xie xingguo [Sat, 19 Jan 2019 09:19:10 +0000 (17:19 +0800)]
crush: fix upmap overkill
It appears that OSDMap::maybe_remove_pg_upmaps's sanity checks
are overzealous. With some customized crush rules it is possible
for osdmaptool to generate valid upmaps, but maybe_remove_pg_upmaps
will cancel them.
xie xingguo [Mon, 18 Feb 2019 07:40:22 +0000 (15:40 +0800)]
osd/OSDMap: using std::vector::reserve to reduce memory reallocation
In C++ vectors are dynamic arrays.
Vectors are assigned memory in blocks of contiguous locations.
When the memory allocated for the vector falls short of storing
new elements, a new memory block is allocated to vector and all
elements are copied from the old location to the new location.
This reallocation of elements helps vectors to grow when required.
However, it is a costly operation and time complexity is involved
in this step is linear.
Try to use std::vector::reserve whenever possible if performance
matters.
xie xingguo [Sat, 26 Jan 2019 10:03:15 +0000 (18:03 +0800)]
osd/OSDMap: more improvements to upmap
- add ability of appending a 2nd, 3rd, etc... pair to existing upmaps
when possible, rather than just continuing to the next PG
- handle the underfull case: we can rm-pg-upmap-items if there exist
any upmaps which remapped a PG out from an underfull OSD
xie xingguo [Fri, 18 Jan 2019 10:14:52 +0000 (18:14 +0800)]
osd/OSDMap: be more aggressive when trying to balance
Previously we'd require the absolute deviation >= 1 before
an osd can fill in the overfull or underfull set, which as
a result can get some osds stuck severely underfull forever.
This patch tries to get us out of those corner cases mentioned
above and can be scrutinized from two aspects:
- a standard deviation is introduced to evaluate the efficiency
of balancing, therefore making the distribution of pgs always
converge to the perfect status (standard deviation == 0).
- populate overfull or underfull osds more aggressively by
gradually allowing the absolute deviations converging towards
to 0 instead of 1.
It turns out the balancer module works even better now after
applying this patch. E.g.:
```
OSD=5 MON=1 MGR=1 MDS=0 ../src/vstart.sh -x -l -b -n -d
bin/ceph osd set-require-min-compat-client luminous
bin/ceph balancer mode upmap
bin/ceph balancer on
bin/ceph osd pool create rbd 117
// wait until automatic balancing is done
bin/ceph osd pool create aaa 133
```
__before__:
```
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00980 1.00000 10 GiB 1.1 GiB 3.9 MiB 0 B 1 GiB 9.0 GiB 10.60 1.00 151 up
1 hdd 0.00980 1.00000 10 GiB 1.1 GiB 3.9 MiB 0 B 1 GiB 9.0 GiB 10.60 1.00 153 up
2 hdd 0.00980 1.00000 10 GiB 1.1 GiB 3.9 MiB 0 B 1 GiB 9.0 GiB 10.60 1.00 147 up
3 hdd 0.00980 1.00000 10 GiB 1.1 GiB 3.9 MiB 0 B 1 GiB 9.0 GiB 10.60 1.00 149 up
4 hdd 0.00980 1.00000 10 GiB 1.1 GiB 3.9 MiB 0 B 1 GiB 9.0 GiB 10.60 1.00 150 up
TOTAL 50 GiB 5.3 GiB 19 MiB 0 B 5 GiB 45 GiB 10.60
```
__after__:
```
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 hdd 0.00980 1.00000 10 GiB 1.1 GiB 6.2 MiB 0 B 1 GiB 9.0 GiB 10.62 1.00 150 up
1 hdd 0.00980 1.00000 10 GiB 1.1 GiB 6.2 MiB 0 B 1 GiB 9.0 GiB 10.62 1.00 151 up
2 hdd 0.00980 1.00000 10 GiB 1.1 GiB 6.2 MiB 0 B 1 GiB 9.0 GiB 10.62 1.00 149 up
3 hdd 0.00980 1.00000 10 GiB 1.1 GiB 6.2 MiB 0 B 1 GiB 9.0 GiB 10.62 1.00 150 up
4 hdd 0.00980 1.00000 10 GiB 1.1 GiB 6.2 MiB 0 B 1 GiB 9.0 GiB 10.62 1.00 150 up
TOTAL 50 GiB 5.3 GiB 31 MiB 0 B 5 GiB 45 GiB 10.62
```
xie xingguo [Sat, 12 Jan 2019 07:01:54 +0000 (15:01 +0800)]
osd/OSDMap: potential access violation fix
Seems we'll continue to access the iterator after it is invalidated
by the __erase__ method.
Also this is more efficient considering there could be some extreme
ec-pool (e.g., 8 + 2) consumers..
huangjun [Fri, 24 Aug 2018 14:47:02 +0000 (22:47 +0800)]
osd/OSDMap: don't mapping all pgs each time in calc_pg_upmaps
We have a cluster pool with 32768 pgs and 400 osds, it costs 600 seconds when doing upmap
with '--upmap-max 32768 --upmap-deviation 0.01', which is pretty slow.
After adding some debug code, the time mostly spent on pg_to_up_acting_osds, the average
time for one pg_to_up_acting_osds is about 12us, so the whole pool's pg will cost 500ms each
time, we finally have 1429 pgs need to do upmap, so it cost about 600 seconds.
Withi this patch, it only spend 5 seconds to get job done.
Noah Watkins [Thu, 17 Jan 2019 19:16:44 +0000 (11:16 -0800)]
cli: dump osd-fsid as part of osd find <id>
Dumps the osd-fsid uuid as part of the `osd find <id>` command.
Currently this uuid is only available as part of `osd dump` but
ceph-ansible has a use case to interrogate a single osd without needing
the entire osdmap dump.
Neha Ojha [Mon, 7 Jan 2019 23:26:27 +0000 (15:26 -0800)]
mon/OSDMonitor.cc: make a note about reusing jewel feature bit
For OSD_PGLOG_HARDLIMIT, we have reused a jewel feature bit that was retired
in luminous. Therefore, we need to check the release version for
>= CEPH_RELEASE_LUMINOUS, before using it.
Conflicts:
src/include/rados.h
src/mon/MonCommands.h
src/mon/OSDMonitor.cc
src/osd/OSDMap.cc
Luminous does not have CEPH_OSDMAP_NOSNAPTRIM flag.
In nautilus, CEPH_OSDMAP_PGLOG_HARDLIMIT is set by default,
which is not the case in luminous.
In https://github.com/ceph/ceph/pull/21580 I set a trap to catch some wired
and random segmentfaults and in a recent QA run I was able to observe it was
successfully triggered by one of the test case, see:
The root cause is that there might be holes on log versions, thus the
approx_size() method should (almost) always overestimate the actual number of log entries.
As a result, we might be at the risk of overtrimming log entries.
https://github.com/ceph/ceph/pull/18338 reveals a probably easier way
to fix the above problem but unfortunately it also can cause big performance regression
and hence comes this pr..
For the auto-repair (EIO caused) case, we will not reinitialize
**complete_to** (because last_complete is equal to last_update!)
and hence there is chance that **complete_to** should aleady
point to **log.end()** before we call recover_got.
We could simply drop it here as we (already) logged the **complete_to**
iterator change in a more compatible way a few lines below.
src/osd/PG.cc: remove redundant call to trim_log()
This change is motived by the failure tracked in
https://tracker.ceph.com/issues/25198. The failure highlights a case, when a
call to trim_log() after the PG has recovered, races with the previous op,
on a replica OSD. Since the previous operation has not completed, the
last_complete value for that OSD is not valid, when we try to trim the
log. It is also worth noting that the race is due to MOSDPGTrim going through
the strict queue as a peering message vs regular ops going through the
non-strict queue.
During the investigation of this bug, we noticed that, with
https://tracker.ceph.com/issues/23979, we allow pg log trimming to
happen on the primary and replicas, whenever we cross the upper bound of
the pg log. This also ensures that pg log trimming happens while processing
any new op.
Therefore, the function trim_log(), which earlier served the purpose of
trimming logs on the primary and replicas, just before the PG went into
the Recovered state, is no more required. This acted like a last line of
defense to trim logs, when we did not need the logs any more. But, this call
seems redundant now, because, we are limiting the pg log length at all times.
Conflicts:
src/osd/PGLog.cc: Now it is possible to have complete_to version
less than or equal to trim version, because the pg log length upper
limit is a hard limit, and trim can proceed even when there is
pending recovery/backfill. So do not complain when this happens.
Remove async recovery components: The async recovery feature is not present
in luminous. We do not need commit 22d17fb5aad6ab9d7525d9492c0e96a36d02879e,
which adds a flag to remember async recovery. We have also removed async
recovery requirements from this commit and modified the commit message to
only reflect backfill.
Conflicts:
src/osd/PrimaryLogPG.cc: min_last_complete_ondisk and
pg_log.get_can_rollback_to() are no longer the limit of the pg log.
Make the head of the pg log the new limit for pg log trimming.
Volker Theile [Thu, 29 Nov 2018 12:48:30 +0000 (13:48 +0100)]
ceph-volume: Adapt code to support Python3
- raw_input() has been renamed to input() in Python3
- Changed signature of prompt_bool. Variables that are named like built-ins must be named like xxx_ and not _xxx
https://github.com/ceph/ceph/pull/25824 adds slow request to OSD logs.
To deal with it, whitelist 'slow request' instead of 'slow requests'.
This PR is specific to luminous because later versions whitelist it correctly.
IvanGuan [Fri, 4 Jan 2019 04:22:27 +0000 (12:22 +0800)]
client: fix fuse client hang because its pipe to mds is not ok
If fuse client session had been killed by mds and the mds daemon restart
or hot-standby switch happens right away but the client did not receive
any message from monitor due to network or other whatever reason untill
the mds become active again.Thus cause client didn't do closed_mds_session
lead the seession still is STATE_OPEN but client can't send any message to
mds because its pipe is not ok.So we should close the stale session so that
it can be reopened again.
As the largeish change from master g_conf() isn't in mimic yet, use the g_conf
global structure, also make rgw_op use the value from req_info ceph context as
we do for all the requests