Sage Weil [Wed, 13 Mar 2019 17:46:50 +0000 (12:46 -0500)]
qa/standalone/erasure-code/test-erasure-code: adjust test to avoid m=0
_DD is k=2 m=0, which we don't allow. Switch it to cDD.
I confess I don't fully understand why this was _DD to begin with, but
I'm pretty sure mapping is there to control the order of results so that
it can be mapped to the CRUSH rule output sanely, and the coding portion
is not relevant to the test.
Conditionally allow non-unique email address values for builtin
RGW users.
Fixes: http://tracker.ceph.com/issues/40089 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit 974791522007cca6d8fb30e83677f0ddd7c4e71d)
Conflicts:
src/rgw/rgw_user.cc
- changed '_conf.get_val<bool>' to '_conf->get_val<bool>'
xie xingguo [Mon, 3 Jun 2019 08:43:25 +0000 (16:43 +0800)]
test: add parallel clean_pg_upmaps test
With parallel clean_pg_upmaps feature on, the total time cost
of the performance test which now can utilize up to 8 threads for
parallel upmap validating decreased from:
Note that by default the mon uses only 4 worker threads for
CPU intensive background work, you could further increase
the "mon_cpu_threads" option value if you decided the
time-consuming of clean_pg_upmaps still matters.
xie xingguo [Mon, 3 Jun 2019 08:10:22 +0000 (16:10 +0800)]
mon/OSDMonitor: do clean_pg_upmaps the parallel way if necessary
There could definitely be some certain cases we could reliably
skip this kind of checking, but there is no easy way to separate
those out.
However, this is clearly the general way to do the massive pg
upmap clean-up job more efficiently and hence should make sense
in all cases.
xie xingguo [Mon, 17 Jun 2019 10:44:09 +0000 (18:44 +0800)]
osd: maybe_remove_pg_upmaps -> clean_pg_upmaps
It should always be the preferred option to kill the unnecessary
or duplicated code, which is good for maintenance.
Also I've noticed there is already a clean_temps helper, so re-naming
maybe_remove_pg_upmaps to clean_pg_upmaps to at least keep pace with
that sounds to be a natural choice for me..
xie xingguo [Wed, 5 Jun 2019 02:41:52 +0000 (10:41 +0800)]
osd/OSDMapMapping: make ParallelPGMapper can accept input pgs
The existing "prime temp" machinism is a good inspiration
for cluster with a large number of pgs that need to do various
calculations quickly.
I am planning to do the upmap tidy-up work the same way, hence
the need for an alternate way of specifying pgs to process other
than taking directly from the map.
The upmap results are directly applied after calling
_pg_to_raw_osds, which means it basically has nothing to do
with the up/down status.
In other words, if a pg_upmap/pg_upmap_items remapped a pg
into some down osds and is now causing collided result,
we should still be able to detect and cancel that.
xie xingguo [Sat, 1 Jun 2019 02:43:10 +0000 (10:43 +0800)]
test/osd: add performance test case for maybe_remove_pg_upmap
Tom Byrne reported that maybe_remove_pg_upmap might become
super inefficient for large clusters with balancer on.
To identify and resolve the problem, we need to add some good
measurements first.
osd/bluestore: Actually wait until completion in write_sync
This function is only used by RocksDB WAL writing so it must sync data.
This fixes #18338 and thus allows to actually set `bluefs_preextend_wal_files`
to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups.
To my knowledge it doesn't hurt performance in other cases.
Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`.
Issue #18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD
while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`.
mon: paxos: introduce new reset_pending_committing_finishers for safety
There are asserts about the state of the system and pending_finishers which can
be triggered by running arbitrary commands through again. They are correct
when not restarting, but when we do restart we need to take care to preserve
the same invariants as appropriate. Use this function to be careful about
the order of committing_finishers v pending_finishers and to make sure they're
both empty before any Contexts actually get called.
We also reorder a call to finish_contexts on the waiting_for_writeable list for
similar reasons.
metadata sync of a new bucket entrypoint may call rgw_link_bucket()
(which in turn calls into cls user) without deleting/unlinking the
previous bucket entrypoint. this prevented the new bucket entrypoint
from overwriting the creation_time of the old one
rgw: return ERR_NO_SUCH_BUCKET early while evaluating bucket policy
Right now we create a ERR_NO_SUCH_BUCKET ret code but continue further
processing. Since this ret code isn't returned at any stage we end up creating a
bucket instance anyway which shouldn't happen and then succeeding the client
call in cases like put bucket versioning. Return an error code early in these
cases
rgw: limit entries in remove_olh_pending_entries()
If there are too many entries to send in a single osd op, the osd rejects
the request with EINVAL. This error happens in follow_olh(), which means
that requests against the object logical head (requests with no version
id) can't be resolved to the current object version. In multisite, this
also causes data sync to get stuck in retries
osd/PG: fix last_complete re-calculation on splitting
We add hard-limit for pg_logs now, which means we might keep trimming
old log entries irrespective of pg's current missing_set. This as a
result can cause the last_complete pointer moving far ahead of the real
on-disk version (the oldest need of missing_set, for instance) the
corresponding pg should have on splitting:
For the above example, parent's last_complete cursor changed from **0'0** to
**238'300** directly due to the effort of trying to catch up the oldest
log entry changing when splitting was done. However, back into v12.2.9 primary
would still reference shard's last_complete field when trying to figure out all
possible locations of a currently missing object (see PG::MissingLoc::add_source_info):
```c++
if (oinfo.last_complete < need) {
if (omissing.is_missing(soid)) {
ldout(pg->cct, 10) << "search_for_missing " << soid << " " << need
<< " also missing on osd." << fromosd << dendl;
continue;
}
}
```
Hence a wrongly calculated last_complete could then make primary mis-consider
that a specific shard might have the authoritative object it currently
looking for:
note that ```5:624c3a7a:::benchmark_data_smithi190_39968_object1382:head 226'110```
was actually missing on both primary and shard osd.2 whereas primary insisted that
object should exist on shard osd.2!
https://github.com/ceph/ceph/pull/26175 posted an indirect fix
for the above problem by ignoring last_complete when checking the missing set,
but it should generally make more sense to fill in the last_complete field correctly
whenever possible.
Hence coming this additional fix.
J. Eric Ivancich [Thu, 29 Nov 2018 23:02:45 +0000 (18:02 -0500)]
rgw: fix bad versioned bucket stats after reshard
When a versioned bucket is resharded, the stats for bucket index
entries of type PlainIdx and InstanceIdx were both accumulated to
determine the bucket stats. This caused a doubling of some stats, such
as bytes used. This fix makes certain that only PlainIdx entries are
accumulated. This works for unversioned buckets as well, as they only
have Plain Idx entries.
Jason Dillaman [Sat, 23 Feb 2019 00:25:44 +0000 (19:25 -0500)]
rbd-mirror: image replay should retry asok registration upon failure
If the asok registration fails (perhaps due to a race condition with
a deleted and recreated image of the same name), periodically attempt
to register the missing asok hook.
If the image replay was canceled prior to the start of the bootstrap
stage, the image replayer would be stuck attempting to shut down if
the bootstrap is paused behind an image sync.
Jason Dillaman [Fri, 22 Feb 2019 15:59:26 +0000 (10:59 -0500)]
rbd-mirror: complete pool watcher initialization if object missing
If the mirroring object is missing, complete the initialization and
continue to retry in the background. This is useful for cases where
the remote doesn't (yet) have mirroring enabled but the remote
pool watcher initialization is delaying the leader watcher promotion
to the point where the leader is blacklisted by its peers.
Jason Dillaman [Tue, 19 Feb 2019 21:06:48 +0000 (16:06 -0500)]
librbd: add missing shutdown states to managed lock helper
The PRE_SHUTTING_DOWN and SHUTTING_DOWN states were missed
in the 'is_state_shutdown' helper method. This resulted in
rbd-mirror potentially entering an infinite loop during
shutdown.
Jason Dillaman [Thu, 28 Feb 2019 21:43:27 +0000 (16:43 -0500)]
librbd: improve object map performance under high IOPS workloads
Do not zero-fill the BitVector's bitset prior to decoding the data.
Additionally, only read-update-modify the portions of the footer
that are potentially affected by the updated state.
Fixes: http://tracker.ceph.com/issues/38538 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 071671fff64f27943047610fe075a7e98f0f705c)
Conflicts:
src/cls/rbd/cls_rbd.cc
src/cls/rbd/cls_rbd_client.h
src/common/bit_vector.hpp
src/test/common/test_bit_vector.cc
src/test/librbd/test_ObjectMap.cc
Trivial conflicts with bufferlist::begin/cbegin and assert/ceph_assert