The new sharded wq implementation cannot handle a resent mon create
message and a split child already existing. This a side effect of the
new pg create path instantiating the PG at the pool create epoch osdmap
and letting it roll forward through splits; the mon may be resending a
create for a pg that was already created elsewhere and split elsewhere,
such that one of those split children has peered back onto this same OSD.
When we roll forward our re-created empty parent it may split and find the
child already exists, crashing.
This is no longer a concern because the mgr-based controller for pg_num
will not split PGs until after the initial PGs are all created. (We
know this because the pool has the CREATED flag set.)
The old-style path had it's own problem
http://tracker.ceph.com/issues/22165. We would build the history and
instantiate the pg in the latest osdmap epoch, ignoring any split children
that should have been created between teh pool create epoch and the
current epoch. Since we're now taking the new path, that is no longer
a problem.
Fixes: http://tracker.ceph.com/issues/22165 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 6 Apr 2018 16:26:26 +0000 (11:26 -0500)]
mon/OSDMonitor: set last_force_resend_prenautilus for pg_num_pending changes
This will force pre-nautilus clients to resend ops when we are adjusting
pg_num_pending. This is a big hammer: for nautilus+ clients, we only have
an interval change for the affected PGs (the two PGs that are about to
merge), whereas this compat hack will do an op resend for the whole pool.
However, it is better than requiring all clients be upgraded to nautilus in
order to do PG merges.
Note that we already do the same thing for pre-luminous clients both for
splits, so we've already inflicted similar pain the past (and, to my
knowledge, have not seen any negative feedback or fallout from that).
Sage Weil [Sat, 7 Apr 2018 02:53:35 +0000 (21:53 -0500)]
mgr: do not adjust pg_num until FLAG_CREATING removed from pool
This is more reliable than looking at PG states because the PG may have
gone active and sent a notification to the mon (pg created!) and mgr
(new state!) but the mon may not have persisted that information yet.
Sage Weil [Sat, 7 Apr 2018 02:39:14 +0000 (21:39 -0500)]
mon/OSDMonitor: set POOL_CREATING flag until initial pool pgs are created
Set the flag when the pool is created, and clear it when the initial set
of PGs have been created by the mon. Move the update_creating_pgs()
block so that we can process the pgid removal from the creating list and
the pool flag removal in the same epoch; otherwise we might remove the
pgid but have no cluster activity to roll over another osdmap epoch to
allow the pool flag to be removed.
Previously, we renamed the old last_force_resend to
last_force_resend_preluminous and created a new last_force_resend for
luminous+. This allowed us to force preluminous clients to resend ops
(because they didn't understand the new pg split => new interval rule)
without affecting luminous clients.
Do the same rename again, adding a last_force_resend_prenautilus (luminous
or mimic).
Adjust the OSD code accordingly so it matches the behavior we'll see from
a luminous client.
Sage Weil [Fri, 13 Apr 2018 22:16:41 +0000 (17:16 -0500)]
osd/PGLog: merge_from helper
When merging two logs, we throw out all of the actual log entries.
However, we need to convert them to dup ops as appropriate, and merge
those together. Reuse the trim code to do this.
Sage Weil [Sat, 17 Feb 2018 17:38:57 +0000 (11:38 -0600)]
osd: notify mon when pending PGs are ready to merge
When a PG is in the pending merge state it is >= pg_num_pending and <
pg_num. When this happens quiesce IO, peer, wait for activate to commit,
and then notify the mon that we are idle and safe to merge.
Sage Weil [Fri, 6 Apr 2018 15:26:10 +0000 (10:26 -0500)]
mgr: add simple controller to adjust pg[p]_num_actual
This is a pretty trivial controller. It adds some constraints that were
obviously not there before when the user could set these values to anything
they wanted, but does not implement all of the "nice" stepping that we'll
eventually want. That can come later.
Splits:
- throttle pg_num increases, currently using the same config option
(mon_osd_max_creating_pgs) that we used to throttle pg creation
- do not increase pg_num until the initial pg creation has completed.
Merges:
- wait until the source and target pgs for merge are active and clean
before doing a merge.
Sage Weil [Fri, 16 Feb 2018 03:25:32 +0000 (21:25 -0600)]
mon/OSDMonitor: allow pg_num to adjusted up or down via pg[p]_num_target
The CLI now sets the *_target values, imposing only the subset of constraints that
the user needs to be concerned with.
new "pg_num_actual" and "pgp_num_actual" properties/commands are added that allow
the underlying raw values to be adjusted. For the merge case, this sets
pg_num_pending instead of pg_num so that the OSDs can go through the
merge prep process.
A controller (in a future commit) will make pg[p]_num converge to pg[p]_num_target.
Sage Weil [Mon, 9 Jul 2018 22:22:58 +0000 (17:22 -0500)]
os/bluestore: fix osr_drain before merge
We need to make sure the deferred writes on the source collection finish
before the merge so that ops ordered via the final target sequencer will
occur after those writes.
Sage Weil [Sun, 8 Jul 2018 19:24:49 +0000 (14:24 -0500)]
os/bluestore: allow reuse of osr from existing collection
We try to attach an old osr at prepare_new_collection time, but that
happens before a transaction is submitted, and we might have a
transaction that removes and then recreates a collection.
Move the logic to _osr_attach and extend it to include reusing an osr
in use by a collection already in coll_map. Also adjust the
_osr_register_zombie method to behave if the osr is already there, which
can happen with a remove, create, remove+create transaction sequence.
Fixes: https://tracker.ceph.com/issues/25180 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 4 Aug 2018 18:51:05 +0000 (13:51 -0500)]
os/filestore: (re)implement merge
Merging is a bit different then splitting, because the two collections
may already be hashed at different levels. Since lookup etc rely on the
idea that the object is always at the deepest level of hashing, if you
merge collections with different levels that share some common bit prefix
then some objects will end up higher up the hierarchy even though deeper
hashed directories exist.
ceph-volume lvm.batch use 'ceph' as the cluster name with filestore
Custom cluster names are currently broken on ceph-volume, should get
addressed with http://tracker.ceph.com/issues/27210 which is out of
scope for these changes
ceph-volume lvm.api use double -f flags when calling pvremove
Fairly destructive, just like everything else when zapping a device.
This is required in the case of double UUIDs detected, something that
surfaced when testing with a loop device to create an nvme (the loop
device ends up with the same UUID as the nvme).
test/crimson: do not use unit.cc as the driver of unittest_seastar_denc
as unit.cc initializes the CephContext and all of Ceph's infratructure,
which is not necessary for the denc test. also, seastar's builtin allocator
only pre-allocates 32 << 20 bytes. it's enough for the denc test, but
not necessarily enough for create CephContext and its friends. an option is
to use seastar's app template to initialize the memory allocator properly.
another option is to avoid initializing CephContext in this test.