Sage Weil [Wed, 9 Aug 2017 21:25:12 +0000 (17:25 -0400)]
crush/builder: fix ENOENT when removing last bucket item
We were decrementing size and then breaking out ENOENT condition check.
Fix by decrementing size only after we break out of the loop and verify
we found the item.
Fix a follow-on bug by avoiding realloc when we have 0 items left. This case
was never exercised before due to the ENOENT issue; now we return explicitly.
It's really not necessary to realloc at all, probably, since these are very
small arrays, but in any case leaving a single item allocation there in place of
a 0-length allocation is fine. (And the 0-length allocation behvaior on realloc
is undefined.. may either return a pointer or NULL.)
Sage Weil [Fri, 11 Aug 2017 15:58:42 +0000 (11:58 -0400)]
os/bluestore: do not segv on kraken upgrade debug print
When loading an onode from kraken we have a compat path that calls
get_ref before the SharedBlob pointer is initialized. This is fine except
that if debugging is enabled the operator<< on the Blob will segv on
printing *b.shared_blob (which is NULL).
Fix operator<< to print something else if it is NULL. shared_blob does
get set up right after the call to decode() so having it be NULL at this
point is otherwise harmless.
Sage Weil [Fri, 11 Aug 2017 16:46:09 +0000 (12:46 -0400)]
os/bluestore: fix clone dirty_range again
If we are cloning a blob for a 1 byte logical extent then dirty_range_begin
will equal _end and we won't dirty the source onode (with possibly newly
shared blobs).
Fix by using a separate flag to indicate whether we are dirtying instead
of overloading the begin/end markers for this. Note that even if they
are equal dirty_range will still dirty the shard in question.
Consider the following user case:
(1) randomly choose some OSDs(e.g., from different hosts) and try to make them for private use only,
say, by grouping them into 'pool1'
(2) ceph osd crush set-device-class pool1 'OSDs from (1)'
(3) ceph osd crush rule create-replicated rule_for_pool1 default host pool1
(4) ceph osd pool rename pool1 pool2
(5) ceph osd crush class rename pool1 pool2
From the above user case, we need to safely change a pool name without worrying
any risk of data migration. That is why the 'osd crush class rename' command
is still needed here.
Greg Farnum [Wed, 9 Aug 2017 21:34:44 +0000 (14:34 -0700)]
mdsmon: treat the osdmon correctly when doing plugged updates
Make sure it's writeable before invoking changes, and propose_pending()
on it when we're done.
Make the PaxosService::C_RetryMessage public so we can do this from FSCommands.
Sage Weil [Tue, 8 Aug 2017 22:43:22 +0000 (18:43 -0400)]
mon/Elector: force election epoch bump on start
We are generally careful when bumping the epoch so that we can join
existing rounds. However, if we restart in the middle of an election,
and change versions, we need to be certain that our previous ACK (as
$version - 1) isn't accepted as truth for the restarted daemon (running
$version) keeping the same epoch.
The conservatism with bumping is to avoid spurious election cycles, but
mon restarts are more rare, and we need them here.
Andrew Schoen [Fri, 4 Aug 2017 15:40:39 +0000 (10:40 -0500)]
ceph-volume: adds a functional testing scenario for lvm create
This setups up the basic test harness and adds a test for the create
subcommand. The test uses ceph-ansible to deploy a cluster using
``ceph-volume lvm create``, tests the cluster state using the
ceph-ansible test suite, reboots the nodes and then tests again.
Sage Weil [Wed, 9 Aug 2017 20:40:43 +0000 (16:40 -0400)]
qa/suites/upgrade/jewel-x/parallel: thrash layout
We can't kill and restart osds because that will interfere with
the upgrade process. We can, however, thrash the layout by
tweaking osd weights and so on. This will exercise osd recovery
paths during the upgrade that aren't normally exercised (outside
of stress-split..which doesn't upgrade individual osds while they
are non-clean).
Sage Weil [Wed, 9 Aug 2017 16:50:57 +0000 (12:50 -0400)]
osd/PG: force rebuild of missing set on jewel upgrade
Previously we were detecting the need to rebuild missing based on
whether the "divergent_priors" omap key was present. Unfortunately,
jewel does not always set this, so it is not a reliable indicator.
(It only gets set if you actually have a divergent prior at some
point in the PG's life time on that OSD.)
Fix by using the info_struct_v on the PG to detect whether we need
to do the conversion. We didn't bump the value when we adding
the missing persistence, but the fastinfo was also added during
the same period between jewel and kraken, so it will work just as
well.
James Page [Wed, 9 Aug 2017 09:04:37 +0000 (10:04 +0100)]
Align use of uint64_t in service_daemon::AttributeType
size_t on a 32-bit architecture is a 32 bit unsigned int which
created ambiguity when casting to bool, uint64_t or std::string
(which are boost::variants for service_daemon::AttributeType).
Align to use of uint64_t to resolve compilation failures in
all 32-bit architectures.
Gregory Farnum [Tue, 8 Aug 2017 21:27:28 +0000 (14:27 -0700)]
Merge pull request #16755 from ivancich/wip-pull-new-dmclock
osd: bring in dmclock library changes
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com> Reviewed-by: Greg Farnum <gfarnum@redhat.com>
(cherry picked from commit 25f1edefbf21f17f5501d9894f0c4979c04b3f08)
Piotr Dałek [Wed, 7 Jun 2017 14:01:37 +0000 (16:01 +0200)]
rbd: parallelize rbd ls -l
When a cluster contains a large number of images, "rbd ls -l" takes a
long time to finish. In my particular case, it took about 58s to
process 3000 images.
"rbd ls -l" opens each of image and that takes majority of time, so
improve this by using aio_open() and aio_close() to do it
asynchronously. This reduced total processing time down to around 15
seconds when using default 10 concurrently opened images.
Abhishek L [Tue, 8 Aug 2017 18:53:40 +0000 (20:53 +0200)]
Merge pull request #16914 from theanalyst/wip-16734
luminous: rgw_lc: support for AWSv4 authentication
Reviewed-By: Daniel Gryniewicz <dang@redhat.com> Reviewed-By: Radoslaw Zarzynski <rzarzynski@redhat.com> Reviewed-By: Matt Benjamin <mbenjami@redhat.com>
* refs/remotes/upstream/pull/16378/head:
doc: remove accidental additions to release notes
qa/cephfs: Fix race in test_volume_client
qa/cephfs: Test filtered df
PendingReleaseNotes: add note about df filtering
client: Support new, filtered MStatfs
objecter: Support new, filtered MStatfs
mon/PGMap stats: Support new, filtered MStatfs
messages: Add optional data pool to MStatfs
Reviewed-by: John Spray <john.spray@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 5 Aug 2017 19:30:15 +0000 (15:30 -0400)]
mon: include PGMonitor commands with mixed-version mons
While we have a mixed version cluster, we have to advertise our
PGMonitor commands to our peons or else commands like 'pg dump'
won't work.
Once the mon feature flag is set, we can drop that because each
mon will include the mgr commands (either those stored in paxos
or the statically compiled ones until that point).
Sage Weil [Sat, 5 Aug 2017 19:08:26 +0000 (15:08 -0400)]
mon: use vector<MonCommand> throughput for commands
The old code was pretty messy. This is standardizes on std::vector
throughout. We also drop the win_election command args because
when we win an election we always set the leader commands to our
commands, and we can do that inside win_command() without passing
them in from here.
Jason Dillaman [Fri, 21 Jul 2017 15:18:46 +0000 (11:18 -0400)]
rbd-mirror: restore deletion propagation and image replayer cleanup
The previous intermediate commits removed handling for deletion
propagation and image replayer cleanup since this logic has been
moved from instance to image replayer. Note that eventually the
policy's release notification will be responsible for the cleanup
of image replayers.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>