Samuel Just [Fri, 7 Nov 2014 23:20:02 +0000 (15:20 -0800)]
osd/: go "peered" instead of "active" when < min_size
In the case of a replicated pool, the pg will transition to "peered"
rather than "active", where it can perform backfill bringing itself up
to min_size peers. To that end, most operations now require
is_peered rather than is_active (OSDOps being the primary exception).
Also, rather than using the query_epoch on the activation message as the
activation epoch (for last_epoch_started) on the replica, we instead
use the last_epoch_started in the info sent by the primary. This
allows the primary to not advance last_epoch_started past the last
known actual activation. This will prevent later peering epochs from
requiring the last_update from a peered epoch to go active (might be
divergent).
Fixes: #7862 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Tue, 18 Nov 2014 00:22:59 +0000 (16:22 -0800)]
PG::proc_master_log: assume les and history.les from log source
Otherwise, we might later go peered (not active) and not distribute
a non-0 last_epoch_started. This should be safe as the log recipient
will have a last_update reflecting that interval as well.
Samuel Just [Wed, 29 Oct 2014 21:10:13 +0000 (14:10 -0700)]
PG: break waiting_for_peered out of waiting_for_active
waiting_for_peered now holds ops until peering completes (activation,
not necessarily state active). waiting_for_active now holds
specifically MOSDOp blocked on:
- scrub
- replay
- state active
Sage Weil [Thu, 29 Jan 2015 21:00:15 +0000 (13:00 -0800)]
librados: rename NOREUSE to NOCACHE
As far as I can tell, the posix_fadvise() distinction between WONTNEED and
NOREUSE is subtle: one says I won't access the data, and the other says
I will access it one more time and then not access it. That is, the
distinction is about time. This thread seems to confirm this
interpretation:
https://lkml.org/lkml/2011/6/27/44
Since we are attaching hints to the IO operations themselves, this
distinction doesn't make much sense for us. (Backends should be careful
about which hint they use; or rather, they should use WONTNEED *after*
doing the IO since NOREUSE is presenting a no-op in Linux.)
However, we want to make a totally different distinction:
WONTNEED - nobody will access this -> drop it from the cache
NOCACHE - *i* won't access this again -> don't let me affect your caching
decisions or the working set you're maintaining for other
clients.
The NOCACHE name is made-up and distinct from NOREUSE only so that it is
different from POSIX and doesn't introduce confusion for people familiar
with the POSIX meaning. Perhaps a more accurate name would be IWONTNEED
but that is only one character apart and too error-prone IMO.
Yehuda Sadeh [Wed, 21 Jan 2015 00:36:34 +0000 (16:36 -0800)]
rgw: convert old replicalog entries if needed
If reading a bucket replicalog entry and one doesn't exist, fall back to
the old key, and convert it to the new one. When updating entries, if
entry does not exist do the same.
Jason Dillaman [Wed, 28 Jan 2015 15:32:20 +0000 (10:32 -0500)]
tests: ensure RBD integration tests exercise all features
The RBD_FEATURES environment variables was not being exported to
the Python and C++ integration tests. This resulted in the same
test cases being run multiple times instead of testing different
RBD features.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Yehuda Sadeh [Wed, 7 May 2014 23:19:56 +0000 (16:19 -0700)]
rgw: fix replica log indexing
Fixes: #8251
Previously we were indexing the replica log only by bucket name, even
though we were provided with the bucket instance id. Fix that, and also
add the option to be able to revert to the old behavior. For
radosgw-admin we can do it if --replicalog-index-by-instance=false is
provided. For the replica log REST api we now have the index-by-instance
http param (defaults to true).
John Spray [Tue, 27 Jan 2015 11:13:59 +0000 (11:13 +0000)]
common: filtering in `perf dump`
So that we can get out a particular subsystem
or particular counter without dumping
everything. Should make tests that watch perf
counters much less spammy!
`logger` and `counter` params are used with
an exact comparison here but the interface
should be amenable to extending to e.g. globbing
if we wanted to in the future.
Mykola Golub [Tue, 27 Jan 2015 08:51:07 +0000 (10:51 +0200)]
tests: bring back useful test 'ceph tell osd.foo'
The test was removed in 1189138 (mon: make ceph tell mon.* version
work) as it began to fail due to #10439. After it fixed in c4548f6
(pybind: ceph_argparse: validate incorrectly formed targets), the test
can be restored.
Jason Dillaman [Mon, 26 Jan 2015 00:38:46 +0000 (19:38 -0500)]
tests: replace existing gtest 1.5.0 with gmock/gtest 1.7.0
Google Testing Framework is included by default within the Google
C++ Mocking Framework. Update makefiles to use new gmock/gtest
libraries and remove old gtest source code.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 26 Jan 2015 03:15:07 +0000 (22:15 -0500)]
librbd: trim header update not using AIO
The original trim header update code was using blocking IO to
update the header. After migrating to an asynchronous trim
which performs all work in librados callbacks, it exposed a
potential deadlock in the librados_test_stub when attempting
to do blocking IO within a librados callback. This commit
changes the header update to use AIO.
Fixes: #10637 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 19 Jan 2015 22:33:41 +0000 (17:33 -0500)]
librbd: throttle async progress callbacks
Ensure that no more than one outstanding progress callback
is queued for notification. This will allow remote progress
updates to be sent at a rate in which all watch/notify
clients can support.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 14 Jan 2015 20:56:15 +0000 (15:56 -0500)]
librbd: add more robust retry handling to maintenance ops
When image locking is enabled, snapshot create, resize, and
flatten are coordinated with the lock owner. Previously, if the
the lock owner changed during one of this operations, the
operation would fail. Now librbd will attempt to restart the
operation with the new lock owner (or become the owner itself).
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 14 Jan 2015 16:49:13 +0000 (11:49 -0500)]
librbd: assert header lock ownership for maint operations
The resize, flatten, and snapshot maintenance operations now
use the new assert_lock feature to ensure that the current
client still owns the header lock when making changes.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 18 Nov 2014 08:56:41 +0000 (03:56 -0500)]
cls_lock: New assert_locked operation
The assert_locked operation can be combined with other
RADOS ops to prevent an update to a locked object when
the client doesn't own the lock. It will not attempt to
acquire the lock if the object is not currently locked.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Thu, 9 Oct 2014 04:00:17 +0000 (00:00 -0400)]
librbd: Coordinate maintenance through exclusive lock leader
When the exclusive lock feature is enabled, only a single client can
modify the image. As a result, certain maintenance activities
need to be proxied from the maintenance client to the active
leader.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 10 Nov 2014 17:25:50 +0000 (12:25 -0500)]
librbd: Create async versions of long-running maintenance operations
Resize and flatten now have async versions. The existing resize
and flatten operations now use the async versions internally. The
async operations will be used by the client holding the exclusive
lock when it receives maintenance requests from other clients.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Sat, 24 Jan 2015 07:28:07 +0000 (02:28 -0500)]
librbd: trim would not complete if exclusive lock is lost
The trim completion context was not properly invoked if the
image's exclusive lock was lost between issuing a librados call
and receiving its completion.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
delco225 [Fri, 23 Jan 2015 09:32:30 +0000 (10:32 +0100)]
bug: error when installing ceph dependencies with install-deps.sh
The parsing is sensitive to i18n and will fail if, for instance, it is set to French.
Workaround the problem by always setting the language to C so the script
can safely assume all output will be in english.
Ken Dreyer [Fri, 23 Jan 2015 22:08:34 +0000 (15:08 -0700)]
ceph.spec.in: use wildcards to capture man pages
Use wildcard to capture gzipped man pages for ceph-clsinfo(8) and
librados-config(8). In addition to future-proofing us against
possible compression type changes down the road, this also aligns us
with the existing convention that's used to capture the rest of the man
page files.
Jason Dillaman [Fri, 23 Jan 2015 17:56:56 +0000 (12:56 -0500)]
librbd: potential deadlock on close_image
The owner_lock was incorrectly held when unregistering the image
watcher. It was possible for the ImageWatcher finisher to be
running code that was then deadlocked waiting to acquire the
owner_lock while the close_image thread was attempting to shutdown
the deadlocked finisher.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 21 Jan 2015 13:58:57 +0000 (08:58 -0500)]
librbd: fix copy-on-read / resize down race condition
There was a rare race condition between a pending CoR operation
and a resize down operation resulting in a CoR copyup past the
new, reduced parent overlap. This commit also adds additional
log message details for CoR.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 20 Jan 2015 17:12:43 +0000 (12:12 -0500)]
test: add rados_nobjects_list_xyz functions to librados test stub
The new RBD copy-on-read unit test case uses these RADOS functions
to verify that the CoR operation was successful. This implements
these functions in the librados_test_stub library.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Sat, 17 Jan 2015 05:18:24 +0000 (00:18 -0500)]
librbd: use finisher for copy-on-read copyup fulfillment
When the RBD cache is enabled, the ObjectCacher does not allow
reentrancy to read the full object. As a temporary workaround,
use the Finisher to handle CoR read requests.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Samuel Just [Fri, 23 Jan 2015 17:07:44 +0000 (09:07 -0800)]
ReplicatedPG::hit_set_persist: update ssc->snapset as well
This is a hack. The correct answer is to adapt this method and
finish_ctx to allow this method to use finish_ctx. This is complicated
by the presence of the hit set trims in the same repop, so for now, we
kick the can down the road a bit.
Fixes: 10616 Signed-off-by: Samuel Just <sjust@redhat.com>
Sage Weil [Fri, 23 Jan 2015 17:06:32 +0000 (09:06 -0800)]
librados: add FADVISE_NOREUSE
We left this off because it seemed the same as DONTNEED, but there is a
subtle distinction: DONTNEED means nobody will need it (and we probably
discard our cache), while NOREUSE means this client won't need it again
(and we should try to avoid polluting the cache from this IO only). At
least, that's the way we'r defining it. posix_fadvise says:
POSIX_FADV_NOREUSE
The specified data will be accessed only once.
POSIX_FADV_DONTNEED
The specified data will not be accessed in the near future.
which is similar. I think our definitions make a bit more sense for the
multi-client environment.