Loic Dachary [Wed, 18 Mar 2015 23:32:39 +0000 (00:32 +0100)]
doc,tests: force checkout of submodules
When updating submodules, always checkout even if the HEAD is the
desired commit hash (update --force) to avoid the following:
* a directory gmock exists in hammer
* a submodule gmock replaces the directory gmock in master
* checkout master + submodule update : gmock/.git is created
* checkout hammer : the gmock directory still contains the .git from
master because it did not exist at the time and checkout won't
remove untracked directories
* checkout master + submodule update : git rev-parse HEAD is
at the desired commit although the content of the gmock directory
is from hammer
Greg Farnum [Fri, 6 Feb 2015 05:12:17 +0000 (21:12 -0800)]
fsync-tester: print info about PATH and locations of lsof lookup
We're seeing the lsof invocation fail (as not found) in testing and nobody can
identify why. Since attempting to reproduce the issue has not worked, this
patch will gather data from a genuinely in-vitro location.
Jason Dillaman [Wed, 4 Feb 2015 10:07:32 +0000 (05:07 -0500)]
osdc: Constrain max number of in-flight read requests
Constrain the number of in-flight RADOS read requests to the
cache size. This reduces the chance of the cache memory
ballooning during certain scenarios like copy-up which can
invoke many concurrent read requests.
Fixes: #9854
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Loic Dachary [Fri, 30 Jan 2015 17:49:16 +0000 (18:49 +0100)]
osd: handle no-op write with snapshot case
If we have a transaction that does something to the object but it !exists
both before and after, we will continue through the write path. If the
snapdir object already exists, and we try to create it again, we will
leak a snapdir obc.
Fix is to not recreate the snapdir if it already exists.
Fixes: #10262 Signed-off-by: Sage Weil <sage@redhat.com> Signed-off-by: Loic Dachary <ldachary@redhat.com>
(cherry picked from commit 02fae9fc54c10b5a932102bac43f32199d4cb612)
Sage Weil [Sun, 2 Nov 2014 22:27:12 +0000 (14:27 -0800)]
JounralingObjectStore: journal->committed_thru after replay
It's possible that the osd stopped between when the filestore
op_seq file was updated and when the journal was trimmed. In
that case, it's possible that on boot the journal might be
full, and yet not be trimmed because commit_start assumes
there is no work to do. Calling committed_thru on the journal
ensures that the journal matches committed_seq.
Backport: giant firefly emperor dumpling Fixes: #6756 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
(cherry picked from commit 8924158df8580bbb462130b5bb7f753b13513a23)
Jason Dillaman [Mon, 19 Jan 2015 15:28:56 +0000 (10:28 -0500)]
librbd: gracefully handle deleted/renamed pools
snap_unprotect and list_children both attempt to scan all
pools. If a pool is deleted or renamed during the scan,
the methods would previously return -ENOENT. Both methods
have been modified to more gracefully handle this condition.
Fixes: #10270, #10122
Backport: giant, firefly Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 436923c68b77c900b7774fbef918c0d6e1614a36)
Jason Dillaman [Tue, 18 Nov 2014 02:49:26 +0000 (21:49 -0500)]
librbd: protect list_children from invalid child pool IoCtxs
While listing child images, don't ignore error codes returned
from librados when creating an IoCtx. This will prevent seg
faults from occurring when an invalid IoCtx is used.
Fixes: #10123
Backport: giant, firefly, dumpling Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 0d350b6817d7905908a4e432cd359ca1d36bab50)
Loic Dachary [Thu, 9 Oct 2014 16:52:17 +0000 (18:52 +0200)]
ceph-disk: run partprobe after zap
Not running partprobe after zapping a device can lead to the following:
* ceph-disk prepare /dev/loop2
* links are created in /dev/disk/by-partuuid
* ceph-disk zap /dev/loop2
* links are not removed from /dev/disk/by-partuuid
* ceph-disk prepare /dev/loop2
* some links are not created in /dev/disk/by-partuuid
This is assuming there is a bug in the way udev events are handled by
the operating system.
Loic Dachary [Fri, 10 Oct 2014 08:23:34 +0000 (10:23 +0200)]
ceph-disk: encapsulate partprobe / partx calls
Add the update_partition function to reduce code duplication.
The action is made an argument although it always is -a because it will
be -d when deleting a partition.
Use the update_partition function in prepare_journal_dev
Josh Durgin [Wed, 12 Nov 2014 02:16:02 +0000 (18:16 -0800)]
qa: allow small allocation diffs for exported rbds
The local filesytem may behave slightly differently. This isn't
foolproof, but seems to be reliable enough on rhel7 rootfs, where
exact comparison was failing.
Samuel Just [Tue, 23 Sep 2014 22:52:08 +0000 (15:52 -0700)]
ReplicatedPG: don't move on to the next snap immediately
If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time. Instead, requeue.
Fixes: #9487
Backport: dumpling, firefly, giant Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)
Sage Weil [Tue, 23 Sep 2014 23:21:33 +0000 (16:21 -0700)]
osd: initialize purged_snap on backfill start; restart backfill if change
If we backfill a PG to a new OSD, we currently neglect to initialize
purged_snaps. As a result, the first time the snaptrimmer runs it has to
churn through every deleted snap for all time, and to make matters worse
does so in one go with the PG lock held. This leads to badness on any
cluster with a significant number of removed snaps that experiences
backfill.
Resolve this by initializing purged_snaps when we finish backfill. The
backfill itself will clear out any stray snaps and ensure the object set
is in sync with purged_snaps. Note that purged_snaps on the primary
that is driving backfill will not change during this period as the
snaptrimmer is not scheduled unless the PG is clean (which it won't be
during backfill).
If we by chance to interrupt backfill, go clean with other OSDs,
purge snaps, and then let this OSD rejoin, we will either restart
backfill (non-contiguous log) or the log will include the result of
the snap trim (the events that remove the trimmed snap).
Samuel Just [Mon, 29 Sep 2014 23:26:54 +0000 (16:26 -0700)]
ReplicatedPG: do not queue the snap trimmer constantly
Previously, we continuously requeued the snap trimmer while in
TrimmingObjects. This is not a good idea now that we try to
limit the number of snap trimming repops in flight and requeue
the snap trimmer directly as those repops complete.
Fixes: #9113
Backport: giant, dumpling, firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 34f38b68d89baf1dcbb4571d4f4d3076dc354538)
Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid
m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.
Xiaoxi Chen [Wed, 20 Aug 2014 07:35:44 +0000 (15:35 +0800)]
CrushWrapper: pick a ruleset same as rule_id
Originally in the add_simple_ruleset funtion, the ruleset_id
is not reused but rule_id is reused. So after some add/remove
against rules, the newly created rule likely to have
ruleset!=rule_id.
We dont want this happen because we are trying to hold the constraint
that ruleset == rule_id.
Greg Farnum [Thu, 9 Oct 2014 22:12:19 +0000 (15:12 -0700)]
mds: Locker: fix a NULL deref in _update_cap_fields
The MClientCaps* is allowed to be NULL, so we can't deref it unless
the dirty param is non-zero. So don't do the ahead-of-time lookup;
just call it explicitly in the if block.
David Zafman [Wed, 11 Dec 2013 01:29:48 +0000 (17:29 -0800)]
osd: Fix problems in ReplicatedPG::do_op() logic
Fix assert(is_degraded_object(soid)) in ReplicatedPG::wait_for_degraded_object()
Use last_backfill_started as the backfill line
Handle uncommon case of multi op source after backfill line and target before
backfill line and !is_degraded_object().
Include backfill line itself for before_backfill (<= instead of <)
We can't adjust last_backfill to object x until x has been fully
backfilled. pending_backfill_updates contains all those backfills
started, but which have not yet been reflected in pinfo.last_update.
backfills_in_flight contains those backfills which have not yet
completed. Thus, we can adjust last_update to the largest entry
in pending_backfill_updates not in backfills_in_flight.
Samuel Just [Mon, 28 Oct 2013 22:53:24 +0000 (15:53 -0700)]
ReplicatedPG: replace backfill_pos with last_backfill_started
last_backfill_started reflects what pinfo.last_backfill will be
once all currently outstanding backfills complete. backfill_pos
was tricky since we couldn't correctly inialize it without
doing the first backfill scan pair.
In recover_backfill, we rescan from last_backfill_started rather
than from backfill_pos. This ensures that we capture all clones
created between last_backfill_started and what previously had been
backfill_pos without special handling in make_writeable. The main
downside is that we will tend to "rescan" last_backfill_started.
Jason Dillaman [Sun, 7 Sep 2014 02:59:40 +0000 (22:59 -0400)]
Enforce cache size on read requests
In-flight cache reads were not previously counted against
new cache read requests, which could result in very large
cache usage. This effect is most noticeable when writing
small chunks to a cloned image since each write requires
a full object read from the parent.
Locker: accept ctime updates from clients without dirty write caps
The ctime changes any time the inode does. That can happen even without
the file itself having changed, so we'd better accept the update whenever
the auth caps have dirtied, without worrying about the file caps!
Fixes: #9514
Backport: firefly
Signed-off-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 0ea20a668cf859881c49b33d1b6db4e636eda18a)
c70331db13bdbf1f967ece48cdbec28b97c3d19c backported some
CRUSH changes which necessitate backporting this as well, or
we get failures like successfully looking up an OSD with ID -1. Reviewed-by: Greg Farnum <greg@inktank.com>
Loic Dachary [Sat, 26 Oct 2013 00:11:43 +0000 (02:11 +0200)]
common: rebuild_page_aligned sometimes rebuilds unaligned
rebuild_page_aligned relies on rebuild to create memory that is aligned
according to list::is_page_aligned(). However, when the bufferlist only
contains a single ptr and that its size is not list::is_n_page_size(),
rebuild will not create the expected alligned bufferlist.
The allocation of the ptr is moved out of rebuild which is now given the
ptr as an argument. The rebuild_page_aligned function always require an
aligned ptr with buffer::create_page_aligned(_len) for consistency.
demonstrated the problem. It was assumed to be a feature but should have
been identified as a bug. The last ligne is replaced with
EXPECT_TRUE(bl.is_page_aligned());
Most tests related to is_page_aligned() wrongfully assumed that
bufferptr ptr(2);
is never page aligned. Most of the time it is not but sometime it is
when the pointer address is by chance on a CEPH_PAGE_SIZE boundary,
which triggered #6614. Non aligned ptr are created as follows instead:
Sage Weil [Thu, 25 Sep 2014 20:16:52 +0000 (13:16 -0700)]
osdc/Objecter: only post_rx_buffer if no op timeout
If we post an rx buffer and there is a timeout, the revocation can happen
while the reader has consumed the buffers but before it has decoded and
constructed the message. In particular, we calculate a crc32c over the
data portion of the message after we've taken the buffers and dropped the
lock.
Instead of fixing this race (for example, by reverifying rx_buffers under
the lock while calculating the crc.. bleh), just skip the rx buffer
optimization entirely when a timeout is present.
Note that this doesn't cover the op_cancel() paths, but none of those users
provide static buffers to read into.
Sage Weil [Tue, 24 Sep 2013 18:22:19 +0000 (11:22 -0700)]
osd/ReplicatedPG: respect RWORDERED rados flag
If this flag is set, we need to order reads as writes. In particular, this
means that reads will wait for degraded object recovery even if there is a
local copy. And subsequently will be ordered after a preceding write that
is waiting for the same thing.
Greg Farnum [Thu, 22 May 2014 04:41:23 +0000 (21:41 -0700)]
cephfs-java: build against older jni headers
Older versions of the JNI interface expected non-const parameters
to their memory move functions. It's unpleasant, but won't actually
change the memory in question, to do a cast_const in order to satisfy
those older headers. (And even if it *did* modify the memory, that
would be okay given our single user.)
Samuel Just [Thu, 31 Oct 2013 20:19:32 +0000 (13:19 -0700)]
sharedptr_registry.hpp: removed ptrs need to not blast contents
See the included unit test update. Consider:
1) x = lookup_or_create(1, 1)
2) remove(1)
3) y = lookup_or_create(1, 2)
4) x.reset()
5) z = lookup(1)
The bug is that z will be null since x.reset() caused the
cleanup callback to remove y's key value from contents.
To fix this, contents also records the pointer value for
the weak_ptr. The removal callback only removes the
key from contents if it matches the ptr in contents.
This should work since the pointer passed to the removal
callback must be unique up to that point since it has
not yet been deleted.
This allowed a pg removal -> pg recreation -> pg removal
sequence to cause the second pg removal entry to be
erroneously cleared by the first pg removal's destructor
as it finally made its way through the removal queue.
Fixes: #5951 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 28e4271267976ab8405a24d015f2fb50a2f82c49)
Loic Dachary [Mon, 12 Aug 2013 12:05:38 +0000 (14:05 +0200)]
sharedptr_registry: add a variant of get_next() and the empty() method
The SharedPtrRegistry::get_next() method with a value of type VPtr
instead of V is added because it is sometime more convenient to not
copy the value when walking the registry. The
SharedPtrRegistry::empty() predicate method is added.
A single counter ( waiting ) accurately reflects the number of
waiters, regardless of the method waiting. It is enough to allow
unit tests to synthetise all situations, including:
T1: x = lookup_or_create(0)
T1: release x part 1 (weak_ptrs now fail to lock)
T2: y = lookup_or_create(0)
T2: block in lookup_or_create (waiting == 1)
T1: z = lookup_or_create(1) (does not block because the key is different)
while holding the lock it waiting++ and waiting == 2
and before returning it waiting-- and waiting is back to == 1
T1: complete release x
T2: complete lookup_or_create(0) (waiting == 0)
The unit tests are modified to add a lookup on an unrelated key to
demonstrate that it does not reset waiting counter.
Covers 100% of the LOC and all the expected behavior, including thread
safety.
The sharedptr_registry is made friend of the test class so that it can
synthetize race conditions. The lookup and lookup_or_create methods
set the new in_method data member before calling cond.Wait() so that
the caller knows it is waiting.
The crash occurs due to ImageCtx->parent->parent being uninitialized,
since the inital open_parent() -> open_image(parent) ->
ictx_refresh(parent) occurs before ImageCtx->parent->snap_id is set,
so refresh_parent() is not called to open an ImageCtx for the parent
of the parent. This leaves the ImageCtx->parent->parent NULL, but the
rest of ImageCtx->parent updated to point at the correct parent snapshot.
Setting the parent->snap_id earlier has some unintended side effects
currently, so for now just call refresh_parent() during
open_parent(). This is the easily backportable version of the
fix. Further patches can clean up this whole initialization process.
Sage Weil [Wed, 10 Sep 2014 15:00:50 +0000 (08:00 -0700)]
test/cli-integration/rbd: fix trailing space
Newer versions of json.tool remove the trailing ' ' after the comma. Add
it back in with sed so that the .t works on both old and new versions, and
so that we don't have to remove the trailing spaces from all of the test
cases.
Sage Weil [Sat, 16 Aug 2014 19:42:33 +0000 (12:42 -0700)]
os/FileStore: fix mount/remount force_sync race
Consider:
- mount
- sync_entry is doing some work
- umount
- set force_sync = true
- set done = true
- sync_entry exits (due to done)
- ..but does not set force_sync = false
- mount
- journal replay starts
- sync_entry sees force_sync and does a commit while op_seq == 0
...crash...
Sage Weil [Mon, 8 Sep 2014 20:44:57 +0000 (13:44 -0700)]
osdc/Objecter: revoke rx_buffer on op_cancel
If we cancel a read, revoke the rx buffers to avoid a use-after-free and/or
other undefined badness by using user buffers that may no longer be
present.
Fixes: #9362
Backport: firefly, dumpling Reported-by: Matthias Kiefer <matthias.kiefer@1und1.de> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2305b2897acba38384358c33ca3bbfcae6f1c74e)
Josh Durgin [Mon, 11 Aug 2014 23:41:26 +0000 (16:41 -0700)]
librbd: fix error path cleanup for opening an image
If the image doesn't exist and caching is enabled, the ObjectCacher
was not being shutdown, and the ImageCtx was leaked. The IoCtx could
later be closed while the ObjectCacher was still running, resulting in
a segfault. Simply use the usual cleanup path in open_image(), which
works fine here.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Danny Al-Gaaf [Wed, 4 Jun 2014 21:22:18 +0000 (23:22 +0200)]
librbd/internal.cc: check earlier for null pointer
Fix potential null ponter deref, move check for 'order != NULL'
to the beginning of the function to prevent a) deref in ldout() call
and b) to leave function as early as possible if check fails.
[src/librbd/internal.cc:843] -> [src/librbd/internal.cc:865]: (warning)
Possible null pointer dereference: order - otherwise it is redundant
to check it against null.