Greg Farnum [Thu, 23 Oct 2014 00:16:31 +0000 (17:16 -0700)]
client: cast m->get_client_tid() to compare to 16-bit Inode::flushing_cap_tid
m->get_client_tid() is 64 bits (as it should be), but Inode::flushing_cap_tid
is only 16 bits. 16 bits should be plenty to let the cap flush updates
pipeline appropriately, but we need to cast in the proper direction when
comparing these differently-sized versions. So downcast the 64-bit one
to 16 bits.
Sage Weil [Thu, 8 May 2014 21:19:22 +0000 (14:19 -0700)]
osd/ReplicatedPG: carry CopyOpRef in copy_from completion
There is a race with copy_from cancellation. The internal Objecter
completion decodes a bunch of data and copies it into pointers provided
when the op is queued. When we cancel, we need to ensure that we can cope
until control passes back to our provided completion.
Once we *do* get into the (ReplicatedPG) callbacks, we will bail out
because the tid in the CopyOp or FlushOp no longer matches.
Fix this by carrying a ref to keep the copy-from targets alive, and
clearing out the tids that we cancel.
Note that previously, the trigger for this was that the tid changes when
we handle a redirect, which made the op_cancel() call fail. With the
coming Objecter changes, this will no longer be the case. However, there
are also locking and threading changes that will make cancellation racy,
so we will not be able to rely on it always preventing the callback.
Either way, this will avoid the problem.
qa/workunits/cephtool/test.sh: fix thrash (ultimate)
Keep the osd trash test to ensure it is a valid command but make it a
noop by giving it a zero argument (meaning thrash 0 OSD maps).
Remove the loops that were added after the command in an attempt to wait
for the cluster to recover and not pollute the rest of the tests. Actual
testing of osd thrash would require a dedicated cluster because it the
side effects are random and it is unnecessarily difficult to ensure they
are finished.
Then, we begin to flush 15 with a delete with snapc 4:[4] leaving the
backing pool with:
4:[4]:[4(4)]
Then, we finish flushing 15 with snapc 9:[4] with leaving the backing
pool with:
9:[4]:[4(4)]+head
Next, snaps 10 and 15 are removed causing clone 10 to be removed leaving
the cache with:
30:[29,21,20,4]:[22(21),4(4)]+head
We next begin to flush 22 by sending a delete with snapc 4(4) since
prev_snapc is 4 <---------- here is the bug
The backing pool ignores this request since 4 < 9 (ORDERSNAP) leaving it
with:
9:[4]:[4(4)]
Then, we complete flushing 22 with snapc 19:[4] leaving the backing pool
with:
19:[4]:[4(4)]+head
Then, we begin to flush head by deleting with snapc 22:[21,20,4] leaving
the backing pool with:
22[21,20,4]:[22(21,20), 4(4)]
Finally, we flush head leaving the backing pool with:
30:[29,21,20,4]:[22(21*,20*),4(4)]+head
When we go to flush clone 22, all we know is that 22 is dirty, has snaps
[21], and 4 is clean. As part of flushing 22, we need to do two things:
1) Ensure that the current head is cloned as cloneid 4 with snaps [4] by
sending a delete at snapc 4:[4].
2) Flush the data at snap sequence < 21 by sending a copyfrom with snapc
20:[20,4].
Unfortunately, it is possible that 1, 1&2, or 1 and part of the flush
process for some other now non-existent clone have already been
performed. Because of that, between 1) and 2), we need to send
a second delete ensuring that the object does not exist at 20.
We have been setting it to the old head value. This is usually
harmless since the new head will virtually always be ahead of the
old head for claim_log_and_clear_rollback_info, but can cause trouble
in some edge cases.
Samuel Just [Mon, 15 Sep 2014 23:53:21 +0000 (16:53 -0700)]
PG::find_best_info: let history.last_epoch_started provide a lower bound
If we find a info.history.last_epoch_started above any
info.last_epoch_started, we must be missing updates and
min_last_update_acceptable should provisionally be max().
Fixes: #9482
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
The crash occurs due to ImageCtx->parent->parent being uninitialized,
since the inital open_parent() -> open_image(parent) ->
ictx_refresh(parent) occurs before ImageCtx->parent->snap_id is set,
so refresh_parent() is not called to open an ImageCtx for the parent
of the parent. This leaves the ImageCtx->parent->parent NULL, but the
rest of ImageCtx->parent updated to point at the correct parent snapshot.
Setting the parent->snap_id earlier has some unintended side effects
currently, so for now just call refresh_parent() during
open_parent(). This is the easily backportable version of the
fix. Further patches can clean up this whole initialization process.
init-radosgw.sysv: Support systemd for starting the gateway
When using RHEL7 the radosgw daemon needs to start under systemd.
Check for systemd running on PID 1. If it is then start
the daemon using: systemd-run -r <cmd>. pidof returns null
as it is executed too quickly, adding one second of sleep and
script reports startup correctly.
Sage Weil [Mon, 8 Sep 2014 20:44:57 +0000 (13:44 -0700)]
osdc/Objecter: revoke rx_buffer on op_cancel
If we cancel a read, revoke the rx buffers to avoid a use-after-free and/or
other undefined badness by using user buffers that may no longer be
present.
Fixes: #9362
Backport: firefly, dumpling Reported-by: Matthias Kiefer <matthias.kiefer@1und1.de> Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 2305b2897acba38384358c33ca3bbfcae6f1c74e)
Samuel Just [Wed, 27 Aug 2014 23:21:41 +0000 (16:21 -0700)]
PG::can_discard_op: do discard old subopreplies
Otherwise, a sub_op_reply from a previous interval can stick around
until we either one day go active again and get rid of it or delete the
pg which is holding it on its waiting_for_active list. While it sticks
around futily waiting for the pg to once more go active, it will cause
harmless slow request warnings.
Sage Weil [Wed, 27 Aug 2014 13:19:12 +0000 (06:19 -0700)]
osd/PG: fix crash from second backfill reservation rejection
If we get more than one reservation rejection we should ignore them; when
we got the first we already sent out cancellations. More importantly, we
should not crash.
osd: OSDMap: ordered blacklist on non-classic encode function
Fixes: #9211
Backport: firefly
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 81102044f417bd99ca570d9234b1df5195e9a8c9)
Sage Weil [Tue, 26 Aug 2014 15:16:29 +0000 (08:16 -0700)]
osd/OSDMap: encode blacklist in deterministic order
When we use an unordered_map the encoding order is non-deterministic,
which is problematic for OSDMap. Construct an ordered map<> on encode
and use that. This lets us keep the hash table for lookups in the general
case.
Sage Weil [Wed, 20 Aug 2014 15:59:46 +0000 (08:59 -0700)]
mon: add a cluster fingerprint
Generate it on cluster creations with the initial monmap. Include it in
the report. Provide no way for this uuid to be fed in to the cluster
(intentionally or not) so that it can be assumed to be a truly unique
identifier for the cluster.
Sage Weil [Sat, 16 Aug 2014 19:42:33 +0000 (12:42 -0700)]
os/FileStore: fix mount/remount force_sync race
Consider:
- mount
- sync_entry is doing some work
- umount
- set force_sync = true
- set done = true
- sync_entry exits (due to done)
- ..but does not set force_sync = false
- mount
- journal replay starts
- sync_entry sees force_sync and does a commit while op_seq == 0
...crash...
Loic Dachary [Mon, 25 Aug 2014 15:05:04 +0000 (17:05 +0200)]
common: ROUND_UP_TO accepts any rounding factor
The ROUND_UP_TO function was limited to rounding factors that are powers
of two. This saves a modulo but it is not used where it would make a
difference. The implementation is changed so it is generic.
Haomai Wang [Thu, 20 Mar 2014 06:09:49 +0000 (14:09 +0800)]
Remove exclusive lock on GenericObjectMap
Now most of GenericObjectMap interfaces use header as argument not the union of
coll_t and ghobject_t. So caller should be responsible for maintain the
exclusive header.
Haomai Wang [Tue, 26 Aug 2014 04:41:28 +0000 (04:41 +0000)]
Add random cache and replace SharedLRU in KeyValueStore
SharedLRU plays pool performance in KeyValueStore with large header cache size,
so a performance optimized RandomCache could improve it.
RandomCache will record the lookup frequency of key. When evictint element,
it will randomly compare several elements's frequency and evict the least
one.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Conflicts:
Haomai Wang [Tue, 26 Aug 2014 04:40:16 +0000 (04:40 +0000)]
Add Header cache to KeyValueStore
In the performance statistic recently, the header lookup becomes the main time
consuming for the read/write operations. Most of time it occur 50% to deal with
header lookup, decode/encode logics.
Now adding header cache using SharedLRU structure which will maintain the header
cache and caller will get the pointer to the real header. It also avoid too much
header copy operations overhead.
Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
Conflicts:
Sage Weil [Thu, 21 Aug 2014 20:05:35 +0000 (13:05 -0700)]
mon: fix occasional message leak after session reset
Consider:
- we get a message, put it on a wait list
- the client session resets
- we go back to process the message later and discard
- _ms_dispatch returns false, but nobody drops the msg ref
Since we call _ms_dispatch() a lot internally, we need to always return
true when we are an internal caller.
Loic Dachary [Thu, 21 Aug 2014 12:41:55 +0000 (14:41 +0200)]
erasure-code: preload the jerasure plugin variant (sse4,sse3,generic)
The preloading of the jerasure plugin ldopen the plugin that is in
charge of selecting the variant optimized for the
CPU (sse4,sse3,generic). The variant plugin itself is not loaded because
it does not happen at load() but when the factory() method is called.
The JerasurePlugin::preload method is modified to call the factory()
method to load jerasure_sse4 or jerasure_sse3 or jerasure_generic as a
side effect.
Indirectly loading another plugin in the factory() method is error prone
and should be moved to the load() method instead. This change should be
done in a separate commit.
Haomai Wang [Tue, 20 May 2014 06:32:18 +0000 (14:32 +0800)]
Fix set_alloc_hint op cause KeyValueStore crash problem
Now KeyValueStore doesn't support set_alloc_hit op, the implementation of
_do_transaction need to consider decoding the arguments. Otherwise, the
arguments will be regarded as the next op.
Fix the same problem for MemStore.
Fix #8381
Reported-by: Xinxin Shu <xinxin.shu5040@gmail.com> Signed-off-by: Haomai Wang <haomaiwang@gmail.com>
(cherry picked from commit c08adbc98ff5f380ecd215f8bd9cf3cab214913c)
Loic Dachary [Mon, 18 Aug 2014 23:30:15 +0000 (01:30 +0200)]
erasure-code: preload the jerasure plugin
Load the jerasure plugin when ceph-osd starts to avoid the following
scenario:
* ceph-osd-v1 is running but did not load jerasure
* ceph-osd-v2 is installed being installed but takes time : the files
are installed before ceph-osd is restarted
* ceph-osd-v1 is required to handle an erasure coded placement group and
loads jerasure (the v2 version which is not API compatible)
* ceph-osd-v1 calls the v2 jerasure plugin and does not reference the
expected part of the code and crashes
Although this problem shows in the context of teuthology, it is unlikely
to happen on a real cluster because it involves upgrading immediately
after installing and running an OSD. Once it is backported to firefly,
it will not even happen in teuthology tests because the upgrade from
firefly to master will use the firefly version including this fix.
While it would be possible to walk the plugin directory and preload
whatever it contains, that would not work for plugins such as jerasure
that load other plugins depending on the CPU features, or even plugins
such as isa which only work on specific CPU.
Matt Benjamin [Thu, 29 May 2014 14:34:20 +0000 (10:34 -0400)]
Work around an apparent binding bug (GCC 4.8).
A reference to h->seq passed to std::pair ostensibly could not bind
because the header structure is packed. At first this looked like
a more general unaligned access problem, but the only location the
compiler rejects is a false positive.
common/config.cc: allow integer values to be parsed as SI units
We are allowing this for all and any integer values; that is, OPT_INT,
OPT_LONGLONG, OPT_U32 and OPT_U64.
It's on the user to use appropriate units. For instance, the user should
not use 'E(xabyte)' when setting a signed int, and use his best judgment
when setting options that, for instance, ought to receive seconds.
Sage Weil [Wed, 13 Aug 2014 00:25:10 +0000 (17:25 -0700)]
ceph-disk: use partition type UUIDs, and blkid
Use blkid to give us the GPT partition type. This lets us distinguish
between dmcrypt and non-dmcrypt partitions. Fake it if blkid doesn't
give us what we want and try with sgdisk. This isn't perfect (it can't
tell between dmcrypt and not dmcrypt), but such is life, and we are better
off than before.