Zero-length writes would hang because the completion was never
called. Reads would hit an assert about zero length in
Striper::file_to_exents().
Fix all of these cases by skipping zero-length extents. The completion
is created and finished when finish_adding_requests() is called. This
is slightly different from usual completions since it comes from the
same thread as the one scheduling the request, but zero-length aio
requests should never happen from things that might care about this,
like QEMU.
Writes and discards have had this bug since the beginning of
librbd. Reads might have avoided it until stripingv2 was added.
Sage Weil [Wed, 9 Apr 2014 00:28:54 +0000 (17:28 -0700)]
osd: do not block when updating osdmap superblock features
We are holding osd_lock in check_osdmap_features, which means we cannot
block while waiting for filestore operations to flush/apply without
risking deadlock.
The important constraint is that we commit that the feature is enabled
before also commiting anything that utilizes sharded objects. The normal
commit sequencing does that already; there is no reason to block here.
Fixes: #8045 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 19:26:19 +0000 (12:26 -0700)]
osd/PG: set CREATING pg state bit until we peer for the first time
We send PG state updates to the monitor while creating a PG before the
actual creation and been finalized and persisted. Because those updates
do not include the CREATING bit, the mon will remove the pgid from it's
creating set. If the OSD(s) crash before persisting that PG creation, the
PG will never get created.
Fix this by leaving the CREATING bit set on the primary as long as
last_epoch_started==0. That is, until we successfully peer for the very
first time. Only then do we clear the bit and tell the monitor it's duty
is complete.
Fixes: #8001 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount
We observed a sequence like:
- replay journal
- sets JournalingObjectStore applied_op_seq
- umount
- mount
- initiate commit with prevous applied_op_seq
- replay journal
- commit finishes
- on replay commit, we fail assert op > committed_seq
Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.
Fixes: #8019 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 17:58:53 +0000 (10:58 -0700)]
vstart.sh: make crush location match up with what init-ceph does
This makes is to that ./init-ceph restart osd.0 won't modify the CRUSH
tree. And in any case, the localhost/localrack thing we were doing before
was pretty useless.
Sage Weil [Tue, 8 Apr 2014 16:01:14 +0000 (09:01 -0700)]
osd: drop unused same_for_*() helpers
These were all identical and mostly served to obscure the actual logic,
which is now captured by can_discard_op() and the matching Objecter
code on the client side.
Sage Weil [Tue, 8 Apr 2014 16:00:11 +0000 (09:00 -0700)]
osd: drop previous interval ops even if primary happens to be the same
If we have two consecutive intervals with the same primary, the client
will not resend the op and the same_primary_since epoch will not change,
and all is well.
If, however, we have 3 intervals, and the primary changes away and then
back to a particular OSD, the OSD will currently still process the old
request (assuming the timing works out) because it is currently the
primary. This is unnecessary because the client will resend the request.
It may even introduce a hard-to-hit ordering problem since whether or not
the OSD processes the message becomes dependent on how many subsequent
maps it has consumed when the request is processed.
Instead, simplify the minor tangle of helpers by making a single simple
check that discards requests from before same_primary_since. We can then
avoid using the same_for_*() helpers and drop the check from
handle_misdireted_op(), which is also nice because the name is now accurate
(it *only* deals with ops that are in fact misdirected, not just slow to
arrive).
The main change is use shared_ptr instead of weak_ptr to define
active request map. The reason is that slave request needs to be
preserved until master explicitly finishes it.
erasure-code: thread-safe initialization of gf-complete
Instead of relying on an implicit initialization happening during
encoding/decoding with galois.c:galois_init_default_field, call
gf.c:gf_init_easy for each w values when the plugin is loaded.
Loading the plugin is protected against race conditions by a lock.
It does not cover all possible uses of gf-complete but it is enough for
the ceph jerasure plugin.
mon: MonCommands: have all 'auth' commands require 'execute' caps
Earlier patch already have the entity requiring 'execute' caps for
read-only commands. This patch introduces the same requirement for *all*
auth commands, read-only and read-write alike.
While the rationale behind the earlier patch for leaving read-write
operations out of this requirement still holds, we now enforce this to
match compatibility with what was happening back on Dumpling with regard
to the 'execute' cap being required for auth commands. However, it should
be noted that back on Dumpling we were only requiring the 'execute' cap
for auth commands, regardless of read-only or read-write, and no other
caps were required.
Fixes: 7919 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Sun, 6 Apr 2014 23:03:50 +0000 (16:03 -0700)]
osd: fix map subscription in YOU_DIED osd_ping handler
If we have epoch X and find out we died as of epoch Y, we still want to
request X+1. Among other things, this fixes a 'stall' if Y happens to be
the most recent map published and no new maps are generated because we will
never get anything back from our subscription.
This makes this osdmap_subscribe() caller match every other caller by
passing in current epoch + 1.
Fixes: #8002 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 2 Apr 2014 15:49:33 +0000 (08:49 -0700)]
msgr: add ms_dump_on_send option
This is useful only for debugging. The encoded contents of a message are
dumped to the log on message send. This is useful when valgrind is
triggering warnings about uninitialized memory in messages because the
call chain will indicate which message type is to blame, whereas the
usual writer thread context does not tell us any useful information.
Sage Weil [Sat, 5 Apr 2014 23:58:55 +0000 (16:58 -0700)]
mon: wait for quorum for MMonGetVersion
We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.
Fixes: #7997 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 5 Apr 2014 01:15:04 +0000 (18:15 -0700)]
osd: disable agent when stats_invalid (post-split)
After a split the pg stats are approximate but not precisely correct. Any
inaccuracy can be problematic for the agent because it determines the
level of effort and potentially full/blocking behavior based on that.
We could concievably do some estimation here that is "safe" in that we
don't commit to too much effort (or back off later if it isn't paying off)
and never block, but that is error-prone.
Instead, just disable the agent until a scrub makes the stats reliable
again.
We should document that a scrub after split is recommended (in any case)
and especially important on cache tiers, but there are currently *no*
user docs about PG splitting.
Fixes: #7975 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 4 Apr 2014 20:56:33 +0000 (13:56 -0700)]
osd/ReplicatedPG: do not hit_set_persist while potentially backfilling hit_set_*
The hit_set transactions may include both a modify of the new hit_set and
deletion of an old one, spanning the backfill boundary, and we may end up
sending a backfill target a blank transaction that does not correctly
remove the old object. Later it will notice the stray object and
throw an assertion.
Fix this by skipping hit_set_persist() if any of the backfill targets are
still working on the very first hash value in the PG (which is where all
of the hit_set objects live). This is coarse but simple.
Another solution would be to send separate ops for the trim/deletion and
new hit_set update, but that is a bit more complex and a bit more
runtime overhead (twice the messages).
Fixes: #7983 Signed-off-by: Sage Weil <sage@inktank.com>
mon: MonCommands.h: have 'auth' read-only operations require 'x' cap
This reintroduces the same semantics that were in place in dumpling prior
to the refactoring of the cap/command matching code.
We haven't added this requirement to auth read-write operations as that
would have the potential to break a lot of well-configured keyrings once
the users upgraded, without any significant gain -- we assume that if
they have set 'rw' caps on a given entity, they are indeed expecting said
entity to be sort-of-privileged entities with regard to monitor access.
Fixes: #7919 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Greg Farnum [Wed, 12 Mar 2014 03:52:21 +0000 (20:52 -0700)]
SimpleLock: use MutationRef instead of raw pointers
While we're here, remove the non-const get_xlock_by() (because
we don't need it). Also note we return a full MutationRef
(instead of a ref to the stored one). It's necessary in case we
don't have a set-up more() object.
Greg Farnum [Fri, 7 Mar 2014 23:58:11 +0000 (15:58 -0800)]
mds: MDRequest: rename to MDRequestImpl, and declare MDRequestRef
We're switching the MDRequest to be used as a shared pointer. This is the
first step on the path to inserting an OpTracker into the MDS.
Give the MDRequestImpl a weak_ptr self_ref so that we can keep
using the elist for now.