Sage Weil [Tue, 8 Apr 2014 17:52:43 +0000 (10:52 -0700)]
os/FileStore: reset journal state on umount
We observed a sequence like:
- replay journal
- sets JournalingObjectStore applied_op_seq
- umount
- mount
- initiate commit with prevous applied_op_seq
- replay journal
- commit finishes
- on replay commit, we fail assert op > committed_seq
Although strictly speaking the assert failure is harmless here, in general
we should not let state leak through from a previous mount into this
mount or else assertions are in general more difficult to reason about.
Fixes: #8019 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 8 Apr 2014 17:58:53 +0000 (10:58 -0700)]
vstart.sh: make crush location match up with what init-ceph does
This makes is to that ./init-ceph restart osd.0 won't modify the CRUSH
tree. And in any case, the localhost/localrack thing we were doing before
was pretty useless.
The main change is use shared_ptr instead of weak_ptr to define
active request map. The reason is that slave request needs to be
preserved until master explicitly finishes it.
erasure-code: thread-safe initialization of gf-complete
Instead of relying on an implicit initialization happening during
encoding/decoding with galois.c:galois_init_default_field, call
gf.c:gf_init_easy for each w values when the plugin is loaded.
Loading the plugin is protected against race conditions by a lock.
It does not cover all possible uses of gf-complete but it is enough for
the ceph jerasure plugin.
mon: MonCommands: have all 'auth' commands require 'execute' caps
Earlier patch already have the entity requiring 'execute' caps for
read-only commands. This patch introduces the same requirement for *all*
auth commands, read-only and read-write alike.
While the rationale behind the earlier patch for leaving read-write
operations out of this requirement still holds, we now enforce this to
match compatibility with what was happening back on Dumpling with regard
to the 'execute' cap being required for auth commands. However, it should
be noted that back on Dumpling we were only requiring the 'execute' cap
for auth commands, regardless of read-only or read-write, and no other
caps were required.
Fixes: 7919 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Sun, 6 Apr 2014 23:03:50 +0000 (16:03 -0700)]
osd: fix map subscription in YOU_DIED osd_ping handler
If we have epoch X and find out we died as of epoch Y, we still want to
request X+1. Among other things, this fixes a 'stall' if Y happens to be
the most recent map published and no new maps are generated because we will
never get anything back from our subscription.
This makes this osdmap_subscribe() caller match every other caller by
passing in current epoch + 1.
Fixes: #8002 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 2 Apr 2014 15:49:33 +0000 (08:49 -0700)]
msgr: add ms_dump_on_send option
This is useful only for debugging. The encoded contents of a message are
dumped to the log on message send. This is useful when valgrind is
triggering warnings about uninitialized memory in messages because the
call chain will indicate which message type is to blame, whereas the
usual writer thread context does not tell us any useful information.
Sage Weil [Sat, 5 Apr 2014 23:58:55 +0000 (16:58 -0700)]
mon: wait for quorum for MMonGetVersion
We should not respond to checks for map versions when we are in the
probing or electing states or else clients will get incorrect results when
they ask what the latest map version is.
Fixes: #7997 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Sat, 5 Apr 2014 01:15:04 +0000 (18:15 -0700)]
osd: disable agent when stats_invalid (post-split)
After a split the pg stats are approximate but not precisely correct. Any
inaccuracy can be problematic for the agent because it determines the
level of effort and potentially full/blocking behavior based on that.
We could concievably do some estimation here that is "safe" in that we
don't commit to too much effort (or back off later if it isn't paying off)
and never block, but that is error-prone.
Instead, just disable the agent until a scrub makes the stats reliable
again.
We should document that a scrub after split is recommended (in any case)
and especially important on cache tiers, but there are currently *no*
user docs about PG splitting.
Fixes: #7975 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 4 Apr 2014 20:56:33 +0000 (13:56 -0700)]
osd/ReplicatedPG: do not hit_set_persist while potentially backfilling hit_set_*
The hit_set transactions may include both a modify of the new hit_set and
deletion of an old one, spanning the backfill boundary, and we may end up
sending a backfill target a blank transaction that does not correctly
remove the old object. Later it will notice the stray object and
throw an assertion.
Fix this by skipping hit_set_persist() if any of the backfill targets are
still working on the very first hash value in the PG (which is where all
of the hit_set objects live). This is coarse but simple.
Another solution would be to send separate ops for the trim/deletion and
new hit_set update, but that is a bit more complex and a bit more
runtime overhead (twice the messages).
Fixes: #7983 Signed-off-by: Sage Weil <sage@inktank.com>
mon: MonCommands.h: have 'auth' read-only operations require 'x' cap
This reintroduces the same semantics that were in place in dumpling prior
to the refactoring of the cap/command matching code.
We haven't added this requirement to auth read-write operations as that
would have the potential to break a lot of well-configured keyrings once
the users upgraded, without any significant gain -- we assume that if
they have set 'rw' caps on a given entity, they are indeed expecting said
entity to be sort-of-privileged entities with regard to monitor access.
Fixes: #7919 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Greg Farnum [Wed, 12 Mar 2014 03:52:21 +0000 (20:52 -0700)]
SimpleLock: use MutationRef instead of raw pointers
While we're here, remove the non-const get_xlock_by() (because
we don't need it). Also note we return a full MutationRef
(instead of a ref to the stored one). It's necessary in case we
don't have a set-up more() object.
Greg Farnum [Fri, 7 Mar 2014 23:58:11 +0000 (15:58 -0800)]
mds: MDRequest: rename to MDRequestImpl, and declare MDRequestRef
We're switching the MDRequest to be used as a shared pointer. This is the
first step on the path to inserting an OpTracker into the MDS.
Give the MDRequestImpl a weak_ptr self_ref so that we can keep
using the elist for now.
ReplicatedPG: fix CEPH_OSD_OP_CREATE on cache pools
The following
./ceph osd pool create data-cache 8 8
./ceph osd tier add data data-cache
./ceph osd tier cache-mode data-cache writeback
./ceph osd tier set-overlay data data-cache
./rados -p data create foo
./rados -p data stat foo
results in
error stat-ing data/foo: No such file or directory
even though foo exists in the data-cache pool, as it should. STAT
checks for (exists && !is_whiteout()), but the whiteout flag isn't
cleared on CREATE as it is on WRITE and WRITEFULL. The problem is
that, for newly created 0-sized cache pool objects, CREATE handler in
do_osd_ops() doesn't get a chance to queue OP_TOUCH, and so the logic
in prepare_transaction() considers CREATE to be a read and therefore
doesn't clear whiteout. Fix it by allowing CREATE handler to queue
OP_TOUCH at all times, mimicking WRITE and WRITEFULL behaviour.