Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
qa: test_alloc_hint: flush journal before prodding the FS
OSDs that for some reason get behind on processing their op queue break
expect_alloc_hint_eq(), as it pokes the FS and not the journal. Fix it
by flushing the journal before proceeding with anything else.
Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
osd: add flush_journal admin socket command
Add flush_journal admin socket command to be able to flush journal to
the permanent store for online osds. (For offline osds we already have
ceph-osd --flush-journal.)
Instead of copying the files in the ceph repository, which is less
convenient.
When building the headers are ignored, even though they do
not exist. When creating the tarbal with make dist, it fails because
they cannot be found. I misread src/gf_int.h to be include/gf_int.h and
wrongfully thought the submodules were to blame. This is why they were
removed shortly after being added.
Dan Mick [Thu, 6 Mar 2014 22:33:39 +0000 (14:33 -0800)]
Makefiles: remove libkeyutils from every binary except two
Only rbd and mount_ceph need secret.c, and only secret.c needs libkeyutils;
remove it from LIBCOMMON_DEPS so it's not a dependency for everything,
remove secret.c from libcommon.a, and add it to mount.ceph/rbd's sources;
add LIBKEYID_LIB to mount.ceph/rbd's LDADD.
Samuel Just [Tue, 18 Mar 2014 22:47:44 +0000 (15:47 -0700)]
ReplicatedPG::do_op: delay if snapdir is unreadable
Since all we really need on a snapdir is the context, we really only
need it to be !missing. However, it might become !missing before it
becomes !unreadable. That allows ops to end up in the
waiting_for_degraded queue before one in waiting_for_unreadable is
woken, which allows the ops to be reordered. Rather than reintroduce an
extra waiting_for_missing queue, simply require !unreadable for snapdir
(which implies !misssing).
Fixes: #7777 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Tue, 18 Mar 2014 19:35:03 +0000 (12:35 -0700)]
PG::start_peering_interval: always send_notify if !primary
Otherwise, we might get into a situation where the primary
forgets about a stray pg. This is simpler and does not
increase the number of notifies by much.
Fixes: #7733 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Tue, 18 Mar 2014 19:09:05 +0000 (12:09 -0700)]
PG::find_best_info: fix log_tail component
The previous logic should have kept the current best info if it found a
replica which best could log-recover, but p couldn't. However, the
continue in that loop advanced the inner loop instead of the outer loop
allowing the primary case to take over in cases where best had a longer
tail. Instead, we will prefer the longer tail regardless of the other
infos to simplify the logic.
Fixes: #7755 Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 16:00:25 +0000 (17:00 +0100)]
osd,mon: use profile instead of properties
The qa and functional tests are adapted to the new command prototype
requiring a profile instead of a list of properties. When possible the
implicit ruleset creation is used to simplify the test setup.
Loic Dachary [Sun, 16 Mar 2014 20:11:56 +0000 (21:11 +0100)]
mon: set the profile and ruleset defaults early
The poolstr is removed from the prepare_pool_crush_ruleset prototype
because it no longer decides for the default ruleset, if it is not
omitted by the caller of osd pool create.
If no profile
profile = default
If no ruleset and profile is default
ruleset = erasure-code
If no ruleset and profile is not default
ruleset = the name of the pool
Loic Dachary [Sun, 16 Mar 2014 15:24:58 +0000 (16:24 +0100)]
mon: add crush ruleset name to osd pool create
The ruleset to be used for the new erasure coded pool was expected in
the properties, under the name crush_ruleset. It does not belong to the
erasure code profile and needs to be added to the prototype explicitly.
The crush ruleset name is added to the prototype of the prepare_new_pool
and prepare_pool_crush_ruleset methods.
Loic Dachary [Sun, 16 Mar 2014 15:13:17 +0000 (16:13 +0100)]
osd: use erasure code profile when building the PGBackend
The PGBackend::build_pg_backend() prototype is modified to add an
OSDMapRef which is used to get the profile stored in the pg_pool_t
and pass it to the erasure code plugin.
Loic Dachary [Mon, 17 Mar 2014 21:47:03 +0000 (22:47 +0100)]
mon: OSDMonitor use erasure_code_profile instead of properties
The prepare_pool_properties is replaced with the
parse_erasure_code_profile method which looks up the same content from
OSDMap::erasure_code_profiles instead of the argument vector.
The type and name substitution is applied in OSDMonitor and the
MonCommand prototypes for osd pool create and osd create-erasure are
updated.
Loic Dachary [Sun, 16 Mar 2014 12:27:51 +0000 (13:27 +0100)]
osd: obsolete pg_pool_t properties with erasure_code_profile
The generic map properties is only used for erasure code and should be
specialized. It is obsoleted by a string that can be looked up in
OSDMap::erasure_code_profiles to retrieve the key/value pairs that will
be interpreted by the erasure code plugin.
Loic Dachary [Sun, 16 Mar 2014 12:00:50 +0000 (13:00 +0100)]
mon: add the erasure-code-profile {set,get,rm,ls} MonCommand
"erasure-code-profile set" parses the key=value pairs given in argument
and stores them in OSDMap::erasure_code_profiles. The
"erasure-code-profile get" supports plain text display if the Formatter
is not set (or invalid).
Sage Weil [Mon, 17 Mar 2014 23:21:17 +0000 (16:21 -0700)]
mon/Paxos: commit only after entire quorum acks
If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state. However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.
The simplest way to prevent this is not to commit until the entire quorum
replies. In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.
A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts. This will lower latency a bit in the non-failure case, but not
change the failure case significantly. Later!
Fixes: #7736 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
John Spray [Mon, 17 Mar 2014 16:17:05 +0000 (16:17 +0000)]
ceph.in: Better error on bad arg to 'tell'
Previously would get a rather enigmatic error:
UnboundLocalError: local variable 'ret' referenced before assignment
Now give something sensible:
ceph_argparse.ArgumentValid: Bad target type 'mds'
Also update a couple of the other catch-all exception handlers
so that they will let the (nicer) ArgumentError exception through
for humans to see instead of munging them into RuntimeErrors.
Signed-off-by: John Spray <john.spray@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 15:16:26 +0000 (16:16 +0100)]
mon: add helper to selection functions implementing tests
The call_TEST_functions will lookup SHARE_MON_TEST_* shell functions and
run them after creating a mon. Each is expected to clean up and avoid
interferences. It will also lookup TEST_* shell functions and include them
in setup/teardown to provide them with a clean environment.
John Spray [Mon, 17 Mar 2014 14:48:59 +0000 (14:48 +0000)]
mds: avoid spurious TMAP2OMAP warning
The message "one or more OSDs do not support TMAP2OMAP" was printed
incorrectly when zero OSDs were up (and therefore the feature was
absent). Don't issue this prompt until at least one OSD is up.
Signed-off-by: John Spray <john.spray@inktank.com>
John Spray [Mon, 17 Mar 2014 11:11:16 +0000 (11:11 +0000)]
tools/rados: Allow binary file output of omap data
Extends getomapval and getompaheader to take an output
file argument the same way 'get' does. Removes misleading
'-o' line from usage(), there doesn't appear to be any
such
option.
Signed-off-by: John Spray <john.spray@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 11:14:30 +0000 (12:14 +0100)]
erasure-code: remove dependency to the global context
Instead of relying on derr to display error messages, add them to an
ostream parameter given in argument to load() and factory(). The erasure
code convenience library no longer depends on the global context that is
indirectly referenced by debug.h
Loic Dachary [Sun, 16 Mar 2014 11:09:51 +0000 (12:09 +0100)]
common,erasure-code,mon: s/erasure-code-//
The parameters to erasure code do not need to be prefixed with the
erasure-code- string. There only are erasure-code parameters and the
prefix was originaly intended to desambiguate the erasure-code
properties, assuming that the properties map could be used for other
purposes.
Loic Dachary [Mon, 3 Mar 2014 14:40:13 +0000 (15:40 +0100)]
mon: tests for pool create erasure implicit ruleset creation
* Remove the tests checking that a missing or wrong crush_ruleset
parameters triggered an error.
* Add a test checking that a ruleset with the same name as the pool is
created implicitly when no crush_ruleset is specified.
Loic Dachary [Mon, 3 Mar 2014 14:25:21 +0000 (15:25 +0100)]
mon: pool create erasure implicit ruleset creation
If the crush_ruleset parameter is missing, set it to the pool name.
If the crush_ruleset parameter is set to a name that does not match any
of the existing rulesets, create one using the pool creation parameters.
If the ruleset exists and is in the pending map or if the ruleset was
just created (meaning it exists in the pending map), the
prepare_pool_crush_ruleset method returns EAGAIN so that the pool
creation message is retried after the pending map is proposed.
If the ruleset exists, it is used to initialize the newly created pool,
as before.
Create the ruleset and branch depending on the result:
* If it succeeds, wait
* If it already exists and is pending (-EALREADY), wait
* If it already exists (-EEXIST), return immediately
* If it fails for other reasons, return immediately
Loic Dachary [Mon, 3 Mar 2014 13:36:50 +0000 (14:36 +0100)]
mon: create crush_ruleset_create_erasure helper
Move the code bloc verbatim, from "osd crush rule create-erasure" to the
new crush_ruleset_create_erasure() method helper. This step helps
separate the code changes from the code moving around unmodified.
Samuel Just [Sun, 16 Mar 2014 00:58:35 +0000 (17:58 -0700)]
OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting
We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.
Fixes: #7740 Signed-off-by: Samuel Just <sam.just@inktank.com>