Loic Dachary [Sun, 16 Mar 2014 16:00:25 +0000 (17:00 +0100)]
osd,mon: use profile instead of properties
The qa and functional tests are adapted to the new command prototype
requiring a profile instead of a list of properties. When possible the
implicit ruleset creation is used to simplify the test setup.
Loic Dachary [Sun, 16 Mar 2014 20:11:56 +0000 (21:11 +0100)]
mon: set the profile and ruleset defaults early
The poolstr is removed from the prepare_pool_crush_ruleset prototype
because it no longer decides for the default ruleset, if it is not
omitted by the caller of osd pool create.
If no profile
profile = default
If no ruleset and profile is default
ruleset = erasure-code
If no ruleset and profile is not default
ruleset = the name of the pool
Loic Dachary [Sun, 16 Mar 2014 15:24:58 +0000 (16:24 +0100)]
mon: add crush ruleset name to osd pool create
The ruleset to be used for the new erasure coded pool was expected in
the properties, under the name crush_ruleset. It does not belong to the
erasure code profile and needs to be added to the prototype explicitly.
The crush ruleset name is added to the prototype of the prepare_new_pool
and prepare_pool_crush_ruleset methods.
Loic Dachary [Sun, 16 Mar 2014 15:13:17 +0000 (16:13 +0100)]
osd: use erasure code profile when building the PGBackend
The PGBackend::build_pg_backend() prototype is modified to add an
OSDMapRef which is used to get the profile stored in the pg_pool_t
and pass it to the erasure code plugin.
Loic Dachary [Mon, 17 Mar 2014 21:47:03 +0000 (22:47 +0100)]
mon: OSDMonitor use erasure_code_profile instead of properties
The prepare_pool_properties is replaced with the
parse_erasure_code_profile method which looks up the same content from
OSDMap::erasure_code_profiles instead of the argument vector.
The type and name substitution is applied in OSDMonitor and the
MonCommand prototypes for osd pool create and osd create-erasure are
updated.
Loic Dachary [Sun, 16 Mar 2014 12:27:51 +0000 (13:27 +0100)]
osd: obsolete pg_pool_t properties with erasure_code_profile
The generic map properties is only used for erasure code and should be
specialized. It is obsoleted by a string that can be looked up in
OSDMap::erasure_code_profiles to retrieve the key/value pairs that will
be interpreted by the erasure code plugin.
Loic Dachary [Sun, 16 Mar 2014 12:00:50 +0000 (13:00 +0100)]
mon: add the erasure-code-profile {set,get,rm,ls} MonCommand
"erasure-code-profile set" parses the key=value pairs given in argument
and stores them in OSDMap::erasure_code_profiles. The
"erasure-code-profile get" supports plain text display if the Formatter
is not set (or invalid).
Sage Weil [Mon, 17 Mar 2014 23:21:17 +0000 (16:21 -0700)]
mon/Paxos: commit only after entire quorum acks
If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state. However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.
The simplest way to prevent this is not to commit until the entire quorum
replies. In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.
A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts. This will lower latency a bit in the non-failure case, but not
change the failure case significantly. Later!
Fixes: #7736 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
John Spray [Mon, 17 Mar 2014 16:17:05 +0000 (16:17 +0000)]
ceph.in: Better error on bad arg to 'tell'
Previously would get a rather enigmatic error:
UnboundLocalError: local variable 'ret' referenced before assignment
Now give something sensible:
ceph_argparse.ArgumentValid: Bad target type 'mds'
Also update a couple of the other catch-all exception handlers
so that they will let the (nicer) ArgumentError exception through
for humans to see instead of munging them into RuntimeErrors.
Signed-off-by: John Spray <john.spray@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 15:16:26 +0000 (16:16 +0100)]
mon: add helper to selection functions implementing tests
The call_TEST_functions will lookup SHARE_MON_TEST_* shell functions and
run them after creating a mon. Each is expected to clean up and avoid
interferences. It will also lookup TEST_* shell functions and include them
in setup/teardown to provide them with a clean environment.
John Spray [Mon, 17 Mar 2014 14:48:59 +0000 (14:48 +0000)]
mds: avoid spurious TMAP2OMAP warning
The message "one or more OSDs do not support TMAP2OMAP" was printed
incorrectly when zero OSDs were up (and therefore the feature was
absent). Don't issue this prompt until at least one OSD is up.
Signed-off-by: John Spray <john.spray@inktank.com>
John Spray [Mon, 17 Mar 2014 11:11:16 +0000 (11:11 +0000)]
tools/rados: Allow binary file output of omap data
Extends getomapval and getompaheader to take an output
file argument the same way 'get' does. Removes misleading
'-o' line from usage(), there doesn't appear to be any
such
option.
Signed-off-by: John Spray <john.spray@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 11:14:30 +0000 (12:14 +0100)]
erasure-code: remove dependency to the global context
Instead of relying on derr to display error messages, add them to an
ostream parameter given in argument to load() and factory(). The erasure
code convenience library no longer depends on the global context that is
indirectly referenced by debug.h
Loic Dachary [Sun, 16 Mar 2014 11:09:51 +0000 (12:09 +0100)]
common,erasure-code,mon: s/erasure-code-//
The parameters to erasure code do not need to be prefixed with the
erasure-code- string. There only are erasure-code parameters and the
prefix was originaly intended to desambiguate the erasure-code
properties, assuming that the properties map could be used for other
purposes.
Loic Dachary [Mon, 3 Mar 2014 14:40:13 +0000 (15:40 +0100)]
mon: tests for pool create erasure implicit ruleset creation
* Remove the tests checking that a missing or wrong crush_ruleset
parameters triggered an error.
* Add a test checking that a ruleset with the same name as the pool is
created implicitly when no crush_ruleset is specified.
Loic Dachary [Mon, 3 Mar 2014 14:25:21 +0000 (15:25 +0100)]
mon: pool create erasure implicit ruleset creation
If the crush_ruleset parameter is missing, set it to the pool name.
If the crush_ruleset parameter is set to a name that does not match any
of the existing rulesets, create one using the pool creation parameters.
If the ruleset exists and is in the pending map or if the ruleset was
just created (meaning it exists in the pending map), the
prepare_pool_crush_ruleset method returns EAGAIN so that the pool
creation message is retried after the pending map is proposed.
If the ruleset exists, it is used to initialize the newly created pool,
as before.
Create the ruleset and branch depending on the result:
* If it succeeds, wait
* If it already exists and is pending (-EALREADY), wait
* If it already exists (-EEXIST), return immediately
* If it fails for other reasons, return immediately
Loic Dachary [Mon, 3 Mar 2014 13:36:50 +0000 (14:36 +0100)]
mon: create crush_ruleset_create_erasure helper
Move the code bloc verbatim, from "osd crush rule create-erasure" to the
new crush_ruleset_create_erasure() method helper. This step helps
separate the code changes from the code moving around unmodified.
Samuel Just [Sun, 16 Mar 2014 00:58:35 +0000 (17:58 -0700)]
OSD::handle_pg_query: on dne pg, send lb=hobject_t() if deleting
We will set lb=hobject_t() if we resurrect the pg. In that case,
we need to have sent that to the primary before hand. If we
finish the removal before the pg is recreated, we'll just end
up backfilling it, which is ok since the pg doesn't exist anyway.
Fixes: #7740 Signed-off-by: Samuel Just <sam.just@inktank.com>
Yan, Zheng [Sat, 15 Mar 2014 12:37:37 +0000 (20:37 +0800)]
mds: fix corner case of pushing inline data
Following sequence of events can happen.
- Client releases an inode, queues cap release message.
- A 'lookup' reply brings the same inode back, but the reply doesn't
contain inline data because MDS didn't receive the cap release
message and thought client already has up-to-data inline data.
The fix is trigger a getattr if client finds inline_version is zero.
The getattr mask is set to CEPH_STAT_CAP_INLINE_DATA, so that MDS knows
client does not have inline data.
Sage Weil [Fri, 14 Mar 2014 23:32:48 +0000 (16:32 -0700)]
osd/ReplicatedPG: fix enqueue_front race
When requeuing and item at the front, we need to shuffle the items in
pg_for_processing if there is an entry for this PG there. If so, we need
to hold the qlock for the duration of the requeue of the shuffled item
back into the primary queue in order to avoid reshuffling items. For
example, consider the queue has
A B C D
- dequeue1 gets (pg, A), puts A in the processing list
- dequeue1 tries to lock pg, blocks
- enqueue_front on X takes qlock, swaps it for A, drops qlock
- dequeue2 gets (pg, B), puts B in the processing list
- enqueue_front pushes X back into the original list
so we have processing: X B queue: A C D
- dequeue* get X, then B, then A C D
If we whole qlock for the duration of the enqueue_front, we avoid dequeu2
from sneaking in an shuffling B into the processing list before we have
crammed A back onto the front of the list.
This may have caused #7712.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Mar 2014 21:48:31 +0000 (14:48 -0700)]
PG::issue_repop: only adjust peer_info last_updates if not temp
Temp object repops have version eversion_t() since they don't
actually send log entries. Updating the last_updates here
caused the peer info last_updates to be incorrect until the
next non-temp repop.
Fixes: #7718 Signed-off-by: Samuel Just <sam.just@inktank.com>
Danny Al-Gaaf [Wed, 12 Mar 2014 21:56:44 +0000 (22:56 +0100)]
RGWListBucketMultiparts: init max_uploads/default_max with 0
CID 717377 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member "max_uploads" is not initialized
in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member "default_max" is not initialized
in this constructor nor in any functions that it calls.
Sage Weil [Sat, 22 Feb 2014 17:35:27 +0000 (09:35 -0800)]
mon/PGMap: only recalculate min_last_epoch_clean if incremental touches old min
If the Incremental updates a value that used to equal the old min, we may
have raised it and need to recalculate it at the end. Otherwise, we can
avoid recalculating at all!
Sage Weil [Fri, 14 Mar 2014 19:46:57 +0000 (12:46 -0700)]
unittest_ceph_argparse: fix warnings
In file included from test/ceph_argparse.cc:17:0:
../src/gtest/include/gtest/gtest.h: In function ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = int, T2 = long unsigned int]’:
../src/gtest/include/gtest/gtest.h:1333:30: instantiated from ‘static testing::AssertionResult testing::internal::EqHelper::Compare(const char*, const char*, const T1&, const T2&) [with T1 = int, T2 = long unsigned int]’
test/ceph_argparse.cc:344:207: instantiated from here
warning: ../src/gtest/include/gtest/gtest.h:1263:3: comparison between signed and unsigned integer expressions [-Wsign-compare]
Samuel Just [Fri, 14 Mar 2014 20:09:30 +0000 (13:09 -0700)]
PG: clear want_pg_temp in clear_primary_state only if primary
Clearing it in that way in on_shutdown() can cause a stray
shard to clobber the want_pg_temp value created by the primary
shard on the same osd. Thus, instead only clear it if we are
the primary.
Fixes: #7719 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Fri, 14 Mar 2014 18:02:30 +0000 (11:02 -0700)]
mon: only do timecheck with known monmap
If we are still on monmap epoch 0, our mon ranks cannot yet be trusted
since there is not yet a shared source of truth from paxos. If we do
timechecks, the code gets confused about the ranks in e.g. the
timecheck_waiting map.
Fixes: #7692 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Fri, 14 Mar 2014 01:16:19 +0000 (18:16 -0700)]
PG::activate: handle peer contigious with primary, but not auth_log
The added case covers a situation where a replica is not contiguous with
the auth_log, but is contiguous with the primary. Reshuffling the
active set to handle this would be tricky, so instead we just go ahead
and backfill it anyway. This is probably preferrable in any case since
the replica in question would have to be significantly behind.
Fixes: #7696 Signed-off-by: Samuel Just <sam.just@inktank.com>