Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
qa: test_alloc_hint: flush journal before prodding the FS
OSDs that for some reason get behind on processing their op queue break
expect_alloc_hint_eq(), as it pokes the FS and not the journal. Fix it
by flushing the journal before proceeding with anything else.
Ilya Dryomov [Tue, 18 Mar 2014 16:06:12 +0000 (18:06 +0200)]
osd: add flush_journal admin socket command
Add flush_journal admin socket command to be able to flush journal to
the permanent store for online osds. (For offline osds we already have
ceph-osd --flush-journal.)
Instead of copying the files in the ceph repository, which is less
convenient.
When building the headers are ignored, even though they do
not exist. When creating the tarbal with make dist, it fails because
they cannot be found. I misread src/gf_int.h to be include/gf_int.h and
wrongfully thought the submodules were to blame. This is why they were
removed shortly after being added.
Yan, Zheng [Wed, 19 Mar 2014 03:09:07 +0000 (11:09 +0800)]
mds: don't mark scatter locks dirty when dirfrag is dirty
The journal reply code has check that decides which scatter locks
should be marked as dirty. So don't unconditionally mark scatter
locks dirty when dirfrag is dirty
Yan, Zheng [Wed, 5 Mar 2014 10:56:01 +0000 (18:56 +0800)]
mds: improve success rate of subtree exporting
When exporting a subtree, the migrator acquires the required locks,
then freezes the subtree and releases the locks. After subtree is
frozen, it try acquiring the same locks again.
This patch make scatter locks keep in their old states if inode has
exporting dirfrag. It improves the chance that migrator acquires all
required locks when subtree is frozen.
Yan, Zheng [Wed, 5 Mar 2014 01:14:56 +0000 (09:14 +0800)]
mds: fix open remote dirfrag deadlock
During subtree migration, the importer may need to open subtree
bound dirfrags. Opening subtree bound dirfrags happens after the
exporter freeze the exporting subtee. So the discover message for
opening subtree bound dirfrags should not wait for any freezing
tree/directory, otherwise deadlock can happen.
In MDCache::handle_discover(), there are two cases can cause
discover messages wait for freezing tree/directory. One case is
fetching bare-bone dirfrags. Another case is, when merging dirfrags,
some of the dirfrags are frozen, some are freezing.
Yan, Zheng [Wed, 19 Feb 2014 05:12:09 +0000 (13:12 +0800)]
mds: rollback slave request after slave prepare is journalled
Resolve ack message can abort slave requests that are being journalled.
The slave rollback does not handle this case properly. The fix is mark
slave request aborted in this case. The slave rollback code is executed
after slave prepare is safely journalled.
Dan Mick [Thu, 6 Mar 2014 22:33:39 +0000 (14:33 -0800)]
Makefiles: remove libkeyutils from every binary except two
Only rbd and mount_ceph need secret.c, and only secret.c needs libkeyutils;
remove it from LIBCOMMON_DEPS so it's not a dependency for everything,
remove secret.c from libcommon.a, and add it to mount.ceph/rbd's sources;
add LIBKEYID_LIB to mount.ceph/rbd's LDADD.
Samuel Just [Tue, 18 Mar 2014 22:47:44 +0000 (15:47 -0700)]
ReplicatedPG::do_op: delay if snapdir is unreadable
Since all we really need on a snapdir is the context, we really only
need it to be !missing. However, it might become !missing before it
becomes !unreadable. That allows ops to end up in the
waiting_for_degraded queue before one in waiting_for_unreadable is
woken, which allows the ops to be reordered. Rather than reintroduce an
extra waiting_for_missing queue, simply require !unreadable for snapdir
(which implies !misssing).
Fixes: #7777 Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Sat, 15 Mar 2014 03:26:04 +0000 (20:26 -0700)]
common/PrioritizedQueue: fix remove_by_class() corner case
If i is the first entry, then setting cur = begin() sets us up to point at
something that we are about to delete. Move the check to the end to avoid
this.
Backport: emperor, dumpling Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 18 Mar 2014 19:35:03 +0000 (12:35 -0700)]
PG::start_peering_interval: always send_notify if !primary
Otherwise, we might get into a situation where the primary
forgets about a stray pg. This is simpler and does not
increase the number of notifies by much.
Fixes: #7733 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Tue, 18 Mar 2014 19:09:05 +0000 (12:09 -0700)]
PG::find_best_info: fix log_tail component
The previous logic should have kept the current best info if it found a
replica which best could log-recover, but p couldn't. However, the
continue in that loop advanced the inner loop instead of the outer loop
allowing the primary case to take over in cases where best had a longer
tail. Instead, we will prefer the longer tail regardless of the other
infos to simplify the logic.
Fixes: #7755 Signed-off-by: Samuel Just <sam.just@inktank.com>
Loic Dachary [Sun, 16 Mar 2014 16:00:25 +0000 (17:00 +0100)]
osd,mon: use profile instead of properties
The qa and functional tests are adapted to the new command prototype
requiring a profile instead of a list of properties. When possible the
implicit ruleset creation is used to simplify the test setup.
Loic Dachary [Sun, 16 Mar 2014 20:11:56 +0000 (21:11 +0100)]
mon: set the profile and ruleset defaults early
The poolstr is removed from the prepare_pool_crush_ruleset prototype
because it no longer decides for the default ruleset, if it is not
omitted by the caller of osd pool create.
If no profile
profile = default
If no ruleset and profile is default
ruleset = erasure-code
If no ruleset and profile is not default
ruleset = the name of the pool
Loic Dachary [Sun, 16 Mar 2014 15:24:58 +0000 (16:24 +0100)]
mon: add crush ruleset name to osd pool create
The ruleset to be used for the new erasure coded pool was expected in
the properties, under the name crush_ruleset. It does not belong to the
erasure code profile and needs to be added to the prototype explicitly.
The crush ruleset name is added to the prototype of the prepare_new_pool
and prepare_pool_crush_ruleset methods.
Loic Dachary [Sun, 16 Mar 2014 15:13:17 +0000 (16:13 +0100)]
osd: use erasure code profile when building the PGBackend
The PGBackend::build_pg_backend() prototype is modified to add an
OSDMapRef which is used to get the profile stored in the pg_pool_t
and pass it to the erasure code plugin.
Loic Dachary [Mon, 17 Mar 2014 21:47:03 +0000 (22:47 +0100)]
mon: OSDMonitor use erasure_code_profile instead of properties
The prepare_pool_properties is replaced with the
parse_erasure_code_profile method which looks up the same content from
OSDMap::erasure_code_profiles instead of the argument vector.
The type and name substitution is applied in OSDMonitor and the
MonCommand prototypes for osd pool create and osd create-erasure are
updated.
Loic Dachary [Sun, 16 Mar 2014 12:27:51 +0000 (13:27 +0100)]
osd: obsolete pg_pool_t properties with erasure_code_profile
The generic map properties is only used for erasure code and should be
specialized. It is obsoleted by a string that can be looked up in
OSDMap::erasure_code_profiles to retrieve the key/value pairs that will
be interpreted by the erasure code plugin.
Loic Dachary [Sun, 16 Mar 2014 12:00:50 +0000 (13:00 +0100)]
mon: add the erasure-code-profile {set,get,rm,ls} MonCommand
"erasure-code-profile set" parses the key=value pairs given in argument
and stores them in OSDMap::erasure_code_profiles. The
"erasure-code-profile get" supports plain text display if the Formatter
is not set (or invalid).
Sage Weil [Mon, 17 Mar 2014 23:21:17 +0000 (16:21 -0700)]
mon/Paxos: commit only after entire quorum acks
If a subset of the quorum accepts the proposal and we commit, we will start
sharing the new state. However, the mon that didn't yet reply with the
accept may still be sharing the old and stale value.
The simplest way to prevent this is not to commit until the entire quorum
replies. In the general case, there are no failures and this is just fine.
In the failure case, we will call a new election and have a smaller quorum
of (live) nodes and will recommit the same value.
A more performant solution would be to have a separate message invalidate
the old state and commit once we have all invalidations and a majority of
accepts. This will lower latency a bit in the non-failure case, but not
change the failure case significantly. Later!
Fixes: #7736 Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>