Yan, Zheng [Wed, 6 Nov 2013 01:42:43 +0000 (09:42 +0800)]
mds: handle cache rejoin corner case
A recovering MDS may receives strong cache rejoin from a survivor,
then the survivor restarts, the recovering MDS receives week cache
rejoin from the same MDS. Before processing the week cache rejoin,
we should scour replicas added by the obsoleted strong cache rejoin.
Yan, Zheng [Wed, 6 Nov 2013 01:28:51 +0000 (09:28 +0800)]
mds: unify nonce type
MDSCacheObject::replica_nonce is defined as __s16, but nonce type
in MDSCacheObject::replica_map is int. This mismatch may confuse
MDCache::handle_cache_expire().
Yan, Zheng [Thu, 24 Oct 2013 09:10:59 +0000 (17:10 +0800)]
mds: rework stale import/export message detection
Current code uses import state to detect obsolete import/export messages.
it does not work for the case: cancel a subtree export, export the same
subtree again, the messages for the first export get dispatched.
This patch introduces "transation ID" for subtree exports. Each subtree
export has a unique TID, the ID is recorded in all import/export related
messages. By comparing the TID, we can reliably detect stale messages.
Yan, Zheng [Thu, 24 Oct 2013 08:05:56 +0000 (16:05 +0800)]
mds: put import/export related states together
Current code uses several STL maps to record import/export related
states. A map lookup is required for each state access, this is not
efficient. It's better to put import/export related states together.
Yan, Zheng [Wed, 23 Oct 2013 01:15:58 +0000 (09:15 +0800)]
mds: freeze tree deadlock detection.
there are two situations that result freeze tree deadlock.
- mds.0 authpins an item in subtree A
- mds.0 sends request to mds.1 to authpin an item in subtree B
- mds.0 freezes subtree A
- mds.1 authpins an item in subtree B
- mds.1 sends request to mds.0 to authpin an item in subtree A
- mds.1 freezes subtree B
- mds.1 receives the remote authpin request from mds.0
(wait because subtree B is freezing)
- mds.0 receives the remote authpin request from mds.1
(wait because subtree A is freezing)
- client request authpins items in subtree B
- freeze subtree B
- import subtree A which is parent of subtree B
(authpins parent inode of subtree B, see CDir::set_dir_auth())
- freeze subtree A
- client request tries authpinning items in subtree A
(wait because subtree A is freezing)
Enforcing a authpinning order can avoid the deadlock, but it's very
expensive. The deadlock is rare, so I think deadlock detection is
more suitable for the case.
This patch introduces freeze tree deadlock detection. We record the
start time of freezing tree. If we fail to freeze the tree within a
given duration, cancel the process of freezing tree.
Steve Stock [Sat, 14 Dec 2013 21:44:06 +0000 (16:44 -0500)]
Add -n option to mount.ceph. Required by autofs when /etc/mtab is a link to /proc/mounts (e.g. Debian Wheezy), otherwise automounting a ceph file system fails. Also useful when /etc is read-only. feature 7006
Loic Dachary [Sun, 15 Dec 2013 13:31:27 +0000 (14:31 +0100)]
common: fix rare race condition in Throttle unit tests
The thread created to test Throttle race conditions updates a value (
throttle.get_current() ) that is tested by the main gtest thread but is
not protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.
John Wilkins [Sat, 14 Dec 2013 00:08:37 +0000 (16:08 -0800)]
doc: Updates to federated config.
Reverted Emperor versionadded to Dumpling as it gets backported.
Added default index and bucket pools to pool creation
Added default default_placment setting
Added placement_pools key val pair examples.
Added comments for re-running the procedure for the secondary region.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: modprobe with single_major=Y on newer kernels
On kernels that support it, and if 'rbd map' is given a chance to
modprobe, turn on single-major device number allocation scheme. For
users who for some reason don't want it, the workaround is to insert
the rbd module manually before executing the first 'rbd map' command.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: add support for single-major device number allocation scheme
With the preparatory commits ("rbd: match against wholedisk device
numbers on unmap" and "rbd: match against both major and minor on unmap
on kernels >= 3.14") in, this amounts to chosing to work with new rbd
bus interfaces (/sys/bus/rbd/{add,remove}_single_major) if they are
available, instead of the old ones (/sys/bus/rbd/{add,remove}).
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against both major and minor on unmap on newer kernels
As described in commit "rbd: match against wholedisk device numbers on
unmap", currently we only match against major numbers. In preparation
for support for single-major device number allocation scheme, start
matching against minor numbers also, which newer kernels provide in
a /sys/bus/rbd/devices/<id>/minor sysfs attribute.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against whole disks on unmap
Currently the way 'rbd unmap' translates a user-provided block device
into an rbd id is it matches the major number of the specified device
against /sys/bus/rbd/devices/<id>/major for each rbd mapping and
declares success on the first match. This works for both entire disks
and partitions, because under the current device number allocation
scheme, each mapping means a new major number.
In preparation for support for single-major device number allocation
scheme, which would require matching both major and minor numbers, make
sure to always match against entire disk device numbers, by converting
the specified device major:minor pair into wholdedisk major:minor pair.
To achive that, use the libblkid library, which accomplishes this goal
by walking stable sysfs structures.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: switch to strict_strtol for major parsing
Use common/strict_strtol, which actually parses integers in a proper
way, instead of atoi for parsing /sys/bus/rbd/devices/<id>/major. This
is important, because the kernel apparently can write things like
"(none)" into that file, and in general is more bulletproof.
Sage Weil [Wed, 4 Dec 2013 05:39:03 +0000 (21:39 -0800)]
mon/OSDMonitor: take 'osd pool set ...' value as a string again
We ran into problems before when we made this a string because a mixed
cluster of mons might forward a client request with the wrong schema.
To make this work, we make the new code understand both the new and
old schema, and also backport a change to emperor and dumpling to
handle the new schema.
Greg Farnum [Mon, 9 Dec 2013 16:44:05 +0000 (08:44 -0800)]
Monitor: Elector: share the classic command set if we have a classic mon
The leader now checks to see if any monitors did not provide their
command set, and if so, shares the list of "classic" commands instead
of his own set. This will prevent users from seeing different commands
(depending on whether they connect to an old or new mon) while
performing upgrades, and will make it really obvious if they forgot
to upgrade one of the monitors!
Greg Farnum [Mon, 9 Dec 2013 16:41:54 +0000 (08:41 -0800)]
Elector: share local command set when deferring
We're about to use this at a basic level, to identify when we have
"classic" monitors in-quorum, but could also do something more
sophisticated like a set intersection on the commands.
Greg Farnum [Mon, 9 Dec 2013 06:17:39 +0000 (22:17 -0800)]
Monitor: import MonCommands.h from original Dumpling and expose it
If the Elector doesn't receive a set of commands from the elected leader, it
assumes the monitor is "classic" and uses the Dumpling command set as
the leader set.
Greg Farnum [Fri, 6 Dec 2013 22:08:48 +0000 (14:08 -0800)]
Elector: transmit local api on election win, accept leader's on loss
If we're the leader, just point to our local set. Disseminating these
will let peons advertise the full command set supported by the leader.
INCOMPLETE: does not yet handle winning Electors who do not send a command set.
Loic Dachary [Sun, 8 Dec 2013 13:38:59 +0000 (14:38 +0100)]
crush: fix map->choose_tries boundary test
CrushWrapper::start_choose_profile allocates map->choose_tries with
choose_total_tries elements. When crush_choose_firstn sets a value, it
tests against map->choose_local_tries which could lead to memory
corruption if map->choose_total_tries is smaller than
map->choose_local_tries.
Another indesirable but non fatal side effect is that the output crushtool
--show-choose-tries will be truncated to choose_local_tries which is
set to a lower value than choose_total_tries by the default tuneables.
Noah Watkins [Sat, 7 Dec 2013 17:59:13 +0000 (09:59 -0800)]
librbd: rename howmany to avoid conflict
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.