Loic Dachary [Sun, 15 Dec 2013 13:31:27 +0000 (14:31 +0100)]
common: fix rare race condition in Throttle unit tests
The thread created to test Throttle race conditions updates a value (
throttle.get_current() ) that is tested by the main gtest thread but is
not protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.
John Wilkins [Sat, 14 Dec 2013 00:08:37 +0000 (16:08 -0800)]
doc: Updates to federated config.
Reverted Emperor versionadded to Dumpling as it gets backported.
Added default index and bucket pools to pool creation
Added default default_placment setting
Added placement_pools key val pair examples.
Added comments for re-running the procedure for the secondary region.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: modprobe with single_major=Y on newer kernels
On kernels that support it, and if 'rbd map' is given a chance to
modprobe, turn on single-major device number allocation scheme. For
users who for some reason don't want it, the workaround is to insert
the rbd module manually before executing the first 'rbd map' command.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: add support for single-major device number allocation scheme
With the preparatory commits ("rbd: match against wholedisk device
numbers on unmap" and "rbd: match against both major and minor on unmap
on kernels >= 3.14") in, this amounts to chosing to work with new rbd
bus interfaces (/sys/bus/rbd/{add,remove}_single_major) if they are
available, instead of the old ones (/sys/bus/rbd/{add,remove}).
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against both major and minor on unmap on newer kernels
As described in commit "rbd: match against wholedisk device numbers on
unmap", currently we only match against major numbers. In preparation
for support for single-major device number allocation scheme, start
matching against minor numbers also, which newer kernels provide in
a /sys/bus/rbd/devices/<id>/minor sysfs attribute.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: match against whole disks on unmap
Currently the way 'rbd unmap' translates a user-provided block device
into an rbd id is it matches the major number of the specified device
against /sys/bus/rbd/devices/<id>/major for each rbd mapping and
declares success on the first match. This works for both entire disks
and partitions, because under the current device number allocation
scheme, each mapping means a new major number.
In preparation for support for single-major device number allocation
scheme, which would require matching both major and minor numbers, make
sure to always match against entire disk device numbers, by converting
the specified device major:minor pair into wholdedisk major:minor pair.
To achive that, use the libblkid library, which accomplishes this goal
by walking stable sysfs structures.
Ilya Dryomov [Fri, 13 Dec 2013 15:40:52 +0000 (17:40 +0200)]
rbd: switch to strict_strtol for major parsing
Use common/strict_strtol, which actually parses integers in a proper
way, instead of atoi for parsing /sys/bus/rbd/devices/<id>/major. This
is important, because the kernel apparently can write things like
"(none)" into that file, and in general is more bulletproof.
Sage Weil [Wed, 4 Dec 2013 05:39:03 +0000 (21:39 -0800)]
mon/OSDMonitor: take 'osd pool set ...' value as a string again
We ran into problems before when we made this a string because a mixed
cluster of mons might forward a client request with the wrong schema.
To make this work, we make the new code understand both the new and
old schema, and also backport a change to emperor and dumpling to
handle the new schema.
Greg Farnum [Mon, 9 Dec 2013 16:44:05 +0000 (08:44 -0800)]
Monitor: Elector: share the classic command set if we have a classic mon
The leader now checks to see if any monitors did not provide their
command set, and if so, shares the list of "classic" commands instead
of his own set. This will prevent users from seeing different commands
(depending on whether they connect to an old or new mon) while
performing upgrades, and will make it really obvious if they forgot
to upgrade one of the monitors!
Greg Farnum [Mon, 9 Dec 2013 16:41:54 +0000 (08:41 -0800)]
Elector: share local command set when deferring
We're about to use this at a basic level, to identify when we have
"classic" monitors in-quorum, but could also do something more
sophisticated like a set intersection on the commands.
Greg Farnum [Mon, 9 Dec 2013 06:17:39 +0000 (22:17 -0800)]
Monitor: import MonCommands.h from original Dumpling and expose it
If the Elector doesn't receive a set of commands from the elected leader, it
assumes the monitor is "classic" and uses the Dumpling command set as
the leader set.
Greg Farnum [Fri, 6 Dec 2013 22:08:48 +0000 (14:08 -0800)]
Elector: transmit local api on election win, accept leader's on loss
If we're the leader, just point to our local set. Disseminating these
will let peons advertise the full command set supported by the leader.
INCOMPLETE: does not yet handle winning Electors who do not send a command set.
Loic Dachary [Sun, 8 Dec 2013 13:38:59 +0000 (14:38 +0100)]
crush: fix map->choose_tries boundary test
CrushWrapper::start_choose_profile allocates map->choose_tries with
choose_total_tries elements. When crush_choose_firstn sets a value, it
tests against map->choose_local_tries which could lead to memory
corruption if map->choose_total_tries is smaller than
map->choose_local_tries.
Another indesirable but non fatal side effect is that the output crushtool
--show-choose-tries will be truncated to choose_local_tries which is
set to a lower value than choose_total_tries by the default tuneables.
Noah Watkins [Sat, 7 Dec 2013 17:59:13 +0000 (09:59 -0800)]
librbd: rename howmany to avoid conflict
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.
Checking for fdatasync uses the same approach as the qemu configure
script. The relevant commit is d1722a27f552a22561104210e0afad4577878e53.
Here is a copy of the commit message which explains the check:
Under Darwin, a symbol exists for the fdatasync() function, so that our
link test succeeds. However _POSIX_SYNCHRONIZED_IO is set to '-1'.
According to POSIX:2008, a value of -1 means the feature is not
supported.
A value of 0 means supported at compilation time, and a value greater 0
means supported at both compilation and run time.
Enable fdatasync() only if _POSIX_SYNCHRONIZED_IO is '>0'.
Josh Durgin [Sat, 7 Dec 2013 00:03:20 +0000 (16:03 -0800)]
objecter: don't take extra throttle budget for resent ops
These ops have already taken their budget in the original op_submit().
It will be returned via put_op_budget() when they complete.
If there were many localized reads of missing objects from replicas,
or cache pool redirects, this would cause the objecter to use up all
of its op throttle budget and hang.
Loic Dachary [Fri, 6 Dec 2013 23:31:54 +0000 (00:31 +0100)]
crush: detach_bucket must test item >= 0 not > 0
Since detach_bucket is a private helper solely used by move_bucket which
contains another ( correct ) safeguard, the code cannot be reached and
the problem can never happen. If another function uses detach_bucket,
it may happen.
// un-set the device name so we can use add_item later
build_rmap(name_map, name_rmap);
name_map.erase(id);
name_rmap.erase(id_name);
when insert_item refused to move a bucket for which a name already
exists. It was changed in 2013 by 4e2557a038dc1e8c68993ad8571d74e2eb8ea90a and now supports it. The
TestCrushWrapper unittest for move_bucket pass.