Loic Dachary [Sun, 8 Dec 2013 13:38:59 +0000 (14:38 +0100)]
crush: fix map->choose_tries boundary test
CrushWrapper::start_choose_profile allocates map->choose_tries with
choose_total_tries elements. When crush_choose_firstn sets a value, it
tests against map->choose_local_tries which could lead to memory
corruption if map->choose_total_tries is smaller than
map->choose_local_tries.
Another indesirable but non fatal side effect is that the output crushtool
--show-choose-tries will be truncated to choose_local_tries which is
set to a lower value than choose_total_tries by the default tuneables.
Noah Watkins [Sat, 7 Dec 2013 17:59:13 +0000 (09:59 -0800)]
librbd: rename howmany to avoid conflict
A howmany macro exists on some platforms in standard headers, but there
really isn't any sort of standard that I've found. We just avoid the
conflict entirely this way.
Checking for fdatasync uses the same approach as the qemu configure
script. The relevant commit is d1722a27f552a22561104210e0afad4577878e53.
Here is a copy of the commit message which explains the check:
Under Darwin, a symbol exists for the fdatasync() function, so that our
link test succeeds. However _POSIX_SYNCHRONIZED_IO is set to '-1'.
According to POSIX:2008, a value of -1 means the feature is not
supported.
A value of 0 means supported at compilation time, and a value greater 0
means supported at both compilation and run time.
Enable fdatasync() only if _POSIX_SYNCHRONIZED_IO is '>0'.
Loic Dachary [Fri, 6 Dec 2013 23:31:54 +0000 (00:31 +0100)]
crush: detach_bucket must test item >= 0 not > 0
Since detach_bucket is a private helper solely used by move_bucket which
contains another ( correct ) safeguard, the code cannot be reached and
the problem can never happen. If another function uses detach_bucket,
it may happen.
// un-set the device name so we can use add_item later
build_rmap(name_map, name_rmap);
name_map.erase(id);
name_rmap.erase(id_name);
when insert_item refused to move a bucket for which a name already
exists. It was changed in 2013 by 4e2557a038dc1e8c68993ad8571d74e2eb8ea90a and now supports it. The
TestCrushWrapper unittest for move_bucket pass.
Sage Weil [Tue, 3 Dec 2013 16:33:55 +0000 (08:33 -0800)]
crush/mapper: apply chooseleaf_tries to firstn mode too
Parameterize the attempts for the _firstn choose method, and apply the
rule-specified tries count to firstn mode as well. Note that we have
slightly different behavior here than with indep:
If the firstn value is not specified for firstn, we pass through the
normal attempt count. This maintains compatibility with legacy behavior.
Note that this is usually *not* actually N^2 work, though, because of the
descend_once tunable. However, descend_once is unfortunately *not* the
same thing as 1 chooseleaf try because it is only checked on a reject but
not on a collision. Sigh.
In contrast, for indep, if tries is not specified we default to 1
recursive attempt, because that is simply more sane, and we have the
option to do so. The descend_once tunable has no effect for indep.
Loic Dachary [Thu, 5 Dec 2013 18:41:50 +0000 (19:41 +0100)]
crush: check for invalid names in loc[]
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.
This is (as near to) a trivial ObjectStore backend for the OSD as we can
get at the moment. Everything is stored in memory. We are slightly
tricky with the locking, but not overly so.
On umount we dump everything out to disk, and on mount we load it all in
again, so we have some very coarse persistence/durability... just enough
to make this usable in a non-failure environment.
Merge pull request #900 from ceph/wip-mon-mds-trim
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose
We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.
Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.
We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.
'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: MDSMonitor: implement 'get_trim_to()' to let the mon trim mdsmaps
This commit also adds two options to the MDSMonitor:
- mon_max_mdsmap_epochs: the maximum amount of maps we'll keep (def: 500)
- mon_mds_force_trim: the version we want to trim to
This results in 'get_trim_to()' returning the possible values:
- if we have set mon_mds_force_trim, and this value is greater than the
last committed version, trim to mon_mds_force_trim
- if we hold more than the max number of maps, trim to last - max
- if we have set mon_mds_force_trim and if we hold more than the max
number of maps, and mon_mds_force_trim is lower than last - max,
then trim to last - max
Backport: dumpling
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Loic Dachary [Thu, 5 Dec 2013 12:01:00 +0000 (13:01 +0100)]
crush: remove redundant test in insert_item
A year after the last modification of test to check if an item was added
twice to the same bucket, the subtree_contains test was added a few
lines above it, making it redundant.
Loic Dachary [Thu, 5 Dec 2013 08:54:37 +0000 (09:54 +0100)]
crush: insert_item returns on error if bucket name is invalid
A bucket name may be created as a side effect of insert_item. All names
in the loc argument are checked for validity at the beginning of the
method and an error is returned immediately if one is found. This allows
to not check for errors when setting the name of an item later on.
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both
There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.
If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.
Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.
Sage Weil [Tue, 3 Dec 2013 16:16:41 +0000 (08:16 -0800)]
crush/mapper: new SET_CHOOSE_LEAF_TRIES command
Explicitly control the number of sample attempts, and allow the number of
tries in the recursive call to be explicitly controlled via the rule. This
is important because the amount of time we want to spend looking for a
solution may be rule dependent (e.g., higher for the wide indep pool than
the rep pools).
(We should do the same for the other tunables, by the way!)
Sage Weil [Tue, 3 Dec 2013 01:17:13 +0000 (17:17 -0800)]
crush/mapper: pass parent r value for indep call
Pass down the parent's 'r' value so that we will sample different values in
the recursive call when the parent tries multiple times. This avoids doing
useless work (calling multiple times and trying the same values).
Sage Weil [Tue, 3 Dec 2013 01:15:56 +0000 (17:15 -0800)]
crush/mapper: clarify numrep vs endpos
Pass numrep (the width of the result) separately from the number of results
we want *this* iteration. This makes things less awkward when we do a
recursive call (for chooseleaf) and want only one item.
Sage Weil [Sat, 2 Nov 2013 18:54:09 +0000 (11:54 -0700)]
crush/mapper: strip firstn conditionals out of crush_choose, rename
Now that indep is handled by crush_choose_indep, rename crush_choose to
crush_choose_firstn and remove all the conditionals. This ends up
stripping out *lots* of code.
Note that it *also* makes it obvious that the shenanigans we were playing
with r' for uniform buckets were broken for firstn mode. This appears to
have happened waaaay back in commit dae8bec9 (or earlier)... 2007.