Greg Farnum [Fri, 6 Dec 2013 22:08:48 +0000 (14:08 -0800)]
Elector: transmit local api on election win, accept leader's on loss
If we're the leader, just point to our local set. Disseminating these
will let peons advertise the full command set supported by the leader.
INCOMPLETE: does not yet handle winning Electors who do not send a command set.
Loic Dachary [Thu, 5 Dec 2013 18:41:50 +0000 (19:41 +0100)]
crush: check for invalid names in loc[]
Add the is_valid_crush_loc helper to test for invalid crush names in
insert_item and update_item, before performing any side
effect. Implement the associated unit tests.
Merge pull request #900 from ceph/wip-mon-mds-trim
mon: MDSMonitor: trim versions and let PaxosService decide whether to propose
We were not trimming mdsmap versions and were generating a new map every time
we modified the pending value.
Now we not only make sure that MDSMonitor will trim old maps (configurable
option allowing us to set the maximum number of maps to keep, defaulting to 500,
much like other services do) but we also delegate to PaxosService the decision on
whether to propose our pending value.
We also perform several modifications to 'ceph-kvstore-tool', allowing one to obtain
the contents of a given prefix:key and have them outputted to a file instead of stdout,
and also add support for getting the size of a given prefix:key's value.
'ceph report' was also modified so that we always output the first and last
committed versions for all services; up until this point, we would only output the
first committed version on all services, and only a few were also outputting the
last committed version.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: MDSMonitor: implement 'get_trim_to()' to let the mon trim mdsmaps
This commit also adds two options to the MDSMonitor:
- mon_max_mdsmap_epochs: the maximum amount of maps we'll keep (def: 500)
- mon_mds_force_trim: the version we want to trim to
This results in 'get_trim_to()' returning the possible values:
- if we have set mon_mds_force_trim, and this value is greater than the
last committed version, trim to mon_mds_force_trim
- if we hold more than the max number of maps, trim to last - max
- if we have set mon_mds_force_trim and if we hold more than the max
number of maps, and mon_mds_force_trim is lower than last - max,
then trim to last - max
Backport: dumpling
Backport: emperor
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Loic Dachary [Thu, 5 Dec 2013 12:01:00 +0000 (13:01 +0100)]
crush: remove redundant test in insert_item
A year after the last modification of test to check if an item was added
twice to the same bucket, the subtree_contains test was added a few
lines above it, making it redundant.
Loic Dachary [Thu, 5 Dec 2013 08:54:37 +0000 (09:54 +0100)]
crush: insert_item returns on error if bucket name is invalid
A bucket name may be created as a side effect of insert_item. All names
in the loc argument are checked for validity at the beginning of the
method and an error is returned immediately if one is found. This allows
to not check for errors when setting the name of an item later on.
Josh Durgin [Mon, 25 Nov 2013 21:43:43 +0000 (13:43 -0800)]
init, upstart: prevent daemons being started by both
There can be only one init system starting a daemon. If there is a
host entry in ceph.conf for a daemon, sysvinit would try to start it
even if the daemon's directory did not include a sysvinit file. This
preserves backwards compatibility with older installs using sysvinit,
but if an upstart file is present in the daemon's directory, upstart
will try to start them, regardless of host entries in ceph.conf.
If there's an upstart file in a daemon's directory and a host entry
for that daemon in ceph.conf, both sysvinit and upstart would attempt
to manage it.
Fix this by only starting daemons if the marker file for the other
init system is not present. This maintains backwards compatibility
with older installs using neither sysvinit or upstart marker files,
and does not break any valid configurations. The only configuration
that would break is one with both sysvinit and upstart files present
for the same daemon.
Josh Durgin [Mon, 2 Dec 2013 22:54:04 +0000 (14:54 -0800)]
osd: read into correct variable for magic string
4d140a71a1a48081b449b7d8dde808eb6e74c6b2 refactored this and
introduced a bug. peek_meta() was accidentally reading into magic,
then replacing magic with val, which was always the empty string,
resulting in the osd always failing to start due to 'mismatched'
magic values.
Sage Weil [Mon, 2 Dec 2013 16:31:23 +0000 (08:31 -0800)]
osd/OSDMap: add region, pdu, pod types while we are at it
One use noted that they have a 'pdu' type in their type hierarchy that
typically spans multiple racks. Others are known to use the 'pod'
terminology; add that to. And I can imagine 'region' above datacenter.
Factor this into a helper to make things a bit less fragile.
Sage Weil [Sun, 10 Nov 2013 06:03:42 +0000 (22:03 -0800)]
osd/OSDMap: add 'chassis' to default type hierarchy
A chassis is usually bigger than a host but smaller than a rack. This will
be useful for a broad class of modern hardware that sticks multiple hosts
in the same chassis (in sleds, or on cards, or blades, or whatever).
Sage Weil [Sun, 3 Nov 2013 04:21:39 +0000 (21:21 -0700)]
os/ObjectStore: add {read,write}_meta
Move these from the OSD. Use a generic implementation in ObjectStore that
hopefully all backends can share (so that it can remain in sync with the
start/stop scripts, ceph-disk, and other orchestration machinery).
Sage Weil [Sun, 3 Nov 2013 02:19:09 +0000 (19:19 -0700)]
osd: construct ObjectStore outside of OSD
This lets ceph_osd.cc handle the config details and use it directly for
all of the random command-line stuff, eliminating a bunch of mostly-
useless static wrappers in OSD.
Samuel Just [Wed, 27 Nov 2013 03:17:59 +0000 (19:17 -0800)]
PG: don't query unfound on empty pgs
When the replica responds, it responds with a notify
rather than a log, which the primary then ignores since
it is already in the peer_info map. Rather than fix that
we'll simply not send queries to peers we already know to
have no unfound objects.
Fixes: #6910 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: David Zafman <david.zafman@inktank.com>
Samuel Just [Tue, 26 Nov 2013 21:20:21 +0000 (13:20 -0800)]
PG: retry GetLog() each time we get a notify in Incomplete
If for some reason there are no up OSDs in the history which
happen to have usable copies of the pg, it's possible that
there is a usable copy elsewhere on the cluster which will
become known to the primary if it waits.
Fixes: #6909 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yehuda Sadeh [Wed, 27 Nov 2013 21:34:00 +0000 (13:34 -0800)]
rgw: don't error out on empty owner when setting acls
Fixes: #6892
Backport: dumpling, emperor
s3cmd specifies empty owner field when trying to set acls on object
/ bucket. We errored out as it didn't match the current owner name, but
with this change we ignore it.