Dan Mick [Wed, 10 Jul 2013 23:41:24 +0000 (16:41 -0700)]
mon,auth: AuthMonitor, KeyRing: add Formatter-dumps of auth info
Signed-off-by: Dan Mick <dan.mick@inktank.com>
auth: KeyRing: encode_formatted() receives a label as first argument
Also, this allows us to standardize formatted output on the AuthMonitor,
so that all output starts with a section 'auth'. Other subsystems using
the KeyRing class, can specify whatever section they prefer.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Dan Mick [Wed, 10 Jul 2013 23:24:45 +0000 (16:24 -0700)]
ceph CLI: valid() no longer returns bool, but just exception
The type validation's valid() method was using a combination of
return code and exception to really indicate the same thing;
simplify by only raising on validation error, and change callers
to cope. validate_one() follows suit.
Also, allow validate() to be called with args that are dicts
(for REST support) rather than bare words. Rules: 'name':'value'
must both match descriptor's name and validate (through valid() for
the value. If value is '', it's assumed to be the same as name,
(one can pass, for example, "detail" as one parameter to
REST, but it will still show up as {'detail':''} here).
Tweak validate()'s algorithm a bit in the process, and make
validate_command() exit the bestcmds loop immediately on first
full validation.
Dan Mick [Wed, 10 Jul 2013 23:12:56 +0000 (16:12 -0700)]
MonCommands: add new fields: modulename, perms, availability
To help optimize the REST API, we need to know whether the commands
are read (GET) or write (PUT/POST). However, we also could use that
same info for permission/caps checking. Add modulename/perms as
the required caps for each command to drive both needs.
The availability field is to control whether a command is displayed/
advertised through the CLI or REST interfaces; some commands aren't
really useful for REST, and we may want to invent REST-only commands;
also, this gives us a way to deprecate commands quickly and leave the
code, should that be desirable. Make the CLI display only commands
marked with the 'CLI' marker.
Also stop renaming 'help' to 'helptext' in the client.
Get device-by-path by looking for it instead of assuming 3rd entry.
On some systems (virtual machines so far) the device-by-path entry
from udevadm is not always in the same spot so instead actually
look for the right output instead of blindy assuming that its a
specific field in the output.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Wed, 10 Jul 2013 18:02:08 +0000 (11:02 -0700)]
osd: limit number of inc osdmaps send to peers, clients
We should not send an unbounded number of inc maps to our peers or clients.
In particular, if a peer is not contacted for a while, we may think they
have a very old map (say, 10000 epochs ago) and send thousands of inc maps
when the distribution shifts and we need to peer.
Note that if we do not send enough maps, the peers will make do by
requesting the map from somewhere else (currently the mon). Regardless
of the source, however, we must limit the amount that we speculatively
share as it usually is not needed.
Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 10 Jul 2013 17:17:45 +0000 (10:17 -0700)]
mon: do not populate MMonCommand paxos version field
The field is not used or useful since the monitor does not even look
at it (in Monitor::handle_command()). Avoid populating it and the
subsequent confusion for poor developers.
Sage Weil [Wed, 10 Jul 2013 17:06:20 +0000 (10:06 -0700)]
messages/MPGStats: do not set paxos version to osdmap epoch
The PaxosServiceMessage version field is meant for client-coordinated
ordering of messages when switching between monitors (and is rarely
used). Do not fill it with the osdmap epoch lest it be compared to a
pgmap version, which may cause the mon to (near) indefinitely put it on
a wait queue until the pgmap version catches up.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
David Zafman [Tue, 9 Jul 2013 01:58:12 +0000 (18:58 -0700)]
osd: Clean-up redundant use of object_locator_t
Remove locator arg from get_object_context()/find_object_context()
Remove locator from object_info_t but retain encode format
Remove locator from object_info_t dump output
Remove OLOC_BLANK
Signed-off-by: David Zafman <david.zafman@inktank.com>
David Zafman [Tue, 11 Jun 2013 01:18:59 +0000 (18:18 -0700)]
librados, os, osd, osdc, test: Add support for client specified namespaces
Add rados_ioctx_namespace_set_key() and librados::IoCtx::namespace_set_key()
Add namespace to admin-daemon operations
Support namespace in osd map command
Add namespace to object_locator_t and hobject_t
Add random namespaces to psim program
Feature: #4982 (OSD: namespaces pt 1 (librados/osd, not caps))
Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Mon, 8 Jul 2013 21:31:29 +0000 (14:31 -0700)]
osd: change pg_stat_t::reported from eversion_t to a pair of fields
This rarely represents an actual eversion_t as the epoch and seq values are
bumped semi-independently to ensure it is always unique. Break it into
two separate fields to avoid confusion.
Drop now-unused and slightly curious inc() method.
Sage Weil [Mon, 8 Jul 2013 22:57:48 +0000 (15:57 -0700)]
mon: be smarter about calculating last_epoch_clean lower bound
We need to take PGs whose mapping has not changed in a long time into
account. For them, the pg state will indicate it was clean at the time of
the report, in which case we can use that as a lower-bound on their actual
latest epoch clean. If they are not currently clean (at report time), use
the last_epoch_clean value.
Fixes: #5519 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 8 Jul 2013 20:27:58 +0000 (13:27 -0700)]
osd: report pg stats to mon at least every N (=500) epochs
The mon needs a moderately accurate last_epoch_clean value in order to trim
old osdmaps. To prevent a PG that hasn't peered or received IO in forever
from preventing this, send pg stats at some minimum frequency. This will
increase the pg stat report workload for the mon over an idle pool, but
should be no worse that a cluster that is getting actual IO and sees these
updates from normal stat updates.
This makes the reported update a bit more aggressive/useful in that the epoch
is the last map epoch processed by this PG and not just one that is >= the
currenting interval. Note that the semantics of this field are pretty useless
at this point.
Sage Weil [Tue, 9 Jul 2013 04:54:53 +0000 (21:54 -0700)]
mon: make service trim_to stateless
Call get_trim_to() when we need to know how much to trim (if any), and
calculate it then. No need to keep this in a hidden trim_version
variable and remember to update it. This drops several helpers and
accessors and makes get_trim_to() a single method that services need to
override.
Sage Weil [Tue, 9 Jul 2013 17:55:05 +0000 (10:55 -0700)]
mon: preserve last_committed_floor across sync
Add a paranoid check to prevent us from forgetting how far ahead our
last_committed was when we sync. This prevents an i'll-timed forced-sync
from allowing paxos to warp back in time.
This should never happen unless there is a perfect storm of bad admin
decisions and/or bugs, but we guard against it anyway.
See: #5256 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Fri, 5 Jul 2013 17:36:54 +0000 (10:36 -0700)]
mon/Paxos: remove unnecessary trim enable/disable
The sync no longer cares if we trim Paxos versions as we go, as long as we
don't trim so fast that we fall behind between GET_CHUNK messages, which
we can consider a tuning problem.
Sage Weil [Fri, 5 Jul 2013 17:34:46 +0000 (10:34 -0700)]
mon/Paxos: config min paxos txns to keep separately
We were using paxos_max_join_drift to control the minimum number of
paxos transactions to keep around. Instead, make this explicit, and
separate from the join drift.
Sage Weil [Tue, 9 Jul 2013 01:13:31 +0000 (18:13 -0700)]
mon: implement a simpler sync
The previous sync implementation was highly stateful and very complex.
This made it very hard to understand and to debug, and there were bugs
still lurking in the timeout code (at least).
Replace it with something much simpler:
- sync providers are almost stateless. they keep an iterator, identified
by a unique cookie, that times out in a simple way.
- sync requesters sync from whomever they fancy. namely anyone with newer
committed paxos state.
There are a few extra fields that might allow sync continuation later, but
this is complex and not necessary at this point.
rgw: call appropriate curl calls for waiting on sockets
If libcurl supports curl_multi_wait() then use it, otherwise
use select() and force a timeout, even if it has been disabled.
Otherwise we may wait forever for events that we can't wait for
as select() only uses fds < 1024.
Sage Weil [Mon, 8 Jul 2013 17:49:28 +0000 (10:49 -0700)]
mon/PaxosService: prevent reads until initial service commit is done
Do not process reads (or, by PaxosService::dispatch() implication, writes)
until we have committed the initial service state. This avoids things like
EPERM due to missing keys when we race with mon creation, triggered by
teuthology tests doing their health check after startup.
Fixes: #5515
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 9 Jul 2013 04:38:11 +0000 (21:38 -0700)]
mon/PaxosService: trim periodically instead of via propose_pending
We want to trim old states even if there is no update activity. For
example, if a long-running rebalance finishes all osdmap updates will
stop and we won't trim out old maps to free space.
Instead, trim at the same time as tick(). Remove the trim during
propose_pending() to force all trims through this path and avoid
introducing a new and rarely-exercised behavior.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Sage Weil [Tue, 9 Jul 2013 00:46:40 +0000 (17:46 -0700)]
mon/OSDMonitor: fix base case for loading full osdmap
Right after cluster creation, first_committed is 1 and latest stashed in 0,
but we don't have the initial full map yet. Thereafter, we do (because we
write it with trim). Fixes afd6c7d8247075003e5be439ad59976c3d123218.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>