Sage Weil [Fri, 12 Jul 2013 21:47:09 +0000 (14:47 -0700)]
mon: fix scrub vs paxos race: refresh on commit, not round completion
Consider:
- paxos starts a commit N+1
- a majority of the peers ack it
- paxos::commit() writes N+1 it to disk
- tells peers to commit
- peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
- leader does _scrub on N+1, without latest full osdmap
- peers do _scrub on N+1, with latest full osdmap
- leader finishes paxos gather, does refresh_from_paxos()
-> scrub fails.
Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes. We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly. This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.
We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.
Sage Weil [Fri, 12 Jul 2013 20:50:49 +0000 (13:50 -0700)]
mon/PaxosService: do not prepare new pending if still proposing
The _active callback can get called while are already proposing. If
that happens, we should not prepare a fresh new pending but should
wait for the previous proposal to finish.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Sat, 13 Jul 2013 04:42:19 +0000 (21:42 -0700)]
mon/PaxosService: fix trim completion
Do not call C_Committed after trim or else we will prematurely clear
the bool proposing, propose something again using the same version, and
crash. We do not in fact need anything to happen here aside from the
refresh_from_paxos() that happens on its own.
Dan Mick [Fri, 12 Jul 2013 20:58:36 +0000 (13:58 -0700)]
ceph-rest-api: separate into module and front-end for WSGI deploy
To deploy ceph-rest-api within a WSGI server (apache/mod_wsgi,
nginx/uwsgi, etc.), there needs to be an importable (.py) module
that performs all init/config when imported. ceph-rest-api was
close, but it needs to be named properly, and there's no argument
passing, so it needs to get args from a fixed file or the env.
Separate most of ceph-rest-api into pybind/ceph_rest_api.py, and make
its arguments come from the environment, and init errors be
ImportError exceptions. Recase ceph-rest-api as a thin layer that
does the usual setup and arg parsing, and then sets args into the
environment and imports ceph_rest_api.py, catching exceptions and
reporting errors. This allows standalone execution as usual.
ceph-rest-api grabs a few module globals (addr/port and the flask.app)
to use after it imports.
Accept cluster name, and do the ceph.conf search using cluster name
in the appropriate places in the searched-for files.
Also ceph_rest_api.py gets a little cleanup (fewer global variables,
cleaner conf file search algorithm, better error reporting on conf
load)
Also: doc updates, packaging updates to include ceph_rest_api.py
Sage Weil [Fri, 12 Jul 2013 23:21:24 +0000 (16:21 -0700)]
msg/Pipe: fix RECONNECT_SEQ behavior
Calling handle_ack() here has no effect because we have already
spliced sent messages back into our out queue. Instead, pull them out
of there and discard. Add a few assertions along the way.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>
Sage Weil [Fri, 12 Jul 2013 20:12:51 +0000 (13:12 -0700)]
mon: AuthMonitor: don't try to auth import a null buffer
Hangs result if 'ceph auth import' is attempted without -i.
Check for this case and return error status. Also,
update auth import help to more-clearly indicate that "input"
means "-i <file>".
Fixes: #4599 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
David Zafman [Fri, 12 Jul 2013 02:47:47 +0000 (19:47 -0700)]
test: idempotent filestore test failure
Remove obsolete use of collection_move()
Allow operations to be skipped if random selections don't make sense
Track total number of possible objects in m_num_objects
BUG: do_remove() was calling _do_touch() instead of _do_remove()
For ops that require an object, select from among existing objects in collection
Initialize m_num_objects unique objects across collections
touch: don't create an object that already exists in another collection
remove: Use remove_obj() to clear object from m_objects to have accurate tracking
clone/clone_range(): Select 2 existing objects in the collection
add: Skip operation if selected target object name exists in target collection
move: Removed this buggy operation that is only present for upgrades
Fixes: #5371 Fixes: #5240 Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Fri, 12 Jul 2013 00:42:14 +0000 (17:42 -0700)]
mon: stash latest state when flagging force_sync
Store our latest state when we set the force_sync flag. This is important
because we will clear the store the next time we start up and may not be
able to get a useful monmap at that point.
Sage Weil [Fri, 12 Jul 2013 00:45:46 +0000 (17:45 -0700)]
mon: fix off-by-one: no need to reapply previous last_committed after sync
The old last_committed is already committed; don't reapply. This also fixes
the case where lc was 0 (i.e., we did get_cookie_recent from the beginning
of time).
Sage Weil [Fri, 12 Jul 2013 01:43:24 +0000 (18:43 -0700)]
osd/OSDmap: fix OSDMap::Incremental::dump() for new pool names
The name is always present when pools are created, but not when they are
modified. Also, a name may be present with a new_pools entry if the pool
is just renamed. Separate it out completely in the dump.
Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 11 Jul 2013 18:42:02 +0000 (11:42 -0700)]
msg/Message: use old footer for encoded message dump
This avoids the need for a conditional decoding check on ceph-dencoder,
and makes it match up with what encode_message() is doing. The new(ish)
fields in the footer (the signature) is not useful for the object
corpus.
Dan Mick [Thu, 11 Jul 2013 00:39:47 +0000 (17:39 -0700)]
Add 'ceph-rest-api'
ceph-rest-api is a Python WSGI module for accessing the Ceph cluster.
It supports most of the commands supported by the ceph CLI,
appropriately translated to HTTP GET/PUT requests. It is not a
truly RESTful interface.
Not supported at this moment: "tell", "pg <pgid>", and "daemon"
commands.
Configuration options are specified in ceph.conf, specified with
-c/--conf or obtained from $CEPH_CONF, /etc/ceph/ceph.conf,
~/.ceph/ceph.conf, or ./ceph.conf.
-n/--name specifies the client name, used for the cluster
authentication key and for the ceph.conf section name (default
is client.restapi).
Primitive human-level command discovery is supported; GET from
BASEURL (say, http://localhost:5000/api/v0.1) will show an HTML
table of all commands and arguments, method supported, and help strings.
Dan Mick [Thu, 11 Jul 2013 00:22:51 +0000 (17:22 -0700)]
mon: OSDMonitor: osd pool get: move to preproc, add formatted output
Move 'osd pool get' handling to preprocess_command()
It's a read operation; there's no need for it to be in prepare_command().
Also, add formatted output.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Dan Mick [Thu, 11 Jul 2013 00:15:35 +0000 (17:15 -0700)]
mon: Monitor: support multiple formatters on some status functions
Commands such as 'mon_status', 'quorum_status', 'sync_status' and
'sync_force' didn't support other formatter besides json. Regardless of
'--format=foo' being specified, they would always output in json.
This commit changes that behavior, allowing a format to be passed. These
functions do not output in plain-text however. Plain-text will default
to 'json' -- the reason: the information they provide are better outputted
in a structured fashion, and I was too lazy to come up with a plain-text
version that could be at least as good.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Wed, 10 Jul 2013 23:41:24 +0000 (16:41 -0700)]
mon,auth: AuthMonitor, KeyRing: add Formatter-dumps of auth info
Signed-off-by: Dan Mick <dan.mick@inktank.com>
auth: KeyRing: encode_formatted() receives a label as first argument
Also, this allows us to standardize formatted output on the AuthMonitor,
so that all output starts with a section 'auth'. Other subsystems using
the KeyRing class, can specify whatever section they prefer.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Dan Mick [Wed, 10 Jul 2013 23:24:45 +0000 (16:24 -0700)]
ceph CLI: valid() no longer returns bool, but just exception
The type validation's valid() method was using a combination of
return code and exception to really indicate the same thing;
simplify by only raising on validation error, and change callers
to cope. validate_one() follows suit.
Also, allow validate() to be called with args that are dicts
(for REST support) rather than bare words. Rules: 'name':'value'
must both match descriptor's name and validate (through valid() for
the value. If value is '', it's assumed to be the same as name,
(one can pass, for example, "detail" as one parameter to
REST, but it will still show up as {'detail':''} here).
Tweak validate()'s algorithm a bit in the process, and make
validate_command() exit the bestcmds loop immediately on first
full validation.
Dan Mick [Wed, 10 Jul 2013 23:12:56 +0000 (16:12 -0700)]
MonCommands: add new fields: modulename, perms, availability
To help optimize the REST API, we need to know whether the commands
are read (GET) or write (PUT/POST). However, we also could use that
same info for permission/caps checking. Add modulename/perms as
the required caps for each command to drive both needs.
The availability field is to control whether a command is displayed/
advertised through the CLI or REST interfaces; some commands aren't
really useful for REST, and we may want to invent REST-only commands;
also, this gives us a way to deprecate commands quickly and leave the
code, should that be desirable. Make the CLI display only commands
marked with the 'CLI' marker.
Also stop renaming 'help' to 'helptext' in the client.
Get device-by-path by looking for it instead of assuming 3rd entry.
On some systems (virtual machines so far) the device-by-path entry
from udevadm is not always in the same spot so instead actually
look for the right output instead of blindy assuming that its a
specific field in the output.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Get device-by-path by looking for it instead of assuming 3rd entry.
On some systems (virtual machines so far) the device-by-path entry
from udevadm is not always in the same spot so instead actually
look for the right output instead of blindy assuming that its a
specific field in the output.
Signed-off-by: Sandon Van Ness <sandon@inktank.com> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Gary Lowell [Mon, 8 Jul 2013 19:20:45 +0000 (12:20 -0700)]
Makefile.am: fix ceph_sbindir
This reinstates the fix for the ceph_sbindir from commit 352f362567bf270d0896fb7573df4ae5139a56fb, with corrections
from Danny's review commits pull request #389. Fixes: #5492 Reported-by: Denis Kaganovich <mahatma@eu.by> Reviewed-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de> Signed-off-by: Sage Weil <sage@inktank.com> Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Wed, 10 Jul 2013 18:02:08 +0000 (11:02 -0700)]
osd: limit number of inc osdmaps send to peers, clients
We should not send an unbounded number of inc maps to our peers or clients.
In particular, if a peer is not contacted for a while, we may think they
have a very old map (say, 10000 epochs ago) and send thousands of inc maps
when the distribution shifts and we need to peer.
Note that if we do not send enough maps, the peers will make do by
requesting the map from somewhere else (currently the mon). Regardless
of the source, however, we must limit the amount that we speculatively
share as it usually is not needed.
Backport: cuttlefish, bobtail Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Wed, 10 Jul 2013 17:17:45 +0000 (10:17 -0700)]
mon: do not populate MMonCommand paxos version field
The field is not used or useful since the monitor does not even look
at it (in Monitor::handle_command()). Avoid populating it and the
subsequent confusion for poor developers.
Sage Weil [Wed, 10 Jul 2013 17:06:20 +0000 (10:06 -0700)]
messages/MPGStats: do not set paxos version to osdmap epoch
The PaxosServiceMessage version field is meant for client-coordinated
ordering of messages when switching between monitors (and is rarely
used). Do not fill it with the osdmap epoch lest it be compared to a
pgmap version, which may cause the mon to (near) indefinitely put it on
a wait queue until the pgmap version catches up.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Greg Farnum <greg@inktank.com>