Sage Weil [Sat, 7 Jan 2012 01:18:01 +0000 (17:18 -0800)]
osd: clean up src_oid, src_obc map key calculation
Be consistent about how we generate the src_oid and src_oloc, so that we
feed good value into find_object_context and use a consistent key for
the src_obc map<>. This fixes a crash in do_osd_ops() due to a missing
src_obc key when the get_src_oloc() normalizes the key in do_op() but not
in do_osd_ops().
Also use a nicer name.
Fixes: #1897 Signed-off-by: Sage Weil <sage@newdream.net>
Greg Farnum [Thu, 5 Jan 2012 23:29:32 +0000 (15:29 -0800)]
mon: elector needs to reset leader_acked on every election start
Otherwise you never reset the leader_acked after a failed
election attempt, so if mon 0 is available on the first round
but then fails, you never make progress!
Greg Farnum [Thu, 5 Jan 2012 22:03:43 +0000 (14:03 -0800)]
mon: kill client sessions when we're not in quorum
After a timeout of 2*mon_lease length (ie, two election rounds),
kill existing client sessions so they can reconnect to a
monitor that's (hopefully) remained in the quorum. Let any
new client sessions stick around for a mon_lease interval, then
do the same to them.
Sage Weil [Wed, 4 Jan 2012 21:21:36 +0000 (13:21 -0800)]
mon: rev cluster protocol
The OSDMap NEW and AUTOOUT bit additions subtely change the decoding of
the incremental maps in a reasonably harmless way in that the bits get
implicitly cleared whenever the OSD weight changes from non-zero. The
monitors need to agree on this behavior to avoid odd behavior. We don't
care what clients see, since those bits are informational only.
Sage Weil [Wed, 4 Jan 2012 20:56:15 +0000 (12:56 -0800)]
mon: track auto-marked-out osds
Mark OSDs that were automatically marked OUT by the monitor because they
were down for too long. Clear the bit as soon as they are no longer out,
as soon as the weight is changed from 0.
Sage Weil [Wed, 4 Jan 2012 17:42:02 +0000 (09:42 -0800)]
osd: initialize backfill_pos on activate
Handling of writes depends on backfill_pos being initialized (to know what
is between the leading and trailing edge of the backfill), so it needs to
be initialized at activate time to avoid badness on writes prior to
recovery starting.
- initialize during activate to last_backfill
- update on receiving the digest to maintain the invariant that
backfill_pos = min(peer_backfill_info.start, backfill_info.start)
in recover_backfill().
Sage Weil [Sun, 1 Jan 2012 04:44:05 +0000 (20:44 -0800)]
osd: do not use incomplete peer for best info/log
For one, their stats are incomplete; if we use them we'll screw up everyone
else. For another, it doesn't do us any good if they are a bit ahead of
the peers: we/they may not even have the objects their newer log says were
updated. The only real use is if their log extends farther back in time,
but that is a problem in general that we'll eventually solve in other ways.
On the other hand, having the pg_stats sum only through last_backfill may
not have been the best choice; we could avoid that part of things by adding
a objects_backfilled field. But this is probably a good idea anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Josh Durgin [Fri, 30 Dec 2011 02:36:54 +0000 (18:36 -0800)]
RadosModel: check for out of order replies within WriteOps
A single WriteOp already does multiple aio_writes. Each aio_write
gets a unique tid that is checked upon completion. There's no reason
to loop over the ranges twice since we can use the done flag instead
of the set of completions in WriteOp::finished().
Josh Durgin [Fri, 30 Dec 2011 02:30:57 +0000 (18:30 -0800)]
RadosModel: allow TestOps to pass data to their finish methods
This will allow nested writes to keep track of which write actually
completed. Also remove finish() and _finish() from TestOp subclasses
that had the same implementation as the superclass.
Greg Farnum [Sat, 24 Dec 2011 00:41:38 +0000 (16:41 -0800)]
osd: add a monitor timeout via MPGStatsAck messages
Keep track of when we have outstanding updates, and while we do, make
sure the monitor responds within a timeout (default 30 seconds). If
it doesn't, reconnect!
Sage Weil [Sat, 31 Dec 2011 23:09:58 +0000 (15:09 -0800)]
osd: trigger RecoveryFinished event on recovery completion
Unconditionally trigger the RecoveryFinished event when start_recvoery_ops
thinks it may be done. This lets us trigger the acting change (if needed),
or call finish_recovery() if needed.
This fixes the case where we are backfilling with up == acting, complete,
but don't call finish_recovery() or clear the backfill|degraded bits.
At some point we may want to move the is_all_uptodate() checks to the
caller.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 30 Dec 2011 23:51:45 +0000 (15:51 -0800)]
osd: do not backfill if any objects are missing on the primary
Someday we need to do something smarter so that a single unfound object
doesn't hold up replication of other objects. For now, this is the
simplest thing to do.
Florian Haas [Thu, 29 Dec 2011 19:58:01 +0000 (20:58 +0100)]
Add OCF-compliant resource agent for Ceph daemons
Add a wrapper around the ceph init script that makes
MDS, OSD and MON configurable as Open Cluster Framework
(OCF) compliant cluster resources. Allows Ceph
daemons to tie in with cluster resource managers that
support OCF, such as Pacemaker (http://www.clusterlabs.org).
Disabled by default, configure --with-ocf to enable.
Sage Weil [Fri, 30 Dec 2011 01:15:07 +0000 (17:15 -0800)]
mon: make full ratio config change callback safe
We can't propose_pending() from any context; do this in the tick() thread,
with the proper locking. Among other things, this fixes the crash on
startup that is now triggered due to eba235f2.
Florian Haas [Tue, 27 Dec 2011 10:43:47 +0000 (11:43 +0100)]
init script: be LSB compliant for exit code on status
An exit code of 1 on status is defined in LSB as
"program is dead, but pid file exists". Check for existence
of this pid file, and only set the exit status 1 if it's still there.
Set it to 3 ("program is not running") otherwise.
Sage Weil [Thu, 29 Dec 2011 17:41:00 +0000 (09:41 -0800)]
osd: explicitly track leading edge of backfill
backfill_pos is the leading edge; last_backfill is the trailing edge.
Anything inbetween is either pushed, doesn't exist, or in
backfills_in_flight.
For operations on non-degraded (in-progress) objects in that window, book
the stats update in pending_backfill_updates so that it will get applied
when last_backfill is advanced.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 29 Dec 2011 17:00:46 +0000 (09:00 -0800)]
osd: get fsid from monmap, not osdmap
We may not have a valid OSDMap in all of these cases (notably, during
boot). Always take the fsid from the monmap, which will be valid after
we've authenticated.
Sage Weil [Thu, 29 Dec 2011 16:59:00 +0000 (08:59 -0800)]
monc: get latest monmap during authentication
Tell the monitor which monmap version we have in our initial auth message.
Make the monitor send the latest monmap if it has something newer. This
ensures that once authentication completes the monclient has the latest
monmap and a valid fsid.
Fixes: #1848 Signed-off-by: Sage Weil <sage@newdream.net>