Sage Weil [Wed, 4 Jan 2012 21:21:36 +0000 (13:21 -0800)]
mon: rev cluster protocol
The OSDMap NEW and AUTOOUT bit additions subtely change the decoding of
the incremental maps in a reasonably harmless way in that the bits get
implicitly cleared whenever the OSD weight changes from non-zero. The
monitors need to agree on this behavior to avoid odd behavior. We don't
care what clients see, since those bits are informational only.
Sage Weil [Wed, 4 Jan 2012 20:56:15 +0000 (12:56 -0800)]
mon: track auto-marked-out osds
Mark OSDs that were automatically marked OUT by the monitor because they
were down for too long. Clear the bit as soon as they are no longer out,
as soon as the weight is changed from 0.
Sage Weil [Wed, 4 Jan 2012 17:42:02 +0000 (09:42 -0800)]
osd: initialize backfill_pos on activate
Handling of writes depends on backfill_pos being initialized (to know what
is between the leading and trailing edge of the backfill), so it needs to
be initialized at activate time to avoid badness on writes prior to
recovery starting.
- initialize during activate to last_backfill
- update on receiving the digest to maintain the invariant that
backfill_pos = min(peer_backfill_info.start, backfill_info.start)
in recover_backfill().
Sage Weil [Sun, 1 Jan 2012 04:44:05 +0000 (20:44 -0800)]
osd: do not use incomplete peer for best info/log
For one, their stats are incomplete; if we use them we'll screw up everyone
else. For another, it doesn't do us any good if they are a bit ahead of
the peers: we/they may not even have the objects their newer log says were
updated. The only real use is if their log extends farther back in time,
but that is a problem in general that we'll eventually solve in other ways.
On the other hand, having the pg_stats sum only through last_backfill may
not have been the best choice; we could avoid that part of things by adding
a objects_backfilled field. But this is probably a good idea anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Josh Durgin [Fri, 30 Dec 2011 02:36:54 +0000 (18:36 -0800)]
RadosModel: check for out of order replies within WriteOps
A single WriteOp already does multiple aio_writes. Each aio_write
gets a unique tid that is checked upon completion. There's no reason
to loop over the ranges twice since we can use the done flag instead
of the set of completions in WriteOp::finished().
Josh Durgin [Fri, 30 Dec 2011 02:30:57 +0000 (18:30 -0800)]
RadosModel: allow TestOps to pass data to their finish methods
This will allow nested writes to keep track of which write actually
completed. Also remove finish() and _finish() from TestOp subclasses
that had the same implementation as the superclass.
Greg Farnum [Sat, 24 Dec 2011 00:41:38 +0000 (16:41 -0800)]
osd: add a monitor timeout via MPGStatsAck messages
Keep track of when we have outstanding updates, and while we do, make
sure the monitor responds within a timeout (default 30 seconds). If
it doesn't, reconnect!
Sage Weil [Sat, 31 Dec 2011 23:09:58 +0000 (15:09 -0800)]
osd: trigger RecoveryFinished event on recovery completion
Unconditionally trigger the RecoveryFinished event when start_recvoery_ops
thinks it may be done. This lets us trigger the acting change (if needed),
or call finish_recovery() if needed.
This fixes the case where we are backfilling with up == acting, complete,
but don't call finish_recovery() or clear the backfill|degraded bits.
At some point we may want to move the is_all_uptodate() checks to the
caller.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 30 Dec 2011 23:51:45 +0000 (15:51 -0800)]
osd: do not backfill if any objects are missing on the primary
Someday we need to do something smarter so that a single unfound object
doesn't hold up replication of other objects. For now, this is the
simplest thing to do.
Florian Haas [Thu, 29 Dec 2011 19:58:01 +0000 (20:58 +0100)]
Add OCF-compliant resource agent for Ceph daemons
Add a wrapper around the ceph init script that makes
MDS, OSD and MON configurable as Open Cluster Framework
(OCF) compliant cluster resources. Allows Ceph
daemons to tie in with cluster resource managers that
support OCF, such as Pacemaker (http://www.clusterlabs.org).
Disabled by default, configure --with-ocf to enable.
Sage Weil [Fri, 30 Dec 2011 01:15:07 +0000 (17:15 -0800)]
mon: make full ratio config change callback safe
We can't propose_pending() from any context; do this in the tick() thread,
with the proper locking. Among other things, this fixes the crash on
startup that is now triggered due to eba235f2.
Florian Haas [Tue, 27 Dec 2011 10:43:47 +0000 (11:43 +0100)]
init script: be LSB compliant for exit code on status
An exit code of 1 on status is defined in LSB as
"program is dead, but pid file exists". Check for existence
of this pid file, and only set the exit status 1 if it's still there.
Set it to 3 ("program is not running") otherwise.
Sage Weil [Thu, 29 Dec 2011 17:41:00 +0000 (09:41 -0800)]
osd: explicitly track leading edge of backfill
backfill_pos is the leading edge; last_backfill is the trailing edge.
Anything inbetween is either pushed, doesn't exist, or in
backfills_in_flight.
For operations on non-degraded (in-progress) objects in that window, book
the stats update in pending_backfill_updates so that it will get applied
when last_backfill is advanced.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 29 Dec 2011 17:00:46 +0000 (09:00 -0800)]
osd: get fsid from monmap, not osdmap
We may not have a valid OSDMap in all of these cases (notably, during
boot). Always take the fsid from the monmap, which will be valid after
we've authenticated.
Sage Weil [Thu, 29 Dec 2011 16:59:00 +0000 (08:59 -0800)]
monc: get latest monmap during authentication
Tell the monitor which monmap version we have in our initial auth message.
Make the monitor send the latest monmap if it has something newer. This
ensures that once authentication completes the monclient has the latest
monmap and a valid fsid.
Fixes: #1848 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Thu, 22 Dec 2011 23:25:00 +0000 (15:25 -0800)]
filestore: fix config observer
Actually, I don't think this was fully implemented to begin with, so it's
not a 'fix' per se. This will let you use injectargs to adjust the
filestore config options during runtime.
Samuel Just [Thu, 22 Dec 2011 17:44:33 +0000 (09:44 -0800)]
ReplicatedPG: init backfill infos to last_backfill
We can scan starting from last_backfill to avoid rescanning portions
of the collection recovered by normal recovery. collection_list_partial
now includes begin if present. next will be <= the next object in the
collection. This way we can scan starting at last_backfill without
skipping last_backfill.