Sage Weil [Wed, 4 Jan 2012 17:42:02 +0000 (09:42 -0800)]
osd: initialize backfill_pos on activate
Handling of writes depends on backfill_pos being initialized (to know what
is between the leading and trailing edge of the backfill), so it needs to
be initialized at activate time to avoid badness on writes prior to
recovery starting.
- initialize during activate to last_backfill
- update on receiving the digest to maintain the invariant that
backfill_pos = min(peer_backfill_info.start, backfill_info.start)
in recover_backfill().
Sage Weil [Sun, 1 Jan 2012 04:44:05 +0000 (20:44 -0800)]
osd: do not use incomplete peer for best info/log
For one, their stats are incomplete; if we use them we'll screw up everyone
else. For another, it doesn't do us any good if they are a bit ahead of
the peers: we/they may not even have the objects their newer log says were
updated. The only real use is if their log extends farther back in time,
but that is a problem in general that we'll eventually solve in other ways.
On the other hand, having the pg_stats sum only through last_backfill may
not have been the best choice; we could avoid that part of things by adding
a objects_backfilled field. But this is probably a good idea anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Sat, 31 Dec 2011 23:09:58 +0000 (15:09 -0800)]
osd: trigger RecoveryFinished event on recovery completion
Unconditionally trigger the RecoveryFinished event when start_recvoery_ops
thinks it may be done. This lets us trigger the acting change (if needed),
or call finish_recovery() if needed.
This fixes the case where we are backfilling with up == acting, complete,
but don't call finish_recovery() or clear the backfill|degraded bits.
At some point we may want to move the is_all_uptodate() checks to the
caller.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 30 Dec 2011 23:51:45 +0000 (15:51 -0800)]
osd: do not backfill if any objects are missing on the primary
Someday we need to do something smarter so that a single unfound object
doesn't hold up replication of other objects. For now, this is the
simplest thing to do.
Florian Haas [Thu, 29 Dec 2011 19:58:01 +0000 (20:58 +0100)]
Add OCF-compliant resource agent for Ceph daemons
Add a wrapper around the ceph init script that makes
MDS, OSD and MON configurable as Open Cluster Framework
(OCF) compliant cluster resources. Allows Ceph
daemons to tie in with cluster resource managers that
support OCF, such as Pacemaker (http://www.clusterlabs.org).
Disabled by default, configure --with-ocf to enable.
Sage Weil [Fri, 30 Dec 2011 01:15:07 +0000 (17:15 -0800)]
mon: make full ratio config change callback safe
We can't propose_pending() from any context; do this in the tick() thread,
with the proper locking. Among other things, this fixes the crash on
startup that is now triggered due to eba235f2.
Florian Haas [Tue, 27 Dec 2011 10:43:47 +0000 (11:43 +0100)]
init script: be LSB compliant for exit code on status
An exit code of 1 on status is defined in LSB as
"program is dead, but pid file exists". Check for existence
of this pid file, and only set the exit status 1 if it's still there.
Set it to 3 ("program is not running") otherwise.
Sage Weil [Thu, 29 Dec 2011 17:41:00 +0000 (09:41 -0800)]
osd: explicitly track leading edge of backfill
backfill_pos is the leading edge; last_backfill is the trailing edge.
Anything inbetween is either pushed, doesn't exist, or in
backfills_in_flight.
For operations on non-degraded (in-progress) objects in that window, book
the stats update in pending_backfill_updates so that it will get applied
when last_backfill is advanced.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 29 Dec 2011 17:00:46 +0000 (09:00 -0800)]
osd: get fsid from monmap, not osdmap
We may not have a valid OSDMap in all of these cases (notably, during
boot). Always take the fsid from the monmap, which will be valid after
we've authenticated.
Sage Weil [Thu, 29 Dec 2011 16:59:00 +0000 (08:59 -0800)]
monc: get latest monmap during authentication
Tell the monitor which monmap version we have in our initial auth message.
Make the monitor send the latest monmap if it has something newer. This
ensures that once authentication completes the monclient has the latest
monmap and a valid fsid.
Fixes: #1848 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Thu, 22 Dec 2011 23:25:00 +0000 (15:25 -0800)]
filestore: fix config observer
Actually, I don't think this was fully implemented to begin with, so it's
not a 'fix' per se. This will let you use injectargs to adjust the
filestore config options during runtime.
Samuel Just [Thu, 22 Dec 2011 17:44:33 +0000 (09:44 -0800)]
ReplicatedPG: init backfill infos to last_backfill
We can scan starting from last_backfill to avoid rescanning portions
of the collection recovered by normal recovery. collection_list_partial
now includes begin if present. next will be <= the next object in the
collection. This way we can scan starting at last_backfill without
skipping last_backfill.
Samuel Just [Sat, 17 Dec 2011 02:04:32 +0000 (18:04 -0800)]
calc_acting: Prefer up[0] as primary if possible
Previously, we could get into a state where although up[0] has been
fully backfilled, acting[0] could be selected as a primary if it is able
to pull another peer into the acting set. This also collects the logic
of choosing the best info into a helper function.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Samuel Just [Mon, 19 Dec 2011 22:50:17 +0000 (14:50 -0800)]
MOSDRepScrub,ReplicatedPG: Add scrub_to to MOSDRepScrub
When scrub_from is set, also set scrub_to to the primary's
last_update_applied (which will also be the official last_update before
finalizing scrub began). The replica instead of waiting for
last_update_applied to catch up to last_update will wait for
last_update_applied to catch up to active_rep_scrub->scrub_to. This
avoids a race where the replica scrub is requeued before all of the
currently queued sub-ops have been processed.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 30 Nov 2011 22:13:14 +0000 (14:13 -0800)]
filejournal: uuid for fsid
Decode old header struct, but encode new class using more normal encoding
style. Embed in a bufferlist for later extensibility. Use the first
64 bits of the uuid for the per-entry magic, as before.