Alexandre Oliva [Tue, 10 Jan 2012 03:41:45 +0000 (01:41 -0200)]
client: start caching readdir results after readdir_start
Use upper_bound rather than lower_bound to compute the initial pd within
insert_trace, so that we don't attempt to remove it if it happens to be
in the same frag as the new reply.
Fixes: #1774 Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 10 Jan 2012 21:23:00 +0000 (13:23 -0800)]
osd: fail to peer if interval lacks any !incomplete replicas
We need at least one non-incomplete replica during a rw interval in order
to peer. The backfilling/incomplete replicas get log entries, but not
all object writes, so they are (mostly) excluded from the peering process
(find_best_info(), in particular).
We can't do this during the PriorSet calculation because we don't have
their PG::Info yet. But, once we get it, we need to make sure at least one
of the replicas during the last rw interval is not incomplete, or else we
should mark the pg DOWN (just like the PriorSet calculation does).
This logic mostly mirrors that of PriorSet, but additionally requires
the replicas be !incomplete.
Greg Farnum [Tue, 10 Jan 2012 19:25:25 +0000 (11:25 -0800)]
mon: allow specifying pg_num and pgp_num when creating new pools.
Right now this is only exposed via the monitor command interface:
osd pool create <poolname> [pg_num [pgp_num]]
but it can be expanded to other interfaces as appropriate.
Greg Farnum [Tue, 10 Jan 2012 18:41:36 +0000 (10:41 -0800)]
mds: initiate monitor reconnect if beacon acks take too long
If it takes 2*mds_beacon_grace (default 30 seconds total) seconds
to get an ack back, maybe it's the monitor and not us. Try a reconnect,
which will just add the teensiest bit of load if we're wrong.
Alex Elder [Tue, 10 Jan 2012 02:13:41 +0000 (18:13 -0800)]
ceph: add a new "run_uml.sh" script to manage running a UML client
This script is used to automate most of what's required to run a
User-Mode Linux (UML) instance. This is mainly of interest for
ceph client developers who might benefit from the debugger access
that UML affords. It was written for ceph development but isn't
really dependent on ceph. It basically makes a few assumptions and
follows some conventions, and in doing so is able to encapsulate
most of the "tricky parts" of setting up to run a UML instance.
Sage Weil [Mon, 9 Jan 2012 00:23:55 +0000 (16:23 -0800)]
osd: populate_obc_watchers when object pulled to primary
We don't care about degraded state, only whether the object is on the
primary so that we can load the object_info_t.
In particular, this avoids problems with backfill, where an object is
not degraded and populated, is then degraded while we backfill to the
target, and then not degraded again, and populate_obc_watchers() is called
a second time.
Fixes: #1903 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Sat, 7 Jan 2012 01:18:01 +0000 (17:18 -0800)]
osd: clean up src_oid, src_obc map key calculation
Be consistent about how we generate the src_oid and src_oloc, so that we
feed good value into find_object_context and use a consistent key for
the src_obc map<>. This fixes a crash in do_osd_ops() due to a missing
src_obc key when the get_src_oloc() normalizes the key in do_op() but not
in do_osd_ops().
Also use a nicer name.
Fixes: #1897 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 6 Jan 2012 19:38:15 +0000 (11:38 -0800)]
objecter: ignore replies from old request attempts
If we know the request attempt, ignore old attempts.
If we do not know the attempt (because the server is old), accept the
reply. This could lead to doing some ACK callbacks we shouldn't in
extreme failure/recovery scenarios, but that is better than doing
the callbacks out of order.
Partially fixes: #1490 Signed-off-by: Sage Weil <sage@newdream.net>
Greg Farnum [Thu, 5 Jan 2012 23:29:32 +0000 (15:29 -0800)]
mon: elector needs to reset leader_acked on every election start
Otherwise you never reset the leader_acked after a failed
election attempt, so if mon 0 is available on the first round
but then fails, you never make progress!
Greg Farnum [Thu, 5 Jan 2012 22:03:43 +0000 (14:03 -0800)]
mon: kill client sessions when we're not in quorum
After a timeout of 2*mon_lease length (ie, two election rounds),
kill existing client sessions so they can reconnect to a
monitor that's (hopefully) remained in the quorum. Let any
new client sessions stick around for a mon_lease interval, then
do the same to them.
Sage Weil [Wed, 4 Jan 2012 21:21:36 +0000 (13:21 -0800)]
mon: rev cluster protocol
The OSDMap NEW and AUTOOUT bit additions subtely change the decoding of
the incremental maps in a reasonably harmless way in that the bits get
implicitly cleared whenever the OSD weight changes from non-zero. The
monitors need to agree on this behavior to avoid odd behavior. We don't
care what clients see, since those bits are informational only.
Sage Weil [Wed, 4 Jan 2012 20:56:15 +0000 (12:56 -0800)]
mon: track auto-marked-out osds
Mark OSDs that were automatically marked OUT by the monitor because they
were down for too long. Clear the bit as soon as they are no longer out,
as soon as the weight is changed from 0.
Sage Weil [Wed, 4 Jan 2012 17:42:02 +0000 (09:42 -0800)]
osd: initialize backfill_pos on activate
Handling of writes depends on backfill_pos being initialized (to know what
is between the leading and trailing edge of the backfill), so it needs to
be initialized at activate time to avoid badness on writes prior to
recovery starting.
- initialize during activate to last_backfill
- update on receiving the digest to maintain the invariant that
backfill_pos = min(peer_backfill_info.start, backfill_info.start)
in recover_backfill().
Sage Weil [Sun, 1 Jan 2012 04:44:05 +0000 (20:44 -0800)]
osd: do not use incomplete peer for best info/log
For one, their stats are incomplete; if we use them we'll screw up everyone
else. For another, it doesn't do us any good if they are a bit ahead of
the peers: we/they may not even have the objects their newer log says were
updated. The only real use is if their log extends farther back in time,
but that is a problem in general that we'll eventually solve in other ways.
On the other hand, having the pg_stats sum only through last_backfill may
not have been the best choice; we could avoid that part of things by adding
a objects_backfilled field. But this is probably a good idea anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>