Sage Weil [Thu, 12 Jan 2012 23:09:18 +0000 (15:09 -0800)]
osd: fill in empty item in peer_missing for strays
If we search_for_missing() on a host, make a corresponding entry in our
peer_missing map (if it isn't already there). This ensure we get (empty)
entries for strays, which makes all_unfound_are_queried_or_lost() happy.
Samuel Just [Thu, 12 Jan 2012 21:13:47 +0000 (13:13 -0800)]
ReplicatedPG: Do a write even for 0 length operation
Otherwise, a 0 length write to an offset past the end of the file will
cause the internal accounting to reflect the full size of the file, but
not the file on disk.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Thu, 12 Jan 2012 20:59:07 +0000 (12:59 -0800)]
qa/client/gen-1774.sh
Capture Alexandre's script for reproducing #1774 here for posterity, until
we write a properly harnessed test for this. Currently, workunits can't
mount/unmount, and we don't have a way to make ceph-fuse drop it's cache.
Sage Weil [Thu, 12 Jan 2012 19:46:27 +0000 (11:46 -0800)]
osd: fix PG::Log::copy_up_to() tail
The tail needs to refer to the entry preceeding the first entry in the
log. This updates copy_up_to() to match the basic structure of the other
copy_*() methods.
Sage Weil [Thu, 12 Jan 2012 19:07:02 +0000 (11:07 -0800)]
osd: reset last_complete on backfill restart
Since last_backfill is hobject_t(), we can set this equal to last_update.
This fixes a problem where last_complete preceeds the abbreviated log we
send to the replica below.
Sage Weil [Thu, 12 Jan 2012 18:01:40 +0000 (10:01 -0800)]
COPYING: note licenses for all files, not just the default
This (mostly) copies debian/copyright for now, but there are format
restrictions for that file. Suggestions for a cleaner way to handle this
are welcome. In the meantime, this is better...
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Josh Durgin [Tue, 10 Jan 2012 22:16:41 +0000 (14:16 -0800)]
pg: add a configurable lower bound on log size
This helps prevent problems with retrying requests being detected as
duplicates. The problem occurs when the log is trimmed too
aggressively, and an earlier tid is removed from the log, while a
later one is not. The later request will be detected as a duplicate
and responded to immediately, possibly violating the ordering of the
requests.
Alexandre Oliva [Tue, 10 Jan 2012 03:41:45 +0000 (01:41 -0200)]
client: start caching readdir results after readdir_start
Use upper_bound rather than lower_bound to compute the initial pd within
insert_trace, so that we don't attempt to remove it if it happens to be
in the same frag as the new reply.
Fixes: #1774 Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 10 Jan 2012 21:23:00 +0000 (13:23 -0800)]
osd: fail to peer if interval lacks any !incomplete replicas
We need at least one non-incomplete replica during a rw interval in order
to peer. The backfilling/incomplete replicas get log entries, but not
all object writes, so they are (mostly) excluded from the peering process
(find_best_info(), in particular).
We can't do this during the PriorSet calculation because we don't have
their PG::Info yet. But, once we get it, we need to make sure at least one
of the replicas during the last rw interval is not incomplete, or else we
should mark the pg DOWN (just like the PriorSet calculation does).
This logic mostly mirrors that of PriorSet, but additionally requires
the replicas be !incomplete.
Greg Farnum [Tue, 10 Jan 2012 19:25:25 +0000 (11:25 -0800)]
mon: allow specifying pg_num and pgp_num when creating new pools.
Right now this is only exposed via the monitor command interface:
osd pool create <poolname> [pg_num [pgp_num]]
but it can be expanded to other interfaces as appropriate.
Greg Farnum [Tue, 10 Jan 2012 18:41:36 +0000 (10:41 -0800)]
mds: initiate monitor reconnect if beacon acks take too long
If it takes 2*mds_beacon_grace (default 30 seconds total) seconds
to get an ack back, maybe it's the monitor and not us. Try a reconnect,
which will just add the teensiest bit of load if we're wrong.
Alex Elder [Tue, 10 Jan 2012 02:13:41 +0000 (18:13 -0800)]
ceph: add a new "run_uml.sh" script to manage running a UML client
This script is used to automate most of what's required to run a
User-Mode Linux (UML) instance. This is mainly of interest for
ceph client developers who might benefit from the debugger access
that UML affords. It was written for ceph development but isn't
really dependent on ceph. It basically makes a few assumptions and
follows some conventions, and in doing so is able to encapsulate
most of the "tricky parts" of setting up to run a UML instance.
Sage Weil [Mon, 9 Jan 2012 00:23:55 +0000 (16:23 -0800)]
osd: populate_obc_watchers when object pulled to primary
We don't care about degraded state, only whether the object is on the
primary so that we can load the object_info_t.
In particular, this avoids problems with backfill, where an object is
not degraded and populated, is then degraded while we backfill to the
target, and then not degraded again, and populate_obc_watchers() is called
a second time.
Fixes: #1903 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Sat, 7 Jan 2012 01:18:01 +0000 (17:18 -0800)]
osd: clean up src_oid, src_obc map key calculation
Be consistent about how we generate the src_oid and src_oloc, so that we
feed good value into find_object_context and use a consistent key for
the src_obc map<>. This fixes a crash in do_osd_ops() due to a missing
src_obc key when the get_src_oloc() normalizes the key in do_op() but not
in do_osd_ops().
Also use a nicer name.
Fixes: #1897 Signed-off-by: Sage Weil <sage@newdream.net>
Sage Weil [Fri, 6 Jan 2012 19:38:15 +0000 (11:38 -0800)]
objecter: ignore replies from old request attempts
If we know the request attempt, ignore old attempts.
If we do not know the attempt (because the server is old), accept the
reply. This could lead to doing some ACK callbacks we shouldn't in
extreme failure/recovery scenarios, but that is better than doing
the callbacks out of order.
Partially fixes: #1490 Signed-off-by: Sage Weil <sage@newdream.net>
Greg Farnum [Thu, 5 Jan 2012 23:29:32 +0000 (15:29 -0800)]
mon: elector needs to reset leader_acked on every election start
Otherwise you never reset the leader_acked after a failed
election attempt, so if mon 0 is available on the first round
but then fails, you never make progress!
Greg Farnum [Thu, 5 Jan 2012 22:03:43 +0000 (14:03 -0800)]
mon: kill client sessions when we're not in quorum
After a timeout of 2*mon_lease length (ie, two election rounds),
kill existing client sessions so they can reconnect to a
monitor that's (hopefully) remained in the quorum. Let any
new client sessions stick around for a mon_lease interval, then
do the same to them.