Greg Farnum [Fri, 10 Feb 2012 23:07:10 +0000 (15:07 -0800)]
mon: remove the last_consumed setting in Paxos
This was only ever used while initializing the Paxos machine, and it
doesn't need to be. Its existence is just an invitation to have races
between updating the stashed data and the stashed version.
Greg Farnum [Fri, 10 Feb 2012 23:02:03 +0000 (15:02 -0800)]
mon: handle inconsistent disk states on startup.
This lets us recover from an interrupted slurp while still noticing
other corruption issues. Rather than running init() and then
update_from_paxos() on each instance, we run init() and check
consistency. If it is consistent, we update_from_paxos as before. If
it is not, we do nothing and detect the slurping state
in handle_probe_reply(). (This assumes the disk was in a slurping state. If not, the
daemon crashes because something else went horribly wrong.)
While we're at it, remove unnecessary sets of first_committed. These
are done in the call to pax->trim_to().
Greg Farnum [Fri, 10 Feb 2012 18:42:24 +0000 (10:42 -0800)]
mon: add a slurping flag to the Paxos state
Set it before we start slurping, and clear it when we end slurping.
This allows us to differentiate between deliberately inconsistent
disk states, and broken disk states. Run simple checks in a new
is_consistent() call.
Greg Farnum [Fri, 10 Feb 2012 17:16:58 +0000 (09:16 -0800)]
mon: initialize paxos state in constructor
These should all be initialized in init() anyway
(except accepted_pn_from, which is set in collect and handle_collect),
but initializing them to safe defaults in the constructor provides
a safety net.
Greg Farnum [Wed, 8 Feb 2012 19:08:51 +0000 (11:08 -0800)]
mon: waitlist new sessions trying to connect while we're out of quorum
If we're stuck out of the quorum, we don't want clients connecting to
to us. Instead, waitlist their requests; process them when we get into
a quorum and look at them every tick so we can toss them out
(if we take too long to get into a quorum).
This change is smaller than it looks - most requests would
previously have been blocked anyway while waiting for Paxos to be
readable, and the ones that weren't (eg, one-time map subscribes)
should have been!
This also means we can tear out the cleanup code for new Sessions, so
tick() looks better.
Greg Farnum [Mon, 6 Feb 2012 23:31:20 +0000 (15:31 -0800)]
mon: make PaxosService::update_from_paxos return void.
You can't really recover from a failed update (as PGMonitor was trying
to do), and nothing in the system checks the return values.
So rip out the return values and change failed updates to an assert
failure (most of the other Monitors continue to have decode exceptions,
but we can keep the pretty that we have).
Sage Weil [Fri, 3 Feb 2012 17:27:47 +0000 (09:27 -0800)]
osd: reorder PG recovery_state initialization
The state machine state constructors print stuff to the logs, and the
PG::gen_prefix() includes all kinds of PG fields in that. Move the
recovery_state member to the end of the class so that all previous fields
are initialized when that happens. This makes valgrind shut up.
Some more deliberate tidying of the PG members and methods would also
be nice...
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Alexandre Oliva [Thu, 2 Feb 2012 19:41:46 +0000 (17:41 -0200)]
pick object from random osd for primary recovery
When recovering a primary, try the osds that have a copy of the object
in random order, rather than preferring the lowest-numbered.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br> Suggested-by: Samuel Just <samuel.just@dreamhost.com> Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Jim Schutt [Wed, 1 Feb 2012 15:54:25 +0000 (08:54 -0700)]
common/Throttle: throttle in FIFO order
Under heavy write load from many clients, many reader threads will
be waiting in the policy throttler, all on a single condition variable.
When a wakeup is signalled, any of those threads may receive the
signal. This increases the variance in the message processing
latency, and in extreme cases can significantly delay a message.
This patch causes threads to exit a throttler in the same order
they entered.
Signed-off-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
Sage Weil [Thu, 2 Feb 2012 05:08:31 +0000 (21:08 -0800)]
librados: discard incoming messages when DISCONNECTED
If we are disconnected (probably shutting down, if we are receiving a
message) then ignore anything incoming. This avoids passing it to
partially torn down subsystems like the objecter.
Sage Weil [Thu, 2 Feb 2012 05:07:40 +0000 (21:07 -0800)]
objecter: track whether initialized; add asserts
init() should be called when not initialized; shutdown() should not be
called unless initialized. No handle_* method should be called unless
initialized.
Greg Farnum [Wed, 1 Feb 2012 21:25:37 +0000 (13:25 -0800)]
osd: add check_ops_in_flight()
By default it warns on requests that are more than 30 seconds old,
using an exponential backoff of that interval.
Also add state name retrieval to OpRequest.
Greg Farnum [Mon, 30 Jan 2012 22:50:28 +0000 (14:50 -0800)]
osd: "mark" OpRequests as they move through the system.
Right now these are just informational flags which can be read out. Later
they might extend to timing information, separate lists for more precise
control over latency warnings, etc.
Sage Weil [Tue, 31 Jan 2012 21:00:45 +0000 (13:00 -0800)]
qa: test_backfill.sh: take osd.0 down
Mark this down to
1- trigger the WaitActingChange vs osd down race, and
2- help trigger a divergnet log when osd.2 is blackholed+restarted during
backfill. e.g.,
Sage Weil [Tue, 31 Jan 2012 17:53:32 +0000 (09:53 -0800)]
osd: restart peering if requesting acting osd goes down
If we request an acting set, we need to restart peering if one of the
requested nodes goes down. This prevents a deadlock where we get stuck
in WaitActingChange because we have [a,b], want [a,b,c], but c is down and
our up and acting don't actually change.
Sage Weil [Tue, 31 Jan 2012 15:25:04 +0000 (07:25 -0800)]
osd: fix divergent backfill targets
During peering, a previous backfill target may have a slightly newer
last_update than the other options, but it will not be chosen because it
is incomplete. That caused a failed assert during activate() (#1983).
To fix, we remove the bad assert, and then fix merge_log() so that the
replica/backfill target will trim its divergent entries when it gets the
activation MLogRec. We also fix the handling of MInfoRec, as that can
trigger the same analogous condition.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 31 Jan 2012 01:39:23 +0000 (17:39 -0800)]
filestore: implement filestore_blackhole hook
If true, we'll drop any new transactions on the floor. Useful for
triggering failure conditions (e.g., prior to killing ceph-osd itself, to
ensure some operations don't reach the local disk).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Mon, 30 Jan 2012 22:27:24 +0000 (14:27 -0800)]
osd: disable clone overlap for push/pull
There is a bug in the push/pull code. Disable the recovery smarts by
default until we fix #2002.
There is currently a race (in the callers) where:
- an adjacent clone is missing
- we (calculate some clone overlap? and) start pulling
- we get adjacent clone
- we get push, calc a different overlap, and then get confused.