Samuel Just [Wed, 25 May 2011 17:54:27 +0000 (10:54 -0700)]
PG: fix race in _activate_committed
Previously, _activate_committed would access the osdmap epoch racing
with handle_osd_map's osdmap update. This would allow a message to be
sent from a replica to the primary tagged with the same epoch as
last_warm_restart, though the event actually occured before
last_warm_restart. Thus the primary would fail to ignore the event and
transition to crashed.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 18 May 2011 04:29:33 +0000 (21:29 -0700)]
mds: do not shift to EXCL or MIX while rdlocked
There was an old change in file_eval() that was allowing us to switch from
SYNC to MIX or EXCL while there were rdlocks, which either caused lots of
lock thrashing or could (I think) hang things up completely. This was
from ea10a672, an ancient fix for something related that appears to have
taken out the rdlocked check by accident.
In my tests (one writer, one stat-er), this took things from long stalls
(up to 20 seconds) to very responsive stats. Yay!
Fixes: #791 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Tue, 24 May 2011 16:47:06 +0000 (09:47 -0700)]
osd: add ability to explicitly mark unfound as lost
Instead of automatically marking unfound objects lost (once we've tried
every location we can think of), do it when the administator explicitly
says to. This avoids marking things wrong incorrectly when there are
peering issues, and also allows the administrator to decide whether there
may be offline osds that are worth bringing online.
Sage Weil [Tue, 24 May 2011 16:42:39 +0000 (09:42 -0700)]
osd: make automatically marking of unfound as lost optional
We may not want to do this automatically until we have more confidense in
the recovery code. Even then, possible not. In particular, the OSDs may
believe they have contact all possible homes for the data even though there
is some long-lost OSD that has the data on disk that if offline.
For now, we make the marking process explicit so that the administrator can
make the call.
Sage Weil [Fri, 20 May 2011 21:45:36 +0000 (14:45 -0700)]
osd: more heartbeat rework
A few things:
- track Connection* instead of entity_inst_t for hb peers
- we can only send maps over the cluster_messenger
- if peer is still alive, do that
- if peer is not, send dying MOSDPing ping with YOU_DIED flag
Sage Weil [Fri, 20 May 2011 19:55:29 +0000 (12:55 -0700)]
osd: rework peer map epoch caching
We try to keep track of which epochs our peers have so that we can be
semi-intelligent about which map incrementals we send preceeding any
messages. Since this is useful from the heartbeat and cluster channels/
threads, protect the data with an inner lock and clean up the callers.
Be smarter about when we forget.
Make note of peer epoch when we receive a ping.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 17:42:16 +0000 (10:42 -0700)]
osd: do not clobber explicitly requested heartbeat_to target addresss
Consider peer P.
- P does down in, say, epoch 60, and back up in epoch 70
- P and requests a heartbeat, as_of 70
- We update to map 50, and coincidentally add the same peer as a target
- We set the heartbeat_to[P] = 50 and start sending to the _old_ address
- P marks us down because we stop sending to the new addr
- We eventually get map 70, but it's too late!
Make sure we preserve any _to targets _and_ their epoch+inst.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 16:29:10 +0000 (09:29 -0700)]
osd: request proper log extent for missing
We can't blinding ask for everything since last_epoch_started because that
may mean we get some fragment of a backlog. Look at the peer's log
ranges and request the correct thing. Also, in fulfill_log, infer what
the primary should have asked for if they make a bad request.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 07:27:00 +0000 (00:27 -0700)]
osd: take remote log when it is clearly superior
I'm hitting a case where the primary is compensating for a replica's
last_complete < log.tail by sending a log+backlog, but the replica
isn't smart enough to take advantage. In this case,
replica: log(781'26629,781'26631]
from primary: log(781'26629,781'26631]+backlog
result: log(781'26629,781'26631]
Doh!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 07:14:24 +0000 (00:14 -0700)]
osd: fix compensation for bad last_complete
If the peer has a last_complete below their tail, we can get by with our
log (without backlog) if our tail if _before_ their last_complete, not
after. Otherwise, we need a backlog!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 06:40:12 +0000 (23:40 -0700)]
osd: include past acting osds if they were up
This fixes a bug where we were excluding up (but not acting) nodes from
past intervals, which in turn was triggering a nasty choose_acting loop
(because we _do_ already include acting but !up from the current
interval).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Josh Durgin [Thu, 19 May 2011 21:31:30 +0000 (14:31 -0700)]
PG: choose_log_location: prefer OSDs with a backlog
Without preferring an OSD with a backlog, PGs would get stuck in the
active state when acting != up and the backlog was on an OSD with the
same last_update but a lower number or log_tail.
Josh Durgin [Wed, 18 May 2011 22:54:06 +0000 (15:54 -0700)]
PG: choose acting set and newest_update_osd based on a map of all osds
newest_update osd should be stable when the primary changes, to
prevent cycles of acting set choices. For the same reason, we should
not treat the primary as a special case in choose_acting.
Also remove the magic -1 from representing the current primary.
Josh Durgin [Wed, 18 May 2011 23:15:28 +0000 (16:15 -0700)]
PG: GetLog: don't fail if we get an outdated log
If we request a log from one osd, and then another member of our prior
set comes up with a later last_update, we should not fail when we
receive the first log.
Samuel Just [Tue, 17 May 2011 22:59:32 +0000 (15:59 -0700)]
PG: make choose_acting a bit smarter
This change allows old strays that don't need backlogs
to stay acting until current members of the up set are caught up.
This allows the up set to maintain its full size during peering.
Signed-off-by: Josh Durgin <josh.durgin@dreamhost.com> Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 18 May 2011 01:26:46 +0000 (18:26 -0700)]
page: redefine PAGE_* macros
Saw this on sid i386:
msg/SimpleMessenger.cc: In function 'void alloc_aligned_buffer(ceph::bufferlist&
, int, int)':
msg/SimpleMessenger.cc:1782:14: error: '__sysconf' was not declared in this scop
e
msg/SimpleMessenger.cc:1789:23: error: '__sysconf' was not declared in this scop
Some header is clobbering out PAGE_* macros. Make our header more
forceful.
Josh Durgin [Wed, 18 May 2011 00:36:39 +0000 (17:36 -0700)]
PG: update same_acting_since when acting or up changes
This is a hack since we currently use same_up_since to denote the beginning of an interval.
We should probably change this usaged or rename it to same_interval since.
Sage Weil [Tue, 17 May 2011 17:10:45 +0000 (10:10 -0700)]
msgr: avoid clearing connection_state on pipe replacement
read_message and write_message both dereference connection-state, so avoid
clearing it when replacing a pipe.
read_message still uses it to find rx_buffers in ways that may interfere
when two Pipes reference the connection, but currently that is only used
for lossy pipes. We could still take pipe_lock in that case, but it is
only an optimization (we copy the data if the buffers don't get used
directly) and probably not worth bothering with.
Sage Weil [Fri, 13 May 2011 20:01:52 +0000 (13:01 -0700)]
osd: lazily close connections to down peers
If we hear from a peer that should be dead, tell them, but mark our
connection so that it will close after that message is delivered or if
it encounters any errors.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 13 May 2011 20:01:08 +0000 (13:01 -0700)]
msgr: mark_down_on_empty and mark_disposable
Mark a connection to close when messages are sent, and to close on any
error. We can use this to tell people who should be dead that they should
be dead, but not waste resources reconnecting to them.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Samuel Just [Sat, 14 May 2011 00:30:50 +0000 (17:30 -0700)]
PG: Only pull the master log from a member of the prior_set
There must be a member of the prior_set such that no other
osd has a more recent last_update. This way, prior_set_affected
will ensure that we reset peering if the master log source
goes down.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Greg Farnum [Wed, 11 May 2011 23:52:40 +0000 (16:52 -0700)]
Objecter: switch handle_osd_map op resending around
We need to order the resend by tid. We could do that in a
set with a special-purpose comparison function, but just
switching to a map is easier.
Use a list for LingerOps, as those should also
be ordered but don't maintain tids like regular Ops do.
Greg Farnum [Wed, 11 May 2011 23:52:24 +0000 (16:52 -0700)]
Objecter: implement operator<.
This will maintain ordering of Ops when they're in eg STL sets.
Previously Objecter::handle_osd_map would indiscriminately fire out
Op replays in an order determined by their pointer address! Obviously,
this could cause breakage.
Sage Weil [Wed, 11 May 2011 20:11:57 +0000 (13:11 -0700)]
osd: trigger a store snapshot when the osdmap says to
Move the OSDMap decoding up a bit so that we can either snapshot or flush.
We can't do it after we take map_lock or else we'll have problems dropping
and retaking osd_lock.