Sage Weil [Wed, 25 May 2011 21:54:15 +0000 (14:54 -0700)]
mds: fix canceled lock attempt
If client tries to lock a file, has to wait, and then cancels the attempt,
the client will send an unlock request to unwind its state.
- the unlock now removes the waiting lock attempt from the wait list
- when the lock request retries and finds it is no longer on the wait
list it will fail.
Samuel Just [Wed, 25 May 2011 17:54:27 +0000 (10:54 -0700)]
PG: fix race in _activate_committed
Previously, _activate_committed would access the osdmap epoch racing
with handle_osd_map's osdmap update. This would allow a message to be
sent from a replica to the primary tagged with the same epoch as
last_warm_restart, though the event actually occured before
last_warm_restart. Thus the primary would fail to ignore the event and
transition to crashed.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
Sage Weil [Wed, 18 May 2011 04:29:33 +0000 (21:29 -0700)]
mds: do not shift to EXCL or MIX while rdlocked
There was an old change in file_eval() that was allowing us to switch from
SYNC to MIX or EXCL while there were rdlocks, which either caused lots of
lock thrashing or could (I think) hang things up completely. This was
from ea10a672, an ancient fix for something related that appears to have
taken out the rdlocked check by accident.
In my tests (one writer, one stat-er), this took things from long stalls
(up to 20 seconds) to very responsive stats. Yay!
Fixes: #791 Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
If "rados put" uses write instead of write_full, the resulting object on
the server may be a mismash of old and new objects, if the old object
was longer than the new one. This is fairly counterintuitive behavior
for radostool, so remove it.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Tue, 24 May 2011 16:47:06 +0000 (09:47 -0700)]
osd: add ability to explicitly mark unfound as lost
Instead of automatically marking unfound objects lost (once we've tried
every location we can think of), do it when the administator explicitly
says to. This avoids marking things wrong incorrectly when there are
peering issues, and also allows the administrator to decide whether there
may be offline osds that are worth bringing online.
Sage Weil [Tue, 24 May 2011 16:42:39 +0000 (09:42 -0700)]
osd: make automatically marking of unfound as lost optional
We may not want to do this automatically until we have more confidense in
the recovery code. Even then, possible not. In particular, the OSDs may
believe they have contact all possible homes for the data even though there
is some long-lost OSD that has the data on disk that if offline.
For now, we make the marking process explicit so that the administrator can
make the call.
A CephContext represents the context held by a single library user.
There can be multiple CephContexts in the same process.
For daemons and utility programs, there will be only one CephContext.
The CephContext contains the configuration, the dout object, and
anything else that you might want to pass to libcommon with every
function call.
Move some non-config things out of md_config_t and into CephContext.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Split common_init_daemonize from common_init_finish
Split off common_init_daemonize from common_init_finish. cfuse is a
daemon that calls common_init_finish, but handles daemonization itself.
This fixes cfuse.
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Get rid of the initialize-then-shutdown-crypto hack. We just initialize
crypto once, after it is safe to do so. There is now a single callback,
common_init_finish, which does the final stage of initialization,
including starting crypto and daemonization (if required.)
common_init_finish needs to be done before messenger::start().
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
Sage Weil [Fri, 20 May 2011 21:45:36 +0000 (14:45 -0700)]
osd: more heartbeat rework
A few things:
- track Connection* instead of entity_inst_t for hb peers
- we can only send maps over the cluster_messenger
- if peer is still alive, do that
- if peer is not, send dying MOSDPing ping with YOU_DIED flag
Sage Weil [Fri, 20 May 2011 19:55:29 +0000 (12:55 -0700)]
osd: rework peer map epoch caching
We try to keep track of which epochs our peers have so that we can be
semi-intelligent about which map incrementals we send preceeding any
messages. Since this is useful from the heartbeat and cluster channels/
threads, protect the data with an inner lock and clean up the callers.
Be smarter about when we forget.
Make note of peer epoch when we receive a ping.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 17:42:16 +0000 (10:42 -0700)]
osd: do not clobber explicitly requested heartbeat_to target addresss
Consider peer P.
- P does down in, say, epoch 60, and back up in epoch 70
- P and requests a heartbeat, as_of 70
- We update to map 50, and coincidentally add the same peer as a target
- We set the heartbeat_to[P] = 50 and start sending to the _old_ address
- P marks us down because we stop sending to the new addr
- We eventually get map 70, but it's too late!
Make sure we preserve any _to targets _and_ their epoch+inst.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 16:29:10 +0000 (09:29 -0700)]
osd: request proper log extent for missing
We can't blinding ask for everything since last_epoch_started because that
may mean we get some fragment of a backlog. Look at the peer's log
ranges and request the correct thing. Also, in fulfill_log, infer what
the primary should have asked for if they make a bad request.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 07:27:00 +0000 (00:27 -0700)]
osd: take remote log when it is clearly superior
I'm hitting a case where the primary is compensating for a replica's
last_complete < log.tail by sending a log+backlog, but the replica
isn't smart enough to take advantage. In this case,
replica: log(781'26629,781'26631]
from primary: log(781'26629,781'26631]+backlog
result: log(781'26629,781'26631]
Doh!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 07:14:24 +0000 (00:14 -0700)]
osd: fix compensation for bad last_complete
If the peer has a last_complete below their tail, we can get by with our
log (without backlog) if our tail if _before_ their last_complete, not
after. Otherwise, we need a backlog!
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Fri, 20 May 2011 06:40:12 +0000 (23:40 -0700)]
osd: include past acting osds if they were up
This fixes a bug where we were excluding up (but not acting) nodes from
past intervals, which in turn was triggering a nasty choose_acting loop
(because we _do_ already include acting but !up from the current
interval).
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Sage Weil [Thu, 19 May 2011 22:03:13 +0000 (15:03 -0700)]
client: hold FILE_BUFFER ref while waiting for dirty throttle
We may block in the write path because we've reached out dirty data limit.
Hold a reference to the FILE_BUFFER cap during that interval so we don't
lose the cap and put new dirty buffers into the objectcacher out of turn.
(We could also recheck our ability to take the ref after blocking, but I
think this is cleaner.)
Sage Weil [Thu, 19 May 2011 22:00:34 +0000 (15:00 -0700)]
client: assert(in) on _flush
We should never arrive in _flush() and not have a reference to the inode
in question, because the presence of dirty buffers pins the inode. This
condition was introduced forever ago; clean it out.