Greg Farnum [Fri, 26 Feb 2010 19:28:25 +0000 (11:28 -0800)]
msgr: Remove the type CEPH_ENTITY_TYPE_ADMIN.
It *looks* like this won't break EntityName's isAdmin() function as that
depends on a set name, and client.admin will satisfy it. I think.
Greg Farnum [Fri, 26 Feb 2010 18:57:37 +0000 (10:57 -0800)]
ceph_fs: Split CEPH_FEATURE_{SUPPORTED, REQUIRED} flags into service-based flags
msgr: New get_{required,supported}_bits methods which calculate required bits
based on type of self and of peer. Replace all hard-coded flag uses with these.
Sage Weil [Tue, 2 Mar 2010 00:34:17 +0000 (16:34 -0800)]
mds: put forced open sessions in OPENING then OPEN
We use OPENING state to indicate sessions that are being
imported. Fix get_or_add_open_session() to NOT set the session
state (except to STATE_NEW if new) so that the caller can do
the right thing. Otherwise, the prepare_force_open_sessions()
can't tell if it just forced open a session (and needs it to be
OPENING) or if it was already open. Subsequently cap migrations
weren't working if the client didn't already have a session
open.
There is still a bug: if the import aborts, we have an OPENING
session with no actual open client_session message queued. Maybe
we should have a different state instead of OPENING... IMPORTING?
Sage Weil [Fri, 26 Feb 2010 20:30:07 +0000 (12:30 -0800)]
osd: use onreadable_sync finishers to drop ondisk locks
This fixes a deadlock where we are holding pg->lock, block
waiting for the ondisk lock, but the unlock completion is stuck
behind something else waiting on pg->lock in the finisher queue.
Sage Weil [Fri, 26 Feb 2010 20:29:17 +0000 (12:29 -0800)]
filestore: add onreadable_sync callback
Add an additional completion context that gets called
synchronously when the operation completes, instead of getting
shunted to the async finisher thread. This allows us to make
sure certain completion events happen without getting 'stuck
in line' behind other completions with conflicting locks.
Sage Weil [Fri, 26 Feb 2010 19:10:29 +0000 (11:10 -0800)]
objectstore: conflate onjournal and ondisk
No callers actually make this distinction, and the ObjectStore
hides the details of if/whether/how things go to the journal
or disk or in what order, so simplify things all around.
Sage Weil [Thu, 25 Feb 2010 23:50:09 +0000 (15:50 -0800)]
mds: revise mds sessionmap encoding [disk format change]
Encode session name before session itself, so that we can use
an existing session instead of allocating a new one. This lets
us keep eagerly reconnecting clients that connect before we
load the sessionmap.
Add proper struct_v. Drop useless/incorrect 'n' value.
Continue to read old format, of course. Some minor hackery
because we didn't have a struct_v before.
Sage Weil [Thu, 25 Feb 2010 19:48:14 +0000 (11:48 -0800)]
mds: fix trim_non_auth empty lru case
If our lru is empty, make sure we clean things out _after_
unpinning the subtrees! This came up after an mds leaving the
cluster crashed before it finished, and on replay/rejoin had
no auth subtrees.
Sage Weil [Thu, 25 Feb 2010 18:40:25 +0000 (10:40 -0800)]
osd: detect permanently lost objects, and continue
If we mark an osd lost, and subsequently there are some objects
that are permanently lost, recover. Adjust the missing map to
no longer expect those new revisions. (FIXME: pg stats are not
correctly adjusted; a repair will be needed.)
Sage Weil [Wed, 24 Feb 2010 18:48:31 +0000 (10:48 -0800)]
filer: remove -> purge_range, and scale to large ranges
Redefine remove interface to operate over a range of objects
numbers, not a byte range, since we are removing objects. It is
the caller's responsibility to ensure they have the proper
range (by mapping from the ceph_file_layout).
And behave when the range is large by only allowing a few in
flight remove requests at once.
Eventually the objecter probably needs a more generalized request
throttling mechanism, but this will do for now.
Sage Weil [Wed, 24 Feb 2010 05:08:56 +0000 (21:08 -0800)]
mds: make scatter_nudge actually nudge when replica asks
If we're not replicated, there is no need to twiddle the
lockstate.. we can just write out any dirty data, as when we
have delayed rstat propagation. If we are replicated, though,
and a replica asks to nudge the lock, we had better nudge the
lock state!
Sage Weil [Tue, 23 Feb 2010 23:51:39 +0000 (15:51 -0800)]
mds: fix file purge race
Handle the case where a new inode ref appears while we are
purging an inode. If so, we just truncate it to 0, so that next
time we go through purge_stray() we don't have to do the work
over again.
This can happen if a client goes snooping in the stray dir (or
who knows what else!).
Sage Weil [Thu, 18 Feb 2010 23:05:33 +0000 (15:05 -0800)]
objectstore: simpler transaction encoding
Just concatenate operations to a bufferlist as we go. No
distinct decoding step is needed; we parse the transaction as it
is replayed/applied. This avoids the old decoded intermediate
representation overhead.
Since we still decode the old version, that code is still there,
but not used for anything new.
Sage Weil [Tue, 16 Feb 2010 23:07:35 +0000 (15:07 -0800)]
osd: pool cleanups
missed this before:
- no need to initalize in create_pending(), constructor does that
- int32_t, not int
- pool_max while we're at it
- initialize pool_max in OSDMap constructor
Greg Farnum [Fri, 12 Feb 2010 21:21:22 +0000 (13:21 -0800)]
osd: Deal with pools being removed from OSDMap.
This potentially has issues, since pools are not removed from the map
until after all the PGs are removed (which is threaded, not inline with
map delivery). But Sage thinks it's okay and the system keeps working
even if you delete a pool while benchmarking on it with rados.
Greg Farnum [Fri, 12 Feb 2010 00:57:23 +0000 (16:57 -0800)]
OSDMap: get_pg_pool now returns a pointer
This lets us return NULL if the pool isn't in the map, which is
needed functionality for pool deletion. Meanwhile, code which
expects the pool to exist will continue to cause a crash if it doesn't.
Sage Weil [Mon, 15 Feb 2010 21:47:41 +0000 (13:47 -0800)]
mds: infer 'follows' in journal_dirty_inode on non-head inodes
There are lots of callers to journal_dirty_inode that may
unwittingly be dealing with a non-head inode (e.g.
check_file_max). If the provided inode is snapped, infer an
appropriate follows values so as not to cow_inode() again.
Sage Weil [Fri, 12 Feb 2010 22:45:02 +0000 (14:45 -0800)]
osd: fix recovery requeue race
If a recovery op finished right as another recovery op was
begin started, we could get into start_recovery_ops() and get
max = 0 and not start anything. Since the PG wasn't being
requeued for later, it would never recover. So, requeue if we
race and get max == 0.
that look a bit like multiple procs were racing into
join_reader(). Add an assert to catch that if it happens again,
and also wrap thread starts in pipe_lock to ensure we keep the
_running flags in sync with reality. Add in a few other
sanity checks too.
Sage Weil [Fri, 12 Feb 2010 21:35:57 +0000 (13:35 -0800)]
mon: note mds beacon times more carefully
We need to update the beacon timestamp even when we are updating
the mds state. Otherwise we can get caught in a busy loop
between marking an mds laggy and !laggy because the beacon stamp
never updates.
So even if we are updating, and the reply will be slow, update
our timestamp, so we don't mark the mds laggy.
Sage Weil [Fri, 12 Feb 2010 21:27:49 +0000 (13:27 -0800)]
osd: bail out of interval loop completely
We're going backwards, so once this test fails, it always fails,
and we can break instead of continue. Any skipped intervals will
be pruned shortly anyway.