Sage Weil [Thu, 14 May 2009 23:10:27 +0000 (16:10 -0700)]
mds: drain wrlocks before going from LOCK->SYNC in file_eval
This avoids excessive waits for the journal to flush on lock->sync
when a client request holds a wrlock. There's no reason to hurry.. if
someone needs it sync we can to the transition then; otherwise, it'll
happen when the wrlock is dropped.
Sage Weil [Thu, 14 May 2009 20:57:25 +0000 (13:57 -0700)]
mds: break CAP_RDCACHE into CAP_SHARED, CAP_CACHE
FILE_CAP_RDCACHE was being used to mean both read access to the
file attributes (size, mtime) and permission to retain cached
data. That lead to an incorrect definition of the filelock in the
mds, and in turn bugs with multiple client access. These are now
CAP_*_SHARED (all locks) and CAP_FILE_CACHE (filelock only).
The main observed symptom was a client creating files in a
directory and a second client not seeing them, due to RDCACHE not
being revoked and rdcache_gen thus not incrementing, allowing a
dcache readdir to proceed.
Sage Weil [Wed, 13 May 2009 21:55:07 +0000 (14:55 -0700)]
crush: fall back to exhaustive bucket search for any bucket type
If we don't get a bucket-specific choice in 5 tries, do an
exhaustive search (based on a random permutation). Only then give
up on the bucket and retry descent.
Note that the search-based fallback does not honor weighting at
all.
Sage Weil [Sun, 10 May 2009 05:12:50 +0000 (22:12 -0700)]
osd: skip initial bit of peering if already have_master_log
Once we have settled on the master log, we want to skip that
step of peer(). Namely because peer() can be called on an
active PG if an osd shows up with stray content. We still want
to peer() in that case in case there are missing objects to be
found.
Sage Weil [Fri, 8 May 2009 21:04:03 +0000 (14:04 -0700)]
osd: maintain up_epoch AND boot_epoch; revise OSDSuperblock accordingly
In order to make the superblock clean interval meaningful after we
are marked down and then up again (over the life of a single
cosd process insance), we track both boot_epoch and up_epoch,
and keep [boot_epoch,clean_thru] in the superblock.
This avoids seeing crashed pgs when and osd is wrongly marked down
and the osd marks itself up again.
Sage Weil [Fri, 8 May 2009 03:49:51 +0000 (20:49 -0700)]
osd: factor out clear_recovery_state from {cancel,finish}_recovery
Also kill OSD::num_pulling counter, which is wrong anyway,
lacking locking, and probably not needed anyway with the more
general recovery_op accounting.
Sage Weil [Fri, 8 May 2009 03:47:31 +0000 (20:47 -0700)]
osd: make sure _finish_recovery only completes when it's supposed to
Because the finish_recovery does a sync, the final cleanup is
deferred, and we have to make sure we are still done (and we
haven't, say, repeered or something).
In this case, we thus make sure we don't clear out pg
recovery_ops when we actually have ops in progress.
Sage Weil [Wed, 6 May 2009 20:12:29 +0000 (13:12 -0700)]
osd: move .snap out of object_t
This makes the snap versioning completely orthogonal to the logical
object name (object_t). This is key since eventually object_t
won't be structured. And the old way made for an awkward interface
anyway.
Also killed the .snap = 0 special casing, which AFAICS was
useless.
Sage Weil [Thu, 7 May 2009 21:39:47 +0000 (14:39 -0700)]
kclient: recalculate pgid each time request is sent
The pg calculation depends on osdmap parameters that are transient. In
contrast, the rest of calc_layout is concerned with file striping, which
is fixed (at least over the lifetime of the request).