Sage Weil [Mon, 1 Dec 2008 15:13:30 +0000 (07:13 -0800)]
osd: optionally avoid zeroing trimmed log on disk
This is a half-hearted attempt to keep old PG log content around. It'll
still be lost if a PG moves to another node or the entire log is written
to disk for some other reason.
Sage Weil [Mon, 1 Dec 2008 04:42:44 +0000 (20:42 -0800)]
osd: skip peer_info on down osds
We don't clean old/down OSDs out of peer_info map, since we may not
restart peering when strays go up/down. That's fine... just make sure
we ignore them later.
Sage Weil [Wed, 26 Nov 2008 19:23:15 +0000 (11:23 -0800)]
osd: move stats into PG::Info (disk format change)
We want the pg stats to propagate along with last_update. Do so
in merge_log.
Also, stop doing delayed stats update on primary; we always update
the in-core copy of Info, and only delay applying the transaction
to disk. At least currently.
Sage Weil [Mon, 24 Nov 2008 21:49:17 +0000 (13:49 -0800)]
osd: remove snap collection after it is trimmed
We can still end up with empty collections for existing snaps but no
local objects. However, they'll eventually go away when the snap is
deleted, so who cares.
Use a single struct to track all of our osd up/down info. Include
down_at, the epoch we last marked the osd down.
Fix PG::build_prior to require that the osd was clean through the _entire_
interval in question.
In monitor, adjust new clean interval foward to down_at-1 if the up_from
matches the interval we mounted. That is, if the OSD shut down cleanly,
it obviously remained clean at least until we marked it down in the map.
Sage Weil [Mon, 24 Nov 2008 21:26:11 +0000 (13:26 -0800)]
mds: move filelock to lock state if we can't wrlock but lock is stable
For example, lock may be sync c=1 when we're trying to reset
max_size to 0. We need to make sure the lock will change state
before we wait on WAIT_STABLE.
Sage Weil [Mon, 24 Nov 2008 21:23:53 +0000 (13:23 -0800)]
mds: rejournal using EUpdate instead of EOpen if no caps in check_inode_max_size
Using EOpen on non-open files is imprecise. More importantly, if
it is a snapped inode, the EOpen replay code won't be able to
look up the ino and will throw an assertion.
So, use EUpdate instead to record the new info when necessary.
Sage Weil [Mon, 24 Nov 2008 18:00:21 +0000 (10:00 -0800)]
osd: use last_clean_interval in build_prior logic
We now mark a PG crashed if any of the OSDs during a given interval
is not either still alive or cleanly shut down during the interval. If
those two conditions are not yet, it may have crashed.
It isn't a perfect set of criteria, since last_clean_interval is only
clean shutdowns of an OSD. We could, for example, track another interval
generated via old osd_up_thru, but the conditions for that are different,
since that requires survival past the end of the interval, not a clean
shutdown during the interval. This should capture the common case, though,
of a clean unmount.
to
Sage Weil [Mon, 24 Nov 2008 18:17:58 +0000 (10:17 -0800)]
osd: remove bad assertion in pick_read_snap
The clone oid.snap does not necessary correspond to the newest
snap, since snaps may be deleted, or because the snap the snap is
named based on the snap context seq and not the oldest snap it
contains.
Sage Weil [Sat, 22 Nov 2008 16:57:19 +0000 (08:57 -0800)]
mds: fix up completed_request handling during journal replay
The completed_requests is handled separately from the session table
itself, in that we may add completed requests to the table even when
we may have loaded newer info. But the handling was a bit wrong.
We make sure we only add completed requests if the session is already
open... and remove the unnecessary trim (if the sessionmap is newer, the
session is already closed, and thus we have no request info).
Sage Weil [Fri, 21 Nov 2008 22:58:00 +0000 (14:58 -0800)]
mds: make open_remote_ino terminate if the anchortrace refers to a non-existent ino
Since anchor lookup is racy, we may have to do multiple lookups
(in the case of a concurrent anchor table update). If we don't
find the ino in the anchor, remember the anchor version when we
try again. If we fail again at the same point, and the anchor
has not changed, fail.
Create an open_remote_dentry helper that does this. If we fail,
set the CDentry::STATE_BADREMOTEINO state bit.
Sage Weil [Fri, 21 Nov 2008 22:34:40 +0000 (14:34 -0800)]
mds: add version to anchor; avoid looping in open_remote_ino
If we do not find the item referenced by teh anchor trace, we
try again, but keep track of the anchor version we ended on. If
we hit a dead end at the same point next time and the anchor
hasn't changed, we give up.