Before, we would provide "have" and a bool "onetime" flag. The struct was
also screwed up with an extra __le64. Then have=0 was a special case
that meant "give me the latest".
The problem is this is ambiguous between the usual "give me everything
since X" and "give me your latest", because you might actually have 0 and
want 1..current.
Changes protocol and cleans up the struct:
- now "start" and "flags", where only 1 flag (ONETIME) is defined
- clean up sub_want_* methods throughout
- fix all sub_want callers to ask for _start_ (not have) epoch, or 0 for
any/latest
- add a feature bit; talks old clients w/o that bit
Sage Weil [Fri, 23 Jul 2010 19:57:52 +0000 (12:57 -0700)]
osd: make assemble_backlog more tolerant of races
The assemble_backlog is merging data generated while racing with online
updates. It needs to tolerate races with things like delete. For example,
- generate_backlog identifies object A
- client deletes+logs A
- assemble_backlog sees A deletion entry.
We may want to merge the backlog entry in this case.
Sage Weil [Fri, 23 Jul 2010 22:14:43 +0000 (15:14 -0700)]
mds: fix lease issue mask
We define 1 to be the only lease "mask" we currently support (for dentry)
and divorce ourselves from the CEPH_LOCK namespace for this purpose. We
did this in c8d7c970e864e9f130b7c44b29c322771067026e, but screwed up the
issue function. Fix the caller, and pay attention to @mask here.
Sage Weil [Thu, 22 Jul 2010 22:55:23 +0000 (15:55 -0700)]
osd: simplify heartbeat checks
- Only check heartbeats when we have heartbeat_lock and osdmap rdlocked,
and thus _know_ heartbeat info and map are in sync. Drop unnecessary
consistency checks.
- Check heartbeats in tick(), since the heartbeat thread may miss it if
it's unlucky.
Sage Weil [Fri, 23 Jul 2010 18:45:26 +0000 (11:45 -0700)]
osd: zero ondisklog pointers when starting pg deletion
This fixes a problem where the osd stops part way through pg cleanup. It
seens the old ondisklog bounds, but then fails to load the log, and crashes
on startup.
We just need to zero the ondisklog bounds when we zero the log.
Sage Weil [Thu, 22 Jul 2010 19:01:11 +0000 (12:01 -0700)]
mds: be careful obeying REQRDLOCK
Only do a simple_sync() if we are stable, auth, and not already sync. The
client request can race with other state changes, so be careful. The
client will also retry on any state change, so we can safely ignore if
things don't look quite right.
Sage Weil [Tue, 20 Jul 2010 22:59:11 +0000 (15:59 -0700)]
mds: initialize snaprealm created, current_parent_since on creation
Need to initialize created and current_parent_since on new snaprealms
when they are created, or else we get incorrect results from the likes of
SnapRealm::get_snap_info() (e.g., parent snaps from before we were
created).
Sage Weil [Tue, 20 Jul 2010 20:23:51 +0000 (13:23 -0700)]
osd: clear failure_queue when marked down
This prevents bleed through of failures (due to not getting hearbeats, due
to us being marked down) so they don't get sent after we are marked back
up again.
This fixes one possible source of up/down flapping...
Sage Weil [Mon, 19 Jul 2010 23:22:52 +0000 (16:22 -0700)]
osd: refactor push code
- send_push_op() does a push, nothing else
- push_start() starts a primary->replica push, tracks state
- push_to_replica() ensures we push head first, calculates cloning, etc.
Sage Weil [Mon, 19 Jul 2010 21:44:09 +0000 (14:44 -0700)]
osd: recover degraded objects _before_ modifying it
This will slow down writes to degraded objects because we will wait for it
to recover before applying the write. OTOH it will be robust in the case
of large objects. We can optimize the small object update (and overwrite)
cases later.
Sage Weil [Fri, 16 Jul 2010 18:43:14 +0000 (11:43 -0700)]
mds: journal dirty items in order
There was some weird thing where dirty items were added to the front of
the list in the EMetaBlob, dating from 2007. I have no idea why. But it
was breaking rename, which dirties the same inode twice in some cases
(same src and dst dir). Since it would replay in reverse, the inode would
end up with an older state. If that happened to be the last time the inode
was modified in the journal prior to an mds restart+replay, we'd get bad
dirstat/rstat metadata.
Sage Weil [Thu, 15 Jul 2010 22:13:25 +0000 (15:13 -0700)]
mds: remove bogus 'oldest snap' floor on lssnap result
I suspect the intent was to exclude snaps from parents from before we
existed. However, get_snap_info() already does that intelligently. And
get_oldest_snap() was wrong: this dirs snaps have little to do with
the nested snaprealms, old_inodes isn't a good indicator of birth time,
etc.
This fixes a bug where we couldn't see a snap in .snaps, but after the mds
restart could (because old_inodes included more old versions after
replay).