Greg Farnum [Thu, 4 Aug 2011 16:15:25 +0000 (09:15 -0700)]
osd: Initialize new PGs with correct info.history.same_primary_since
Previously we were initializing based on the local osdmap epoch, which
is often correct, but if we process the MOSDPGCreate in an epoch after
the PG was created in the OSDMap, we could have problems with clients
sending out messages based on the creation epoch which the OSD would
reject as being earlier than same_primary_since. See bug #1357.
Sage Weil [Tue, 2 Aug 2011 21:19:41 +0000 (14:19 -0700)]
osd: change src_oid encoding -- FLAG DAY
The old encoding was mutually exclusive with putting any data payload on
the operation. That was stupid.. we can't, for example, do xattr ops then
on a src_oid.
Fix this by just including the oid in the data payload inline whenever the
bit is set in the op code. This changes the client protocol in an
incompatible way, which means users of the CLONERANGE operation need to be
upgrade/downgraded in unison.
Josh Durgin [Mon, 1 Aug 2011 17:45:19 +0000 (10:45 -0700)]
librados: fix notify deadlock
The success of the notify call needs to checked before waiting to
receive a notification. If we try notifying on an object that does not
exist, for example, it should fail with -ENOENT, and not hang.
Josh Durgin [Tue, 2 Aug 2011 19:14:06 +0000 (12:14 -0700)]
osd: put_object_context: tolerate pgs being deleted
PGs that are queued for deletion won't be in the osdmap,
and may not be in the pg_map, but if they are, it's safe to
put object_context. Otherwise, the pg is being deleted and
will clean up the object contexts itself.
osd, pg: clean up watchers on pg deletion and shutdown
Watchers and their object contexts need to be cleaned up so
they aren't used after the pg is gone. This happened if the
pool was deleted and the connection to the watcher was reset.
Sage Weil [Sat, 30 Jul 2011 05:10:17 +0000 (22:10 -0700)]
mds: fix create_subtree_map for new dirs
Currently mkdir foo ; rmdir foo fails because we can't get_subtree_map()
on a new directory that isn't linked in the committed plane. Since we are
journaling the projected subtree, it makes sense to use
get_projected_subtree_map() here.
It's easiest to keep in both the old and new directories in the rename
project map instead of looking at the next-to-most-recent parent for the
inode. The committed version is irrelevant (could conceivably be multiple
renames behind) and the current projected parent is just newdir; we need
olddir too, and we don't project for cross-mds rename anyway.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
This allows clients to determine whether they have the latest
mds, mon, or osd map. This is useful for figuring out if a pool
does not exist, or if the osdmap with it simply hasn't been
received yet.
Sage Weil [Fri, 29 Jul 2011 20:44:24 +0000 (13:44 -0700)]
mds: identify slave requests with reqid + attempt number
We need to distinguish between different attempts to process a request, or
else we can get annoying races in the slave request handling code. E.g.,
- request sent to mds A
- A authpins items on B, B registered slave_request
- A forwards request to C, sends slave finish to B
- C receives request, sends authpin slave request to B
- B receives C's authpin request, discards (*)
- B receives A's finish, closes slave request
Sage Weil [Thu, 28 Jul 2011 22:24:23 +0000 (15:24 -0700)]
mds: fix log trimming races
trim() would iterate over segments. It would take the *p segment, ++p,
then call try_expire(). But the _expired() function would also clean up
and (if possible) retire subsequent segments on the list if they were on
the expired list, invalidating the p iterator.
Untangle the mess by making expired segment trimming (i.e. removing from
segment list) a separate operation performed only by trim() (probably a
good idea anyway). This keeps the iterator safe/stable.
Sage Weil [Thu, 28 Jul 2011 21:51:06 +0000 (14:51 -0700)]
mds: separate type for gratuitous debug ESubtreeMaps
Give these a different type so they are not interpreted as subtree
boundaries during replay. Otherwise we break the truncate_finish code,
which references the truncate_start logsegment by offset. Probably other
stuff too.
Sage Weil [Thu, 28 Jul 2011 18:27:25 +0000 (11:27 -0700)]
client: open session with all mds targets
If we have an open session with an mds, we need to have an open session.
The problem is if we, say,
- client has old mdsmap
- mds A adds B as target in mdsmap
- send request to mds A
- A exports to B
- we get the EXPORT, but B isn't listed as a target for A in client map
- client gets updated map
At the time we receive the map we need to open the session to B. We can't
really do it when we get the EXPORT because we don't know the target MDS.
We can either track which exports are pending to do it, or just blindly
open sessions with targets for any MDSs we have caps with. Which is
basically every session we have open. That's simplest for now.
mds: Split the CInode::scatter_wanted field in two
We use this field to indicate we want a scatter or an unscatter. Make
that distinction explicit.
Also, clear the unscatter_wanted in simple_lock when we start a gather!
Sage Weil [Thu, 28 Jul 2011 17:10:51 +0000 (10:10 -0700)]
heartbeatmap: warn if previous deadline is missed
This will generate missed deadline noise in the log that may otherwise be
missed by an infrequent heartbeat_interval. We generally want to know if
deadlines are missed, but we don't necessarily need to touch the heartbeat
file every second. This gets us both.