Sage Weil [Wed, 29 May 2013 20:26:45 +0000 (13:26 -0700)]
osd: wait for healthy pings from peers in waiting-for-healthy state
If we are (wrongly) marked down, we need to go into the waiting-for-healthy
state and verify that our network interfaces are working before trying to
rejoin the cluster.
- make _is_healthy() check require positive proof of pings working
- do heartbeat checks and updates in this state
- reset the random peers every heartbeat_interval, in case we keep picking
bad ones
Sage Weil [Wed, 29 May 2013 20:15:41 +0000 (13:15 -0700)]
osd: distinguish between definitely healthy and definitely not unhealthy
is_unhealthy() will assume they are healthy for some period after we
send our first ping attempt. is_healthy() is now a strict check that we
know they are healthy.
Switch the failure report check to use is_unhealthy(); use is_healthy()
everywhere else, including the waiting-for-healthy pre-boot checks.
Sage Weil [Mon, 27 May 2013 22:24:56 +0000 (15:24 -0700)]
osd: simplify is_healthy() check during boot
This has a slight behavior change in that we ask the mon for the latest
osdmap if our internal heartbeat is failing. That isn't useful yet, but
will be shortly.
Sage Weil [Wed, 29 May 2013 16:49:11 +0000 (09:49 -0700)]
osd: do not assume head obc object exists when getting snapdir
For a list-snaps operation on the snapdir, do not assume that the obc for the
head means the object exists. This fixes a race between a head deletion and
a list-snaps that wrongly returns ENOENT, triggered by the DiffItersateStress
test when thrashing OSDs.
Fixes: #5183
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
- check against both front and back cons; either one may have failed.
- close *both* front and back before reopening either. this is
overkill, but slightly simpler code.
- fix leak of con when marking down
- handle race against osdmap update and note_down_osd
Fixes: #5172 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 28 May 2013 18:10:05 +0000 (11:10 -0700)]
HashIndex: sync top directory during start_split,merge,col_split
Otherwise, the links might be ordered after the in progress
operation tag write. We need the in progress operation tag to
correctly recover from an interrupted merge, split, or col_split.
Fixes: #5180
Backport: cuttlefish, bobtail Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Yan, Zheng [Sun, 26 May 2013 11:04:34 +0000 (19:04 +0800)]
mds: use "open-by-ino" function to open remote link
Also add a new config option "mds_open_remote_link_mode". The anchor
approach is used by default. If mode is non-zero, use the open-by-ino
function. In case open-by-ino function fails, if mode is 1, retry
using the anchor approach, otherwise trigger assertion.
Yan, Zheng [Sat, 25 May 2013 13:30:38 +0000 (21:30 +0800)]
mds: open missing cap inodes
When a recovering MDS enters reconnect stage, client sends reconnect
messages to it. The message lists open files, their path, and issued
caps. If an inode is not in the cache, the recovering MDS uses the
path client provides to determine if it's the inode's authority. If
not, the recovering MDS exports the inode's caps to other MDS. The
issue here is that the path client provides isn't always accuracy.
The fix is use recently added "open inode by ino" function to open
any missing cap inodes when the recovering MDS enters rejoin stage.
Send cache rejoin messages to other MDS after all caps' authorities
are determined.
Yan, Zheng [Wed, 15 May 2013 02:28:58 +0000 (10:28 +0800)]
mds: open inode by ino
This patch adds "open-by-ino" helper. It utilizes backtrace to find
inode's path and open the inode. The algorithm looks like:
1. Check MDS peers. If any MDS has the inode in its cache, goto step 6.
2. Fetch backtrace. If backtrace was previously fetched and get the
same backtrace again, return -EIO.
3. Traverse the path in backtrace. If the inode is found, goto step 6;
if non-auth dirfrag is encountered, goto next step. If fail to find
the inode in its parent dir, goto step 1.
4. Request MDS peers to traverse the path in backtrace. If the inode
is found, goto step 6. If MDS peer encounters non-auth dirfrag, it
stops traversing. If any MDS peer fails to find the inode in its
parent dir, goto step 1.
5. Use the same algorithm to open the inode's parent. Goto step 3 if
succeeds; goto step 1 if fails.
6. return the inode's auth MDS ID.
The algorithm has two main assumptions:
1. If an inode is in its auth MDS's cache, its on-disk backtrace
can be out of date.
2. If an inode is not in any MDS's cache, its on-disk backtrace
must be up to date.
Yan, Zheng [Fri, 17 May 2013 08:43:01 +0000 (16:43 +0800)]
mds: bring back old style backtrace handling
To queue a backtrace update, current code allocates a BacktraceInfo
structure and adds it to log segment's update_backtraces list. The
main issue of this approach is that BacktraceInfo is independent
from inode. It's very inconvenient to find pending backtrace updates
for given inodes. When exporting inodes from one MDS to another
MDS, we need find and cancel all pending backtrace updates on the
source MDS.
This patch brings back old backtrace handling code and adapts it
for the current backtrace format. The basic idea behind of the old
code is: when an inode's backtrace becomes dirty, add the inode to
log segment's dirty_parent_inodes list.
Compare to the current backtrace handling, another difference is
that backtrace update is journalled in EMetaBlob::full_bit
Yan, Zheng [Fri, 17 May 2013 08:02:03 +0000 (16:02 +0800)]
mds: journal backtrace update in EMetaBlob::fullbit
Current way to journal backtrace update is set EMetaBlob::update_bt
to true. The problem is that an EMetaBlob can include several inodes.
If an EMetaBlob's update_bt is true, journal replay code has to queue
backtrace updates for all inodes in the EMetaBlob.
This patch adds two new flags to class EMetaBlob::fullbit, make it be
able to journal backtrace update.
Yan, Zheng [Wed, 15 May 2013 03:24:36 +0000 (11:24 +0800)]
mds: warn on unconnected snap realms
When there are more than one active MDS, restarting MDS triggers
assertion "reconnected_snaprealms.empty()" quite often. If there
is no snapshot in the FS, the items left in reconnected_snaprealms
should be other MDS' mdsdir. I think it's harmless.
If there are snapshots in the FS, the assertion probably can catch
real bugs. But at present, snapshot feature is broken, fixing it is
non-trivial. So replace the assertion with a warning.
Yan, Zheng [Mon, 6 May 2013 01:09:59 +0000 (09:09 +0800)]
mds: defer releasing cap if necessary
When inode is freezing or frozen, we defer processing MClientCaps
messages and cap release embedded in requests. The same deferral
logical should also cover MClientCapRelease messages.
Yan, Zheng [Thu, 16 May 2013 17:44:23 +0000 (01:44 +0800)]
mds: fix Locker::request_inode_file_caps()
After sending cache rejoin message, replica need notify auth MDS when
cap_wanted changes. But it can send MInodeFileCaps message only after
receiving auth MDS' rejoin ack. Locker::request_inode_file_caps() has
correct wait logical, but it skips sending MInodeFileCaps message if
the auth MDS is still in rejoin state.
The fix is defer sending MInodeFileCaps message until the auth MDS
is active. It makes the function's wait logical less tricky.
mds: send slave request after target MDS is active
when failure of peer is detected, MDCache::handle_mds_failure()
checks if there are requests waiting for slave replies from the
failed peer, and adds them to the "wait for active peer" list.
The "retry request" logical only covers slave requests sent before
MDCache::handle_mds_failure() is called. If a slave request was
sent while peer isn't up, we wait for its reply forever.
Yan, Zheng [Tue, 7 May 2013 00:56:11 +0000 (08:56 +0800)]
mds: remove buggy cache rejoin code
I previously added code to handle a corner case of cache rejoin:
entire subtree, together with the inode subtree root belongs to,
were trimmed between sending cache rejoin and receiving rejoin ack.
In this case, we should send cache expire message to the subtree's
auth MDS. But the code is complete broken, remove it temporarily.
Current code uses import state to detect obsolete import discover/prep
message. it does not work for the case: cancel a subtree import, import
the same subtree again, the discover/prep message for the first import
get dispatched.
For unlink/rename request, the target dentry's linkage may change
before all locks are acquired. So we need check if the existing stray
dentry is valid.
Yan, Zheng [Wed, 15 May 2013 08:35:39 +0000 (16:35 +0800)]
mds: fix slave commit tracking
MDS may crash after journalling a slave commit, but before sending
commit ack to the master. Later when the MDS restarts, it will not
send commit ack to the master. So the master waits for the commit
ack forever. The fix is remove failed MDS from requests' uncommitted
slave list. When failed MDS recovers, its resolve message will tell
the master which slave requests are not committed. The master will
re-add the recovering MDS to requests' uncommitted slave list if
necessary.