Yan, Zheng [Mon, 20 Feb 2017 03:27:09 +0000 (11:27 +0800)]
mds: properly set default dir_hash for directory inodes
MDCache::handle_cache_rejoin_strong(). may add new inodes (race with
cache expire). Updating these inodes is at the very end of the function.
Before these inodes get updated, MDCache::handle_cache_rejoin_strong()
may add dentries to these inodes. So dir_hash type of these inodes
should be set to the default value.
Yan, Zheng [Fri, 17 Feb 2017 02:51:43 +0000 (10:51 +0800)]
mds: don't call kick_discovers() for recovering mds twice
MDSRankDispatcher::handle_mds_map() calls kick_discovers() when
the recovering mds enters rejoin state. No need to call it again
when the recovering mds entry clientreplay/active state.
Yan, Zheng [Wed, 15 Feb 2017 07:51:00 +0000 (15:51 +0800)]
mds: avoid race between cache expire and MDentryLink
commit 22535340 tried to fix race between cache expire and
MDentryLink. It avoids trimming null dentry whose lock is
not readable. The fix does not handle the case that MDS
first recevies a MDentryUnlink message, then receives a
MDentryLink message.
Yan, Zheng [Mon, 13 Feb 2017 01:30:43 +0000 (09:30 +0800)]
mds: fix deadlock when wrlock and remote_wrlock the same lock
When handling trans-authority rename, the master mds may ask slave
mds to wrlock a lock, then try to wrlock the same lock locally.
If the master can't wrlock the lock locally, it need to drop the
remote wrlock and wait. Otherwise deadlock happens. The code does
not handle a corner case: Lock::wrlock_start() can sleep even
when SimpleLock::can_wrlock() return true.
Yan, Zheng [Fri, 10 Feb 2017 09:31:37 +0000 (17:31 +0800)]
mds: issue new caps to client even when session is stale
If early reply is not allowed for open request, MDS does not send
reply to client immediately after adding adds new caps. Later when
MDS sends the reply, client session can be in stale stale. If MDS
does not issue the new caps to client along with the reply, the
new caps get lost. This issue can cause MDS to hang at revoking
caps.
Yan, Zheng [Thu, 9 Feb 2017 03:43:57 +0000 (11:43 +0800)]
mds: note subtree bounds when rolling back rename
mds can do a slave rename that moves directory inode (whoes dirfrags
are all non-auth) to new auth. Then rolls back the slave rename. If
There is a ESubtreeMap event between log event of slave rename and
log event of rollback. The ESubtreeMap does not have information
about the inode's non-auth dirfrags.
Later when mds replays the log, the log event of slave rename can
be missing. So mds need to re-create subtree bounds when replaying
the log event of rename rollback
Yan, Zheng [Wed, 8 Feb 2017 03:25:16 +0000 (11:25 +0800)]
mds: properly set ambiguous auth on auth mds of rename source inode
When doing trans-authority rename, the master mds may send two slave
requests to auth mds of rename source inode. The first slave request
set ambiguous auth on rename source inode. The second slave request
is sent after receiving all bystanders' slave request replies.
Current code uses mdr->more()->is_ambiguous_auth bit to indicate if
the first slave reuqest was sent. The is_ambiguous_auth is set when
when calling Server::_rename_prepare_witness(). This causes problem
if Server::_rename_prepare_witness() can't send the slave request
immediately and wants to retry the MDRequest laster. The fix is set
is_ambiguous_auth when receiving reply for the first slave request
Yan, Zheng [Tue, 7 Feb 2017 02:25:06 +0000 (10:25 +0800)]
mds: disambiguate other mds' imports when cluster enters rejoin state
When mds cluster is in rejoin state, we know all mds have finished
their exports. All export abort notifications have been processed
by standby mds. So it's safe to disambiguate other mds' imports.
Yan, Zheng [Tue, 7 Feb 2017 05:15:59 +0000 (13:15 +0800)]
mds: wait acknowledgment for export abort notification
To disambiguate other mds's failed import, survivor bystander mds
need to receive either exporter mds' export abort notification or
exporter mds' resolve message. For bystander mds, it's hard to
distinguish "export succeeded" from the case "hasn't received
export abort notification".
To handle this problem, we rely on the fact that surviver mds does
not send resolve message to the recovering mds until it finishes
all its exports. Without the resolve message, the recovering mds
can't go to rejoin state. So when mds cluster is in rejoin state,
we know all mds have finished their exports. If export abort
notifications also require acknowledgments. When mds cluster is
in rejoin state, we know all export abort notifications have been
proceesed by bystander mds. So bystander mds can disambiguate other
mds' imports
Yan, Zheng [Mon, 6 Feb 2017 09:15:40 +0000 (17:15 +0800)]
mds: cleanup ambiguous slave update when master mds fails
When auth mds of rename source dentry fails, slave updates in witness
mds become ambiguous. Witnesses need to ask the master if they should
rollback the updates. This type of rollback is special, corresponding
MDRequest struct need to be preserved after rollback. If the master
mds also fails, slave updates in witness mds are no longer special.
Corresponding MDRequest struct need to be cleanup after rollback.
Yan, Zheng [Mon, 6 Feb 2017 08:47:43 +0000 (16:47 +0800)]
mds: log master commit after all slave commits get journaled
When survivor mds sends resolve message to recovering mds, aslo
records committing slave request in the message. So the recovering
mds knows the slave commit is still being journaled. It journals
master commit after receiving corresponding OP_COMMITTED message.
Yan, Zheng [Fri, 3 Feb 2017 06:58:56 +0000 (14:58 +0800)]
mds: avoid journal unnessary dirfrags in ESubtreeMap
EMetaBlob::add_dir_contex() skips adding inodes that has already
been journaled in the last ESubtreeMap. The log replay code only
replays the first ESubtreeMap. For the rest ESubtreeMap, it just
verifies subtree map in the cache matches the ESubtreeMap. If
unnessary inodes were included in non-first ESubtreeMap, these
inodes do not get added to the cache, the log replay code can
find these inodes are missing when replaying the rest events in
the log segment.
Previous attempt (commit a9b959dfb7) to fix this issue is not
complete. This patch makes MDCache::create_subtree_map() journal
dirfrags according to simplified subtree map. It should fix this
issue completely.
Vikhyat Umrao [Thu, 16 Feb 2017 18:21:11 +0000 (23:51 +0530)]
auth: 'ceph auth import -i' overwrites caps, if caps are not specified
in given keyring file, should alert user and should not allow this import.
Because in 'ceph auth list' we keep all the keyrings with caps and importing
'client.admin' user keyring without caps locks the cluster with error[1]
because admin keyring caps are missing in 'ceph auth'.
[1] Error connecting to cluster: PermissionDeniedError
Sage Weil [Fri, 17 Feb 2017 19:50:38 +0000 (14:50 -0500)]
osd/PGLog: reindex properly on pg log split
When pg_log_t::split_out_child() runs it builds the list, which means the
old indexes are wrong (the point to bad memory), but index() will not
rebuild them because ever since b858e869e78927dccebaa350d246bd74af7f1de9
we won't rebuild them if they are already built.
Fix that by calling unindex() before the split.
Further, the new child log also needs to be indexed. Fix that too.
Fixes: http://tracker.ceph.com/issues/18975 Signed-off-by: Sage Weil <sage@redhat.com>
This script currently has a syntax error, but still exits with
success, which is hiding that failure. Expose it by allowing
the 'sudo' exit code to be the script's exit code.
rgw: make sending Content-Length in 204 and 304 controllable
This commit introduces a new configurable "rgw print prohibited
content length" to let operator decide whether RadosGW complies
to RFC 7230 (a part of the HTTP specification) or violates it
but follows the Swift's behavior.