Yan, Zheng [Thu, 5 Dec 2019 09:24:47 +0000 (17:24 +0800)]
mds: fix deadlock when xlocking policylock
Previous commit makes Server::rdlock_path_pin_ref() rdlock snaplock and
policylock. Deadlock happens if we use this function for requests that
xlock snaplock or snaplock.
when client has 'CEPH_CAP_FILE_EXCL|CEPH_CAP_DIR_UNLINK' caps of a
directory inode, it can asynchronously unlink file from the directory
if the corresponding dentry is a primary link.
when client has 'CEPH_CAP_FILE_EXCL|CEPH_CAP_DIR_CREATE' caps of a
directory inode, it can asynchronously create new file in the directory
as long as no file with the same name exists.
Define cap bits for async dir operation. Lock cache for a given type of
dir operation can be delegeted to client through cap mechanism. As long
as client holds correspondindg cap, dir operation requests from the
client can use the lock cache. If mds want to invalidate a lock cache,
it needs to first revoke corresponding cap from client.
mds: suppress frozen inode when locks of dir operation is cached.
Frozen inode in directory with lock caches may cause ABBA deadlock. The
requeset that owns the frozen inode can't get locks because lock cache
is holding conflicting locks. The request that uses lock cache can't
auth pin the frozen inode.
The solution is not creating lock cache when directory contains freezing
or frozen inode. When mds starts freezing an inode, invalidate all lock
caches on its parent directory.
mds: invalidate lock caches when freezing dirfrag/subtree
Add a list to CDir, which tracks lock caches which hold auth pins on
objects in the CDir. When mds want to freeze dirfrag or subtree, it can
find lock caches and invalidate them
mds: invalidate lock caches if they hold conflicting locks
Add a list to SimpleLock, which tracks lock caches which hold locks
on SimpleLock itself. When mds want to change lock state, it can find
lock caches and invalidate them.
The lock cache preserves locks and authpins required for directory
operations. MDS can create a lock cache when it has acquired all locks
of a directory operation. The lock cache can be used to for later
operations of the same type on the same directory.
For example, when mds has acquired all locks of a unlink operation,
it creates a lock cache, which holds holds wrlocks on direcotry inode's
filelock and nestlock, rdlocks on ancestor inodes' snaplocks. For later
unlink operations on the same directory, MDS only needs to xlock the
dentry to unlink and xlock linklock of the inode to unlink.
Take snap rdlocks (instead of taking dentry locks) on subtree's ancestor
inodes. Path to subtree is stable after they are all locked.
For dirfragtree locks on subtree bounds, it's not convenient to rdlock
them in top-down order (paths to them are not stable). The solution is
kicking them to SYNC state and try taking rdlocks on all of them.
Yan, Zheng [Mon, 7 Oct 2019 07:50:50 +0000 (15:50 +0800)]
mds: add 'path_locked' flag to MDCache::find_ino_peers()
MDS now relies on snaplocks to ensure that paths for slave request are
stable. MDCache::handle_find_ino_reply() may encounter xlocked dentry
during path traverse.
If an inode's snaplock is locked by a MDRequest, the inode can not be
renamed by other MDRequest. For rename case, snaplocks of all inodes
in paths are locked. So the paths are stable while handling slave rename.
It's OK to discover all components (even they are xlocked) in the paths.
The helper function is for requests that operate on two paths. It
ensures that the two paths get locks in proper order. The rule is:
1. Lock directory inodes or dentries according to which trees they
are under. Lock objects under fs root before objects under mdsdir.
2. Lock directory inodes or dentries according to their depth, in
ascending order.
3. Lock directory inodes or dentries according to inode numbers or
dentries' parent inode numbers, in ascending order.
4. Lock dentries in the same directory in order of their keys.
5. Lock non-directory inodes according to inode numbers, in ascending
order.
This patch also makes handle_client_link() and handle_client_rename()
to use this helper function.
mds: make Server::rdlock_path_xlock_dentry take locks
Introduce MDS_TRAVERSE_XLOCK_DENTRY, which instructs
MDCache::path_traverse() to take appropriate locks (xlock dentry,
wrlock directory's filelock/nestlock) for file create/deletion.
mds: take snaplock and policylock during path traverse.
To take locks in top-down order for a MDRequest, we need to first take
snap/policy rdlocks on ancestor inodes of the request's base inode.
It's not convenient to use Locker::acquire_locks() to do the job because
path to request's base inode can change before all of these locks are
rdlocked.
This patch introduces Locker::try_rdlock_snap_layout(), which tries
taking snap/policy rdlocks on request's base inode and its ancestors
all at the same time. MDCache::path_traverse() calls this function at
first, then uses Locker::acquire_locks() to take snaplock on components
of request's path.
This patch also reorders inode locks, put snaplock and policy at the
front. Because some requests (such as setattr) may xlock other locks
after taking snaplock/policylock.
Yan, Zheng [Wed, 14 Aug 2019 03:22:35 +0000 (11:22 +0800)]
mds: let Locker::acquire_locks()'s caller choose locking order
This patch makes Locker::acquire_locks() lock objects in the order
specified by its caller. Locker::acquire_locks() only rearranges locks
in the same object (relieve of remembering the order). This patch is
preparation for 'lock object in top-down order'.
Besides, this patch allows MDRequest to lock objects step by step. For
example: call Locker::acquire_locks() to lock a dentry. After the dentry
is locked, call Locker::acquire_locks() to lock inode that is linked by
the dentry.
Locking object step by step introduces a problem. MDRequest may needs to
auth pin extra objects after taking same locks. If any object can not be
auth pinned, MDRequest needs to drop all locks before going to wait. For
slave auth pin request, this patch make slave mds send a notification
back to master mds if the auth pin request is blocked. The master mds
drops locks when receiving the notification.
Sage Weil [Sun, 24 Nov 2019 02:30:28 +0000 (20:30 -0600)]
Merge PR #31778 into master
* refs/pull/31778/head:
os/bluestore: pin onodes as they are added to the cache
Revert "Revert "Merge pull request #30964 from markhpc/wip-bs-cache-trim-pinned""
Reviewed-by: Mark Nelson <mnelson@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 23 Nov 2019 14:48:32 +0000 (08:48 -0600)]
Merge PR #31796 into master
* refs/pull/31796/head:
PendingReleaseNotes: note about the removal of 'nvme' class
common/blkdev: drop is_nvme() method
os/bluestore/KernelDevice: get rid of 'nvme' type
Sage Weil [Fri, 22 Nov 2019 21:28:17 +0000 (15:28 -0600)]
Merge PR #31502 into master
* refs/pull/31502/head:
qa/tasks/ceph2: get ceph-daemon from same place as ceph
qa/tasks/ceph2: use safe_while
qa/tasks/ceph2: pull image using sha1
qa/tasks/ceph2: docker needs quay.io/ prefix for image name
qa/workunits/rados/test_python: make sure rbd pool exists
qa/suites/rados/ssh: new tests!
qa/tasks/ceph2: pull ceph-ci/ceph:$branch
qa/tasks/ceph2: register_daemons after pods start
qa/tasks/ceph2: fix conf
qa/tasks/ceph2: add restart
qa/tasks/ceph2: pass ceph-daemon path to DaemonState
qa/tasks/ceph2: tolerate no mdss or 1 mgr
qa/tasks/ceph: replace wait_for_osds_up with manager.wait_for_all_osds_up
qa/tasks/ceph: wait-until-healthy
qa/tasks/ceph2: set up managers
qa/tasks/ceph2: use seed ceph.conf
qa/tasks/ceph: healthy: use manager helpers (instead of teuthology/misc ones)
qa/tasks/ceph2: name mds daemons
qa/tasks/ceph2: fix osd ordering
qa/tasks/ceph2: start up mdss
qa/tasks/ceph2: set up daemon handles and use them to stop
qa/tasks/ceph2: make it multicluster-aware
qa/tasks/ceph2: can bring up mon, mgr, osds!
qa/tasks/ceph2: basic task to bring up cluster with ceph-daemon and ssh
These are unused since 1d29722f801c ("switch monc, daemons to use new
msgr2 auth frame exchange"). As they default to false, a confused user
might flip them to true and think that their client <-> OSD traffic is
encrypted.
The new set of options was added in c7ee66c3e54b
("auth,msg/async/ProtocolV2: negotiate connection modes").
before this change, the librados applications are responsible to call
`AioCompletion::release()` explicitly to release its internal pimpl
pointer. this is error prone and not intuitive.
after this change, the destructor of `AioCompletion` and
`PoolAsyncCompletion` will do this automatically. while
`AioCompletion::release()` and `PoolAsyncCompletion::release()` still
delete the instance as they did before. so this change is backward
compatible, as existing librados clients can still use `ptr->release()`
to free the completion instance, while new clients can just `delete
ptr`.
librados_test_stub is updated accordingly to match the new model
Sage Weil [Fri, 22 Nov 2019 14:50:52 +0000 (08:50 -0600)]
Merge PR #31798 into master
* refs/pull/31798/head:
ceph-daemon: ceph-volume works without an fsid
ceph-daemon: several commands that can infer fsids still require them
ceph-daemon: fix fsid inference
Tim Serong [Fri, 22 Nov 2019 09:25:19 +0000 (20:25 +1100)]
mgr/PyModule: correctly remove config options
Previously, incorrect parameters were being passed to "config rm",
causing it to do nothing. This commit also ensures the correct
error message is shown for both the set and remove failure cases.
I've also moved the update of the in-memory config map to *after*
the value is persisted, to ensure the config map actually reflects
what's stored.
Fixes: https://tracker.ceph.com/issues/42958 Signed-off-by: Tim Serong <tserong@suse.com>
Sage Weil [Thu, 21 Nov 2019 22:19:17 +0000 (16:19 -0600)]
os/bluestore/KernelDevice: get rid of 'nvme' type
We are either 'hdd' or 'ssd' based on the rotational flag. Previously,
we would try to distinguish between an nvme vs SATA/SAS ssd and set the
class to 'nvme'. This was misguided--the interface is not important and
has no direct bearing on the device performance. Moreover, the HDD
manufacturers are planning to produce rotation HDDs that use the nvme
interface instead of SATA/SAS.
So, drop this.
This may be somewhat disruptive to clusters where we were detecting
nvme but now are not. However, the good news is that this doesn't seem
to trigger for real deployments because LVM breaks the is_nvme()
method.
Patrick Donnelly [Thu, 21 Nov 2019 18:09:39 +0000 (10:09 -0800)]
mds: release free heap pages after trim
MDS free heap space can grow to large for some workloads (like smallfile
and recursive deletes). This can cause the MDS mapped memory to grow
well beyond memory targets.
When we finally use the PriorityCache in the MDS, this will not be
necessary anymore as the PriorityCache already does this.
Fixes: https://tracker.ceph.com/issues/42938 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Sage Weil [Mon, 18 Nov 2019 01:45:36 +0000 (19:45 -0600)]
mgr/ssh: add mode option
Set the mgr/ssh/mode option to ceph-daemon-package to switch to a mode
where we assume ceph-daemon is installed on the remote host and we can
ssh as user cephdaemon.