Patrick Donnelly [Fri, 22 Sep 2017 16:53:37 +0000 (09:53 -0700)]
Merge PR #17854 into luminous
* refs/remotes/upstream/pull/17854/head:
mds: void sending cap import message when inode is frozen
client: fix message order check in handle_cap_export()
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
ceph: do link/rename semantic checks after srcdn is readable
For hard link, source inode must not be directory. For rename,
types of source/destination inodes must match. If srcdn is replica
and we do these checks while it's not readble, it's possible that
wrong source inode is used in these checks.
client: set client_try_dentry_invalidate to false by default
By default, ceph-fuse uses side effect of 'dentry invalidation' to
trim kernel dcache if it runs on kernel < 3.18. The implemention of
kernel function d_invalidate() changed in 3.18 kernel, the method no
longer works for upstream kernel >= 3.18.
RHEL 3.10 kernel includes backport of patches that change implemention
of d_invalidate(). So checking kernel version to decide if 'dentry
invalidation' method works is unreliable.
Douglas Fuller [Tue, 12 Sep 2017 17:22:09 +0000 (13:22 -0400)]
qa/tasks/cephfs: Whitelist POOL_APP_NOT_ENABLED for test_misc
test_misc verifies that ceph fs new will not create a filesystem
on a pool that already contains objects. As part of the test, it
inserts a dummy object into a pool and then attempts to use it for
CephFS. This triggers POOL_APP_NOT_ENABLED. Setting the application
metadata for the pool (and having ceph fs new fail because of the
existing metadata) would then exercise a different failure case.
Jeff Layton [Fri, 25 Aug 2017 12:31:47 +0000 (08:31 -0400)]
client: add mountedness check inside client_lock
Currently we check for mountedness in the high level wrappers, but those
checks are lockless. It's possible to have a call that races with
ceph_unmount(). It could pass one of the is_mounted() checks in the
wrapper, and then block on the client_lock while the unmount is actually
running. Eventually it picks up and runs after the unmount returns, with
questionable results -- possibly even a crash in some cases.
For now, we can explain this away with a simple admonition that
applications should ensure that no calls are running when ceph_unmount
is called. In the future though, we may need to forcibly shut down the
mount when certain events occur (not returning a lease or delegation in
time, for instance).
Sprinkle in a bunch of "unmounting" checks after taking the client_lock,
and simply have the functions return errors (or sensible values in some
cases) when the Client is being downed. With that, we ensure that this
sort of race can't occur, even when the unmount is not being driven by
userland. Note too that in some places I've replaced assertions in the
code with error returns, as that's nicer behavior for libraries.
Note that this can't replace the ->is_mounted() checks in the lockless
wrappers as those are needed to determine whether the client pointer in
the ceph_mount_info is still valid. The admonition not to allow
ceph_unmount to race with other calls is therefore still necessary.
Douglas Fuller [Wed, 12 Jul 2017 15:43:39 +0000 (10:43 -0500)]
qa/cephfs: support CephFS recovery pools
Add support for testing recovery of CephFS metadata into an alternate
RADOS pool, useful as a disaster recovery mechanism that avoids
modifying the metadata in-place.
Douglas Fuller [Wed, 12 Jul 2017 15:41:11 +0000 (10:41 -0500)]
qa/ceph_test_case: support CephFS recovery pools
Add support for testing recovery of CephFS metadata into an alternate
RADOS pool, useful as a disaster recovery mechanism that avoids
modifying the metadata in-place.
Yan, Zheng [Tue, 29 Aug 2017 03:35:56 +0000 (11:35 +0800)]
mds: void sending cap import message when inode is frozen
To export an inode to other mds, mds need to:
- Freeze the inode (stop issuing caps to clients)
- Flush client sessions (ensure client have received all cap messages)
- Send cap export message
These steps guarantee that clients receive cap import/export messages
in proper order (In the case that inode gets exported servel times
within a short time)
When inode is frozen, mds may have already flushed client sessions.
So mds shouldn't send cap import messages.
Yan, Zheng [Mon, 28 Aug 2017 09:13:31 +0000 (17:13 +0800)]
client: fix message order check in handle_cap_export()
If importer mds' cap already exists, but cap ID mismatches, client
should have received corresponding import message (the imported caps
got released later). Because cap ID does not change as long as client
holds the caps.
mds: check ongoing catter-gather process before capping log
When deactivating mds, MDLog::trim() may start scatter-gather
process on mdsdir inode. Locker::scatter_writebehind() submits
log entry. So mds should make sure there is no scatter-gather
before capping log.
The fast dispatch refactor in 3cc48278bf0ee5c9535d04b60a661f988c50063b
eliminated the osdmap subscription in the ms_fast_dispatch path, which
meant ops could reach a PG without having the latest map. In a cluster
with few osdmap updates, where the monitor fails to send a new map to
an osd (it tries one random osd), this can result in indefinitely
blocked requests.
Fix this by adding an OSDService mechanism for scheduling a new osdmap
subscription request.
We need to prevent duplicates in the final result. For example, we
can currently take
[1,2,3] and apply [(1,2)] and get [2,2,3]
or
[1,2,3] and apply [(3,2)] and get [1,2,2]
The rest of the system is not prepared to handle duplicates in the
result set like this.
The reverted commit was intended to allow
[1,2,3] and [(1,2),(2,1)] to get [2,1,3]
to reorder primaries. First, this bidirectional swap is hard to implement
in a way that also prevents dups. For example,
[1,2,3] and [(1,4),(2,3),(3,4)] would give [4,3,4]
but would we just drop the last step we'd have [4,3,3] which
is also invalid, etc. Simpler to just not handle bidirectional
swaps. In practice, they are not needed: if you just want to choose
a different primary then use primary_affinity, or pg_upmap
(not pg_upmap_items).
cmake: do not pass $SIMD_COMPILE_FLAGS to rocksdb cmake
which enables SSE42 globally in rocksdb. and we will end up with a
binary not portable on non-SSE42 enabled machines.
Fixes: http://tracker.ceph.com/issues/20529 Signed-off-by: Kefu Chai <kchai@redhat.com>
Conflicts:
this change is not cherry-picked from master. because the
PR targeting master (https://github.com/ceph/ceph/pull/17388) is
still pending on review. and the cmake changes is different if
we want to use a recent commit of rocksdb, as it's doing differently
in cmake to address the portability issues.
to pick up the the fix to disable SSE42 globally, and only enable it on
crc32c. this change is pushed to ceph/rocksdb:luminous.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Conflicts:
this change is not cherry-picked from master, as the master PR
(https://github.com/ceph/ceph/pull/17388) is still pending on review.
and the latest rocksdb's cmake was revised to address the portability
issues. so the fix on ceph side is different if we want to use a
recent rocksdb's commit.
During up:resolve, the MDS tries to merge each subtree with its parent. During
testing, QE found that many thousands of subtrees in a directory (made possible
using pins) would cause the MDS to spend minutes printing out subtree maps to
the debug log. This causes the heartbeat code to consider the MDS as stalled so
beacons are no longer sent to the mons resulting in the MDS being removed from
the rank.
A more complete solution to this problem is to selectively print subtrees
relating to the operation (e.g. the subtree and its parents).
Fixes: http://tracker.ceph.com/issues/21221 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1485783 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit d0747a37fd06053b2206bb9a952f7ab77f0db2f0)
Patrick Donnelly [Mon, 11 Sep 2017 22:21:52 +0000 (15:21 -0700)]
mds: support limiting cache by memory
This introduces two config parameters:
mds_cache_memory_limit: Sets the soft maximum of the cache to the given
byte count. (Like mds_cache_size, this doesn't actually limit the maximum
size of the cache. It just dictates the steady-state size.)
mds_cache_reservation: This replaces mds_health_cache_threshold everywhere
except the Beacon heartbeat sent to the mons. The idea here is to specify a
reservation of memory (5% by default) for operations and the MDS tries to
always maintain that reservation. So, the MDS will recall caps from clients
when it begins dipping into its reservation of memory.
mds_cache_size still limits the cache by Inode count but is now by-default 0
(i.e. unlimited). The new preferred way of specifying cache limits is by memory
size. The default is 1GB.
Fixes: http://tracker.ceph.com/issues/20594 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1464976 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 06c94de584e6cd7d347bcdfb79d9fef4fed0d277)
Avoids an unnecessary "max" size of the LRU which was used to calculate the
midpoint. Instead, just dynamically move the LRUObjects between top and bottom
on-the-fly.
This change is necessary for a cache which which does not limit by the number
of objects but by some other metric. (In this case, memory.)
Patrick Donnelly [Tue, 12 Sep 2017 21:29:49 +0000 (14:29 -0700)]
mds: go back to compact_map for replicas
Zheng observed that an alloc_ptr doesn't really work in this case since any
call to get_replicas() will cause the map to be allocated, nullifying the
benefit. Use a compact_map until a better solution can be written. (This means
that the map will be allocated outside the mempool.)
Patrick Donnelly [Fri, 28 Jul 2017 00:21:54 +0000 (17:21 -0700)]
mds: use mempool for cache objects
The purpose of this is to allow us to track memory usage by cached objects so
we can limit cache size based on memory available/allocated to the MDS.
This commit is a first step: it adds CInode, CDir, and CDentry to the mempool
but not all of the containers in these classes (e.g. std::map). However,
MDSCacheObject has been changed to allocate its containers through the mempool
by converting compact_* containers to the std versions offered through mempool
via the new alloc_ptr.
(A compact_* class simply wraps a pointer to the std:: version to reduce memory
usage of an object when the container is only occasionally used. The alloc_ptr
allows us to achieve the same thing explicitly with only a little handholding:
when all entries in the wrapped container are deleted, the caller must call
alloc_ptr.release().)
Patrick Donnelly [Thu, 27 Jul 2017 19:10:14 +0000 (12:10 -0700)]
common: add alloc_ptr smart pointer
This ptr is like a unique_ptr except it allocates the underlying object on
access. The idea being that we can save memory if the object is only needed
sometimes.