We need to prevent duplicates in the final result. For example, we
can currently take
[1,2,3] and apply [(1,2)] and get [2,2,3]
or
[1,2,3] and apply [(3,2)] and get [1,2,2]
The rest of the system is not prepared to handle duplicates in the
result set like this.
The reverted commit was intended to allow
[1,2,3] and [(1,2),(2,1)] to get [2,1,3]
to reorder primaries. First, this bidirectional swap is hard to implement
in a way that also prevents dups. For example,
[1,2,3] and [(1,4),(2,3),(3,4)] would give [4,3,4]
but would we just drop the last step we'd have [4,3,3] which
is also invalid, etc. Simpler to just not handle bidirectional
swaps. In practice, they are not needed: if you just want to choose
a different primary then use primary_affinity, or pg_upmap
(not pg_upmap_items).
cmake: do not pass $SIMD_COMPILE_FLAGS to rocksdb cmake
which enables SSE42 globally in rocksdb. and we will end up with a
binary not portable on non-SSE42 enabled machines.
Fixes: http://tracker.ceph.com/issues/20529 Signed-off-by: Kefu Chai <kchai@redhat.com>
Conflicts:
this change is not cherry-picked from master. because the
PR targeting master (https://github.com/ceph/ceph/pull/17388) is
still pending on review. and the cmake changes is different if
we want to use a recent commit of rocksdb, as it's doing differently
in cmake to address the portability issues.
to pick up the the fix to disable SSE42 globally, and only enable it on
crc32c. this change is pushed to ceph/rocksdb:luminous.
Signed-off-by: Kefu Chai <kchai@redhat.com>
Conflicts:
this change is not cherry-picked from master, as the master PR
(https://github.com/ceph/ceph/pull/17388) is still pending on review.
and the latest rocksdb's cmake was revised to address the portability
issues. so the fix on ceph side is different if we want to use a
recent rocksdb's commit.
During up:resolve, the MDS tries to merge each subtree with its parent. During
testing, QE found that many thousands of subtrees in a directory (made possible
using pins) would cause the MDS to spend minutes printing out subtree maps to
the debug log. This causes the heartbeat code to consider the MDS as stalled so
beacons are no longer sent to the mons resulting in the MDS being removed from
the rank.
A more complete solution to this problem is to selectively print subtrees
relating to the operation (e.g. the subtree and its parents).
Fixes: http://tracker.ceph.com/issues/21221 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1485783 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit d0747a37fd06053b2206bb9a952f7ab77f0db2f0)
Patrick Donnelly [Mon, 11 Sep 2017 22:21:52 +0000 (15:21 -0700)]
mds: support limiting cache by memory
This introduces two config parameters:
mds_cache_memory_limit: Sets the soft maximum of the cache to the given
byte count. (Like mds_cache_size, this doesn't actually limit the maximum
size of the cache. It just dictates the steady-state size.)
mds_cache_reservation: This replaces mds_health_cache_threshold everywhere
except the Beacon heartbeat sent to the mons. The idea here is to specify a
reservation of memory (5% by default) for operations and the MDS tries to
always maintain that reservation. So, the MDS will recall caps from clients
when it begins dipping into its reservation of memory.
mds_cache_size still limits the cache by Inode count but is now by-default 0
(i.e. unlimited). The new preferred way of specifying cache limits is by memory
size. The default is 1GB.
Fixes: http://tracker.ceph.com/issues/20594 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1464976 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
(cherry picked from commit 06c94de584e6cd7d347bcdfb79d9fef4fed0d277)
Avoids an unnecessary "max" size of the LRU which was used to calculate the
midpoint. Instead, just dynamically move the LRUObjects between top and bottom
on-the-fly.
This change is necessary for a cache which which does not limit by the number
of objects but by some other metric. (In this case, memory.)
Patrick Donnelly [Tue, 12 Sep 2017 21:29:49 +0000 (14:29 -0700)]
mds: go back to compact_map for replicas
Zheng observed that an alloc_ptr doesn't really work in this case since any
call to get_replicas() will cause the map to be allocated, nullifying the
benefit. Use a compact_map until a better solution can be written. (This means
that the map will be allocated outside the mempool.)
Patrick Donnelly [Fri, 28 Jul 2017 00:21:54 +0000 (17:21 -0700)]
mds: use mempool for cache objects
The purpose of this is to allow us to track memory usage by cached objects so
we can limit cache size based on memory available/allocated to the MDS.
This commit is a first step: it adds CInode, CDir, and CDentry to the mempool
but not all of the containers in these classes (e.g. std::map). However,
MDSCacheObject has been changed to allocate its containers through the mempool
by converting compact_* containers to the std versions offered through mempool
via the new alloc_ptr.
(A compact_* class simply wraps a pointer to the std:: version to reduce memory
usage of an object when the container is only occasionally used. The alloc_ptr
allows us to achieve the same thing explicitly with only a little handholding:
when all entries in the wrapped container are deleted, the caller must call
alloc_ptr.release().)
Patrick Donnelly [Thu, 27 Jul 2017 19:10:14 +0000 (12:10 -0700)]
common: add alloc_ptr smart pointer
This ptr is like a unique_ptr except it allocates the underlying object on
access. The idea being that we can save memory if the object is only needed
sometimes.
hechuang [Thu, 29 Jun 2017 02:38:23 +0000 (10:38 +0800)]
rgw: Data encryption is not follow the AWS agreement
Encryption request headers should not be sent for GET requests and HEAD
requests if your object uses SSE-KMS/SSE-S3 or you’ll get an HTTP 400
BadRequest error.
John Spray [Thu, 7 Sep 2017 13:44:36 +0000 (09:44 -0400)]
mon: fix dropping mgr metadata for active mgr
drop_standby() was killing it and it was only getting added
back in certain locations. Instead, make the metadata
drop conditional and only do it in the places we're
really dropping the daemon, not when we're promoting
it to active.
Fixes: http://tracker.ceph.com/issues/21260 Signed-off-by: John Spray <john.spray@redhat.com>
(cherry picked from commit 29c6f9adf178f6611a625740f395e397cad9147b)
* should reset it, in case we reuse it after initializing it.
* initialize the value of `p` using the C++11 style initializer, so it
is zero-initialized.
* revert 2a83ef3c which disables a warning of:
./include/encoding.h:317:7: warning: 't' may be used uninitialized in
this function [-Wmaybe-uninitialized]
where the `t` is the temporary variable for initializing the value of
`p`.
tools/ceph_objectstore_tool: fix 'dup' unable to duplicate meta PG
Recently we plan to bring a Jewel cluster into Luminous.
After that is done, which turns out to be a big success,
we then try to transform all FileStore osds into BlueStore ones
offline but with no luck. The ceph_objectstore_tool keeps complaining:
--------------------------------------------------------------------
dup from filestore: /var/lib/ceph/osd/ceph-20.old
to bluestore: /var/lib/ceph/osd/ceph-20
fsid d444b253-337d-4d15-9d63-86ae134ec9ac
65 collections
1/65 meta
cannot get bit count for collection meta: (61) No data available
--------------------------------------------------------------------
The root cause is that for FileStore Luminous will always try to rewrite
pg "bits" as a file attribute on "Load" if that is not available.
But since meta pg is never loaded (we skip it during OSD::load_pgs()),
we actually never get the chance to do so; hence making the
dup method from ceph_objectstore_tool very unhappy since it always
expects to see such a attribute from underlying store.
Fix the above problem by manually skipping loading the "bits" attribute
if underlying OS is FileStore for dup.