Jan Fajerski [Tue, 27 Nov 2018 14:49:01 +0000 (15:49 +0100)]
ceph-volume: refactor strategies for explicit device arguments
This commit refactors lvm strategies to:
a) change the constructor to receive usage specific device lists
b) add classmethod with_auto_devices of automatic splitting by the
rotational flag (keep backwards compatible behavior)
c) rename ssds and hdds member variables to wal_devs and block_devs
d) add db_devs member variable
Andrew Schoen [Tue, 4 Dec 2018 20:23:47 +0000 (14:23 -0600)]
ceph-volume: batch mixed type scenarios have no need to calulate data size
We know with a mixed type scenario the device used for data will be used
at 100% capacity. This means we do not need to be explict when asking
for the size of the data lvs, which avoids rounding errors with very
small device sizes.
Andrew Schoen [Fri, 30 Nov 2018 17:55:27 +0000 (11:55 -0600)]
ceph-volume: set use_large_block_db in validate, not compute
The self.use_large_block_db property was never getting set because
the block in compute was never called as block_db_size was reset in
validate if it was 0. We needed to set self.use_large_block_db in
validate instead of compute.
Andrew Schoen [Thu, 6 Dec 2018 15:56:58 +0000 (09:56 -0600)]
ceph-volume: use Device.lvm_size in batch strategies
We should show the user what the size of the device will be after lvm
creates a pv out of it. This way there isn't a discrepency between the
sizes that are reported to the user and what is actually created.
Matt Benjamin [Fri, 8 Mar 2019 20:41:05 +0000 (15:41 -0500)]
rgw: prefix-delimiter listing: support >1 character delimiter
Fix prefix and CommonPrefix extraction logic in
RGWRados::Bucket::List::list_objects_ordered so as to permit
arbitrary-length string delimiters.
Fixes: https://tracker.ceph.com/issues/24821 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit e3c1ea244234aace7368d5a5ee95af2f6a529b00)
Tim Serong [Wed, 20 Mar 2019 07:52:14 +0000 (18:52 +1100)]
cmake: remove cython 0.29's subinterpreter check during install
Commit 3bde34af8a removed cython 0.29's subinterpreter check when
building the various python modules during `make`, but unforunately
they're *rebuilt* during `make install`, with the rebuild overwriting
the original build. The original fix was of course missing from the
install stage...
Unfortunately, this completely breaks ceph-mgr. Until we can
figure out a better long term solution, this commit removes
cython's subinterpreter check, via some careful abuse of the
C preprocessor.
This works because when cython is invoked, it first generates
some C code, then compiles it. We know it's going to generate
C code including:
int __Pyx_check_single_interpreter(void) { ... }
and:
if (__Pyx_check_single_interpreter())
return NULL;
This replaces the call to __Pyx_check_single_interpreter()
with a literal 0, removing the subinterpreter check.
The void0 dead_function(void) thing is necessary because
the __Pyx_check_single_interpreter() macro also clobbers
that function definition, so we need to make sure it's
replaced with something that works as a function definition.
Matt Benjamin [Fri, 7 Jun 2019 14:20:01 +0000 (10:20 -0400)]
rgw_file: all directories are virtual with respect to contents
This change causes directory handles to always report an mtime of
"now." This is not an invalidate per se--it interacts with the
nfs implementation to produce that result when the implementation
updates its cached attributes. Hence, it can be modulated by timers
or other rules governing attribute caching at the upper layer.
Fixes: http://tracker.ceph.com/issues/40204 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit b4c7d0faeff667c25ab255786999ef0cc844ea2b)
Sage Weil [Mon, 9 Apr 2018 19:58:03 +0000 (14:58 -0500)]
os/filestore: estimate omap_allocated
Assume all of leveldb/rocksdb is omap. This is an overestimate, but
better than nothing.
We don't populate the metadata overhead (no easy way to calculate this
that comes to mind). And we don't populate the compression-related
fields. It's possible we could make something up here in the VDO
case...
Sage Weil [Tue, 10 Apr 2018 17:55:55 +0000 (12:55 -0500)]
os/ObjectMap: add get_db() accessor
This is just to let us get at the underlying KeyValueDB for DBObjectMap.
It is not really better or worse than adding accessors for things like
GetEstimatedSize() to ObjectMap.
Jason Dillaman [Mon, 29 Apr 2019 14:13:21 +0000 (10:13 -0400)]
librbd: simplify IO flush handling through AsyncOperation
Allow ImageFlushRequest to directly execute a flush call through
AsyncOperation. This will allow the flush to be directly linked
to its preceeding IOs.
Conditionally allow non-unique email address values for builtin
RGW users.
Fixes: http://tracker.ceph.com/issues/40089 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit 974791522007cca6d8fb30e83677f0ddd7c4e71d)
Conflicts:
src/rgw/rgw_user.cc
- changed '_conf.get_val<bool>' to '_conf->get_val<bool>'
xie xingguo [Mon, 3 Jun 2019 08:43:25 +0000 (16:43 +0800)]
test: add parallel clean_pg_upmaps test
With parallel clean_pg_upmaps feature on, the total time cost
of the performance test which now can utilize up to 8 threads for
parallel upmap validating decreased from:
Note that by default the mon uses only 4 worker threads for
CPU intensive background work, you could further increase
the "mon_cpu_threads" option value if you decided the
time-consuming of clean_pg_upmaps still matters.
xie xingguo [Mon, 3 Jun 2019 08:10:22 +0000 (16:10 +0800)]
mon/OSDMonitor: do clean_pg_upmaps the parallel way if necessary
There could definitely be some certain cases we could reliably
skip this kind of checking, but there is no easy way to separate
those out.
However, this is clearly the general way to do the massive pg
upmap clean-up job more efficiently and hence should make sense
in all cases.
xie xingguo [Mon, 17 Jun 2019 10:44:09 +0000 (18:44 +0800)]
osd: maybe_remove_pg_upmaps -> clean_pg_upmaps
It should always be the preferred option to kill the unnecessary
or duplicated code, which is good for maintenance.
Also I've noticed there is already a clean_temps helper, so re-naming
maybe_remove_pg_upmaps to clean_pg_upmaps to at least keep pace with
that sounds to be a natural choice for me..
xie xingguo [Wed, 5 Jun 2019 02:41:52 +0000 (10:41 +0800)]
osd/OSDMapMapping: make ParallelPGMapper can accept input pgs
The existing "prime temp" machinism is a good inspiration
for cluster with a large number of pgs that need to do various
calculations quickly.
I am planning to do the upmap tidy-up work the same way, hence
the need for an alternate way of specifying pgs to process other
than taking directly from the map.
The upmap results are directly applied after calling
_pg_to_raw_osds, which means it basically has nothing to do
with the up/down status.
In other words, if a pg_upmap/pg_upmap_items remapped a pg
into some down osds and is now causing collided result,
we should still be able to detect and cancel that.
xie xingguo [Sat, 1 Jun 2019 02:43:10 +0000 (10:43 +0800)]
test/osd: add performance test case for maybe_remove_pg_upmap
Tom Byrne reported that maybe_remove_pg_upmap might become
super inefficient for large clusters with balancer on.
To identify and resolve the problem, we need to add some good
measurements first.
osd/bluestore: Actually wait until completion in write_sync
This function is only used by RocksDB WAL writing so it must sync data.
This fixes #18338 and thus allows to actually set `bluefs_preextend_wal_files`
to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups.
To my knowledge it doesn't hurt performance in other cases.
Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`.
Issue #18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD
while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`.
mon: paxos: introduce new reset_pending_committing_finishers for safety
There are asserts about the state of the system and pending_finishers which can
be triggered by running arbitrary commands through again. They are correct
when not restarting, but when we do restart we need to take care to preserve
the same invariants as appropriate. Use this function to be careful about
the order of committing_finishers v pending_finishers and to make sure they're
both empty before any Contexts actually get called.
We also reorder a call to finish_contexts on the waiting_for_writeable list for
similar reasons.
metadata sync of a new bucket entrypoint may call rgw_link_bucket()
(which in turn calls into cls user) without deleting/unlinking the
previous bucket entrypoint. this prevented the new bucket entrypoint
from overwriting the creation_time of the old one