Sage Weil [Wed, 11 Aug 2021 14:58:39 +0000 (10:58 -0400)]
Merge PR #42682 into master
* refs/pull/42682/head:
cephadm: no need to explicitly enable prometheus module
mgr/cephadm: enable prometheus module before deploying prometheus
mgr/cephadm: drop daemon_id arg to CephadmService.config()
doc/cephadm: no need to manually enable the prometheus module
Reviewed-by: Sebastian Wagner <sewagner@redhat.com>
Currently BlueStore keeps its allocation info inside RocksDB.
BlueStore is committing all allocation information (alloc/release) into RocksDB (column-family B) before the client Write is performed causing a delay in write path and adding significant load to the CPU/Memory/Disk.
Committing all state into RocksDB allows Ceph to survive failures without losing the allocation state.
The new code skips the RocksDB updates on allocation time and instead perform a full desatge of the allocator object with all the OSD allocation state in a single step during umount().
This results with an 25% increase in IOPS and reduced latency in small random-write workloads, but exposes the system to losing allocation info in failure cases where we don't call umount.
We added code to perform a full allocation-map rebuild from information stored inside the ONode which is used in failure cases.
When we perform a graceful shutdown there is no need for recovery and we simply read the allocation-map from a flat file where the allocation-map was stored during umount() (in fact this mode is faster and shaves few seconds from boot time since reading a flat file is faster than iterating over RocksDB)
Open Issues:
There is a bug in the src/stop.sh script killing ceph without invoking umount() which means anyone using it will always invoke the recovery path.
Adam Kupczyk is fixing this issue in a separate PR.
A simple workaround is to add a call to 'killall -15 ceph-osd' before calling src/stop.sh
Fast-Shutdown and Ceph Suicide (done when the system underperforms) stop the system without a proper drain and a call to umount.
This will trigger a full recovery which can be long( 3 minutes in my testing, but your your mileage may vary).
We plan on adding a follow up PR doing the following in Fast-Shutdown and Ceph Suicide:
Block the OSD queues from accepting any new request
Delete all items in queue which we didn't start yet
Drain all in-flight tasks
call umount (and destage the allocation-map)
If drain didn't complete within a predefined time-limit (say 3 minutes) -> kill the OSD Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
create allocator from on-disk onodes and BlueFS inodes
change allocator + add stat counters + report illegal physical-extents
compare allocator after rebuild from ONodes
prevent collection from being open twice
removed FSCK repo check for null-fm
Bug-Fix: don't add BlueFS allocation to shared allocator
add configuration option to commit to No-Column-B
Only invalidate allocation file after opening rocksdb in read-write mode
fix tests not to expect failure in cases unapplicable to null-allocator
accept non-existing allocation file and don't fail the invaladtion as it could happen legally
don't commit to null-fm when db is opened in repair-mode
add a reverse mechanism from null_fm to real_fm (using RocksDB)
Using Ceph encode/decode, adding more info to header/trailer, add crc protection
Code cleanup
some changes requested by Adam (cleanup and style changes)
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
Sage Weil [Thu, 5 Aug 2021 14:24:13 +0000 (10:24 -0400)]
mgr/cephadm: enable prometheus module before deploying prometheus
The mon will restart the mgr when the module is enabled, so we don't
really have to do anything here. The raise is there just in case the
mgr doesn't immediately get the new mgrmap and respawn, although there is
likely no harm done if we continue to deploy prometheus in the meantime,
even if we're interrupted partway through.
Kefu Chai [Wed, 11 Aug 2021 08:29:10 +0000 (16:29 +0800)]
script/run-make.sh: retry if dpkg was interrupted
there is chance that apt-get is interrupted in the middle when a new PR
cancels the running jenkins job, the next job running apt-get or dpkg
would run into issues like:
E: dpkg was interrupted, you must manually run 'sudo dpkg --configure -a' to correct the problem.
Build step 'Execute shell' marked build as failure
Kefu Chai [Wed, 11 Aug 2021 09:32:20 +0000 (17:32 +0800)]
do_cmake:sh: do not set BOOST_J
do_cmake.sh is called by src/script/run-make.sh in configure() function,
in src/script/run-make.sh, BOOST_J is also set if it is not set. so we
can drop the code setting BOOST_J in do_cmake.sh.
this helps to silence the cmake warning like:
CMake Warning:
Manually-specified variables were not used by the project:
Sage Weil [Tue, 10 Aug 2021 20:47:34 +0000 (16:47 -0400)]
Merge PR #42318 into master
* refs/pull/42318/head:
mgr/rook: update DefaultFetcher device path to look at local and fix bug
mgr/rook: add node and PV name information to Device in DefaultFetcher
mgr/rook: fix typing errors in Fetcher classes
mgr/rook: create and use DefaultFetcher and LSOFetcher classes
mgr/rook: create KubernetesCustomResource class to fetch CRs
mgr/rook: fix device ls error handling
mgr/rook: change storage class module option name and default value
mgr/rook: fix typing errors related to storage_class_name and device ls
mgr/rook: make `device ls` only display pvs in specified storage class
mgr/rook: add StorageV1Api and storage_class_name to RookCluster
mgr/rook: add StorageV1Api to RookOrchestrator
mgr/rook: add mgr/rook/storage_class_name to ceph config
mgr/rook: ceph orch device ls fetch and display info about PVs
mgr/rook: add CustomObjectsApi to RookCluster
Reviewed-by: Juan Miguel Olmo <jolmomar@redhat.com>
Sage Weil [Tue, 10 Aug 2021 20:37:38 +0000 (16:37 -0400)]
Merge PR #42691 into master
* refs/pull/42691/head:
mgr/nfs: add --port to 'nfs cluster create' and port to 'nfs cluster info'
qa/suites/orch/cephadm/smoke-roleless: test taking ganeshas offline
qa/tasks/vip: exec with bash -ex
qa/suites/orch/cephadm: separate test_nfs from test_orch_cli
Sage Weil [Tue, 10 Aug 2021 14:36:37 +0000 (10:36 -0400)]
Merge PR #42680 into master
* refs/pull/42680/head:
src/pybind/mgr/nfs/tests: pass cluster_id to from_export_block()
src/pybind/mgr/nfs: remove `tag` option
src/pybind/mgr/nfs: remove per daemon config test
src/pybind/mgr/nfs: directly use cluster_id and remove daemon related stuff
script: run-cbt.sh tests crimson with CyanStore instead of MemStore.
These tests were always supposed to run against CyanStore. However,
commit e6ed65db8b4e0a2f8026c2e35a12dd292c5f2b8c (PR #42437) changed
the meaning of `--memstore` and introduced `--cyanstore` to be used
instead. This commit makes `run-cbt.sh` aware about the new switch.
This PR updates the text in the RADOS Guide
(the Ceph Storage Cluster Guide) that appears
at the beginning of the "Storage Devices"
chapter. I did the following:
- rewrote some of the sentences so that
they read more like written text than like
spoken language
- added "Ceph Manager" to the list of daemons
that a Ceph cluster comprises
- that's about it.
mgr/dashboard: rgw service creation form: add realm and zone to service spec.
Align rgw service id pattern with cephadm: https://github.com/ceph/ceph/pull/39877
- Update rgw pattern to allow service id for non-multisite config.
- Extract realm and zone from service id (when detected) and add them to the service spec.
Fixes: https://tracker.ceph.com/issues/44605 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
mgr/dashboard: connect-rgw: rename to set-rgw-credentials; refactoring
- Rename the dashboard command to better reflect its behavior.
- Rename '_radosgw_admin' method to 'send_rgwadmin_command' for consistency with
'send_mon_command' and move it to the mgr_module.py .
- Cleanup: remove unneeded rgw settings.
- Better error handling and test coverage.
Fixes: https://tracker.ceph.com/issues/44605 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
Alfonso Martínez [Wed, 28 Jul 2021 07:48:18 +0000 (09:48 +0200)]
mgr/dashboard: connect-rgw: adaptation and test coverage
- Align Dashboard with cephadm: configure credentials using the same logic.
- Fix: create a 'dashboard' user per realm (before: only on 1st realm).
- Lint fixes, test coverage, method renaming to better reflect behavior and method visibility.
Fixes: https://tracker.ceph.com/issues/44605 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
Adam Kupczyk [Mon, 9 Aug 2021 13:59:46 +0000 (15:59 +0200)]
os/bluestore: Better handling of deferred write trigger
Now deferred write in _do_alloc_write does not depend on blob size,
but on size of extent allocated on disk.
It is now possible to set bluestore_prefer_deferred_size way larger than
bluestore_max_blob_size and still get desired behavior.
Example: for deferred=256K, blob=64K : when op write is 128K both blobs will be
written as deferred. When op write is 256K then all will go as regular write.
Sage Weil [Mon, 9 Aug 2021 18:15:28 +0000 (14:15 -0400)]
cephadm: fix container name detection
'enter' was broken because we weren't correctly identifying the container
name. Strip the newline from the inspect result so that we can reliably
match against the 'running' state.
Neha Ojha [Mon, 9 Aug 2021 14:35:01 +0000 (14:35 +0000)]
qa/suites/rados/perf/ceph.yaml: remove rgw
This is no longer required because we removed cosbench workloads in fd350fd0150a2d4072f055658c20314a435a19ba. This is also required to prevent
failures like the following or any other changes that break the rgw task:
```
2021-08-06T20:13:25.812 INFO:teuthology.orchestra.run.smithi060.stderr:curl: (7) Failed to connect to smithi060.front.sepia.ceph.com port 80: Connection refused
2021-08-06T20:15:33.813 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_04c2febe7099917d97a71271f17abb5710030132/teuthology/contextutil.py", line 31, in nested
vars.append(enter())
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/rgw.py", line 191, in start_rgw
wait_for_radosgw(url, remote)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/util/rgw.py", line 94, in wait_for_radosgw
assert exit_status == 0
AssertionError
```
Kefu Chai [Sun, 8 Aug 2021 17:21:38 +0000 (01:21 +0800)]
crimson/common: instantiate interrupt_cond in .cc
so we can explicitly instantiate it.
this should address the segfault when accessing interrupt_cond when
it is defined as a plain thread local storage template variable in the
header file.
it seems Clang is not able to identify the access to TLS variable and
the value of %fs segment register of the main thread is always zero if
interrupt_cond is defined as a plain global variable stored in
thread local storage.