qa/standalone: Fix test_activate_osd() test in ceph-helpers.sh
Modify test_activate_osd() to get the type of scheduler in use and then
verify the value of osd_max_backfills. This is because mclock scheduler
overrides this option to 1000 upon OSD initialization.
The test earlier used to pass because the OSD daemon was killed but not
marked down and upon being brought up, the wait for OSD up check was
passing quickly. But the OSD still didn't have the latest config values.
But now upon killing the OSD, the osd_fast_shutdown sequence notifies the
mon (see PR: https://github.com/ceph/ceph/pull/44807) and is marked down
and dead. Upon bringing it up, the wait for OSD up check takes a longer
time and this is sufficient for the config values to be updated. This
results in the correct values being read from the config 'Values' map.
Satoru Takeuchi [Thu, 18 Nov 2021 20:48:18 +0000 (20:48 +0000)]
osd: make osd_fast_shutdown_notify_mon option true by default
osd_fast_shutdown_notify_mon option is false by default. So users suffer
from error log flood, slow ops, and the long I/O timeouts on voluntary OS
shutdown before they are aware of the existence of this option. Let's
make this option true by default.
doc: Improvements to mClock configuration reference documentation
Improve the documentation around.
- mclock client types.
- Describe in greater detail about mclock config profiles.
- Add notes about manually benchmarking OSDs and tuning bluestore throttle
parameters.
- Include a couple of missing mclock configuration options.
Ronen Friedman [Fri, 25 Mar 2022 10:45:47 +0000 (10:45 +0000)]
osd/scrub: restart snap trimming only after scrubbing is done
Snap trimming that was postponed as the target PG was scrubbing
must be restarted at scrub completion.
PR #38111 moved trimming restart to just before the scrub fully
terminated. The current PR fixes that.
Trimming is also restarted in those cases where scrub was
queued but aborted immediately.
During test LibRadosWatchNotify.Watch2Delete rados_watch_check can return error -102 if reconnect happened, in that case Broken pipe reconnect and -102 returned
Fix a problem in store_test::BluestoreBrokenNoSharedBlobRepairTest where the check for active null-fm was wrong and so reporting bogus errors when null-fm was inactive
The check need to access dynamic value and not config setting (which can be overridden) Signed-off-by: Gabriel BenHanokh <gbenhano@redhat.com>
(cherry picked from commit 2969539d20a8157d62ae27f842c43b801efdc0ee)
Bug-Fix from PR-44370 force setting need_to_destage_allocation_file to True on device expansion without checking if we work in null-fm mode Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit f7ebef8a804b8ce193bcbee4284dc28102708f37)
os/bluestore: Disable NCB functionality on rotational drives
NCB code needs to recover allocation map after an OSD crash.
The recovery process on rotational drives is about 20x slower than SSD making this solution unacceptable for that environmentÂ
Currently, for CEPH_OSD_OP_OMAPRMKEYRANGE ops, clean_omap gets set to true,
which results in incomplete recovery of objects and results in
inconsistent PGs after a scrub.
Ilya Dryomov [Wed, 16 Mar 2022 19:05:56 +0000 (20:05 +0100)]
librados: check latest osdmap on ENOENT in pool_reverse_lookup()
Avoid spurious ENOENT errors from rados_pool_reverse_lookup() and
Rados::pool_reverse_lookup().
This makes lookup by id consistent with lookup by name: the latter
has been checking latest osdmap since commit 7e5669b11b14 ("rados: we
need to get the latest osdmap when pool does not exists").
Teoman ONAY [Thu, 11 Nov 2021 15:05:49 +0000 (15:05 +0000)]
cephadm: remove containers pids-limit
The default pids-limit (docker 4096/podman 2048) prevent some
customization from working (http threads on RGW) or limits the number
of luns per iscsi target.
iovec have unsigned length (size_t) and before this patch the
total length was computed by adding iovec's length to a signed
length variable (ssize_t). While the code checked if the resulting
length was negative on overflow, the case where length is positive
after overflow was not checked. This patch fixes the overflow check
by changing length to unsigned size_t.
Additionally, this patch fixes the case where some iovecs have been
added to the bufferlist and the aio completion has been blocked, but
adding an additional iovec fails because of overflow. This leads to
the UserBufferDeleter trying to unblock the completion on destruction
of the bufferlist but asserting because the completion was never
armed. We avoid this by first computing the total length and checking
for overflows and iovcnt before adding them to the bufferlist.
Ilya Dryomov [Sat, 19 Mar 2022 13:04:52 +0000 (14:04 +0100)]
qa/workunits/rbd/cli_generic.sh: relax trash purge schedule status assert
Commit 08df6e0fd006 ("qa/workunits/rbd: expand LevelSpec parsing
coverage") didn't account for images with a separate data pool. This
was missed because of small-cache-pool.yaml breakage.
Add the snaptrim duration to the json formatted output of the pg dump
stats. Define methods for a PG to set the snaptrim begin time and then to
calculate the total time spent to trim all the objects for the snaps in
the snap_trimq for the PG.
Tests:
- Librados C and C++ API tests to verify the time spent for a snaptrim
operation on a PG. These tests use the self-managed snaps APIs.
- Standalone tests to verify snaptrim duration using rados pool snaps.
Add a new column, OBJECTS_TRIMMED, to the pg dump stats that shows the
number of objects trimmed when a snap is removed.
When a pg splits, the stats from the parent pg is copied to the child
pg. In such a case, reset objects_trimmed to 0 for the child pg
(see PeeringState::split_into()). Otherwise, this will result in incorrect
stats to be shown for a child pg after the split operation.
Tests:
- Librados C and C++ API tests to verify the number of objects trimmed
during snaptrim operation. These tests use the self-managed snaps APIs.
- Standalone tests to verify objects trimmed using rados pool snaps.
J. Eric Ivancich [Wed, 22 Dec 2021 19:45:59 +0000 (14:45 -0500)]
rgw: in bucket reshard list, clarify new num shards is tentative
With dynamic bucket index resharding, when the average number of
objects per shard exceeds the configured value, that bucket is
scheduled for reshard. That bucket may receive more new objects before
the resharding takes place. As a result, the existing code
re-calculates the number of new shards just prior to resharding,
rather than waste a resharding opportunity with too low a value.
The same holds true for a user-scheduled resharding.
A user reported confusion that the number reported in `radosgw-admin
reshard list` wasn't the number that the reshard operation ultimately
used. This commit makes it clear that the new number of shards is
"tentative". And test_rgw_reshard.py is updated to reflect this
altered output.
Additionally this commit adds some modernization and efficiency to the
"reshard list" subcommand.
Matt Benjamin [Fri, 24 Dec 2021 19:35:00 +0000 (14:35 -0500)]
rgwlc: warn on missing RGW_ATTR_LC
This should not happen. If it does (e.g., due to damaged bucket_info),
log the event to assist with debugging.
Fixes: https://tracker.ceph.com/issues/53728 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit ae1a75c09d11d8f0b626c781112c35de353c0c89)
Xuehan Xu [Sat, 2 Jan 2021 14:50:23 +0000 (22:50 +0800)]
librgw: make rgw file handle versioned
The reason that we need this is that there could be the following scenario:
1. rgw_setattr sets the file attr;
2. rgw_write writes some new data, and encodes its attr to store into rados;
3. before the actual persistence of the file's attr bl, rgw_lookup loads the file's
previous attr and modifies the current file handle's metadata;
4. rgw_write's result persisted to rados;
5. rgw_setattr set the current file handle's metadata which is actually an old one to rados
In this case, the attr in rados would be out of date which means loss of data
In RGWBucketCtl::chown we have one RGWObjectCtx for all objects of a bucket.
In RGWObjectCtx there is a cache mechanism (std::map) for states of objects that will grows
continuously. for buckets with millions of objects this mechanism leads to huge memory usage.
in chown process we really do not need this caching mechanism so we could create one RGWObjectCtx
for every 1000 objects to limit usage of memory.
Fixes: https://tracker.ceph.com/issues/53599 Signed-off-by: Mohammad Fatemipour <mohammad.fatemipour@sotoon.ir>
(cherry picked from commit cf2d83ef81458524715c23e802977dc0760c847f)
Addition of a SCRUB_DURATION field that shows how long the scrub/deep-scrub of a pg took.
This field will be displayed in the output of the "ceph pg dump --format=json" and "ceph pg ls-by-pool --format=json" commands.
Lucian Petrut [Mon, 7 Mar 2022 08:12:23 +0000 (08:12 +0000)]
include: Define dlfcn.h on Windows
"dlfcn.h" is not available on Windows, so Ceph provides a drop-in
replacement through "dlfcn_compat.h".
The issue is that directly importing "dlfcn.h" fails at the moment,
for which reason we'll simply add a file called "dlfcn.h" that
includes "dlfcn_compat.h".
Kamoltat [Mon, 28 Feb 2022 21:40:43 +0000 (21:40 +0000)]
upgrade/pacific-x/parallel: Added mds.a and mds.b
Added mds daemons so that it can create
cephFS pools and set options using
`do_set_pool()` in FSCommand.cc. Such that
we can cover corner cases like that in
ceph-fuse: perform cleanup if test_dentry_handling failed
If remount failed due to some reason then ceph_abort() is
getting called which causes child process termination
without cleanup.
To fix this issue, ceph_abort() call moved after
performing cleanup.
quiesce all activities and destage allocations to disk before killing the OSD
1) keep the old (unsafe) fast-shutdown when we are not using NCB (non null-manager())
2) skip service.prepare_to_stop() which can take as much as 10 seconds
3) skip debug options in fast-shutdown
4) set_state(STATE_STOPPING) which will stop accepting new tasks to this OSD
5) clear op_shardedwq queues, this is safe since we didn't started processing them
6) stop timer
7) drain osd_op_tp (no new items will be added)
8) now we can safely call umount which will close_db/bluefs and will destage allocation to disk
9) skip _shutdown_cache() when we are in the middle of a fast-shutdown
10) increase debug level on fast-shutdown
11) add option for bluestore_qfsck_on_mount to force scan on mount for all tests
12) disable fsck-on-umount when running fast-shutdown
13) add an option to increase debug level at fast-shutdown umount()
14) set a time limit to fast-shutdown
15) Bug-Fix BlueStore::pool_statfs don't access db after it was removed
16) Fix error message for qfsck (error was caused by PR https://github.com/ceph/ceph/pull/44563)
17) make shutdown-timeout configurable
Fixes: https://tracker.ceph.com/issues/53266 Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
(cherry picked from commit 9b2a64a5f6ea743b2a4f4c2dbd703248d88b2a96)
Dan van der Ster [Thu, 24 Feb 2022 08:42:00 +0000 (09:42 +0100)]
osd: require osd_pg_max_concurrent_snap_trims > 0
If osd_pg_max_concurrent_snap_trims is zero, we mistakenly clear
the snaptrim queue. Require it to be > 0.
Fixes: https://tracker.ceph.com/issues/54396 Signed-off-by: Dan van der Ster <daniel.vanderster@cern.ch>
(cherry picked from commit 29545b617b3b0324f9b0b20e032e3e38557115eb)
Ilya Dryomov [Tue, 8 Mar 2022 12:56:15 +0000 (13:56 +0100)]
test/librbd/test_notify.py: effect post object map rebuild assert
Instead of just optionally skipping update_features test, commit 9c0b239d70cd ("qa/upgrade: conditionally disable update_features
tests") moved it after rebuild_object_map test. This isn't right
because update_features test invalidates the object map as a side
effect and rebuild_object_map test is what makes it valid again:
Invoke "rbd mirror snapshot schedule ls -R" and "rbd mirror snapshot
schedule status" commands on all levels, consistently. In particular,
make sure that an image level schedule is listed for a recursive query
at the pool level both before and after the schedule kicks in:
$ rbd create --size 1G --mirror-image-mode snapshot -p foo bar
$ rbd mirror snapshot schedule add -p foo --image bar 1m
$ rbd mirror snapshot schedule ls -p foo -R
POOL NAMESPACE IMAGE SCHEDULE
foo bar every 1m
<wait for schedule to become visible in status>
$ rbd mirror snapshot schedule ls -p foo -R
POOL NAMESPACE IMAGE SCHEDULE
foo bar every 1m
Also, make sure that pool and image level status queries work:
$ rbd mirror snapshot schedule status -p foo
SCHEDULE TIME IMAGE
2022-03-04 07:14:00 foo/bar
$ rbd mirror snapshot schedule status -p foo --image bar
SCHEDULE TIME IMAGE
2022-03-04 07:14:00 foo/bar
Both of these issues are fixed by the previous commit.
Sunny Kumar [Thu, 24 Feb 2022 16:07:39 +0000 (16:07 +0000)]
mgr/rbd_support: cast pool_id from int to str when collecting LevelSpec
While collecting LevelSpec using class method from_name make sure to cast
pool_id from int to string. This is necessary to match the internal
representation of LevelSpec where pool_id is maintained as str.
Kefu Chai [Sat, 5 Mar 2022 04:49:57 +0000 (12:49 +0800)]
cmake: pass RTE_DEVEL_BUILD=n when building dpdk
ceph is still using the Makefile based building system for building
DPDK. and DPDK enables -Werror if RTE_DEVEL_BUILD is 'y' which is
enabled by default when the dpdk is built from a git repo.
but newer GCC is more picky than the older versions, to prevent
the possible FTBFS when we switch to newer GCC for building old
branches whose dpdk submodule might be include the changes addressing
those warnings. let's just disable this option.
the only effect of this option is to add -Werror to CFLAGS. but
the building warnings from DPDK is not our focus when developing
Ceph in the most cases. so it should be fine.