Vikhyat Umrao [Fri, 30 Aug 2019 07:16:46 +0000 (00:16 -0700)]
radosgw-admin: add support for --bucket-id in bucket stats command Fixes: https://tracker.ceph.com/issues/41061 Signed-off-by: Vikhyat Umrao <vikhyat@redhat.com>
(cherry picked from commit 4cd16e13ca0c8709091737ad2cb2b37a3b19840d)
Conflicts:
src/rgw/rgw_admin.cc
nautilus uses opt_cmd == OPT_BUCKET_STATS
nautilus does not have store->ctl()->meta.mgr
use store->meta_mgr
src/rgw/rgw_bucket.cc
nautilus has different declaration for RGWBucket::link
nautlis can not take nullptr in rgw_bucket_parse_bucket_key
use &shard_id
src/rgw/rgw_bucket.h
nautilus does not have set_tenant() add set_tenant()
nautilus does not have get_tenant() add get_tenant()
qa/tasks: do not cancel pending pg num changes on mimic
mimic does not support auto split/merge, but we do test mimic-x on
nautilus, which ends up with failures like:
ceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_py2/teuthology/contextutil.py", line 34, in nested
yield vars
File "/home/teuthworker/src/git.ceph.com_ceph_nautilus/qa/tasks/ceph.py", line 1928, in task
ctx.managers[config['cluster']].stop_pg_num_changes()
File "/home/teuthworker/src/git.ceph.com_ceph_nautilus/qa/tasks/ceph_manager.py", line 1806, in stop_pg_num_changes
if pool['pg_num'] != pool['pg_num_target']:
KeyError: 'pg_num_target'
so we need to skip this if 'pg_num_target' is not in pg_pool_t::dump().
this change is not cherry-picked from master, as we don't test
mimic-x on master.
Jason Dillaman [Wed, 11 Mar 2020 19:11:10 +0000 (15:11 -0400)]
qa/workunits/rbd: wait for nbd map to close after unmap
The unmap action only sends a signal to the kernel to notify the
rbd-nbd daemon to disconnect. Therefore, it's possible that an
unmap followed by an immediate re-map to the same device might
fail since the unmap is still in-progress.
Fixes: https://tracker.ceph.com/issues/44567 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 80a3f18cafb4add1624cc690bba436a1284dc634)
rgw: reshard: skip stale bucket id entries from reshard queue
If we encounter a reshard queue entry that has an older ID compared to the
bucket's current ID, it'd mean that some other process or a manual reshard has
already processed this entry, skip processing the entry this time. An
alternative is to verify the num_shards that we have in queue >= the current
shards, but this would mean that we may reshard a recently manual resharded
bucket again which might not be intended
Jan Fajerski [Wed, 4 Mar 2020 10:39:40 +0000 (11:39 +0100)]
ceph-volume: available_lvm: vg space takes precedence
This changes available_lvm to check for generic reasons only if no VGs
were found. A VG can contain a (mounted) lv, which triggers the
ro/locked test, despite the VG having space available.
Volker Theile [Wed, 6 Nov 2019 14:02:49 +0000 (15:02 +0100)]
mgr/dashboard: Refactor Python unittests
* Make use of the KVStoreMockMixin class to get rid off duplicate code.
* Fake the index.html file to be able to run tests/test_home.py locally without building the frontend in production mode.
* Encapsulate helper functions in controllers/home.py, otherwise tests/test_feature_toggles.py need to fake the filesystem because load_controllers() will load the home.py controller and fail due missing files in the filesystem.
in tasks/module_selftest.yaml, `TestModuleSelftest.test_telegraf()` is
called. but we fail to prepare a unix domain socket to which the telegraf
module can send stats. and telegraf module does not catch
FileNotFoundError exception, so the exception is populated to ceph-mgr
and is found by the test, hence the test is marked a failure whenever
telegraf is tested.
in this change,
* catch this exception, so it won't be caught by ceph-mgr
* whitelist the error message, so the test can pass
Tim Serong [Thu, 2 Apr 2020 05:46:57 +0000 (16:46 +1100)]
mgr/PyModule: fix missing tracebacks in handle_pyerror()
In certain cases, errors raised in mgr modules don't actually result in a
proper traceback in the mgr log; all you see is a message like "'Hello'
object has no a ttribute 'dneasdfasdf'", but you have no idea where that
came from, which is a complete PITA to debug.
Here's what's going on: handle_pyerror() calls PyErr_Fetch() to get
information about the error that occurred, then passes that information
back to python's traceback.format_exception() function to get the traceback.
If we write code in an mgr module that explicitly raises an exception
(e.g.: 'raise RuntimeError("that didn't work")'), the error value returned
by PyErr_Fetch() is of type RuntimeError, and traceback.format_exception()
does the right thing. If however we accidentally write code that's just
broken (e.g.: 'self.dneasdfasdf += 1'), the error value returned is not
an actual exception, it's just a string. So traceback.format_exception()
freaks out with something like "'str' object has no attribute '__cause__'"
(which we don't actually ever see in the logs), which in turn dumps us in a
"catch (error_already_set const &)" block, which just prints out the
single line error string.
https://docs.python.org/3/c-api/exceptions.html#c.PyErr_NormalizeException
tells us that "Under certain circumstances, the values returned by
PyErr_Fetch() below can be “unnormalized”, meaning that *exc is a class
object but *val is not an instance of the same class.". And that's exactly
the problem we're having here. We're getting a 'str', not an Exception.
Adding a call to PyErr_NormalizeException() turns the value back into a
proper Exception type and traceback.format_exception() now always does the
right thing.
I've also added calls to peek_pyerror() in the catch blocks, so if anything
else ever somehow causes traceback.format_exception to fail, we'll at least
have an idea of what it is in the log.
Sage Weil [Wed, 11 Sep 2019 22:26:52 +0000 (17:26 -0500)]
mon: disable min pg per osd warning
Now that the pg_autoscaler is on by default, it is "normal" (and okay) to
have a small number of PGs in the cluster if the overall cluster usage is
also low. This setting just results in a health warning out of the box
when you create a pool and haven't written any data yet.
J. Eric Ivancich [Fri, 31 Jan 2020 20:01:40 +0000 (15:01 -0500)]
rgw: fix bug with (un)ordered bucket listing and marker w/ namespace
When listing without specifying a namespace, the returned entries
could be in one or more namespaces. The marker used to continue the
listing may therefore contain a namespace, and that needs to be
preserved. This fixes a bug in both ordered and unordered listings
where it was not preserved.
Shilpa Jagannath [Tue, 26 Nov 2019 08:03:52 +0000 (13:33 +0530)]
rgw: when a period lookup for oldest_realm_epoch returns an ENOENT,
find the oldest one and update RGWMetadataLogHistory. This is to avoid an
empty cursor being passed in to ceph_assert() in PurgePeriodLogsCR::operate()
in case of incomplete period history.
mgr/dashboard: fix COVERAGE_PATH in run-backend-api-tests.sh
As we cannot backport directly https://github.com/ceph/ceph/pull/33407
to nautilus and COVERAGE_PATH is invalid for CentOS7 + py2,
this fix is applied directly to nautilus.
Signed-off-by: Alfonso Martínez <almartin@redhat.com>
Yan, Zheng [Tue, 9 Oct 2018 03:46:56 +0000 (11:46 +0800)]
mds: handle bad purge queue item encoding
The bad encoding was introduced by commit a88f8d5eb4
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
(cherry picked from commit ba06fcbe3e8345dc2c4c4c3dc3bcc18acf5ab076)
Conflicts:
src/mds/PurgeQueue.cc: advance changed to += Fixes: https://tracker.ceph.com/issues/36635
Note: This commit from v13.2.3 fixes a bad backport in v13.2.2. It is
also required in Octopus/Nautilus to handle upgrades.
(cherry picked from commit b73d1989bcbea227017607f8dd6e79633ec11f8f)
Conflicts:
src/mds/PurgeQueue.cc: += changed to advance
Sage Weil [Tue, 3 Mar 2020 16:09:06 +0000 (10:09 -0600)]
common/ceph_time: tolerate mono time going backwards
Some kernels (and possibly some hardware?) can trigger a monotonic clock
that goes back in time. That, in turn, can lead to a negative monotonic
time span. This would trigger an assert.
This this problem seems to be widespread, tolerate the case and interpret
it as a 0-length interval (vs something negative).
Fixes: https://tracker.ceph.com/issues/44078 Fixes: https://tracker.ceph.com/issues/43365 Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit cd00378720459d92394697e2cb8086f30c220312)
Conflicts:
src/pybind/rbd/rbd.pyx
- no "snap_exists", "snap_get_name", "snap_get_id",
"mirror_image_create_snapshot", "mirror_image_get_mode", "config_set",
"config_get", "config_remove", "snap_get_mirror_namespace", in nautilus
- nautilus "mirror_image_enable" does not take any argument
zhangdaolong [Tue, 24 Mar 2020 00:51:44 +0000 (08:51 +0800)]
pybind/rbd: fix no lockers are obtained, ImageNotFound exception will be output
No lockers are obtained, ImageNotFound exception will be output,
but tht image is always exist.when lockers number is zero,
Should not output any exceptions。
Fixes: https://tracker.ceph.com/issues/44613 Signed-off-by: zhangdaolong <zhangdaolong@fiberhome.com>
(cherry picked from commit a183aac978dac69f996250324975073a78cb476b)
Conflicts:
qa/tasks/ceph.py
- when tmpfs journal was enabled, something slightly different was
happening in nautilus compared to master, but this commit is ripping
out the whole thing
Neha [Tue, 25 Feb 2020 03:01:41 +0000 (03:01 +0000)]
osd/PeeringState.h: ignore RemoteBackfillReserved in WaitLocalBackfillReserved
It is possible to dequeue an outstanding RemoteBackfillReserved, though we may have
already released reservations for that backfill target. Currently, if this happens
while we are in WaitLocalBackfillReserved, it can lead to a crash on the primary.
Prevent this by treating this condition as a no-op.
The longer term fix is to add a RELEASE_ACK mechanism, which prevents the primary
from scheduling a backfill retry until all the RELEASE_ACKs have been received.
Sage Weil [Wed, 19 Feb 2020 23:42:09 +0000 (17:42 -0600)]
mon: stash newer map on bootstrap when addr doesn't match
If we have to respawn because a newer monmap comes along where our addr or
rank changes, we need to use that on restart in order to make progress.
Stash the newer map in a temporary location and use it when we restart.
Don't bother cleaning up. Having this map here is harmless, since we
only use it if it is newer than what is in paxos.