Ilya Dryomov [Mon, 19 Jun 2023 18:42:46 +0000 (20:42 +0200)]
test/librados: make sense of LibRadosMiscConnectFailure.ConnectFailure
Over the years, with commits 8bc197400d94 ("ceph_test_librados_api_misc:
fix stupid LibRadosMiscConnectFailure.ConnectFailure test") and f357459e6b15 ("test/librados: modify LibRadosMiscConnectFailure.ConnectFailure
to comply with new seconds unit"), this test has lost its original
meaning. Any ability to time out is definitely gone since a 1 second
timeout is too high to kick in a normally functioning setup, not to
mention that the timeout was being ballooned to 10 seconds until the
previous commit.
The first connection attempt would normally succeed while the second
one would immediately fail with "cannot identify monitors to contact"
error due do the cluster handle getting recreated. The 16 iteration
loop is dead code.
This commit just codifies the above to avoid the appearance that this
test has anything to do with timeouts.
Ilya Dryomov [Mon, 19 Jun 2023 17:53:39 +0000 (19:53 +0200)]
mon/MonClient: resurrect original client_mount_timeout handling
While reducing a "waiting for config" timeout from 30 seconds to 3
(mon_client_hunt_interval default) and instead introducing 10 retries,
commit 3c2b30e4c5dd ("mon/MonClient: apply timeout while fetching
config") also subjected authenticate() to these retries. However,
authenticate() is going by client_mount_timeout which defaults to
5 minutes. As a result, when the monitors are unreachable or there
are other connectivity issues, we end up taking 50 minutes to return
ETIMEDOUT from rados_connect().
Conflicts:
src/test/librados/aio.cc:
removed test case for rados_aio_write_op_operate2()
which wasn't backported
test case for rados_aio_write_op_operate() uses rados_stat()
instead of rados_stat2() which doesn't exist on pacific
no test_data.m_oid, used "foo" for oids
Nitzan Mordechai [Wed, 10 May 2023 09:42:07 +0000 (09:42 +0000)]
mon/MonClient: before complete auth with error, reopen session
When monClient try to authenticate and fail with -EAGAIN there is
a possibility that we no longer hunting and not have active_con.
that will result of disconnecting the monClient and ticks will continue
without having open session.
the solution is to check at the end of auth, that we don't have -EAGAIN
error, and if we do, reopen the session and on the next tick try auth again
Casey Bodley [Tue, 23 May 2023 16:31:54 +0000 (12:31 -0400)]
librados: use ObjectOperationImpl for rados_write_op_t
the c++ api uses ObjectOperationImpl to wrap ObjectOperation with
additional storage for an optional mtime. the c api now reuses
ObjectOperationImpl for its write operations only - the mtime isn't
needed for read ops
librbd: localize snap_remove op for mirror snapshots
A client may attempt a lock request not quickly enough to
obtain exclusive lock for operations when another competing
client responds quicker. This can happen when a peer site has
different performance characteristics or latency. Instead of
relying on this unpredictable behavior, localize operation to
primary cluster.
Fixes: https://tracker.ceph.com/issues/59393 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit ac552c9b4d65198db8038d397a3060d5a030917d)
Conflicts:
src/cls/rbd/cls_rbd.cc [ commit 3a93b40 ("librbd:
s/boost::variant/std::variant/") not in pacific ]
src/librbd/mirror/snapshot/UnlinkPeerRequest.cc [ ditto ]
ceph-volume: fix a bug in `get_lvm_fast_allocs()` (batch)
`get_lvm_fast_allocs()` in `devices/lvm/batch.py` calls the property
`Device.used_by_ceph` in order to filter out devices that are already
used by ceph. The issue is that `Device.used_by_ceph()` itself filters
out journal devices (db/wal) given that a db/wal device can be shared
between multiple OSDs. The consequence is that `Device.used_by_ceph()`
always returns False for a db/wal device (even if it is actually
already used by ceph) so `get_lvm_fast_allocs()` always returns the
full list of the passed db/wal devices on the `lvm batch` CLI command.
Finally, the logic in `devices.lvm.batch.get_deployment_layout()`
checks whether the length of the list returned by `get_lvm_fast_allocs()`
is equal to `num_osds` (the number of OSD being created), if not it fails.
Laura Flores [Mon, 5 Jun 2023 20:23:42 +0000 (15:23 -0500)]
qa/suites/rados: remove rook coverage from the rados suite
The rook team relies on a daily CI system to validate
rook changes. It doesn't seem that the teuthology tests
are maintained, so it makes sense to remove them from the
rados suite.
By removing this symlink, rook test coverage will remain
in the orch suite, and coverage will only be removed from the
rados suite.
Workaround for: https://tracker.ceph.com/issues/58585 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit c26674ef4c6cbbdd94c54cafbd66e98704f044d7)
This commit https://github.com/ceph/ceph/commit/bdb2241ca5a9758e8c52d47320d8b5ea0766aea9
was updating on logging changes in quincy, but seems to have been
erroneously included in a pacific batch backport https://github.com/ceph/ceph/pull/42736
This stuff doesn't work in pacific. For example,
[ceph: root@vm-00 /]# ceph version
ceph version 16.2.13-257-gd8c5d349 (d8c5d34975dce1c5eb0aa3a7979a4d9b9a99d1ec) pacific (stable)
[ceph: root@vm-00 /]# ceph config set global log_to_journald false
Error EINVAL: unrecognized config option 'log_to_journald'
Ilya Dryomov [Sat, 27 May 2023 10:28:40 +0000 (12:28 +0200)]
osd/OSDCap: allow rbd.metadata_list method under rbd-read-only profile
This was missed in commit acc447d5de7b ("osd/OSDCap: rbd profile
permits use of rbd.metadata_list cls method") which adjusted only
"profile rbd" OSD cap. Listing image metadata is an essential part
of opening the image and "profile rbd-read-only" OSD cap must allow
it too.
While at it, constrain the existing grant for rbd profile from "any
object in the pool" to just "rbd_info object in the global namespace of
the pool" as this is where pool-level image metadata actually lives.
Nitzan Mordechai [Wed, 17 May 2023 05:47:09 +0000 (05:47 +0000)]
test: correct osd pool default size
Using the default pool size of 2 with random eio thrashing can cause
some of the object to mark as lost.
fixing typo from 'osd default pool size: 3' to 'osd pool default size: 3'
so we will have pool size 3 correctly.
Nitzan Mordechai [Thu, 18 May 2023 13:37:38 +0000 (13:37 +0000)]
test: monitor thrasher wait until quorum
With 1 sec. delay we may sometimes fail to get correct length of
quorum since the monitor didn't updated on time.
With the following fix, we will wait for quorum and check every few
seconds (3) until timeout (30).
Zac Dover [Thu, 25 May 2023 09:01:49 +0000 (19:01 +1000)]
doc/rados: fix link in common.rst
Fix a link in doc/rados/configuration/common.rst that was missing its
final letter, causing a 404 error when readers attempted to follow it.
This bug was reported by stalwart friend of the Ceph documentation
project Eugen Block, who is here credited as a co-author. This bug was
reported at https://pad.ceph.com/p/Report_Documentation_Bugs.
Zac Dover [Mon, 22 May 2023 21:41:09 +0000 (07:41 +1000)]
doc/glossary: update bluestore entry
Update the BlueStore entry in the glossary, explaining that as of Reef
BlueStore and only BlueStore (and not FileStore) is the storage backend
for Ceph.
This topic has been discussed many times; recently at the Dev
Summit of Cephalocon 2023.
This commit is the minial version of the work, contained entirely
within the `doc`. However, likely it will be expanded as there
were ideas like e.g. adding cache tiering back experimental feature
list (Sam) to warn users when deploying a new cluster.
doc: Add missing `ceph` command in documentation section `REPLACING AN OSD`
Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com> Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com>
(cherry picked from commit 0557d5e465556adba6d25db62a40ba55a5dd2400)
Zac Dover [Thu, 18 May 2023 21:07:02 +0000 (07:07 +1000)]
doc/radosgw: explain multisite dynamic sharding
Add a note to doc/radosgw/dynamicresharding.rst and a note to
doc/radosgw/multisite.rst that explains that dynamic resharding is not
supported in releases prior to Reef.
This commit is made in response to a request from Mathias Chapelain.
Co-authored-by: Anthony D'Atri <anthony.datri@gmail.com> Signed-off-by: Zac Dover <zac.dover@proton.me>
(cherry picked from commit d4ed4223d914328361528990f89f1ee4acd30e79)