Igor Fedotov [Fri, 11 Nov 2022 14:31:19 +0000 (17:31 +0300)]
os/bluestore: introduce a cooldown period for failed BlueFS allocations.
When using bluefs_shared_alloc_size one might get a long-lasting state when
that large chunks are not available any more and fallback to shared
device min alloc size occurs. The introduced cooldown is intended to
prevent repetitive allocation attempts with bluefs_shared_alloc_size for
a while. The rationale is to eliminate performance penalty these failing
attempts might cause.
Igor Fedotov [Thu, 10 Nov 2022 22:06:15 +0000 (01:06 +0300)]
os/bluestore: support main/slow device's alloc unit for BlueFS.
This effectively enables having 4K allocation units for BlueFS.
But it doesn't turn it on by default for the sake of performance.
Using main device which lacks enough free large continuous extents
might do the trick though.
Conflicts:
src/os/bluestore/BlueFS.cc
(trivial, no https://github.com/ceph/ceph/pull/39871/)
src/os/bluestore/BlueStore.cc
(trivial, no commits for zoned support)
src/test/objectstore/test_bluefs.cc
(trivial, no https://github.com/ceph/ceph/pull/45883)
Igor Fedotov [Wed, 2 Nov 2022 16:39:14 +0000 (19:39 +0300)]
os/bluestore: prepend compacted BlueFS log with a starter part.
The rationale is to have initial log fnode after compaction small
enough to fit into 4K superblock. Without that compacted metadata might
require fnode longer than 4K which goes beyond existing 4K
superblock. BlueFS assert in this case for now.
Hence the resulting log allocation disposition is like:
- superblock(4K) keeps initial log fnode which refers:
op_init, op_update_inc(log), op_jump(next seq)
- updated log fnode built from superblock + above op_update_inc refers:
compacted meta (a bunch of op_update and others)
- *
- more op_update_inc(log) to follow if log is extended
- *
Adam Kupczyk [Thu, 23 Dec 2021 14:12:52 +0000 (15:12 +0100)]
os/bluestore/bluefs: Cleanup on pending_release variable
Moved pending_release to struct dirty {}.
Restructured BlueFS::open_for_write to modify pending_release under dirty.lock.
Now all pending_release modifications are under dirty.lock.
Adam Kupczyk [Thu, 23 Dec 2021 12:26:17 +0000 (13:26 +0100)]
os/bluestore/bluefs: Modify _update_logger_stats
Extract updating of num files and log size from _update_logger_stats
and put it exactly where modification happens.
It allows to escape problem of taking nodes.lock and log.lock.
Adam Kupczyk [Tue, 2 Nov 2021 12:17:31 +0000 (13:17 +0100)]
os/bluestore/bluefs: Fix missing lock in compact_log_async
During phase of log fnode switch from new_log to actual log it is necessary to hold lock.
Added that locking.
Modified procedure of transporting extents from old_log to new_log.
Now new_log gets additional extents, instead of removing from old_log; this shortens time
when we need to hold log.lock.
Adam Kupczyk [Wed, 9 Feb 2022 15:19:56 +0000 (16:19 +0100)]
os/bluestore/bluefs: Fix improper vselector tracking in _flush_special()
Moves vselector size tracking outside _flush_special().
Function _compact_log_async...() updated sizes twice.
Problem could not be solved by making second modification of size just update,
as it will possibly disrupt vselector consistency check (_vselector_check()).
Feature to track vselector consistency relies on the fact that either log.lock or nodes.lock
are taken when the check is performed. Which is not true for _compact_log_async...().
Now _flush_special does not update vselector sizes by itself but leaves the update to
the caller.
Fixes: https://tracker.ceph.com/issues/54248 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 4bc0f61d23299724fad2d8e6f2858734f1db6e5a)
Adam Kupczyk [Sat, 30 Oct 2021 18:05:35 +0000 (20:05 +0200)]
os/bluestore/bluefs: Fix incorrect file capture in compact_log_async
It was possible to skip capture of file that was recently modified.
New procedure under one log.lock flushed pending files and captures state.
It is much less interruptible then I had hoped for but I cannot now do it better.
Adam Kupczyk [Tue, 19 Oct 2021 12:38:32 +0000 (12:38 +0000)]
os/bluestore/bluefs: Fix false collision with lockdep module
Usually sequence of locking is 1) FileWriter 2) File.
In _compact_log_async_LD_NF_D it was in reversed order.
No real deadlock was possible, but lockdep complained.
Bonus: Improved lock dependency graph.
Fixes: https://tracker.ceph.com/issues/52939 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 7b7945d6117eb7502729c5dd0b5d383d8bc73f10)
Adam Kupczyk [Tue, 10 Aug 2021 13:15:52 +0000 (15:15 +0200)]
os/bluestore/bluefs: Rearrange locks in prealocate
Rearranged locks in preallocate to avoid possible deadlock with
compact_log_async_dump_metadata_NF.
Cycle was:
L->N rename/mkdir
N->F compact_log_async_dump_metadata_NF
F->L preallocate
Modified File.dirty_seq to capture dirty.seq_stable instead of 0. This is used to distunguish
files already serialized for compact_log_async_dump_metadata() function.
Adam Kupczyk [Tue, 22 Jun 2021 11:15:21 +0000 (13:15 +0200)]
os/bluestore/bluefs: Reorganize BlueFS state variables
Reorganize BlueFS state variables into separate domains: 1) log, 2) dirty, 3) nodes.
Each has separate lock. This change is intended to make it easier to control which locks
need to be held when specific elements are modified.
Adam Kupczyk [Sat, 5 Jun 2021 18:50:46 +0000 (20:50 +0200)]
os/bluestore/bluefs: Refactor _flush
This refactor prepares _flush for fine-grain locks in BlueFS.
Introduced _flush_special, a flush dedicated to bluefs special files (ino=1) and (ino=0).
Function _flush no longer accepts these special files.
Adam Kupczyk [Sat, 5 Jun 2021 06:55:14 +0000 (08:55 +0200)]
os/bluestore/bluefs: Refactor flush_and_sync_log
This refactor prepares flush_and_sync_log and compact_log_async for fine-grain locks in BlueFS.
There is no new logic introduced, but refactor is accompanied by some new comments.
librbd: localize snap_remove op for mirror snapshots
A client may attempt a lock request not quickly enough to
obtain exclusive lock for operations when another competing
client responds quicker. This can happen when a peer site has
different performance characteristics or latency. Instead of
relying on this unpredictable behavior, localize operation to
primary cluster.
Fixes: https://tracker.ceph.com/issues/59393 Signed-off-by: Christopher Hoffman <choffman@redhat.com>
(cherry picked from commit ac552c9b4d65198db8038d397a3060d5a030917d)
Conflicts:
src/cls/rbd/cls_rbd.cc [ commit 3a93b40 ("librbd:
s/boost::variant/std::variant/") not in pacific ]
src/librbd/mirror/snapshot/UnlinkPeerRequest.cc [ ditto ]
Laura Flores [Mon, 5 Jun 2023 20:23:42 +0000 (15:23 -0500)]
qa/suites/rados: remove rook coverage from the rados suite
The rook team relies on a daily CI system to validate
rook changes. It doesn't seem that the teuthology tests
are maintained, so it makes sense to remove them from the
rados suite.
By removing this symlink, rook test coverage will remain
in the orch suite, and coverage will only be removed from the
rados suite.
Workaround for: https://tracker.ceph.com/issues/58585 Signed-off-by: Laura Flores <lflores@redhat.com>
(cherry picked from commit c26674ef4c6cbbdd94c54cafbd66e98704f044d7)
This commit https://github.com/ceph/ceph/commit/bdb2241ca5a9758e8c52d47320d8b5ea0766aea9
was updating on logging changes in quincy, but seems to have been
erroneously included in a pacific batch backport https://github.com/ceph/ceph/pull/42736
This stuff doesn't work in pacific. For example,
[ceph: root@vm-00 /]# ceph version
ceph version 16.2.13-257-gd8c5d349 (d8c5d34975dce1c5eb0aa3a7979a4d9b9a99d1ec) pacific (stable)
[ceph: root@vm-00 /]# ceph config set global log_to_journald false
Error EINVAL: unrecognized config option 'log_to_journald'
Ilya Dryomov [Sat, 27 May 2023 10:28:40 +0000 (12:28 +0200)]
osd/OSDCap: allow rbd.metadata_list method under rbd-read-only profile
This was missed in commit acc447d5de7b ("osd/OSDCap: rbd profile
permits use of rbd.metadata_list cls method") which adjusted only
"profile rbd" OSD cap. Listing image metadata is an essential part
of opening the image and "profile rbd-read-only" OSD cap must allow
it too.
While at it, constrain the existing grant for rbd profile from "any
object in the pool" to just "rbd_info object in the global namespace of
the pool" as this is where pool-level image metadata actually lives.
Zac Dover [Thu, 25 May 2023 09:01:49 +0000 (19:01 +1000)]
doc/rados: fix link in common.rst
Fix a link in doc/rados/configuration/common.rst that was missing its
final letter, causing a 404 error when readers attempted to follow it.
This bug was reported by stalwart friend of the Ceph documentation
project Eugen Block, who is here credited as a co-author. This bug was
reported at https://pad.ceph.com/p/Report_Documentation_Bugs.
Zac Dover [Mon, 22 May 2023 21:41:09 +0000 (07:41 +1000)]
doc/glossary: update bluestore entry
Update the BlueStore entry in the glossary, explaining that as of Reef
BlueStore and only BlueStore (and not FileStore) is the storage backend
for Ceph.
This topic has been discussed many times; recently at the Dev
Summit of Cephalocon 2023.
This commit is the minial version of the work, contained entirely
within the `doc`. However, likely it will be expanded as there
were ideas like e.g. adding cache tiering back experimental feature
list (Sam) to warn users when deploying a new cluster.
doc: Add missing `ceph` command in documentation section `REPLACING AN OSD`
Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com> Signed-off-by: Alexander Proschek <alexander.proschek@protonmail.com>
(cherry picked from commit 0557d5e465556adba6d25db62a40ba55a5dd2400)