Adam Kupczyk [Mon, 10 Jun 2024 16:03:24 +0000 (16:03 +0000)]
os/bluestore: Add admin socket commands to inspect onode metadata
Add admin socket commands:
1) bluestore collections
Lists collections.
2) bluestore list <coll> [start object] [max count]
Lists collection coll starting from object (optional). Default 100 entries. 0 = unlimited.
3) bluestore onode metadata <object>
Prints onode metadata as seen by BlueStore.
It might happen (usually in tests) that 2 BlueStore instances are created at the same time.
Since admin commands are unique, it fails to register.
Use first register to detect whether we can register at all.
Adam Kupczyk [Tue, 8 Apr 2025 08:36:21 +0000 (08:36 +0000)]
os/bluestore: Add do_write_v2_compressed()
Modify do_write_v2() to branch into do_write_v2_compressed().
Segmented and regular cases are recognized and handled properly.
New do_write_v2_compressed() oversees compression / recompression.
Make one Estimator per Collection.
It makes possible for estimator to learn in collection specific compressibility.
In write_v2_compressed use compressor already selected in choose_write_options.
Make Collection create Estimator on first use.
Adam Kupczyk [Tue, 8 Apr 2025 11:03:22 +0000 (11:03 +0000)]
os/bluestore/compression: Main part of recompression feature
Add feature of recompression scanner that looks around write region to see how much
would be gained, if we read some more around and wrote more.
Added Compression.h / Compression.cc.
Added debug_bluestore_compression dout.
Created Scanner class.
Provides write_lookaround() for scanning loaded extents.
Adam Kupczyk [Wed, 9 Apr 2025 16:03:52 +0000 (16:03 +0000)]
os/bluestore/compression: Estimator class
Add CMake rules to compile.
Add bluestore_compression dout subsys.
Created Estimator class.
It is used by Scanner to decide if specific extent is to be recompressed.
Prepare for future machine learning / adaptive algorithm for estimation.
So far logic of Estimator is relatively simple.
It learns expected recompression values and uses them in next iterations to predict.
Adam Kupczyk [Tue, 19 Mar 2024 21:18:18 +0000 (21:18 +0000)]
os/bluestore/writer: Split do_write, add handling of compressed
Split do_write into do_write and do_write_with_blobs.
The original is used when only uncompressed data is written.
The new one accepts stream of data formatted into blobs;
the blobs can be compressed or uncompressed.
Add blob_create_full_compressed.
Fix do_put_new_blobs to handle compressed.
Adam Kupczyk [Thu, 13 Jun 2024 20:07:22 +0000 (20:07 +0000)]
os/bluestore: Moved selection of compressor to choose_write_options
This is borrowed from https://github.com/ceph/ceph/pull/57631;
selective cherry-picked from commit:
os/bluestore: implement data reformatting on reads Signed-off-by: Garry Drankovich <garry.drankovich@clyso.com> Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
Adam Kupczyk [Tue, 1 Apr 2025 14:01:23 +0000 (14:01 +0000)]
os/bluestore/bluefs: Fix race condition between truncate() and unlink()
It was possible for unlink() to interrupt ongoing truncate().
As the result, unlink() finishes properly, but truncate() is not aware
of it and does:
1) updates file that is already removed
2) releases same allocations again
Now fixed by checking if file is deleted under FILE lock.
Casey Bodley [Thu, 7 Nov 2024 20:36:45 +0000 (15:36 -0500)]
rgw/rados: add concurrent io algorithms for sharded data
cls/rgw provides the base class CLSRGWConcurrentIO as a swiss army knife
for bucket index operations that visit every shard object. while it uses
asynchronous librados requests to perform the io, it blocks on a
condition variable when waiting for the AioCompletions.
for use in coroutines, we need a version of this that suspends instead
of blocking. and to support both stackful and stackless coroutines, we
want a fully-generic async inferface templated on CompletionToken.
while the CLSRGWConcurrentIO algorithm works for all current uses
(reads and writes, with/without retries, with/without cleanup), i chose
to break this into 3 algorithms with well-defined semantics:
1. reads: to produce a successful result, all shard operations must
succeed. so any shard's failure causes the rest to be cancelled or
skipped. supports retries for ListBucket (RGWBIAdvanceAndRetryError).
2. writes: even if some shards fail, we still want to visit every shard
before returning the error. supports retries for log trimming
operations (repeat until ENODATA).
3. revertible writes: similar to reads, requires all shard operations to
succeed. on any failure, the side effects of any successful writes
must be reverted before returning. only used by IndexInit (any created
shards are removed on failure).
each algorithm provides a pure virtual base class that must be
implemented for each type of operation, similar to how existing
operations inherit from CLSRGWConcurrentIO.
mgr/dashboard: Fix empty ceph version in GET api/hosts
Fixes https://tracker.ceph.com/issues/70821
Due to the pagination the host list is being fetched from orchestrator which caused a regression as via orchestrator list ceph version is always marked empty.
Caused by https://github.com/ceph/ceph/pull/52154
Also fixed tests , as the new version addition causing whole json object mock to fail in tests
test/librbd/test_notify.py: drop RBD_DISABLE_UPDATE_FEATURES
This was put in place in commit 9c0b239d70cd ("qa/upgrade:
conditionally disable update_features tests") to paper over a backwards
compatibility issue that arose from commit 01ff1530544c ("librbd: make
all maintenance op notifications async"). It's not needed in squid or
later because upgrades from octopus are tested only until reef.
test/librbd/test_notify.py: force line-buffered output
"master" and "slave" invocations are intended to run in parallel and
coordinate between themselves. Ensure that their respective output is
properly timestamped and ordered in teuthology.log file.
Crimson's suite is reletavly limited and currently is run for
main only (not for prior releases).
Changes to Crimson are more delicate and having more main runs
to compare to might help with (git-)bisecting issues
Shilpa Jagannath [Mon, 16 Dec 2024 20:28:36 +0000 (15:28 -0500)]
rgw/trim: fix ENOENT return response from bucket sync status query.
only handle them when the bucket metadata is deleted. there is a case
when we get enoent when status objects have not been created yet,
for example when bucket metadata is created and synced but no data
exists yet and bucket sync status won't be initialized. these don't
need special handling.
Shilpa Jagannath [Wed, 26 Jun 2024 07:04:08 +0000 (03:04 -0400)]
rgw/multisite: in a multisite env with bucket sync policies configured,
we may end up orphaning objects on remote zones when a delete bucket
is issued on metadata master. to avoid this, list the buckets on remote
zones and delete bucket only when empty. if a zone is unreachable we
drop that zone and continue with bucket deletion. such zones might have
orphaned objects that will have to be cleaned up using radosgw-admin tool
rgw/multisite: handle the 'deleted' index log addition in RGWBucketInstanceMetadataHandler.
create an async cr for removing bucket instance info in bilog trimming logic
cmake: Fix googletest deprecated warnings by using target_compile_options()
Previously, we attempted to disable deprecated declarations warnings when
building gtest by adding `-Wno-deprecated-declarations` to the COMPILE_OPTIONS
property of the googletest directory. However, this approach failed to apply
the option when actually building gtest.
This change applies the compile option directly to the `gtest` target using
target_compile_options() instead. Verified by forcing the condition to TRUE
and confirming the option is included when building `gtest-all.cc` through
`cmake --build ~/dev/ceph/build --target gtest --verbose`.