From: Peter Dillinger Date: Fri, 7 Nov 2025 17:04:52 +0000 (-0800) Subject: Auto-tune manifest file size (#14076) X-Git-Tag: v10.8.3~3 X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=41c6640085de4c75b13ec292c7ccd5776b288ce2;p=rocksdb.git Auto-tune manifest file size (#14076) Summary: Adds auto-tuning of manifest file size to avoid the need to scale `max_manifest_file_size` in proportion to things like number of SST files to properly balance (a) manifest file write amp and new file creation, vs. (b) manifest file space amp and replay time, including non-incremental space usage in backups. (Manifest file write amp comes from re-writing a "live" record when the manifest file is re-created, or "compacted"; space amp is usage beyond what would be used by a compacted manifest file.) In more detail, * Add new option `max_manifest_space_amp_pct` with default value of 500, which defaults to 0.2 write amp and up to roughly 5.0 space amp, except `max_manifest_file_size` is treated as the "minimum" size before re-creating ("compacting") the manifest file. * `max_manifest_file_size` in a way means the same thing, with the same default of 1GB, but in a way has taken on a new role. What is the same is that we do not re-create the manifest file before reaching this size (except for DB re-open), and so users are very unlikely to see a change in default behavior (auto-tuning only kicking in if auto-tuning would exceed 1GB for effective max size for the current manifest file). The new role is as a file size lower bound before auto-tuning kicks in, to minimize churn in files considered "negligibly small." We recommend a new setting of around 1MB or even smaller like 64KB, and expect something like this to become the default soon. * These two options along with `manifest_preallocation_size` are now mutable with SetDBOptions. The effect is nearly immediate, affecting the next write to the current manifest file. Also in this PR: * Refactoring of VersionSet to allow it to get (more) settings from MutableDBOptions. This touches a number of files in not very interesting ways, but notably we have to be careful about thread-safe access to MutableDBOptions fields, and even fields within VersionSet. I have decided to save copies of relevant fields from MutableDBOptions to simplify testing, etc. by not saving a reference to MutableDBOptions but getting notified of updates. * Updated some logging in VersionSet to provide some basic data about final and compacted manifest sizes (effects of auto-tuning), making sure to avoid I/O while holding DB mutex. * Added db_etc3_test.cc which is intended as a successor to db_test and db_test2, but having "test.cc" in its name for easier exclusion of test files when using `git grep`. Intended follow-up: rename db_test2 to db_etc2_test * Moved+updated `ManifestRollOver` test to the new file to be closer to other manifest file rollover testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14076 Test Plan: As for correctness, new unit test AutoTuneManifestSize is pretty thorough. Some other unit tests updated appropriately. Manual tests in the performance section were also audited for expected behavior based on the new logging in the DB LOG. Example LOG data with -max_manifest_file_size=2048 -max_manifest_space_amp_pct=500: ``` 2025/10/24-11:12:48.979472 2150678 [/version_set.cc:5927] Created manifest 5, compacted+appended from 52 to 116 2025/10/24-11:12:49.626441 2150682 [/version_set.cc:5927] Created manifest 24, compacted+appended from 2169 to 1801 2025/10/24-11:12:52.194592 2150682 [/version_set.cc:5927] Created manifest 91, compacted+appended from 10913 to 8707 2025/10/24-11:13:02.969944 2150682 [/version_set.cc:5927] Created manifest 362, compacted+appended from 52259 to 13321 2025/10/24-11:13:18.815120 2150681 [/version_set.cc:5927] Created manifest 765, compacted+appended from 80064 to 13304 2025/10/24-11:13:35.590905 2150681 [/version_set.cc:5927] Created manifest 1167, compacted+appended from 79863 to 13304 ``` As you can see, it only took a few iterations of ramp-up to settle on the auto-tuned max manifest size for tracking ~122 live SST files, around 80KB and compacting down to about 13KB. (13KB * (500 + 100) / 100 = 78KB). With the default large setting for max_manifest_file_size, we end up with a 232KB manifest, which is more than 90% wasted space. (A long-running DB would be much worse.) As for performance, we don't expect a difference, even with TransactionDB because actual writing of the manifest is done without holding the DB mutex. I was not able to see a performance regression using db_bench with FIFO compaction and >1000 ~10MB SST files, including settings of -max_manifest_file_size=2048 -max_manifest_space_amp_pct={500,10,0}. No "hiccups" visible with -histogram either. I also tried seeding a 1 second delay in writing new manifest files (other than the first). This had no significant effect at -max_manifest_space_amp_pct=500 but at 100 started causing write stalls in my test. In many ways this is kind of a worst case scenario and out-of-proportion test, but gives me more confidence that a higher number like 500 is probably the best balance in general. Reviewed By: xingbowang Differential Revision: D85445178 Pulled By: pdillinger fbshipit-source-id: 1e6e07e89c586762dd65c65bb7cb2b8b719513f9 --- diff --git a/BUCK b/BUCK index c4327a3f7..117c5e700 100644 --- a/BUCK +++ b/BUCK @@ -4844,6 +4844,12 @@ cpp_unittest_wrapper(name="db_encryption_test", extra_compiler_flags=[]) +cpp_unittest_wrapper(name="db_etc3_test", + srcs=["db/db_etc3_test.cc"], + deps=[":rocksdb_test_lib"], + extra_compiler_flags=[]) + + cpp_unittest_wrapper(name="db_flush_test", srcs=["db/db_flush_test.cc"], deps=[":rocksdb_test_lib"], diff --git a/CMakeLists.txt b/CMakeLists.txt index e0bbbc4c5..833c510a2 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -1359,6 +1359,7 @@ if(WITH_TESTS) db/db_clip_test.cc db/db_dynamic_level_test.cc db/db_encryption_test.cc + db/db_etc3_test.cc db/db_flush_test.cc db/db_inplace_update_test.cc db/db_io_failure_test.cc diff --git a/Makefile b/Makefile index a6a5dd3b4..29040dfb1 100644 --- a/Makefile +++ b/Makefile @@ -1496,6 +1496,9 @@ db_test: $(OBJ_DIR)/db/db_test.o $(TEST_LIBRARY) $(LIBRARY) db_test2: $(OBJ_DIR)/db/db_test2.o $(TEST_LIBRARY) $(LIBRARY) $(AM_LINK) +db_etc3_test: $(OBJ_DIR)/db/db_etc3_test.o $(TEST_LIBRARY) $(LIBRARY) + $(AM_LINK) + compression_test: $(OBJ_DIR)/util/compression_test.o $(TEST_LIBRARY) $(LIBRARY) $(AM_LINK) diff --git a/db/compaction/compaction_job_test.cc b/db/compaction/compaction_job_test.cc index e1e11e76f..2836ed20e 100644 --- a/db/compaction/compaction_job_test.cc +++ b/db/compaction/compaction_job_test.cc @@ -211,8 +211,8 @@ class CompactionJobTestBase : public testing::Test { table_cache_(NewLRUCache(50000, 16)), write_buffer_manager_(db_options_.db_write_buffer_size), versions_(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", @@ -546,13 +546,13 @@ class CompactionJobTestBase : public testing::Test { ASSERT_OK(s); db_options_.info_log = info_log; - versions_.reset( - new VersionSet(dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, - /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, - test::kUnitTestDbId, /*db_session_id=*/"", - /*daily_offpeak_time_utc=*/"", - /*error_handler=*/nullptr, /*unchanging=*/false)); + versions_.reset(new VersionSet( + dbname_, &db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, + /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, + test::kUnitTestDbId, /*db_session_id=*/"", + /*daily_offpeak_time_utc=*/"", + /*error_handler=*/nullptr, /*unchanging=*/false)); compaction_job_stats_.Reset(); VersionEdit new_db; diff --git a/db/db_basic_test.cc b/db/db_basic_test.cc index 9a4a5b983..b115e7069 100644 --- a/db/db_basic_test.cc +++ b/db/db_basic_test.cc @@ -675,30 +675,6 @@ TEST_F(DBBasicTest, Flush) { } while (ChangeCompactOptions()); } -TEST_F(DBBasicTest, ManifestRollOver) { - do { - Options options; - options.max_manifest_file_size = 10; // 10 bytes - options = CurrentOptions(options); - CreateAndReopenWithCF({"pikachu"}, options); - { - ASSERT_OK(Put(1, "manifest_key1", std::string(1000, '1'))); - ASSERT_OK(Put(1, "manifest_key2", std::string(1000, '2'))); - ASSERT_OK(Put(1, "manifest_key3", std::string(1000, '3'))); - uint64_t manifest_before_flush = dbfull()->TEST_Current_Manifest_FileNo(); - ASSERT_OK(Flush(1)); // This should trigger LogAndApply. - uint64_t manifest_after_flush = dbfull()->TEST_Current_Manifest_FileNo(); - ASSERT_GT(manifest_after_flush, manifest_before_flush); - ReopenWithColumnFamilies({"default", "pikachu"}, options); - ASSERT_GT(dbfull()->TEST_Current_Manifest_FileNo(), manifest_after_flush); - // check if a new manifest file got inserted or not. - ASSERT_EQ(std::string(1000, '1'), Get(1, "manifest_key1")); - ASSERT_EQ(std::string(1000, '2'), Get(1, "manifest_key2")); - ASSERT_EQ(std::string(1000, '3'), Get(1, "manifest_key3")); - } - } while (ChangeCompactOptions()); -} - TEST_F(DBBasicTest, IdentityAcrossRestarts) { constexpr size_t kMinIdSize = 10; do { diff --git a/db/db_etc3_test.cc b/db/db_etc3_test.cc new file mode 100644 index 000000000..e5152fcd5 --- /dev/null +++ b/db/db_etc3_test.cc @@ -0,0 +1,161 @@ +// Copyright (c) 2011-present, Facebook, Inc. All rights reserved. +// This source code is licensed under both the GPLv2 (found in the +// COPYING file in the root directory) and Apache 2.0 License +// (found in the LICENSE.Apache file in the root directory). + +#include "db/db_test_util.h" + +namespace ROCKSDB_NAMESPACE { + +class DBEtc3Test : public DBTestBase { + public: + DBEtc3Test() : DBTestBase("db_etc3_test", /*env_do_fsync=*/true) {} +}; + +TEST_F(DBEtc3Test, ManifestRollOver) { + do { + Options options; + // Force new manifest on each manifest write + options.max_manifest_file_size = 0; + options.max_manifest_space_amp_pct = 0; + options = CurrentOptions(options); + CreateAndReopenWithCF({"pikachu"}, options); + { + ASSERT_OK(Put(1, "key1", std::string(1000, '1'))); + ASSERT_OK(Put(1, "key2", std::string(1000, '2'))); + ASSERT_OK(Put(1, "key3", std::string(1000, '3'))); + uint64_t manifest_before_flush = dbfull()->TEST_Current_Manifest_FileNo(); + ASSERT_OK(Flush(1)); // This should trigger LogAndApply. + uint64_t manifest_after_flush = dbfull()->TEST_Current_Manifest_FileNo(); + ASSERT_GT(manifest_after_flush, manifest_before_flush); + // Re-open should always re-create manifest file + ReopenWithColumnFamilies({"default", "pikachu"}, options); + ASSERT_GT(dbfull()->TEST_Current_Manifest_FileNo(), manifest_after_flush); + ASSERT_EQ(std::string(1000, '1'), Get(1, "key1")); + ASSERT_EQ(std::string(1000, '2'), Get(1, "key2")); + ASSERT_EQ(std::string(1000, '3'), Get(1, "key3")); + } + } while (ChangeCompactOptions()); +} + +TEST_F(DBEtc3Test, AutoTuneManifestSize) { + // Ensure we have auto-tuning beyond max_manifest_file_size by default + ASSERT_EQ(DBOptions{}.max_manifest_space_amp_pct, 500); + + Options options = CurrentOptions(); + ASSERT_OK(db_->SetOptions({{"level0_file_num_compaction_trigger", "20"}})); + + // Use large column family names to essentially control the amount of payload + // data needed for the manifest file. Drop manifest entries don't include the + // CF name so are small. + uint64_t prev_manifest_num = 0, cur_manifest_num = 0; + std::deque handles; + int counter = 5; + auto AddCfFn = [&]() { + std::string name = "cf" + std::to_string(counter++); + name.resize(1000, 'a'); + ASSERT_OK(db_->CreateColumnFamily(options, name, &handles.emplace_back())); + prev_manifest_num = cur_manifest_num; + cur_manifest_num = dbfull()->TEST_Current_Manifest_FileNo(); + }; + auto DropCfFn = [&]() { + ASSERT_OK(db_->DropColumnFamily(handles.front())); + ASSERT_OK(db_->DestroyColumnFamilyHandle(handles.front())); + handles.pop_front(); + prev_manifest_num = cur_manifest_num; + cur_manifest_num = dbfull()->TEST_Current_Manifest_FileNo(); + }; + auto TrivialManifestWriteFn = [&]() { + ASSERT_OK(Put("x", std::to_string(counter++))); + ASSERT_OK(Flush()); + prev_manifest_num = cur_manifest_num; + cur_manifest_num = dbfull()->TEST_Current_Manifest_FileNo(); + }; + + options.max_manifest_file_size = 1000000; + options.max_manifest_space_amp_pct = 0; // no auto-tuning yet + DestroyAndReopen(options); + + // With the generous (minimum) maximum manifest size, should not be rotated + AddCfFn(); + AddCfFn(); + AddCfFn(); + ASSERT_EQ(prev_manifest_num, cur_manifest_num); + + // Change options for small max and (still) no auto-tuning + ASSERT_OK(db_->SetDBOptions({{"max_manifest_file_size", "3000"}})); + + // Takes effect on the next manifest write + TrivialManifestWriteFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + + // Now we have to rewrite the whole manifest on each write because the + // compacted size exceeds the "max" size. + AddCfFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + DropCfFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + AddCfFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + TrivialManifestWriteFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + + // Enabling auto-tuning should fix this, immediately for next manifest writes. + // This will allow up to double-ish the size of the compacted manifest, + // which last should have been 4000 + some bytes. + ASSERT_EQ(handles.size(), 4U); + ASSERT_OK(db_->SetDBOptions({{"max_manifest_space_amp_pct", "105"}})); + + // After 9 CF names should be enough to rotate the manifest + for (int i = 1; i <= 5; ++i) { + if ((i % 2) == 1) { + DropCfFn(); + } + AddCfFn(); + ASSERT_EQ(prev_manifest_num, cur_manifest_num); + } + TrivialManifestWriteFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + + // We now have a different last compacted manifest size, should be + // able to go beyond 9 CFs named in manifest this time. + ASSERT_EQ(handles.size(), 6U); + + DropCfFn(); + DropCfFn(); + for (int i = 1; i <= 4; ++i) { + DropCfFn(); + AddCfFn(); + ASSERT_EQ(prev_manifest_num, cur_manifest_num); + } + // We've written 10 named CFs to the manifest. We should be able to + // dynamically change the auto-tuning still based on the last "compacted" + // manifest size of 7000 + some bytes. + ASSERT_OK(db_->SetDBOptions({{"max_manifest_space_amp_pct", "51"}})); + TrivialManifestWriteFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + // And the "compacted" manifest size has reset again, so should be changed + // again sooner. + ASSERT_EQ(handles.size(), 4U); + for (int i = 1; i <= 2; ++i) { + AddCfFn(); + ASSERT_EQ(prev_manifest_num, cur_manifest_num); + } + // Enough for manifest change + AddCfFn(); + ASSERT_LT(prev_manifest_num, cur_manifest_num); + + // Wrap up + while (!handles.empty()) { + DropCfFn(); + } +} + +} // namespace ROCKSDB_NAMESPACE + +int main(int argc, char** argv) { + ROCKSDB_NAMESPACE::port::InstallStackTraceHandler(); + ::testing::InitGoogleTest(&argc, argv); + RegisterCustomObjects(argc, argv); + return RUN_ALL_TESTS(); +} diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc index 55bf299c3..1ece5b066 100644 --- a/db/db_impl/db_impl.cc +++ b/db/db_impl/db_impl.cc @@ -258,10 +258,10 @@ DBImpl::DBImpl(const DBOptions& options, const std::string& dbname, [this]() { this->TriggerPeriodicCompaction(); }); versions_.reset(new VersionSet( - dbname_, &immutable_db_options_, file_options_, table_cache_.get(), - write_buffer_manager_, &write_controller_, &block_cache_tracer_, - io_tracer_, db_id_, db_session_id_, options.daily_offpeak_time_utc, - &error_handler_, read_only)); + dbname_, &immutable_db_options_, mutable_db_options_, file_options_, + table_cache_.get(), write_buffer_manager_, &write_controller_, + &block_cache_tracer_, io_tracer_, db_id_, db_session_id_, + options.daily_offpeak_time_utc, &error_handler_, read_only)); column_family_memtables_.reset( new ColumnFamilyMemTablesImpl(versions_->GetColumnFamilySet())); @@ -1412,7 +1412,7 @@ Status DBImpl::SetDBOptions( file_options_for_compaction_ = FileOptions(new_db_options); file_options_for_compaction_ = fs_->OptimizeForCompactionTableWrite( file_options_for_compaction_, immutable_db_options_); - versions_->ChangeFileOptions(mutable_db_options_); + versions_->UpdatedMutableDbOptions(mutable_db_options_, &mutex_); // TODO(xiez): clarify why apply optimize for read to write options file_options_for_compaction_ = fs_->OptimizeForCompactionTableRead( file_options_for_compaction_, immutable_db_options_); diff --git a/db/db_impl/db_impl_follower.cc b/db/db_impl/db_impl_follower.cc index 1ff12cec0..1262c5bdf 100644 --- a/db/db_impl/db_impl_follower.cc +++ b/db/db_impl/db_impl_follower.cc @@ -293,9 +293,9 @@ Status DB::OpenAsFollower( DBImplFollower* impl = new DBImplFollower(tmp_opts, std::move(new_env), dbname, src_path); impl->versions_.reset(new ReactiveVersionSet( - dbname, &impl->immutable_db_options_, impl->file_options_, - impl->table_cache_.get(), impl->write_buffer_manager_, - &impl->write_controller_, impl->io_tracer_)); + dbname, &impl->immutable_db_options_, impl->mutable_db_options_, + impl->file_options_, impl->table_cache_.get(), + impl->write_buffer_manager_, &impl->write_controller_, impl->io_tracer_)); impl->column_family_memtables_.reset( new ColumnFamilyMemTablesImpl(impl->versions_->GetColumnFamilySet())); impl->wal_in_db_path_ = impl->immutable_db_options_.IsWalDirSameAsDBPath(); diff --git a/db/db_impl/db_impl_open.cc b/db/db_impl/db_impl_open.cc index a9871d6bb..cccc3ea2c 100644 --- a/db/db_impl/db_impl_open.cc +++ b/db/db_impl/db_impl_open.cc @@ -329,7 +329,7 @@ Status DBImpl::NewDB(std::vector* new_filenames) { } FileTypeSet tmp_set = immutable_db_options_.checksum_handoff_file_types; file->SetPreallocationBlockSize( - immutable_db_options_.manifest_preallocation_size); + mutable_db_options_.manifest_preallocation_size); std::unique_ptr file_writer(new WritableFileWriter( std::move(file), manifest, file_options, immutable_db_options_.clock, io_tracer_, nullptr /* stats */, diff --git a/db/db_impl/db_impl_secondary.cc b/db/db_impl/db_impl_secondary.cc index 05337a019..5e6de87c5 100644 --- a/db/db_impl/db_impl_secondary.cc +++ b/db/db_impl/db_impl_secondary.cc @@ -783,9 +783,9 @@ Status DB::OpenAsSecondary( handles->clear(); DBImplSecondary* impl = new DBImplSecondary(tmp_opts, dbname, secondary_path); impl->versions_.reset(new ReactiveVersionSet( - dbname, &impl->immutable_db_options_, impl->file_options_, - impl->table_cache_.get(), impl->write_buffer_manager_, - &impl->write_controller_, impl->io_tracer_)); + dbname, &impl->immutable_db_options_, impl->mutable_db_options_, + impl->file_options_, impl->table_cache_.get(), + impl->write_buffer_manager_, &impl->write_controller_, impl->io_tracer_)); impl->column_family_memtables_.reset( new ColumnFamilyMemTablesImpl(impl->versions_->GetColumnFamilySet())); impl->wal_in_db_path_ = impl->immutable_db_options_.IsWalDirSameAsDBPath(); diff --git a/db/db_options_test.cc b/db/db_options_test.cc index cfe0b8f96..36c4f211a 100644 --- a/db/db_options_test.cc +++ b/db/db_options_test.cc @@ -432,12 +432,47 @@ TEST_F(DBOptionsTest, SetWalBytesPerSync) { ASSERT_GT(low_bytes_per_sync, counter); } +TEST_F(DBOptionsTest, MutableManifestOptions) { + // These aren't end-to-end tests, but sufficient to ensure the VersionSet + // receives the updates with SetDBOptions + for (int64_t i : {0, 1, 100, 100000, 10000000}) { + ASSERT_OK( + db_->SetDBOptions({{"max_manifest_file_size", std::to_string(i)}})); + ASSERT_EQ(i, + static_cast(db_->GetDBOptions().max_manifest_file_size)); + ASSERT_EQ(i, + static_cast( + dbfull()->GetVersionSet()->TEST_GetMinMaxManifestFileSize())); + if (i > 1) { + ++i; + } + ASSERT_OK( + db_->SetDBOptions({{"max_manifest_space_amp_pct", std::to_string(i)}})); + ASSERT_EQ(i, static_cast( + db_->GetDBOptions().max_manifest_space_amp_pct)); + ASSERT_EQ(i, + static_cast( + dbfull()->GetVersionSet()->TEST_GetMaxManifestSpaceAmpPct())); + if (i > 1) { + ++i; + } + ASSERT_OK(db_->SetDBOptions( + {{"manifest_preallocation_size", std::to_string(i)}})); + ASSERT_EQ(i, static_cast( + db_->GetDBOptions().manifest_preallocation_size)); + ASSERT_EQ( + i, static_cast( + dbfull()->GetVersionSet()->TEST_GetManifestPreallocationSize())); + } +} + TEST_F(DBOptionsTest, WritableFileMaxBufferSize) { Options options; options.create_if_missing = true; options.writable_file_max_buffer_size = 1024 * 1024; options.level0_file_num_compaction_trigger = 3; options.max_manifest_file_size = 1; + options.max_manifest_space_amp_pct = 0; options.env = env_; int buffer_size = 1024 * 1024; Reopen(options); diff --git a/db/db_test2.cc b/db/db_test2.cc index 1a565c8e1..33da1ffaf 100644 --- a/db/db_test2.cc +++ b/db/db_test2.cc @@ -5205,6 +5205,7 @@ TEST_F(DBTest2, SwitchMemtableRaceWithNewManifest) { Options options = CurrentOptions(); DestroyAndReopen(options); options.max_manifest_file_size = 10; + options.max_manifest_space_amp_pct = 0; options.create_if_missing = true; CreateAndReopenWithCF({"pikachu"}, options); ASSERT_EQ(2, handles_.size()); @@ -5896,6 +5897,7 @@ TEST_P(RenameCurrentTest, Flush) { Destroy(last_options_); Options options = GetDefaultOptions(); options.max_manifest_file_size = 1; + options.max_manifest_space_amp_pct = 0; options.create_if_missing = true; Reopen(options); ASSERT_OK(Put("key", "value")); @@ -5915,6 +5917,7 @@ TEST_P(RenameCurrentTest, Compaction) { Destroy(last_options_); Options options = GetDefaultOptions(); options.max_manifest_file_size = 1; + options.max_manifest_space_amp_pct = 0; options.create_if_missing = true; Reopen(options); ASSERT_OK(Put("a", "a_value")); diff --git a/db/db_test_util.cc b/db/db_test_util.cc index 0cefcfd41..a0608b30b 100644 --- a/db/db_test_util.cc +++ b/db/db_test_util.cc @@ -454,7 +454,8 @@ Options DBTestBase::GetOptions( options.allow_mmap_reads = can_allow_mmap; break; case kManifestFileSize: - options.max_manifest_file_size = 50; // 50 bytes + options.max_manifest_file_size = 50; // 50 bytes + options.max_manifest_space_amp_pct = 0; // old behavior break; case kPerfOptions: options.delayed_write_rate = 8 * 1024 * 1024; diff --git a/db/db_wal_test.cc b/db/db_wal_test.cc index da9ef3158..75e13724a 100644 --- a/db/db_wal_test.cc +++ b/db/db_wal_test.cc @@ -1746,8 +1746,8 @@ class RecoveryTestHelper { WriteController write_controller; versions.reset(new VersionSet( - test->dbname_, &db_options, file_options, table_cache.get(), - &write_buffer_manager, &write_controller, + test->dbname_, &db_options, MutableDBOptions{options}, file_options, + table_cache.get(), &write_buffer_manager, &write_controller, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", options.daily_offpeak_time_utc, @@ -2277,6 +2277,7 @@ TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL) { Options options = CurrentOptions(); // Small size to force manifest creation options.max_manifest_file_size = 1; + options.max_manifest_space_amp_pct = 0; options.track_and_verify_wals_in_manifest = true; DestroyAndReopen(options); diff --git a/db/flush_job_test.cc b/db/flush_job_test.cc index b84bb3d8b..3d4cf1d8d 100644 --- a/db/flush_job_test.cc +++ b/db/flush_job_test.cc @@ -142,13 +142,13 @@ class FlushJobTestBase : public testing::Test { column_families.emplace_back(cf_name, cf_options_); } - versions_.reset( - new VersionSet(dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, - /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, - test::kUnitTestDbId, /*db_session_id=*/"", - /*daily_offpeak_time_utc=*/"", - /*error_handler=*/nullptr, /*read_only=*/false)); + versions_.reset(new VersionSet( + dbname_, &db_options_, MutableDBOptions{options_}, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, + /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, + test::kUnitTestDbId, /*db_session_id=*/"", + /*daily_offpeak_time_utc=*/"", + /*error_handler=*/nullptr, /*read_only=*/false)); EXPECT_OK(versions_->Recover(column_families, false)); } diff --git a/db/memtable_list_test.cc b/db/memtable_list_test.cc index 97e36e00f..7065b125b 100644 --- a/db/memtable_list_test.cc +++ b/db/memtable_list_test.cc @@ -112,7 +112,8 @@ class MemTableListTest : public testing::Test { WriteBufferManager write_buffer_manager(db_options.db_write_buffer_size); WriteController write_controller(10000000u); - VersionSet versions(dbname, &immutable_db_options, env_options, + VersionSet versions(dbname, &immutable_db_options, + MutableDBOptions{db_options}, env_options, table_cache.get(), &write_buffer_manager, &write_controller, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", @@ -163,7 +164,8 @@ class MemTableListTest : public testing::Test { WriteBufferManager write_buffer_manager(db_options.db_write_buffer_size); WriteController write_controller(10000000u); - VersionSet versions(dbname, &immutable_db_options, env_options, + VersionSet versions(dbname, &immutable_db_options, + MutableDBOptions{db_options}, env_options, table_cache.get(), &write_buffer_manager, &write_controller, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", diff --git a/db/repair.cc b/db/repair.cc index 6d184eba8..05672957f 100644 --- a/db/repair.cc +++ b/db/repair.cc @@ -120,8 +120,8 @@ class Repairer { /*io_tracer=*/nullptr, db_session_id_)), wb_(db_options_.db_write_buffer_size), wc_(db_options_.delayed_write_rate), - vset_(dbname_, &immutable_db_options_, file_options_, - raw_table_cache_.get(), &wb_, &wc_, + vset_(dbname_, &immutable_db_options_, MutableDBOptions{db_options_}, + file_options_, raw_table_cache_.get(), &wb_, &wc_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", db_session_id_, db_options.daily_offpeak_time_utc, /*error_handler=*/nullptr, /*read_only=*/false), diff --git a/db/version_set.cc b/db/version_set.cc index 90c8e1a8b..ff3d89f58 100644 --- a/db/version_set.cc +++ b/db/version_set.cc @@ -5307,6 +5307,7 @@ void AtomicGroupReadBuffer::Clear() { VersionSet::VersionSet( const std::string& dbname, const ImmutableDBOptions* _db_options, + const MutableDBOptions& mutable_db_options, const FileOptions& storage_options, Cache* table_cache, WriteBufferManager* write_buffer_manager, WriteController* write_controller, BlockCacheTracer* const block_cache_tracer, @@ -5335,6 +5336,7 @@ VersionSet::VersionSet( prev_log_number_(0), current_version_number_(0), manifest_file_size_(0), + last_compacted_manifest_file_size_(0), file_options_(storage_options), block_cache_tracer_(block_cache_tracer), io_tracer_(io_tracer), @@ -5342,7 +5344,9 @@ VersionSet::VersionSet( offpeak_time_option_(OffpeakTimeOption(daily_offpeak_time_utc)), error_handler_(error_handler), unchanging_(unchanging), - closed_(false) {} + closed_(false) { + UpdatedMutableDbOptions(mutable_db_options, /*mu=*/nullptr); +} Status VersionSet::Close(FSDirectory* db_dir, InstrumentedMutex* mu) { Status s; @@ -5438,11 +5442,35 @@ void VersionSet::Reset() { current_version_number_ = 0; manifest_writers_.clear(); manifest_file_size_ = 0; + last_compacted_manifest_file_size_ = 0; + TuneMaxManifestFileSize(); obsolete_files_.clear(); obsolete_manifests_.clear(); wals_.Reset(); } +void VersionSet::UpdatedMutableDbOptions( + const MutableDBOptions& updated_options, InstrumentedMutex* mu) { + // Must be holding mutex if not called during initialization + if (manifest_file_size_ > 0) { + mu->AssertHeld(); + } + file_options_.writable_file_max_buffer_size = + updated_options.writable_file_max_buffer_size; + min_max_manifest_file_size_ = updated_options.max_manifest_file_size; + max_manifest_space_amp_pct_ = static_cast( + std::max(updated_options.max_manifest_space_amp_pct, 0)); + manifest_preallocation_size_ = updated_options.manifest_preallocation_size; + TuneMaxManifestFileSize(); +} + +void VersionSet::TuneMaxManifestFileSize() { + tuned_max_manifest_file_size_ = + std::max(min_max_manifest_file_size_, + last_compacted_manifest_file_size_ * + (100U + max_manifest_space_amp_pct_) / 100U); +} + void VersionSet::AppendVersion(ColumnFamilyData* column_family_data, Version* v) { // compute new compaction score @@ -5686,10 +5714,11 @@ Status VersionSet::ProcessManifestWrites( } #endif // NDEBUG + uint64_t prev_manifest_file_size = manifest_file_size_; assert(pending_manifest_file_number_ == 0); if (!skip_manifest_write && (!descriptor_log_ || - manifest_file_size_ > db_options_->max_manifest_file_size)) { + prev_manifest_file_size >= tuned_max_manifest_file_size_)) { TEST_SYNC_POINT("VersionSet::ProcessManifestWrites:BeforeNewManifest"); new_descriptor_log = true; } else { @@ -5729,6 +5758,8 @@ Status VersionSet::ProcessManifestWrites( IOStatus manifest_io_status; manifest_io_status.PermitUncheckedError(); std::unique_ptr new_desc_log_ptr; + // Save before releasing mu + uint64_t manifest_preallocation_size = manifest_preallocation_size_; if (skip_manifest_write) { if (s.ok()) { constexpr bool update_stats = true; @@ -5772,16 +5803,13 @@ Status VersionSet::ProcessManifestWrites( // This is fine because everything inside of this block is serialized -- // only one thread can be here at the same time // create new manifest file - ROCKS_LOG_INFO(db_options_->info_log, "Creating manifest %" PRIu64 "\n", - pending_manifest_file_number_); std::string descriptor_fname = DescriptorFileName(dbname_, pending_manifest_file_number_); std::unique_ptr descriptor_file; io_s = NewWritableFile(fs_.get(), descriptor_fname, &descriptor_file, opt_file_opts); if (io_s.ok()) { - descriptor_file->SetPreallocationBlockSize( - db_options_->manifest_preallocation_size); + descriptor_file->SetPreallocationBlockSize(manifest_preallocation_size); FileTypeSet tmp_set = db_options_->checksum_handoff_file_types; std::unique_ptr file_writer(new WritableFileWriter( std::move(descriptor_file), descriptor_fname, opt_file_opts, clock_, @@ -5882,6 +5910,13 @@ Status VersionSet::ProcessManifestWrites( if (s.ok()) { // find offset in manifest file where this version is stored. new_manifest_file_size = raw_desc_log_ptr->file()->GetFileSize(); + if (new_descriptor_log) { + ROCKS_LOG_INFO(db_options_->info_log, + "Created manifest %" PRIu64 + ", compacted+appended from %" PRIu64 " to %" PRIu64 "\n", + pending_manifest_file_number_, prev_manifest_file_size, + new_manifest_file_size); + } } if (first_writer.edit_list.front()->IsColumnFamilyDrop()) { @@ -5930,6 +5965,8 @@ Status VersionSet::ProcessManifestWrites( descriptor_log_ = std::move(new_desc_log_ptr); obsolete_manifests_.emplace_back( DescriptorFileName("", manifest_file_number_)); + last_compacted_manifest_file_size_ = new_manifest_file_size; + TuneMaxManifestFileSize(); } // Install the new versions @@ -6563,14 +6600,16 @@ Status VersionSet::ReduceNumberOfLevels(const std::string& dbname, const ReadOptions read_options; const WriteOptions write_options; - ImmutableDBOptions db_options(*options); + ImmutableDBOptions imm_db_options(*options); + MutableDBOptions mutable_db_options(*options); ColumnFamilyOptions cf_options(*options); std::shared_ptr tc(NewLRUCache(options->max_open_files - 10, options->table_cache_numshardbits)); WriteController wc(options->delayed_write_rate); WriteBufferManager wb(options->db_write_buffer_size); - VersionSet versions(dbname, &db_options, file_options, tc.get(), &wb, &wc, - nullptr /*BlockCacheTracer*/, nullptr /*IOTracer*/, + VersionSet versions(dbname, &imm_db_options, mutable_db_options, file_options, + tc.get(), &wb, &wc, nullptr /*BlockCacheTracer*/, + nullptr /*IOTracer*/, /*db_id*/ "", /*db_session_id*/ "", options->daily_offpeak_time_utc, /*error_handler_*/ nullptr, /*unchanging=*/false); @@ -7622,12 +7661,13 @@ Status VersionSet::VerifyFileMetadata(const ReadOptions& read_options, } ReactiveVersionSet::ReactiveVersionSet( - const std::string& dbname, const ImmutableDBOptions* _db_options, + const std::string& dbname, const ImmutableDBOptions* imm_db_options, + const MutableDBOptions& mutable_db_options, const FileOptions& _file_options, Cache* table_cache, WriteBufferManager* write_buffer_manager, WriteController* write_controller, const std::shared_ptr& io_tracer) - : VersionSet(dbname, _db_options, _file_options, table_cache, - write_buffer_manager, write_controller, + : VersionSet(dbname, imm_db_options, mutable_db_options, _file_options, + table_cache, write_buffer_manager, write_controller, /*block_cache_tracer=*/nullptr, io_tracer, /*db_id*/ "", /*db_session_id*/ "", /*daily_offpeak_time_utc*/ "", /*error_handler=*/nullptr, /*unchanging=*/false) {} diff --git a/db/version_set.h b/db/version_set.h index b20ab972f..175cebd65 100644 --- a/db/version_set.h +++ b/db/version_set.h @@ -1186,7 +1186,9 @@ class AtomicGroupReadBuffer { // but false for secondary instance or writable DB). class VersionSet { public: - VersionSet(const std::string& dbname, const ImmutableDBOptions* db_options, + VersionSet(const std::string& dbname, + const ImmutableDBOptions* imm_db_options, + const MutableDBOptions& mutable_db_options, const FileOptions& file_options, Cache* table_cache, WriteBufferManager* write_buffer_manager, WriteController* write_controller, @@ -1203,6 +1205,13 @@ class VersionSet { virtual Status Close(FSDirectory* db_dir, InstrumentedMutex* mu); + // Requires: already holding DB mutex `mu`, to ensure + // * Safely read values from `updated_options` + // * Safely update fields on `this` (must be read elsewhere while holding mu) + // except `mu` can be nullptr during initialization + void UpdatedMutableDbOptions(const MutableDBOptions& updated_options, + InstrumentedMutex* mu); + Status LogAndApplyToDefaultColumnFamily( const ReadOptions& read_options, const WriteOptions& write_options, VersionEdit* edit, InstrumentedMutex* mu, @@ -1548,10 +1557,6 @@ class VersionSet { } const FileOptions& file_options() { return file_options_; } - void ChangeFileOptions(const MutableDBOptions& new_options) { - file_options_.writable_file_max_buffer_size = - new_options.writable_file_max_buffer_size; - } // TODO - Consider updating together when file options change in SetDBOptions const OffpeakTimeOption& offpeak_time_option() { @@ -1590,6 +1595,16 @@ class VersionSet { bool& TEST_unchanging() { return const_cast(unchanging_); } + uint64_t TEST_GetMinMaxManifestFileSize() { + return min_max_manifest_file_size_; + } + unsigned TEST_GetMaxManifestSpaceAmpPct() { + return max_manifest_space_amp_pct_; + } + size_t TEST_GetManifestPreallocationSize() { + return manifest_preallocation_size_; + } + protected: struct ManifestWriter; @@ -1610,6 +1625,7 @@ class VersionSet { } }; + // Revert back to a post-construction state (keep same options/settings) void Reset(); // Returns approximated offset of a key in a file for a given version. @@ -1648,6 +1664,11 @@ class VersionSet { ColumnFamilyData* cfd, const std::string& fpath, int level, const FileMetaData& meta); + // Auto-tune next max size for the current manifest file based on its initial + // "compacted" size and other parameters saved in this VersionSet. Must be + // holding DB mutex if outside of DB startup. + void TuneMaxManifestFileSize(); + // Protected by DB mutex. WalSet wals_; @@ -1699,6 +1720,20 @@ class VersionSet { // Current size of manifest file uint64_t manifest_file_size_; + // Size of the populated manifest file last time it was re-written from + // scratch. + uint64_t last_compacted_manifest_file_size_; + + // Auto-tuned max allowed size for the current manifest file + uint64_t tuned_max_manifest_file_size_; + + // Saved copy of max_manifest_file_size in (Mutable)DBOptions + uint64_t min_max_manifest_file_size_; + // Saved, sanitized copy from (Mutable)DBOptions + unsigned max_manifest_space_amp_pct_; + // Saved copy from (Mutable)DBOptions + size_t manifest_preallocation_size_; + // Obsolete files, or during DB shutdown any files not referenced by what's // left of the in-memory LSM state. std::vector obsolete_files_; @@ -1751,6 +1786,7 @@ class ReactiveVersionSet : public VersionSet { public: ReactiveVersionSet(const std::string& dbname, const ImmutableDBOptions* _db_options, + const MutableDBOptions& mutable_db_options, const FileOptions& _file_options, Cache* table_cache, WriteBufferManager* write_buffer_manager, WriteController* write_controller, diff --git a/db/version_set_test.cc b/db/version_set_test.cc index 65cee38de..fefde1170 100644 --- a/db/version_set_test.cc +++ b/db/version_set_test.cc @@ -1159,12 +1159,12 @@ class VersionSetTestBase { : env_(nullptr), dbname_(test::PerThreadDBPath(name)), options_(), - db_options_(options_), + imm_db_options_(options_), cf_options_(options_), - immutable_options_(db_options_, cf_options_), + immutable_options_(imm_db_options_, cf_options_), mutable_cf_options_(cf_options_), table_cache_(NewLRUCache(50000, 16)), - write_buffer_manager_(db_options_.db_write_buffer_size), + write_buffer_manager_(imm_db_options_.db_write_buffer_size), shutting_down_(false), table_factory_(std::make_shared()) { EXPECT_OK(test::CreateEnvFromSystem(ConfigOptions(), &env_, &env_guard_)); @@ -1178,8 +1178,8 @@ class VersionSetTestBase { EXPECT_OK(fs_->CreateDirIfMissing(dbname_, IOOptions(), nullptr)); options_.env = env_; - db_options_.env = env_; - db_options_.fs = fs_; + imm_db_options_.env = env_; + imm_db_options_.fs = fs_; immutable_options_.env = env_; immutable_options_.fs = fs_; immutable_options_.clock = env_->GetSystemClock().get(); @@ -1188,16 +1188,17 @@ class VersionSetTestBase { mutable_cf_options_.table_factory = table_factory_; versions_.reset(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*read_only=*/false)); reactive_versions_ = std::make_shared( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, nullptr); - db_options_.db_paths.emplace_back(dbname_, - std::numeric_limits::max()); + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, + nullptr); + imm_db_options_.db_paths.emplace_back(dbname_, + std::numeric_limits::max()); } virtual ~VersionSetTestBase() { @@ -1220,7 +1221,7 @@ class VersionSetTestBase { ASSERT_OK( SetIdentityFile(WriteOptions(), env_, dbname_, Temperature::kUnknown)); VersionEdit new_db; - if (db_options_.write_dbid_to_manifest) { + if (imm_db_options_.write_dbid_to_manifest) { DBOptions tmp_db_options; tmp_db_options.env = env_; std::unique_ptr impl(new DBImpl(tmp_db_options, dbname_)); @@ -1381,8 +1382,8 @@ class VersionSetTestBase { void ReopenDB() { versions_.reset(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*read_only=*/false)); @@ -1471,7 +1472,8 @@ class VersionSetTestBase { const std::string dbname_; EnvOptions env_options_; Options options_; - ImmutableDBOptions db_options_; + ImmutableDBOptions imm_db_options_; + MutableDBOptions mutable_db_options_; ColumnFamilyOptions cf_options_; ImmutableOptions immutable_options_; MutableCFOptions mutable_cf_options_; @@ -1902,8 +1904,8 @@ TEST_F(VersionSetTest, WalAddition) { // Recover a new VersionSet. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -1970,8 +1972,8 @@ TEST_F(VersionSetTest, WalCloseWithoutSync) { // Recover a new VersionSet. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2024,8 +2026,8 @@ TEST_F(VersionSetTest, WalDeletion) { // Recover a new VersionSet, only the non-closed WAL should show up. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2063,8 +2065,8 @@ TEST_F(VersionSetTest, WalDeletion) { // Recover from the new MANIFEST, only the non-closed WAL should show up. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2184,8 +2186,8 @@ TEST_F(VersionSetTest, DeleteWalsBeforeNonExistingWalNumber) { // Recover a new VersionSet, WAL0 is deleted, WAL1 is not. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2221,8 +2223,8 @@ TEST_F(VersionSetTest, DeleteAllWals) { // Recover a new VersionSet, all WALs are deleted. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2264,8 +2266,8 @@ TEST_F(VersionSetTest, AtomicGroupWithWalEdits) { // kept. { std::unique_ptr new_versions(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -2444,8 +2446,8 @@ class VersionSetWithTimestampTest : public VersionSetTest { void VerifyFullHistoryTsLow(uint64_t expected_ts_low) { std::unique_ptr vset(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &imm_db_options_, mutable_db_options_, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*unchanging=*/false)); @@ -3500,7 +3502,7 @@ class VersionSetTestEmptyDb std::unique_ptr* log_writer) override { assert(nullptr != log_writer); VersionEdit new_db; - if (db_options_.write_dbid_to_manifest) { + if (imm_db_options_.write_dbid_to_manifest) { ASSERT_OK(SetIdentityFile(WriteOptions(), env_, dbname_, Temperature::kUnknown)); DBOptions tmp_db_options; @@ -3532,7 +3534,7 @@ class VersionSetTestEmptyDb const std::string VersionSetTestEmptyDb::kUnknownColumnFamilyName = "unknown"; TEST_P(VersionSetTestEmptyDb, OpenFromIncompleteManifest0) { - db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); + imm_db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); PrepareManifest(nullptr, nullptr, &log_writer_); log_writer_.reset(); CreateCurrentFile(); @@ -3564,7 +3566,7 @@ TEST_P(VersionSetTestEmptyDb, OpenFromIncompleteManifest0) { } TEST_P(VersionSetTestEmptyDb, OpenFromIncompleteManifest1) { - db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); + imm_db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); PrepareManifest(nullptr, nullptr, &log_writer_); // Only a subset of column families in the MANIFEST. VersionEdit new_cf1; @@ -3605,7 +3607,7 @@ TEST_P(VersionSetTestEmptyDb, OpenFromIncompleteManifest1) { } TEST_P(VersionSetTestEmptyDb, OpenFromInCompleteManifest2) { - db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); + imm_db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); PrepareManifest(nullptr, nullptr, &log_writer_); // Write all column families but no log_number, next_file_number and // last_sequence. @@ -3651,7 +3653,7 @@ TEST_P(VersionSetTestEmptyDb, OpenFromInCompleteManifest2) { } TEST_P(VersionSetTestEmptyDb, OpenManifestWithUnknownCF) { - db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); + imm_db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); PrepareManifest(nullptr, nullptr, &log_writer_); // Write all column families but no log_number, next_file_number and // last_sequence. @@ -3708,7 +3710,7 @@ TEST_P(VersionSetTestEmptyDb, OpenManifestWithUnknownCF) { } TEST_P(VersionSetTestEmptyDb, OpenCompleteManifest) { - db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); + imm_db_options_.write_dbid_to_manifest = std::get<0>(GetParam()); PrepareManifest(nullptr, nullptr, &log_writer_); // Write all column families but no log_number, next_file_number and // last_sequence. @@ -3828,7 +3830,7 @@ class VersionSetTestMissingFiles : public VersionSetTestBase, ASSERT_OK(s); log_writer->reset(new log::Writer(std::move(file_writer), 0, false)); VersionEdit new_db; - if (db_options_.write_dbid_to_manifest) { + if (imm_db_options_.write_dbid_to_manifest) { DBOptions tmp_db_options; tmp_db_options.env = env_; std::unique_ptr impl(new DBImpl(tmp_db_options, dbname_)); @@ -4088,7 +4090,7 @@ TEST_F(VersionSetTestMissingFiles, NoFileMissing) { } TEST_F(VersionSetTestMissingFiles, MinLogNumberToKeep2PC) { - db_options_.allow_2pc = true; + imm_db_options_.allow_2pc = true; NewDB(); SstInfo sst(100, kDefaultColumnFamilyName, "a", 0 /* level */, diff --git a/db/version_util.h b/db/version_util.h index 2690a00f4..7219d11c8 100644 --- a/db/version_util.h +++ b/db/version_util.h @@ -1,4 +1,4 @@ -// Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved. +// Copyright (c) Meta Platforms, Inc. and affiliates. // This source code is licensed under both the GPLv2 (found in the // COPYING file in the root directory) and Apache 2.0 License // (found in the LICENSE.Apache file in the root directory). @@ -23,7 +23,8 @@ class OfflineManifestWriter { immutable_db_options_(WithDbPath(options, db_path)), tc_(NewLRUCache(1 << 20 /* capacity */, options.table_cache_numshardbits)), - versions_(db_path, &immutable_db_options_, sopt_, tc_.get(), &wb_, &wc_, + versions_(db_path, &immutable_db_options_, MutableDBOptions{options}, + sopt_, tc_.get(), &wb_, &wc_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", options.daily_offpeak_time_utc, diff --git a/db/wal_manager_test.cc b/db/wal_manager_test.cc index 5b5ba7c0a..a114df0db 100644 --- a/db/wal_manager_test.cc +++ b/db/wal_manager_test.cc @@ -50,8 +50,8 @@ class WalManagerTest : public testing::Test { db_options_.clock = env_->GetSystemClock().get(); versions_.reset(new VersionSet( - dbname_, &db_options_, env_options_, table_cache_.get(), - &write_buffer_manager_, &write_controller_, + dbname_, &db_options_, MutableDBOptions{}, env_options_, + table_cache_.get(), &write_buffer_manager_, &write_controller_, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", /*daily_offpeak_time_utc=*/"", /*error_handler=*/nullptr, /*read_only=*/false)); diff --git a/db_stress_tool/db_stress_common.h b/db_stress_tool/db_stress_common.h index 9c3b6563f..619c24e75 100644 --- a/db_stress_tool/db_stress_common.h +++ b/db_stress_tool/db_stress_common.h @@ -248,6 +248,7 @@ DECLARE_string(fs_uri); DECLARE_uint64(ops_per_thread); DECLARE_uint64(log2_keys_per_lock); DECLARE_uint64(max_manifest_file_size); +DECLARE_int32(max_manifest_space_amp_pct); DECLARE_bool(in_place_update); DECLARE_string(memtablerep); DECLARE_int32(prefix_size); diff --git a/db_stress_tool/db_stress_gflags.cc b/db_stress_tool/db_stress_gflags.cc index 47b5f715f..e9f7e172b 100644 --- a/db_stress_tool/db_stress_gflags.cc +++ b/db_stress_tool/db_stress_gflags.cc @@ -978,7 +978,11 @@ DEFINE_uint64(log2_keys_per_lock, 2, "Log2 of number of keys per lock"); static const bool FLAGS_log2_keys_per_lock_dummy __attribute__((__unused__)) = RegisterFlagValidator(&FLAGS_log2_keys_per_lock, &ValidateUint32Range); -DEFINE_uint64(max_manifest_file_size, 16384, "Maximum size of a MANIFEST file"); +DEFINE_uint64(max_manifest_file_size, 16384, + "Maximum size of a MANIFEST file (without auto-tuning)"); + +DEFINE_int32(max_manifest_space_amp_pct, 500, + "Max manifest space amp percentage for auto-tuning"); DEFINE_bool(in_place_update, false, "On true, does inplace update in memtable"); diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc index b37a0307f..f95b80376 100644 --- a/db_stress_tool/db_stress_test_base.cc +++ b/db_stress_tool/db_stress_test_base.cc @@ -4420,6 +4420,7 @@ void InitializeOptionsFromFlags( options.compression_opts.checksum = true; } options.max_manifest_file_size = FLAGS_max_manifest_file_size; + options.max_manifest_space_amp_pct = FLAGS_max_manifest_space_amp_pct; options.max_subcompactions = static_cast(FLAGS_subcompactions); options.allow_concurrent_memtable_write = FLAGS_allow_concurrent_memtable_write; diff --git a/include/rocksdb/options.h b/include/rocksdb/options.h index 5fbc3324d..df9a179f0 100644 --- a/include/rocksdb/options.h +++ b/include/rocksdb/options.h @@ -958,12 +958,67 @@ struct DBOptions { // Default: 0 size_t recycle_log_file_num = 0; - // manifest file is rolled over on reaching this limit. - // The older manifest file be deleted. - // The default value is 1GB so that the manifest file can grow, but not - // reach the limit of storage capacity. + // The manifest file is rolled over on reaching this limit AND the + // space amp limit described in max_manifest_space_amp_pct. More trade-off + // details there. + // + // NOTE: this option used to be a hard limit, but that made this a dangerous + // tuning parameter for optimizing manifest file size because the best + // size really depends on the DB size and average SST file size (and other + // settings). Now it is essentially a minimum for the auto-tuned max manifest + // file size. + // + // Until the max_manifest_space_amp_pct feature is fully validated to show a + // smaller default here like 1MB is appropriate, the default value is 1GB to + // match historical behavior (without it being a hard limit in case of giant + // compacted manifest size). + // + // This option is mutable with SetDBOptions(), taking effect on the next + // manifest write (e.g. completed DB compaction or flush). uint64_t max_manifest_file_size = 1024 * 1024 * 1024; + // This option mostly replaces max_manifest_file_size to control an auto-tuned + // balance of manifest write amplification and space amplification. A new + // manifest file is created with the "compacted" contents of the old one when + // current_manifest_size + // > + // max(max_manifest_file_size, + // est_compacted_manifest_size * (1 + max_manifest_space_amp_pct/100)) + // + // where est_compacted_manifest_size is an estimate of how big a new compacted + // version of the current manifest would be. Currently, the estimate used is + // the last newly-written manifest, in its "compacted" form. + // + // Space amplification in the manifest file might be less of a concern for + // primary storage space and more of a concern for DB recover time and size of + // backup files that aren't incremental between backups. To minimize manifest + // churn on initial DB population, setting max_manifest_file_size to something + // not too small, like 1MB, should suffice. Similarly, write amp on the + // manifest file is likely not a direct concern but completed compactions and + // flushes cannot (currently) be committed while the (relatively small) + // manifest file is being compacted. Manifest compactions should not + // interfere with user write latency or throughput unless the DB is + // chronically stalling or close to stalling writes already. + // + // For this option to have a meaningful effect, it is recommended to set + // max_manifest_file_size to something modest like 1MB. Then we can interpret + // values for this option as follows, starting with minimum space amp and + // maximum write amp: + // * 0 - Every manifest write (flush, compaction, etc.) generates a whole new + // manifest. Only useful for testing. + // * very small - Doesn't take many manifest writes to generate a whole new + // manifest. + // * 100 - In a DB with pretty consistent number of SST files, etc., achieves + // about 1.0 write amp (writing about 2x the theoretical minimum) and a max of + // about 1.0 space amp (manifest up to 2x the compacted size). + // * 500 - Recommended and default: 0.2 write amp and up to roughly 5.0 space + // amp. + // * 10000 - 0.01 write amp and up to 100 space amp on the manifest. + // + // This option is mutable with SetDBOptions(), taking effect on the next + // manifest write (e.g. completed DB compaction or flush). + int max_manifest_space_amp_pct = 500; + // Number of shards used for table cache. int table_cache_numshardbits = 6; diff --git a/options/db_options.cc b/options/db_options.cc index 3e06c4ceb..dfacea8e5 100644 --- a/options/db_options.cc +++ b/options/db_options.cc @@ -124,6 +124,18 @@ static std::unordered_map {offsetof(struct MutableDBOptions, max_background_flushes), OptionType::kInt, OptionVerificationType::kNormal, OptionTypeFlags::kMutable}}, + {"max_manifest_file_size", + {offsetof(struct MutableDBOptions, max_manifest_file_size), + OptionType::kUInt64T, OptionVerificationType::kNormal, + OptionTypeFlags::kMutable}}, + {"max_manifest_space_amp_pct", + {offsetof(struct MutableDBOptions, max_manifest_space_amp_pct), + OptionType::kInt, OptionVerificationType::kNormal, + OptionTypeFlags::kMutable}}, + {"manifest_preallocation_size", + {offsetof(struct MutableDBOptions, manifest_preallocation_size), + OptionType::kSizeT, OptionVerificationType::kNormal, + OptionTypeFlags::kMutable}}, {"daily_offpeak_time_utc", {offsetof(struct MutableDBOptions, daily_offpeak_time_utc), OptionType::kString, OptionVerificationType::kNormal, @@ -288,10 +300,6 @@ static std::unordered_map {offsetof(struct ImmutableDBOptions, log_file_time_to_roll), OptionType::kSizeT, OptionVerificationType::kNormal, OptionTypeFlags::kNone}}, - {"manifest_preallocation_size", - {offsetof(struct ImmutableDBOptions, manifest_preallocation_size), - OptionType::kSizeT, OptionVerificationType::kNormal, - OptionTypeFlags::kNone}}, {"max_log_file_size", {offsetof(struct ImmutableDBOptions, max_log_file_size), OptionType::kSizeT, OptionVerificationType::kNormal, @@ -310,10 +318,6 @@ static std::unordered_map {offsetof(struct ImmutableDBOptions, WAL_ttl_seconds), OptionType::kUInt64T, OptionVerificationType::kNormal, OptionTypeFlags::kNone}}, - {"max_manifest_file_size", - {offsetof(struct ImmutableDBOptions, max_manifest_file_size), - OptionType::kUInt64T, OptionVerificationType::kNormal, - OptionTypeFlags::kNone}}, {"persist_stats_to_disk", {offsetof(struct ImmutableDBOptions, persist_stats_to_disk), OptionType::kBoolean, OptionVerificationType::kNormal, @@ -657,7 +661,7 @@ class DBOptionsConfigurable : public MutableDBConfigurable { explicit DBOptionsConfigurable( const DBOptions& opts, const std::unordered_map* map = nullptr) - : MutableDBConfigurable(MutableDBOptions(opts), map), db_options_(opts) { + : MutableDBConfigurable(MutableDBOptions{opts}, map), db_options_(opts) { // The ImmutableDBOptions currently requires the env to be non-null. Make // sure it is if (opts.env != nullptr) { @@ -708,7 +712,7 @@ std::unique_ptr DBOptionsAsConfigurable( return ptr; } -ImmutableDBOptions::ImmutableDBOptions() : ImmutableDBOptions(Options()) {} +ImmutableDBOptions::ImmutableDBOptions() : ImmutableDBOptions(DBOptions{}) {} ImmutableDBOptions::ImmutableDBOptions(const DBOptions& options) : create_if_missing(options.create_if_missing), @@ -737,13 +741,11 @@ ImmutableDBOptions::ImmutableDBOptions(const DBOptions& options) log_file_time_to_roll(options.log_file_time_to_roll), keep_log_file_num(options.keep_log_file_num), recycle_log_file_num(options.recycle_log_file_num), - max_manifest_file_size(options.max_manifest_file_size), table_cache_numshardbits(options.table_cache_numshardbits), WAL_ttl_seconds(options.WAL_ttl_seconds), WAL_size_limit_MB(options.WAL_size_limit_MB), max_write_batch_group_size_bytes( options.max_write_batch_group_size_bytes), - manifest_preallocation_size(options.manifest_preallocation_size), allow_mmap_reads(options.allow_mmap_reads), allow_mmap_writes(options.allow_mmap_writes), use_direct_reads(options.use_direct_reads), @@ -850,9 +852,6 @@ void ImmutableDBOptions::Dump(Logger* log) const { ROCKS_LOG_HEADER( log, " Options.max_log_file_size: %" ROCKSDB_PRIszt, max_log_file_size); - ROCKS_LOG_HEADER(log, - " Options.max_manifest_file_size: %" PRIu64, - max_manifest_file_size); ROCKS_LOG_HEADER( log, " Options.log_file_time_to_roll: %" ROCKSDB_PRIszt, log_file_time_to_roll); @@ -892,9 +891,6 @@ void ImmutableDBOptions::Dump(Logger* log) const { " " "Options.max_write_batch_group_size_bytes: %" PRIu64, max_write_batch_group_size_bytes); - ROCKS_LOG_HEADER( - log, " Options.manifest_preallocation_size: %" ROCKSDB_PRIszt, - manifest_preallocation_size); ROCKS_LOG_HEADER(log, " Options.is_fd_close_on_exec: %d", is_fd_close_on_exec); ROCKS_LOG_HEADER(log, " Options.advise_random_on_open: %d", @@ -1025,24 +1021,7 @@ const std::string& ImmutableDBOptions::GetWalDir( } } -MutableDBOptions::MutableDBOptions() - : max_background_jobs(2), - max_background_compactions(-1), - max_subcompactions(0), - avoid_flush_during_shutdown(false), - writable_file_max_buffer_size(1024 * 1024), - delayed_write_rate(2 * 1024U * 1024U), - max_total_wal_size(0), - delete_obsolete_files_period_micros(6ULL * 60 * 60 * 1000000), - stats_dump_period_sec(600), - stats_persist_period_sec(600), - stats_history_buffer_size(1024 * 1024), - max_open_files(-1), - bytes_per_sync(0), - wal_bytes_per_sync(0), - strict_bytes_per_sync(false), - compaction_readahead_size(0), - max_background_flushes(-1) {} +MutableDBOptions::MutableDBOptions() : MutableDBOptions(DBOptions{}) {} MutableDBOptions::MutableDBOptions(const DBOptions& options) : max_background_jobs(options.max_background_jobs), @@ -1063,6 +1042,9 @@ MutableDBOptions::MutableDBOptions(const DBOptions& options) strict_bytes_per_sync(options.strict_bytes_per_sync), compaction_readahead_size(options.compaction_readahead_size), max_background_flushes(options.max_background_flushes), + max_manifest_file_size(options.max_manifest_file_size), + max_manifest_space_amp_pct(options.max_manifest_space_amp_pct), + manifest_preallocation_size(options.manifest_preallocation_size), daily_offpeak_time_utc(options.daily_offpeak_time_utc) {} void MutableDBOptions::Dump(Logger* log) const { @@ -1107,6 +1089,15 @@ void MutableDBOptions::Dump(Logger* log) const { compaction_readahead_size); ROCKS_LOG_HEADER(log, " Options.max_background_flushes: %d", max_background_flushes); + ROCKS_LOG_HEADER(log, + " Options.max_manifest_file_size: %" PRIu64, + max_manifest_file_size); + ROCKS_LOG_HEADER(log, + " Options.max_manifest_space_amp_pct: %d", + max_manifest_space_amp_pct); + ROCKS_LOG_HEADER( + log, " Options.manifest_preallocation_size: %" ROCKSDB_PRIszt, + manifest_preallocation_size); ROCKS_LOG_HEADER(log, "Options.daily_offpeak_time_utc: %s", daily_offpeak_time_utc.c_str()); } diff --git a/options/db_options.h b/options/db_options.h index c23a6f1c9..ef8607d8b 100644 --- a/options/db_options.h +++ b/options/db_options.h @@ -47,12 +47,10 @@ struct ImmutableDBOptions { size_t log_file_time_to_roll; size_t keep_log_file_num; size_t recycle_log_file_num; - uint64_t max_manifest_file_size; int table_cache_numshardbits; uint64_t WAL_ttl_seconds; uint64_t WAL_size_limit_MB; uint64_t max_write_batch_group_size_bytes; - size_t manifest_preallocation_size; bool allow_mmap_reads; bool allow_mmap_writes; bool use_direct_reads; @@ -146,6 +144,9 @@ struct MutableDBOptions { bool strict_bytes_per_sync; size_t compaction_readahead_size; int max_background_flushes; + uint64_t max_manifest_file_size; + int max_manifest_space_amp_pct; + size_t manifest_preallocation_size; std::string daily_offpeak_time_utc; }; diff --git a/options/options_helper.cc b/options/options_helper.cc index c98ec320e..0963428df 100644 --- a/options/options_helper.cc +++ b/options/options_helper.cc @@ -99,13 +99,15 @@ void BuildDBOptions(const ImmutableDBOptions& immutable_db_options, options.log_file_time_to_roll = immutable_db_options.log_file_time_to_roll; options.keep_log_file_num = immutable_db_options.keep_log_file_num; options.recycle_log_file_num = immutable_db_options.recycle_log_file_num; - options.max_manifest_file_size = immutable_db_options.max_manifest_file_size; + options.max_manifest_file_size = mutable_db_options.max_manifest_file_size; + options.max_manifest_space_amp_pct = + mutable_db_options.max_manifest_space_amp_pct; options.table_cache_numshardbits = immutable_db_options.table_cache_numshardbits; options.WAL_ttl_seconds = immutable_db_options.WAL_ttl_seconds; options.WAL_size_limit_MB = immutable_db_options.WAL_size_limit_MB; options.manifest_preallocation_size = - immutable_db_options.manifest_preallocation_size; + mutable_db_options.manifest_preallocation_size; options.allow_mmap_reads = immutable_db_options.allow_mmap_reads; options.allow_mmap_writes = immutable_db_options.allow_mmap_writes; options.use_direct_reads = immutable_db_options.use_direct_reads; diff --git a/options/options_settable_test.cc b/options/options_settable_test.cc index c5b587383..89a01896c 100644 --- a/options/options_settable_test.cc +++ b/options/options_settable_test.cc @@ -408,6 +408,7 @@ TEST_F(OptionsSettableTest, DBOptionsAllFieldsSettable) { "skip_stats_update_on_db_open=false;" "skip_checking_sst_file_sizes_on_db_open=false;" "max_manifest_file_size=4295009941;" + "max_manifest_space_amp_pct=321;" "db_log_dir=path/to/db_log_dir;" "writable_file_max_buffer_size=1048576;" "paranoid_checks=true;" diff --git a/options/options_test.cc b/options/options_test.cc index 2af5f5e14..7111872f5 100644 --- a/options/options_test.cc +++ b/options/options_test.cc @@ -160,6 +160,7 @@ TEST_F(OptionsTest, GetOptionsFromMapTest) { {"keep_log_file_num", "39"}, {"recycle_log_file_num", "5"}, {"max_manifest_file_size", "40"}, + {"max_manifest_space_amp_pct", "42"}, {"table_cache_numshardbits", "41"}, {"WAL_ttl_seconds", "43"}, {"WAL_size_limit_MB", "44"}, @@ -341,7 +342,8 @@ TEST_F(OptionsTest, GetOptionsFromMapTest) { ASSERT_EQ(new_db_opt.log_file_time_to_roll, 38U); ASSERT_EQ(new_db_opt.keep_log_file_num, 39U); ASSERT_EQ(new_db_opt.recycle_log_file_num, 5U); - ASSERT_EQ(new_db_opt.max_manifest_file_size, static_cast(40)); + ASSERT_EQ(new_db_opt.max_manifest_file_size, uint64_t{40}); + ASSERT_EQ(new_db_opt.max_manifest_space_amp_pct, 42); ASSERT_EQ(new_db_opt.table_cache_numshardbits, 41); ASSERT_EQ(new_db_opt.WAL_ttl_seconds, static_cast(43)); ASSERT_EQ(new_db_opt.WAL_size_limit_MB, static_cast(44)); @@ -2479,6 +2481,7 @@ TEST_F(OptionsOldApiTest, GetOptionsFromMapTest) { {"keep_log_file_num", "39"}, {"recycle_log_file_num", "5"}, {"max_manifest_file_size", "40"}, + {"max_manifest_space_amp_pct", "42"}, {"table_cache_numshardbits", "41"}, {"WAL_ttl_seconds", "43"}, {"WAL_size_limit_MB", "44"}, @@ -2665,7 +2668,8 @@ TEST_F(OptionsOldApiTest, GetOptionsFromMapTest) { ASSERT_EQ(new_db_opt.log_file_time_to_roll, 38U); ASSERT_EQ(new_db_opt.keep_log_file_num, 39U); ASSERT_EQ(new_db_opt.recycle_log_file_num, 5U); - ASSERT_EQ(new_db_opt.max_manifest_file_size, static_cast(40)); + ASSERT_EQ(new_db_opt.max_manifest_file_size, uint64_t{40}); + ASSERT_EQ(new_db_opt.max_manifest_space_amp_pct, 42); ASSERT_EQ(new_db_opt.table_cache_numshardbits, 41); ASSERT_EQ(new_db_opt.WAL_ttl_seconds, static_cast(43)); ASSERT_EQ(new_db_opt.WAL_size_limit_MB, static_cast(44)); diff --git a/src.mk b/src.mk index 06310de3d..68bfb2186 100644 --- a/src.mk +++ b/src.mk @@ -494,6 +494,7 @@ TEST_MAIN_SOURCES = \ db/db_clip_test.cc \ db/db_dynamic_level_test.cc \ db/db_encryption_test.cc \ + db/db_etc3_test.cc \ db/db_flush_test.cc \ db/db_follower_test.cc \ db/db_readonly_with_timestamp_test.cc \ diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc index caf23ee61..655bba868 100644 --- a/tools/db_bench_tool.cc +++ b/tools/db_bench_tool.cc @@ -446,6 +446,14 @@ DEFINE_int64(db_write_buffer_size, ROCKSDB_NAMESPACE::Options().db_write_buffer_size, "Number of bytes to buffer in all memtables before compacting"); +DEFINE_int64(max_manifest_file_size, + ROCKSDB_NAMESPACE::Options().max_manifest_file_size, + "Max manifest file size (or minimum max with auto-tuning)"); + +DEFINE_int32(max_manifest_space_amp_pct, + ROCKSDB_NAMESPACE::Options().max_manifest_space_amp_pct, + "Max manifest space amp percentage for auto-tuning"); + DEFINE_bool(cost_write_buffer_to_cache, false, "The usage of memtable is costed to the block cache"); @@ -4368,6 +4376,8 @@ class Benchmark { options.write_buffer_manager.reset( new WriteBufferManager(FLAGS_db_write_buffer_size, cache_)); } + options.max_manifest_file_size = FLAGS_max_manifest_file_size; + options.max_manifest_space_amp_pct = FLAGS_max_manifest_space_amp_pct; options.arena_block_size = FLAGS_arena_block_size; options.write_buffer_size = FLAGS_write_buffer_size; options.max_write_buffer_number = FLAGS_max_write_buffer_number; diff --git a/tools/db_crashtest.py b/tools/db_crashtest.py index cab484f08..2738a8c11 100644 --- a/tools/db_crashtest.py +++ b/tools/db_crashtest.py @@ -245,8 +245,9 @@ default_params = { # Test small max_manifest_file_size in a smaller chance, as most of the # time we wnat manifest history to be preserved to help debug "max_manifest_file_size": lambda: random.choice( - [t * 16384 if t < 3 else 1024 * 1024 * 1024 for t in range(1, 30)] + [t * 2048 if t < 5 else 1024 * 1024 * 1024 for t in range(1, 30)] ), + "max_manifest_space_amp_pct": lambda: random.choice([0, 10, 100, 1000]), # Sync mode might make test runs slower so running it in a smaller chance "sync": lambda: random.choice([1 if t == 0 else 0 for t in range(0, 20)]), "bytes_per_sync": lambda: random.choice([0, 262144]), diff --git a/tools/ldb_cmd.cc b/tools/ldb_cmd.cc index 9ab70b974..328f7d875 100644 --- a/tools/ldb_cmd.cc +++ b/tools/ldb_cmd.cc @@ -1610,7 +1610,8 @@ void DumpManifestFile(Options options, std::string file, bool verbose, bool hex, WriteController wc(options.delayed_write_rate); WriteBufferManager wb(options.db_write_buffer_size); ImmutableDBOptions immutable_db_options(options); - VersionSet versions(dbname, &immutable_db_options, sopt, tc.get(), &wb, &wc, + VersionSet versions(dbname, &immutable_db_options, MutableDBOptions{}, sopt, + tc.get(), &wb, &wc, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", options.daily_offpeak_time_utc, @@ -1805,7 +1806,8 @@ Status GetLiveFilesChecksumInfoFromVersionSet(Options options, WriteController wc(options.delayed_write_rate); WriteBufferManager wb(options.db_write_buffer_size); ImmutableDBOptions immutable_db_options(options); - VersionSet versions(dbname, &immutable_db_options, sopt, tc.get(), &wb, &wc, + VersionSet versions(dbname, &immutable_db_options, MutableDBOptions{options}, + sopt, tc.get(), &wb, &wc, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", options.daily_offpeak_time_utc, @@ -2660,7 +2662,8 @@ Status ReduceDBLevelsCommand::GetOldNumOfLevels(Options& opt, int* levels) { const InternalKeyComparator cmp(opt.comparator); WriteController wc(opt.delayed_write_rate); WriteBufferManager wb(opt.db_write_buffer_size); - VersionSet versions(db_path_, &db_options, soptions, tc.get(), &wb, &wc, + VersionSet versions(db_path_, &db_options, MutableDBOptions{opt}, soptions, + tc.get(), &wb, &wc, /*block_cache_tracer=*/nullptr, /*io_tracer=*/nullptr, /*db_id=*/"", /*db_session_id=*/"", opt.daily_offpeak_time_utc, diff --git a/tools/ldb_cmd_test.cc b/tools/ldb_cmd_test.cc index 5715f93db..6943780f7 100644 --- a/tools/ldb_cmd_test.cc +++ b/tools/ldb_cmd_test.cc @@ -208,8 +208,9 @@ class FileChecksumTestHelper { WriteController wc(options_.delayed_write_rate); WriteBufferManager wb(options_.db_write_buffer_size); ImmutableDBOptions immutable_db_options(options_); - VersionSet versions(dbname_, &immutable_db_options, sopt, tc.get(), &wb, - &wc, nullptr, nullptr, "", "", + VersionSet versions(dbname_, &immutable_db_options, + MutableDBOptions{options_}, sopt, tc.get(), &wb, &wc, + nullptr, nullptr, "", "", options_.daily_offpeak_time_utc, nullptr, /*read_only=*/false); std::vector cf_name_list; diff --git a/unreleased_history/new_features/auto_tune_manifest.md b/unreleased_history/new_features/auto_tune_manifest.md new file mode 100644 index 000000000..9bc95a05e --- /dev/null +++ b/unreleased_history/new_features/auto_tune_manifest.md @@ -0,0 +1 @@ +* Added an auto-tuning feature for DB manifest file size that also (by default) improves the safety of existing configurations in case `max_manifest_file_size` is repeatedly exceeded. The new recommendation is to set `max_manifest_file_size` to something small like 1MB and tune `max_manifest_space_amp_pct` as needed to balance write amp and space amp in the manifest. Refer to comments on those options in `DBOptions` for details. Both options are (now) mutable. diff --git a/utilities/backup/backup_engine_test.cc b/utilities/backup/backup_engine_test.cc index 51581fb00..9438b8574 100644 --- a/utilities/backup/backup_engine_test.cc +++ b/utilities/backup/backup_engine_test.cc @@ -3541,6 +3541,7 @@ TEST_F(BackupEngineTest, EnvFailures) { TEST_F(BackupEngineTest, ChangeManifestDuringBackupCreation) { DestroyDBWithoutCheck(dbname_, options_); options_.max_manifest_file_size = 0; // always rollover manifest for file add + options_.max_manifest_space_amp_pct = 0; OpenDBAndBackupEngine(true); FillDB(db_.get(), 0, 100, kAutoFlushOnly); diff --git a/utilities/checkpoint/checkpoint_test.cc b/utilities/checkpoint/checkpoint_test.cc index a514c3400..f7ca4136e 100644 --- a/utilities/checkpoint/checkpoint_test.cc +++ b/utilities/checkpoint/checkpoint_test.cc @@ -596,6 +596,7 @@ TEST_F(CheckpointTest, CheckpointCFNoFlush) { TEST_F(CheckpointTest, CurrentFileModifiedWhileCheckpointing) { Options options = CurrentOptions(); options.max_manifest_file_size = 0; // always rollover manifest for file add + options.max_manifest_space_amp_pct = 0; Reopen(options); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(