]> git-server-git.apps.pok.os.sepia.ceph.com Git - rocksdb.git/log
rocksdb.git
4 years agoFix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
anand76 [Thu, 26 Aug 2021 22:39:32 +0000 (15:39 -0700)]
Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)

Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30580199

Pulled By: anand1976

fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850

4 years agoDeflake test `CompactionJobTest.InputSerialization` (#8712)
Jay Zhuang [Thu, 26 Aug 2021 16:26:41 +0000 (09:26 -0700)]
Deflake test `CompactionJobTest.InputSerialization` (#8712)

Summary:
It's invalid to have an empty file name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8712

Test Plan:
```
$ gtest-parallel ./compaction_job_test --gtest_filter=CompactionJobTest.InputSerialization -r 10000
```

Reviewed By: pdillinger

Differential Revision: D30566739

Pulled By: jay-zhuang

fbshipit-source-id: 41e73175e3c95c4b73b4fdcd33470788d4e29d37

4 years agoMake Configurable/Customizable options copyable (#8704)
mrambacher [Thu, 26 Aug 2021 00:46:31 +0000 (17:46 -0700)]
Make Configurable/Customizable options copyable (#8704)

Summary:
The atomic variable "is_prepared_" was keeping Configurable objects from being copy-constructed.  Removed the atomic to allow copies.

Since the variable is only changed from false to true (and never back), there is no reason it had to be atomic.

Added tests that simple Configurable and Customizable objects can be put on the stack and copied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8704

Reviewed By: anand1976

Differential Revision: D30530526

Pulled By: ltamasi

fbshipit-source-id: 4dd4439b3e5ad7fa396573d0b25d9fb709160576

4 years agoFix legocastle Python commands for CentOS 8 (#8701)
Andrew Kryczka [Wed, 25 Aug 2021 23:52:50 +0000 (16:52 -0700)]
Fix legocastle Python commands for CentOS 8 (#8701)

Summary:
There is no longer an unversioned `python` command that refers to Python
3; the recommended alternative is `/usr/bin/env python3`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8701

Test Plan: - [internal link] https://www.internalfb.com/intern/sandcastle/group/nonce/5100000000000001/

Reviewed By: riversand963

Differential Revision: D30520380

Pulled By: ajkr

fbshipit-source-id: 2af459a64a15fb2a011e98b156f31d322f6b2d25

4 years agoTemporarily disable block-based filter when stress testing timestamp (#8703)
Yanqin Jin [Wed, 25 Aug 2021 02:03:58 +0000 (19:03 -0700)]
Temporarily disable block-based filter when stress testing timestamp (#8703)

Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703

Test Plan: make whitebox_crash_test_with_ts

Reviewed By: pdillinger

Differential Revision: D30528931

Pulled By: riversand963

fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d

4 years agoFix a bug of secondary instance sequence going backward (#8653)
Yanqin Jin [Wed, 25 Aug 2021 01:17:32 +0000 (18:17 -0700)]
Fix a bug of secondary instance sequence going backward (#8653)

Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.

Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30272084

Pulled By: riversand963

fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa

4 years agoSimplify `TraceAnalyzer` (#8697)
Merlin Mao [Wed, 25 Aug 2021 01:17:31 +0000 (18:17 -0700)]
Simplify `TraceAnalyzer` (#8697)

Summary:
Handler functions now use a common output function to output to stdout/files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8697

Test Plan: `trace_analyzer_test` can pass.

Reviewed By: zhichao-cao

Differential Revision: D30527696

Pulled By: autopear

fbshipit-source-id: c626cf4d53a39665a9c4bcf0cb019c448434abe4

4 years agoAdd port::GetProcessID() (#8693)
Peter Dillinger [Wed, 25 Aug 2021 00:45:01 +0000 (17:45 -0700)]
Add port::GetProcessID() (#8693)

Summary:
Useful in some places for object uniqueness across processes.
Currently used for generating a host-wide identifier of Cache objects
but expected to be used soon in some unique id generation code.

`int64_t` is chosen for return type because POSIX uses signed integer type,
usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`.

Future work: avoid copy-pasted declarations in port_*.h, perhaps with
port_common.h always included from port.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693

Test Plan: manual for now

Reviewed By: ajkr, anand1976

Differential Revision: D30492876

Pulled By: pdillinger

fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef

4 years agoAllow iterate refresh for secondary instance (#8700)
Yanqin Jin [Tue, 24 Aug 2021 22:39:31 +0000 (15:39 -0700)]
Allow iterate refresh for secondary instance (#8700)

Summary:
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700

Reviewed By: zhichao-cao

Differential Revision: D30523907

Pulled By: riversand963

fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91

4 years agoRefactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)
Hui Xiao [Tue, 24 Aug 2021 19:42:31 +0000 (12:42 -0700)]
Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)

Summary:
Context:
To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for  inserting/removing dummy entries to reserve cache space for memory used by the class user.

- Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428

- Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838

- Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.

- Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm

- Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.

  Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.

- Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity

- Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.

  The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.

  This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506

Test Plan:
- Passing the added unit tests cache_reservation_manager_test.cc
- Passing existing and adjusted write_buffer_manager_test.cc

Reviewed By: ajkr

Differential Revision: D29644135

Pulled By: hx235

fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7

4 years agoDeflake write-prepared and write-unprepared tests (#8696)
Andrew Kryczka [Tue, 24 Aug 2021 06:08:25 +0000 (23:08 -0700)]
Deflake write-prepared and write-unprepared tests (#8696)

Summary:
The `JobContext::job_snapshot` referenced DB state but could
have been deleted by a BG thread after the signal/unlock allowing
shutdown to proceed. Then we would see an error like this (valgrind):

```
==354104== Thread 2:
==354104== Invalid read of size 8
==354104==    at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:81)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:75)
==354104==    by 0x58F5BA: ~unique_ptr (unique_ptr.h:292)
==354104==    by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221)
==354104==    by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696)
==354104==    by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void*) (db_impl_compaction_flush.cc:2468)
==354104==    by 0x83707A: operator() (std_function.h:688)
==354104==    by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266)
==354104==    by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307)
==354104==    by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28)
==354104==    by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so)
==354104==    by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696

Test Plan: unable to repro

Reviewed By: pdillinger

Differential Revision: D30505277

Pulled By: ajkr

fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d

4 years agoRefactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678)
Merlin Mao [Tue, 24 Aug 2021 00:17:13 +0000 (17:17 -0700)]
Refactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678)

Summary:
`TraceAnalyzer` privately inherits `TraceRecord::Handler` and `WriteBatch::Handler`.

`trace_analyzer_test` can pass with this change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8678

Reviewed By: zhichao-cao

Differential Revision: D30459814

Pulled By: autopear

fbshipit-source-id: a27f59ac4600f7c3682830c9b1d9dc79e53425be

4 years agoAdd extra information to RemoteCompaction APIs (#8680)
Jay Zhuang [Mon, 23 Aug 2021 23:26:13 +0000 (16:26 -0700)]
Add extra information to RemoteCompaction APIs (#8680)

Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448

4 years agoAllow intentionally swallowed errors in BlockBasedFilterBlockReader (#8695)
Peter Dillinger [Mon, 23 Aug 2021 22:49:27 +0000 (15:49 -0700)]
Allow intentionally swallowed errors in BlockBasedFilterBlockReader (#8695)

Summary:
To avoid getting "Didn't get expected error from Get" from
crash test by enabling block-based filter in crash test in https://github.com/facebook/rocksdb/issues/8679.
Basically, this applies the pattern of IGNORE_STATUS_IF_ERROR in
full_filter_block.cc to block_based_filter_block.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8695

Test Plan: watch for resolution of crash test runs

Reviewed By: ltamasi

Differential Revision: D30496748

Pulled By: pdillinger

fbshipit-source-id: f7808fcf14c0e787fe81da03fa8303244590d273

4 years agoFix typo in 6.24.0 HISTORY.md (#8694)
Peter Dillinger [Mon, 23 Aug 2021 20:29:06 +0000 (13:29 -0700)]
Fix typo in 6.24.0 HISTORY.md (#8694)

Summary:
fix typo

Also, clarified change of C API signatures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8694

Test Plan: visual

Reviewed By: ltamasi

Differential Revision: D30492882

Pulled By: pdillinger

fbshipit-source-id: ac6dc3dcefa01c91fd87fc7f50279ea5e13fa41d

4 years agoFix LITE build (#8689)
mrambacher [Mon, 23 Aug 2021 12:08:54 +0000 (05:08 -0700)]
Fix LITE build (#8689)

Summary:
Conditional compilation of static functions not used in LITE mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8689

Reviewed By: ltamasi

Differential Revision: D30476218

Pulled By: mrambacher

fbshipit-source-id: 5f3af90982d34818f47d2cb1d36dd5816d0333a5

4 years agoUpdate version.h and HISTORY.md for the 6.24 release (#8688)
Levi Tamasi [Sat, 21 Aug 2021 05:27:22 +0000 (22:27 -0700)]
Update version.h and HISTORY.md for the 6.24 release (#8688)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688

Reviewed By: ajkr, riversand963

Differential Revision: D30467746

Pulled By: ltamasi

fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6

4 years agoEmbed original file number in SST table properties (#8686)
Peter Dillinger [Sat, 21 Aug 2021 03:39:52 +0000 (20:39 -0700)]
Embed original file number in SST table properties (#8686)

Summary:
I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
file numbers to external SST files (so that more can share db session
ids for better uniqueness properties), because of forward compatibility.
We would have a version of RocksDB that assumes session IDs are unique
on external SST files and therefore can't really break that invariant in
future files.

This change adds a table property for "orig_file_number" which is
populated by normal SST files and also external SST files generated by
SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
object and increments its own file numbers for embedding in table
properties. (They are arguably "fake" file numbers because these numbers
and not embedded in the file name.)

While updating block_based_table_builder, I removed several unnecessary
fields from Rep, because following the pattern would have created
another unnecessary field.

This change also updates block_based_table_reader to use this new
property when available, which means that for newer SST files, we can
determine the stable/original <db_session_id,file_number> unique
identifier using just the file contents, not the file name. (It's a bit
complicated; detailed comments in block_based_table_reader.)

Also added DB host id to properties listing by sst_dump, which could be
useful in debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686

Test Plan: majorly overhauled StableCacheKeys test for this change

Reviewed By: zhichao-cao

Differential Revision: D30457742

Pulled By: pdillinger

fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445

4 years agoUpgrade xxhash, add Hash128 (#8634)
Peter Dillinger [Sat, 21 Aug 2021 01:40:53 +0000 (18:40 -0700)]
Upgrade xxhash, add Hash128 (#8634)

Summary:
With expected use for a 128-bit hash, xxhash library is
upgraded to current dev (2c611a76f914828bed675f0f342d6c4199ffee1e)
as of Aug 6 so that we can use production version of XXH3_128bits
as new Hash128 function (added in hash128.h).

To make this work, however, we have to carve out the "preview" version
of XXH3 that is used in new SST Bloom and Ribbon filters, since that
will not get maintenance in xxhash releases. I have consolidated all the
relevant code into xxph3.h and made it "inline only" (no .cc file). The
working name for this hash function is changed from XXH3p to XXPH3
(XX Preview Hash) because the latter is easier to get working with no
symbol name conflicts between the headers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8634

Test Plan:
no expected change in existing functionality. For Hash128,
added some unit tests based on those for Hash64 to ensure some basic
properties and that the values do not change accidentally.

Reviewed By: zhichao-cao

Differential Revision: D30173490

Pulled By: pdillinger

fbshipit-source-id: 06aa542a7a28b353bc2c865b9b2f8bdfe44158e4

4 years agoAdd Bloom/Ribbon hybrid API support (#8679)
Peter Dillinger [Sat, 21 Aug 2021 00:59:24 +0000 (17:59 -0700)]
Add Bloom/Ribbon hybrid API support (#8679)

Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1

4 years agoAdd `IteratorTraceExecutionResult` for iterator related trace records. (#8687)
Merlin Mao [Fri, 20 Aug 2021 22:32:55 +0000 (15:32 -0700)]
Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687)

Summary:
- Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`.
- Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`.

Added test in `DBTest2.TraceAndReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687

Reviewed By: zhichao-cao

Differential Revision: D30457630

Pulled By: autopear

fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d

4 years agoAdd a PerfContext counter for secondary cache hits (#8685)
anand76 [Fri, 20 Aug 2021 22:16:33 +0000 (15:16 -0700)]
Add a PerfContext counter for secondary cache hits (#8685)

Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665

4 years agoUpdate the block_read_count/block_read_byte counters in MultiGet (#8676)
anand76 [Fri, 20 Aug 2021 18:49:53 +0000 (11:49 -0700)]
Update the block_read_count/block_read_byte counters in MultiGet (#8676)

Summary:
MultiGet in block based table reader doesn't use BlockFetcher. As a result, the block_read_count and block_read_byte PerfContext counters were not being updated. This fixes that by updating them in MultiRead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8676

Reviewed By: zhichao-cao

Differential Revision: D30428680

Pulled By: anand1976

fbshipit-source-id: 21846efe92588fc17123665dd06733693a40126d

4 years agoFix blob callback in compaction and atomic flush (#8681)
Akanksha Mahajan [Fri, 20 Aug 2021 18:37:53 +0000 (11:37 -0700)]
Fix blob callback in compaction and atomic flush (#8681)

Summary:
Pass BlobFileCompletionCallback  in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681

Test Plan: CircleCI jobs

Reviewed By: ltamasi

Differential Revision: D30445998

Pulled By: akankshamahajan15

fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021

4 years agoAdd iterator's lower and upper bounds to `TraceRecord` (#8677)
Merlin Mao [Fri, 20 Aug 2021 00:26:11 +0000 (17:26 -0700)]
Add iterator's lower and upper bounds to `TraceRecord` (#8677)

Summary:
Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set.

Added test cases in `DBTest2.TraceAndManualReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677

Reviewed By: zhichao-cao

Differential Revision: D30438255

Pulled By: autopear

fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288

4 years agoFix some minor issues in the Customizable infrastructure (#8566)
mrambacher [Thu, 19 Aug 2021 17:09:30 +0000 (10:09 -0700)]
Fix some minor issues in the Customizable infrastructure (#8566)

Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566

Reviewed By: zhichao-cao

Differential Revision: D30303724

Pulled By: mrambacher

fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5

4 years agoAdd condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event...
Baptiste Lemaire [Thu, 19 Aug 2021 00:39:00 +0000 (17:39 -0700)]
Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672)

Summary:
Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables.
This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge).
Event listeners were added to the unit tests to handle these situations.
Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672

Reviewed By: akankshamahajan15

Differential Revision: D30383109

Pulled By: bjlemaire

fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73

4 years agoAllow Replayer to report the results of TraceRecords. (#8657)
Merlin Mao [Thu, 19 Aug 2021 00:04:36 +0000 (17:04 -0700)]
Allow Replayer to report the results of TraceRecords. (#8657)

Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6

4 years agoStable cache keys on ingested SST files (#8669)
Peter Dillinger [Wed, 18 Aug 2021 18:32:00 +0000 (11:32 -0700)]
Stable cache keys on ingested SST files (#8669)

Summary:
Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even
the same file ingested into different DBs sharing a block cache.

Note: These new cache keys are currently only enabled when FileSystem
does not provide GetUniqueId. For now, they are typically larger,
so slightly less efficient.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669

Test Plan: Extended unit test

Reviewed By: zhichao-cao

Differential Revision: D30398532

Pulled By: pdillinger

fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7

4 years agoFix bug caused by releasing snapshot(s) during compaction (#8608)
Yanqin Jin [Wed, 18 Aug 2021 05:13:21 +0000 (22:13 -0700)]
Fix bug caused by releasing snapshot(s) during compaction (#8608)

Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81

4 years agoAdd statistics support to integrated BlobDB (#8667)
Levi Tamasi [Wed, 18 Aug 2021 00:21:16 +0000 (17:21 -0700)]
Add statistics support to integrated BlobDB (#8667)

Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4

4 years agoExclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)
Jay Zhuang [Tue, 17 Aug 2021 16:04:09 +0000 (09:04 -0700)]
Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)

Summary:
Just like other per_level properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668

Test Plan: stress_test

Reviewed By: zhichao-cao

Differential Revision: D30360967

Pulled By: jay-zhuang

fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f

4 years agoAdd a stat to count secondary cache hits (#8666)
anand76 [Tue, 17 Aug 2021 04:00:17 +0000 (21:00 -0700)]
Add a stat to count secondary cache hits (#8666)

Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77

4 years agoStable cache keys using DB session ids in SSTs (#8659)
Peter Dillinger [Tue, 17 Aug 2021 03:36:19 +0000 (20:36 -0700)]
Stable cache keys using DB session ids in SSTs (#8659)

Summary:
Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.

These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.

Relevant to https://github.com/facebook/rocksdb/issues/7405

This change has a minor regression in PersistentCache functionality:
metaindex blocks are no longer cached in PersistentCache. Table properties
blocks already were not but ideally should be. I didn't spent effort to
fix & test these issues because we don't believe PersistentCache is used much
if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)

FIXME: there is more to be fixed for stable cache keys on external SST files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659

Test Plan:
new unit test added, which fails when disabling new
functionality

Reviewed By: zhichao-cao

Differential Revision: D30297705

Pulled By: pdillinger

fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a

4 years agoAdd db_test2 to to ASSERT_STATUS_CHECKED (#8640)
Adam Retter [Mon, 16 Aug 2021 15:09:46 +0000 (08:09 -0700)]
Add db_test2 to to ASSERT_STATUS_CHECKED (#8640)

Summary:
This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640

Reviewed By: akankshamahajan15

Differential Revision: D30303684

Pulled By: mrambacher

fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a

4 years agoSupport dynamic sector size in alignment validation for Windows. (#8613)
Burton Li [Mon, 16 Aug 2021 14:30:57 +0000 (07:30 -0700)]
Support dynamic sector size in alignment validation for Windows. (#8613)

Summary:
- Use dynamic section size when calling IsSectorAligned()
- Support relative path for GetSectorSize().
- Move buffer and sector alignment check to assert for better retail performance.
- Typo fixes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8613

Reviewed By: ajkr

Differential Revision: D30136082

Pulled By: mrambacher

fbshipit-source-id: e8cb849befdcae4fea99de5ed5dd6565e612425f

4 years agoUse non-zero exit codes in benchmark.sh when the benchmark cannot be run (#8554)
Adam Retter [Mon, 16 Aug 2021 13:24:35 +0000 (06:24 -0700)]
Use non-zero exit codes in benchmark.sh when the benchmark cannot be run (#8554)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8554

Reviewed By: ajkr

Differential Revision: D29756562

Pulled By: mrambacher

fbshipit-source-id: ab2f5ef988c8ac7ea7c633e6a3dacaf16f021529

4 years agoAdd property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644)
Jay Zhuang [Sun, 15 Aug 2021 21:16:43 +0000 (14:16 -0700)]
Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644

Reviewed By: siying, zhichao-cao

Differential Revision: D30236535

Pulled By: jay-zhuang

fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81

4 years agoImprove MemPurge sampling (#8656)
Baptiste Lemaire [Fri, 13 Aug 2021 21:34:43 +0000 (14:34 -0700)]
Improve MemPurge sampling (#8656)

Summary:
Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details).
In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID.
I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction).
Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656

Reviewed By: pdillinger

Differential Revision: D30288583

Pulled By: bjlemaire

fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3

4 years agoCode cleanup for trace replayer (#8652)
Merlin Mao [Thu, 12 Aug 2021 16:21:40 +0000 (09:21 -0700)]
Code cleanup for trace replayer (#8652)

Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37

4 years agoMake TraceRecord and Replayer public (#8611)
Merlin Mao [Thu, 12 Aug 2021 02:31:44 +0000 (19:31 -0700)]
Make TraceRecord and Replayer public (#8611)

Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8

4 years agoRe-add retired mempurge flag definitions for legacy-options-file temporary support...
Baptiste Lemaire [Wed, 11 Aug 2021 23:04:19 +0000 (16:04 -0700)]
Re-add retired mempurge flag definitions for legacy-options-file temporary support. (#8650)

Summary:
Current internal regression tests pass in an old option flag `experimental_allow_mempurge` to a more recently built db.
This flag was retired and removed in a recent PR (https://github.com/facebook/rocksdb/issues/8628), and therefore, the following error comes up : `Failed: Invalid argument: Could not find option: : experimental_allow_mempurge`.
In this PR, I reintroduce the two flags retired in https://github.com/facebook/rocksdb/issues/8628, `experimental_allow_mempurge` and `experimental_mempurge_policy` in `db_options.cc` and mark them both as `kDeprecated`.
This is a temporary fix to save us time to find a long term solution, which hopefully will consist in ignoring options prefixed with `experimental_` that are no longer recognized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8650

Reviewed By: pdillinger

Differential Revision: D30257307

Pulled By: bjlemaire

fbshipit-source-id: 35303655fd2dd9789fd9e3c450e9d8009f3c1f54

4 years agoUpdate and enhance check_format_compatible.sh (#8651)
Peter Dillinger [Wed, 11 Aug 2021 23:01:27 +0000 (16:01 -0700)]
Update and enhance check_format_compatible.sh (#8651)

Summary:
The last few releases overlooked adding to this test. This
change fixes that.

This change also fixes the problem of older branches not understanding
ROCKSDB_NO_FBCODE and referencing compilers no longer supported.
During the test, build_detect_platform is patched to force no FBCODE
compiler usage. (We should not need to update old branches perpetually.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651

Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650

Reviewed By: jay-zhuang, zhichao-cao

Differential Revision: D30261872

Pulled By: pdillinger

fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc

4 years agoAdd suggestion for btrfs user to disable preallocation (#8646)
Jay Zhuang [Wed, 11 Aug 2021 21:52:46 +0000 (14:52 -0700)]
Add suggestion for btrfs user to disable preallocation (#8646)

Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D30240050

Pulled By: jay-zhuang

fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768

4 years agoMemtable sampling for mempurge heuristic. (#8628)
Baptiste Lemaire [Wed, 11 Aug 2021 01:07:48 +0000 (18:07 -0700)]
Memtable sampling for mempurge heuristic. (#8628)

Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27

4 years agoAttempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627)
Levi Tamasi [Tue, 10 Aug 2021 18:08:34 +0000 (11:08 -0700)]
Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627)

Summary:
The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog`
by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file.
This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from
invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit
and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D30147002

Pulled By: ltamasi

fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6

4 years agoSimplify GenericRateLimiter algorithm (#8602)
Andrew Kryczka [Mon, 9 Aug 2021 23:46:14 +0000 (16:46 -0700)]
Simplify GenericRateLimiter algorithm (#8602)

Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately.  Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped.  These threads are
responsible for unblocking themselves.  The work to do so is split into two main
duties.

(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.

Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2).  It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted.  This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted.  However, the conceptual
complexity of this algorithm was too high.  It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.

This PR drops the leader election to reduce conceptual complexity.  Now, the two
duties can be performed by whichever thread acquires the lock first.  The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.

- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).

Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue.  This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.

Note there are a few cases that still lead to intermediate wakeup, however.  The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.

- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
  - A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
  - A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602

Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
  without this change, even in a rate limiter heavy workload:

Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```

Results before this PR:

```
fillrandom   :     108.463 micros/op 9219 ops/sec;    9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```

Results after this PR:

```
fillrandom   :     108.108 micros/op 9250 ops/sec;    9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```

Reviewed By: hx235

Differential Revision: D30048013

Pulled By: ajkr

fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b

4 years agorocksdb: don't call LZ4_loadDictHC with null dictionary
Lucian Grijincu [Mon, 9 Aug 2021 23:04:26 +0000 (16:04 -0700)]
rocksdb: don't call LZ4_loadDictHC with null dictionary

Summary: UBSAN revealed a pointer underflow when `LZ4HC_init_internal` is called with a null `start`.

Reviewed By: ajkr

Differential Revision: D30181874

fbshipit-source-id: ca9bbac1a85c58782871d7f153af733b000cc66c

4 years agoAdd an unittest for tiered storage universal compaction (#8631)
Jay Zhuang [Mon, 9 Aug 2021 20:43:18 +0000 (13:43 -0700)]
Add an unittest for tiered storage universal compaction (#8631)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631

Reviewed By: siying

Differential Revision: D30200385

Pulled By: jay-zhuang

fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55

4 years agoMove old files to warm tier in FIFO compactions (#8310)
sdong [Mon, 9 Aug 2021 19:50:19 +0000 (12:50 -0700)]
Move old files to warm tier in FIFO compactions (#8310)

Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501

4 years agoFix db_stress failure (#8632)
Akanksha Mahajan [Sat, 7 Aug 2021 16:20:11 +0000 (09:20 -0700)]
Fix db_stress failure (#8632)

Summary:
FaultInjectionTestFS injects error in Rename operation. Because
of injected error, info.log fails to be created if rename  returns error and info_log is set to nullptr which leads to this assertion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8632

Test Plan: run the db_stress job locally

Reviewed By: ajkr

Differential Revision: D30167387

Pulled By: akankshamahajan15

fbshipit-source-id: 8d08c4c33e8f0cabd368bbb498d21b9de0660067

4 years agoAdd more C bindings for OptimisticTransactionDB (#8526)
Roy Crihfield [Sat, 7 Aug 2021 02:05:32 +0000 (19:05 -0700)]
Add more C bindings for OptimisticTransactionDB (#8526)

Summary:
* `rocksdb_optimistictransactiondb_checkpoint_object_create`
* `rocksdb_optimistictransactiondb_write`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526

Reviewed By: ajkr

Differential Revision: D30076822

Pulled By: jay-zhuang

fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d

4 years agoPrevent joining detached thread in ThreadPoolImpl (#8635)
Andrew Kryczka [Sat, 7 Aug 2021 02:05:01 +0000 (19:05 -0700)]
Prevent joining detached thread in ThreadPoolImpl (#8635)

Summary:
This draining mechanism should not be run during `JoinThreads()` because it can detach threads that will be joined. Joining detached threads would throw an exception.

With this PR, we skip draining when `JoinThreads()` has already decided what threads to `join()`, so the threads will exit naturally once the work queue empties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8635

Test Plan: verified it unblocked using `WaitForJobsAndJoinAllThreads()` in https://github.com/facebook/rocksdb/issues/8611.

Reviewed By: riversand963

Differential Revision: D30174587

Pulled By: ajkr

fbshipit-source-id: 144966398a607987e0763c7152a0f653fdbf3c8b

4 years agoFix the sorting of KeyContexts for batched MultiGet (#8633)
Levi Tamasi [Fri, 6 Aug 2021 23:26:04 +0000 (16:26 -0700)]
Fix the sorting of KeyContexts for batched MultiGet (#8633)

Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30169182

Pulled By: ltamasi

fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023

4 years agoFix the wrong comment of level compaction cf paths test (#8533)
Zaorang Yang [Fri, 6 Aug 2021 22:26:06 +0000 (15:26 -0700)]
Fix the wrong comment of level compaction cf paths test (#8533)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8533

Reviewed By: ajkr

Differential Revision: D29718067

fbshipit-source-id: b4b91c9271362e7a7d47ddbaf28f56fb537cc668

4 years agoRemove unused variable - run_had_errors (#8599)
Peter (Stig) Edwards [Fri, 6 Aug 2021 21:45:39 +0000 (14:45 -0700)]
Remove unused variable - run_had_errors (#8599)

Summary:
Unused since https://github.com/facebook/rocksdb/commit/ab718b415fc9b2a66a2ed642c18803f764839d7b .
Noticed on https://lgtm.com/projects/g/facebook/rocksdb/snapshot/b215f1a83226f111ff52305987af93564272b7d3/files/tools/db_crashtest.py?sort=name&dir=ASC&mode=heatmap#xf254f528ad18f108:1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8599

Reviewed By: ajkr

Differential Revision: D30057041

Pulled By: zhichao-cao

fbshipit-source-id: e80438cf9717086d2bf67461e19393d426a7676e

4 years agoUpdate benchmark.sh (#8615)
HappyUncle [Fri, 6 Aug 2021 21:34:37 +0000 (14:34 -0700)]
Update benchmark.sh (#8615)

Summary:
Fix help message.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8615

Reviewed By: siying

Differential Revision: D30136092

Pulled By: mrambacher

fbshipit-source-id: edf4112570514d709560baaf96a47c5f36f00665

4 years agoMake backup restore atomic, with sync option (#8568)
Peter Dillinger [Fri, 6 Aug 2021 16:48:53 +0000 (09:48 -0700)]
Make backup restore atomic, with sync option (#8568)

Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e

4 years agoCorrect javadoc for Env#setBackgroundThreads(int) (#8576)
Brendan MacDonell [Fri, 6 Aug 2021 15:51:05 +0000 (08:51 -0700)]
Correct javadoc for Env#setBackgroundThreads(int) (#8576)

Summary:
By default, the low priority pool is not the flush pool, so calling `Env#setBackgroundThreads` without providing a priority will not do what the caller expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8576

Reviewed By: ajkr

Differential Revision: D29925154

Pulled By: mrambacher

fbshipit-source-id: cd7211fc374e7d9929a9b88ea0a5ba8134b76099

4 years agoMake MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
mrambacher [Fri, 6 Aug 2021 15:26:23 +0000 (08:26 -0700)]
Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)

Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af

4 years agoDynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Akanksha Mahajan [Fri, 6 Aug 2021 02:43:44 +0000 (19:43 -0700)]
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)

Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c

4 years agoAttempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624)
Levi Tamasi [Fri, 6 Aug 2021 01:35:02 +0000 (18:35 -0700)]
Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624)

Summary:
We've been seeing occasional crashes on CI while inserting into the
vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes
don't reproduce locally (could be either a race or an object lifecycle
issue) but the good news is that the vectors in question are not really
used for anything meaningful by the test. (The assertion about the sizes
of the two vectors being equal is guaranteed to hold, since the two sync
points where they are populated are right after each other.) The patch
simply removes the vectors from the test, alongside the associated
callbacks and sync points.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D30118485

Pulled By: ltamasi

fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647

4 years agoUpdate HISTORY for PR8585 (#8623)
Yanqin Jin [Thu, 5 Aug 2021 01:44:53 +0000 (18:44 -0700)]
Update HISTORY for PR8585 (#8623)

Summary:
Update HISTORY.md for PR https://github.com/facebook/rocksdb/issues/8585 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8623

Reviewed By: ltamasi

Differential Revision: D30121910

Pulled By: riversand963

fbshipit-source-id: 525af43fad908a498f22ed4f934ec5cbf60e6d25

4 years agoDo not attempt to rename non-existent info log (#8622)
Andrew Kryczka [Thu, 5 Aug 2021 00:24:06 +0000 (17:24 -0700)]
Do not attempt to rename non-existent info log (#8622)

Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.

Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622

Test Plan: new unit test

Reviewed By: akankshamahajan15

Differential Revision: D30115189

Pulled By: ajkr

fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170

4 years agoFix clang failure (#8621)
Akanksha Mahajan [Thu, 5 Aug 2021 00:11:47 +0000 (17:11 -0700)]
Fix clang failure (#8621)

Summary:
Fixed clang failure because of memory leak

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8621

Test Plan: CircleCI clang job

Reviewed By: pdillinger

Differential Revision: D30114337

Pulled By: akankshamahajan15

fbshipit-source-id: 16572b9bcbaa053c2ab7bc1c344148d0e6f8039c

4 years agoRemove corruption error injection in FaultInjectionTestFS (#8616)
anand76 [Wed, 4 Aug 2021 22:48:10 +0000 (15:48 -0700)]
Remove corruption error injection in FaultInjectionTestFS (#8616)

Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.

Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616

Reviewed By: zhichao-cao

Differential Revision: D30074422

Pulled By: anand1976

fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df

4 years agoImprove rate limiter implementation's readability (#8596)
hx235 [Wed, 4 Aug 2021 17:42:49 +0000 (10:42 -0700)]
Improve rate limiter implementation's readability (#8596)

Summary:
Context:
As need for new feature of resource management using RocksDB's rate limiter like [https://github.com/facebook/rocksdb/issues/8595](https://github.com/facebook/rocksdb/pull/8595) arises, it is about time to re-learn our rate limiter and make this learning process easier for others by improving its readability. The comment/assertion/one extra else-branch are added based on my best understanding toward the rate_limiter.cc and rate_limiter_test.cc up to date after giving it a hard read.
- Add code comments/assertion/one extra else-branch (that is not affecting existing behavior, see PR comment) to describe how leader-election works under multi-thread settings in GenericRateLimiter::Request()
- Add code comments to describe a non-obvious trick during clean-up of rate limiter destructor
- Add code comments to explain more about the starvation being fixed in GenericRateLimiter::Refill() through partial byte-granting
- Add code comments to the rate limiter's setup in a complicated unit test in rate_limiter_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8596

Test Plan: - passed existing rate_limiter_test.cc

Reviewed By: ajkr

Differential Revision: D29982590

Pulled By: hx235

fbshipit-source-id: c3592986bb5b0c90d8229fe44f425251ec7e8a0a

4 years agoMention PR 8605 in HISTORY.md (#8619)
Levi Tamasi [Tue, 3 Aug 2021 23:13:41 +0000 (16:13 -0700)]
Mention PR 8605 in HISTORY.md (#8619)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8619

Reviewed By: riversand963

Differential Revision: D30081937

Pulled By: ltamasi

fbshipit-source-id: 57505957ae2c22d4b194aa28cb3fd261b3b39919

4 years agoFix NotifyOnFlushCompleted() for atomic flush (#8585)
Yanqin Jin [Tue, 3 Aug 2021 20:30:05 +0000 (13:30 -0700)]
Fix NotifyOnFlushCompleted() for atomic flush (#8585)

Summary:
PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure
`OnFlushCompleted()` is called after committing flush results to
MANIFEST. However, `flush_jobs_info_` is not updated in atomic
flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`.

This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D29913720

Pulled By: riversand963

fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda

4 years agoCache warming blocks during flush (#8561)
Akanksha Mahajan [Tue, 3 Aug 2021 19:42:22 +0000 (12:42 -0700)]
Cache warming blocks during flush (#8561)

Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66

4 years agoFix db stress crash mempurge (#8604)
Baptiste Lemaire [Tue, 3 Aug 2021 03:25:39 +0000 (20:25 -0700)]
Fix db stress crash mempurge (#8604)

Summary:
The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush.
It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue.
In this PR, I reverted the MemtableList::Add function to what it was before my changes.
I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from.
I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR.
Experiment run:

```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604

Reviewed By: pdillinger

Differential Revision: D30047295

Pulled By: bjlemaire

fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe

4 years agoRevert checkpoint fix (#8607)
Merlin Mao [Tue, 3 Aug 2021 01:27:11 +0000 (18:27 -0700)]
Revert checkpoint fix (#8607)

Summary:
PR https://github.com/facebook/rocksdb/pull/8572 looses custom types in the options file. Need more API changes to fix this issue. Revert this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8607

Reviewed By: ajkr

Differential Revision: D30058289

Pulled By: autopear

fbshipit-source-id: 78f5a154c0bf193e8441bae4a36fa79b95277fd4

4 years agoFix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)
Levi Tamasi [Tue, 3 Aug 2021 01:10:57 +0000 (18:10 -0700)]
Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)

Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528

4 years agoFix a issue with initializing blob header buffer (#8537)
yangzaorang [Tue, 3 Aug 2021 00:13:36 +0000 (17:13 -0700)]
Fix a issue with initializing blob header buffer (#8537)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8537

Reviewed By: ajkr

Differential Revision: D29838132

Pulled By: jay-zhuang

fbshipit-source-id: e3e78d5f85f240a1800ace417a8b634f74488e41

4 years agoAllow to use a string as a delimiter in StringAppendOperator (#8536)
Mikhail Golubev [Mon, 2 Aug 2021 23:49:54 +0000 (16:49 -0700)]
Allow to use a string as a delimiter in StringAppendOperator (#8536)

Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f

4 years agoAllow WAL dir to change with db dir (#8582)
mrambacher [Fri, 30 Jul 2021 19:15:04 +0000 (12:15 -0700)]
Allow WAL dir to change with db dir (#8582)

Summary:
Prior to this change, the "wal_dir"  DBOption would always be set (defaults to dbname) when the DBOptions were sanitized.  Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.

After this change, the "wal_dir" option is only set under specific circumstances.  Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname.  Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).

Tests were added to the core and ldb to test that a database could be created and renamed without issue.  Additional tests for various permutations of wal_dir were also added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582

Reviewed By: pdillinger, autopear

Differential Revision: D29881122

Pulled By: mrambacher

fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb

4 years agoSeveral simple local code clean-ups (#8565)
Yanqin Jin [Fri, 30 Jul 2021 19:06:47 +0000 (12:06 -0700)]
Several simple local code clean-ups (#8565)

Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565

Test Plan: make check

Reviewed By: lth

Differential Revision: D29963984

Pulled By: riversand963

fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a

4 years agoFix insecure internal API for GetImpl (#8590)
Peter Dillinger [Thu, 29 Jul 2021 21:58:35 +0000 (14:58 -0700)]
Fix insecure internal API for GetImpl (#8590)

Summary:
Calling the GetImpl function could leave reference to a local
callback function in a field of a parameter struct. As this is
performance-critical code, I'm not going to attempt to sanitize this
code too much, but make the existing hack a bit cleaner by reverting
what it overwrites in the input struct.

Added SaveAndRestore utility class to make that easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590

Test Plan:
added unit test for SaveAndRestore; existing tests for
GetImpl

Reviewed By: riversand963

Differential Revision: D29947983

Pulled By: pdillinger

fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb

4 years agoCreate fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593)
Baptiste Lemaire [Thu, 29 Jul 2021 21:57:03 +0000 (14:57 -0700)]
Create fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593)

Summary:
Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times:
"A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible."
New flags:
- `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry.
- `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence.
- `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries.
- `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts.
- `persistent_entries_value_size`: size of the random value string for the `persistent` entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593

Reviewed By: pdillinger

Differential Revision: D29974436

Pulled By: bjlemaire

fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d

4 years agoDB::GetSortedWalFiles() to ensure file deletion is disabled (#8591)
sdong [Thu, 29 Jul 2021 18:50:00 +0000 (11:50 -0700)]
DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591)

Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.

Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591

Test Plan: Run all existing tests

Reviewed By: pdillinger

Differential Revision: D29969412

fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de

4 years agoSome fixes and enhancements to `ldb repair` (#8544)
Peter Dillinger [Wed, 28 Jul 2021 23:43:16 +0000 (16:43 -0700)]
Some fixes and enhancements to `ldb repair` (#8544)

Summary:
* Basic handling of SST file with just range tombstones rather than
failing assertion about smallest_seqno <= largest_seqno
* Adds --verbose option so that there exists a way to see the INFO
output from Repairer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544

Test Plan: unit test added, manual testing for --verbose

Reviewed By: ajkr

Differential Revision: D29954805

Pulled By: pdillinger

fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9

4 years agoFix HISTORY.md for #8518 (#8594)
Jay Zhuang [Wed, 28 Jul 2021 23:09:41 +0000 (16:09 -0700)]
Fix HISTORY.md for #8518 (#8594)

Summary:
PR https://github.com/facebook/rocksdb/issues/8518 merge the change to wrong section.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8594

Reviewed By: riversand963

Differential Revision: D29974565

Pulled By: jay-zhuang

fbshipit-source-id: 51c930d93fbdb406fe31ff73c96548a6f88b9965

4 years agoReplace macros in compaction_iterator.cc with inline functions (#8592)
jimmycleary [Wed, 28 Jul 2021 21:52:35 +0000 (14:52 -0700)]
Replace macros in compaction_iterator.cc with inline functions (#8592)

Summary:
Internal task T96186510.

Created new inline member functions in `CompactionIterator`,
`DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and
`InEarliestSnapshot` to replace the macros at the top of
`compaction_iterator.cc`.

Placed the definitions in `compaction_iterator.h` in accordance with
Google's style guide for inline functions. Separated the declarations
and definitions, and only placed the `inline` keyword on the
definitions, in line with ISO CPP recommendations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592

Test Plan: Ran `make check`.  Successful build and all tests appeared to pass.

Reviewed By: riversand963

Differential Revision: D29966782

Pulled By: jimmycFB

fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6

4 years agoAdd experimental mempurge policy flag to db_stress. (#8588)
Baptiste Lemaire [Wed, 28 Jul 2021 20:27:10 +0000 (13:27 -0700)]
Add experimental mempurge policy flag to db_stress. (#8588)

Summary:
Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`.
This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default).
- `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds.
- `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588

Reviewed By: pdillinger

Differential Revision: D29934251

Pulled By: bjlemaire

fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae

4 years agoFix use-after-free on implicit temporary FileOptions (#8571)
Peter Dillinger [Wed, 28 Jul 2021 04:48:22 +0000 (21:48 -0700)]
Fix use-after-free on implicit temporary FileOptions (#8571)

Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,

> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.

This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.

This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.

Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571

Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)

Reviewed By: ajkr

Differential Revision: D29943890

Pulled By: pdillinger

fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9

4 years agoFix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589)
Peter Dillinger [Wed, 28 Jul 2021 04:30:54 +0000 (21:30 -0700)]
Fix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589)

Summary:
This appears to be little used code so not a major bug, but is
blocking https://github.com/facebook/rocksdb/issues/8544

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589

Test Plan:
Added regression test to the end of
DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports
memory leak.

Reviewed By: ajkr

Differential Revision: D29943623

Pulled By: pdillinger

fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b

4 years agoAdd MultiGet to replay (#8577)
Zhichao Cao [Tue, 27 Jul 2021 20:55:15 +0000 (13:55 -0700)]
Add MultiGet to replay (#8577)

Summary:
When the trace contains the MultiGet record, with this PR, it can replay the MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8577

Test Plan: make check and replay the real trace.

Reviewed By: anand1976

Differential Revision: D29864060

Pulled By: zhichao-cao

fbshipit-source-id: 5288d4fc9b6a3cb331de1e0c635d4e044dcb534a

4 years agoPass extra db_stress args to fbcode crash tests (#8587)
anand76 [Tue, 27 Jul 2021 19:45:59 +0000 (12:45 -0700)]
Pass extra db_stress args to fbcode crash tests (#8587)

Summary:
Allow extra arguments to be passed to db_stress in fbcode crash tests by the ```rocksdb-lego-determinator``` invoker.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8587

Reviewed By: zhichao-cao

Differential Revision: D29940217

Pulled By: anand1976

fbshipit-source-id: 17cbcd2def60eff2a895553f917694496c4742aa

4 years agoMake EventListener into a Customizable Class (#8473)
mrambacher [Tue, 27 Jul 2021 14:46:09 +0000 (07:46 -0700)]
Make EventListener into a Customizable Class (#8473)

Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "".  If there is no name, the listener cannot be loaded from the ObjectRegistry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473

Reviewed By: zhichao-cao

Differential Revision: D29901488

Pulled By: mrambacher

fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254

4 years agoAdd periodic_compaction_seconds option to RocksJava (#8579)
Anatolii Zhmaiev [Tue, 27 Jul 2021 00:32:42 +0000 (17:32 -0700)]
Add periodic_compaction_seconds option to RocksJava (#8579)

Summary:
Fixes https://github.com/facebook/rocksdb/issues/8578

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8579

Reviewed By: ajkr

Differential Revision: D29895081

Pulled By: mrambacher

fbshipit-source-id: 3e4120e26a3e8252f8301d657c0aaa0b8550cddf

4 years agoAdd simple heuristics for experimental mempurge. (#8583)
Baptiste Lemaire [Mon, 26 Jul 2021 18:55:27 +0000 (11:55 -0700)]
Add simple heuristics for experimental mempurge. (#8583)

Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583

Reviewed By: pdillinger

Differential Revision: D29888050

Pulled By: bjlemaire

fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1

4 years agoDisable DistributedMutex test by default (#8584)
Jay Zhuang [Fri, 23 Jul 2021 22:54:29 +0000 (15:54 -0700)]
Disable DistributedMutex test by default (#8584)

Summary:
DistributedMutex hasn't been used in the code base and enabling
`USE_FOLLY_DISTRIBUTED_MUTEX` only runs the mutex tests from third-party
lib. So disabling it for now.
The implementation may also out of date, should re-sync with folly before
using.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8584

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D29888960

Pulled By: jay-zhuang

fbshipit-source-id: 3e75f73386c6ed03efb96a1400258d602a724f17

4 years agodb_bench_tool.cc: fix copy - paste (#8553)
leipeng [Fri, 23 Jul 2021 21:30:41 +0000 (14:30 -0700)]
db_bench_tool.cc: fix copy - paste (#8553)

Summary:
PR https://github.com/facebook/rocksdb/issues/8519 fix db_bench_tool.cc for MSVC build errors by simply copy-paste, this PR fix the copy-paste while also works for MSVC.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8553

Reviewed By: ajkr

Differential Revision: D29838056

Pulled By: jay-zhuang

fbshipit-source-id: 0cd60c146b87a355c3dc1061dfe813169d75cea4

4 years agoCompactionJob::Install(): fix log truncation (#8563)
leipeng [Fri, 23 Jul 2021 18:38:18 +0000 (11:38 -0700)]
CompactionJob::Install(): fix log truncation (#8563)

Summary:
event log info may be truncated, the default buffer size is 512, this PR changes buffer size to 8192.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8563

Reviewed By: ajkr

Differential Revision: D29838229

Pulled By: jay-zhuang

fbshipit-source-id: 00c5dea3caff0641a209f02c972e92d65b505f50

4 years agoCheckpoint dir options fix (#8572)
Merlin Mao [Fri, 23 Jul 2021 18:11:25 +0000 (11:11 -0700)]
Checkpoint dir options fix (#8572)

Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.

2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.

`wal_dir`:  If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.

`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.

A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572

Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```

Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN      ] CheckpointTest.CheckpointWithOptionsDirsTest
[       OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[  PASSED  ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.

Reviewed By: zhichao-cao

Differential Revision: D29832761

Pulled By: autopear

fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29

4 years agoFix a minor issue with initializing the test path (#8555)
Drewryz [Fri, 23 Jul 2021 15:37:27 +0000 (08:37 -0700)]
Fix a minor issue with initializing the test path (#8555)

Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555

Reviewed By: ajkr

Differential Revision: D29758399

Pulled By: jay-zhuang

fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011

4 years agoRetire superfluous functions introduced in earlier mempurge PRs. (#8558)
Baptiste Lemaire [Fri, 23 Jul 2021 01:26:47 +0000 (18:26 -0700)]
Retire superfluous functions introduced in earlier mempurge PRs. (#8558)

Summary:
The main challenge to make the memtable garbage collection prototype (nicknamed `mempurge`) was to not get rid of WAL files that contain unflushed (but mempurged) data. That was successfully guaranteed by not writing the VersionEdit to the MANIFEST file after a successful mempurge.
By not writing VersionEdits to the `MANIFEST` file after a succesful mempurge operation, we do not change the earliest log file number that contains unflushed data: `cfd->GetLogNumber()` (`cfd->SetLogNumber()` is only called in `VersionSet::ProcessManifestWrites`). As a result, a number of functions introduced earlier just for the mempurge operation are not obscolete/redundant. (e.g.: `FlushJob::ExtractEarliestLogFileNumber`), and this PR aims at cleaning up all these now-unnecessary functions. In particular, we no longer need to store the earliest log file number in the `MemTable` struct itself. This PR therefore also reverts the `MemTable` struct to its original form.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8558

Test Plan: Already included in `db_flush_test.cc`.

Reviewed By: anand1976

Differential Revision: D29764351

Pulled By: bjlemaire

fbshipit-source-id: 0f43b260fa270251862512f397d3f24ee62e8437

4 years agoAnalyze MultiGet in trace_analyzer (#8575)
Zhichao Cao [Thu, 22 Jul 2021 23:51:19 +0000 (16:51 -0700)]
Analyze MultiGet in trace_analyzer (#8575)

Summary:
Now we can analyze the MultiGet queries in the trace file and generate a set of the statistic and analysis files. Note that, when one MultiGet access N keys, we count each sub-get-query individually. But the over all query number is still the MultiGet not the sub-get-query.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8575

Test Plan: added new unit test and make check

Reviewed By: anand1976

Differential Revision: D29860633

Pulled By: zhichao-cao

fbshipit-source-id: a132128527f36828d266df8e36e3ec626c2170be

4 years agoReturn error if trying to open secondary on missing or inaccessible primary (#8200)
Yanqin Jin [Thu, 22 Jul 2021 22:47:35 +0000 (15:47 -0700)]
Return error if trying to open secondary on missing or inaccessible primary (#8200)

Summary:
If the primary's CURRENT file is missing or inaccessible, the secondary should not hang
trying repeatedly to switch to the next MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8200

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D27840627

Pulled By: riversand963

fbshipit-source-id: 071fed97cbab1bc5cdefd1dc235e5cd406c174e1

4 years agoFix an race condition during multiple DB opening (#8574)
Jay Zhuang [Thu, 22 Jul 2021 20:41:48 +0000 (13:41 -0700)]
Fix an race condition during multiple DB opening (#8574)

Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574

Test Plan: pass the failed test

Reviewed By: ajkr

Differential Revision: D29855096

Pulled By: jay-zhuang

fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c