]> git-server-git.apps.pok.os.sepia.ceph.com Git - rocksdb.git/log
rocksdb.git
4 years agoCreate CNAME
Joel Marcey [Wed, 1 Sep 2021 20:28:13 +0000 (13:28 -0700)]
Create CNAME

4 years agoUpdate branch name to "main" in CircleCI config (#8726)
Levi Tamasi [Wed, 1 Sep 2021 18:56:37 +0000 (11:56 -0700)]
Update branch name to "main" in CircleCI config (#8726)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8726

Reviewed By: jay-zhuang

Differential Revision: D30675182

Pulled By: ltamasi

fbshipit-source-id: d5746931f6d942ed3a9d21325335cfc9e111a7f3

4 years agoMake format-diff.sh branch name agnostic (#8731)
Levi Tamasi [Wed, 1 Sep 2021 18:56:33 +0000 (11:56 -0700)]
Make format-diff.sh branch name agnostic (#8731)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8731

Reviewed By: jay-zhuang

Differential Revision: D30678124

Pulled By: ltamasi

fbshipit-source-id: 0131b6707f0c5d1d887bcd45781623143b5ccae0

4 years agobuild with platform assembler (#8733)
Jay Zhuang [Wed, 1 Sep 2021 17:05:57 +0000 (10:05 -0700)]
build with platform assembler (#8733)

Summary:
Required for platform009, which is incompat with the centos
assembler.
author: pbrady@fb.com D29099768

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8733

Test Plan: CI

Reviewed By: pixelb

Differential Revision: D30687156

Pulled By: jay-zhuang

fbshipit-source-id: 796f90842cbf0ca11bad07e7d654bce1fafc4ba0

4 years agoImplement superior user & mid IO priority level in GenericRateLimiter (#8595)
Hui Xiao [Tue, 31 Aug 2021 17:59:14 +0000 (10:59 -0700)]
Implement superior user & mid IO priority level in GenericRateLimiter (#8595)

Summary:
Context:
An extra IO_USER priority in rate limiter allows users to optionally charge WAL writes / SST reads to rate limiter at this priority level, which then has higher priority than IO_HIGH and IO_LOW. With an extra IO_USER priority, it allows users to better specify the relative urgency/importance among different requests in rate limiter. As a consequence, IO resource management can better prioritize and limit resource based on user's need.

The IO_USER is implemented as superior priority in GenericRateLimiter, in the sense that its request queue will always be iterated first without being constrained to fairness. The reason is that the notion of fairness is only meaningful in helping lower priorities in background IO (i.e, IO_HIGH/MID/LOW) to gain some fair chance to run so that it does not block foreground IO (i.e, the ones that are charged at the level of IO_USER). As we can see, the ultimate goal here is to not blocking foreground IO at IO_USER level, which justifies the superiority of IO_USER.

Similar benefits exist for IO_MID priority.
- Rewrote the logic of deciding the order of iterating request queues of high/low priorities to include the extra user/mid priority w/o affecting the existing behavior (see PR's [comment](https://github.com/facebook/rocksdb/pull/8595/files#r678749331))
- Included the request queue of user-pri/mid-pri in the code path of next-leader-candidate signaling and GenericRateLimiter's destructor
- Included the extra user/mid-pri in bookkeeping data structures: total_bytes_through_ and total_requests_
- Re-written the previous impl of explicitly iterating priorities with a loop from Env::IO_LOW to Env::IO_TOTAL

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8595

Test Plan:
- passed existing rate_limiter_test.cc
- passed added unit tests in rate_limiter_test.cc
- run performance test to verify performance with only high/low requests is not affected by this change
   - Set-up command:
   `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --duration=5 --compression_type=none --num=100000000 --disable_auto_compactions=true --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1))`

    - Test command:
   `TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=overwrite --use_existing_db=true --disable_wal=true --duration=30 --compression_type=none --num=100000000 --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1)) --statistics=true --rate_limiter_bytes_per_sec=1048576 --rate_limiter_refill_period_us=1000  --threads=32 |& grep -E '(flush|compact)\.write\.bytes'`

   - Before (on branch upstream/master):
   `rocksdb.compact.write.bytes COUNT : 4014162`
   `rocksdb.flush.write.bytes COUNT : 26715832`
    rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.66

   - After (on branch rate_limiter_user_pri):
  `rocksdb.compact.write.bytes COUNT : 3807822`
  `rocksdb.flush.write.bytes COUNT : 26098659`
   rocksdb.flush.write.bytes/rocksdb.compact.write.bytes ~= 6.85

Reviewed By: ajkr

Differential Revision: D30577783

Pulled By: hx235

fbshipit-source-id: 0881f2705ffd13ecd331256bde7e8ec874a353f4

4 years agoReplace `std::shared_ptr<SystemClock>` by `SystemClock*` in `TraceExecutionHandler...
Qizhong Mao [Tue, 31 Aug 2021 17:56:08 +0000 (10:56 -0700)]
Replace `std::shared_ptr<SystemClock>` by `SystemClock*` in `TraceExecutionHandler` (#8729)

Summary:
All/most trace related APIs directly use `SystemClock*` (https://github.com/facebook/rocksdb/pull/8033). Do the same in `TraceExecutionHandler`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8729

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30672159

Pulled By: autopear

fbshipit-source-id: 017db4912c6ac1cfede842b8b122cf569a394f25

4 years agoFix a race in LRUCacheShard::Promote (#8717)
anand76 [Tue, 31 Aug 2021 02:09:43 +0000 (19:09 -0700)]
Fix a race in LRUCacheShard::Promote (#8717)

Summary:
In ```LRUCacheShard::Promote```, a reference is released outside the LRU mutex. Fix the race condition.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8717

Reviewed By: zhichao-cao

Differential Revision: D30649206

Pulled By: anand1976

fbshipit-source-id: 09c0af05b2294a7fe2c02876a61b0bad6e3ada61

4 years agoBuilt-in support for generating unique IDs, bug fix (#8708)
Peter Dillinger [Mon, 30 Aug 2021 22:19:39 +0000 (15:19 -0700)]
Built-in support for generating unique IDs, bug fix (#8708)

Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.

This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.

Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
  * Lots of info from default Env like time, process id, and hostname.
  * std::random_device
  * port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.

DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.

GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.

Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708

Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.

Reviewed By: zhichao-cao, mrambacher

Differential Revision: D30563780

Pulled By: pdillinger

fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364

4 years agoUpdate comments, fix typos. (#8721)
Merlin Mao [Fri, 27 Aug 2021 20:15:25 +0000 (13:15 -0700)]
Update comments, fix typos. (#8721)

Summary:
- Removed the default empty constructors of `TraceWriter` and `TraceReader`.
- Removed unused `ReadFooter()` from `ReplayerImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8721

Test Plan: None

Reviewed By: zhichao-cao

Differential Revision: D30609743

Pulled By: autopear

fbshipit-source-id: 7e2626b015bd57ebb408a2836b4b4217cea10002

4 years agoRefactor with VersionBuilder (#8706)
Zaorang Yang [Fri, 27 Aug 2021 19:10:01 +0000 (12:10 -0700)]
Refactor with VersionBuilder (#8706)

Summary:
Introduce a new function to save sst files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8706

Reviewed By: jay-zhuang

Differential Revision: D30544242

Pulled By: riversand963

fbshipit-source-id: 554755852daff7ae1c7864b0029f51b27099ee09

4 years agoFix typo in the comment of log_empty_ (#8711)
James Yin [Fri, 27 Aug 2021 19:09:13 +0000 (12:09 -0700)]
Fix typo in the comment of log_empty_ (#8711)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8711

Reviewed By: riversand963

Differential Revision: D30566761

Pulled By: jay-zhuang

fbshipit-source-id: dd4690f5e2af2d263ed75ea1b9ed24692fe81362

4 years agoAdd 6.24 to the format compatibility checker (#8716)
Levi Tamasi [Thu, 26 Aug 2021 23:33:55 +0000 (16:33 -0700)]
Add 6.24 to the format compatibility checker (#8716)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8716

Reviewed By: jay-zhuang

Differential Revision: D30587722

Pulled By: ltamasi

fbshipit-source-id: d2f5b08084778779c5a8b85635977babd8194d5c

4 years agoFix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
anand76 [Thu, 26 Aug 2021 22:39:32 +0000 (15:39 -0700)]
Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)

Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8714

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D30580199

Pulled By: anand1976

fbshipit-source-id: 60a3443ad0d4f7ac6a977dec780e6d2c1b70b850

4 years agoDeflake test `CompactionJobTest.InputSerialization` (#8712)
Jay Zhuang [Thu, 26 Aug 2021 16:26:41 +0000 (09:26 -0700)]
Deflake test `CompactionJobTest.InputSerialization` (#8712)

Summary:
It's invalid to have an empty file name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8712

Test Plan:
```
$ gtest-parallel ./compaction_job_test --gtest_filter=CompactionJobTest.InputSerialization -r 10000
```

Reviewed By: pdillinger

Differential Revision: D30566739

Pulled By: jay-zhuang

fbshipit-source-id: 41e73175e3c95c4b73b4fdcd33470788d4e29d37

4 years agoMake Configurable/Customizable options copyable (#8704)
mrambacher [Thu, 26 Aug 2021 00:46:31 +0000 (17:46 -0700)]
Make Configurable/Customizable options copyable (#8704)

Summary:
The atomic variable "is_prepared_" was keeping Configurable objects from being copy-constructed.  Removed the atomic to allow copies.

Since the variable is only changed from false to true (and never back), there is no reason it had to be atomic.

Added tests that simple Configurable and Customizable objects can be put on the stack and copied.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8704

Reviewed By: anand1976

Differential Revision: D30530526

Pulled By: ltamasi

fbshipit-source-id: 4dd4439b3e5ad7fa396573d0b25d9fb709160576

4 years agoFix legocastle Python commands for CentOS 8 (#8701)
Andrew Kryczka [Wed, 25 Aug 2021 23:52:50 +0000 (16:52 -0700)]
Fix legocastle Python commands for CentOS 8 (#8701)

Summary:
There is no longer an unversioned `python` command that refers to Python
3; the recommended alternative is `/usr/bin/env python3`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8701

Test Plan: - [internal link] https://www.internalfb.com/intern/sandcastle/group/nonce/5100000000000001/

Reviewed By: riversand963

Differential Revision: D30520380

Pulled By: ajkr

fbshipit-source-id: 2af459a64a15fb2a011e98b156f31d322f6b2d25

4 years agoTemporarily disable block-based filter when stress testing timestamp (#8703)
Yanqin Jin [Wed, 25 Aug 2021 02:03:58 +0000 (19:03 -0700)]
Temporarily disable block-based filter when stress testing timestamp (#8703)

Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703

Test Plan: make whitebox_crash_test_with_ts

Reviewed By: pdillinger

Differential Revision: D30528931

Pulled By: riversand963

fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d

4 years agoFix a bug of secondary instance sequence going backward (#8653)
Yanqin Jin [Wed, 25 Aug 2021 01:17:32 +0000 (18:17 -0700)]
Fix a bug of secondary instance sequence going backward (#8653)

Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.

Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8653

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30272084

Pulled By: riversand963

fbshipit-source-id: c6a49c534b2509b93ef62d8936ed0acd5b860eaa

4 years agoSimplify `TraceAnalyzer` (#8697)
Merlin Mao [Wed, 25 Aug 2021 01:17:31 +0000 (18:17 -0700)]
Simplify `TraceAnalyzer` (#8697)

Summary:
Handler functions now use a common output function to output to stdout/files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8697

Test Plan: `trace_analyzer_test` can pass.

Reviewed By: zhichao-cao

Differential Revision: D30527696

Pulled By: autopear

fbshipit-source-id: c626cf4d53a39665a9c4bcf0cb019c448434abe4

4 years agoAdd port::GetProcessID() (#8693)
Peter Dillinger [Wed, 25 Aug 2021 00:45:01 +0000 (17:45 -0700)]
Add port::GetProcessID() (#8693)

Summary:
Useful in some places for object uniqueness across processes.
Currently used for generating a host-wide identifier of Cache objects
but expected to be used soon in some unique id generation code.

`int64_t` is chosen for return type because POSIX uses signed integer type,
usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`.

Future work: avoid copy-pasted declarations in port_*.h, perhaps with
port_common.h always included from port.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8693

Test Plan: manual for now

Reviewed By: ajkr, anand1976

Differential Revision: D30492876

Pulled By: pdillinger

fbshipit-source-id: 39fc2788623cc9f4787866bdb67a4d183dde7eef

4 years agoAllow iterate refresh for secondary instance (#8700)
Yanqin Jin [Tue, 24 Aug 2021 22:39:31 +0000 (15:39 -0700)]
Allow iterate refresh for secondary instance (#8700)

Summary:
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8700

Reviewed By: zhichao-cao

Differential Revision: D30523907

Pulled By: riversand963

fbshipit-source-id: 68928ab4dafb64ce80ab7bc69d83727a4713ab91

4 years agoRefactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)
Hui Xiao [Tue, 24 Aug 2021 19:42:31 +0000 (12:42 -0700)]
Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)

Summary:
Context:
To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for  inserting/removing dummy entries to reserve cache space for memory used by the class user.

- Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428

- Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838

- Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.

- Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm

- Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.

  Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.

- Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity

- Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.

  The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.

  This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8506

Test Plan:
- Passing the added unit tests cache_reservation_manager_test.cc
- Passing existing and adjusted write_buffer_manager_test.cc

Reviewed By: ajkr

Differential Revision: D29644135

Pulled By: hx235

fbshipit-source-id: 0fc93fbfe4a40bb41be85c314f8f2bafa8b741f7

4 years agoDeflake write-prepared and write-unprepared tests (#8696)
Andrew Kryczka [Tue, 24 Aug 2021 06:08:25 +0000 (23:08 -0700)]
Deflake write-prepared and write-unprepared tests (#8696)

Summary:
The `JobContext::job_snapshot` referenced DB state but could
have been deleted by a BG thread after the signal/unlock allowing
shutdown to proceed. Then we would see an error like this (valgrind):

```
==354104== Thread 2:
==354104== Invalid read of size 8
==354104==    at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:81)
==354104==    by 0x58F5BA: operator() (unique_ptr.h:75)
==354104==    by 0x58F5BA: ~unique_ptr (unique_ptr.h:292)
==354104==    by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221)
==354104==    by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696)
==354104==    by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void*) (db_impl_compaction_flush.cc:2468)
==354104==    by 0x83707A: operator() (std_function.h:688)
==354104==    by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266)
==354104==    by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307)
==354104==    by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28)
==354104==    by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so)
==354104==    by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8696

Test Plan: unable to repro

Reviewed By: pdillinger

Differential Revision: D30505277

Pulled By: ajkr

fbshipit-source-id: 5a99f34137cd14d06b0f624add6d37a70a61135d

4 years agoRefactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678)
Merlin Mao [Tue, 24 Aug 2021 00:17:13 +0000 (17:17 -0700)]
Refactor TraceAnalyzer to use `TraceRecord::Handler` to avoid casting. (#8678)

Summary:
`TraceAnalyzer` privately inherits `TraceRecord::Handler` and `WriteBatch::Handler`.

`trace_analyzer_test` can pass with this change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8678

Reviewed By: zhichao-cao

Differential Revision: D30459814

Pulled By: autopear

fbshipit-source-id: a27f59ac4600f7c3682830c9b1d9dc79e53425be

4 years agoAdd extra information to RemoteCompaction APIs (#8680)
Jay Zhuang [Mon, 23 Aug 2021 23:26:13 +0000 (16:26 -0700)]
Add extra information to RemoteCompaction APIs (#8680)

Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680

Test Plan: unittest

Reviewed By: ajkr

Differential Revision: D30444859

Pulled By: jay-zhuang

fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448

4 years agoAllow intentionally swallowed errors in BlockBasedFilterBlockReader (#8695)
Peter Dillinger [Mon, 23 Aug 2021 22:49:27 +0000 (15:49 -0700)]
Allow intentionally swallowed errors in BlockBasedFilterBlockReader (#8695)

Summary:
To avoid getting "Didn't get expected error from Get" from
crash test by enabling block-based filter in crash test in https://github.com/facebook/rocksdb/issues/8679.
Basically, this applies the pattern of IGNORE_STATUS_IF_ERROR in
full_filter_block.cc to block_based_filter_block.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8695

Test Plan: watch for resolution of crash test runs

Reviewed By: ltamasi

Differential Revision: D30496748

Pulled By: pdillinger

fbshipit-source-id: f7808fcf14c0e787fe81da03fa8303244590d273

4 years agoFix typo in 6.24.0 HISTORY.md (#8694)
Peter Dillinger [Mon, 23 Aug 2021 20:29:06 +0000 (13:29 -0700)]
Fix typo in 6.24.0 HISTORY.md (#8694)

Summary:
fix typo

Also, clarified change of C API signatures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8694

Test Plan: visual

Reviewed By: ltamasi

Differential Revision: D30492882

Pulled By: pdillinger

fbshipit-source-id: ac6dc3dcefa01c91fd87fc7f50279ea5e13fa41d

4 years agoFix LITE build (#8689)
mrambacher [Mon, 23 Aug 2021 12:08:54 +0000 (05:08 -0700)]
Fix LITE build (#8689)

Summary:
Conditional compilation of static functions not used in LITE mode.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8689

Reviewed By: ltamasi

Differential Revision: D30476218

Pulled By: mrambacher

fbshipit-source-id: 5f3af90982d34818f47d2cb1d36dd5816d0333a5

4 years agoUpdate version.h and HISTORY.md for the 6.24 release (#8688)
Levi Tamasi [Sat, 21 Aug 2021 05:27:22 +0000 (22:27 -0700)]
Update version.h and HISTORY.md for the 6.24 release (#8688)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8688

Reviewed By: ajkr, riversand963

Differential Revision: D30467746

Pulled By: ltamasi

fbshipit-source-id: 0fce0d42fe2fe3cb56d7a89607154b3b957f09b6

4 years agoEmbed original file number in SST table properties (#8686)
Peter Dillinger [Sat, 21 Aug 2021 03:39:52 +0000 (20:39 -0700)]
Embed original file number in SST table properties (#8686)

Summary:
I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
file numbers to external SST files (so that more can share db session
ids for better uniqueness properties), because of forward compatibility.
We would have a version of RocksDB that assumes session IDs are unique
on external SST files and therefore can't really break that invariant in
future files.

This change adds a table property for "orig_file_number" which is
populated by normal SST files and also external SST files generated by
SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
object and increments its own file numbers for embedding in table
properties. (They are arguably "fake" file numbers because these numbers
and not embedded in the file name.)

While updating block_based_table_builder, I removed several unnecessary
fields from Rep, because following the pattern would have created
another unnecessary field.

This change also updates block_based_table_reader to use this new
property when available, which means that for newer SST files, we can
determine the stable/original <db_session_id,file_number> unique
identifier using just the file contents, not the file name. (It's a bit
complicated; detailed comments in block_based_table_reader.)

Also added DB host id to properties listing by sst_dump, which could be
useful in debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686

Test Plan: majorly overhauled StableCacheKeys test for this change

Reviewed By: zhichao-cao

Differential Revision: D30457742

Pulled By: pdillinger

fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445

4 years agoUpgrade xxhash, add Hash128 (#8634)
Peter Dillinger [Sat, 21 Aug 2021 01:40:53 +0000 (18:40 -0700)]
Upgrade xxhash, add Hash128 (#8634)

Summary:
With expected use for a 128-bit hash, xxhash library is
upgraded to current dev (2c611a76f914828bed675f0f342d6c4199ffee1e)
as of Aug 6 so that we can use production version of XXH3_128bits
as new Hash128 function (added in hash128.h).

To make this work, however, we have to carve out the "preview" version
of XXH3 that is used in new SST Bloom and Ribbon filters, since that
will not get maintenance in xxhash releases. I have consolidated all the
relevant code into xxph3.h and made it "inline only" (no .cc file). The
working name for this hash function is changed from XXH3p to XXPH3
(XX Preview Hash) because the latter is easier to get working with no
symbol name conflicts between the headers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8634

Test Plan:
no expected change in existing functionality. For Hash128,
added some unit tests based on those for Hash64 to ensure some basic
properties and that the values do not change accidentally.

Reviewed By: zhichao-cao

Differential Revision: D30173490

Pulled By: pdillinger

fbshipit-source-id: 06aa542a7a28b353bc2c865b9b2f8bdfe44158e4

4 years agoAdd Bloom/Ribbon hybrid API support (#8679)
Peter Dillinger [Sat, 21 Aug 2021 00:59:24 +0000 (17:59 -0700)]
Add Bloom/Ribbon hybrid API support (#8679)

Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.

So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)

I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.

C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.

BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679

Test Plan: new + updated tests, including crash test

Reviewed By: jay-zhuang

Differential Revision: D30445797

Pulled By: pdillinger

fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1

4 years agoAdd `IteratorTraceExecutionResult` for iterator related trace records. (#8687)
Merlin Mao [Fri, 20 Aug 2021 22:32:55 +0000 (15:32 -0700)]
Add `IteratorTraceExecutionResult` for iterator related trace records. (#8687)

Summary:
- Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`.
- Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`.

Added test in `DBTest2.TraceAndReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687

Reviewed By: zhichao-cao

Differential Revision: D30457630

Pulled By: autopear

fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d

4 years agoAdd a PerfContext counter for secondary cache hits (#8685)
anand76 [Fri, 20 Aug 2021 22:16:33 +0000 (15:16 -0700)]
Add a PerfContext counter for secondary cache hits (#8685)

Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665

4 years agoUpdate the block_read_count/block_read_byte counters in MultiGet (#8676)
anand76 [Fri, 20 Aug 2021 18:49:53 +0000 (11:49 -0700)]
Update the block_read_count/block_read_byte counters in MultiGet (#8676)

Summary:
MultiGet in block based table reader doesn't use BlockFetcher. As a result, the block_read_count and block_read_byte PerfContext counters were not being updated. This fixes that by updating them in MultiRead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8676

Reviewed By: zhichao-cao

Differential Revision: D30428680

Pulled By: anand1976

fbshipit-source-id: 21846efe92588fc17123665dd06733693a40126d

4 years agoFix blob callback in compaction and atomic flush (#8681)
Akanksha Mahajan [Fri, 20 Aug 2021 18:37:53 +0000 (11:37 -0700)]
Fix blob callback in compaction and atomic flush (#8681)

Summary:
Pass BlobFileCompletionCallback  in case of atomic flush and
compaction job which is currently nullptr(default parameter).
BlobFileCompletionCallback is used in case of IntegratedBlobDB to report new blob files to
SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8681

Test Plan: CircleCI jobs

Reviewed By: ltamasi

Differential Revision: D30445998

Pulled By: akankshamahajan15

fbshipit-source-id: ba48093843864faec57f1f365cce7b5a569c4021

4 years agoAdd iterator's lower and upper bounds to `TraceRecord` (#8677)
Merlin Mao [Fri, 20 Aug 2021 00:26:11 +0000 (17:26 -0700)]
Add iterator's lower and upper bounds to `TraceRecord` (#8677)

Summary:
Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set.

Added test cases in `DBTest2.TraceAndManualReplay`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677

Reviewed By: zhichao-cao

Differential Revision: D30438255

Pulled By: autopear

fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288

4 years agoFix some minor issues in the Customizable infrastructure (#8566)
mrambacher [Thu, 19 Aug 2021 17:09:30 +0000 (10:09 -0700)]
Fix some minor issues in the Customizable infrastructure (#8566)

Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566

Reviewed By: zhichao-cao

Differential Revision: D30303724

Pulled By: mrambacher

fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5

4 years agoAdd condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event...
Baptiste Lemaire [Thu, 19 Aug 2021 00:39:00 +0000 (17:39 -0700)]
Add condition on NotifyOnFlushComplete that FlushJob was not mempurge. Add event listeners to mempurge tests. (#8672)

Summary:
Previously, when a `FlushJob` was redirected to a MemPurge, the function `DBImpl::NotifyOnFlushComplete` was called, which created a series of issues because the JobInfo was not correctly collected from the memtables.
This diff aims at correcting these two issues (`FlushJobInfo` collection in `FlushJob::MemPurge` , no call to `DBImpl::NotifyOnFlushComplete` after successful mempurge).
Event listeners were added to the unit tests to handle these situations.
Surprisingly none of the crashtests caught this issue, I will try to add event listeners to crash tests in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8672

Reviewed By: akankshamahajan15

Differential Revision: D30383109

Pulled By: bjlemaire

fbshipit-source-id: 35a8d4295886923ee4049a6447f00022cb221c73

4 years agoAllow Replayer to report the results of TraceRecords. (#8657)
Merlin Mao [Thu, 19 Aug 2021 00:04:36 +0000 (17:04 -0700)]
Allow Replayer to report the results of TraceRecords. (#8657)

Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.

New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".

`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657

Reviewed By: ajkr

Differential Revision: D30290216

Pulled By: autopear

fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6

4 years agoStable cache keys on ingested SST files (#8669)
Peter Dillinger [Wed, 18 Aug 2021 18:32:00 +0000 (11:32 -0700)]
Stable cache keys on ingested SST files (#8669)

Summary:
Extends https://github.com/facebook/rocksdb/issues/8659 to work for ingested external SST files, even
the same file ingested into different DBs sharing a block cache.

Note: These new cache keys are currently only enabled when FileSystem
does not provide GetUniqueId. For now, they are typically larger,
so slightly less efficient.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8669

Test Plan: Extended unit test

Reviewed By: zhichao-cao

Differential Revision: D30398532

Pulled By: pdillinger

fbshipit-source-id: 1f13e2af4b8bfff5741953a69466e9589fbc23c7

4 years agoFix bug caused by releasing snapshot(s) during compaction (#8608)
Yanqin Jin [Wed, 18 Aug 2021 05:13:21 +0000 (22:13 -0700)]
Fix bug caused by releasing snapshot(s) during compaction (#8608)

Summary:
In debug mode, we are seeing assertion failure as follows

```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```

It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.

In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.

Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.

Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```

It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D30145645

Pulled By: riversand963

fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81

4 years agoAdd statistics support to integrated BlobDB (#8667)
Levi Tamasi [Wed, 18 Aug 2021 00:21:16 +0000 (17:21 -0700)]
Add statistics support to integrated BlobDB (#8667)

Summary:
The patch adds statistics support to the integrated BlobDB implementation,
namely the tickers `BLOB_DB_BLOB_FILE_BYTES_READ` and
`BLOB_DB_GC_{NUM_KEYS,BYTES}_RELOCATED`, and the histograms
`BLOB_DB_(DE)COMPRESSION_MICROS`. (Some other statistics, like
`BLOB_DB_BLOB_FILE_BYTES_WRITTEN`, `BLOB_DB_BLOB_FILE_SYNCED`,
`BLOB_DB_BLOB_FILE_{READ,WRITE,SYNC}_MICROS` were already supported.)
Note that the vast majority of the old BlobDB's tickers/histograms are not
really applicable to the new implementation, since they e.g. pertain to calling
dedicated BlobDB APIs (which the integrated BlobDB does not have) or are
tied to the legacy BlobDB's design of writing blob files synchronously when
a write API is called. Such statistics are marked "legacy BlobDB only" in
`statistics.h`.

Fixes https://github.com/facebook/rocksdb/issues/8645 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8667

Test Plan: Ran `make check` and tested the new statistics using `db_bench`.

Reviewed By: riversand963

Differential Revision: D30356884

Pulled By: ltamasi

fbshipit-source-id: 5f8a833faee60401c5643c2f0a6c0415488190a4

4 years agoExclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)
Jay Zhuang [Tue, 17 Aug 2021 16:04:09 +0000 (09:04 -0700)]
Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668)

Summary:
Just like other per_level properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668

Test Plan: stress_test

Reviewed By: zhichao-cao

Differential Revision: D30360967

Pulled By: jay-zhuang

fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f

4 years agoAdd a stat to count secondary cache hits (#8666)
anand76 [Tue, 17 Aug 2021 04:00:17 +0000 (21:00 -0700)]
Add a stat to count secondary cache hits (#8666)

Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77

4 years agoStable cache keys using DB session ids in SSTs (#8659)
Peter Dillinger [Tue, 17 Aug 2021 03:36:19 +0000 (20:36 -0700)]
Stable cache keys using DB session ids in SSTs (#8659)

Summary:
Use DB session ids in SST table properties to make cache keys
stable across DB re-open and copy / move / restore / etc.

These new cache keys are currently only enabled when FileSystem does not
provide GetUniqueId. For now, they are typically larger, so slightly
less efficient.

Relevant to https://github.com/facebook/rocksdb/issues/7405

This change has a minor regression in PersistentCache functionality:
metaindex blocks are no longer cached in PersistentCache. Table properties
blocks already were not but ideally should be. I didn't spent effort to
fix & test these issues because we don't believe PersistentCache is used much
if at all and expect SecondaryCache to replace it. (Though PRs are welcome.)

FIXME: there is more to be fixed for stable cache keys on external SST files

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8659

Test Plan:
new unit test added, which fails when disabling new
functionality

Reviewed By: zhichao-cao

Differential Revision: D30297705

Pulled By: pdillinger

fbshipit-source-id: e8539a5c8802a79340405629870f2e3fb3822d3a

4 years agoAdd db_test2 to to ASSERT_STATUS_CHECKED (#8640)
Adam Retter [Mon, 16 Aug 2021 15:09:46 +0000 (08:09 -0700)]
Add db_test2 to to ASSERT_STATUS_CHECKED (#8640)

Summary:
This is the `db_test2` parts of https://github.com/facebook/rocksdb/pull/7737 reworked on the latest HEAD.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8640

Reviewed By: akankshamahajan15

Differential Revision: D30303684

Pulled By: mrambacher

fbshipit-source-id: 263e2f82d849bde4048b60aed8b31e7deed4706a

4 years agoSupport dynamic sector size in alignment validation for Windows. (#8613)
Burton Li [Mon, 16 Aug 2021 14:30:57 +0000 (07:30 -0700)]
Support dynamic sector size in alignment validation for Windows. (#8613)

Summary:
- Use dynamic section size when calling IsSectorAligned()
- Support relative path for GetSectorSize().
- Move buffer and sector alignment check to assert for better retail performance.
- Typo fixes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8613

Reviewed By: ajkr

Differential Revision: D30136082

Pulled By: mrambacher

fbshipit-source-id: e8cb849befdcae4fea99de5ed5dd6565e612425f

4 years agoUse non-zero exit codes in benchmark.sh when the benchmark cannot be run (#8554)
Adam Retter [Mon, 16 Aug 2021 13:24:35 +0000 (06:24 -0700)]
Use non-zero exit codes in benchmark.sh when the benchmark cannot be run (#8554)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8554

Reviewed By: ajkr

Differential Revision: D29756562

Pulled By: mrambacher

fbshipit-source-id: ab2f5ef988c8ac7ea7c633e6a3dacaf16f021529

4 years agoAdd property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644)
Jay Zhuang [Sun, 15 Aug 2021 21:16:43 +0000 (14:16 -0700)]
Add property `LiveSstFilesSizeAtTemperature` for tiered storage (#8644)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8644

Reviewed By: siying, zhichao-cao

Differential Revision: D30236535

Pulled By: jay-zhuang

fbshipit-source-id: 1758d1c46d83a5087560fb63d53a016bf999da81

4 years agoImprove MemPurge sampling (#8656)
Baptiste Lemaire [Fri, 13 Aug 2021 21:34:43 +0000 (14:34 -0700)]
Improve MemPurge sampling (#8656)

Summary:
Previously, the `MemPurge` sampling function was assessing whether a random entry from a memtable was garbage or not by simply querying the given memtable (see https://github.com/facebook/rocksdb/issues/8628 for more details).
In this diff, I am updating the sampling function by querying not only the memtable the entry was drawn from, but also all subsequent memtables that have a greater memtable ID.
I also added the size of the value for KV entries in the payload/useful payload estimates (which was also one of the reasons why sampling was not as good as mempurging all the time in terms of L0 SST files reduction).
Once these changes were made, I was able to clean obsolete objects and functions from the `MemtableList` struct, and did a bit of cleanup everywhere.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8656

Reviewed By: pdillinger

Differential Revision: D30288583

Pulled By: bjlemaire

fbshipit-source-id: 7646a545ec56f4715949daa59ab5eee74540feb3

4 years agoCode cleanup for trace replayer (#8652)
Merlin Mao [Thu, 12 Aug 2021 16:21:40 +0000 (09:21 -0700)]
Code cleanup for trace replayer (#8652)

Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652

Reviewed By: ajkr

Differential Revision: D30276872

Pulled By: autopear

fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37

4 years agoMake TraceRecord and Replayer public (#8611)
Merlin Mao [Thu, 12 Aug 2021 02:31:44 +0000 (19:31 -0700)]
Make TraceRecord and Replayer public (#8611)

Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.

User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.

Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611

Reviewed By: ajkr

Differential Revision: D30266329

Pulled By: autopear

fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8

4 years agoRe-add retired mempurge flag definitions for legacy-options-file temporary support...
Baptiste Lemaire [Wed, 11 Aug 2021 23:04:19 +0000 (16:04 -0700)]
Re-add retired mempurge flag definitions for legacy-options-file temporary support. (#8650)

Summary:
Current internal regression tests pass in an old option flag `experimental_allow_mempurge` to a more recently built db.
This flag was retired and removed in a recent PR (https://github.com/facebook/rocksdb/issues/8628), and therefore, the following error comes up : `Failed: Invalid argument: Could not find option: : experimental_allow_mempurge`.
In this PR, I reintroduce the two flags retired in https://github.com/facebook/rocksdb/issues/8628, `experimental_allow_mempurge` and `experimental_mempurge_policy` in `db_options.cc` and mark them both as `kDeprecated`.
This is a temporary fix to save us time to find a long term solution, which hopefully will consist in ignoring options prefixed with `experimental_` that are no longer recognized.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8650

Reviewed By: pdillinger

Differential Revision: D30257307

Pulled By: bjlemaire

fbshipit-source-id: 35303655fd2dd9789fd9e3c450e9d8009f3c1f54

4 years agoUpdate and enhance check_format_compatible.sh (#8651)
Peter Dillinger [Wed, 11 Aug 2021 23:01:27 +0000 (16:01 -0700)]
Update and enhance check_format_compatible.sh (#8651)

Summary:
The last few releases overlooked adding to this test. This
change fixes that.

This change also fixes the problem of older branches not understanding
ROCKSDB_NO_FBCODE and referencing compilers no longer supported.
During the test, build_detect_platform is patched to force no FBCODE
compiler usage. (We should not need to update old branches perpetually.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8651

Test Plan: local run reproduces regression described in https://github.com/facebook/rocksdb/issues/8650

Reviewed By: jay-zhuang, zhichao-cao

Differential Revision: D30261872

Pulled By: pdillinger

fbshipit-source-id: 02b447d224d7e0eb8613c63185437ded146713bc

4 years agoAdd suggestion for btrfs user to disable preallocation (#8646)
Jay Zhuang [Wed, 11 Aug 2021 21:52:46 +0000 (14:52 -0700)]
Add suggestion for btrfs user to disable preallocation (#8646)

Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D30240050

Pulled By: jay-zhuang

fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768

4 years agoMemtable sampling for mempurge heuristic. (#8628)
Baptiste Lemaire [Wed, 11 Aug 2021 01:07:48 +0000 (18:07 -0700)]
Memtable sampling for mempurge heuristic. (#8628)

Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628

Reviewed By: pdillinger

Differential Revision: D30149315

Pulled By: bjlemaire

fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27

4 years agoAttempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627)
Levi Tamasi [Tue, 10 Aug 2021 18:08:34 +0000 (11:08 -0700)]
Attempt to deflake DBTestXactLogIterator.TransactionLogIteratorCorruptedLog (#8627)

Summary:
The patch attempts to deflake `DBTestXactLogIterator.TransactionLogIteratorCorruptedLog`
by disabling file deletions while retrieving the list of WAL files and truncating the first WAL file.
This is to prevent the `PurgeObsoleteFiles` call triggered by `GetSortedWalFiles` from
invalidating the result of `GetSortedWalFiles`. The patch also cleans up the test case a bit
and changes it to using `test::TruncateFile` instead of calling the `truncate` syscall directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8627

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D30147002

Pulled By: ltamasi

fbshipit-source-id: db11072a4ad8900a2f859cb5294e22b1888c23f6

4 years agoSimplify GenericRateLimiter algorithm (#8602)
Andrew Kryczka [Mon, 9 Aug 2021 23:46:14 +0000 (16:46 -0700)]
Simplify GenericRateLimiter algorithm (#8602)

Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately.  Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped.  These threads are
responsible for unblocking themselves.  The work to do so is split into two main
duties.

(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.

Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2).  It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted.  This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted.  However, the conceptual
complexity of this algorithm was too high.  It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.

This PR drops the leader election to reduce conceptual complexity.  Now, the two
duties can be performed by whichever thread acquires the lock first.  The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.

- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).

Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue.  This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.

Note there are a few cases that still lead to intermediate wakeup, however.  The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.

- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
  - A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
  - A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602

Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
  without this change, even in a rate limiter heavy workload:

Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```

Results before this PR:

```
fillrandom   :     108.463 micros/op 9219 ops/sec;    9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```

Results after this PR:

```
fillrandom   :     108.108 micros/op 9250 ops/sec;    9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```

Reviewed By: hx235

Differential Revision: D30048013

Pulled By: ajkr

fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b

4 years agorocksdb: don't call LZ4_loadDictHC with null dictionary
Lucian Grijincu [Mon, 9 Aug 2021 23:04:26 +0000 (16:04 -0700)]
rocksdb: don't call LZ4_loadDictHC with null dictionary

Summary: UBSAN revealed a pointer underflow when `LZ4HC_init_internal` is called with a null `start`.

Reviewed By: ajkr

Differential Revision: D30181874

fbshipit-source-id: ca9bbac1a85c58782871d7f153af733b000cc66c

4 years agoAdd an unittest for tiered storage universal compaction (#8631)
Jay Zhuang [Mon, 9 Aug 2021 20:43:18 +0000 (13:43 -0700)]
Add an unittest for tiered storage universal compaction (#8631)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8631

Reviewed By: siying

Differential Revision: D30200385

Pulled By: jay-zhuang

fbshipit-source-id: 0fa2bb15e74ff81762d767f234078e0fe0106c55

4 years agoMove old files to warm tier in FIFO compactions (#8310)
sdong [Mon, 9 Aug 2021 19:50:19 +0000 (12:50 -0700)]
Move old files to warm tier in FIFO compactions (#8310)

Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310

Test Plan: Add several unit tests.

Reviewed By: ajkr

Differential Revision: D28493792

fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501

4 years agoFix db_stress failure (#8632)
Akanksha Mahajan [Sat, 7 Aug 2021 16:20:11 +0000 (09:20 -0700)]
Fix db_stress failure (#8632)

Summary:
FaultInjectionTestFS injects error in Rename operation. Because
of injected error, info.log fails to be created if rename  returns error and info_log is set to nullptr which leads to this assertion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8632

Test Plan: run the db_stress job locally

Reviewed By: ajkr

Differential Revision: D30167387

Pulled By: akankshamahajan15

fbshipit-source-id: 8d08c4c33e8f0cabd368bbb498d21b9de0660067

4 years agoAdd more C bindings for OptimisticTransactionDB (#8526)
Roy Crihfield [Sat, 7 Aug 2021 02:05:32 +0000 (19:05 -0700)]
Add more C bindings for OptimisticTransactionDB (#8526)

Summary:
* `rocksdb_optimistictransactiondb_checkpoint_object_create`
* `rocksdb_optimistictransactiondb_write`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8526

Reviewed By: ajkr

Differential Revision: D30076822

Pulled By: jay-zhuang

fbshipit-source-id: a59956a8d5449e75d39a8087fbb2bad148cf697d

4 years agoPrevent joining detached thread in ThreadPoolImpl (#8635)
Andrew Kryczka [Sat, 7 Aug 2021 02:05:01 +0000 (19:05 -0700)]
Prevent joining detached thread in ThreadPoolImpl (#8635)

Summary:
This draining mechanism should not be run during `JoinThreads()` because it can detach threads that will be joined. Joining detached threads would throw an exception.

With this PR, we skip draining when `JoinThreads()` has already decided what threads to `join()`, so the threads will exit naturally once the work queue empties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8635

Test Plan: verified it unblocked using `WaitForJobsAndJoinAllThreads()` in https://github.com/facebook/rocksdb/issues/8611.

Reviewed By: riversand963

Differential Revision: D30174587

Pulled By: ajkr

fbshipit-source-id: 144966398a607987e0763c7152a0f653fdbf3c8b

4 years agoFix the sorting of KeyContexts for batched MultiGet (#8633)
Levi Tamasi [Fri, 6 Aug 2021 23:26:04 +0000 (16:26 -0700)]
Fix the sorting of KeyContexts for batched MultiGet (#8633)

Summary:
`CompareKeyContext::operator()` on the trunk has a bug: when comparing
column family IDs, `lhs` is used for both sides of the comparison. This
results in the `KeyContext`s getting sorted solely based on key, which
in turn means that keys with the same column family do not necessarily
form a single range in the sorted list. This violates an assumption of the
batched `MultiGet` logic, leading to the same column family
showing up multiple times in the list of `MultiGetColumnFamilyData`.
The end result is the code attempting to check out the thread-local
`SuperVersion` for the same CF multiple times, causing an
assertion violation in debug builds and memory corruption/crash in
release builds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8633

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D30169182

Pulled By: ltamasi

fbshipit-source-id: a47710652df7e95b14b40fb710924c11a8478023

4 years agoFix the wrong comment of level compaction cf paths test (#8533)
Zaorang Yang [Fri, 6 Aug 2021 22:26:06 +0000 (15:26 -0700)]
Fix the wrong comment of level compaction cf paths test (#8533)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8533

Reviewed By: ajkr

Differential Revision: D29718067

fbshipit-source-id: b4b91c9271362e7a7d47ddbaf28f56fb537cc668

4 years agoRemove unused variable - run_had_errors (#8599)
Peter (Stig) Edwards [Fri, 6 Aug 2021 21:45:39 +0000 (14:45 -0700)]
Remove unused variable - run_had_errors (#8599)

Summary:
Unused since https://github.com/facebook/rocksdb/commit/ab718b415fc9b2a66a2ed642c18803f764839d7b .
Noticed on https://lgtm.com/projects/g/facebook/rocksdb/snapshot/b215f1a83226f111ff52305987af93564272b7d3/files/tools/db_crashtest.py?sort=name&dir=ASC&mode=heatmap#xf254f528ad18f108:1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8599

Reviewed By: ajkr

Differential Revision: D30057041

Pulled By: zhichao-cao

fbshipit-source-id: e80438cf9717086d2bf67461e19393d426a7676e

4 years agoUpdate benchmark.sh (#8615)
HappyUncle [Fri, 6 Aug 2021 21:34:37 +0000 (14:34 -0700)]
Update benchmark.sh (#8615)

Summary:
Fix help message.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8615

Reviewed By: siying

Differential Revision: D30136092

Pulled By: mrambacher

fbshipit-source-id: edf4112570514d709560baaf96a47c5f36f00665

4 years agoMake backup restore atomic, with sync option (#8568)
Peter Dillinger [Fri, 6 Aug 2021 16:48:53 +0000 (09:48 -0700)]
Make backup restore atomic, with sync option (#8568)

Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.

Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).

Fixes https://github.com/facebook/rocksdb/issues/8500

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568

Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)

Reviewed By: akankshamahajan15

Differential Revision: D29812605

Pulled By: pdillinger

fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e

4 years agoCorrect javadoc for Env#setBackgroundThreads(int) (#8576)
Brendan MacDonell [Fri, 6 Aug 2021 15:51:05 +0000 (08:51 -0700)]
Correct javadoc for Env#setBackgroundThreads(int) (#8576)

Summary:
By default, the low priority pool is not the flush pool, so calling `Env#setBackgroundThreads` without providing a priority will not do what the caller expected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8576

Reviewed By: ajkr

Differential Revision: D29925154

Pulled By: mrambacher

fbshipit-source-id: cd7211fc374e7d9929a9b88ea0a5ba8134b76099

4 years agoMake MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)
mrambacher [Fri, 6 Aug 2021 15:26:23 +0000 (08:26 -0700)]
Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481)

Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
 - Added Options/Configurable/Object Registration for TTL and Cassandra variants
 - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char.  Made the delimiter into a configurable option
 - Added tests for new functionality

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481

Reviewed By: zhichao-cao

Differential Revision: D30136050

Pulled By: mrambacher

fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af

4 years agoDynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)
Akanksha Mahajan [Fri, 6 Aug 2021 02:43:44 +0000 (19:43 -0700)]
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache (#8620)

Summary:
Dynamically configure BlockBasedTableOptions.prepopulate_block_cache using DB::SetOptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8620

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D30091319

Pulled By: akankshamahajan15

fbshipit-source-id: fb586d1848a8dd525bba7b2f9eeac34f2fc6d82c

4 years agoAttempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624)
Levi Tamasi [Fri, 6 Aug 2021 01:35:02 +0000 (18:35 -0700)]
Attempt to deflake ObsoleteFilesTest.DeleteObsoleteOptionsFile (#8624)

Summary:
We've been seeing occasional crashes on CI while inserting into the
vectors in `ObsoleteFilesTest.DeleteObsoleteOptionsFile`. The crashes
don't reproduce locally (could be either a race or an object lifecycle
issue) but the good news is that the vectors in question are not really
used for anything meaningful by the test. (The assertion about the sizes
of the two vectors being equal is guaranteed to hold, since the two sync
points where they are populated are right after each other.) The patch
simply removes the vectors from the test, alongside the associated
callbacks and sync points.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8624

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D30118485

Pulled By: ltamasi

fbshipit-source-id: 0a4c3d06584e84cd2b1dcc212d274fa1b89cb647

4 years agoUpdate HISTORY for PR8585 (#8623)
Yanqin Jin [Thu, 5 Aug 2021 01:44:53 +0000 (18:44 -0700)]
Update HISTORY for PR8585 (#8623)

Summary:
Update HISTORY.md for PR https://github.com/facebook/rocksdb/issues/8585 .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8623

Reviewed By: ltamasi

Differential Revision: D30121910

Pulled By: riversand963

fbshipit-source-id: 525af43fad908a498f22ed4f934ec5cbf60e6d25

4 years agoDo not attempt to rename non-existent info log (#8622)
Andrew Kryczka [Thu, 5 Aug 2021 00:24:06 +0000 (17:24 -0700)]
Do not attempt to rename non-existent info log (#8622)

Summary:
Previously we attempted to rename "LOG" to "LOG.old.*" without checking
its existence first. "LOG" had no reason to exist in a new DB.

Errors in renaming a non-existent "LOG" were swallowed via
`PermitUncheckedError()` so things worked. However the storage service's
error monitoring was detecting all these benign rename failures. So it
is better to fix it. Also with this PR we can now distinguish rename failure
for other reasons and return them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8622

Test Plan: new unit test

Reviewed By: akankshamahajan15

Differential Revision: D30115189

Pulled By: ajkr

fbshipit-source-id: e2f337ffb2bd171be0203172abc8e16e7809b170

4 years agoFix clang failure (#8621)
Akanksha Mahajan [Thu, 5 Aug 2021 00:11:47 +0000 (17:11 -0700)]
Fix clang failure (#8621)

Summary:
Fixed clang failure because of memory leak

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8621

Test Plan: CircleCI clang job

Reviewed By: pdillinger

Differential Revision: D30114337

Pulled By: akankshamahajan15

fbshipit-source-id: 16572b9bcbaa053c2ab7bc1c344148d0e6f8039c

4 years agoRemove corruption error injection in FaultInjectionTestFS (#8616)
anand76 [Wed, 4 Aug 2021 22:48:10 +0000 (15:48 -0700)]
Remove corruption error injection in FaultInjectionTestFS (#8616)

Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.

Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616

Reviewed By: zhichao-cao

Differential Revision: D30074422

Pulled By: anand1976

fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df

4 years agoImprove rate limiter implementation's readability (#8596)
hx235 [Wed, 4 Aug 2021 17:42:49 +0000 (10:42 -0700)]
Improve rate limiter implementation's readability (#8596)

Summary:
Context:
As need for new feature of resource management using RocksDB's rate limiter like [https://github.com/facebook/rocksdb/issues/8595](https://github.com/facebook/rocksdb/pull/8595) arises, it is about time to re-learn our rate limiter and make this learning process easier for others by improving its readability. The comment/assertion/one extra else-branch are added based on my best understanding toward the rate_limiter.cc and rate_limiter_test.cc up to date after giving it a hard read.
- Add code comments/assertion/one extra else-branch (that is not affecting existing behavior, see PR comment) to describe how leader-election works under multi-thread settings in GenericRateLimiter::Request()
- Add code comments to describe a non-obvious trick during clean-up of rate limiter destructor
- Add code comments to explain more about the starvation being fixed in GenericRateLimiter::Refill() through partial byte-granting
- Add code comments to the rate limiter's setup in a complicated unit test in rate_limiter_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8596

Test Plan: - passed existing rate_limiter_test.cc

Reviewed By: ajkr

Differential Revision: D29982590

Pulled By: hx235

fbshipit-source-id: c3592986bb5b0c90d8229fe44f425251ec7e8a0a

4 years agoMention PR 8605 in HISTORY.md (#8619)
Levi Tamasi [Tue, 3 Aug 2021 23:13:41 +0000 (16:13 -0700)]
Mention PR 8605 in HISTORY.md (#8619)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8619

Reviewed By: riversand963

Differential Revision: D30081937

Pulled By: ltamasi

fbshipit-source-id: 57505957ae2c22d4b194aa28cb3fd261b3b39919

4 years agoFix NotifyOnFlushCompleted() for atomic flush (#8585)
Yanqin Jin [Tue, 3 Aug 2021 20:30:05 +0000 (13:30 -0700)]
Fix NotifyOnFlushCompleted() for atomic flush (#8585)

Summary:
PR https://github.com/facebook/rocksdb/issues/5908 added `flush_jobs_info_` to `FlushJob` to make sure
`OnFlushCompleted()` is called after committing flush results to
MANIFEST. However, `flush_jobs_info_` is not updated in atomic
flush, causing `NotifyOnFlushCompleted()` to skip `OnFlushCompleted()`.

This PR fixes this, in a similar way to https://github.com/facebook/rocksdb/issues/5908 that handles regular flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8585

Test Plan: make check

Reviewed By: jay-zhuang

Differential Revision: D29913720

Pulled By: riversand963

fbshipit-source-id: 4ff023c98372fa2c93188d4a5c8a4e9ffa0f4dda

4 years agoCache warming blocks during flush (#8561)
Akanksha Mahajan [Tue, 3 Aug 2021 19:42:22 +0000 (12:42 -0700)]
Cache warming blocks during flush (#8561)

Summary:
Insert warm blocks  (data, uncompressed dict, index and filter blocks) during flush in Block cache which is enabled under option BlockBasedTableOptions.prepopulate_block_cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8561

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D29773411

Pulled By: akankshamahajan15

fbshipit-source-id: 6631123c10134340ef0bd7e90baafaa6deba0e66

4 years agoFix db stress crash mempurge (#8604)
Baptiste Lemaire [Tue, 3 Aug 2021 03:25:39 +0000 (20:25 -0700)]
Fix db stress crash mempurge (#8604)

Summary:
The db_stress crash was caused by a call to `IsFlushPending()` made by a stats function which triggered an `assert([false])`, which I didn't plan when I created the `trigger_flush` bool. It turns out that this bool variable is not useful: I created it because I thought the `imm_flush_needed` atomic bool would actually trigger a flush.
It turns out that this bool is only checked in `IsFlushPending` - this is its only use - and a flush is triggered by either a background thread checking on the imm array, or by an explicit call to `SchedulePendingFlush` which creates a flush request, that is then added to a flush request queue.
In this PR, I reverted the MemtableList::Add function to what it was before my changes.
I tested the fix by running the exact command line that deterministically triggered the assert error (see below), which confirmed that this is where the error was coming from.
I also run `db_crashtest.py whitebox` and `blackbox` for a couple hours locally before committing this PR.
Experiment run:

```./db_stress --acquire_snapshot_one_in=0 --allow_concurrent_memtable_write=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=76.90653425292307 --bottommost_compression_type=disable --cache_index_and_filter_blocks=1 --cache_size=1048576 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_ttl=2 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --enable_compaction_filter=1 --enable_pipelined_write=0 --expected_values_path=/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_allow_mempurge=1 --experimental_mempurge_policy=kAlternate --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=1 --mark_for_compaction_one_file_in=10 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --progress_reports=0 --read_fault_one_in=0 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync=1 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=3 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_ribbon_filter=1 --user_timestamp_size=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8604

Reviewed By: pdillinger

Differential Revision: D30047295

Pulled By: bjlemaire

fbshipit-source-id: b9e379bfa3d6b9bd2b275725fb0bca4bd81a3dbe

4 years agoRevert checkpoint fix (#8607)
Merlin Mao [Tue, 3 Aug 2021 01:27:11 +0000 (18:27 -0700)]
Revert checkpoint fix (#8607)

Summary:
PR https://github.com/facebook/rocksdb/pull/8572 looses custom types in the options file. Need more API changes to fix this issue. Revert this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8607

Reviewed By: ajkr

Differential Revision: D30058289

Pulled By: autopear

fbshipit-source-id: 78f5a154c0bf193e8441bae4a36fa79b95277fd4

4 years agoFix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)
Levi Tamasi [Tue, 3 Aug 2021 01:10:57 +0000 (18:10 -0700)]
Fix a race in ColumnFamilyData::UnrefAndTryDelete (#8605)

Summary:
The `ColumnFamilyData::UnrefAndTryDelete` code currently on the trunk
unlocks the DB mutex before destroying the `ThreadLocalPtr` holding
the per-thread `SuperVersion` pointers when the only remaining reference
is the back reference from `super_version_`. The idea behind this was to
break the circular dependency between `ColumnFamilyData` and `SuperVersion`:
when the penultimate reference goes away, `ColumnFamilyData` can clean up
the `SuperVersion`, which can in turn clean up `ColumnFamilyData`. (Assuming there
is a `SuperVersion` and it is not referenced by anything else.) However,
unlocking the mutex throws a wrench in this plan by making it possible for another thread
to jump in and take another reference to the `ColumnFamilyData`, keeping the
object alive in a zombie `ThreadLocalPtr`-less state. This can cause issues like
https://github.com/facebook/rocksdb/issues/8440 ,
https://github.com/facebook/rocksdb/issues/8382 ,
and might also explain the `was_last_ref` assertion failures from the `ColumnFamilySet`
destructor we sometimes observe during close in our stress tests.

Digging through the archives, this unlocking goes way back to 2014 (or earlier). The original
rationale was that `SuperVersionUnrefHandle` used to lock the mutex so it can call
`SuperVersion::Cleanup`; however, this logic turned out to be deadlock-prone.
https://github.com/facebook/rocksdb/pull/3510 fixed the deadlock but left the
unlocking in place. https://github.com/facebook/rocksdb/pull/6147 then introduced
the circular dependency and associated cleanup logic described above (in order
to enable iterators to keep the `ColumnFamilyData` for dropped column families alive),
and moved the unlocking-relocking snippet to its present location in `UnrefAndTryDelete`.
Finally, https://github.com/facebook/rocksdb/pull/7749 fixed a memory leak but
apparently exacerbated the race by (otherwise correctly) switching to `UnrefAndTryDelete`
in `SuperVersion::Cleanup`.

The patch simply eliminates the unlocking and relocking, which has been unnecessary
ever since https://github.com/facebook/rocksdb/issues/3510 made `SuperVersionUnrefHandle` lock-free.
This closes the window during which another thread could increase the reference count,
and hopefully fixes the issues above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8605

Test Plan: Ran `make check` and stress tests locally.

Reviewed By: pdillinger

Differential Revision: D30051035

Pulled By: ltamasi

fbshipit-source-id: 8fe559e4b4ad69fc142579f8bc393ef525918528

4 years agoFix a issue with initializing blob header buffer (#8537)
yangzaorang [Tue, 3 Aug 2021 00:13:36 +0000 (17:13 -0700)]
Fix a issue with initializing blob header buffer (#8537)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8537

Reviewed By: ajkr

Differential Revision: D29838132

Pulled By: jay-zhuang

fbshipit-source-id: e3e78d5f85f240a1800ace417a8b634f74488e41

4 years agoAllow to use a string as a delimiter in StringAppendOperator (#8536)
Mikhail Golubev [Mon, 2 Aug 2021 23:49:54 +0000 (16:49 -0700)]
Allow to use a string as a delimiter in StringAppendOperator (#8536)

Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536

Reviewed By: mrambacher

Differential Revision: D29962120

Pulled By: zhichao-cao

fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f

4 years agoAllow WAL dir to change with db dir (#8582)
mrambacher [Fri, 30 Jul 2021 19:15:04 +0000 (12:15 -0700)]
Allow WAL dir to change with db dir (#8582)

Summary:
Prior to this change, the "wal_dir"  DBOption would always be set (defaults to dbname) when the DBOptions were sanitized.  Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.

After this change, the "wal_dir" option is only set under specific circumstances.  Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname.  Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).

Tests were added to the core and ldb to test that a database could be created and renamed without issue.  Additional tests for various permutations of wal_dir were also added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582

Reviewed By: pdillinger, autopear

Differential Revision: D29881122

Pulled By: mrambacher

fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb

4 years agoSeveral simple local code clean-ups (#8565)
Yanqin Jin [Fri, 30 Jul 2021 19:06:47 +0000 (12:06 -0700)]
Several simple local code clean-ups (#8565)

Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565

Test Plan: make check

Reviewed By: lth

Differential Revision: D29963984

Pulled By: riversand963

fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a

4 years agoFix insecure internal API for GetImpl (#8590)
Peter Dillinger [Thu, 29 Jul 2021 21:58:35 +0000 (14:58 -0700)]
Fix insecure internal API for GetImpl (#8590)

Summary:
Calling the GetImpl function could leave reference to a local
callback function in a field of a parameter struct. As this is
performance-critical code, I'm not going to attempt to sanitize this
code too much, but make the existing hack a bit cleaner by reverting
what it overwrites in the input struct.

Added SaveAndRestore utility class to make that easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8590

Test Plan:
added unit test for SaveAndRestore; existing tests for
GetImpl

Reviewed By: riversand963

Differential Revision: D29947983

Pulled By: pdillinger

fbshipit-source-id: 2f608853f970bc06724e834cc84dcc4b8599ddeb

4 years agoCreate fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593)
Baptiste Lemaire [Thu, 29 Jul 2021 21:57:03 +0000 (14:57 -0700)]
Create fillanddeleteuniquerandom benchmark (db_bench), with new option flags. (#8593)

Summary:
Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times:
"A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible."
New flags:
- `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry.
- `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence.
- `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries.
- `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts.
- `persistent_entries_value_size`: size of the random value string for the `persistent` entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593

Reviewed By: pdillinger

Differential Revision: D29974436

Pulled By: bjlemaire

fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d

4 years agoDB::GetSortedWalFiles() to ensure file deletion is disabled (#8591)
sdong [Thu, 29 Jul 2021 18:50:00 +0000 (11:50 -0700)]
DB::GetSortedWalFiles() to ensure file deletion is disabled (#8591)

Summary:
If DB::GetSortedWalFiles() runs without file deletion disbled, file might get deleted in the middle and error is returned to users. It makes the function hard to use. Fix it by disabling file deletion if it is not done.

Fix another minor issue of logging within DB mutex, which should not be done unless a major failure happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8591

Test Plan: Run all existing tests

Reviewed By: pdillinger

Differential Revision: D29969412

fbshipit-source-id: d5f42b5271608a35b9b07687ce18157d7447b0de

4 years agoSome fixes and enhancements to `ldb repair` (#8544)
Peter Dillinger [Wed, 28 Jul 2021 23:43:16 +0000 (16:43 -0700)]
Some fixes and enhancements to `ldb repair` (#8544)

Summary:
* Basic handling of SST file with just range tombstones rather than
failing assertion about smallest_seqno <= largest_seqno
* Adds --verbose option so that there exists a way to see the INFO
output from Repairer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8544

Test Plan: unit test added, manual testing for --verbose

Reviewed By: ajkr

Differential Revision: D29954805

Pulled By: pdillinger

fbshipit-source-id: 696af25805fc36cc178b04ba6045922a22625fd9

4 years agoFix HISTORY.md for #8518 (#8594)
Jay Zhuang [Wed, 28 Jul 2021 23:09:41 +0000 (16:09 -0700)]
Fix HISTORY.md for #8518 (#8594)

Summary:
PR https://github.com/facebook/rocksdb/issues/8518 merge the change to wrong section.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8594

Reviewed By: riversand963

Differential Revision: D29974565

Pulled By: jay-zhuang

fbshipit-source-id: 51c930d93fbdb406fe31ff73c96548a6f88b9965

4 years agoReplace macros in compaction_iterator.cc with inline functions (#8592)
jimmycleary [Wed, 28 Jul 2021 21:52:35 +0000 (14:52 -0700)]
Replace macros in compaction_iterator.cc with inline functions (#8592)

Summary:
Internal task T96186510.

Created new inline member functions in `CompactionIterator`,
`DefinitelyInSnapshot`, `DefinitelyNotInSnapshot`, and
`InEarliestSnapshot` to replace the macros at the top of
`compaction_iterator.cc`.

Placed the definitions in `compaction_iterator.h` in accordance with
Google's style guide for inline functions. Separated the declarations
and definitions, and only placed the `inline` keyword on the
definitions, in line with ISO CPP recommendations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8592

Test Plan: Ran `make check`.  Successful build and all tests appeared to pass.

Reviewed By: riversand963

Differential Revision: D29966782

Pulled By: jimmycFB

fbshipit-source-id: 3584290bbbabf862e9ab58852281f46d37f58be6

4 years agoAdd experimental mempurge policy flag to db_stress. (#8588)
Baptiste Lemaire [Wed, 28 Jul 2021 20:27:10 +0000 (13:27 -0700)]
Add experimental mempurge policy flag to db_stress. (#8588)

Summary:
Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`.
This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default).
- `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds.
- `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588

Reviewed By: pdillinger

Differential Revision: D29934251

Pulled By: bjlemaire

fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae

4 years agoFix use-after-free on implicit temporary FileOptions (#8571)
Peter Dillinger [Wed, 28 Jul 2021 04:48:22 +0000 (21:48 -0700)]
Fix use-after-free on implicit temporary FileOptions (#8571)

Summary:
FileOptions has an implicit conversion from EnvOptions and some
internal APIs take `const FileOptions&` and save the reference, which is
counter to Google C++ guidelines,

> Avoid defining functions that require a const reference parameter to outlive the call, because const reference parameters bind to temporaries. Instead, find a way to eliminate the lifetime requirement (for example, by copying the parameter), or pass it by const pointer and document the lifetime and non-null requirements.

This is at least a problem for repair.cc, which passes an EnvOptions to
TableCache(), which would save a reference to the temporary copy as
FileOptions. This was unfortunately only caught as a side effect of
changes in https://github.com/facebook/rocksdb/issues/8544.

This change fixes the repair.cc case and updates the involved internal
APIs that save a reference to use `const FileOptions*` instead.

Unfortunately, I don't know how to get any of our sanitizers to reliably
report bugs like this, so I can't rule out more existing in our
codebase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8571

Test Plan:
Test that issues seen with https://github.com/facebook/rocksdb/issues/8544 are fixed (can reproduce on
AWS EC2)

Reviewed By: ajkr

Differential Revision: D29943890

Pulled By: pdillinger

fbshipit-source-id: 95f9c5251548777b4dc994c1a083dd2add5799c9

4 years agoFix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589)
Peter Dillinger [Wed, 28 Jul 2021 04:30:54 +0000 (21:30 -0700)]
Fix missing Handle release in TableCache::GetRangeTombstoneIterator (#8589)

Summary:
This appears to be little used code so not a major bug, but is
blocking https://github.com/facebook/rocksdb/issues/8544

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8589

Test Plan:
Added regression test to the end of
DBRangeDelTest::TableEvictedDuringScan. Without this fix, ASAN reports
memory leak.

Reviewed By: ajkr

Differential Revision: D29943623

Pulled By: pdillinger

fbshipit-source-id: f7115fa6d4440aef83888ff609aa03d09216463b

4 years agoAdd MultiGet to replay (#8577)
Zhichao Cao [Tue, 27 Jul 2021 20:55:15 +0000 (13:55 -0700)]
Add MultiGet to replay (#8577)

Summary:
When the trace contains the MultiGet record, with this PR, it can replay the MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8577

Test Plan: make check and replay the real trace.

Reviewed By: anand1976

Differential Revision: D29864060

Pulled By: zhichao-cao

fbshipit-source-id: 5288d4fc9b6a3cb331de1e0c635d4e044dcb534a

4 years agoPass extra db_stress args to fbcode crash tests (#8587)
anand76 [Tue, 27 Jul 2021 19:45:59 +0000 (12:45 -0700)]
Pass extra db_stress args to fbcode crash tests (#8587)

Summary:
Allow extra arguments to be passed to db_stress in fbcode crash tests by the ```rocksdb-lego-determinator``` invoker.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8587

Reviewed By: zhichao-cao

Differential Revision: D29940217

Pulled By: anand1976

fbshipit-source-id: 17cbcd2def60eff2a895553f917694496c4742aa