git-server-git.apps.pok.os.sepia.ceph.com Git

]> git-server-git.apps.pok.os.sepia.ceph.com Git - rocksdb.git/log

Khem Raj [Wed, 25 Jan 2023 05:40:43 +0000 (21:40 -0800)]

checkpoint.h: Add missing includes <cstdint>

It uses uint64_t and it comes from <cstdint>
This is needed with GCC 13 and newer [1]

[1] https://www.gnu.org/software/gcc/gcc-13/porting_to.html

Signed-off-by: Khem Raj <raj.khem@gmail.com>
(cherry picked from commit 31012cdfa435d9203da3c3de8127b66bf018692a)

commit | commitdiff | tree

Heiko Becker [Wed, 25 Jan 2023 22:30:32 +0000 (14:30 -0800)]

Fix build with gcc 13 by including <cstdint> (#11118)

Summary:
Like other versions before, gcc 13 moved some includes around and as a result <cstdint> is no longer transitively included [1]. Explicitly include it for uint{32,64}_t.

[1] https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11118

Reviewed By: cbi42

Differential Revision: D42711356

Pulled By: ajkr

fbshipit-source-id: 5ea257b85b7017f40fd8fdbce965336da95c55b2
(cherry picked from commit 88edfbfb5e1cac228f7cc31fbec24bb637fe54b1)

commit | commitdiff | tree

Radoslaw Zarzynski [Tue, 21 Aug 2018 19:47:48 +0000 (21:47 +0200)]

db: VersionEdit can understand the format introduced in PR 3488.

RocksDB's PR 3488 introduced a new format of `VersionEdit`
encoding which is not understandable for older versions.
Thus, the change broke forward compatibility and has been
reverted in PR 3762. Later, PR 3765 reiterated the concept
but in a way that does not provide compatibility with the very
short-living format of PR 3488.

This change tries to address the issue of accessing DBs
in format of 3488 by ignoring the special entries.

Fixes: http://tracker.ceph.com/issues/25146
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
(cherry picked from commit ffed5e68515620e1de0b66d96d2ec15cef1c82a7)
(cherry picked from commit c540de6f709b66efd41436694f72d6f7986a325b)

commit | commitdiff | tree

anand76 [Thu, 23 Feb 2023 18:00:36 +0000 (10:00 -0800)]

Update version.h to 7.10.2

commit | commitdiff | tree

anand76 [Fri, 10 Feb 2023 22:59:29 +0000 (14:59 -0800)]

Update HISTORY.md for 7.10.2

commit | commitdiff | tree

anand76 [Wed, 8 Feb 2023 20:05:49 +0000 (12:05 -0800)]

Fix bug in WAL streaming uncompression (#11198)

Summary:
Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198

Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.

Reviewed By: ajkr, cbi42

Differential Revision: D43102692

Pulled By: anand1976

fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd

commit | commitdiff | tree

anand76 [Fri, 3 Feb 2023 00:35:27 +0000 (16:35 -0800)]

Return any errors returned by ReadAsync to the MultiGet caller (#11171)

Summary:
Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.

Tests:
Update existing unit tests and add a new one for this scenario

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171

Reviewed By: akankshamahajan15

Differential Revision: D42950057

Pulled By: anand1976

fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869

commit | commitdiff | tree

Andrew Kryczka [Thu, 2 Feb 2023 16:51:12 +0000 (08:51 -0800)]

update HISTORY.md and version.h for 7.10.1

commit | commitdiff | tree

Andrew Kryczka [Wed, 1 Feb 2023 03:51:00 +0000 (19:51 -0800)]

add release note for GetMergeOperands() fix

commit | commitdiff | tree

Andrew Kryczka [Thu, 26 Jan 2023 23:11:19 +0000 (15:11 -0800)]

Fix GetMergeOperands() returning MergeInProgress (#11136)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11136

Test Plan: the provided unit test used to fail due to `GetMergeOperands()` returning `Status::MergeInProgress()`; it passes now because the `GetMergeOperands()` call returns `Status::OK()`

Reviewed By: pdillinger

Differential Revision: D42759198

Pulled By: ajkr

fbshipit-source-id: 878f9f40ccc1d7e2fe7b1352814bae3a49c19939

commit | commitdiff | tree

Andrew Kryczka [Wed, 1 Feb 2023 00:57:49 +0000 (16:57 -0800)]

Allow canceling manual compaction while waiting for conflicting compaction (#11165)

Summary:
This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165

Test Plan: repro test case

Reviewed By: cbi42

Differential Revision: D42864058

Pulled By: ajkr

fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a

commit | commitdiff | tree

Hui Xiao [Tue, 24 Jan 2023 17:54:04 +0000 (09:54 -0800)]

Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)

Summary:
**Context:**
Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `

**Summary:**
Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)

**Tets:**
- new unit test
- make check
- aggressive crash test rehearsal

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111

Reviewed By: ajkr

Differential Revision: D42644600

Pulled By: hx235

fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf

commit | commitdiff | tree

Peter Dillinger [Wed, 25 Jan 2023 22:18:27 +0000 (14:18 -0800)]

Fix DelayWrite() calls for two_write_queues (#11130)

Summary:
PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.

This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.

HISTORY not updated because this is intended for the same release where the bug was introduced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130

Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for issues.

Reviewed By: ajkr

Differential Revision: D42749330

Pulled By: pdillinger

fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b

commit | commitdiff | tree

Hui Xiao [Mon, 23 Jan 2023 19:11:36 +0000 (11:11 -0800)]

Update history for 7.10.fb

commit | commitdiff | tree

Andrew Kryczka [Fri, 20 Jan 2023 22:40:30 +0000 (14:40 -0800)]

Add API to limit blast radius of merge operator failure (#11092)

Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee

commit | commitdiff | tree

akankshamahajan [Fri, 20 Jan 2023 18:17:57 +0000 (10:17 -0800)]

Enhance async scan prefetch unit tests (#11087)

Summary:
Add more coverage in unit tests for async scan. The added unit test fails without PR https://github.com/facebook/rocksdb/pull/10939.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11087

Test Plan: CircleCI jobs status for new unit tests.

Reviewed By: anand1976

Differential Revision: D42487931

Pulled By: akankshamahajan15

fbshipit-source-id: d59ed7666599bd0d2733ac5d76bd70984b54c5a9

commit | commitdiff | tree

codeoos [Thu, 19 Jan 2023 21:59:48 +0000 (13:59 -0800)]

Fix error maybe-uninitialized #11100 (#11101)

Summary:
In this issue [11100](https://github.com/facebook/rocksdb/issues/11100)
I try to upgrade dependencies of [BaikalDB](https://github.com/baidu/BaikalDB) and tool chain to gcc-12.I found that when I build rocksdb v6.26.0(maybe I can use newer version),I found that in file trace_replay/trace_replay.cc,the compiler tell me "error mybe-uninitialized".I dound that it can be fixed very easy,so I make this pull request.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11101

Reviewed By: ajkr

Differential Revision: D42583031

Pulled By: cbi42

fbshipit-source-id: 7f399f09441a30fe88b83cec5e2fd9885bad5c06

commit | commitdiff | tree

leipeng [Thu, 19 Jan 2023 21:28:58 +0000 (13:28 -0800)]

remove unused InternalIteratorBase::is_mutable_ (#11104)

Summary:
`InternalIteratorBase::is_mutable_` is not used any more, remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11104

Reviewed By: ajkr

Differential Revision: D42582747

Pulled By: cbi42

fbshipit-source-id: d30bf75151fc8414df0ae112a6ec4943b5b7330b

commit | commitdiff | tree

Peter Dillinger [Thu, 19 Jan 2023 20:07:50 +0000 (12:07 -0800)]

Upgrade xxhash.h to latest dev (#11098)

Summary:
Upgrading xxhash.h to latest dev version as of 1/17/2023, which is d7197ddea81364a539051f116ca77926100fc77f This should improve performance on some ARM machines.

I allowed some of our RocksDB-specific changes to be made obsolete where it seemed appropriate, for example
* xxhash.h has its own fallthrough marker (which I hope works for us)
* As in https://github.com/Cyan4973/xxHash/pull/549

Merging and resolving conflicts one way or the other was all that went into this diff. Except I had to mix the two sides around `defined(__loongarch64)`

How I did the upgrade (for future reference), so that I could use usual merge conflict resolution:
```
# New branch to help with merging
git checkout -b xxh_merge_base
# Check out RocksDB revision before last xxhash.h upgrade
git reset --hard 22161b7547652af82a5dc67458de9ca8946ac83d^
# Create a commit with the raw base version from xxHash repo (from xxHash repo)
git show 2c611a76f914828bed675f0f342d6c4199ffee1e:xxhash.h > ../rocksdb/util/xxhash.h
# In RocksDB repo
git commit -a
# Merge in the last xxhash.h upgrade
git merge 22161b7547652af82a5dc67458de9ca8946ac83d
# Resolve conflict using committed version
git show 22161b7547652af82a5dc67458de9ca8946ac83d:util/xxhash.h > util/xxhash.h
git commit -a
# Catch up to upstream
git merge upstream/main

# Create a different branch for applying raw upgrade
git checkout -b xxh_upgrade_2023
# Find the RocksDB commit we made for the raw base version from xxHash
git log main..HEAD
# Rewind to it
git reset --hard 2428b727a9a19c6078bb75895805d7488cbdd08c
# Copy in latest raw version (from xxHash repo)
cat xxhash.h > ../rocksdb/util/xxhash.h
# Merge in RocksDB changes, use typical tools for conflict resolution
git merge xxh_merge_base
```

Branch https://github.com/facebook/rocksdb/tree/xxhash_merge_base can be used as a base for future xxhash merges.

Fixes https://github.com/facebook/rocksdb/issues/11073

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11098

Test Plan:
existing tests (e.g. Bloom filter schema stability tests)

Also seems to include a small performance boost on my Intel dev machine, using `./db_bench --benchmarks=xxh3[-X50] 2>&1 | egrep -o 'operations;.*' | sort`

Fastest out of 50 runs, before: 15477.3 MB/s
Fastest out of 50 runs, after: 15850.7 MB/s, and 11 more runs faster than the "before" number

Slowest out of 50 runs, before: 12267.5 MB/s
Slowest out of 50 runs, after: 13897.1 MB/s

More repetitions show the distinction is repeatable

Reviewed By: hx235

Differential Revision: D42560010

Pulled By: pdillinger

fbshipit-source-id: c43ee52f1c5fe0ba3d6d6e4eebb22ded5f5492ea

commit | commitdiff | tree

Changyu Bi [Thu, 19 Jan 2023 00:38:07 +0000 (16:38 -0800)]

Fix asan failure caused by range tombstone start key use-after-free (#11106)

Summary:
the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106

Test Plan: Added UT for repro.

Reviewed By: ajkr

Differential Revision: D42590862

Pulled By: cbi42

fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c

commit | commitdiff | tree

akankshamahajan [Wed, 18 Jan 2023 21:24:37 +0000 (13:24 -0800)]

Fix crash in block_cache_trace_analyzer if reference key is null in case of MultiGet (#11042)

Summary:
Same as title
Error:
```
block_cache_trace_analyzer: ./db/dbformat.h:421: uint64_t rocksdb::GetInternalKeySeqno(const rocksdb::Slice&): Assertion `n >= kNumInternalBytes' failed.
Aborted (core dumped)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11042

Test Plan:
- Added new unit test which fails without the fix.
- Also ran manually on traces to confirm.

Reviewed By: anand1976

Differential Revision: D42481587

Pulled By: akankshamahajan15

fbshipit-source-id: 7c33eb03a4a4d8ffbabcfbe0efa1e4d11bde3ba2

commit | commitdiff | tree

Changyu Bi [Wed, 18 Jan 2023 00:42:41 +0000 (16:42 -0800)]

Consider TTL compaction file cutting earlier to prevent small output file (#11075)

Summary:
in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075

Test Plan: - Added new unit test.

Reviewed By: hx235

Differential Revision: D42398739

Pulled By: cbi42

fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7

commit | commitdiff | tree

Changyu Bi [Tue, 17 Jan 2023 20:47:44 +0000 (12:47 -0800)]

Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091)

Summary:
in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091

Test Plan:
- Added unit tests
- Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10`

Reviewed By: ajkr

Differential Revision: D42521588

Pulled By: cbi42

fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8

commit | commitdiff | tree

Changyu Bi [Fri, 13 Jan 2023 20:28:21 +0000 (12:28 -0800)]

Revert #10802 Consider range tombstone in compaction output file cutting (#11089)

Summary:
This reverts commit f02c708aa32829bbbd70aa3493af8444e76e4350 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153

commit | commitdiff | tree

leipeng [Fri, 13 Jan 2023 19:47:26 +0000 (11:47 -0800)]

db_bench: let -benchmark=compact respect -subcompactions (#11077)

Summary:
When running `-benchmarks=compact`, `-subcompactions` does not take effect.

`-subcompactions` option comment says it is for L0-L1 compactions, it is natural to extend it to CompactionRangeOptions.max_subcompactions.

This PR set CompactionRangeOptions.max_subcompactions = FLAGS_subcompactions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11077

Reviewed By: akankshamahajan15

Differential Revision: D42506251

Pulled By: ajkr

fbshipit-source-id: f77c9a99d32ff7af59f3c452c9e16aaeb0360304

commit | commitdiff | tree

Wenlong Zhang [Fri, 13 Jan 2023 16:42:44 +0000 (08:42 -0800)]

support loongarch64 for rocksdb (#10036)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036

Reviewed By: hx235

Differential Revision: D42424074

Pulled By: ajkr

fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd

commit | commitdiff | tree

anand76 [Fri, 13 Jan 2023 02:09:07 +0000 (18:09 -0800)]

Add a unit test for async prefetch fix in #11049 (#11084)

Summary:
Add a unit test in prefetch_test for https://github.com/facebook/rocksdb/issues/11049

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11084

Test Plan: Verify the test fails without https://github.com/facebook/rocksdb/issues/11049 and passes with it

Reviewed By: akankshamahajan15

Differential Revision: D42485828

Pulled By: anand1976

fbshipit-source-id: ae512f2d121745a1f5212645a9b58868976c1f83

commit | commitdiff | tree

Peter Dillinger [Wed, 11 Jan 2023 22:20:40 +0000 (14:20 -0800)]

Major Cache refactoring, CPU efficiency improvement (#10975)

Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).

The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.

* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)

## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
  * Simpler for implementations to deal with just one Insert and one Lookup.
  * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
  * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
  * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
  * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
  * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.

## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).

The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.

These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)

Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.

## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.

## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.

The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.

## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)

## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975

Test Plan:
tests updated

Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):

34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83

Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.

Reviewed By: anand1976

Differential Revision: D42417818

Pulled By: pdillinger

fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432

commit | commitdiff | tree

Changyu Bi [Thu, 5 Jan 2023 20:10:02 +0000 (12:10 -0800)]

Fix some unit test failure in ExternalSSTFileBasicTest (#11070)

Summary:
valgrind build for `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType` and `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after https://github.com/facebook/rocksdb/issues/10988.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11070

Test Plan: monitor future valgrind tests result.

Reviewed By: hx235

Differential Revision: D42319056

Pulled By: cbi42

fbshipit-source-id: acbcd841a2a15e36b278f39ba514f4b9a6ee43ca

commit | commitdiff | tree

Niklas Fiekas [Thu, 5 Jan 2023 03:36:43 +0000 (19:36 -0800)]

Add C API for ReadOptions::async_io (#11062)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11062

Reviewed By: hx235

Differential Revision: D42297489

Pulled By: ajkr

fbshipit-source-id: 03fe1477c1ae1f8af73dc77a6986fdc7025edf4f

commit | commitdiff | tree

ehds [Thu, 5 Jan 2023 03:35:34 +0000 (19:35 -0800)]

fix shared state used after free (#11059)

Summary:
Before this pr, the destruction order is `shared` -> `db_`(StressTest destruction) -> `stress`, but `compaction_filter` of `db_` will hold the `shared` pointer, so `shared` maybe used after free.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11059

Reviewed By: hx235

Differential Revision: D42297366

Pulled By: ajkr

fbshipit-source-id: 17b314635359acacd5ba62f9db5f955f451133f7

commit | commitdiff | tree

Hui Xiao [Tue, 3 Jan 2023 19:54:58 +0000 (11:54 -0800)]

Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063)

Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR.

It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false`

```
--fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65
```

Therefore this PR is adding it back to stress/crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063

Test Plan: Rehearsal stress test to make sure stress/crash test is stable

Reviewed By: ajkr

Differential Revision: D42283650

Pulled By: hx235

fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef

commit | commitdiff | tree

Changyu Bi [Sat, 31 Dec 2022 18:56:55 +0000 (10:56 -0800)]

Fix BackupEngineTest.ExcludeFiles memory leak (#11066)

Summary:
Valgrind was complaining about the test BackupEngineTest.ExcludeFiles. The cause is backup_engine not being freed similar to https://github.com/facebook/rocksdb/issues/9610.
```
==18228== Command: ./backup_engine_test --gtest_filter=BackupEngineTest.ExcludeFiles
==18228==
Note: Google Test filter = BackupEngineTest.ExcludeFiles
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BackupEngineTest
[ RUN      ] BackupEngineTest.ExcludeFiles
[       OK ] BackupEngineTest.ExcludeFiles (16264 ms)
[----------] 1 test from BackupEngineTest (16273 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (16306 ms total)
[  PASSED  ] 1 test.
==18228==
==18228== HEAP SUMMARY:
==18228==     in use at exit: 14,099 bytes in 159 blocks
==18228==   total heap usage: 255,328 allocs, 255,169 frees, 497,538,546 bytes allocated
==18228==
==18228== 19 bytes in 1 blocks are possibly lost in loss record 4 of 67
==18228==    at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==18228==    by 0x1E752D: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .constprop.0] (basic_string.tcc:219)
==18228==    by 0x1F1898: _M_construct_aux<char*> (basic_string.h:251)
==18228==    by 0x1F1898: _M_construct<char*> (basic_string.h:270)
==18228==    by 0x1F1898: basic_string (basic_string.h:455)
==18228==    by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:146)
==18228==    by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:483)
==18228==    by 0x1F1898: push_back (stl_vector.h:1189)
==18228==    by 0x1F1898: rocksdb::(anonymous namespace)::TestFs::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) (backup_engine_test.cc:208)
==18228==    by 0x4B3583: rocksdb::NewWritableFile(rocksdb::FileSystem*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::FileOptions const&) (read_write_util.cc:23)
==18228==    by 0x31C3A8: rocksdb::DBImpl::CreateWAL(unsigned long, unsigned long, unsigned long, rocksdb::log::Writer**) (db_impl_open.cc:1752)
==18228==    by 0x321A8C: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1852)
==18228==    by 0x322E7F: Open (db_impl_open.cc:1660)
==18228==    by 0x322E7F: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1637)
==18228==    by 0x1EE1CD: InitializeDBAndBackupEngine (backup_engine_test.cc:724)
==18228==    by 0x1EE1CD: rocksdb::(anonymous namespace)::BackupEngineTest::OpenDBAndBackupEngine(bool, bool, rocksdb::(anonymous namespace)::BackupEngineTest::ShareOption) (backup_engine_test.cc:732)
==18228==    by 0x217585: rocksdb::(anonymous namespace)::BackupEngineTest_ExcludeFiles_Test::TestBody() (backup_engine_test.cc:4232)
==18228==    by 0x296143: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3899)
==18228==    by 0x296143: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3935)
==18228==    by 0x28A0A5: testing::Test::Run() [clone .part.0] (gtest-all.cc:3973)
==18228==    by 0x28A364: Run (gtest-all.cc:3965)
==18228==    by 0x28A364: testing::TestInfo::Run() [clone .part.0] (gtest-all.cc:4149)
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11066

Test Plan: make -j24 J=24 ROCKSDBTESTS_SUBSET=backup_engine_test valgrind_check_some

Reviewed By: ajkr

Differential Revision: D42297791

Pulled By: cbi42

fbshipit-source-id: db67982b27b91cc78e1a9f4a96da0cba7c9785b7

commit | commitdiff | tree

mrambacher [Sat, 31 Dec 2022 00:55:58 +0000 (16:55 -0800)]

Add ability to have unit tests for ROCKSDB_PLUGINS (#11052)

Summary:
This is based on speedb PR [143](https://github.com/speedb-io/speedb/pull/143).

This PR adds the ability to add a xxx_TESTS variable to the make or cmake files for a plugin. When set, those files will be added to the unit tests built and executed by the corresponding make system.

Note that the rule for building plugin tests via make could be expanded to almost every other unit test in RocksDB. This expansion would allow for a much smaller/simpler Makefile and make it easier to add new test files to RocksDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11052

Reviewed By: cbi42

Differential Revision: D42212269

Pulled By: ajkr

fbshipit-source-id: d02668f7f4466900d63c90bb4f7962d23fcc7114

commit | commitdiff | tree

ywave [Sat, 31 Dec 2022 00:55:55 +0000 (16:55 -0800)]

Fix typo in flushing stats CF (#11055)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11055

Test Plan: make check

Reviewed By: cbi42

Differential Revision: D42232828

Pulled By: ajkr

fbshipit-source-id: 3b46514aebff4da7e47b9954b90800ba4a3ba30b

commit | commitdiff | tree

HuangYi [Sat, 31 Dec 2022 00:53:00 +0000 (16:53 -0800)]

add c-api for setting option optimize_filters_for_memory (#11044)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11044

Reviewed By: cbi42

Differential Revision: D42152851

Pulled By: ajkr

fbshipit-source-id: 81710d9503ba4f23f112c72ebf16a48112e27158

commit | commitdiff | tree

Hui Xiao [Thu, 29 Dec 2022 23:05:36 +0000 (15:05 -0800)]

Add missing range conflict check between file ingestion and RefitLevel() (#10988)

Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.

RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.

Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](https://github.com/facebook/rocksdb/blame/c62f3221698fd273b673d4f7e54eabb8329a4369/db/db_iter.cc#L342-L343) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.

Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.

**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (https://github.com/facebook/rocksdb/commit/0f88160f67d36ea30e3aca3a3cef924c3a009be6, https://github.com/facebook/rocksdb/commit/5c64fb67d2fc198f1a73ff3ae543749a6a41f513 and https://github.com/facebook/rocksdb/commit/87dfc1d23e0e16ff73e15f63c6fa0fb3b3fc8c8c.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988

Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: cbi42

Differential Revision: D41535685

Pulled By: hx235

fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258

commit | commitdiff | tree

Changyu Bi [Thu, 29 Dec 2022 21:28:24 +0000 (13:28 -0800)]

Include estimated bytes deleted by range tombstones in compensated file size (#10734)

Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734

Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```

In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.

Compaction stats from this PR:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   31.54 MB   0.5      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     63.4    135.56            110.94       544    0.249       0      0       0.0       0.0
  L4      3/0   96.55 MB   0.8     18.5     6.7     11.8      18.4      6.6       0.0   2.7     65.3     64.9    290.08            284.03       108    2.686    284M  1957K       0.0       0.0
  L5     15/0   404.41 MB   1.0     19.1     7.7     11.4      18.8      7.4       0.3   2.5     66.6     65.7    292.93            285.34       220    1.332    293M  3808K       0.0       0.0
  L6    143/0    4.12 GB   0.0     45.0     7.5     37.5      41.6      4.1       0.0   5.5     71.2     65.9    647.00            632.66       251    2.578    739M    47M       0.0       0.0
Sum    163/0    4.64 GB   0.0     82.6    21.9     60.7      87.2     26.5       0.3  10.4     61.9     65.4   1365.58           1312.97      1123    1.216   1318M    52M       0.0       0.0
```

Compaction stats from main:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     60.5    142.12            115.89       569    0.250       0      0       0.0       0.0
  L4      3/0   85.68 MB   1.0     17.7     6.8     10.9      17.6      6.7       0.0   2.6     62.7     62.3    289.05            281.79       112    2.581    272M  2309K       0.0       0.0
  L5     11/0   293.73 MB   1.0     18.8     7.5     11.2      18.5      7.2       0.5   2.5     64.9     63.9    296.07            288.50       220    1.346    288M  4365K       0.0       0.0
  L6    130/0    3.94 GB   0.0     51.5     7.6     43.9      47.9      3.9       0.0   6.3     67.2     62.4    784.95            765.92       258    3.042    848M    51M       0.0       0.0
Sum    144/0    4.31 GB   0.0     88.0    21.9     66.0      92.3     26.3       0.5  11.0     59.6     62.5   1512.19           1452.09      1159    1.305   1409M    58M       0.0       0.0```

Reviewed By: ajkr

Differential Revision: D39834713

Pulled By: cbi42

fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b

commit | commitdiff | tree

Peter Dillinger [Thu, 29 Dec 2022 18:42:50 +0000 (10:42 -0800)]

Add BackupEngine feature to exclude files (#11030)

Summary:
We have a request for RocksDB to essentially support
disconnected incremental backup. In other words, if there is limited
or no connectivity to the primary backup dir, we should still be able to
take an incremental backup relative to that primary backup dir,
assuming some metadata about that primary dir is available (and
obviously anticipating primary backup dir will be fully available if
restore is needed).

To support that, this feature allows the API user to "exclude" DB
files from backup. This only applies to files that can be shared
between backups (sst and blob files), and excluded files are
tracked in the backup metadata sufficiently to ensure they are
restored at restore time. At restore time, the user provides
a set of alternate backup directories (as open BackupEngines, which
can be read-only), and excluded files must be found in one of the
backup directories ("included" in some backup).

This feature depends on backup schema version 2 features, though
schema version 2.0 support is not sufficient to read / restore a
backup with exclusions. This change updates the schema version to
2.1 because of this feature, so that it's easy to recognize whether
a RocksDB release supports this feature, while backups not using the
feature are fully compatible with 2.0.

Also in this PR:
* Stacked on https://github.com/facebook/rocksdb/pull/11029
* Allow progress_callback to be empty, not just no-op function, and
recover from exceptions thrown by BackupEngine callbacks.
* The internal-only `AsBackupEngine()` function is working around the
diamond hierarchy of `BackupEngineImplThreadSafe` to get to the
internals, without using confusing features like virtual inheritance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11030

Test Plan: unit tests added / updated

Reviewed By: ajkr

Differential Revision: D42004388

Pulled By: pdillinger

fbshipit-source-id: 31b6e533d308a5462e528d9012d650482d974077

commit | commitdiff | tree

anand76 [Thu, 22 Dec 2022 06:42:19 +0000 (22:42 -0800)]

Avoid mixing sync and async prefetch (#11050)

Summary:
Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050

Test Plan: Crash test

Reviewed By: akankshamahajan15

Differential Revision: D42194682

Pulled By: anand1976

fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e

commit | commitdiff | tree

Peter Dillinger [Wed, 21 Dec 2022 23:41:10 +0000 (15:41 -0800)]

Make CompactRange() more aware of SstPartitionerFactory (#11032)

Summary:
Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range.

This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032

Test Plan: unit test included

Reviewed By: hx235

Differential Revision: D41981380

Pulled By: pdillinger

fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a

commit | commitdiff | tree

Alan Paxton [Wed, 21 Dec 2022 19:54:24 +0000 (11:54 -0800)]

Improve Java API get() performance by reducing copies (#10970)

Summary:
Performance improvements for `get()` paths in the RocksJava API (JNI).
Document describing the performance results.

Replace uses of the legacy `DB::Get()` method wrapper returning data in a `std::string` with direct calls to `DB::Get()` passing a pinnable slice to receive this data. Copying from a pinned slice direct to the destination java byte array, without going via an intervening std::string, is a major performance gain for this code path.

Note that this gain only comes where `DB::Get()` is able to return a pinned buffer; where it has to copy into the buffer owned by the slice, there is still the intervening copy and no performance gain. It may be possible to address this case too, but it is not trivial.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10970

Reviewed By: pdillinger

Differential Revision: D42125567

Pulled By: ajkr

fbshipit-source-id: b7a4df7523b0420cadb1e9b6c7da3ec030a8da34

commit | commitdiff | tree

anand76 [Wed, 21 Dec 2022 17:15:53 +0000 (09:15 -0800)]

Fix async prefetch heap use after free (#11049)

Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free

Also add a few asserts to surface bugs earlier in the crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049

Test Plan: Repro with db_stress and verify the fix

Reviewed By: akankshamahajan15

Differential Revision: D42181118

Pulled By: anand1976

fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc

commit | commitdiff | tree

Changyu Bi [Tue, 20 Dec 2022 00:36:39 +0000 (16:36 -0800)]

Fix an assertion failure in `CompactionOutputs::AddRangeDels()` (#11040)

Summary:
the [assertion](https://github.com/facebook/rocksdb/blob/c3f720c60db59c27486d8f18e094f9d1eb3c33cf/db/compaction/compaction_outputs.cc#L643) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](https://github.com/facebook/rocksdb/blob/c3f720c60db59c27486d8f18e094f9d1eb3c33cf/db/compaction/compaction_outputs.cc#L594). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D42119914

Pulled By: cbi42

fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf

commit | commitdiff | tree

ehds [Mon, 19 Dec 2022 23:06:22 +0000 (15:06 -0800)]

snapshots of FragmentedRangeTombstoneList must in ascending order (#11046)

Summary:
`snapshots` argument of `FragmentedRangeTombstoneList` should be in ascending order.

If we pass it in descending order order， it will not work.

for example:

```
  auto range_del_iter = MakeRangeDelIter({{"a", "e", 3},{"a","e", 6}});

  FragmentedRangeTombstoneList fragment_list(
      std::move(range_del_iter), bytewise_icmp, true /* for_compaction */,
      {8 ,7 ,4} /* snapshots */);
    FragmentedRangeTombstoneIterator iter(&fragment_list, bytewise_icmp,
                                        kMaxSequenceNumber /* upper_bound */);
  VerifyFragmentedRangeDels(&iter, {{"a", "e", 6}, {"a", "e", 3}});
```
VerifyFragmentedRangeDels will fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11046

Reviewed By: ajkr

Differential Revision: D42148654

Pulled By: cbi42

fbshipit-source-id: a2e76f96dccf56fcca1a91cb8da9b99145f68026

commit | commitdiff | tree

anand76 [Mon, 19 Dec 2022 19:38:42 +0000 (11:38 -0800)]

Prevent db_stress failure when io_uring is disabled (#11045)

Summary:
The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045

Reviewed By: akankshamahajan15

Differential Revision: D42136213

Pulled By: anand1976

fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f

commit | commitdiff | tree

anand76 [Thu, 15 Dec 2022 23:48:50 +0000 (15:48 -0800)]

Enable ReadAsync testing and fault injection in db_stress (#11037)

Summary:
The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037

Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890.

Reviewed By: akankshamahajan15

Differential Revision: D42061290

Pulled By: anand1976

fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074

commit | commitdiff | tree

Changyu Bi [Thu, 15 Dec 2022 17:11:54 +0000 (09:11 -0800)]

Consider range tombstone in compaction output file cutting (#10802)

Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df

commit | commitdiff | tree

sdong [Wed, 14 Dec 2022 20:06:24 +0000 (12:06 -0800)]

~SleepingBackgroundTask() to wake up the sleeping task (#11036)

Summary:
Right now, in unit tests, when background tests are sleeping using SleepingBackgroundTask, and the test exits with test assertion failure, the process will hang and it might prevent us to see the test failure message in CI runs. Try to wake up the thread so that the test can exit correctly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11036

Test Plan: Watch CI succeeds

Reviewed By: riversand963

Differential Revision: D42020489

fbshipit-source-id: 5b8441b18d5f67bbb3ade59a1225a8d3c860c2eb

commit | commitdiff | tree

Alan Paxton [Wed, 14 Dec 2022 18:49:32 +0000 (10:49 -0800)]

JNI native memory leak - release array elements (#10981)

Summary:
Closes https://github.com/facebook/rocksdb/issues/10980

Reproduced as per the suggestion in the ticket, and `$ jcmd <PID> VM.native_memory | grep Internal` reports that we are no longer leaking internal memory with the suggested fix.

I did the repro in `MultiGetTest.java` which I have optimized imports on. It did not seem helpful to leave the test code around as it would be onerous to build a memory leak reproducer, and regression seems a remote possibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10981

Reviewed By: riversand963

Differential Revision: D41498748

Pulled By: ajkr

fbshipit-source-id: 8c6dd0d608172879c8bda479c7c9c05c12d34e70

commit | commitdiff | tree

Yanqin Jin [Wed, 14 Dec 2022 05:45:00 +0000 (21:45 -0800)]

Revise LockWAL/UnlockWAL implementation (#11020)

Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.

According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.

There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D41785203

Pulled By: riversand963

fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce

commit | commitdiff | tree

Hui Xiao [Tue, 13 Dec 2022 21:29:37 +0000 (13:29 -0800)]

Sort L0 files by newly introduced epoch_num (#10922)

Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9

commit | commitdiff | tree

Peter Dillinger [Tue, 13 Dec 2022 17:42:34 +0000 (09:42 -0800)]

Fix bug updating latest backup on delete (#11029)

Summary:
Previously, the "latest" valid backup would not be updated on delete.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029

Test Plan: unit test included (added to an existing test for efficiency)

Reviewed By: hx235

Differential Revision: D41967925

Pulled By: pdillinger

fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137

commit | commitdiff | tree

Arvid Lunnemark [Mon, 12 Dec 2022 18:39:53 +0000 (10:39 -0800)]

replace sprintf with its safe version snprintf (v2) (#11011)

Summary:
same motivations as https://github.com/facebook/rocksdb/pull/5475, applied to the last remaining `sprintf`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11011

Reviewed By: pdillinger

Differential Revision: D41673500

Pulled By: ajkr

fbshipit-source-id: 88618ea791cafad86a9a491799c45979d46e3544

commit | commitdiff | tree

Jay Zhuang [Mon, 12 Dec 2022 18:37:55 +0000 (10:37 -0800)]

Add an unittest for Periodic compaction conflict with ongoing compaction (#10908)

Summary:
Add a tiered storage migration test which would conflict with
an ongoing penultimate level compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908

Test Plan: Test only change

Reviewed By: anand1976

Differential Revision: D40864509

Pulled By: ajkr

fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb

commit | commitdiff | tree

Andrew Kryczka [Sat, 10 Dec 2022 23:07:42 +0000 (15:07 -0800)]

Use VersionBuilder for CF import ordering/validation (#11028)

Summary:
Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028

Test Plan: rely on existing tests

Reviewed By: hx235

Differential Revision: D41865427

Pulled By: ajkr

fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50

commit | commitdiff | tree

Peter Dillinger [Fri, 9 Dec 2022 18:03:47 +0000 (10:03 -0800)]

Improve error messages for SST footer and size errors (#11009)

Summary:
Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
* Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
* Footer errors are reported with affected file.
* If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009

Test Plan:
unit tests added, some manual

Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).

Reviewed By: siying

Differential Revision: D41656272

Pulled By: pdillinger

fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df

commit | commitdiff | tree

dependabot[bot] [Thu, 8 Dec 2022 17:21:36 +0000 (09:21 -0800)]

Bump nokogiri from 1.13.9 to 1.13.10 in /docs (#11024)

Summary:
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.13.9 to 1.13.10.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p>
<blockquote>
<h2>1.13.10 / 2022-12-07</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li>
</ul>
<hr />
<p>sha256 checksums:</p>
<pre><code>777ce2e80f64772e91459b943e531dfef387e768f2255f9bc7a1655f254bbaa1  nokogiri-1.13.10-aarch64-linux.gem
b432ff47c51386e07f7e275374fe031c1349e37eaef2216759063bc5fa5624aa  nokogiri-1.13.10-arm64-darwin.gem
73ac581ddcb680a912e92da928ffdbac7b36afd3368418f2cee861b96e8c830b  nokogiri-1.13.10-java.gem
916aa17e624611dddbf2976ecce1b4a80633c6378f8465cff0efab022ebc2900  nokogiri-1.13.10-x64-mingw-ucrt.gem
0f85a1ad8c2b02c166a6637237133505b71a05f1bb41b91447005449769bced0  nokogiri-1.13.10-x64-mingw32.gem
91fa3a8724a1ce20fccbd718dafd9acbde099258183ac486992a61b00bb17020  nokogiri-1.13.10-x86-linux.gem
d6663f5900ccd8f72d43660d7f082565b7ffcaade0b9a59a74b3ef8791034168  nokogiri-1.13.10-x86-mingw32.gem
81755fc4b8130ef9678c76a2e5af3db7a0a6664b3cba7d9fe8ef75e7d979e91b  nokogiri-1.13.10-x86_64-darwin.gem
51d5246705dedad0a09b374d09cc193e7383a5dd32136a690a3cd56e95adf0a3  nokogiri-1.13.10-x86_64-linux.gem
d3ee00f26c151763da1691c7fc6871ddd03e532f74f85101f5acedc2d099e958  nokogiri-1.13.10.gem
</code></pre>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p>
<blockquote>
<h2>1.13.10 / 2022-12-07</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/4c80121dc309e67fa3d9f66a00516bad39b42c31"><code>4c80121</code></a> version bump to v1.13.10</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/85410e38410f670cbbc8c5b00d07b843caee88ce"><code>85410e3</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a> from sparklemotion/flavorjones-fix-reader-error-hand...</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/9fe0761c47c0d4270d1a5220cfd25de080350d50"><code>9fe0761</code></a> fix(cruby): XML::Reader#attribute_hash returns nil on error</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/3b9c736bee91f95514da309eef28b06c0c29ce3a"><code>3b9c736</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2717">https://github.com/facebook/rocksdb/issues/2717</a> from sparklemotion/flavorjones-lock-psych-to-fix-bui...</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/2efa87b49a26d1e961c2a0c143ecf28a67033677"><code>2efa87b</code></a> test: skip large cdata test on system libxml2</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/3187d6739c90864a7bb59cf8276facb1a47ca85d"><code>3187d67</code></a> dep(dev): pin psych to v4 until v5 builds in CI</li>
<li><a href="https://github.com/sparklemotion/nokogiri/commit/a16b4bf14cec72e1a396c28a85135cd9abb08d9b"><code>a16b4bf</code></a> style(rubocop): disable Minitest/EmptyLineBeforeAssertionMethods</li>
<li>See full diff in <a href="https://github.com/sparklemotion/nokogiri/compare/v1.13.9...v1.13.10">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nokogiri&package-manager=bundler&previous-version=1.13.9&new-version=1.13.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).

</details>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11024

Reviewed By: ltamasi

Differential Revision: D41830779

Pulled By: riversand963

fbshipit-source-id: d5ec751cfb947413a6eabe412871bf30b9db5674

commit | commitdiff | tree

Hui Xiao [Wed, 7 Dec 2022 02:33:35 +0000 (18:33 -0800)]

Move two history entries mistake out of 7.9 section (#11013)

Summary:
https://github.com/facebook/rocksdb/pull/10892 and https://github.com/facebook/rocksdb/pull/10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two.

No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11013

Reviewed By: cbi42

Differential Revision: D41666514

Pulled By: hx235

fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528

commit | commitdiff | tree

Hui Xiao [Wed, 7 Dec 2022 02:31:43 +0000 (18:31 -0800)]

Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016)

Summary:
**Context/Summary:**
Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134,
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific.

Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41717786

Pulled By: hx235

fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4

commit | commitdiff | tree

Changyu Bi [Mon, 5 Dec 2022 21:46:27 +0000 (13:46 -0800)]

Fix an assertion failure in `TimestampTablePropertiesCollector` for empty output (#11015)

Summary:
when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015

Test Plan: added UT.

Reviewed By: ajkr

Differential Revision: D41716719

Pulled By: cbi42

fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080

commit | commitdiff | tree

anand76 [Mon, 5 Dec 2022 06:58:25 +0000 (22:58 -0800)]

Fix table cache leak in MultiGet with async_io (#10997)

Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997

Test Plan:
1. New unit test in db_basic_test
2. asan_crash

Reviewed By: pdillinger

Differential Revision: D41587244

Pulled By: anand1976

fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64

commit | commitdiff | tree

Peter Dillinger [Thu, 1 Dec 2022 21:18:40 +0000 (13:18 -0800)]

Fix use of make_unique in Arena::AllocateNewBlock (#11012)

Summary:
The change to `make_unique<char[]>` in https://github.com/facebook/rocksdb/issues/10810 inadvertently started initializing data in Arena blocks, which could lead to increased memory use due to (at least on our implementation) force-mapping pages as a result. This change goes back to `new char[]` while keeping all the other good parts of https://github.com/facebook/rocksdb/issues/10810.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11012

Test Plan: unit test added (fails on Linux before fix)

Reviewed By: anand1976

Differential Revision: D41658893

Pulled By: pdillinger

fbshipit-source-id: 267b7dccfadaeeb1be767d43c602a6abb0e71cd0

commit | commitdiff | tree

WLeoo [Thu, 1 Dec 2022 03:27:28 +0000 (19:27 -0800)]

Fix an uninitialized variable warning for g++ 12.2.0 (#10995)

Summary:
/home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc: In constructor ‘rocksdb::RawFilterTester::RawFilterTester()’:
/home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc:813:40: error: member ‘rocksdb::RawFilterTester::data_’ is used uninitialized [-Werror=uninitialized]

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10995

Reviewed By: cbi42

Differential Revision: D41620186

Pulled By: ajkr

fbshipit-source-id: a6ebd3820ef12e0af322cbfb7eb553de5bdfcb29

commit | commitdiff | tree

Jay Zhuang [Wed, 30 Nov 2022 21:49:51 +0000 (13:49 -0800)]

Blog: Time Aware Tiered Storage in RocksDB (#11005)

Summary:
Blog for time aware tiered storage

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11005

Test Plan: document only

Reviewed By: siying

Differential Revision: D41620430

Pulled By: ajkr

fbshipit-source-id: d85406a50a7b1553495f2f2f72143dbd90101b01

commit | commitdiff | tree

Hui Xiao [Tue, 29 Nov 2022 22:14:43 +0000 (14:14 -0800)]

Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)

Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.

`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.

As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.

**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.

**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892

Reviewed By: ajkr

Differential Revision: D40778965

Pulled By: hx235

fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f

commit | commitdiff | tree

Hui Xiao [Tue, 29 Nov 2022 18:56:42 +0000 (10:56 -0800)]

Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)

Summary:
**Context/Summary:**

This reverts commit fc74abb436c5807e967aecc892bb8fa127f2fa85 and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a

commit | commitdiff | tree

Changyu Bi [Tue, 29 Nov 2022 03:27:22 +0000 (19:27 -0800)]

Remove copying of range tombstones keys in iterator (#10878)

Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50

# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472  --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25

# main
readseq [AVG    5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec

# this PR
readseq [AVG    5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40711170

Pulled By: cbi42

fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f

commit | commitdiff | tree

Hui Xiao [Mon, 28 Nov 2022 23:45:03 +0000 (15:45 -0800)]

Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955)

Summary:
**Context**

https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect.

As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound.
Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178
Column family: default, op_logs: S 00000000000023A90000000000000003787878`

**Summary**:
- Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`
- Add some helpful log to LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955

Test Plan:
- Errors repro-ed by
```
./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20
```
is gone after this fix
- CI

Reviewed By: ajkr

Differential Revision: D41319441

Pulled By: hx235

fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709

commit | commitdiff | tree

relife22 [Mon, 28 Nov 2022 17:42:42 +0000 (09:42 -0800)]

Add Apache Spark as a user (#10993)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10993

Reviewed By: ajkr

Differential Revision: D41543962

Pulled By: cbi42

fbshipit-source-id: a895d7863543bd64734c5c9faa7b55b0732b3d60

commit | commitdiff | tree

Changyu Bi [Wed, 23 Nov 2022 22:27:14 +0000 (14:27 -0800)]

Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)

Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966

Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.

Reviewed By: ajkr

Differential Revision: D41414172

Pulled By: cbi42

fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981

commit | commitdiff | tree

Andrew Kryczka [Wed, 23 Nov 2022 17:20:58 +0000 (09:20 -0800)]

Support tiering when file endpoints overlap (#10961)

Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.

Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961

Test Plan: updated unit test

Reviewed By: cbi42

Differential Revision: D41370615

Pulled By: ajkr

fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572

commit | commitdiff | tree

Yanqin Jin [Wed, 23 Nov 2022 06:53:31 +0000 (22:53 -0800)]

Make best-efforts recovery verify SST unique ID before Version construction (#10962)

Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).

With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41378241

Pulled By: riversand963

fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c

commit | commitdiff | tree

jsteemann [Tue, 22 Nov 2022 23:51:01 +0000 (15:51 -0800)]

fix compile warnings (#10976)

Summary:
Fixes lots of compile warnings related to missing override specifiers, e.g.
```
./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:130:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteBlockAccess(const rocksdb::BlockCacheTraceRecord&, const rocksdb::Slice&, const rocksdb::Slice&, const rocksdb::Slice&)’ can be marked override [-Wsuggest-override]
  130 |   Status WriteBlockAccess(const BlockCacheTraceRecord& record,
      |          ^~~~~~~~~~~~~~~~
./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:136:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteHeader()’ can be marked override [-Wsuggest-override]
  136 |   Status WriteHeader();
      |          ^~~~~~~~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10976

Reviewed By: riversand963

Differential Revision: D41478588

Pulled By: ajkr

fbshipit-source-id: d30b0457241999e38b16aacf6dabe3e691f7c46f

commit | commitdiff | tree

Alan Paxton [Tue, 22 Nov 2022 23:48:59 +0000 (15:48 -0800)]

improve copying of Env in Options (#10666)

Summary:
Closes https://github.com/facebook/rocksdb/issues/9909

- Constructing an Options from a DBOptions should use the Env from the DBOptions
- DBOptions should be constructed with the default Env as the env_, rather than null. Why ever not ?

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10666

Reviewed By: riversand963

Differential Revision: D40515418

Pulled By: ajkr

fbshipit-source-id: 4122ba3f537660720262694c21ab4bfb13b6f8de

commit | commitdiff | tree

Andrew Kryczka [Tue, 22 Nov 2022 21:07:17 +0000 (13:07 -0800)]

Deflake DBTest2.TraceAndReplay by relaxing latency checks (#10979)

Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979

Reviewed By: riversand963

Differential Revision: D41478300

Pulled By: ajkr

fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e

commit | commitdiff | tree

anand76 [Tue, 22 Nov 2022 03:24:42 +0000 (19:24 -0800)]

Post 7.9.0 release branch cut updates (#10974)

Summary:
Update HISTORY.md, version.h, and check_format_compatible.sh

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974

Reviewed By: akankshamahajan15

Differential Revision: D41455289

Pulled By: anand1976

fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b

commit | commitdiff | tree

Changyu Bi [Tue, 22 Nov 2022 01:08:50 +0000 (17:08 -0800)]

Set correct temperature for range tombstone only file in penultimate level (#10972)

Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972

Test Plan: added unit test for this scenario.

Reviewed By: ajkr

Differential Revision: D41449215

Pulled By: cbi42

fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c

commit | commitdiff | tree

anand76 [Tue, 22 Nov 2022 01:00:01 +0000 (17:00 -0800)]

Update HISTORY.md for 7.9.0 (#10973)

Summary:
Update HISTORY.md for 7.9.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973

Reviewed By: pdillinger

Differential Revision: D41453720

Pulled By: anand1976

fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7

commit | commitdiff | tree

Peter Dillinger [Tue, 22 Nov 2022 00:17:36 +0000 (16:17 -0800)]

Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)

Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945

Test Plan: existing tests for CacheDumper, added basic unit test

Reviewed By: anand1976

Differential Revision: D41231497

Pulled By: pdillinger

fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068

commit | commitdiff | tree

Andrew Kryczka [Tue, 22 Nov 2022 00:14:03 +0000 (16:14 -0800)]

Fix CompactionIterator flag for penultimate level output (#10967)

Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
- Before this fix, Y was always zero
- After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329

commit | commitdiff | tree

Peter Dillinger [Mon, 21 Nov 2022 20:08:21 +0000 (12:08 -0800)]

Observe and warn about misconfigured HyperClockCache (#10965)

Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.

Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965

Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.

Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.

Reviewed By: anand1976

Differential Revision: D41406932

Pulled By: pdillinger

fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7

commit | commitdiff | tree

Yanqin Jin [Fri, 18 Nov 2022 04:43:50 +0000 (20:43 -0800)]

Test Merge with timestamps in stress test (#10948)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948

Test Plan: make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D41390854

Pulled By: riversand963

fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05

commit | commitdiff | tree

Peter Dillinger [Thu, 17 Nov 2022 22:44:59 +0000 (14:44 -0800)]

Mark HyperClockCache as production-ready (#10963)

Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963

Test Plan: existing tests, comment changes only

Reviewed By: siying

Differential Revision: D41384083

Pulled By: pdillinger

fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d

commit | commitdiff | tree

Levi Tamasi [Wed, 16 Nov 2022 20:22:35 +0000 (12:22 -0800)]

Mention wide-column support in HISTORY.md (#10959)

Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959

Reviewed By: akankshamahajan15

Differential Revision: D41348198

Pulled By: ltamasi

fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d

commit | commitdiff | tree

Peter Dillinger [Wed, 16 Nov 2022 18:15:55 +0000 (10:15 -0800)]

Remove prototype FastLRUCache (#10954)

Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954

Test Plan: tests updated

Reviewed By: akankshamahajan15

Differential Revision: D41310123

Pulled By: pdillinger

fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c

commit | commitdiff | tree

Peter Dillinger [Tue, 15 Nov 2022 18:47:15 +0000 (10:47 -0800)]

Re-arrange cache.h to prepare for refactoring (#10942)

Summary:
No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics:
* This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental).
* CacheEntryRole will be needed in definition of class Cache, so that has been moved above it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D41205509

Pulled By: pdillinger

fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348

commit | commitdiff | tree

Levi Tamasi [Tue, 15 Nov 2022 16:06:41 +0000 (08:06 -0800)]

Support using GetMergeOperands for verification with wide columns (#10952)

Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952

Test Plan: Ran a simple non-batched ops blackbox crash test.

Reviewed By: riversand963

Differential Revision: D41292114

Pulled By: ltamasi

fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c

commit | commitdiff | tree

Akanksha Mahajan [Tue, 15 Nov 2022 00:14:41 +0000 (16:14 -0800)]

Fix db_stress failure in async_io in FilePrefetchBuffer (#10949)

Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.

From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.

Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949

Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI

Reviewed By: anand1976

Differential Revision: D41257786

Pulled By: akankshamahajan15

fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f

commit | commitdiff | tree

xiaochenfan [Mon, 14 Nov 2022 19:49:06 +0000 (11:49 -0800)]

Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833)

Summary:
zlib(https://zlib.net/) has released v1.2.13.

1.2.12 is no longer available for downloading and Makefile for rocksdb will be broken due to can't find the source .tar.gz.

https://nvd.nist.gov/vuln/detail/CVE-2022-37434

This pr update the version number and the shasum of new .tar.gz file. (1.2.13)

Fixes https://github.com/facebook/rocksdb/issues/10876

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10833

Reviewed By: hx235

Differential Revision: D40575954

Pulled By: ajkr

fbshipit-source-id: 3e560e453ddf58d045214fc4e64f83bef91f22e5

commit | commitdiff | tree

akankshamahajan [Mon, 14 Nov 2022 19:39:22 +0000 (11:39 -0800)]

Update unit test to avoid timeout (#10950)

Summary:
Update unit test to avoid timeout

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10950

Reviewed By: hx235

Differential Revision: D41258892

Pulled By: akankshamahajan15

fbshipit-source-id: cbfe94da63e9e54544a307845deb79ba42458301

commit | commitdiff | tree

anand76 [Mon, 14 Nov 2022 05:38:35 +0000 (21:38 -0800)]

Add some async read stats (#10947)

Summary:
Add stats for time spent in the ReadAsync call, and async read errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947

Test Plan: Run db_bench and look at stats

Reviewed By: akankshamahajan15

Differential Revision: D41236637

Pulled By: anand1976

fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf

commit | commitdiff | tree

Peter Dillinger [Sat, 12 Nov 2022 01:35:53 +0000 (17:35 -0800)]

Don't attempt to use SecondaryCache on block_cache_compressed (#10944)

Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!

This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.

Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.

This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944

Test Plan: test added that fails on base revision (at least with ASAN)

Reviewed By: akankshamahajan15

Differential Revision: D41205578

Pulled By: pdillinger

fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f

commit | commitdiff | tree

Levi Tamasi [Sat, 12 Nov 2022 00:32:32 +0000 (16:32 -0800)]

Support Merge for wide-column entities in the compaction logic (#10946)

Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41233722

Pulled By: ltamasi

fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed

commit | commitdiff | tree

akankshamahajan [Fri, 11 Nov 2022 21:34:49 +0000 (13:34 -0800)]

Fix async_io regression in scans (#10939)

Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe

commit | commitdiff | tree

Levi Tamasi [Fri, 11 Nov 2022 02:00:08 +0000 (18:00 -0800)]

Support Merge with wide-column entities in iterator (#10941)

Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41195471

Pulled By: ltamasi

fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef

commit | commitdiff | tree

Levi Tamasi [Fri, 11 Nov 2022 01:29:57 +0000 (17:29 -0800)]

Refactor MergeHelper::MergeUntil a bit (#10943)

Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41201946

Pulled By: ltamasi

fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9

commit | commitdiff | tree

Levi Tamasi [Wed, 9 Nov 2022 20:54:05 +0000 (12:54 -0800)]

Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)

Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677

commit | commitdiff | tree

Levi Tamasi [Tue, 8 Nov 2022 22:49:16 +0000 (14:49 -0800)]

Clear saved value in DBIter::{Next, Prev} (#10934)

Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41133473

Pulled By: ltamasi

fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308

commit | commitdiff | tree

Daniel Engel [Tue, 8 Nov 2022 19:56:55 +0000 (11:56 -0800)]

Fix use of crc32c 3way on portable builds using MSVC (#10667)

Summary:
Hello,
As discussed previously in this [discussion](https://github.com/facebook/rocksdb/pull/9680#discussion_r853105163), the mentioned PR introduced a regression in portable versions that compile with MSVC - crc_3way optimization won't be used even in cases where it is supported.

This PR aims to fix just that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10667

Reviewed By: akankshamahajan15

Differential Revision: D40644592

Pulled By: ajkr

fbshipit-source-id: dadbeb10d57c19800e74288258ec3b96095557dd

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom