Hui Xiao [Fri, 8 Oct 2021 23:59:21 +0000 (16:59 -0700)]
Update HISTORY.md for #8428 (#9001)
Summary:
Context:
HISTORY.md was not properly updated along with the change in https://github.com/facebook/rocksdb/pull/8428, where we introduced a change of accounting compression dictionary buffering memory and an extra condition of triggering data unbuffering.
Updated HISTORY.md for https://github.com/facebook/rocksdb/pull/8428 in 6.25.0 HISTORY.md section.
Updated blog post https://rocksdb.org/blog/2021/05/31/dictionary-compression.html.
Return Status::NotSupported() in RateLimiter::GetTotalPendingRequests default impl (#8950)
Summary:
Context:
After more discussion, a fix in https://github.com/facebook/rocksdb/issues/8938 might turn out to be too restrictive for the case where `GetTotalPendingRequests` might be invoked on RateLimiter classes that does not support the recently added API `RateLimiter::GetTotalPendingRequests` (https://github.com/facebook/rocksdb/issues/8890) due to the `assert(false)` in https://github.com/facebook/rocksdb/issues/8938. Furthermore, sentinel value like `-1` proposed in https://github.com/facebook/rocksdb/issues/8938 is easy to be ignored and unchecked. Therefore we decided to adopt `Status::NotSupported()`, which is also a convention of adding new API to public header in RocksDB.
- Changed return value type of `RateLimiter::GetTotalPendingRequests` in related declaration/definition
- Passed in pointer argument to hold the output instead of returning it as before
- Adapted to the changes above in calling `RateLimiter::GetTotalPendingRequests` in test
- Minor improvement to `TEST_F(RateLimiterTest, GetTotalPendingRequests)`: added failure message for assertion and replaced repetitive statements with a loop
RandomAccessFileReader::MultiRead() should not return read bytes not read (#8941)
Summary:
Right now, if underlying read returns fewer bytes than asked for, RandomAccessFileReader::MultiRead() still returns those in the buffer to upper layer. This can be a surprise to upper layer.
This is unlikely to cause incorrect data. To cause incorrect data, checksum checking in upper layer should pass with short reads, whose chance is low.
Make RateLimiter::GetTotalPendingRequest() non pure virtual for backward compability (#8938)
Summary:
Context/Summary:
https://github.com/facebook/rocksdb/pull/8890 added a public API `RateLimiter::GetTotalPendingRequest()` but mistakenly marked it as pure virtual, forcing RateLimiter's derived classes to implement this function and breaking backward compatibility.
This PR makes `RateLimiter::GetTotalPendingRequest()` as non-pure virtual method by providing a trivial implementation in rate_limiter.h
Peter Dillinger [Tue, 21 Sep 2021 18:12:19 +0000 (11:12 -0700)]
Finish BackupEngine migration to IOStatus (#8940)
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.
Jay Zhuang [Sat, 18 Sep 2021 06:24:03 +0000 (23:24 -0700)]
RemoteCompaction support Fallback to local compaction (#8709)
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.
Yanqin Jin [Sat, 18 Sep 2021 01:43:32 +0000 (18:43 -0700)]
Batch blob read IO for MultiGet (#8699)
Summary:
In batched `MultiGet()`, RocksDB batches blob read IO and uses `RandomAccessFileReader::MultiRead()`
to read the blobs instead of issuing multiple `Read()`.
Summary:
- Fixed a bug in `RateLimiterTest.GeneratePriorityIterationOrder` that the callbacks in this test were not called starting from `i = 1`. Fix by increasing `rate_bytes_per_sec` and requested bytes.
- The bug is due to the previous `rate_bytes_per_sec` was set too small, resulting in `refill_bytes_per_period` less than `kMinRefillBytesPerPeriod`. Hence the actual `refill_bytes_per_period` was equal to `kMinRefillBytesPerPeriod` due to the logic [here](https://github.com/facebook/rocksdb/blob/main/util/rate_limiter.cc#L302-L303) and it ended up being greater than the previously set requested bytes. Therefore starting from `i = 1`, `RefillBytesAndGrantRequests()` and `GeneratePriorityIterationOrder` won't be called and the test callbacks was not triggered to execute the assertion.
- Added internal flag to assert callbacks are called in `RateLimiterTest.GeneratePriorityIterationOrder` to prevent any future changes defeat the purpose of the test [as suggested](https://github.com/facebook/rocksdb/pull/8890#discussion_r704915134)
- Increased `rate_bytes_per_sec` and bytes of each request in `RateLimiterTest.GetTotalBytesThrough`, `RateLimiterTest.GetTotalRequests`, `RateLimiterTest.GetTotalPendingRequests` to trigger the "long path" of execution (i.e, the one trigger RefillBytesAndGrantRequests()) to increase test coverage
- This increased the running time of the three tests, see test plan for time difference running locally
- Cleared up sync point effects after each test by calling `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearAllCallBacks();` in `~RateLimiterTest()` [as suggested](https://github.com/facebook/rocksdb/pull/8595/files#r697534279)
- It's fine to call these two methods even when `EnableProcessing()` or `SetCallBack()` is not called in the test or is already cleaned up. In those cases, calling these two functions in destructor is effectively no-op.
- This will allow cleaning up sync point effects of previous test even when the previous test failed in assertion.
- Added missing `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in existing tests for completeness
- Called `SyncPoint::GetInstance()->DisableProcessing();` and `SyncPoint::GetInstance()->ClearCallBacks(..);` in loop in `RateLimiterTest.GeneratePriorityIterationOrder` for completeness
Test Plan:
- Passing existing tests
- To verify the 1st change, run `RateLimiterTest.GeneratePriorityIterationOrder` with assertions of callbacks are indeed called under original `rate_bytes_per_sec` and request byte and under updated `rate_bytes_per_sec` and request byte. The former will fail the assertion while the latter succeeds.
- Here is the increased test time due to the 3rd change mentioned above in the summary. The relevant 3 tests mentioned in total increase the test time by 6s (~6000/33848 = 17.7% of the original total test time), which IMO is acceptable for better test coverage through running the "long path".
- current (run on branch rate_limiter_ut_improve locally)
[ RUN ] RateLimiterTest.GetTotalBytesThrough
[ OK ] RateLimiterTest.GetTotalBytesThrough (3000 ms)
[ RUN ] RateLimiterTest.GetTotalRequests
[ OK ] RateLimiterTest.GetTotalRequests (3001 ms)
[ RUN ] RateLimiterTest.GetTotalPendingRequests
[ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms)
...
[----------] 10 tests from RateLimiterTest (43349 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test case ran. (43349 ms total)
[ PASSED ] 10 tests.
- previous (run on branch main locally)
[ RUN ] RateLimiterTest.GetTotalBytesThrough
[ OK ] RateLimiterTest.GetTotalBytesThrough (0 ms)
[ RUN ] RateLimiterTest.GetTotalRequests
[ OK ] RateLimiterTest.GetTotalRequests (0 ms)
[ RUN ] RateLimiterTest.GetTotalPendingRequests
[ OK ] RateLimiterTest.GetTotalPendingRequests (0 ms)
...
[----------] 10 tests from RateLimiterTest (33848 ms total)
[----------] Global test environment tear-down
[==========] 10 tests from 1 test case ran. (33848 ms total)
[ PASSED ] 10 tests.
Summary:
This keeps the implementations/API backward compatible. Implementations of Statistics will need to override this method (and be registered with the ObjectRegistry) in order to be created via CreateFromString.
Expose blob file information through the EventListener interface (#8675)
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
3. Test OnFile*Finish operations notifications with Blob Files.
4. Log the blob file creation/deletion events through EventLogger in Log file.
Summary:
Right now, the failure injection test for MultiGet() is not sufficient. Improve it with TestFSRandomAccessFile::MultiRead() injecting failures.
More robust checking of IO uring completion data (#8894)
Summary:
Potential bugs in the IO uring implementation can cause bad data to be returned in the completion queue. Add some checks in the PosixRandomAccessFile::MultiRead completion handling code to catch such errors and fail the entire MultiRead. Also log some diagnostic messages and stack trace.
Levi Tamasi [Wed, 15 Sep 2021 19:15:47 +0000 (12:15 -0700)]
Use the write amplification value calculated by RocksDB in benchmark.sh (#8915)
Summary:
Currently, `benchmark.sh` computes write amplification itself; the patch
changes the script to use the value calculated by RocksDB (which is
printed as part of the periodic statistics). This also has the benefit
of being correct for BlobDB as well, since it also considers the amount
of data written to blob files.
Always iniitalize ArenaWrappedDBIter::db_iter_ to nullptr (#8889)
Summary:
ArenaWrappedDBIter::db_iter_ should never be nullptr. However, when debugging a segfault, it's hard to distinguish it is not initialized (not possible) and other corruption. Add this nullptr to help distinguish the case.
Andrew Kryczka [Tue, 14 Sep 2021 20:13:36 +0000 (13:13 -0700)]
Adapt key-value checksum for timestamp-suffixed keys (#8914)
Summary:
After https://github.com/facebook/rocksdb/issues/8725, keys added to `WriteBatch` may be timestamp-suffixed, while `WriteBatch` has no awareness of the timestamp size. Therefore, `WriteBatch` can no longer calculate timestamp checksum separately from the rest of the key's checksum in all cases.
This PR changes the definition of key in KV checksum to include the timestamp suffix. That way we do not need to worry about where the timestamp begins within the key. I believe the only practical effect of this change is now `AssignTimestamp()` requires recomputing the whole key checksum (`UpdateK()`) rather than just the timestamp portion (`UpdateT()`).
Adam Retter [Tue, 14 Sep 2021 18:08:46 +0000 (11:08 -0700)]
Improve benchmark.sh (#8730)
Summary:
* Started on some proper usage text to document the options
* Added a `JOB_ID` parameter, so that we can trace jobs and relate them to other assets
* Now generates a correct TSV file of the summary
* Summary has new additional fields:
* RocksDB Version
* Date
* Job ID
* db_bench log files now also include the Job ID
Summary:
Fix WAL log data corruption when using DBOptions.manual_wal_flush(true) and WriteOptions.sync(true) together (https://github.com/facebook/rocksdb/issues/8723)
Summary:
These tests would frequently fail to find SST files due to race
condition in running ldb (read-only) on an open DB which might do automatic
compaction. But only sometimes would that failure translate into test
failure because the implementation of ldb file_checksum_dump would
swallow many errors. Now,
* DB closed while running ldb to avoid unnecessary race condition
* Detect and report/propagate more failures in `ldb file_checksum_dump`
* Use --hex so that random binary data is not printed to console
Peter Dillinger [Mon, 13 Sep 2021 19:16:51 +0000 (12:16 -0700)]
Bypass unused parameterization in ExternalSSTFileBasicTest.IngestExte… (#8910)
Summary:
Facebook infrastructure doesn't like continuously skipping
tests, so fixing this permanently disabled parameterization to BYPASS
instead of SKIP. (Internal ref: T100525285)
Levi Tamasi [Mon, 13 Sep 2021 17:46:07 +0000 (10:46 -0700)]
Use GetBlobFileSize instead of GetTotalBlobBytes in DB properties (#8902)
Summary:
The patch adjusts the definition of BlobDB's DB properties a bit by
switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The
difference is that the value returned by `GetBlobFileSize` includes
the blob file header and footer as well, and thus matches the on-disk
size of blob files. In addition, the patch removes the `Version` number
from the `blob_stats` property, and updates/extends the unit tests a little.
Change the File System File Wrappers to std::unique_ptr (#8618)
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606
Yanqin Jin [Sun, 12 Sep 2021 22:33:15 +0000 (15:33 -0700)]
Allow WriteBatch to have keys with different timestamp sizes (#8725)
Summary:
In the past, we unnecessarily requires all keys in the same write batch
to be from column families whose timestamps' formats are the same for
simplicity. Specifically, we cannot use the same write batch to write to
two column families, one of which enables timestamp while the other
disables it.
The limitation is due to the member `timestamp_size_` that used to exist
in each `WriteBatch` object. We pass a timestamp_size to the constructor
of `WriteBatch`. Therefore, users can simply use the old
`WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while
the internal implementation of `WriteBatch` will take care of memory
allocation for timestamps.
The above is not necessary.
One the one hand, users can set up a memory buffer to store user key and
then contiguously append the timestamp to the user key. Then the user
can pass this buffer to the `WriteBatch::Put(Slice&)` API.
On the other hand, users can set up a SliceParts object which is an
array of Slices and let the last Slice to point to the memory buffer
storing timestamp. Then the user can pass the SliceParts object to the
`WriteBatch::Put(SliceParts&)` API.
Peter Dillinger [Fri, 10 Sep 2021 16:59:05 +0000 (09:59 -0700)]
Fix and detect headers with missing dependencies (#8893)
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.
rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)
Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.
This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.
Add public API RateLimiter::GetTotalPendingRequests() (#8890)
Summary:
Context/Summary:
As users requested, a public API RateLimiter::GetTotalPendingRequests() is added to expose the total number of pending requests for bytes in the rate limiter, which is the size of the request queue of that priority (or of all priorities, if IO_TOTAL is interested) at the time when this API is called.
Add support to the ObjectRegistry for ManagedObjects (#8658)
Summary:
ManagedObjects are shared pointer objects where RocksDB wants to share a single object between multiple configurations. For example, the Cache may be shared between multiple column families/tables or the Statistics may be shared between multiple databases.
ManagedObjects are stored in the ObjectRegistry by Type (e.g. Cache) and ID. For a given type/ID name, a single object is stored.
Levi Tamasi [Fri, 10 Sep 2021 01:57:01 +0000 (18:57 -0700)]
Support timestamps in SstFileWriter (#8899)
Summary:
As a first step of supporting user-defined timestamps with ingestion, the
patch adds timestamp support to `SstFileWriter`; namely, it adds new
versions of the `Put` and `Delete` APIs that take timestamps. (`Merge`
and `DeleteRange` are currently not supported with user-defined timestamps
in general but once those features are implemented, we can handle them
in `SstFileWriter` in a similar fashion.) The new APIs validate the size of
the timestamp provided by the client. Similarly, calls to the pre-existing
timestamp-less APIs are now disallowed when user-defined timestamps are
in use according to the comparator.
Support custom Env in db_sst_test and external_sst_file_basic_test (#8888)
Summary:
Support custom Env in these tests. Some custom Envs do not support reopening a file for write, either normal mode or Random RW mode. Added some additional checks in external_sst_file_basic_test to accommodate those Envs.
Charge read to rate limiter in BackupEngine (#8722)
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`
Andrew Kryczka [Wed, 8 Sep 2021 20:37:59 +0000 (13:37 -0700)]
prevent stranded LATEST_BACKUP in BackupEngineTest.NoDeleteWithReadOnly (#8887)
Summary:
A "LATEST_BACKUP" file was left in the backup directory by
"BackupEngineTest.NoDeleteWithReadOnly" test, affecting future test
runs. In particular, it caused "BackupEngineTest.IOStats" to fail since
it relies on backup directory containing only data written by its
`BackupEngine`.
The fix is to promote "LATEST_BACKUP" to an explicitly managed file so
it is deleted in `BackupEngineTest` constructor if it exists.
Account for dictionary-building buffer in global memory limit (#8428)
Summary:
Context:
Some data blocks are temporarily buffered in memory in BlockBasedTableBuilder for building compression dictionary used in data block compression. Currently this memory usage is not counted toward our global memory usage utilizing block cache capacity. To improve that, this PR charges that memory usage into the block cache to achieve better memory tracking and limiting.
- Reserve memory in block cache for buffered data blocks that are used to build a compression dictionary
- Release all the memory associated with buffering the data blocks mentioned above in EnterUnbuffered(), which is called when (a) buffer limit is exceeded after buffering OR (b) the block cache becomes full after reservation OR (c) BlockBasedTableBuilder calls Finish()
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.
For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
Test Plan:
```
➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN ] DBBlobBasicTest.GetBlob
[ OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN ] DBBlobBasicTest.MultiGetBlobs
[ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex
[ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN ] DBBlobBasicTest.GenerateIOTracing
[ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN ] DBBlobBasicTest.GetMergeBlobWithPut
[ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN ] DBBlobBasicTest.BlobDBProperties
[ OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[ PASSED ] 16 tests.
```
Make MemTableRepFactory into a Customizable class (#8419)
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Peter Dillinger [Wed, 8 Sep 2021 05:40:37 +0000 (22:40 -0700)]
Fix POSIX LockFile after failure to create file (#8747)
Summary:
Failure to create the lock file (e.g. out of space) could
prevent future LockFile attempts in the same process on the same file
from succeeding.
Also added DEBUG code to fail assertion if PosixFileLock is destroyed
without using UnlockFile (which is a risk because FileLock is in the
public API with virtual destructor).
Peter Dillinger [Wed, 8 Sep 2021 04:18:21 +0000 (21:18 -0700)]
Add (& fix) some simple source code checks (#8821)
Summary:
* Don't hardcode namespace rocksdb (use ROCKSDB_NAMESPACE)
* Don't #include <rocksdb/...> (use double quotes)
* Support putting NOCOMMIT (any case) in source code that should not be
committed/pushed in current state.
These will be run with `make check` and in GitHub actions
Andrew Kryczka [Wed, 8 Sep 2021 01:23:58 +0000 (18:23 -0700)]
Bytes read/written stats for `CreateNewBackup*()` (#8819)
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.
Jay Zhuang [Tue, 7 Sep 2021 22:38:04 +0000 (15:38 -0700)]
Remove asan_symbolize.py for internal asan build (#8737)
Summary:
asan_symbolize.py is not compatible with python3. Also make it
consistent with public CI, which doesn't use asan_symbolize.py
And update coverage_test.sh to use python3.
Andrew Kryczka [Tue, 7 Sep 2021 20:25:24 +0000 (13:25 -0700)]
Bytes read stat for `VerifyChecksum()` and `VerifyFileChecksums()` APIs (#8741)
Summary:
- Clarified some comments on compatibility for adding new ticker stats
- Added read I/O stats for `VerifyChecksum()` and `VerifyFileChecksums()` APIs
Peter Dillinger [Tue, 7 Sep 2021 20:04:07 +0000 (13:04 -0700)]
Improve support for using regexes (#8740)
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.
Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.
Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:
utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)
db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA
Jay Zhuang [Sat, 4 Sep 2021 02:03:56 +0000 (19:03 -0700)]
Fix regression test script (#8753)
Summary:
Regression test is broken and not running:
1. failed test is not reporting, fix it by add `set -e`
2. internal regression test is not run inside github, removing that
3. fix a few minor issues to pass the test
4. delete unused binary size build, and regression test is reporting binary size now.
Peter Dillinger [Wed, 1 Sep 2021 21:14:50 +0000 (14:14 -0700)]
Remove some unneeded code (#8736)
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Hui Xiao [Tue, 31 Aug 2021 17:59:14 +0000 (10:59 -0700)]
Implement superior user & mid IO priority level in GenericRateLimiter (#8595)
Summary:
Context:
An extra IO_USER priority in rate limiter allows users to optionally charge WAL writes / SST reads to rate limiter at this priority level, which then has higher priority than IO_HIGH and IO_LOW. With an extra IO_USER priority, it allows users to better specify the relative urgency/importance among different requests in rate limiter. As a consequence, IO resource management can better prioritize and limit resource based on user's need.
The IO_USER is implemented as superior priority in GenericRateLimiter, in the sense that its request queue will always be iterated first without being constrained to fairness. The reason is that the notion of fairness is only meaningful in helping lower priorities in background IO (i.e, IO_HIGH/MID/LOW) to gain some fair chance to run so that it does not block foreground IO (i.e, the ones that are charged at the level of IO_USER). As we can see, the ultimate goal here is to not blocking foreground IO at IO_USER level, which justifies the superiority of IO_USER.
Similar benefits exist for IO_MID priority.
- Rewrote the logic of deciding the order of iterating request queues of high/low priorities to include the extra user/mid priority w/o affecting the existing behavior (see PR's [comment](https://github.com/facebook/rocksdb/pull/8595/files#r678749331))
- Included the request queue of user-pri/mid-pri in the code path of next-leader-candidate signaling and GenericRateLimiter's destructor
- Included the extra user/mid-pri in bookkeeping data structures: total_bytes_through_ and total_requests_
- Re-written the previous impl of explicitly iterating priorities with a loop from Env::IO_LOW to Env::IO_TOTAL
Test Plan:
- passed existing rate_limiter_test.cc
- passed added unit tests in rate_limiter_test.cc
- run performance test to verify performance with only high/low requests is not affected by this change
- Set-up command:
`TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --duration=5 --compression_type=none --num=100000000 --disable_auto_compactions=true --write_buffer_size=1048576 --writable_file_max_buffer_size=65536 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --level0_slowdown_writes_trigger=$(((1 << 31) - 1)) --level0_stop_writes_trigger=$(((1 << 31) - 1))`
Qizhong Mao [Tue, 31 Aug 2021 17:56:08 +0000 (10:56 -0700)]
Replace `std::shared_ptr<SystemClock>` by `SystemClock*` in `TraceExecutionHandler` (#8729)
Summary:
All/most trace related APIs directly use `SystemClock*` (https://github.com/facebook/rocksdb/pull/8033). Do the same in `TraceExecutionHandler`.
Peter Dillinger [Mon, 30 Aug 2021 22:19:39 +0000 (15:19 -0700)]
Built-in support for generating unique IDs, bug fix (#8708)
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
anand76 [Thu, 26 Aug 2021 22:39:32 +0000 (15:39 -0700)]
Fix a race condition in DumpStats() during iteration of the ColumnFamilySet (#8714)
Summary:
DumpStats() iterates through the ColumnFamilySet. There is a potential
race condition because it does Ref the cfd, and the cfd could get
destroyed during the iteration.
Yanqin Jin [Wed, 25 Aug 2021 02:03:58 +0000 (19:03 -0700)]
Temporarily disable block-based filter when stress testing timestamp (#8703)
Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.
Yanqin Jin [Wed, 25 Aug 2021 01:17:32 +0000 (18:17 -0700)]
Fix a bug of secondary instance sequence going backward (#8653)
Summary:
Recent refactor of `ReactiveVersionSet::ReadAndApply()` uses
`ManifestTailer` whose `Iterate()` method can cause the db's
`last_sequence_` to go backward. Consequently, read requests can see
out-dated data. For example, latest changes to the primary will not be
seen on the secondary even after a `TryCatchUpWithPrimary()` if no new
write batches are read from the WALs and no new MANIFEST entries are
read from the MANIFEST.
Fix the bug so that `VersionEditHandler::CheckIterationResult` will
never decrease `last_sequence_`, `last_allocated_sequence_` and
`last_published_sequence_`.
Peter Dillinger [Wed, 25 Aug 2021 00:45:01 +0000 (17:45 -0700)]
Add port::GetProcessID() (#8693)
Summary:
Useful in some places for object uniqueness across processes.
Currently used for generating a host-wide identifier of Cache objects
but expected to be used soon in some unique id generation code.
`int64_t` is chosen for return type because POSIX uses signed integer type,
usually `int`, for `pid_t` and Windows uses `DWORD`, which is `uint32_t`.
Future work: avoid copy-pasted declarations in port_*.h, perhaps with
port_common.h always included from port.h
Hui Xiao [Tue, 24 Aug 2021 19:42:31 +0000 (12:42 -0700)]
Refactor WriteBufferManager::CacheRep into CacheReservationManager (#8506)
Summary:
Context:
To help cap various memory usage by a single limit of the block cache capacity, we charge the memory usage through inserting/releasing dummy entries in the block cache. CacheReservationManager is such a class (non thread-safe) responsible for inserting/removing dummy entries to reserve cache space for memory used by the class user.
- Refactored the inner private class CacheRep of WriteBufferManager into public CacheReservationManager class for reusability such as for https://github.com/facebook/rocksdb/pull/8428
- Encapsulated implementation details of cache key generation and dummy entries insertion/release in cache reservation as discussed in https://github.com/facebook/rocksdb/pull/8506#discussion_r666550838
- Consolidated increase/decrease cache reservation into one API - UpdateCacheReservation.
- Adjusted the previous dummy entry release algorithm in decreasing cache reservation to be loop-releasing dummy entries to stay symmetric to dummy entry insertion algorithm
- Made the previous dummy entry release algorithm in delayed decrease mode more aggressive for better decreasing cache reservation when memory used is less likely to increase back.
Previously, the algorithms only release 1 dummy entries when new_mem_used < 3/4 * cache_allocated_size_ and cache_allocated_size_ - kSizeDummyEntry > new_mem_used.
Now, the algorithms loop-releases as many dummy entries as possible when new_mem_used < 3/4 * cache_allocated_size_.
- Updated WriteBufferManager's test cases to adapt to changes on the release algorithm mentioned above and left comment for some test cases for clarity
- Replaced the previous cache key prefix generation (utilizing object address related to the cache client) with one that utilizes Cache->NewID() to prevent cache-key collision among dummy entry clients sharing the same cache.
The specific collision we are preventing happens when the object address is reused for a new cache-key prefix while the old cache-key using that same object address in its prefix still exists in the cache. This could happen due to that, under LRU cache policy, there is a possible delay in releasing a cache entry after the cache client object owning that cache entry get deallocated. In this case, the object address related to the cache client object can get reused for other client object to generate a new cache-key prefix.
This prefix generation can be made obsolete after Peter's unification of all the code generating cache key, mentioned in https://github.com/facebook/rocksdb/pull/8506#discussion_r667265255
Andrew Kryczka [Tue, 24 Aug 2021 06:08:25 +0000 (23:08 -0700)]
Deflake write-prepared and write-unprepared tests (#8696)
Summary:
The `JobContext::job_snapshot` referenced DB state but could
have been deleted by a BG thread after the signal/unlock allowing
shutdown to proceed. Then we would see an error like this (valgrind):
```
==354104== Thread 2:
==354104== Invalid read of size 8
==354104== at 0x694C4D: rocksdb::ManagedSnapshot::~ManagedSnapshot() (snapshot_impl.cc:20)
==354104== by 0x58F5BA: operator() (unique_ptr.h:81)
==354104== by 0x58F5BA: operator() (unique_ptr.h:75)
==354104== by 0x58F5BA: ~unique_ptr (unique_ptr.h:292)
==354104== by 0x58F5BA: rocksdb::JobContext::~JobContext() (job_context.h:221)
==354104== by 0x5F155E: rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) (db_impl_compaction_flush.cc:2696)
==354104== by 0x5F1BC2: rocksdb::DBImpl::BGWorkCompaction(void*) (db_impl_compaction_flush.cc:2468)
==354104== by 0x83707A: operator() (std_function.h:688)
==354104== by 0x83707A: rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) (threadpool_imp.cc:266)
==354104== by 0x8373ED: rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) (threadpool_imp.cc:307)
==354104== by 0x492A800: execute_native_thread_routine (in /usr/local/fbcode/platform009/lib/libstdc++.so.6.0.28)
==354104== by 0x4A5020B: start_thread (in /usr/local/fbcode/platform009/lib/libpthread-2.30.so)
==354104== by 0x4CF281E: clone (in /usr/local/fbcode/platform009/lib/libc-2.30.so)
```
Jay Zhuang [Mon, 23 Aug 2021 23:26:13 +0000 (16:26 -0700)]
Add extra information to RemoteCompaction APIs (#8680)
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.