rocksdb.git
3 years agoLRUCache midpoint insertion master
Yi Wu [Thu, 24 May 2018 22:45:49 +0000 (15:45 -0700)]
LRUCache midpoint insertion

Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877

Differential Revision: D8100895

Pulled By: yiwu-arbug

fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da

3 years agoCatchup with posix features
Dmitri Smirnov [Thu, 24 May 2018 22:05:00 +0000 (15:05 -0700)]
Catchup with posix features

Summary:
Catch up with Posix features
  NewWritableRWFile must fail when file does not exists
  Implement Env::Truncate()
  Adjust Env options optimization functions
  Implement MemoryMappedBuffer on Windows.
Closes https://github.com/facebook/rocksdb/pull/3857

Differential Revision: D8053610

Pulled By: ajkr

fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e

3 years agoport_posix: use posix_memalign() for aligned_alloc
Kefu Chai [Thu, 24 May 2018 18:59:27 +0000 (11:59 -0700)]
port_posix: use posix_memalign() for aligned_alloc

Summary:
to workaround issue of http://tracker.ceph.com/issues/21422 .
and in tcmalloc aligned_alloc and posix_memalign() are basically the
same thing. the same applies to GNU glibc.

fixes #3175

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3862

Differential Revision: D8147930

Pulled By: yiwu-arbug

fbshipit-source-id: 355afe93c4dd0a96a0d711ef190e8b86fbe8d11d

3 years agorefactor constructor of LRUCacheShard
Yi Wu [Thu, 24 May 2018 01:53:17 +0000 (18:53 -0700)]
refactor constructor of LRUCacheShard

Summary:
Update LRUCacheShard constructor so that adding new params to it don't need to add extra SetXXX() methods.
Closes https://github.com/facebook/rocksdb/pull/3896

Differential Revision: D8128618

Pulled By: yiwu-arbug

fbshipit-source-id: 6afa715de1493a50de413678761a765e3af9b83b

3 years agoIntroduce library-independent default compression level
Andrew Kryczka [Thu, 24 May 2018 01:33:00 +0000 (18:33 -0700)]
Introduce library-independent default compression level

Summary:
Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced https://github.com/facebook/zstd/commit/a146ee04ae5866b948be0c1911418e0436d80cb4, it started giving poor compression ratios by default in zstd.

This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level.
Closes https://github.com/facebook/rocksdb/pull/3895

Differential Revision: D8125780

Pulled By: ajkr

fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6

3 years agoSpecify the underlying type of enums.
Yanqin Jin [Wed, 23 May 2018 22:59:16 +0000 (15:59 -0700)]
Specify the underlying type of enums.

Summary:
Explicitly specify the underlying type of enums help developers understand the physical storage.
Closes https://github.com/facebook/rocksdb/pull/3892

Differential Revision: D8107027

Pulled By: riversand963

fbshipit-source-id: a00efecbba46df4a3c8eed0994a2d4972ad1a1d3

3 years agoFix a backward compatibility problem with table_properties being nullptr
Zhongyi Xie [Tue, 22 May 2018 20:51:21 +0000 (13:51 -0700)]
Fix a backward compatibility problem with table_properties being nullptr

Summary:
Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then.
Closes https://github.com/facebook/rocksdb/pull/3890

Differential Revision: D8100914

Pulled By: miasantreble

fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a

3 years agoFix Issue #3771: Slice ctor checks for nullptr and creates empty string
Jacquin Mininger [Tue, 22 May 2018 20:34:26 +0000 (13:34 -0700)]
Fix Issue #3771: Slice ctor checks for nullptr and creates empty string

Summary:
Fix Issue #3771   : Check for nullptr in Slice constructor
Slice ctor checks for nullptr and creates empty string if the string does not exist
Closes https://github.com/facebook/rocksdb/pull/3887

Differential Revision: D8098852

Pulled By: ajkr

fbshipit-source-id: 04471077defa9776ce7b8c389a61312ce31002fb

3 years agoAvoid sleep in DBTest.GroupCommitTest to fix flakiness
Andrew Kryczka [Tue, 22 May 2018 18:58:51 +0000 (11:58 -0700)]
Avoid sleep in DBTest.GroupCommitTest to fix flakiness

Summary:
DBTest.GroupCommitTest would often fail when run under valgrind because its sleeps were insufficient to guarantee a group commit had multiple entries. Instead we can use sync point to force a leader to wait until a non-leader thread has enqueued its work, thus guaranteeing a leader can do group commit work for multiple threads.
Closes https://github.com/facebook/rocksdb/pull/3883

Differential Revision: D8079429

Pulled By: ajkr

fbshipit-source-id: 61dc50fad29d2c85547842f681288de60fa29049

3 years agoAvoid single-deleting merge operands in db_stress
Andrew Kryczka [Tue, 22 May 2018 17:50:09 +0000 (10:50 -0700)]
Avoid single-deleting merge operands in db_stress

Summary:
I repro'd some of the "unexpected value" failures showing up in our CI lately and they always happened on keys that have a mix of single deletes and merge operands. The `SingleDelete()` API comment mentions it's incompatible with `Merge()`, so this PR prevents `db_stress` from mixing them.
Closes https://github.com/facebook/rocksdb/pull/3878

Differential Revision: D8097346

Pulled By: ajkr

fbshipit-source-id: 357a48c6a31156f4f8db3ce565638ad924c437a1

3 years agoPersistRocksDBOptions() to use WritableFileWriter
Siying Dong [Mon, 21 May 2018 23:37:47 +0000 (16:37 -0700)]
PersistRocksDBOptions() to use WritableFileWriter

Summary:
By using WritableFileWriter rather than WritableFile directly, we can buffer multiple Append() calls to one write() file system call, which will be expensive to underlying Env without its own write buffering.
Closes https://github.com/facebook/rocksdb/pull/3882

Differential Revision: D8080673

Pulled By: siying

fbshipit-source-id: e0db900cb3c178166aa738f3985db65e3ae2cf1b

3 years agoMove prefix_extractor to MutableCFOptions
Zhongyi Xie [Mon, 21 May 2018 21:33:55 +0000 (14:33 -0700)]
Move prefix_extractor to MutableCFOptions

Summary:
Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users.
This PR aims to make it possible to dynamically change bloom filter config.
Closes https://github.com/facebook/rocksdb/pull/3601

Differential Revision: D7253114

Pulled By: miasantreble

fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c

3 years agoUpdate ColumnFamilyTest for multi-CF verification
Yanqin Jin [Mon, 21 May 2018 18:52:31 +0000 (11:52 -0700)]
Update ColumnFamilyTest for multi-CF verification

Summary:
Change `keys_` from `set<string>` to `vector<set<string>>` so that each column
family's keys are stored in one set.

ajkr When you have a chance, can you PTAL? Thanks!
Closes https://github.com/facebook/rocksdb/pull/3871

Differential Revision: D8056447

Pulled By: riversand963

fbshipit-source-id: 650d0f9cad02b1bc005fc329ad76edbf053e6386

3 years agoPrint histogram count and sum in statistics string
Andrew Kryczka [Mon, 21 May 2018 18:09:00 +0000 (11:09 -0700)]
Print histogram count and sum in statistics string

Summary:
Previously it only printed percentiles, even though our histogram keeps track of count and sum (and more). There have been many times we want to know more than the percentiles. For example, we currently want sum of "rocksdb.compression.times.nanos" and sum of "rocksdb.decompression.times.nanos", which would allow us to know the relative cost of compression vs decompression.

This PR adds count and sum to the string printed by `StatisticsImpl::ToString`. This is a bit risky as there are definitely parsers assuming the old format. I will mention it in HISTORY.md and hope for the best...
Closes https://github.com/facebook/rocksdb/pull/3863

Differential Revision: D8038831

Pulled By: ajkr

fbshipit-source-id: 0465b72e4b0cbf18ef965f4efe402601d16d5b5c

3 years agoAssert keys/values pinned by range deletion meta-block iterators
Andrew Kryczka [Mon, 21 May 2018 16:42:49 +0000 (09:42 -0700)]
Assert keys/values pinned by range deletion meta-block iterators

Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875

Differential Revision: D8063667

Pulled By: ajkr

fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41

3 years agoAdd missing test files to src.mk
Andrew Kryczka [Mon, 21 May 2018 16:35:46 +0000 (09:35 -0700)]
Add missing test files to src.mk

Summary:
We only generate the header dependency (".cc.d") files for files mentioned in "src.mk". When we don't generate them, changes to header dependencies do not cause `make` to recompile the dependent ".o". Then it takes a while for developers (or maybe just me) to realize `make clean` is necessary.
Closes https://github.com/facebook/rocksdb/pull/3876

Differential Revision: D8065389

Pulled By: ajkr

fbshipit-source-id: 0f62eee7bcab15b0215791564e6ab3775d46996b

3 years agofix a division by zero bug
Zhongyi Xie [Sat, 19 May 2018 04:44:07 +0000 (21:44 -0700)]
fix a division by zero bug

Summary:
fixes the failing clang_analyze contrun test
Closes https://github.com/facebook/rocksdb/pull/3872

Differential Revision: D8059241

Pulled By: miasantreble

fbshipit-source-id: e8fc1838004fe16a823456188386b8b39429803b

3 years agoclass Block to store num_restarts_
Siying Dong [Fri, 18 May 2018 19:54:09 +0000 (12:54 -0700)]
class Block to store num_restarts_

Summary:
Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class.
Closes https://github.com/facebook/rocksdb/pull/3869

Differential Revision: D8052493

Pulled By: siying

fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1

3 years agoSet the default value of max_manifest_file_size.
Yanqin Jin [Fri, 18 May 2018 14:58:59 +0000 (07:58 -0700)]
Set the default value of max_manifest_file_size.

Summary:
In the past, the default value of max_manifest_file_size is uint64_t::MAX,
allowing a long running RocksDB process to grow its MANIFEST file to take up
the entire disk, as reported in [issue 3851](https://github.com/facebook/rocksdb/issues/3851). It is reasonable and common to provide a default non-max value for this option. Therefore, I set the value to 1GB.

siying miasantreble Please let me know whether this looks good to you. Thanks!
Closes https://github.com/facebook/rocksdb/pull/3867

Differential Revision: D8051524

Pulled By: riversand963

fbshipit-source-id: 50251f0804b1fa933a19a30d19d261ea8b9d2b72

3 years agoImplement key shortening functions in ReverseBytewiseComparator
Siying Dong [Fri, 18 May 2018 01:24:20 +0000 (18:24 -0700)]
Implement key shortening functions in ReverseBytewiseComparator

Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.

Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836

Differential Revision: D7959762

Pulled By: siying

fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683

3 years agoadd override to virtual functions
Zhongyi Xie [Fri, 18 May 2018 00:51:46 +0000 (17:51 -0700)]
add override to virtual functions

Summary:
this will fix the failing clang_check test
Closes https://github.com/facebook/rocksdb/pull/3868

Differential Revision: D8050880

Pulled By: miasantreble

fbshipit-source-id: 749932e2e4025f835c961c068d601e522a126da6

3 years agoReorder field based on esan data
Xin Tong [Fri, 18 May 2018 00:49:06 +0000 (17:49 -0700)]
Reorder field based on esan data

Summary:
Running. TEST_TMPDIR=/dev/shm ./buck-out/gen/rocks/tools/rocks_db_bench --benchmarks=readwhilewriting --num=5000000 -benchmark_write_rate_limit=2000000 --threads=32

Collected esan data and reorder field. Accesses to 4th and 6th fields take majority of the access.  Group them. Overall, this struct takes 10%+ of the total accesses in the program. (637773011/6107964986)

==2433831==  class rocksdb::InlineSkipList
==2433831==   size = 48, count = 637773011, ratio = 112412, array access = 0
==2433831==   # 0: offset = 0,   size = 2,       count = 455137, type = i16
==2433831==   # 1: offset = 2,   size = 2,       count = 6,      type = i16
==2433831==   # 2: offset = 4,   size = 4,       count = 182303, type = i32
==2433831==   # 3: offset = 8,   size = 8,       count = 263953900, type = %"class.rocksdb::MemTableRep::KeyComparator"*
==2433831==   # 4: offset = 16,  size = 8,       count = 136409, type = %"class.rocksdb::Allocator"*
==2433831==   # 5: offset = 24,  size = 8,       count = 366628820, type = %"struct.rocksdb::InlineSkipList<const rocksdb::MemTableRep::KeyComparator &>::Node"*
==2433831==   # 6: offset = 32,  size = 4,       count = 6280031, type = %"struct.std::atomic" = type { %"struct.std::__atomic_base" }
==2433831==   # 7: offset = 40,  size = 8,       count = 136405, type = %"struct.rocksdb::InlineSkipList<const rocksdb::MemTableRep::KeyComparator &>::Splice"*
==2433831==EfficiencySanitizer: total struct field access count = 6107964986

Before re-ordering
[trentxintong@devbig460.frc2 ~/fbsource/fbcode]$ fgrep readwhilewriting
without-ro.log
readwhilewriting :       0.036 micros/op 27545605 ops/sec;   26.8 MB/s
(45954 of 5000000 found)
readwhilewriting :       0.036 micros/op 28024240 ops/sec;   27.2 MB/s
(43158 of 5000000 found)
readwhilewriting :       0.037 micros/op 27345145 ops/sec;   27.1 MB/s
(46725 of 5000000 found)
readwhilewriting :       0.037 micros/op 27072588 ops/sec;   27.3 MB/s
(42605 of 5000000 found)
readwhilewriting :       0.034 micros/op 29578781 ops/sec;   28.3 MB/s
(44294 of 5000000 found)
readwhilewriting :       0.035 micros/op 28528304 ops/sec;   27.7 MB/s
(44176 of 5000000 found)
readwhilewriting :       0.037 micros/op 27075497 ops/sec;   26.5 MB/s
(43763 of 5000000 found)
readwhilewriting :       0.036 micros/op 28024117 ops/sec;   27.1 MB/s
(40622 of 5000000 found)
readwhilewriting :       0.037 micros/op 27078709 ops/sec;   27.6 MB/s
(47774 of 5000000 found)
readwhilewriting :       0.034 micros/op 29020689 ops/sec;   28.1 MB/s
(45066 of 5000000 found)
AVERAGE()=27.37 MB/s

After re-ordering
[trentxintong@devbig460.frc2 ~/fbsource/fbcode]$ fgrep readwhilewriting
ro.log
readwhilewriting :       0.036 micros/op 27542409 ops/sec;   27.7 MB/s
(46163 of 5000000 found)
readwhilewriting :       0.036 micros/op 28021148 ops/sec;   28.2 MB/s
(46155 of 5000000 found)
readwhilewriting :       0.036 micros/op 28021035 ops/sec;   27.3 MB/s
(44039 of 5000000 found)
readwhilewriting :       0.036 micros/op 27538659 ops/sec;   27.5 MB/s
(46781 of 5000000 found)
readwhilewriting :       0.036 micros/op 28028604 ops/sec;   27.6 MB/s
(44689 of 5000000 found)
readwhilewriting :       0.036 micros/op 27541452 ops/sec;   27.3 MB/s
(43156 of 5000000 found)
readwhilewriting :       0.034 micros/op 29041338 ops/sec;   28.8 MB/s
(44895 of 5000000 found)
readwhilewriting :       0.036 micros/op 27784974 ops/sec;   26.3 MB/s
(39963 of 5000000 found)
readwhilewriting :       0.036 micros/op 27538892 ops/sec;   28.1 MB/s
(46570 of 5000000 found)
readwhilewriting :       0.038 micros/op 26622473 ops/sec;   27.0 MB/s
(43236 of 5000000 found)
AVERAGE()=27.58 MB/s
Closes https://github.com/facebook/rocksdb/pull/3855

Reviewed By: siying

Differential Revision: D8048781

Pulled By: trentxintong

fbshipit-source-id: bc9807a9845e2a92cb171ce1ecb5a2c8a51f1481

3 years agoUpdate HISTORY and version for upcoming 5.14
Fosco Marotto [Thu, 17 May 2018 21:20:54 +0000 (14:20 -0700)]
Update HISTORY and version for upcoming 5.14

Summary: Closes https://github.com/facebook/rocksdb/pull/3866

Differential Revision: D8043563

Pulled By: gfosco

fbshipit-source-id: da4af20e604534602ac0e07943135513fd9a9f53

3 years agoIn instrumented mutex, take timing once for both of perf_context and statistics
Siying Dong [Thu, 17 May 2018 19:46:23 +0000 (12:46 -0700)]
In instrumented mutex, take timing once for both of perf_context and statistics

Summary: Closes https://github.com/facebook/rocksdb/pull/3427

Differential Revision: D6827236

Pulled By: siying

fbshipit-source-id: d8a2cc525c90df625510565669f2659014259a8a

3 years agoChange and clarify the relationship between Valid(), status() and Seek*() for all...
Mike Kolupaev [Thu, 17 May 2018 09:44:14 +0000 (02:44 -0700)]
Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs

Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
 * If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
 * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.

However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).

This PR changes the convention to:
 * If status() is not ok, Valid() always returns false.
 * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)

This does sacrifice the two use cases listed above, but siying said it's ok.

Overview of the changes:
 * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
 * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
 * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.

To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:

Iterators that didn't need changes:
 * status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
 * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.

Iterators with changes (see inline comments for details):
 * DBIter - an overhaul:
    - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
    - It had a few code paths silently discarding subiterator's status. The stress test caught a few.
    - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
    - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
    - It used to not reset status on seek for some types of errors.
    - Some simplifications and better comments.
    - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
 * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
 * ForwardIterator - changed to the new convention, also slightly simplified.
 * ForwardLevelIterator - fixed some bugs and simplified.
 * LevelIterator - simplified.
 * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
 * BlockBasedTableIterator - minor changes.
 * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
 * PlainTableIterator - some seeks used to not reset status.
 * CuckooTableIterator - tiny code cleanup.
 * ManagedIterator - fixed some bugs.
 * BaseDeltaIterator - changed to the new convention and fixed a bug.
 * BlobDBIterator - seeks used to not reset status.
 * KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810

Differential Revision: D7888019

Pulled By: al13n321

fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d

3 years agoFix race condition between log_.erase and log_.back
Maysam Yabandeh [Wed, 16 May 2018 19:51:31 +0000 (12:51 -0700)]
Fix race condition between log_.erase and log_.back

Summary:
log_ contract specifies that it should not be modified unless both mutex_ and log_write_mutex_ are held. log_.erase however does that with only holding mutex_. This causes a race condition with two_write_queues since logs_.back is read with holding only log_write_mutex_ (which is correct according to logs_ contract) but logs_.erase is called concurrently. This is probably the cause of logs_.back returning nullptr in https://github.com/facebook/rocksdb/issues/3852 although I could not reproduce it.
Fixes https://github.com/facebook/rocksdb/issues/3852
Closes https://github.com/facebook/rocksdb/pull/3859

Differential Revision: D8026103

Pulled By: maysamyabandeh

fbshipit-source-id: ee394e00fe4aa520d884c5ef87981e9d6b5ccb28

3 years agoFix geo_db may seek an error key when they have the same quadkey
acelyc111 [Tue, 15 May 2018 06:54:06 +0000 (23:54 -0700)]
Fix geo_db may seek an error key when they have the same quadkey

Summary: Closes https://github.com/facebook/rocksdb/pull/3832

Differential Revision: D7994326

Pulled By: miasantreble

fbshipit-source-id: 84a81b35b97750360423a9d4eca5b5a14d002134

3 years agoSuppress tsan lock-order-inversion on FlushWAL
Maysam Yabandeh [Tue, 15 May 2018 03:56:44 +0000 (20:56 -0700)]
Suppress tsan lock-order-inversion on FlushWAL

Summary:
TSAN reports a false alarm for lock-order-inversion in DBWriteTest.IOErrorOnWALWritePropagateToWriteThreadFollower but Open and FlushWAL are not run concurrently. Suppressing the error by skipping FlushWAL in the test until TSAN is fixed.

The alternative would be to use
```
TSAN_OPTIONS="suppressions=tsan-suppressions.txt" ./db_write_test
```
but it does not seem straightforward to integrate it to our test infra.
Closes https://github.com/facebook/rocksdb/pull/3854

Differential Revision: D8000202

Pulled By: maysamyabandeh

fbshipit-source-id: fde33483d963a7ad84d3145123821f64960a4802

3 years agoBottommost level-based compactions in bottom-pri pool
Andrew Kryczka [Mon, 14 May 2018 21:44:04 +0000 (14:44 -0700)]
Bottommost level-based compactions in bottom-pri pool

Summary:
This feature was introduced for universal compaction in cc01985d. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level.
Closes https://github.com/facebook/rocksdb/pull/3835

Differential Revision: D7957179

Pulled By: ajkr

fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4

3 years agoFix db_stress build on mac
Sagar Vemuri [Mon, 14 May 2018 18:01:41 +0000 (11:01 -0700)]
Fix db_stress build on mac

Summary:
I noticed, while debugging an unrelated issue, that db_stress is failing to build on mac, leading to a failed `make all`.
```
$ make db_stress -j4
...
tools/db_stress.cc:862:69: error: cannot initialize a parameter of type 'uint64_t *' (aka 'unsigned long long *') with an rvalue of type 'size_t *' (aka 'unsigned long *')
        status = FLAGS_env->GetFileSize(FLAGS_expected_values_path, &size);
                                                                    ^~~~~
./include/rocksdb/env.h:277:66: note: passing argument to parameter 'file_size' here
  virtual Status GetFileSize(const std::string& fname, uint64_t* file_size) = 0;
                                                                 ^
1 error generated.
make: *** [tools/db_stress.o] Error 1
make: *** Waiting for unfinished jobs....
```
Closes https://github.com/facebook/rocksdb/pull/3839

Differential Revision: D7979236

Pulled By: sagar0

fbshipit-source-id: 0615e7bb5405bade71e4203803bf723720422d62

3 years agoPass manual_wal_flush also to the first wal file
Maysam Yabandeh [Mon, 14 May 2018 17:53:32 +0000 (10:53 -0700)]
Pass manual_wal_flush also to the first wal file

Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824

Differential Revision: D7909153

Pulled By: maysamyabandeh

fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce

3 years agoClarify the ownership of root db after TransactionDB::Open
Maysam Yabandeh [Fri, 11 May 2018 22:08:16 +0000 (15:08 -0700)]
Clarify the ownership of root db after TransactionDB::Open

Summary:
The patch clarifies the ownership of the root db after TransactionDB::Open. If it is a success the ownership if with the TransactionDB, and the root db will be deleted when the destructor of the base class, StackableDB, is called. If it is failure, the temporarily created root db will also be deleted properly.
The patch also includes lots of useful formatting changes.

Closes https://github.com/facebook/rocksdb/pull/3714 upon which this patch is built.
Closes https://github.com/facebook/rocksdb/pull/3806

Differential Revision: D7878010

Pulled By: maysamyabandeh

fbshipit-source-id: f54f3942e29434143ae5a2423ceec9c7072cd4c2

3 years agoFix formatting in log message
Sergey Elin [Fri, 11 May 2018 18:14:20 +0000 (11:14 -0700)]
Fix formatting in log message

Summary:
Add missing space.
Closes https://github.com/facebook/rocksdb/pull/3826

Differential Revision: D7956059

Pulled By: miasantreble

fbshipit-source-id: 3aeba76385f8726399a3086c46de710636a31191

3 years agoApply use_direct_io_for_flush_and_compaction to writes only
Andrew Kryczka [Thu, 10 May 2018 02:26:43 +0000 (19:26 -0700)]
Apply use_direct_io_for_flush_and_compaction to writes only

Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.

This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829

Differential Revision: D7915443

Pulled By: ajkr

fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279

3 years agoRefactor argument handling in db_crashtest.py
Andrew Kryczka [Wed, 9 May 2018 20:32:03 +0000 (13:32 -0700)]
Refactor argument handling in db_crashtest.py

Summary:
- Any options unknown to `db_crashtest.py` are now passed directly to `db_stress`. This way, we won't need to update `db_crashtest.py` every time `db_stress` gets a new option.
- Remove `db_crashtest.py` redundant arguments where the value is the same as `db_stress`'s default
- Remove `db_crashtest.py` redundant arguments where the value is the same in a previously applied options map. For example, default_params are always applied before whitebox_default_params, so if they require the same value for an argument, that value only needs to be provided in default_params.
- Made the simple option maps applied in addition to the regular option maps. Previously they were exclusive which led to lots of duplication
Closes https://github.com/facebook/rocksdb/pull/3809

Differential Revision: D7885779

Pulled By: ajkr

fbshipit-source-id: 3a3243b55724d6d5bff36e939b582b9b62c538a8

3 years agoDisallow to open RandomRW file if the file doesn't exist
Siying Dong [Wed, 9 May 2018 17:24:11 +0000 (10:24 -0700)]
Disallow to open RandomRW file if the file doesn't exist

Summary:
The only use of RandomRW is to change seqno when bulkloading, and in this use case, the file should exist. We should fail the file opening in this case.
Closes https://github.com/facebook/rocksdb/pull/3827

Differential Revision: D7913719

Pulled By: siying

fbshipit-source-id: 62cf6734f1a6acb9e14f715b927da388131c3492

3 years agoMake BlockIter final
Siying Dong [Wed, 9 May 2018 17:18:05 +0000 (10:18 -0700)]
Make BlockIter final

Summary:
Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding.
Closes https://github.com/facebook/rocksdb/pull/3828

Differential Revision: D7933816

Pulled By: siying

fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c

3 years agoIntroduce and use the option to disable stall notifications structures
Dmitri Smirnov [Wed, 9 May 2018 17:04:55 +0000 (10:04 -0700)]
Introduce and use the option to disable stall notifications structures

Summary:
and code. Removing this helps with insert performance.
Closes https://github.com/facebook/rocksdb/pull/3830

Differential Revision: D7921030

Pulled By: siying

fbshipit-source-id: 84e80d50a7ef96f5441c51c9a0d089c50217cce2

3 years agoAdd missing options in BuildColumnfamilyOptions
Huachao Huang [Tue, 8 May 2018 19:06:01 +0000 (12:06 -0700)]
Add missing options in BuildColumnfamilyOptions

Summary:
soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit are added to BuildColumnfamilyOptions.
Closes https://github.com/facebook/rocksdb/pull/3823

Differential Revision: D7909246

Pulled By: maysamyabandeh

fbshipit-source-id: 89032efbf6b5bd302ea50cbd7a234977984a1fca

3 years agoDisable readahead when using mmap for reads
Andrew Kryczka [Tue, 8 May 2018 19:03:29 +0000 (12:03 -0700)]
Disable readahead when using mmap for reads

Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the  `mmap`'d memory region. This PR:

- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813

Differential Revision: D7891513

Pulled By: ajkr

fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d

3 years agoLink jemalloc
Tongliang Liao [Mon, 7 May 2018 21:21:18 +0000 (14:21 -0700)]
Link jemalloc

Summary:
Fix undefined reference to `malloc_*` linking errors on Linux.
Closes https://github.com/facebook/rocksdb/pull/3817

Differential Revision: D7899066

Pulled By: ajkr

fbshipit-source-id: 18c46569a59608388d6240f1b8ec20c2d2557dec

3 years agoAllows other cmake-specific "true" for USE_RTTI.
Tongliang Liao [Mon, 7 May 2018 21:17:49 +0000 (14:17 -0700)]
Allows other cmake-specific "true" for USE_RTTI.

Summary:
People also use ON/OFF, TRUE/FALSE and other switch options that is allowed by cmake.
Closes https://github.com/facebook/rocksdb/pull/3814

Differential Revision: D7899032

Pulled By: ajkr

fbshipit-source-id: b71511af59e0a78eedafb639b5002c47050bf3c2

3 years agoSearch paths provided by intel's "tbbvars.sh".
Tongliang Liao [Mon, 7 May 2018 21:15:29 +0000 (14:15 -0700)]
Search paths provided by intel's "tbbvars.sh".

Summary:
TBBROOT and LIBRARY_PATH are set in env by the script.

With TBB 2018 the library path is $TBBROOT/lib/intel64/gcc4.7 for anything above gcc 4.7, which is both compiler and architecture related. We cannot simply do ${TBB_ROOT_DIR}/lib.
Closes https://github.com/facebook/rocksdb/pull/3815

Differential Revision: D7899006

Pulled By: ajkr

fbshipit-source-id: 159ab1f6a5c40452ed6aa8d79300206953d916c2

3 years agoSplit FaultInjectionTest.FaultTest to avoid timeout
Maysam Yabandeh [Mon, 7 May 2018 19:15:54 +0000 (12:15 -0700)]
Split FaultInjectionTest.FaultTest to avoid timeout

Summary:
tsan flavor of this test occasionally times out in our test infra. The patch split the test to two, each working on half of the option range.
Before:
[       OK ] FaultTest/FaultInjectionTest.FaultTest/0 (5918 ms)
[       OK ] FaultTest/FaultInjectionTest.FaultTest/1 (5336 ms)
After:
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/0 (2930 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/1 (2676 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/2 (2759 ms)
[       OK ] FaultTest/FaultInjectionTestSplitted.FaultTest/3 (2546 ms)
Closes https://github.com/facebook/rocksdb/pull/3819

Differential Revision: D7894975

Pulled By: maysamyabandeh

fbshipit-source-id: 809f1411cbcc27f8aa71a6b29a16b039f51b67c9

3 years agoRecommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple...
LingBin [Fri, 4 May 2018 23:37:39 +0000 (16:37 -0700)]
Recommit "Avoid adding tombstones of the same file to RangeDelAggregator multiple times"

Summary:
The origin commit #3635  will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7b7aabe821e671946e732dda7ae6b43d1b. (see #3672)

To fix this, move the set to  and add a check in , i.e., file will be added only if  is non-nullptr.

The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

Before and after the modification, I re-run this command on the machine, the results of are as follows:

  **fillrandom**
 Table | P50 | P75 | P99 | P99.9 | P99.99 |
  ---- | --- | --- | --- | ----- | ------ |
 before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
 after commit  | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |

 **seekrandomwhilewriting**
  Table | P50 | P75 | P99 | P99.9 | P99.99 |
   ---- | --- | --- | --- | ----- | ------ |
 before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
 after commit  | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes https://github.com/facebook/rocksdb/pull/3800

Differential Revision: D7874245

Pulled By: ajkr

fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17

3 years agoFix db_stress memory leak ASAN error
Andrew Kryczka [Fri, 4 May 2018 23:33:38 +0000 (16:33 -0700)]
Fix db_stress memory leak ASAN error

Summary:
In case `--expected_values_path` is unset, we allocate a buffer internally to hold the expected DB state. This PR makes sure it is freed.
Closes https://github.com/facebook/rocksdb/pull/3804

Differential Revision: D7874694

Pulled By: ajkr

fbshipit-source-id: a8f7655e009507c4e639ceebfc3525d69c856e3b

3 years agoEvenly split HarnessTest.Randomized
Maysam Yabandeh [Fri, 4 May 2018 22:14:54 +0000 (15:14 -0700)]
Evenly split HarnessTest.Randomized

Summary:
Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion.
Before:
```
[       OK ] HarnessTest.Randomized1n2 (2278 ms)
[       OK ] HarnessTest.Randomized3n4 (1095 ms)
[       OK ] HarnessTest.Randomized5 (658 ms)
[       OK ] HarnessTest.Randomized6 (1258 ms)
[       OK ] HarnessTest.Randomized7 (6476 ms)
[       OK ] HarnessTest.Randomized8 (8182 ms)
```
After
```
[       OK ] HarnessTest.Randomized1 (2649 ms)
[       OK ] HarnessTest.Randomized2 (2645 ms)
[       OK ] HarnessTest.Randomized3 (2577 ms)
[       OK ] HarnessTest.Randomized4 (2490 ms)
[       OK ] HarnessTest.Randomized5 (2553 ms)
[       OK ] HarnessTest.Randomized6 (2560 ms)
[       OK ] HarnessTest.Randomized7 (2501 ms)
[       OK ] HarnessTest.Randomized8 (2574 ms)
```
Closes https://github.com/facebook/rocksdb/pull/3808

Differential Revision: D7882663

Pulled By: maysamyabandeh

fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e

3 years agoRename vars to satisfy unity built
Maysam Yabandeh [Fri, 4 May 2018 22:14:20 +0000 (15:14 -0700)]
Rename vars to satisfy unity built

Summary:
Tested by "make unity_test"
Closes https://github.com/facebook/rocksdb/pull/3807

Differential Revision: D7882657

Pulled By: maysamyabandeh

fbshipit-source-id: 84862c18d7f2fc762bd96ad070eaeb6936e45159

3 years agoAdd USE_RTTI and default behavior to CMakeLists
Fosco Marotto [Fri, 4 May 2018 22:09:40 +0000 (15:09 -0700)]
Add USE_RTTI and default behavior to CMakeLists

Summary:
Proposed fix for #3701
Closes https://github.com/facebook/rocksdb/pull/3801

Differential Revision: D7868264

Pulled By: gfosco

fbshipit-source-id: 013963ed3d172c8dc2abd1dd5982580082ca5d2d

3 years agoFix crash test allocation error under TSAN
Andrew Kryczka [Fri, 4 May 2018 20:40:58 +0000 (13:40 -0700)]
Fix crash test allocation error under TSAN

Summary:
We were seeing the following error: "ThreadSanitizer: DenseSlabAllocator overflow. Dying."

It is fixable by mmap'ing a smaller region for keys' expected values, which this PR achieves by reducing the number of keys.
Closes https://github.com/facebook/rocksdb/pull/3803

Differential Revision: D7874478

Pulled By: ajkr

fbshipit-source-id: 433939f5cb92410ab4777d540cb0cc2ee0fe6c2e

3 years agoMaxFileSizeForLevel: adjust max_file_size for dynamic level compaction
Zhongyi Xie [Thu, 3 May 2018 23:35:46 +0000 (16:35 -0700)]
MaxFileSizeForLevel: adjust max_file_size for dynamic level compaction

Summary:
`MutableCFOptions::RefreshDerivedOptions` always assume base level is L1, which is not true when `level_compaction_dynamic_level_bytes=true` and Level based compaction is used.
This PR fixes this by recomputing `max_file_size` at query time (in `MaxFileSizeForLevel`)
Fixes https://github.com/facebook/rocksdb/issues/3229

In master:

```
Level Files Size(MB)
--------------------
  0       14      846
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5       15      366
  6       11      481
Cumulative compaction: 3.83 GB write, 2.27 GB read
```
In branch:
```
Level Files Size(MB)
--------------------
  0        9      544
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6      445      935
Cumulative compaction: 2.91 GB write, 1.46 GB read
```

db_bench command used:
```
./db_bench --benchmarks="fillrandom,deleterandom,fillrandom,levelstats,stats" --statistics -deletes=5000 -db=tmp -compression_type=none --num=20000 -value_size=100000 -level_compaction_dynamic_level_bytes=true -target_file_size_base=2097152 -target_file_size_multiplier=2
```
Closes https://github.com/facebook/rocksdb/pull/3755

Differential Revision: D7721381

Pulled By: miasantreble

fbshipit-source-id: 39afb8503190bac3b466adf9bbf2a9b3655789f8

3 years agoBetter destroydb
Dmitri Smirnov [Thu, 3 May 2018 23:07:41 +0000 (16:07 -0700)]
Better destroydb

Summary:
Delete archive directory before WAL folder
  since archive may be contained as a subfolder.
  Also improve loop readability.
Closes https://github.com/facebook/rocksdb/pull/3797

Differential Revision: D7866378

Pulled By: riversand963

fbshipit-source-id: 0c45d97677ce6fbefa3f8d602ef5e2a2a925e6f5

3 years agoSpeedup ManualCompactionTest.Test
Maysam Yabandeh [Thu, 3 May 2018 22:58:42 +0000 (15:58 -0700)]
Speedup ManualCompactionTest.Test

Summary:
ManualCompactionTest.Test occasionally times out in tsan flavor of our test infra. The patch reduces the number of keys to make the test run faster. The change does not seem to negatively impact the coverage of the test.
Closes https://github.com/facebook/rocksdb/pull/3802

Differential Revision: D7865596

Pulled By: maysamyabandeh

fbshipit-source-id: b4f60e32c3ae1677e25506f71c766e33fa985785

3 years agoSkip deleted WALs during recovery
Siying Dong [Thu, 3 May 2018 22:35:11 +0000 (15:35 -0700)]
Skip deleted WALs during recovery

Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.

Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)

This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765

Differential Revision: D7747618

Pulled By: siying

fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729

3 years agoWritePrepared Txn: enable rollback in stress test
Maysam Yabandeh [Thu, 3 May 2018 01:09:55 +0000 (18:09 -0700)]
WritePrepared Txn: enable rollback in stress test

Summary:
Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback.

Tested by running transaction stress test under tsan.
Closes https://github.com/facebook/rocksdb/pull/3785

Differential Revision: D7793727

Pulled By: maysamyabandeh

fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e

3 years agoWritePrepared Txn: split SeqAdvanceConcurrentTest
Maysam Yabandeh [Thu, 3 May 2018 01:05:03 +0000 (18:05 -0700)]
WritePrepared Txn: split SeqAdvanceConcurrentTest

Summary:
The tsan flavor of SeqAdvanceConcurrentTest times out in our test infra. The patch splits it into 10 tests.
On my vm before:
[       OK ] WritePreparedTransactionTest/WritePreparedTransactionTest.SeqAdvanceConcurrentTest/0 (5194 ms)
after:
[       OK ] OneWriteQueue/SeqAdvanceConcurrentTest.SeqAdvanceConcurrentTest/0 (1906 ms)
Closes https://github.com/facebook/rocksdb/pull/3799

Differential Revision: D7854515

Pulled By: maysamyabandeh

fbshipit-source-id: 4fbac42a1f974326cbc237f8cb9d6232d379c431

3 years agoavoid double delete on dummy record insertion failure
Zhongyi Xie [Tue, 1 May 2018 22:42:06 +0000 (15:42 -0700)]
avoid double delete on dummy record insertion failure

Summary:
When the dummy record insertion fails, there is no need to explicitly delete the block as it will be registered for cleanup regardless.
Closes https://github.com/facebook/rocksdb/pull/3688

Differential Revision: D7537741

Pulled By: miasantreble

fbshipit-source-id: fcd3a3d3d382ee8e2c7ced0a4980e683d93a16d6

3 years agoAdjust pread/pwrite to return Status
Dmitri Smirnov [Tue, 1 May 2018 20:38:36 +0000 (13:38 -0700)]
Adjust pread/pwrite to return Status

Summary:
Returning bytes_read causes the caller to call GetLastError()
  to report failure but the lasterror may be overwritten by then
  so we lose the error code.
  Fix up CMake file to include xpress source code only when needed.
  Fix warning for the uninitialized var.
Closes https://github.com/facebook/rocksdb/pull/3795

Differential Revision: D7832935

Pulled By: anand1976

fbshipit-source-id: 4be21affb9b85d361b96244f4ef459f492b7cb2b

3 years agoinitialize local variable for UBSAN in PosixEnv function
Andrew Kryczka [Tue, 1 May 2018 20:18:01 +0000 (13:18 -0700)]
initialize local variable for UBSAN in PosixEnv function

Summary:
this is a repeat commit of a8a28da2159648a2f72c35ea507371df8a97a2a9, which got reverted together with 6afe22db2e667799d8c903db61750d676bffe152, but forgotten about when that commit was un-reverted in 46152d53bf58748fc3ed0681d8970c342bcfc47a.
Closes https://github.com/facebook/rocksdb/pull/3796

Differential Revision: D7826077

Pulled By: ajkr

fbshipit-source-id: edb22375da56e2feda50c5b35f942f4d2d52b19c

3 years agoSecond attempt at db_stress crash-recovery verification
Andrew Kryczka [Mon, 30 Apr 2018 19:23:45 +0000 (12:23 -0700)]
Second attempt at db_stress crash-recovery verification

Summary:
- Original commit: a4fb1f8c049ee9d61a9da8cf23b64d2c7d36a33f
- Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db2e667799d8c903db61750d676bffe152

This PR includes the contents of the original commit plus two bug fixes, which are:

- In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
- Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
Closes https://github.com/facebook/rocksdb/pull/3793

Differential Revision: D7811671

Pulled By: ajkr

fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30

3 years agofix missing perfcontext destroy declare in C API
Vincent Lee [Mon, 30 Apr 2018 18:27:25 +0000 (11:27 -0700)]
fix missing perfcontext destroy declare in C API

Summary:
`rocksdb_perfcontext_destroy` declare is missing in C API.
Closes https://github.com/facebook/rocksdb/pull/3787

Differential Revision: D7816490

Pulled By: ajkr

fbshipit-source-id: 3a488607bfc897c7ce846a1b3c2b7af693134d0d

3 years agoexpose WAL iterator in the C API
Victor Grishchenko [Fri, 27 Apr 2018 23:46:38 +0000 (16:46 -0700)]
expose WAL iterator in the C API

Summary:
A minor change: I wrapped TransactionLogIterator for the C API.
I needed that for the golang binding.
Closes https://github.com/facebook/rocksdb/pull/3304

Differential Revision: D6628736

Pulled By: miasantreble

fbshipit-source-id: 3374f3c64b1d7b225696b8767090917761e2f30a

3 years agorevert db_stress crash-recovery verification
Andrew Kryczka [Fri, 27 Apr 2018 19:54:16 +0000 (12:54 -0700)]
revert db_stress crash-recovery verification

Summary:
crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures.
Closes https://github.com/facebook/rocksdb/pull/3786

Differential Revision: D7794516

Pulled By: ajkr

fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33

3 years agoremove prefixscanrandom from db_bench help
Zhongyi Xie [Fri, 27 Apr 2018 19:12:43 +0000 (12:12 -0700)]
remove prefixscanrandom from db_bench help

Summary:
fix issue reported in https://github.com/facebook/rocksdb/issues/3757
Closes https://github.com/facebook/rocksdb/pull/3784

Differential Revision: D7794107

Pulled By: miasantreble

fbshipit-source-id: 43535074fcb82adb5656bcb916284b2dfc5cbb64

3 years agoAdd max_subcompactions as a compaction option
Huachao Huang [Fri, 27 Apr 2018 18:48:21 +0000 (11:48 -0700)]
Add max_subcompactions as a compaction option

Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775

Differential Revision: D7792357

Pulled By: ajkr

fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c

3 years agoRename pending_compaction_ to queued_for_compaction_.
Yanqin Jin [Fri, 27 Apr 2018 18:11:12 +0000 (11:11 -0700)]
Rename pending_compaction_ to queued_for_compaction_.

Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781

Differential Revision: D7790063

Pulled By: riversand963

fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804

3 years agoRename pending_flush_ to queued_for_flush_.
Yanqin Jin [Fri, 27 Apr 2018 04:09:53 +0000 (21:09 -0700)]
Rename pending_flush_ to queued_for_flush_.

Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush

```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```

`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777

Differential Revision: D7783066

Pulled By: riversand963

fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229

3 years agoAdd virtual Truncate method to Env
Nathan VanBenschoten [Fri, 27 Apr 2018 04:08:46 +0000 (21:08 -0700)]
Add virtual Truncate method to Env

Summary:
This change adds a virtual `Truncate` method to `Env`, which truncates
the named file to the specified size. At the moment, this is only
supported for `MockEnv`, but other `Env's` could be extended to override
the method too. This is the same approach that methods like `LinkFile` and
`AreSameFile` have taken.

This is useful for any user of the in-memory `Env`. The implementation's
header is not exported, so before this change, it was impossible to
access it's already existing `Truncate` method.
Closes https://github.com/facebook/rocksdb/pull/3779

Differential Revision: D7785789

Pulled By: ajkr

fbshipit-source-id: 3bcdaeea7b7180529f7d9b496dc67b791a00bbf0

3 years agoAllow options file in db_stress and db_crashtest
Andrew Kryczka [Fri, 27 Apr 2018 01:38:45 +0000 (18:38 -0700)]
Allow options file in db_stress and db_crashtest

Summary:
- When options file is provided to db_stress, take supported options from the file instead of from flags
- Call `BuildOptionsTable` after `Open` so it can use `options_` once it has been populated either from flags or from file
- Allow options filename to be passed via `db_crashtest.py`
Closes https://github.com/facebook/rocksdb/pull/3768

Differential Revision: D7755331

Pulled By: ajkr

fbshipit-source-id: 5205cc5deb0d74d677b9832174153812bab9a60a

3 years agoRemove block-based table assertion for non-empty filter block
Andrew Kryczka [Thu, 26 Apr 2018 21:33:02 +0000 (14:33 -0700)]
Remove block-based table assertion for non-empty filter block

Summary:
7a6353bd1c516fe3f7118248c77035697c5ac247 prevents empty filter blocks from being written for SST files containing range deletions only. However the assertion this PR removes is still a problem as we could be reading from a DB generated by a RocksDB build without the 7a6353bd1c516fe3f7118248c77035697c5ac247 patch. So remove the assertion. We already don't do this check when `cache_index_and_filter_blocks=false`, so it should be safe.
Closes https://github.com/facebook/rocksdb/pull/3773

Differential Revision: D7769964

Pulled By: ajkr

fbshipit-source-id: 7285762446f2cd2ccf16efd7a988a106fbb0d8d3

3 years agoSync parent directory after deleting a file in delete scheduler
Siying Dong [Thu, 26 Apr 2018 20:51:39 +0000 (13:51 -0700)]
Sync parent directory after deleting a file in delete scheduler

Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767

Differential Revision: D7760136

Pulled By: siying

fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f

3 years agoFix the bloom filter skipping empty prefixes
Maysam Yabandeh [Thu, 26 Apr 2018 20:22:08 +0000 (13:22 -0700)]
Fix the bloom filter skipping empty prefixes

Summary:
bc0da4b5125ac4f43c88879522013814355338e7 optimized bloom filters by skipping duplicate entires when the whole key and prefixes are both added to the bloom. It however used empty string as the initial value of the last entry added to the bloom. This is incorrect since empty key/prefix are valid entires by themselves. This patch fixes that.
Closes https://github.com/facebook/rocksdb/pull/3776

Differential Revision: D7778803

Pulled By: maysamyabandeh

fbshipit-source-id: d5a065daebee17f9403cac51e9d5626aac87bfbc

3 years agoWritePrepared Txn: disable rollback in stress test
Maysam Yabandeh [Thu, 26 Apr 2018 16:14:28 +0000 (09:14 -0700)]
WritePrepared Txn: disable rollback in stress test

Summary:
WritePrepared rollback implementation is not ready to be invoked in the middle of workload. This is due the lack of synchronization to obtain the cf handle from db. Temporarily disabling this until the problem with rollback is fixed.
Closes https://github.com/facebook/rocksdb/pull/3772

Differential Revision: D7769041

Pulled By: maysamyabandeh

fbshipit-source-id: 0e3b0ce679bc2afba82e653a40afa3f045722754

3 years agoRate limiter should be allowed to share between different rocksdb instances in C API
Vincent Lee [Wed, 25 Apr 2018 22:45:18 +0000 (15:45 -0700)]
Rate limiter should be allowed to share between different rocksdb instances in C API

Summary:
Currently, the `rocksdb_options_set_ratelimiter` in  `c.cc` will change the input to nil, which make it is
 not possible to use the shared rate limiter create by `rocksdb_ratelimiter_create` in different rocksdb option.

In this pr, I changed it to shared ptr.
Closes https://github.com/facebook/rocksdb/pull/3758

Differential Revision: D7749740

Pulled By: ajkr

fbshipit-source-id: c6121f8ca75402afdb4b295ce63c2338d253a1b5

3 years agoFix clang build failure with -Wgnu-redeclared-enum
Anand Ananthabhotla [Wed, 25 Apr 2018 22:29:36 +0000 (15:29 -0700)]
Fix clang build failure with -Wgnu-redeclared-enum

Summary:
In include/rocksdb/db.h, enum EntryType is redeclared even though
original declaration in types.h in included.
Closes https://github.com/facebook/rocksdb/pull/3766

Differential Revision: D7765504

Pulled By: anand1976

fbshipit-source-id: 622a8ecb306993915be1b9dd5cdd79dbc6a4ea05

3 years agocmake: add options for enabling TBB and NUMA support
Kefu Chai [Wed, 25 Apr 2018 21:26:11 +0000 (14:26 -0700)]
cmake: add options for enabling TBB and NUMA support

Summary:
see also https://github.com/facebook/rocksdb/issues/3036

Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3750

Differential Revision: D7765170

Pulled By: ajkr

fbshipit-source-id: 455788b3131bf62a4987a65684b757e68473eed9

3 years agoinitialize local variable for UBSAN in PosixEnv function
Andrew Kryczka [Wed, 25 Apr 2018 20:36:26 +0000 (13:36 -0700)]
initialize local variable for UBSAN in PosixEnv function

Summary:
It seems clear to me that the variable is initialized before line 492, but it wasn't clear to UBSAN. The failure was:

```
In file included from ./env/io_posix.h:14:0,
                 from env/env_posix.cc:44:
./include/rocksdb/env.h: In member function ‘virtual rocksdb::Status rocksdb::{anonymous}::PosixEnv::NewMemoryMappedFileBuffer(const string&, std::unique_ptr<rocksdb::MemoryMappedFileBuffer>*)’:
./include/rocksdb/env.h:822:36: error: ‘base’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       : base(_base), length(_length) {}
                                    ^
env/env_posix.cc:482:11: note: ‘base’ was declared here
     void* base;
```

We can just initialize to nullptr to keep UBSAN happy.
Closes https://github.com/facebook/rocksdb/pull/3770

Differential Revision: D7756287

Pulled By: ajkr

fbshipit-source-id: 0f2efb9594e2d3a30706a4ca7e1d4a6328031bf2

3 years agoPass -latomic to linker when using clang
Andrew Kryczka [Wed, 25 Apr 2018 19:08:24 +0000 (12:08 -0700)]
Pass -latomic to linker when using clang

Summary:
clang compilation is failing due to a4fb1f8c049ee9d61a9da8cf23b64d2c7d36a33f. In that commit I added a call to `std::atomic::is_lock_free` which was evidently relying on a compiler builtin only present in gcc.

Drawbacks to this fix are:

- users may need to install libatomic
- there might be cases where clang is used even though USE_CLANG is unset (e.g., when clang is the only available compiler). I didn't figure out how to add -latomic in those cases...

An alternative fix mentioned in http://lists.llvm.org/pipermail/llvm-bugs/2017-August/057263.html is using -stdlib=libc++ with clang.
Closes https://github.com/facebook/rocksdb/pull/3769

Differential Revision: D7756261

Pulled By: ajkr

fbshipit-source-id: 26888300683fa9970ab5950239d1aa217e8efd49

3 years agoAdd crash-recovery correctness check to db_stress
Andrew Kryczka [Tue, 24 Apr 2018 22:46:41 +0000 (15:46 -0700)]
Add crash-recovery correctness check to db_stress

Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.

In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.

For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.

`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629

Differential Revision: D7463144

Pulled By: ajkr

fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b

3 years agoSkip duplicate bloom keys when whole_key and prefix are mixed
Maysam Yabandeh [Tue, 24 Apr 2018 17:53:38 +0000 (10:53 -0700)]
Skip duplicate bloom keys when whole_key and prefix are mixed

Summary:
Currently we rely on FilterBitsBuilder to skip the duplicate keys. It does that by comparing that hash of the key to the hash of the last added entry. This logic breaks however when we have whole_key_filtering mixed with prefix blooms as their addition to FilterBitsBuilder will be interleaved. The patch fixes that by comparing the last whole key and last prefix with the whole key and prefix of the new key respectively and skip the call to FilterBitsBuilder if it is a duplicate.
Closes https://github.com/facebook/rocksdb/pull/3764

Differential Revision: D7744413

Pulled By: maysamyabandeh

fbshipit-source-id: 15df73bbbafdfd754d4e1f42ea07f47b03bc5eb8

3 years agoSupport lowering CPU priority of background threads
Gabriel Wicke [Tue, 24 Apr 2018 15:38:01 +0000 (08:38 -0700)]
Support lowering CPU priority of background threads

Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.

This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.

A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.

As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763

Differential Revision: D7740096

Pulled By: gwicke

fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c

3 years agoImprove write time breakdown stats
Mike Kolupaev [Tue, 24 Apr 2018 00:53:27 +0000 (17:53 -0700)]
Improve write time breakdown stats

Summary:
There's a group of stats in PerfContext for profiling the write path. They break down the write time into WAL write, memtable insert, throttling, and everything else. We use these stats a lot for figuring out the cause of slow writes.

These stats got a bit out of date and are now categorizing some interesting things as "everything else", and also do some double counting. This PR fixes it and adds two new stats: time spent waiting for other threads of the batch group, and time spent waiting for scheduling flushes/compactions. Probably these will be enough to explain all the occasional abnormally slow (multiple seconds) writes that we're seeing.
Closes https://github.com/facebook/rocksdb/pull/3602

Differential Revision: D7251562

Pulled By: al13n321

fbshipit-source-id: 0a2d0f5a4fa5677455e1f566da931cb46efe2a0d

3 years agoRevert "Skip deleted WALs during recovery"
Siying Dong [Mon, 23 Apr 2018 18:56:43 +0000 (11:56 -0700)]
Revert "Skip deleted WALs during recovery"

Summary:
This reverts commit 73f21a7b2177aeb82b9f518222e2b9ea8fbb7c4f.

It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error:

"Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory"

This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file.
Closes https://github.com/facebook/rocksdb/pull/3762

Differential Revision: D7730035

Pulled By: siying

fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77

3 years agoAvoid directory renames in BackupEngine
Andrew Kryczka [Sat, 21 Apr 2018 00:23:34 +0000 (17:23 -0700)]
Avoid directory renames in BackupEngine

Summary:
We used to name private directories like "1.tmp" while BackupEngine populated them, and then rename without the ".tmp" suffix (i.e., rename "1.tmp" to "1") after all files were copied. On glusterfs, directory renames like this require operations across many hosts, and partial failures have caused operational problems.

Fortunately we don't need to rename private directories. We already have a meta-file that uses the tempfile-rename pattern to commit a backup atomically after all its files have been successfully copied. So we can copy private files directly to their final location, so now there's no directory rename.
Closes https://github.com/facebook/rocksdb/pull/3749

Differential Revision: D7705610

Pulled By: ajkr

fbshipit-source-id: fd724a28dd2bf993ce323a5f2cb7e7d6980cc346

3 years agoDisable EnvPosixTest::FilePermission
Yi Wu [Fri, 20 Apr 2018 22:27:37 +0000 (15:27 -0700)]
Disable EnvPosixTest::FilePermission

Summary:
The test is flaky in our CI but could not be reproduce manually on the same CI host. Disabling it.
Closes https://github.com/facebook/rocksdb/pull/3753

Differential Revision: D7716320

Pulled By: yiwu-arbug

fbshipit-source-id: 6bed3b05880c1d24e8dc86bc970e5181bc98fb45

3 years agoWritePrepared Txn: rollback via commit
Maysam Yabandeh [Fri, 20 Apr 2018 22:25:12 +0000 (15:25 -0700)]
WritePrepared Txn: rollback via commit

Summary:
Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
Closes https://github.com/facebook/rocksdb/pull/3745

Differential Revision: D7696193

Pulled By: maysamyabandeh

fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c

3 years agoAdd a stat for MultiGet keys found, update memtable hit/miss stats
Anand Ananthabhotla [Fri, 20 Apr 2018 22:17:46 +0000 (15:17 -0700)]
Add a stat for MultiGet keys found, update memtable hit/miss stats

Summary:
1. Add a new ticker stat rocksdb.number.multiget.keys.found to track the
number of keys successfully read
2. Update rocksdb.memtable.hit/miss in DBImpl::MultiGet(). It was being done in
DBImpl::GetImpl(), but not MultiGet
Closes https://github.com/facebook/rocksdb/pull/3730

Differential Revision: D7677364

Pulled By: anand1976

fbshipit-source-id: af22bd0ef8ddc5cf2b4244b0a024e539fe48bca5

3 years agoWritePrepared Txn: enable TryAgain for duplicates at the end of the batch
Maysam Yabandeh [Fri, 20 Apr 2018 22:13:47 +0000 (15:13 -0700)]
WritePrepared Txn: enable TryAgain for duplicates at the end of the batch

Summary:
The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might  not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch.
Closes https://github.com/facebook/rocksdb/pull/3747

Differential Revision: D7708391

Pulled By: maysamyabandeh

fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d

3 years agoPropagate fill_cache config to partitioned index iterator
Maysam Yabandeh [Fri, 20 Apr 2018 22:08:00 +0000 (15:08 -0700)]
Propagate fill_cache config to partitioned index iterator

Summary:
Currently the partitioned index iterator creates a new ReadOptions which ignores the fill_cache config set to ReadOptions passed by the user. The patch propagates fill_cache from the user's ReadOptions to that of partition index iterator.
Also it clarifies the contract of fill_cache that i) it does not apply to filters, ii) it still charges block cache for the size of the data block, it still pin the block if it is already in the block cache.
Closes https://github.com/facebook/rocksdb/pull/3739

Differential Revision: D7678308

Pulled By: maysamyabandeh

fbshipit-source-id: 53ed96424ae922e499e2d4e3580ddc3f0db893da

3 years agoFix GitHub issue #3716: gcc-8 warnings
przemyslaw.skibinski@percona.com [Fri, 20 Apr 2018 20:28:05 +0000 (13:28 -0700)]
Fix GitHub issue #3716: gcc-8 warnings

Summary:
Fix the following gcc-8 warnings:
- conflicting C language linkage declaration [-Werror]
- writing to an object with no trivial copy-assignment [-Werror=class-memaccess]
- array subscript -1 is below array bounds [-Werror=array-bounds]

Solves https://github.com/facebook/rocksdb/issues/3716
Closes https://github.com/facebook/rocksdb/pull/3736

Differential Revision: D7684161

Pulled By: yiwu-arbug

fbshipit-source-id: 47c0423d26b74add251f1d3595211eee1e41e54a

3 years agofix compilation error: implicit conversion loses integer precision
Zhongyi Xie [Fri, 20 Apr 2018 01:51:35 +0000 (18:51 -0700)]
fix compilation error: implicit conversion loses integer precision

Summary:
Fix compilation error with clang:
> tools/db_stress.cc:2598:21: error: implicit conversion loses integer precision: 'gflags::uint64' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32]
        Random rand(FLAGS_seed);
               ~~~~ ^~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/3746

Differential Revision: D7703209

Pulled By: miasantreble

fbshipit-source-id: 18c56a5138a2f308e4213594bc82e8e64bc21570

3 years agoCMake: Read rocksdb version from version.h header file
Paweł Bylica [Fri, 20 Apr 2018 00:39:30 +0000 (17:39 -0700)]
CMake: Read rocksdb version from version.h header file

Summary:
This replaces reading the rocksdb version by external shell script. This does not work reliably on Windows (I wander how it works on AppVeyor).
Closes https://github.com/facebook/rocksdb/pull/3737

Differential Revision: D7703106

Pulled By: ajkr

fbshipit-source-id: 4079c7c77431757e9ddc801363ed896b18fdbf23

3 years agocheck return status for Sync() and Append() calls to avoid corruption
Zhongyi Xie [Thu, 19 Apr 2018 21:00:52 +0000 (14:00 -0700)]
check return status for Sync() and Append() calls to avoid corruption

Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740

Differential Revision: D7678848

Pulled By: miasantreble

fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77

3 years agoAdd block cache related DB properties
Yi Wu [Thu, 19 Apr 2018 04:35:12 +0000 (21:35 -0700)]
Add block cache related DB properties

Summary:
Add DB properties "rocksdb.block-cache-capacity", "rocksdb.block-cache-usage", "rocksdb.block-cache-pinned-usage" to show block cache usage.
Closes https://github.com/facebook/rocksdb/pull/3734

Differential Revision: D7657180

Pulled By: yiwu-arbug

fbshipit-source-id: dd34a019d5878dab539c51ee82669e97b2b745fd

3 years agoinclude thread-pool priority in thread names
Andrew Kryczka [Thu, 19 Apr 2018 00:25:37 +0000 (17:25 -0700)]
include thread-pool priority in thread names

Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702

Differential Revision: D7581415

Pulled By: ajkr

fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1

3 years agoImprove db_stress with transactions
Maysam Yabandeh [Wed, 18 Apr 2018 23:25:33 +0000 (16:25 -0700)]
Improve db_stress with transactions

Summary:
db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch.
- The uncommitted transaction must be either rolled back or commit after recovery.
- Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0.
Closes https://github.com/facebook/rocksdb/pull/3733

Differential Revision: D7654419

Pulled By: maysamyabandeh

fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3

3 years agoAdd missing whitespace.
Yanqin Jin [Tue, 17 Apr 2018 16:41:57 +0000 (09:41 -0700)]
Add missing whitespace.

Summary: Closes https://github.com/facebook/rocksdb/pull/3729

Differential Revision: D7645465

Pulled By: riversand963

fbshipit-source-id: a64da0960fe6c39847ef848b8888fe9a9c1df25d

3 years agodb_bench fillXXXdeterministic should respect compression type
Yi Wu [Tue, 17 Apr 2018 00:56:59 +0000 (17:56 -0700)]
db_bench fillXXXdeterministic should respect compression type

Summary:
db_bench fillXXXdeterministic should respect compression type when calling CompactFiles().
Closes https://github.com/facebook/rocksdb/pull/3731

Differential Revision: D7647761

Pulled By: yiwu-arbug

fbshipit-source-id: 15e12429e0dd93ece2231b015f2e26c2d94781e6

3 years agoImprove the comment on TableFactory::NewTableReader()
Harry Wong [Mon, 16 Apr 2018 23:46:22 +0000 (16:46 -0700)]
Improve the comment on TableFactory::NewTableReader()

Summary:
`DBImpl::AddFile()` has been replaced by `DBImpl::IngestExternalFile()`.
Closes https://github.com/facebook/rocksdb/pull/3726

Differential Revision: D7646875

Pulled By: ajkr

fbshipit-source-id: 241eb7a8d88527fdc5c26b0c3f6faec3296451f8

3 years agoInitialize a boolean member variable of a struct.
Yanqin Jin [Mon, 16 Apr 2018 21:18:51 +0000 (14:18 -0700)]
Initialize a boolean member variable of a struct.

Summary:
The reason for this initialization is that LLVM UBSAN check will fail due to
uninitialized bool. [StackOverflow post](https://stackoverflow.com/questions/31420154/runtime-error-load-of-value-127-which-is-not-a-valid-value-for-type-bool).

UBSAN log:
> ===== Running external_sst_file_basic_test
[==========] Running 7 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 7 tests from ExternalSSTFileBasicTest
[ RUN      ] ExternalSSTFileBasicTest.Basic
[       OK ] ExternalSSTFileBasicTest.Basic (6 ms)
[ RUN      ] ExternalSSTFileBasicTest.NoCopy
db/external_sst_file_ingestion_job.h:23:8: runtime error: load of value 253, which is not a valid value for type 'bool'

miasantreble  I've tested this locally using the following command.
```
TEST_TMPDIR=/dev/shm/rocksdb COMPILE_WITH_UBSAN=1 OPT=-g make J=1 -j8 ubsan_check
```

ajkr This PR is related to your review comment in [PR](https://github.com/facebook/rocksdb/pull/3713/). It turns out that, with UBSAN enabled, we must provide a default value for boolean member variables.
Closes https://github.com/facebook/rocksdb/pull/3728

Differential Revision: D7642476

Pulled By: riversand963

fbshipit-source-id: 4c09a4b8d271151cb99ae7393db9e4ad9f29762e