Yehuda Sadeh [Thu, 17 Nov 2016 00:20:17 +0000 (16:20 -0800)]
rgw: clean rgw_obj
Instead of storing the oid and the name, just store the name
and calculate it when needed (same goes to locator). Also give more
coherent names to the various fields.
Two main changes here:
1. Newly created rgw_bucket does not have a predetermined placement pools
assigned to it. The placement_id param in the objects themselves points
at where the data is located. This affects object's tail location, head
is located where the bucket instance's placement rule points at.
2. Modify object manifest to use rgw_raw_obj instead of rgw_obj.
Yehuda Sadeh [Sat, 8 Oct 2016 04:03:32 +0000 (21:03 -0700)]
rgw: introduce rgw_pool, rgw_raw_obj
Pools are represented by rgw_pool (and not rgw_bucket anymore),
and we use rgw_raw_obj to reference rados objs and all 'system'
objects (vs rgw_obj that is used for rgw objects).
optimistyzy [Thu, 9 Mar 2017 06:30:33 +0000 (14:30 +0800)]
Bluestore, NVMeDevice: fix the core id for rte_remote_launch
Previously, we use the id. The id value will be 1 always since
we will only support one NVMe device per osd. Also we have
coremask conf , the default is 3, and we will use core 1.
It is correct. However if we specify another core mask, e.g.,
0xC, still passing 1 to start the dpdk_thread is wrong. we
need to pass core id = 4.
Since for each shareddata, we only use one cpu core, so
just passing rte_get_next_lcore(-1, 0, 0), which selects
the first slave core.
John Spray [Wed, 1 Mar 2017 12:33:05 +0000 (12:33 +0000)]
mds: flush PQ even when not consuming
In normal operation we generate flushes from
_consume when we read from the journaler. However,
we should also have a fallback flush mechanism for
situations where can_consume() is false fo a long time.
This comes up in testing when we set throttle to zero to
prevent progress, but would also come up in real life if
we were busy purging a few very large files, or if purging
was stuck due to bad PGs in the data pool -- we don't want
that to stop us completing appends to the PQ.
John Spray [Mon, 13 Feb 2017 12:01:40 +0000 (12:01 +0000)]
mds: write_head when reading in PurgeQueue
Previously write_head calls were only generated
on the write side, so if you had a big queue
and were just working through consuming it, you
wouldn't record your progress, and on a daemon
restart would end up repeating a load of work.
John Spray [Mon, 13 Feb 2017 12:00:42 +0000 (12:00 +0000)]
osdc: expose Journaler::write_head_needed
So that callers on the read side can optionally
do their own write_head calls according to
the same condition that Journaler uses
internally for its write_head during _flush() condition.
John Spray [Mon, 13 Feb 2017 00:16:29 +0000 (00:16 +0000)]
osdc: less aggressive prefetch in read/write Journaler
Previously, if doing a write/is_readable/write/is_readable sequence,
you'd end up doing a flush after every write, even though there
was already a flush in flight that would advance the readable-ness
of the journal.
Because this flush-during-read path is only active when using
a read/write journal such as in PurgeQueue, tweak the behaviour
to suit this case.
This was an unused code path. If anyone set a nonzero
value here the MDS would crash because the Timer implementation
has changed since this code was written, and now requires
add_event_after callers to hold the right lock.
John Spray [Wed, 8 Feb 2017 16:24:24 +0000 (16:24 +0000)]
mds: expose progress during PurgeQueue drain
We don't track an item count, but we do have
a number of bytes left in the Journaler, so
can use that to give an indication of progress
while the MDS rank shutdown is waiting for
the PurgeQueue to do its thing.
Also lift the ops limit on the PurgeQueue
when it goes into the drain phase.
John Spray [Mon, 5 Dec 2016 15:40:00 +0000 (15:40 +0000)]
mds: move throttling code out of StrayManager
This will belong in PurgeQueue from now on. We assume
that there is no need to throttle the rate of insertions
into purge queue as it is an efficient sequentially-written
journal.
John Spray [Thu, 1 Dec 2016 20:22:43 +0000 (20:22 +0000)]
mds: use a persistent queue for purging deleted files
To avoid creating stray directories of unbounded size
and all the associated pain, use a more appropriate
datastructure to store a FIFO of inodes that need
purging.
Fixes: http://tracker.ceph.com/issues/11950 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Thu, 1 Dec 2016 19:10:35 +0000 (19:10 +0000)]
osdc/Journaler: wrap recover() completion in finisher
Otherwise, the callback will deadlock if it in turn
calls into any Journaler functions. Don't care
about performance because we do this once at startup.
John Spray [Thu, 1 Dec 2016 18:59:26 +0000 (18:59 +0000)]
osdc/Journaler: add have_waiter()
Allows users of wait_for_readable to conveniently
see if there is already a waiter. Yes, they could
do this themselves, but I'd rather peek at an existing
variable than add a new one caller-side.
John Spray [Thu, 1 Dec 2016 15:27:39 +0000 (15:27 +0000)]
osdc/Journaler: remove incorrect assertion
This asserted that flush_pos would be ahead of
safe_pos after calling _flush. However, this
is not guaranteed to be the case because
prezeroing might prevent us from flushing
right now.
Venky Shankar [Wed, 1 Mar 2017 06:51:49 +0000 (12:21 +0530)]
rbd: cleanup unused throttle in v2 import
v2 import does not use throttle as of now although v1
import does use it - initialize throttle wherever its
necessary and avoid passing it functions that do not
require it.
Add option to prefer a WAL write if the write is below a size threshold,
even if we could avoid it. This lets you trade some write-amp (by
journaling data to rocksdb) for latency in cases where the WAL device is
much faster than the main device.
This affects:
- writes to new extents locations below min_alloc_size
- writes to unallocated space below min_alloc_size
- "big" writes above min_alloc_size that are below the prefer_wal_size
threshold.
Note that it's applied to individual blobs, not the entirety of the write,
so if your have a larger write torn into two pieces/blobs that are below
the threshold then they will both go through the wal.
Set different defaults for HDD and SSD, since this makes more sense for HDD
where seeks are expensive.