Matt Benjamin [Fri, 7 Jun 2019 14:20:01 +0000 (10:20 -0400)]
rgw_file: all directories are virtual with respect to contents
This change causes directory handles to always report an mtime of
"now." This is not an invalidate per se--it interacts with the
nfs implementation to produce that result when the implementation
updates its cached attributes. Hence, it can be modulated by timers
or other rules governing attribute caching at the upper layer.
Fixes: http://tracker.ceph.com/issues/40204 Signed-off-by: Matt Benjamin <mbenjamin@redhat.com>
(cherry picked from commit b4c7d0faeff667c25ab255786999ef0cc844ea2b)
xie xingguo [Mon, 3 Jun 2019 08:43:25 +0000 (16:43 +0800)]
test: add parallel clean_pg_upmaps test
With parallel clean_pg_upmaps feature on, the total time cost
of the performance test which now can utilize up to 8 threads for
parallel upmap validating decreased from:
Note that by default the mon uses only 4 worker threads for
CPU intensive background work, you could further increase
the "mon_cpu_threads" option value if you decided the
time-consuming of clean_pg_upmaps still matters.
xie xingguo [Mon, 3 Jun 2019 08:10:22 +0000 (16:10 +0800)]
mon/OSDMonitor: do clean_pg_upmaps the parallel way if necessary
There could definitely be some certain cases we could reliably
skip this kind of checking, but there is no easy way to separate
those out.
However, this is clearly the general way to do the massive pg
upmap clean-up job more efficiently and hence should make sense
in all cases.
xie xingguo [Mon, 17 Jun 2019 10:44:09 +0000 (18:44 +0800)]
osd: maybe_remove_pg_upmaps -> clean_pg_upmaps
It should always be the preferred option to kill the unnecessary
or duplicated code, which is good for maintenance.
Also I've noticed there is already a clean_temps helper, so re-naming
maybe_remove_pg_upmaps to clean_pg_upmaps to at least keep pace with
that sounds to be a natural choice for me..
xie xingguo [Wed, 5 Jun 2019 02:41:52 +0000 (10:41 +0800)]
osd/OSDMapMapping: make ParallelPGMapper can accept input pgs
The existing "prime temp" machinism is a good inspiration
for cluster with a large number of pgs that need to do various
calculations quickly.
I am planning to do the upmap tidy-up work the same way, hence
the need for an alternate way of specifying pgs to process other
than taking directly from the map.
The upmap results are directly applied after calling
_pg_to_raw_osds, which means it basically has nothing to do
with the up/down status.
In other words, if a pg_upmap/pg_upmap_items remapped a pg
into some down osds and is now causing collided result,
we should still be able to detect and cancel that.
xie xingguo [Sat, 1 Jun 2019 02:43:10 +0000 (10:43 +0800)]
test/osd: add performance test case for maybe_remove_pg_upmap
Tom Byrne reported that maybe_remove_pg_upmap might become
super inefficient for large clusters with balancer on.
To identify and resolve the problem, we need to add some good
measurements first.
osd/bluestore: Actually wait until completion in write_sync
This function is only used by RocksDB WAL writing so it must sync data.
This fixes #18338 and thus allows to actually set `bluefs_preextend_wal_files`
to true, gaining +100% single-thread write iops in disk-bound (HDD or bad SSD) setups.
To my knowledge it doesn't hurt performance in other cases.
Test it yourself on any HDD with `fio -ioengine=rbd -direct=1 -bs=4k -iodepth=1`.
Issue #18338 is easily reproduced without this patch by issuing a `kill -9` to the OSD
while doing `fio -ioengine=rbd -direct=1 -bs=4M -iodepth=16`.
mon: paxos: introduce new reset_pending_committing_finishers for safety
There are asserts about the state of the system and pending_finishers which can
be triggered by running arbitrary commands through again. They are correct
when not restarting, but when we do restart we need to take care to preserve
the same invariants as appropriate. Use this function to be careful about
the order of committing_finishers v pending_finishers and to make sure they're
both empty before any Contexts actually get called.
We also reorder a call to finish_contexts on the waiting_for_writeable list for
similar reasons.
osd/PG: fix last_complete re-calculation on splitting
We add hard-limit for pg_logs now, which means we might keep trimming
old log entries irrespective of pg's current missing_set. This as a
result can cause the last_complete pointer moving far ahead of the real
on-disk version (the oldest need of missing_set, for instance) the
corresponding pg should have on splitting:
For the above example, parent's last_complete cursor changed from **0'0** to
**238'300** directly due to the effort of trying to catch up the oldest
log entry changing when splitting was done. However, back into v12.2.9 primary
would still reference shard's last_complete field when trying to figure out all
possible locations of a currently missing object (see PG::MissingLoc::add_source_info):
```c++
if (oinfo.last_complete < need) {
if (omissing.is_missing(soid)) {
ldout(pg->cct, 10) << "search_for_missing " << soid << " " << need
<< " also missing on osd." << fromosd << dendl;
continue;
}
}
```
Hence a wrongly calculated last_complete could then make primary mis-consider
that a specific shard might have the authoritative object it currently
looking for:
note that ```5:624c3a7a:::benchmark_data_smithi190_39968_object1382:head 226'110```
was actually missing on both primary and shard osd.2 whereas primary insisted that
object should exist on shard osd.2!
https://github.com/ceph/ceph/pull/26175 posted an indirect fix
for the above problem by ignoring last_complete when checking the missing set,
but it should generally make more sense to fill in the last_complete field correctly
whenever possible.
Hence coming this additional fix.
Jason Dillaman [Sat, 23 Feb 2019 00:25:44 +0000 (19:25 -0500)]
rbd-mirror: image replay should retry asok registration upon failure
If the asok registration fails (perhaps due to a race condition with
a deleted and recreated image of the same name), periodically attempt
to register the missing asok hook.
If the image replay was canceled prior to the start of the bootstrap
stage, the image replayer would be stuck attempting to shut down if
the bootstrap is paused behind an image sync.
Jason Dillaman [Fri, 22 Feb 2019 15:59:26 +0000 (10:59 -0500)]
rbd-mirror: complete pool watcher initialization if object missing
If the mirroring object is missing, complete the initialization and
continue to retry in the background. This is useful for cases where
the remote doesn't (yet) have mirroring enabled but the remote
pool watcher initialization is delaying the leader watcher promotion
to the point where the leader is blacklisted by its peers.
Jason Dillaman [Tue, 19 Feb 2019 21:06:48 +0000 (16:06 -0500)]
librbd: add missing shutdown states to managed lock helper
The PRE_SHUTTING_DOWN and SHUTTING_DOWN states were missed
in the 'is_state_shutdown' helper method. This resulted in
rbd-mirror potentially entering an infinite loop during
shutdown.
Jason Dillaman [Thu, 28 Feb 2019 21:43:27 +0000 (16:43 -0500)]
librbd: improve object map performance under high IOPS workloads
Do not zero-fill the BitVector's bitset prior to decoding the data.
Additionally, only read-update-modify the portions of the footer
that are potentially affected by the updated state.
Fixes: http://tracker.ceph.com/issues/38538 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 071671fff64f27943047610fe075a7e98f0f705c)
Conflicts:
src/cls/rbd/cls_rbd.cc
src/cls/rbd/cls_rbd_client.h
src/common/bit_vector.hpp
src/test/common/test_bit_vector.cc
src/test/librbd/test_ObjectMap.cc
Trivial conflicts with bufferlist::begin/cbegin and assert/ceph_assert
rgw: allow radosgw-admin to list bucket w --allow-unordered
Presently the `radosgw-admin bucket list --bucket=<bucket>` lists the
objects in lexical order. This can be an expensive operation since
objects are not stored in bucket index shards in order and a selection
sort process is done across all bucket index shards.
By allowing the user to add the "--allow-unordered" command-line flag,
a more efficient bucket listing is enabled. This is particularly
important for buckets with a large number of objects.
Robin H. Johnson [Thu, 23 Aug 2018 17:57:24 +0000 (10:57 -0700)]
rgw: use chunked encoding to get partial results out faster
Some operations can take a long time to have their complete result.
If a RGW operation does not set a content-length header, the RGW
frontends (CivetWeb, Beast) buffer the entire request so that a
Content-Length header can be sent.
If the RGW operation takes long enough, the buffering time may exceed
keepalive values, and because no bytes have been sent in the connection,
the connection will be reset.
If a HTTP response header contains neither Content-Length or chunked
Transfer-Encoding, HTTP keep-alive is not possible.
To fix the issue within these requirements, use chunked
Transfer-Encoding for the following operations:
RGWCopyObj & RGWDeleteMultiObj specifically use send_partial_response
for long-running operations, and are the most impacted by this issue,
esp. for large inputs. RGWCopyObj attempts to send a Progress header
during the copy, but it's not actually passed on to the client until the
end of the copy, because it's buffered by the RGW frontends!
The HTTP/1.1 specification REQUIRES chunked encoding to be supported,
and the specification does NOT require "chunked" to be included in the
"TE" request header.
This patch has one side-effect: this causes many more small IP packets.
When combined with high-latency links this can increase the apparent
deletion time due to round trips and TCP slow start. Future improvements
to the RGW frontends are possible in two seperate but related ways:
- The FE could continue to push more chunks without waiting for the ACK
on the previous chunk, esp. while under the TCP window size.
- The FE could be patched for different buffer flushing behaviors, as
that behavior is presently unclear (packets of 200-500 bytes seen).
Performance results:
- Bucket with 5M objects, index sharded 32 ways.
- Index on SSD 3x replicas, Data on spinning disk, 5:2
- Multi-delete of 1000 keys, with a common prefix.
- Cache of index primed by listing the common prefix immediately before
deletion.
- Timing data captured at the RGW.
- Timing t0 is the TCP ACK sent by the RGW at the end of the response
body.
- Client is ~75ms away from RGW.
BEFORE:
Time to first byte of response header: 11.3 seconds.
Entire operation: 11.5 seconds.
Response packets: 17
AFTER:
Time to first byte of response header: 3.5ms
Entire operation: 16.36 seconds
Response packets: 206
Backport: mimic, luminous
Issue: http://tracker.ceph.com/issues/12713 Signed-off-by: Robin H. Johnson <rjohnson@digitalocean.com>
(cherry picked from commit d22c1f96707ba9ae84578932bd4d741f6c101a54)
xie xingguo [Sat, 23 Mar 2019 01:50:27 +0000 (09:50 +0800)]
osd/OSDMap: calc_pg_upmaps - restrict optimization to origin pools only
The current implementation will try to cancel any pg_upmaps that
would otherwise re-map a PG out from an underfull osd, which is wrong,
e.g., because it could reliably fire the following assert:
xie xingguo [Sat, 19 Jan 2019 09:19:10 +0000 (17:19 +0800)]
crush: fix upmap overkill
It appears that OSDMap::maybe_remove_pg_upmaps's sanity checks
are overzealous. With some customized crush rules it is possible
for osdmaptool to generate valid upmaps, but maybe_remove_pg_upmaps
will cancel them.
xie xingguo [Mon, 18 Feb 2019 07:40:22 +0000 (15:40 +0800)]
osd/OSDMap: using std::vector::reserve to reduce memory reallocation
In C++ vectors are dynamic arrays.
Vectors are assigned memory in blocks of contiguous locations.
When the memory allocated for the vector falls short of storing
new elements, a new memory block is allocated to vector and all
elements are copied from the old location to the new location.
This reallocation of elements helps vectors to grow when required.
However, it is a costly operation and time complexity is involved
in this step is linear.
Try to use std::vector::reserve whenever possible if performance
matters.