Robin H. Johnson [Thu, 23 Aug 2018 17:57:24 +0000 (10:57 -0700)]
rgw: use chunked encoding to get partial results out faster
Some operations can take a long time to have their complete result.
If a RGW operation does not set a content-length header, the RGW
frontends (CivetWeb, Beast) buffer the entire request so that a
Content-Length header can be sent.
If the RGW operation takes long enough, the buffering time may exceed
keepalive values, and because no bytes have been sent in the connection,
the connection will be reset.
If a HTTP response header contains neither Content-Length or chunked
Transfer-Encoding, HTTP keep-alive is not possible.
To fix the issue within these requirements, use chunked
Transfer-Encoding for the following operations:
RGWCopyObj & RGWDeleteMultiObj specifically use send_partial_response
for long-running operations, and are the most impacted by this issue,
esp. for large inputs. RGWCopyObj attempts to send a Progress header
during the copy, but it's not actually passed on to the client until the
end of the copy, because it's buffered by the RGW frontends!
The HTTP/1.1 specification REQUIRES chunked encoding to be supported,
and the specification does NOT require "chunked" to be included in the
"TE" request header.
This patch has one side-effect: this causes many more small IP packets.
When combined with high-latency links this can increase the apparent
deletion time due to round trips and TCP slow start. Future improvements
to the RGW frontends are possible in two seperate but related ways:
- The FE could continue to push more chunks without waiting for the ACK
on the previous chunk, esp. while under the TCP window size.
- The FE could be patched for different buffer flushing behaviors, as
that behavior is presently unclear (packets of 200-500 bytes seen).
Performance results:
- Bucket with 5M objects, index sharded 32 ways.
- Index on SSD 3x replicas, Data on spinning disk, 5:2
- Multi-delete of 1000 keys, with a common prefix.
- Cache of index primed by listing the common prefix immediately before
deletion.
- Timing data captured at the RGW.
- Timing t0 is the TCP ACK sent by the RGW at the end of the response
body.
- Client is ~75ms away from RGW.
BEFORE:
Time to first byte of response header: 11.3 seconds.
Entire operation: 11.5 seconds.
Response packets: 17
AFTER:
Time to first byte of response header: 3.5ms
Entire operation: 16.36 seconds
Response packets: 206
Backport: mimic, luminous
Issue: http://tracker.ceph.com/issues/12713 Signed-off-by: Robin H. Johnson <rjohnson@digitalocean.com>
(cherry picked from commit d22c1f96707ba9ae84578932bd4d741f6c101a54)
xie xingguo [Sat, 23 Mar 2019 01:50:27 +0000 (09:50 +0800)]
osd/OSDMap: calc_pg_upmaps - restrict optimization to origin pools only
The current implementation will try to cancel any pg_upmaps that
would otherwise re-map a PG out from an underfull osd, which is wrong,
e.g., because it could reliably fire the following assert:
xie xingguo [Sat, 19 Jan 2019 09:19:10 +0000 (17:19 +0800)]
crush: fix upmap overkill
It appears that OSDMap::maybe_remove_pg_upmaps's sanity checks
are overzealous. With some customized crush rules it is possible
for osdmaptool to generate valid upmaps, but maybe_remove_pg_upmaps
will cancel them.
xie xingguo [Mon, 18 Feb 2019 07:40:22 +0000 (15:40 +0800)]
osd/OSDMap: using std::vector::reserve to reduce memory reallocation
In C++ vectors are dynamic arrays.
Vectors are assigned memory in blocks of contiguous locations.
When the memory allocated for the vector falls short of storing
new elements, a new memory block is allocated to vector and all
elements are copied from the old location to the new location.
This reallocation of elements helps vectors to grow when required.
However, it is a costly operation and time complexity is involved
in this step is linear.
Try to use std::vector::reserve whenever possible if performance
matters.
xie xingguo [Sat, 26 Jan 2019 10:03:15 +0000 (18:03 +0800)]
osd/OSDMap: more improvements to upmap
- add ability of appending a 2nd, 3rd, etc... pair to existing upmaps
when possible, rather than just continuing to the next PG
- handle the underfull case: we can rm-pg-upmap-items if there exist
any upmaps which remapped a PG out from an underfull OSD
Neha Ojha [Mon, 4 Mar 2019 04:29:05 +0000 (20:29 -0800)]
osd/PG: skip rollforward when !transaction_applied during append_log()
Earlier, we did pg_log.roll_forward(&handler), when
!transaction_applied, which advanced the crt and trimmed the entries
in rollforward(). Due to this, during _merge_object_divergent_entries(),
when we tried to rollback entries, those objects were not found in the
backend, and thus we hit this bug http://tracker.ceph.com/issues/36739.
With this change, we are advancing the crt value, without deleting the
objects, so that _merge_object_divergent_entries() does not fail
because of deleted objects.
xie xingguo [Tue, 26 Mar 2019 07:02:02 +0000 (15:02 +0800)]
osd/PG: move down peers out from peer_purged
In purge_strays(), we'll aggressively clear stray_set and
add all related peers into peer_purged.
However, if the corrsponding peer is down and becomes
up again, (unconditionally) adding it to peer_purged
will prevent primary from re-purging it.
(See Active::react(const MNotifyRec& notevt))
On consuming a new osdmap, let's move any down peers out from
peer_purged simutaneously. This way we can lower the risk
of leaving any leftover PGs behind.
xie xingguo [Tue, 26 Mar 2019 12:04:15 +0000 (20:04 +0800)]
osd/PG: introduce all_missing_unfound helper
We use pg_log.missing to track each peer's missing objects separately,
whereas missing_loc records the location of all (probably existing) good copies
for both primary and replicas' missing objects. Hence an item from
pg_log.missing or missing_loc is of different meaning and is not comparable.
During recovery, we can skip recovering primary only if
- primary is good, e.g., has no missing at all
- or all of the primary's missing objects do exist in missing_loc and are
currently unfound
Obviously, the current "all missing objects are unfound" checker is broken.
Fix by introducing an independent all_missing_unfound helper to make the
count of missing objects that are currently unfound correct.
Zengran Zhang [Wed, 27 Mar 2019 01:39:31 +0000 (09:39 +0800)]
osd: shutdown recovery_request_timer earlier
recovery_request_timer may hold some QueuePeeringEvts which PGRef,
if we dont shutdown it earlier, it potentially cause the PGRef leak
when kicking pg.
The markdown test is based on marking down a specific number of times, but
the duplicate commands from the CLI may not get absorbed/batched by the
mon, breaking the test. Override the default qa/tasks/workunit.py
behavior of sending dups.
J. Eric Ivancich [Fri, 15 Feb 2019 01:30:46 +0000 (20:30 -0500)]
rgw: resolve bugs and clean up garbage collection code
Does a number of things to clean up rgw gc code:
* adds additional logging to make future debugging easier.
* resolves bug where the truncated flag was not always set correctly
in gc_iterate_entries
* resolves bug where marker in RGWGC::process was not advanced
* resolves bug in which gc entries with a zero-length chain
were not trimmed
* resolves bug where same gc entry tag was added to list for
deletion multiple times
Fixes: http://tracker.ceph.com/issues/38454 Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
(cherry picked from commit 73d7d369a2cc5edc8d51d70e02420e6efbfbe297)
Conflicts:
src/rgw/rgw_gc.cc dout() vs ldpp_dout()
Kefu Chai [Thu, 14 Jun 2018 01:32:08 +0000 (09:32 +0800)]
cmake: update BuildSPDK for spdk-18.05
in spdk v18.05, libuuid is linked by libspdk_util.a, in which,
it is used by lib/util/uuid.c. and libspdk_vol.a uses the wrapper
function exposed by libspdk_util.a, so update the CMakefile script to
reflect the change.
Igor Fedotov [Mon, 27 Aug 2018 13:22:56 +0000 (16:22 +0300)]
os/tests: fix garbageCollection test case from store_test.
While running the test case using SSD as block device one could face
a failure caused by unexpectidly small blob size limit - compression
resulted in two blocks rather than single one which violated was test case
constraints.
Nautilus+ releases doesn't have the issue which is probably related to
modifications of BlueStore::MempoolThread::_trim_shards which introduces
different calculation for meta|data_alloc parameters
Jonas Jelten [Mon, 1 Apr 2019 10:28:09 +0000 (12:28 +0200)]
osd/PG: discover missing objects when an OSD peers and PG is degraded
When a PG is remapped from OSD `a` to OSD `b`, the objects are
backfilled. When OSD `a` is restarted, objects become degraded
as `a` is no longer queried or considered as a backfill source.
As the PG is degraded, `PG::discover_all_missing` is not called
when a candidate OSD peers with the primary: The PG is already
active, thus `PG::activate` (and in turn missing object discovery)
is not called. Discovery is also not initiated from
`PG::RecoveryState::Active::react(const MNotifyRec& notevt)`
as there are no unfound objects.
This patch adds a call to `discover_all_missing` when
when an OSD sends its `MNotifyRec` message and the PG is degraded.