xie xingguo [Mon, 3 Jun 2019 08:10:22 +0000 (16:10 +0800)]
mon/OSDMonitor: do clean_pg_upmaps the parallel way if necessary
There could definitely be some certain cases we could reliably
skip this kind of checking, but there is no easy way to separate
those out.
However, this is clearly the general way to do the massive pg
upmap clean-up job more efficiently and hence should make sense
in all cases.
xie xingguo [Mon, 17 Jun 2019 10:44:09 +0000 (18:44 +0800)]
osd: maybe_remove_pg_upmaps -> clean_pg_upmaps
It should always be the preferred option to kill the unnecessary
or duplicated code, which is good for maintenance.
Also I've noticed there is already a clean_temps helper, so re-naming
maybe_remove_pg_upmaps to clean_pg_upmaps to at least keep pace with
that sounds to be a natural choice for me..
xie xingguo [Wed, 5 Jun 2019 02:41:52 +0000 (10:41 +0800)]
osd/OSDMapMapping: make ParallelPGMapper can accept input pgs
The existing "prime temp" machinism is a good inspiration
for cluster with a large number of pgs that need to do various
calculations quickly.
I am planning to do the upmap tidy-up work the same way, hence
the need for an alternate way of specifying pgs to process other
than taking directly from the map.
The upmap results are directly applied after calling
_pg_to_raw_osds, which means it basically has nothing to do
with the up/down status.
In other words, if a pg_upmap/pg_upmap_items remapped a pg
into some down osds and is now causing collided result,
we should still be able to detect and cancel that.
xie xingguo [Sat, 1 Jun 2019 02:43:10 +0000 (10:43 +0800)]
test/osd: add performance test case for maybe_remove_pg_upmap
Tom Byrne reported that maybe_remove_pg_upmap might become
super inefficient for large clusters with balancer on.
To identify and resolve the problem, we need to add some good
measurements first.
rgw: allow radosgw-admin to list bucket w --allow-unordered
Presently the `radosgw-admin bucket list --bucket=<bucket>` lists the
objects in lexical order. This can be an expensive operation since
objects are not stored in bucket index shards in order and a selection
sort process is done across all bucket index shards.
By allowing the user to add the "--allow-unordered" command-line flag,
a more efficient bucket listing is enabled. This is particularly
important for buckets with a large number of objects.
Robin H. Johnson [Thu, 23 Aug 2018 17:57:24 +0000 (10:57 -0700)]
rgw: use chunked encoding to get partial results out faster
Some operations can take a long time to have their complete result.
If a RGW operation does not set a content-length header, the RGW
frontends (CivetWeb, Beast) buffer the entire request so that a
Content-Length header can be sent.
If the RGW operation takes long enough, the buffering time may exceed
keepalive values, and because no bytes have been sent in the connection,
the connection will be reset.
If a HTTP response header contains neither Content-Length or chunked
Transfer-Encoding, HTTP keep-alive is not possible.
To fix the issue within these requirements, use chunked
Transfer-Encoding for the following operations:
RGWCopyObj & RGWDeleteMultiObj specifically use send_partial_response
for long-running operations, and are the most impacted by this issue,
esp. for large inputs. RGWCopyObj attempts to send a Progress header
during the copy, but it's not actually passed on to the client until the
end of the copy, because it's buffered by the RGW frontends!
The HTTP/1.1 specification REQUIRES chunked encoding to be supported,
and the specification does NOT require "chunked" to be included in the
"TE" request header.
This patch has one side-effect: this causes many more small IP packets.
When combined with high-latency links this can increase the apparent
deletion time due to round trips and TCP slow start. Future improvements
to the RGW frontends are possible in two seperate but related ways:
- The FE could continue to push more chunks without waiting for the ACK
on the previous chunk, esp. while under the TCP window size.
- The FE could be patched for different buffer flushing behaviors, as
that behavior is presently unclear (packets of 200-500 bytes seen).
Performance results:
- Bucket with 5M objects, index sharded 32 ways.
- Index on SSD 3x replicas, Data on spinning disk, 5:2
- Multi-delete of 1000 keys, with a common prefix.
- Cache of index primed by listing the common prefix immediately before
deletion.
- Timing data captured at the RGW.
- Timing t0 is the TCP ACK sent by the RGW at the end of the response
body.
- Client is ~75ms away from RGW.
BEFORE:
Time to first byte of response header: 11.3 seconds.
Entire operation: 11.5 seconds.
Response packets: 17
AFTER:
Time to first byte of response header: 3.5ms
Entire operation: 16.36 seconds
Response packets: 206
Backport: mimic, luminous
Issue: http://tracker.ceph.com/issues/12713 Signed-off-by: Robin H. Johnson <rjohnson@digitalocean.com>
(cherry picked from commit d22c1f96707ba9ae84578932bd4d741f6c101a54)
xie xingguo [Sat, 23 Mar 2019 01:50:27 +0000 (09:50 +0800)]
osd/OSDMap: calc_pg_upmaps - restrict optimization to origin pools only
The current implementation will try to cancel any pg_upmaps that
would otherwise re-map a PG out from an underfull osd, which is wrong,
e.g., because it could reliably fire the following assert:
xie xingguo [Sat, 19 Jan 2019 09:19:10 +0000 (17:19 +0800)]
crush: fix upmap overkill
It appears that OSDMap::maybe_remove_pg_upmaps's sanity checks
are overzealous. With some customized crush rules it is possible
for osdmaptool to generate valid upmaps, but maybe_remove_pg_upmaps
will cancel them.
xie xingguo [Mon, 18 Feb 2019 07:40:22 +0000 (15:40 +0800)]
osd/OSDMap: using std::vector::reserve to reduce memory reallocation
In C++ vectors are dynamic arrays.
Vectors are assigned memory in blocks of contiguous locations.
When the memory allocated for the vector falls short of storing
new elements, a new memory block is allocated to vector and all
elements are copied from the old location to the new location.
This reallocation of elements helps vectors to grow when required.
However, it is a costly operation and time complexity is involved
in this step is linear.
Try to use std::vector::reserve whenever possible if performance
matters.
xie xingguo [Sat, 26 Jan 2019 10:03:15 +0000 (18:03 +0800)]
osd/OSDMap: more improvements to upmap
- add ability of appending a 2nd, 3rd, etc... pair to existing upmaps
when possible, rather than just continuing to the next PG
- handle the underfull case: we can rm-pg-upmap-items if there exist
any upmaps which remapped a PG out from an underfull OSD
Neha Ojha [Mon, 4 Mar 2019 04:29:05 +0000 (20:29 -0800)]
osd/PG: skip rollforward when !transaction_applied during append_log()
Earlier, we did pg_log.roll_forward(&handler), when
!transaction_applied, which advanced the crt and trimmed the entries
in rollforward(). Due to this, during _merge_object_divergent_entries(),
when we tried to rollback entries, those objects were not found in the
backend, and thus we hit this bug http://tracker.ceph.com/issues/36739.
With this change, we are advancing the crt value, without deleting the
objects, so that _merge_object_divergent_entries() does not fail
because of deleted objects.
xie xingguo [Tue, 26 Mar 2019 07:02:02 +0000 (15:02 +0800)]
osd/PG: move down peers out from peer_purged
In purge_strays(), we'll aggressively clear stray_set and
add all related peers into peer_purged.
However, if the corrsponding peer is down and becomes
up again, (unconditionally) adding it to peer_purged
will prevent primary from re-purging it.
(See Active::react(const MNotifyRec& notevt))
On consuming a new osdmap, let's move any down peers out from
peer_purged simutaneously. This way we can lower the risk
of leaving any leftover PGs behind.
xie xingguo [Tue, 26 Mar 2019 12:04:15 +0000 (20:04 +0800)]
osd/PG: introduce all_missing_unfound helper
We use pg_log.missing to track each peer's missing objects separately,
whereas missing_loc records the location of all (probably existing) good copies
for both primary and replicas' missing objects. Hence an item from
pg_log.missing or missing_loc is of different meaning and is not comparable.
During recovery, we can skip recovering primary only if
- primary is good, e.g., has no missing at all
- or all of the primary's missing objects do exist in missing_loc and are
currently unfound
Obviously, the current "all missing objects are unfound" checker is broken.
Fix by introducing an independent all_missing_unfound helper to make the
count of missing objects that are currently unfound correct.
Zengran Zhang [Wed, 27 Mar 2019 01:39:31 +0000 (09:39 +0800)]
osd: shutdown recovery_request_timer earlier
recovery_request_timer may hold some QueuePeeringEvts which PGRef,
if we dont shutdown it earlier, it potentially cause the PGRef leak
when kicking pg.