Yehuda Sadeh [Tue, 6 May 2014 23:55:27 +0000 (16:55 -0700)]
rgw: fix stripe_size calculation
Fixes: #8299
Backport: firefly
The stripe size calculation was broken, specifically affected cases
where we had manifest that described multiple parts.
Yehuda Sadeh [Tue, 6 May 2014 18:06:29 +0000 (11:06 -0700)]
rgw: cut short object read if a chunk returns error
Fixes: #8289
Backport: firefly, dumpling
When reading an object, if we hit an error when trying to read one of
the rados objects then we should just stop. Otherwise we're just going
to continue reading the rest of the object, and since it can't be sent
back to the client (as we have a hole in the middle), we end up
accumulating everything in memory.
Sage Weil [Fri, 18 Apr 2014 20:50:11 +0000 (13:50 -0700)]
osd: throttle snap trimmming with simple delay
This is not particularly smart, but it is *a* knob that lets you make
the snap trimmer slow down. It's a flow and a simple delay, so it is
adjustable at runtime. Default is 0 (no change in behavior).
Callig _finish_hunting() clears out the bool hunting flag, which means we
don't retry by connection to another mon periodically. Instead, we send
keepalives every 10s. But, since we aren't yet in state HAVE_SESSION, we
don't check that the keepalives are getting responses. This means that an
ill-timed connection reset (say, after we get a MonMap, but before we
finish authenticating) can drop the monc into a black hole that does not
retry.
Instead, we should *only* call _finish_hunting() when we complete the
authentication handshake.
Fixes: #8278
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 77a6f0aefebebf057f02bfb95c088a30ed93c53f)
Sage Weil [Tue, 6 May 2014 18:01:27 +0000 (11:01 -0700)]
osd/ReplicatedPG: fix whiteouts for other cache mode
We were special casing WRITEBACK mode for handling whiteouts; this needs to
also include the FORWARD and READONLY modes. To avoid having to list
specific cache modes, though, just check != NONE.
Sage Weil [Thu, 1 May 2014 23:53:17 +0000 (16:53 -0700)]
osd: Prevent divide by zero in agent_choose_mode()
Fixes: #8175
Backport: firefly
Signed-off-by: David Zafman <david.zafman@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f47f867952e6b2a16a296c82bb9b585b21cde6c8)
David Zafman [Tue, 22 Apr 2014 06:52:04 +0000 (23:52 -0700)]
osd, common: If agent_work() finds no objs to work on delay 5 (default) secs
Add config osd_agent_delay_time of 5 seconds
Honor delay by ignoring agent_choose_mode() calls
Add tier_delay to logger
Treat restart after delay like we were previously idle
Yan, Zheng [Sat, 3 May 2014 21:17:15 +0000 (05:17 +0800)]
osd: check blacklisted clients in ReplicatedPG::do_op()
OSD checks if client is blacklisted only when receiving OSD request.
It's possible that OSD request's sender get blacklisted while OSD
request in in some waiting list.
Samuel Just [Fri, 18 Apr 2014 00:26:17 +0000 (17:26 -0700)]
ReplicatedPG: block scrub on blocked object contexts
Fixes: #8011 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e66f2e36c06ca00c1147f922d3513f56b122a5c0)
Samuel Just [Thu, 24 Apr 2014 19:48:44 +0000 (12:48 -0700)]
ECBackend::continue_recovery_op: handle a source shard going down
get_min_avail_to_read_shards might return an error if there are
no longer enough sources to reconstruct the missing shards.
This is possible if osds went down while we were writing the
previous chunk -- we already notice in check_recovery_sources
if a source goes down during a read.
Samuel Just [Fri, 11 Apr 2014 01:15:30 +0000 (18:15 -0700)]
ReplicatedPG: do not preserve op context during flush
Any information stashed in the OpContext may be obsolete by the time we
actually mark the object clean. Instead, let the start_flush caller
clean up its OpContext and in try_flush_mark_clean we'll create a new
one. The primary reason to keep the OpContext would have been locking,
but we can set the obc as blocking without holding an OpContext, and
that would allow trimming to happen in the mean time (which is good
since trim_object does not respect rw locks since it doesn't change user
visible state). In try_flush_mark_clean, we requeue the fop->op along
with (but ahead of) the fop->dup_ops.
Modified qemu-iotests workunit script to check for versions
that use the latest qemu (currently only Trusty). Limit the
tests to those that are applicable to rbd.
client: check cap ID when handling cap export message
handle following sequence of events:
- mds0 exports an inode to mds1. client receives the cap import
message from mds1. caps from mds0 are removed while handling
the cap import message.
- mds1 exports an inode to mds0. client receives the cap export
message from mds1. handle_cap_export() adds placeholder caps
for mds0
- client receives the first cap export message (for exporting
inode from mds0 to mds1)
Sage Weil [Sun, 20 Apr 2014 05:04:33 +0000 (22:04 -0700)]
osd: change in up set primary constitutes a peering interval change
In several places, a change in the up_primary triggers a new peering
interval, but the palces that actually generate the new past intervals,
including check_new_interval(), did not enforce that. This becomes
somewhat obvious when you see that those callers are ignoring the
up_primary output argument for pg_to_up_acting_osds().
Fix this by adding arguments to check_new_interval and fixing the callers
to pass them in properly. Add a unit test case to verify this.
Note that the past interval struct itself does not record who the
up_primary was; possibly it should.
Fixes: #8139 Signed-off-by: Sage Weil <sage@inktank.com>
Samuel Just [Tue, 8 Apr 2014 21:03:59 +0000 (14:03 -0700)]
ReplicatedPG: do not create whiteout clones
First, make_writeable treats whiteout heads like snapdir for
cloning purposes. Second, to ensure that we send the correct
deletes on flush to the backing pool, we instead use oi.snaps
on any clone we are flushing to infer the snaps during which
head did not exist and send a delete as appropriate prior to
the copy_from.
Normally, we'd have a problem if the delete and the copy_from
completed, but an interval change intervened before the dirty
flag was cleared since we'd end up re-deleting the object.
To avoid that, we use the CEPH_OSD_FLAG_ORDERSNAP flag.
Additionally, we will use the correct snap_seq on the delete
or flush as appropriate to ensure that the previous clone
gets created with the same clone id as in the cache pool.
Fixes: #7942 Signed-off-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Thu, 17 Apr 2014 19:27:07 +0000 (12:27 -0700)]
osd/: propogate hit_set history with repop
We don't actually send the whole info on each repop, just the log
entries, updated stats, and a few other bits. For hit_set ops, we need
to also communicate the new hit_set history status atomically with the
log entries and the transaction. Thus, we add a channel for an optional
pg_hit_set_history_t field in PGBackend::submit_transaction interface
and associated messages and implementations to update the hit_set info
field along with the log entries.
This also means that hit_set_(persist|trim) update an
updated_hit_set_history field on the OpContext instead of directly
modifying the info field.
Fixes: #8124 Signed-off-by: Samuel Just <sam.just@inktank.com>