Yehuda Sadeh [Fri, 9 Jan 2015 18:23:35 +0000 (10:23 -0800)]
rgw: only keep track for cleanup of rados objects that were written
Fixes: #10311
We're keeping track of rados objects that we've written so that we could
clean them up if needed. Earlier we weren't too accurate about it and
were also setting the head object that is yet to be written. This now
only applies to the tail data, and a bit clearer.
Yehuda Sadeh [Wed, 7 Jan 2015 23:30:27 +0000 (15:30 -0800)]
rgw: max shards configuration is part of the zone config
The zone config params are set in the region configuration. Also,
there's a ceph.conf configurable (rgw_override_bucket_index_max_shards)
for overriding this per rgw.
Yehuda Sadeh [Tue, 9 Dec 2014 21:58:09 +0000 (13:58 -0800)]
rgw: pass num shards on bucket initialization
Need to pass the actual num shards that are going to be used for this
specific bucket. Bucket may be created by applying metadata from
different zone, so num shards might be different.
Yehuda Sadeh [Fri, 5 Dec 2014 23:52:26 +0000 (15:52 -0800)]
rgw, cls_rgw: keep shard ids with oids
Instead of just having the list of oids, keep the shard ids together, so
that we can know on which shard the operation happened.
Bucket markers are just using the shard numeric id, instead of the
bucket instance shard id. This makes it easier to parse the markers
appropriately.
Yehuda Sadeh [Fri, 5 Dec 2014 22:10:50 +0000 (14:10 -0800)]
cls_rgw: clean up CLSRGWConcurrentIO
Class is no longer a template, and keeps a map of oids by shard_id. Call
issue_op() using both shard_id and oids. Shard id is used for mapping
the results in the derived classes.
Sage Weil [Thu, 8 Jan 2015 19:10:45 +0000 (11:10 -0800)]
osd: assert there is a peering event
This became conditional way back in 12e22b3d44eba51a70d8babebc2684f0c46575a7
for unclear reasons. It probably predates the in_use checks. In any case,
at this point, we should only arrive here if the PG was queued, implying
that there will always be an event to process.
Sage Weil [Thu, 8 Jan 2015 21:34:52 +0000 (13:34 -0800)]
osd: requeue PG when we skip handling a peering event
If we don't handle the event, we need to put the PG back into the peering
queue or else the event won't get processed until the next event is
queued, at which point we'll be processing events with a delay.
The queue_null is not necessary (and is a waste of effort) because the
event is still in pg->peering_queue and the PG is queued.
Note that this only triggers when we exceeed osd_map_max_advance, usually
when there is a lot of peering and recovery activity going on. A
workaround is to increase that value, but if you exceed osd_map_cache_size
you expose yourself to crache thrashing by the peering work queue, which
can cause serious problems with heavily degraded clusters and bit lots of
people on dumpling.
Backport: giant, firefly Fixes: #10431 Signed-off-by: Sage Weil <sage@redhat.com>
Yehuda Sadeh [Wed, 7 Jan 2015 21:56:14 +0000 (13:56 -0800)]
rgw: index swift keys appropriately
Fixes: #10471
Backport: firefly, giant
We need to index the swift keys by the full uid:subuser when decoding
the json representation, to keep it in line with how we store it when
creating it through other mechanism.
Sage Weil [Mon, 29 Dec 2014 23:47:28 +0000 (15:47 -0800)]
client: fix quota signed/unsigned warning
client/Client.cc: In member function 'bool Client::is_quota_bytes_exceeded(Inode*, uint64_t)':
client/Client.cc:10393:66: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (quota->max_bytes && (rstat->rbytes + new_bytes) > quota->max_bytes)
Sage Weil [Tue, 23 Dec 2014 20:39:08 +0000 (12:39 -0800)]
mon: provide encoded canonical full OSDMap from primary
Currently we make each monitor apply the incremental and encode the full
map locally. The original motivation was to save bandwidth, but the
savings are minimal to modest and the complexity associated with doing this
is huge.
This strategy also causes problems now that we have OSDMap crc's and old
mons/clusters may have diverging full OSDMaps due to mixed version
clusters. See #10422
Instead, include the encoded full map in the paxos transaction. We will
still apply the incremental and check the crc, but if it fails and we have
the correct version, reload it from disk and move on. If we don't, we
will continue as we have before--the primary mon doesn't have support for
crc's yet. When it does we will start verifying and/or get our
full map back into sync.
Fixes: #10422 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Tue, 23 Dec 2014 23:59:19 +0000 (15:59 -0800)]
osdc/Objecter: improve pool deletion detection
Currently we can have a race like so:
- send op in epoch X
- osd replies
- pool deleted in epoch X+1
- client gets X+1, sends map epoch check
- client gets reply
-> fails assert that there is no map check in flight
Avoid this situation by inferring that the pool is deleted when we see
that we previously sent the request but the pool is no longer present.
Since pool ids are not reused there is no point in doing a synchronous
map check at all.
Backport: giant Fixes: #10372 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 19 Dec 2014 19:48:27 +0000 (11:48 -0800)]
librados: add rados_watch_flush() call
Add a call so that callers can make sure all queued callbacks have
completed before shutting down the ioctx. This avoids a segv triggered
by the LibRadosWatchNotifyPPTests/LibRadosWatchNotifyPP.WatchNotify2Timeout/1
test due to the ioctx being destroyed when the in-progress callback
does a notify_ack.
Sage Weil [Fri, 19 Dec 2014 16:37:00 +0000 (08:37 -0800)]
osdc/Objecter: do notify completion callback in fast-dispatch context
The notify completion has exactly one user, the librados caller which
does nothing but take a local (inner) lock and signal a Cond. Do this
in the fast-dispatch context for simplicity.
Notably, this makes the notify completion (and timeout) trigger a
notify2() return (with ETIMEDOUT) even when the finisher queue that
normally delivers notify is busy.. for example with a notify that is
being very slow. In our case, the unit test is doing a sleep(3) to
test timeouts but also prevented the ETIMEDOUT notification from
being delivered to the caller. This patch resolves that.
Sage Weil [Fri, 19 Dec 2014 00:49:06 +0000 (16:49 -0800)]
osd: only verfy OSDMap crc if it is known
Only verify we encode a full map with the correct CRC if we actually
have the value in the Incremental. Otherwise, any map from an old
mon will fail. And we'll try to request a full map with a message the
old mon doesn't understand.
Haomai Wang [Tue, 9 Dec 2014 08:55:28 +0000 (16:55 +0800)]
AsyncConnection: Fix time event is called after AsyncMessenger destruction
AsyncConnection uses time event to handle async event partially, but it's not
very effective so here use dispatch_event_external instead.
And if client try to connect to server after a period, it may be call
AsyncConnection::process which will reference to AsyncMessenger. Since
AsyncMessenger doesn't use reference count, it will result in segment
fault. Now we record time event id and delete these registered time events
when stopping connection.