Samuel Just [Fri, 29 Aug 2014 21:04:04 +0000 (14:04 -0700)]
PG::init: clear rollback info for backfill as well
Otherwise, we won't remove the old rollback objects from a resurrected pg. In
rare cases, this can cause us to get an EEXIST if we happen to reuse the same
rename id on the same object in a subsequent interval.
Samuel Just [Wed, 27 Aug 2014 23:21:41 +0000 (16:21 -0700)]
PG::can_discard_op: do discard old subopreplies
Otherwise, a sub_op_reply from a previous interval can stick around
until we either one day go active again and get rid of it or delete the
pg which is holding it on its waiting_for_active list. While it sticks
around futily waiting for the pg to once more go active, it will cause
harmless slow request warnings.
Fixes: #9259
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
Yehuda Sadeh [Fri, 22 Aug 2014 04:53:38 +0000 (21:53 -0700)]
rgw: clear bufferlist if write_data() successful
Fixes: #9201
Backport: firefly
We sometimes need to call RGWPutObjProcessor::handle_data() again,
so that we send the pending data. However, we failed to clear the buffer
that was already sent, thus it was resent. This triggers when using non
default pool alignments.
Then, we begin to flush 15 with a delete with snapc 4:[4] leaving the
backing pool with:
4:[4]:[4(4)]
Then, we finish flushing 15 with snapc 9:[4] with leaving the backing
pool with:
9:[4]:[4(4)]+head
Next, snaps 10 and 15 are removed causing clone 10 to be removed leaving
the cache with:
30:[29,21,20,4]:[22(21),4(4)]+head
We next begin to flush 22 by sending a delete with snapc 4(4) since
prev_snapc is 4 <---------- here is the bug
The backing pool ignores this request since 4 < 9 (ORDERSNAP) leaving it
with:
9:[4]:[4(4)]
Then, we complete flushing 22 with snapc 19:[4] leaving the backing pool
with:
19:[4]:[4(4)]+head
Then, we begin to flush head by deleting with snapc 22:[21,20,4] leaving
the backing pool with:
22[21,20,4]:[22(21,20), 4(4)]
Finally, we flush head leaving the backing pool with:
30:[29,21,20,4]:[22(21*,20*),4(4)]+head
When we go to flush clone 22, all we know is that 22 is dirty, has snaps
[21], and 4 is clean. As part of flushing 22, we need to do two things:
1) Ensure that the current head is cloned as cloneid 4 with snaps [4] by
sending a delete at snapc 4:[4].
2) Flush the data at snap sequence < 21 by sending a copyfrom with snapc
20:[20,4].
Unfortunately, it is possible that 1, 1&2, or 1 and part of the flush
process for some other now non-existent clone have already been
performed. Because of that, between 1) and 2), we need to send
a second delete ensuring that the object does not exist at 20.
Fixes: #9054
Backport: firefly Signed-off-by: Samuel Just <sam.just@inktank.com>
Sage Weil [Tue, 26 Aug 2014 15:16:29 +0000 (08:16 -0700)]
osd/OSDMap: encode blacklist in deterministic order
When we use an unordered_map the encoding order is non-deterministic,
which is problematic for OSDMap. Construct an ordered map<> on encode
and use that. This lets us keep the hash table for lookups in the general
case.
Fixes: #9211
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
We need to identify whether an object is just composed of a head, or
also has a tail. Test for pre-firefly objects ("explicit objs") was
broken as it was just looking at the number of explicit objs in the
manifest. However, this is insufficient, as we might have empty head,
and in this case it wouldn't appear, so we need to check whether the
sole object is actually pointing at the head.
Somnath Roy [Mon, 18 Aug 2014 23:59:36 +0000 (16:59 -0700)]
CollectionIndex: Collection name is added to the access_lock name
The CollectionIndex constructor is changed to accept the coll_t
so that the collection name can be used to form access_lock(RWLock)
name.This is needed otherwise lockdep will report a recursive lock error
and assert. lockdep needs unique lock names for each Index object.
Sage Weil [Thu, 21 Aug 2014 20:05:35 +0000 (13:05 -0700)]
mon: fix occasional message leak after session reset
Consider:
- we get a message, put it on a wait list
- the client session resets
- we go back to process the message later and discard
- _ms_dispatch returns false, but nobody drops the msg ref
Since we call _ms_dispatch() a lot internally, we need to always return
true when we are an internal caller.
Fixes: #9176
Backport: firefly, dumpling Signed-off-by: Sage Weil <sage@redhat.com>
Boris Ranto [Fri, 15 Aug 2014 17:34:27 +0000 (19:34 +0200)]
Fix -Wno-format and -Werror=format-security options clash
This causes build failure in latest fedora builds, ceph_test_librbd_fsx adds -Wno-format cflag but the default AM_CFLAGS already contain -Werror=format-security, in previous releases, this was tolerated but in the latest fedora rawhide it no longer is, ceph_test_librbd_fsx builds fine without -Wno-format on x86_64 so there is likely no need for the flag anymore
Signed-off-by: Boris Ranto <branto@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 15 Aug 2014 15:55:10 +0000 (08:55 -0700)]
osd: only require crush features for rules that are actually used
Often there will be a CRUSH rule present for erasure coding that uses the
new CRUSH steps or indep mode. If these rules are not referenced by any
pool, we do not need clients to support the mapping behavior. This is true
because the encoding has not changed; only the expected CRUSH output.
Fixes: #8963
Backport: firefly Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Wed, 13 Aug 2014 23:17:02 +0000 (16:17 -0700)]
mon/Paxos: share state and verify contiguity early in collect phase
We verify peons are contiguous and share new paxos states to catch peons
up at the end of the round. Do this each time we (potentially) get new
states via a collect message. This will allow peons to be pulled forward
and remain contiguous when they otherwise would not have been able to.
For example, if
If we got mon.1 first and then mon.2 second, we would store the new txns
and then boot mon.1 out at the end because 15..25 is not contiguous with
28..40. However, with this change, we share 26..30 to mon.1 when we get
the collect, and then 31..40 when we get mon.2's collect, pulling them
both into the final quorum.
It also breaks the 'catch-up' work into smaller pieces, which ought to
smooth out latency a bit.
Sage Weil [Thu, 14 Aug 2014 23:55:58 +0000 (16:55 -0700)]
mon/Paxos: verify all new peons are still contiguous at end of round
During the collect phase we verify that each peon has overlapping or
contiguous versions as us (and can therefore be caught up with some
series of transactions). However, we *also* assimilate any new states we
get from those peers, and that may move our own first_committed forward
in time. This means that an early responder might have originally been
contiguous, but a later one moved us forward, and when the round finished
they were not contiguous any more. This leads to a crash on the peon
when they get our first begin message.
For example:
- we have 10..20
- first peon has 5..15
- ok!
- second peon has 18..30
- we apply this state
- we are now 18..30
- we finish the round
- send commit to first peon (empty.. we aren't contiguous)
- send no commit to second peon (we match)
- we send a begin for state 31
- first peon crashes (it's lc is still 15)
Prevent this by checking at the end of the round if we are still
contiguous. If not, bootstrap. This is similar to the check we do above,
but reverse to make sure *we* aren't too far ahead of *them*.
Fixes: #9053 Signed-off-by: Sage Weil <sage@redhat.com>
Loic Dachary [Tue, 3 Jun 2014 20:20:29 +0000 (22:20 +0200)]
erasure-code: parse function for the mapping parameter
Each D letter is a data chunk. For instance:
_DDD_DDD
is going to parse into:
[ 1, 2, 3, 5, 6, 7 ]
the 0 and 4 positions are not used by chunks and do not show in the
mapping. Implement ErasureCode::parse to support a reasonable default
for the mapping parameter.
Add support for erasure code plugins that do not sequentially map the
chunks encoded to the corresponding index. This is mostly transparent to
the caller, except when it comes to retrieving the data chunks when
reading. For this purpose there needs to be a remapping function so the
caller has a way to figure out which chunks actually contain the data
and reorder them.
Somnath Roy [Mon, 30 Jun 2014 08:28:07 +0000 (01:28 -0700)]
FileStore: Index caching is introduced for performance improvement
IndexManager now has a Index caching. Index will only be created if not
found in the cache. Earlier, each op is creating an Index object and other
ops requesting the same index needed to wait till previous op is done.
Also, after finishing lookup, this Index object was destroyed.
Now, a Index cache is been implemented to persists these Indexes since
there is a major performance hit because each op is creating and destroying
these. A RWlock is been introduced in the CollectionIndex class and that is
responsible for sync between lookup and create.
Also, since these Index objects are persistent there is no need to use
smart pointers. So, Index is a wrapper class of CollecIndex* now.
It is the responsibility of the users of Index now to lock explicitely
before using them. Index object is sufficient now for locking and no need
to hold IndexPath for locking. The function interfaces of lfn_open,lfn_find
are changed accordingly.
Signed-off-by: Somnath Roy <somnath.roy@sandisk.com>