The logic was a bit broken. Basically, we want to make sure
that region names are the same. However, if region name is not
set then we need to check whether it's the master region. This
can happen in upgrade cases where originally we didn't have
a region name set.
Multiple fixes:
- sync master, secondary entry point ver on creation
- use correct entry point version when removing entry point
- check correct version on bucket removal
was never initialized correctly anyway. It was only supposed to
be used for buckets, but it was never initialized in that case.
Using s->bucket_info.objv_tracker instead.
rgw: forward delete bucket request to master after removal
We can only forward the bucket removal to the master if it was
successfully removed locally.
The master region has no knowledge about whether the
bucket can be removed or not, e.g., there are still objects in the
bucket. If we send it to the master first, then it'll happily remove it
even though it might fail in the end.
We had a problem with bucket recreation, where we identified
that bucket has already existed, but missed the fact that it's
the same bucket, so removal of the bucket index was wrong.
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con
We have the con handy; use it. This avoids generate a spurious RESET
event, which we do not need or do anything useful with. Note that in this
case we are not attaching anything to the Connection priv field.
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down
If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection. Note that we are touching no
internal Monitor state here and the locking has not changed.
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)
If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.
Sage Weil [Fri, 19 Jul 2013 17:37:16 +0000 (10:37 -0700)]
mon/PGMap: avoid negative pg stats when calculating rates
We periodically see strange values come out of the estimated cluster
throughput and recovery rates. Pretty sure this is cause by feeding
negative values into the rate arithmetic and then giving the si_t
helpers mangled (sign-extended + bit shifted) values.
Sage Weil [Fri, 19 Jul 2013 17:39:17 +0000 (10:39 -0700)]
mon/PGMap: use signed values for calculated rates
si_t (and friends) does not handle signed values, but at least we can
give the Formatters unmangled values. This shouldn't happen (tm), but
if it does this will make things a bit less confusing and makes the code
a bit less fragile.
Sage Weil [Thu, 18 Jul 2013 04:52:50 +0000 (21:52 -0700)]
osd: add floor() method to pg/osd stat structs
We often want to maintain a nonnegative value. We generalize
this to floors other than zero only because it makes the function
call make intuitive sense; I don't think it is at all useful.
Sage Weil [Thu, 18 Jul 2013 23:24:00 +0000 (16:24 -0700)]
msg/Pipe: work around incorrect features reported by earlier versions
If we see a peer reporting features ~0ull, we know they are deluded in a
particular way and should infer what features they *actually* have. Do
this right when the features come over the wire to catch all users.
Fixes: #5655 Signed-off-by: Samuel Just <sam.just@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con
We have the con handy; use it. This avoids generate a spurious RESET
event, which we do not need or do anything useful with. Note that in this
case we are not attaching anything to the Connection priv field.
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down
If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection. Note that we are touching no
internal Monitor state here and the locking has not changed.
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)
If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.
Sage Weil [Thu, 18 Jul 2013 21:06:41 +0000 (14:06 -0700)]
cls_lock: fix duration test
It's possible for us to just be really slow when getting the reply to the
first op or doing the second op, resulting in a successful lock. If we
do get a success, assert that at least that amount of time has passed to
avoid any false positives.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Samuel Just [Thu, 18 Jul 2013 17:12:17 +0000 (10:12 -0700)]
FileStore: add global replay guard for split, collection_rename
In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point. Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.
Fixes: #5154 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 04:33:28 +0000 (21:33 -0700)]
rgw: drop unused assignment
rgw/rgw_rados.cc: In member function 'virtual int RGWPutObjProcessor_Atomic::handle_data(ceph::bufferlist&, off_t, void**)':
rgw/rgw_rados.cc:648:5: warning: parameter 'ofs' set but not used [-Wunused-but-set-parameter]
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Sage Weil [Wed, 17 Jul 2013 22:49:16 +0000 (15:49 -0700)]
mon: make 'health' warn about slow requests
Currently we see slow request warnings go by in the cluster log, but they
are not reflected by 'ceph health'. Use the new op queue histograms to
raise a flag there as well.
For example:
HEALTH_WARN 59 requests are blocked > 32 sec; 2 osds have slow requests
21 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
16 ops are blocked > 65.536 sec on osd.1
23 ops are blocked > 32.768 sec on osd.1
5 ops are blocked > 65.536 sec on osd.2
15 ops are blocked > 32.768 sec on osd.2
2 osds have slow requests
Fixes: #5505 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 21:21:40 +0000 (14:21 -0700)]
osd: include op queue age histogram in osd_stat_t
This includes a simple power-of-2 histogram of op ages in the op queue
inside osd_stat_t. This can be used for a coarse view of overall cluster
performance (it will get summed by the mon), to identify specific outlier
osds who have a higher latency than the others, or to identify stuck ops.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log
Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Samuel Just [Wed, 17 Jul 2013 19:51:19 +0000 (12:51 -0700)]
ReplicatedPG: replace clean_up_local with a debug check
Stray objects should have been cleaned up in the merge_log
transactions. Only on the primary have those operations
necessarily been flushed at activate().
Fixes: 5084 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 16 Jul 2013 20:13:46 +0000 (13:13 -0700)]
msg/Pipe: hold pipe_lock during important parts of accept()
Previously we did not bother with locking for accept() because we were
not visible to any other threads. However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.
Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.
Sage Weil [Tue, 16 Jul 2013 23:25:28 +0000 (16:25 -0700)]
msgr: adjust nonce on rebind()
We can have a situation where:
- we have a pipe to a peer
- pipe goes to standby (on peer)
- we rebind to a new port
- ....
- we rebind again to the same old port
- we connect to peer
and get reattached to the ancient pipe from two instances back. Avoid that
by picking a new nonce each time we rebind.
Add 1,000,000 each time so that the port is still legible in the printed
output.
Sage Weil [Tue, 16 Jul 2013 17:09:02 +0000 (10:09 -0700)]
msg/Pipe: avoid creating empty out_q entry
We need to maintain the invariant that all sub queues in out_q are never
empty. Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.
This bug leads to an incorrect reconnect attempt when
- we accept a pipe (lossless peer)
- they send some stuff, maybe
- fault
- we initiate reconnect, even tho we have nothing queued
In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.
This fixes at least one source of #5626, and possibly #5517.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 16:36:36 +0000 (09:36 -0700)]
qa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests
A monc/mon connection fault or the dup command test flag may mean an extra
osd id is created that we isn't actually up; reorder so that doesn't screw
up 'osd ls'.
Samuel Just [Tue, 16 Jul 2013 23:16:47 +0000 (16:16 -0700)]
OSD::_try_resurrect_pg: fix cur/pgid confusion
This bug prevented resurrection of ancestor pgs where
necessary.
Fixes: #5269
This may result in pg A being created just before pg B
is resurrected and split into A and B resulting in one
or the other operations getting and EEXIST.
Backport: cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 9 Jul 2013 04:12:49 +0000 (21:12 -0700)]
ceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND
Monitor commands need to be idempotent. This helps us test this by
simply issuing any successful command a second time so that we notice
when a dup submission fails.
Sage Weil [Tue, 16 Jul 2013 22:28:07 +0000 (15:28 -0700)]
osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Ensure that the snap does in fact exist before we try to remove it. This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not. This fails the old test such that we try
to remove it from pp again, and assert.
Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).
0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))
ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
12: (()+0x7e9a) [0x7fdf35165e9a]
13: (clone()+0x6d) [0x7fdf334fcccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>