Sage Weil [Wed, 17 Jul 2013 22:49:16 +0000 (15:49 -0700)]
mon: make 'health' warn about slow requests
Currently we see slow request warnings go by in the cluster log, but they
are not reflected by 'ceph health'. Use the new op queue histograms to
raise a flag there as well.
For example:
HEALTH_WARN 59 requests are blocked > 32 sec; 2 osds have slow requests
21 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
16 ops are blocked > 65.536 sec on osd.1
23 ops are blocked > 32.768 sec on osd.1
5 ops are blocked > 65.536 sec on osd.2
15 ops are blocked > 32.768 sec on osd.2
2 osds have slow requests
Fixes: #5505 Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 21:21:40 +0000 (14:21 -0700)]
osd: include op queue age histogram in osd_stat_t
This includes a simple power-of-2 histogram of op ages in the op queue
inside osd_stat_t. This can be used for a coarse view of overall cluster
performance (it will get summed by the mon), to identify specific outlier
osds who have a higher latency than the others, or to identify stuck ops.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log
Once we start serving reads, stray objects must have already
been removed. Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set. This way a replica won't serve a replica read until
the store is consistent.
Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Samuel Just [Wed, 17 Jul 2013 19:51:19 +0000 (12:51 -0700)]
ReplicatedPG: replace clean_up_local with a debug check
Stray objects should have been cleaned up in the merge_log
transactions. Only on the primary have those operations
necessarily been flushed at activate().
Fixes: 5084 Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 16 Jul 2013 20:13:46 +0000 (13:13 -0700)]
msg/Pipe: hold pipe_lock during important parts of accept()
Previously we did not bother with locking for accept() because we were
not visible to any other threads. However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.
Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.
Sage Weil [Tue, 16 Jul 2013 23:25:28 +0000 (16:25 -0700)]
msgr: adjust nonce on rebind()
We can have a situation where:
- we have a pipe to a peer
- pipe goes to standby (on peer)
- we rebind to a new port
- ....
- we rebind again to the same old port
- we connect to peer
and get reattached to the ancient pipe from two instances back. Avoid that
by picking a new nonce each time we rebind.
Add 1,000,000 each time so that the port is still legible in the printed
output.
Sage Weil [Tue, 16 Jul 2013 17:09:02 +0000 (10:09 -0700)]
msg/Pipe: avoid creating empty out_q entry
We need to maintain the invariant that all sub queues in out_q are never
empty. Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.
This bug leads to an incorrect reconnect attempt when
- we accept a pipe (lossless peer)
- they send some stuff, maybe
- fault
- we initiate reconnect, even tho we have nothing queued
In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.
This fixes at least one source of #5626, and possibly #5517.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Wed, 17 Jul 2013 16:36:36 +0000 (09:36 -0700)]
qa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests
A monc/mon connection fault or the dup command test flag may mean an extra
osd id is created that we isn't actually up; reorder so that doesn't screw
up 'osd ls'.
Samuel Just [Tue, 16 Jul 2013 23:16:47 +0000 (16:16 -0700)]
OSD::_try_resurrect_pg: fix cur/pgid confusion
This bug prevented resurrection of ancestor pgs where
necessary.
Fixes: #5269
This may result in pg A being created just before pg B
is resurrected and split into A and B resulting in one
or the other operations getting and EEXIST.
Backport: cuttlefish Signed-off-by: Samuel Just <sam.just@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 9 Jul 2013 04:12:49 +0000 (21:12 -0700)]
ceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND
Monitor commands need to be idempotent. This helps us test this by
simply issuing any successful command a second time so that we notice
when a dup submission fails.
Sage Weil [Tue, 16 Jul 2013 22:28:07 +0000 (15:28 -0700)]
osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Ensure that the snap does in fact exist before we try to remove it. This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not. This fails the old test such that we try
to remove it from pp again, and assert.
Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).
0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))
ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
12: (()+0x7e9a) [0x7fdf35165e9a]
13: (clone()+0x6d) [0x7fdf334fcccd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: OSDMonitor: only thrash and propose if we are the leader
'thrash_map' is only set if we are the leader, so we would thrash and
propose the pending value if we are the leader. However, we should keep
the 'is_leader()' check not only for clarity's sake (an unfamiliar reader
may cry OMGBUG, prompting to a patch much like this), but also because
we may lose a subsequent election and become a peon instead, while still
holding a 'thrash_map' value > 0 -- and we really don't want to propose
while being a peon.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Sage Weil [Tue, 16 Jul 2013 21:49:33 +0000 (14:49 -0700)]
mon/OSDMonitor: send_to_waiting() in on_active()
The send_latest() helper may put a message in the waiting_for_map list
if we are not readable, but currently send_to_waiting() is only called
from update_from_paxos(), and it is possible that we may be unreadable
but not get a map update.
Instead, share the map when we are active. Do the same for check_subs(),
which is also about sharing the *new* map. Leave
share_map_with_random_osd() and process_failures() which are not
concerned with whether this is the latest map or not.
Fixes: #5439
ECANCELED there means that we lost in a race to write the object. We
should treat it as a successful write. This is reviving an old behavior
that was changed inadvertently.
Sage Weil [Tue, 16 Jul 2013 20:14:50 +0000 (13:14 -0700)]
ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks
This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone. (It also
appears to be broken on a certain customer RHEL VM.) See d7f7d613512fe39ec883e11d201793c75ee05db1.
Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'
The previous debug message outputted the function's name, as often our
functions do. This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion. Changing
this message is trivial enough and it will make ceph users happier log
readers.
Backport: cuttlefish Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
mon: Monitor: do not reopen MonitorDBStore during conversion
We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.
Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().
Fixes: #5640
Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
When we set bucket.instance meta, we need to set
the correct bucket placement to the bucket (according to
the specific placement rule). However, it might be that
bucket placement was never configured and we just go by
the defaults, using the old legacy pools selection.
Sage Weil [Sun, 14 Jul 2013 22:54:29 +0000 (15:54 -0700)]
messages/MClientReconnect: clear data when encoding
The MClientReconnect puts everything in the data payload portion of
the message and nothing in the front portion. That means that if the
message is resent (socket failure or something), the messenger thinks it
hasn't been encoded yet (front empty) and reencodes, which means
everything gets added (again) to the data portion.
Decoding keep decoding until it runs out of data, so the second copy
means we decode garbage snap realms, leading to the crash in bug
Clearing data each time around resolves the problem, although it does
mean we do the encoding work multiple times. We could alternatively
(or also) stick some data in the front portion of the payload
(ignored), but that changes the wire protocol and I would rather not
do that.
Sage Weil [Sat, 13 Jul 2013 04:52:30 +0000 (21:52 -0700)]
mon: set forwarded message recv stamp
Set it to the stamp of the MForward that carried us. One could argue
we really want the original receive stamp on the origin, but that is
not available to us, and this is better than nothing.
In particular, this gives 'ceph log ...' commands a timestamp when they
are forwarded via a peon. The stamp is still between when the request
is sent and when it is committed/acked, so all is well from the
client's perspective.
Sage Weil [Sat, 13 Jul 2013 15:36:25 +0000 (08:36 -0700)]
mon: set peon state to electing if other mons call an election
Previously we would call mon->reset() and set various flags (like
exited_quorum timestamp), but the state would remain PEON. Make an
explicit join_election() callback and set the state there, and add
asserts in reset() (renamed to be private) so that we ensure all
callers are well-behaved.
Sage Weil [Sat, 13 Jul 2013 15:11:45 +0000 (08:11 -0700)]
mon: once sync full is chosen, make sure we don't change our mind
It is possible for a sequence like:
- probe
- first probe reply has paxos trim that indicates a full sync is
needed
- start sync
- clear store
- something happens that makes us abort and bootstrap (e.g., the
provider mon restarts
- probe
- first probe reply has older paxos trim bound and we call an election
- on election completion, we crash because we have no data.
Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.
Fixes: #5621
Backport: cuttlefish Signed-off-by: Sage Weil <sage@inktank.com>
Sage Weil [Mon, 15 Jul 2013 19:56:26 +0000 (12:56 -0700)]
mon/PaxosService: consolidate resetting in restart()
We had duplicated code in election_finished() and restart(), and it was
incomplete. Put it all in restart() only (the mon should have called
restart() long before the election finishes). Note that we cannot
assert as much in election_finished() because another service may have
just cross-proposed.
Sage Weil [Fri, 12 Jul 2013 21:47:09 +0000 (14:47 -0700)]
mon: fix scrub vs paxos race: refresh on commit, not round completion
Consider:
- paxos starts a commit N+1
- a majority of the peers ack it
- paxos::commit() writes N+1 it to disk
- tells peers to commit
- peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
- leader does _scrub on N+1, without latest full osdmap
- peers do _scrub on N+1, with latest full osdmap
- leader finishes paxos gather, does refresh_from_paxos()
-> scrub fails.
Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes. We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly. This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.
We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.
Sage Weil [Sat, 13 Jul 2013 15:11:45 +0000 (08:11 -0700)]
mon: once sync full is chosen, make sure we don't change our mind
It is possible for a sequence like:
- probe
- first probe reply has paxos trim that indicates a full sync is
needed
- start sync
- clear store
- something happens that makes us abort and bootstrap (e.g., the
provider mon restarts
- probe
- first probe reply has older paxos trim bound and we call an election
- on election completion, we crash because we have no data.
Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.
Sage Weil [Mon, 15 Jul 2013 16:58:08 +0000 (09:58 -0700)]
rgw: fix more warnings
test/test_rgw_admin_opstate.cc: In member function 'int admin_log::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_opstate.cc:129:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:131:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:133:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:135:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
test/test_rgw_admin_log.cc: In member function 'int admin_log::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_log.cc:132:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:134:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:136:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:138:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
Sage Weil [Sun, 14 Jul 2013 23:37:45 +0000 (16:37 -0700)]
test_rgw_admin_meta: fix warnings
test/test_rgw_admin_meta.cc: In member function 'int admin_meta::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_meta.cc:126:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:128:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:130:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:132:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
Sage Weil [Sun, 14 Jul 2013 23:36:21 +0000 (16:36 -0700)]
cls_rgw: fix warning
cls/rgw/cls_rgw.cc: In function 'int get_obj_vals(cls_method_context_t, const string&, const string&, int, std::map, ceph::buffer::list>*)':
warning: cls/rgw/cls_rgw.cc:175:28: narrowing conversion of '129' from 'int' to 'char' inside { } is ill-formed in C++11 [-Wnarrowing]