]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agomon: make 'health' warn about slow requests 444/head
Sage Weil [Wed, 17 Jul 2013 22:49:16 +0000 (15:49 -0700)]
mon: make 'health' warn about slow requests

Currently we see slow request warnings go by in the cluster log, but they
are not reflected by 'ceph health'.  Use the new op queue histograms to
raise a flag there as well.

For example:

HEALTH_WARN 59 requests are blocked > 32 sec; 2 osds have slow requests
21 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
16 ops are blocked > 65.536 sec on osd.1
23 ops are blocked > 32.768 sec on osd.1
5 ops are blocked > 65.536 sec on osd.2
15 ops are blocked > 32.768 sec on osd.2
2 osds have slow requests

Fixes: #5505
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: include op queue age histogram in osd_stat_t
Sage Weil [Wed, 17 Jul 2013 21:21:40 +0000 (14:21 -0700)]
osd: include op queue age histogram in osd_stat_t

This includes a simple power-of-2 histogram of op ages in the op queue
inside osd_stat_t.  This can be used for a coarse view of overall cluster
performance (it will get summed by the mon), to identify specific outlier
osds who have a higher latency than the others, or to identify stuck ops.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoqa/workunits/cephtool/test.sh: test 'osd create <uuid>'
Sage Weil [Thu, 18 Jul 2013 01:17:29 +0000 (18:17 -0700)]
qa/workunits/cephtool/test.sh: test 'osd create <uuid>'

Make sure it gives us back the same id.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoPG: start flush on primary only after we process the master log
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log

Once we start serving reads, stray objects must have already
been removed.  Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set.  This way a replica won't serve a replica read until
the store is consistent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoReplicatedPG: replace clean_up_local with a debug check
Samuel Just [Wed, 17 Jul 2013 19:51:19 +0000 (12:51 -0700)]
ReplicatedPG: replace clean_up_local with a debug check

Stray objects should have been cleaned up in the merge_log
transactions.  Only on the primary have those operations
necessarily been flushed at activate().

Fixes: 5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomsgr: fix a typo/goto-cross from dd4addef2d
Greg Farnum [Wed, 17 Jul 2013 22:23:12 +0000 (15:23 -0700)]
msgr: fix a typo/goto-cross from dd4addef2d

We didn't build or review carefully enough!

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #441 from ceph/wip-5626
Sage Weil [Wed, 17 Jul 2013 21:50:41 +0000 (14:50 -0700)]
Merge pull request #441 from ceph/wip-5626

msgr fixes for lossless peer sessions

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoosd: make 'from dead osd' message more informative 441/head
Sage Weil [Tue, 16 Jul 2013 21:21:08 +0000 (14:21 -0700)]
osd: make 'from dead osd' message more informative

I thought I saw some weirdness here.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: a bit of additional debug output
Sage Weil [Tue, 16 Jul 2013 21:17:05 +0000 (14:17 -0700)]
msg/Pipe: a bit of additional debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: hold pipe_lock during important parts of accept()
Sage Weil [Tue, 16 Jul 2013 20:13:46 +0000 (13:13 -0700)]
msg/Pipe: hold pipe_lock during important parts of accept()

Previously we did not bother with locking for accept() because we were
not visible to any other threads.  However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.

Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: close accepting_pipes from mark_down_all()
Sage Weil [Tue, 16 Jul 2013 00:16:23 +0000 (17:16 -0700)]
msgr: close accepting_pipes from mark_down_all()

We need to catch these pipes too, particularly when doing a rebind(),
to avoid them leaking through.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: maintain list of accepting pipes
Sage Weil [Tue, 16 Jul 2013 00:14:25 +0000 (17:14 -0700)]
msgr: maintain list of accepting pipes

New pipes exist in a sort of limbo before we know who the peer is and
add them to rank_pipe.  Keep a list of them in accepting_pipes for that
period.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: adjust nonce on rebind()
Sage Weil [Tue, 16 Jul 2013 23:25:28 +0000 (16:25 -0700)]
msgr: adjust nonce on rebind()

We can have a situation where:

 - we have a pipe to a peer
 - pipe goes to standby (on peer)
 - we rebind to a new port
 - ....
 - we rebind again to the same old port
 - we connect to peer

and get reattached to the ancient pipe from two instances back.  Avoid that
by picking a new nonce each time we rebind.

Add 1,000,000 each time so that the port is still legible in the printed
output.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: mark_down_all() after, not before, rebind
Sage Weil [Tue, 16 Jul 2013 00:10:23 +0000 (17:10 -0700)]
msgr: mark_down_all() after, not before, rebind

If we are shutting down all old connections and binding to new ports,
we want to avoid a sequence like:

 - close all prevoius connections
 - new connection comes in on old port
 - rebind to new ports
 -> connection from old port leaks through

As a first step, close all connections after we shut down the old
accepter and before we start the new one.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: unlock msgr->lock earlier in accept()
Sage Weil [Tue, 16 Jul 2013 20:01:18 +0000 (13:01 -0700)]
msg/Pipe: unlock msgr->lock earlier in accept()

Small cleanup.  Nothing needs msgr->lock for the previously larger
window.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: avoid creating empty out_q entry
Sage Weil [Tue, 16 Jul 2013 17:09:02 +0000 (10:09 -0700)]
msg/Pipe: avoid creating empty out_q entry

We need to maintain the invariant that all sub queues in out_q are never
empty.  Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.

This bug leads to an incorrect reconnect attempt when

 - we accept a pipe (lossless peer)
 - they send some stuff, maybe
 - fault
 - we initiate reconnect, even tho we have nothing queued

In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.

This fixes at least one source of #5626, and possibly #5517.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: assert lock is held in various helpers
Sage Weil [Mon, 15 Jul 2013 21:47:05 +0000 (14:47 -0700)]
msg/Pipe: assert lock is held in various helpers

These all require that we hold pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph_mon: obtain backup monmap if store is marked with 'force_sync'
Joao Eduardo Luis [Wed, 17 Jul 2013 18:50:38 +0000 (19:50 +0100)]
ceph_mon: obtain backup monmap if store is marked with 'force_sync'

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state
Sage Weil [Wed, 17 Jul 2013 00:08:23 +0000 (17:08 -0700)]
mon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state

We were returning success without waiting if the pending pool state had
the snap.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoqa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests
Sage Weil [Wed, 17 Jul 2013 16:36:36 +0000 (09:36 -0700)]
qa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests

A monc/mon connection fault or the dup command test flag may mean an extra
osd id is created that we isn't actually up; reorder so that doesn't screw
up 'osd ls'.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: MonCommands: remove obsolete 'sync status' command
Joao Eduardo Luis [Wed, 17 Jul 2013 14:50:37 +0000 (15:50 +0100)]
mon: MonCommands: remove obsolete 'sync status' command

Obsoleted by the sync refactor from
da0aff28ab478bcc3136715f92bc1af8d4b403c1

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoOSD::_try_resurrect_pg: fix cur/pgid confusion
Samuel Just [Tue, 16 Jul 2013 23:16:47 +0000 (16:16 -0700)]
OSD::_try_resurrect_pg: fix cur/pgid confusion

This bug prevented resurrection of ancestor pgs where
necessary.

Fixes: #5269
This may result in pg A being created just before pg B
is resurrected and split into A and B resulting in one
or the other operations getting and EEXIST.

Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon/AuthMonitor: make 'auth del ...' idempotent
Sage Weil [Wed, 17 Jul 2013 00:21:33 +0000 (17:21 -0700)]
mon/AuthMonitor: make 'auth del ...' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa/workunits/cephtool/test.sh: mds cluster_down/up are idempotent
Sage Weil [Wed, 17 Jul 2013 00:14:09 +0000 (17:14 -0700)]
qa/workunits/cephtool/test.sh: mds cluster_down/up are idempotent

As of d45429b81ab9817284d6dca98077cb77b5e8280f; fix the test.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND
Sage Weil [Tue, 9 Jul 2013 04:12:49 +0000 (21:12 -0700)]
ceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND

Monitor commands need to be idempotent.  This helps us test this by
simply issuing any successful command a second time so that we notice
when a dup submission fails.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/MDSMonitor: make 'mds cluster_{up,down}' idempotent
Sage Weil [Tue, 16 Jul 2013 23:26:57 +0000 (16:26 -0700)]
mon/MDSMonitor: make 'mds cluster_{up,down}' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosdmaptool: fix cli tests
Sage Weil [Tue, 16 Jul 2013 23:10:08 +0000 (16:10 -0700)]
osdmaptool: fix cli tests

From the HASHPSPOOL change in acbc2f0bc0b4266125403aebb28e6e3a2365394d.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-ceph-disk' into next
Sage Weil [Tue, 16 Jul 2013 22:52:37 +0000 (15:52 -0700)]
Merge branch 'wip-ceph-disk' into next

Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Tested-by: Jing Yuan Luke <jyluke@gmail.com>
12 years agoceph-disk: use /sys/block to determine partition device names
Sage Weil [Thu, 11 Jul 2013 19:59:56 +0000 (12:59 -0700)]
ceph-disk: use /sys/block to determine partition device names

Not all devices are basename + number; some have intervening character(s),
like /dev/cciss/c0d1p2.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: reimplement is_partition() using /sys/block
Sage Weil [Wed, 3 Jul 2013 18:01:58 +0000 (11:01 -0700)]
ceph-disk: reimplement is_partition() using /sys/block

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: use get_dev_name() helper throughout
Sage Weil [Wed, 3 Jul 2013 18:01:39 +0000 (11:01 -0700)]
ceph-disk: use get_dev_name() helper throughout

This is more robust than the broken split trick.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: refactor list_[all_]partitions
Sage Weil [Wed, 3 Jul 2013 17:55:36 +0000 (10:55 -0700)]
ceph-disk: refactor list_[all_]partitions

Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: add get_dev_name, path helpers
Sage Weil [Wed, 3 Jul 2013 17:52:29 +0000 (10:52 -0700)]
ceph-disk: add get_dev_name, path helpers

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/OSDMonitor: fix typo
Sage Weil [Tue, 16 Jul 2013 22:36:53 +0000 (15:36 -0700)]
mon/OSDMonitor: fix typo

From 5eac38797d9eb5a59fcff1d81571cff7a2f10e66

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Sage Weil [Tue, 16 Jul 2013 22:28:07 +0000 (15:28 -0700)]
osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy

Ensure that the snap does in fact exist before we try to remove it.  This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not.  This fails the old test such that we try
to remove it from pp again, and assert.

Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).

     0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))

 ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
 1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
 2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
 5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
 6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
 7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
 8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
 9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
 10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
 12: (()+0x7e9a) [0x7fdf35165e9a]
 13: (clone()+0x6d) [0x7fdf334fcccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoObjectStore: add omap_rmkeyrange to dump
Samuel Just [Tue, 16 Jul 2013 17:53:51 +0000 (10:53 -0700)]
ObjectStore: add omap_rmkeyrange to dump

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoOSD: add perfcounter tracking messages delayed pending a map
Samuel Just [Mon, 15 Jul 2013 23:12:07 +0000 (16:12 -0700)]
OSD: add perfcounter tracking messages delayed pending a map

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoFileStore: add a perf counter for time spent acquiring op queue throttle
Samuel Just [Mon, 15 Jul 2013 23:11:08 +0000 (16:11 -0700)]
FileStore: add a perf counter for time spent acquiring op queue throttle

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-4779' into next
Sage Weil [Tue, 16 Jul 2013 22:24:03 +0000 (15:24 -0700)]
Merge branch 'wip-4779' into next

Reviewed-by: Sage Weil <sage@inktank.com># Please enter a commit message to explain why this merge is necessary,
12 years agoMerge pull request #439 from yehudasa/wip-rgw-next
Gregory Farnum [Tue, 16 Jul 2013 22:17:25 +0000 (15:17 -0700)]
Merge pull request #439 from yehudasa/wip-rgw-next

rgw: quiet down ECANCELED on put_obj_meta()
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomon/OSDMonitor: return error if we can't set the new bucket's name
Sage Weil [Mon, 15 Jul 2013 23:12:50 +0000 (16:12 -0700)]
mon/OSDMonitor: return error if we can't set the new bucket's name

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agocrush: return EINVAL on invalid name from {insert,update,create_or_move}_item, set_it...
Sage Weil [Mon, 15 Jul 2013 23:12:23 +0000 (16:12 -0700)]
crush: return EINVAL on invalid name from {insert,update,create_or_move}_item, set_item_name

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agocrush: add is_valid_crush_name() helper
Sage Weil [Mon, 15 Jul 2013 22:55:39 +0000 (15:55 -0700)]
crush: add is_valid_crush_name() helper

[A-Za-z0-9-_.]+

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agomon: OSDMonitor: only thrash and propose if we are the leader
Joao Eduardo Luis [Tue, 16 Jul 2013 22:02:55 +0000 (23:02 +0100)]
mon: OSDMonitor: only thrash and propose if we are the leader

'thrash_map' is only set if we are the leader, so we would thrash and
propose the pending value if we are the leader.  However, we should keep
the 'is_leader()' check not only for clarity's sake (an unfamiliar reader
may cry OMGBUG, prompting to a patch much like this), but also because
we may lose a subsequent election and become a peon instead, while still
holding a 'thrash_map' value > 0 -- and we really don't want to propose
while being a peon.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon/MDSMonitor: make 'ceph mds remove_data_pool ...' idempotent
Sage Weil [Tue, 16 Jul 2013 21:52:16 +0000 (14:52 -0700)]
mon/MDSMonitor: make 'ceph mds remove_data_pool ...' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/OSDMonitor: clean up waiting_for_map messages on shutdown
Sage Weil [Tue, 16 Jul 2013 21:49:55 +0000 (14:49 -0700)]
mon/OSDMonitor: clean up waiting_for_map messages on shutdown

Do not leak these.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon/OSDMonitor: send_to_waiting() in on_active()
Sage Weil [Tue, 16 Jul 2013 21:49:33 +0000 (14:49 -0700)]
mon/OSDMonitor: send_to_waiting() in on_active()

The send_latest() helper may put a message in the waiting_for_map list
if we are not readable, but currently send_to_waiting() is only called
from update_from_paxos(), and it is possible that we may be unreadable
but not get a map update.

Instead, share the map when we are active.  Do the same for check_subs(),
which is also about sharing the *new* map.  Leave
share_map_with_random_osd() and process_failures() which are not
concerned with whether this is the latest map or not.

This problem surfaced when we changed the timing of refresh relative to
paxos commit, since update_from_paxos() is now not normally called while
readable; see f1ce8d7c955a2443111bf7d9e16b4c563d445712 and
c711203c0d4b924e5951aa808b243bf06e7ad23a.

Fixes: #5643
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agorgw: quiet down ECANCELED on put_obj_meta() 439/head
Yehuda Sadeh [Tue, 16 Jul 2013 20:42:03 +0000 (13:42 -0700)]
rgw: quiet down ECANCELED on put_obj_meta()

Fixes: #5439
ECANCELED there means that we lost in a race to write the object. We
should treat it as a successful write. This is reviving an old behavior
that was changed inadvertently.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoosd: do not enable HASHPSPOOL pool feature by default
Sage Weil [Tue, 16 Jul 2013 20:41:08 +0000 (13:41 -0700)]
osd: do not enable HASHPSPOOL pool feature by default

This was added in kernel 3.9 and should not yet be enabled by default.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks
Sage Weil [Tue, 16 Jul 2013 20:14:50 +0000 (13:14 -0700)]
ceph-disk: rely on /dev/disk/by-partuuid instead of special-casing journal symlinks

This was necessary when ceph-disk-udev didn't create the by-partuuid (and
other) symlinks for us, but now it is fragile and error-prone.  (It also
appears to be broken on a certain customer RHEL VM.)  See
d7f7d613512fe39ec883e11d201793c75ee05db1.

Instead, just use the by-partuuid symlinks that we spent all that ugly
effort generating.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoPendingReleaseNotes: formatted ceph CLI output and ceph-rest-api
Dan Mick [Tue, 16 Jul 2013 01:27:15 +0000 (18:27 -0700)]
PendingReleaseNotes: formatted ceph CLI output and ceph-rest-api

Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agomon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'
Joao Eduardo Luis [Tue, 16 Jul 2013 15:49:48 +0000 (16:49 +0100)]
mon: Monitor: StoreConverter: clearer debug message on 'needs_conversion()'

The previous debug message outputted the function's name, as often our
functions do.  This was however a source of bewilderment, as users would
see those in logs and think their stores would need conversion.  Changing
this message is trivial enough and it will make ceph users happier log
readers.

Backport: cuttlefish
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon: Monitor: StoreConverter: sanitize 'store' pointer on init
Joao Eduardo Luis [Tue, 16 Jul 2013 15:46:53 +0000 (16:46 +0100)]
mon: Monitor: StoreConverter: sanitize 'store' pointer on init

We are supposed to have umount'ed the store and set the pointer to NULL.
We should not tolerate any other case on init().

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon: Monitor: do not reopen MonitorDBStore during conversion
Joao Eduardo Luis [Tue, 16 Jul 2013 15:45:39 +0000 (16:45 +0100)]
mon: Monitor: do not reopen MonitorDBStore during conversion

We already open the store on ceph_mon.cc, before we start the conversion.
Given we are unable to reproduce this every time a conversion is triggered,
we are led to believe that this causes a race in leveldb that will lead
to 'store.db/LOCK' being locked upon the open this patch removes.

Regardless, reopening the db here is pointless as we already did it when
we reach Monitor::StoreConverter::convert().

Fixes: #5640
Backport: cuttlefish

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #438 from yehudasa/wip-rgw-next
Gregory Farnum [Tue, 16 Jul 2013 16:33:52 +0000 (09:33 -0700)]
Merge pull request #438 from yehudasa/wip-rgw-next

Fix an issue with bucket placements and with listing on new installations.

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agorgw: handle ENOENT when listing bucket metadata entries 438/head
Yehuda Sadeh [Tue, 16 Jul 2013 01:43:56 +0000 (18:43 -0700)]
rgw: handle ENOENT when listing bucket metadata entries

Just return success (with an empty list)

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix bucket placement assignment
Yehuda Sadeh [Mon, 15 Jul 2013 22:49:42 +0000 (15:49 -0700)]
rgw: fix bucket placement assignment

When we set bucket.instance meta, we need to set
the correct bucket placement to the bucket (according to
the specific placement rule). However, it might be that
bucket placement was never configured and we just go by
the defaults, using the old legacy pools selection.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoOSD: add config option for peering_wq batch size
Samuel Just [Mon, 15 Jul 2013 20:44:20 +0000 (13:44 -0700)]
OSD: add config option for peering_wq batch size

Large peering_wq batch sizes may excessively delay
peering messages resulting in unreasonably long
peering.  This may speed up peering.

Backport: cuttlefish
Related: #5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon: make report pure json
Sage Weil [Mon, 15 Jul 2013 21:29:14 +0000 (14:29 -0700)]
mon: make report pure json

Put the crc in the status string and drop the header and footer.  If users
want to capture it,

ceph report 2>&1 > foo.txt

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-mon-report' into next
Sage Weil [Mon, 15 Jul 2013 21:23:40 +0000 (14:23 -0700)]
Merge remote-tracking branch 'gh/wip-mon-report' into next

12 years agoceph: drop --threshold hack for 'pg dump_stuck'
Sage Weil [Sat, 13 Jul 2013 21:09:10 +0000 (14:09 -0700)]
ceph: drop --threshold hack for 'pg dump_stuck'

We can live with the incompatibility here; the hack is currently
not working anyway (see #5623).

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agomsg/Pipe: be a bit more explicit about encoding outgoing messages
Sage Weil [Sun, 14 Jul 2013 15:55:52 +0000 (08:55 -0700)]
msg/Pipe: be a bit more explicit about encoding outgoing messages

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomessages/MClientReconnect: clear data when encoding
Sage Weil [Sun, 14 Jul 2013 22:54:29 +0000 (15:54 -0700)]
messages/MClientReconnect: clear data when encoding

The MClientReconnect puts everything in the data payload portion of
the message and nothing in the front portion.  That means that if the
message is resent (socket failure or something), the messenger thinks it
hasn't been encoded yet (front empty) and reencodes, which means
everything gets added (again) to the data portion.

Decoding keep decoding until it runs out of data, so the second copy
means we decode garbage snap realms, leading to the crash in bug

Clearing data each time around resolves the problem, although it does
mean we do the encoding work multiple times.  We could alternatively
(or also) stick some data in the front portion of the payload
(ignored), but that changes the wire protocol and I would rather not
do that.

Fixes: #4565
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge pull request #436 from ceph/wip-mon-fixes
Sage Weil [Mon, 15 Jul 2013 20:46:43 +0000 (13:46 -0700)]
Merge pull request #436 from ceph/wip-mon-fixes

Wip mon fixes

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomon: set forwarded message recv stamp 436/head
Sage Weil [Sat, 13 Jul 2013 04:52:30 +0000 (21:52 -0700)]
mon: set forwarded message recv stamp

Set it to the stamp of the MForward that carried us.  One could argue
we really want the original receive stamp on the origin, but that is
not available to us, and this is better than nothing.

In particular, this gives 'ceph log ...' commands a timestamp when they
are forwarded via a peon.  The stamp is still between when the request
is sent and when it is committed/acked, so all is well from the
client's perspective.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: drop win_election() _reset() kludge and strengthen assertions
Sage Weil [Sat, 13 Jul 2013 15:38:40 +0000 (08:38 -0700)]
mon: drop win_election() _reset() kludge and strengthen assertions

This is only there for the benefit of win_standalone_election(), but it
doesn't need it, it clutters the code, and weakens our assertions.

Now the only win_election() callers are win_standalone_election() (which
is a single path that just did _reset()) and from the elector.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: set peon state to electing if other mons call an election
Sage Weil [Sat, 13 Jul 2013 15:36:25 +0000 (08:36 -0700)]
mon: set peon state to electing if other mons call an election

Previously we would call mon->reset() and set various flags (like
exited_quorum timestamp), but the state would remain PEON.  Make an
explicit join_election() callback and set the state there, and add
asserts in reset() (renamed to be private) so that we ensure all
callers are well-behaved.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: once sync full is chosen, make sure we don't change our mind
Sage Weil [Sat, 13 Jul 2013 15:11:45 +0000 (08:11 -0700)]
mon: once sync full is chosen, make sure we don't change our mind

It is possible for a sequence like:

 - probe
 - first probe reply has paxos trim that indicates a full sync is
   needed
 - start sync
 - clear store
 - something happens that makes us abort and bootstrap (e.g., the
   provider mon restarts
 - probe
 - first probe reply has older paxos trim bound and we call an election
 - on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

Fixes: #5621
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/PaxosService: consolidate resetting in restart()
Sage Weil [Mon, 15 Jul 2013 19:56:26 +0000 (12:56 -0700)]
mon/PaxosService: consolidate resetting in restart()

We had duplicated code in election_finished() and restart(), and it was
incomplete.  Put it all in restart() only (the mon should have called
restart() long before the election finishes).  Note that we cannot
assert as much in election_finished() because another service may have
just cross-proposed.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/PaxosService: assert not proposing in propose_pending
Sage Weil [Fri, 12 Jul 2013 22:04:00 +0000 (15:04 -0700)]
mon/PaxosService: assert not proposing in propose_pending

Drop the useless active check after the assert, too.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/Paxos: separate proposal commit from the end of the round
Sage Weil [Sat, 13 Jul 2013 04:20:05 +0000 (21:20 -0700)]
mon/Paxos: separate proposal commit from the end of the round

Each commit should match with exactly one proposal; finish it when we
actually commit it and make sensible asserts.

The old finish_proposal() turns into finish_round(), and performs
generic checks and cleanup associated with the transition from
updating -> active.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/Paxos: make all handle_accept paths go via out label
Sage Weil [Mon, 15 Jul 2013 19:57:40 +0000 (12:57 -0700)]
mon/Paxos: make all handle_accept paths go via out label

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Mon, 15 Jul 2013 20:20:32 +0000 (13:20 -0700)]
Merge branch 'next'

12 years agomon: fix scrub vs paxos race: refresh on commit, not round completion
Sage Weil [Fri, 12 Jul 2013 21:47:09 +0000 (14:47 -0700)]
mon: fix scrub vs paxos race: refresh on commit, not round completion

Consider:

 - paxos starts a commit N+1
 - a majority of the peers ack it
  - paxos::commit() writes N+1 it to disk
  - tells peers to commit
 - peers commit N+1, *and* refresh_from_paxos(), and generate N+1 full map
 - leader does _scrub on N+1, without latest full osdmap
 - peers do _scrub on N+1, with latest full osdmap
 - leader finishes paxos gather, does refresh_from_paxos()
 -> scrub fails.

Fix this by doing the refresh_from_paxos() at commit time and not when
the paxos round finishes.  We move the refresh out of finish_proposal
and into its own helper, and update all callers accordingly.  This
keeps on-disk state more tightly in sync with in-memory state and
avoids the need for a e.g., kludgey workaround in the scrub code.

We also simplify the bootstrap checks a bit by doing so immediately
and relying on the normal bootstrap paxos reset paths to clean up
any waiters.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #437 from kri5/wip-fix-typo-rgw
Yehuda Sadeh [Mon, 15 Jul 2013 19:20:15 +0000 (12:20 -0700)]
Merge pull request #437 from kri5/wip-fix-typo-rgw

rgw: Fix typo in rgw_user.cc

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-rgw-warnings' into next
Yehuda Sadeh [Mon, 15 Jul 2013 18:17:03 +0000 (11:17 -0700)]
Merge remote-tracking branch 'origin/wip-rgw-warnings' into next

Conflicts:
src/test/test_rgw_admin_log.cc
src/test/test_rgw_admin_meta.cc
src/test/test_rgw_admin_opstate.cc

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix bucket instance json encoding
Yehuda Sadeh [Sat, 13 Jul 2013 03:26:06 +0000 (20:26 -0700)]
rgw: fix bucket instance json encoding

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw_admin: fix gc list encoding
Yehuda Sadeh [Sat, 13 Jul 2013 02:10:25 +0000 (19:10 -0700)]
rgw_admin: fix gc list encoding

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoMerge pull request #434 from gregsfortytwo/next
Yehuda Sadeh [Mon, 15 Jul 2013 17:24:18 +0000 (10:24 -0700)]
Merge pull request #434 from gregsfortytwo/next

test_rgw: fix a number of unsigned/signed comparison warnings

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agodoc: Fixed link in Calxeda repo instruction.
John Wilkins [Mon, 15 Jul 2013 17:05:28 +0000 (10:05 -0700)]
doc: Fixed link in Calxeda repo instruction.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agomon: once sync full is chosen, make sure we don't change our mind
Sage Weil [Sat, 13 Jul 2013 15:11:45 +0000 (08:11 -0700)]
mon: once sync full is chosen, make sure we don't change our mind

It is possible for a sequence like:

 - probe
 - first probe reply has paxos trim that indicates a full sync is
   needed
 - start sync
 - clear store
 - something happens that makes us abort and bootstrap (e.g., the
   provider mon restarts
 - probe
 - first probe reply has older paxos trim bound and we call an election
 - on election completion, we crash because we have no data.

Non-determinism of the probe decision aside, we need to ensure that
the info we share during probe (fc, lc) is accurate, and that once we
clear the store we know we *must* do a full sync.

Fixes: #5621
Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agorgw: fix more warnings
Sage Weil [Mon, 15 Jul 2013 16:58:08 +0000 (09:58 -0700)]
rgw: fix more warnings

test/test_rgw_admin_opstate.cc: In member function 'int admin_log::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_opstate.cc:129:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:131:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:133:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_opstate.cc:135:24: comparison between signed and unsigned integer expressions [-Wsign-compare]

test/test_rgw_admin_log.cc: In member function 'int admin_log::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_log.cc:132:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:134:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:136:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_log.cc:138:24: comparison between signed and unsigned integer expressions [-Wsign-compare]

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest_rgw: fix a number of unsigned/signed comparison warnings 434/head
Greg Farnum [Fri, 12 Jul 2013 23:38:39 +0000 (16:38 -0700)]
test_rgw: fix a number of unsigned/signed comparison warnings

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agorgw: Fix typo in rgw_user.cc 433/head 437/head
Christophe Courtaut [Mon, 15 Jul 2013 11:47:58 +0000 (13:47 +0200)]
rgw: Fix typo in rgw_user.cc

Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
12 years agotest_rgw_admin_meta: fix warnings
Sage Weil [Sun, 14 Jul 2013 23:37:45 +0000 (16:37 -0700)]
test_rgw_admin_meta: fix warnings

test/test_rgw_admin_meta.cc: In member function 'int admin_meta::test_helper::extract_input(int, char**)':
warning: test/test_rgw_admin_meta.cc:126:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:128:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:130:24: comparison between signed and unsigned integer expressions [-Wsign-compare]
warning: test/test_rgw_admin_meta.cc:132:24: comparison between signed and unsigned integer expressions [-Wsign-compare]

on arm7l

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agocls_rgw: fix warning
Sage Weil [Sun, 14 Jul 2013 23:36:21 +0000 (16:36 -0700)]
cls_rgw: fix warning

cls/rgw/cls_rgw.cc: In function 'int get_obj_vals(cls_method_context_t, const string&, const string&, int, std::map, ceph::buffer::list>*)':
warning: cls/rgw/cls_rgw.cc:175:28: narrowing conversion of '129' from 'int' to 'char' inside { } is ill-formed in C++11 [-Wnarrowing]

on arm7l

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-hadoop-doc'
Noah Watkins [Sun, 14 Jul 2013 23:32:28 +0000 (16:32 -0700)]
Merge branch 'wip-hadoop-doc'

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agodoc: update Hadoop docs with plugin download
Noah Watkins [Sun, 24 Mar 2013 18:22:11 +0000 (11:22 -0700)]
doc: update Hadoop docs with plugin download

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agodoc: document new hadoop config options
Noah Watkins [Sun, 17 Mar 2013 19:10:16 +0000 (12:10 -0700)]
doc: document new hadoop config options

Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoMakefile: cls_rgw needs cls_rgw_types linked in now too
Sage Weil [Sun, 14 Jul 2013 23:24:46 +0000 (16:24 -0700)]
Makefile: cls_rgw needs cls_rgw_types linked in now too

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: include some (basic) auth info in report 430/head
Sage Weil [Sun, 14 Jul 2013 23:20:54 +0000 (16:20 -0700)]
mon: include some (basic) auth info in report

Nothing privileged!

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: include paxos info in report
Sage Weil [Sun, 14 Jul 2013 23:16:55 +0000 (16:16 -0700)]
mon: include paxos info in report

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: move quorum out of monmap
Sage Weil [Sun, 14 Jul 2013 23:12:34 +0000 (16:12 -0700)]
mon: move quorum out of monmap

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: include service first_committed in report
Sage Weil [Sun, 14 Jul 2013 23:12:16 +0000 (16:12 -0700)]
mon: include service first_committed in report

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #429 from llonchj/hypertable_changes
Sage Weil [Sun, 14 Jul 2013 22:32:41 +0000 (15:32 -0700)]
Merge pull request #429 from llonchj/hypertable_changes

Hypertable changes

12 years agoUse mon_host instead of mon_addr in ceph_conf 429/head
Jordi Llonch [Sun, 14 Jul 2013 21:28:02 +0000 (07:28 +1000)]
Use mon_host instead of mon_addr in ceph_conf

Signed-off-by: Jordi Llonch <llonchj@gmail.com>
12 years agohypertable recent version prototyping includes bool verify in length and read functions
Jordi Llonch [Sun, 14 Jul 2013 21:26:58 +0000 (07:26 +1000)]
hypertable recent version prototyping includes bool verify in length and read functions

Signed-off-by: Jordi Llonch <llonchj@gmail.com>
12 years agoMakefile: build cls_rgw even if we're not building radosgw
Sage Weil [Sat, 13 Jul 2013 21:56:21 +0000 (14:56 -0700)]
Makefile: build cls_rgw even if we're not building radosgw

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMakefile: fix cls_rgw linkage
Sage Weil [Sat, 13 Jul 2013 21:51:26 +0000 (14:51 -0700)]
Makefile: fix cls_rgw linkage

Broken by 0c83b5fec233b7fc63205233403e7df32139d039.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMakefile: fix cls_refcount linkage
Sage Weil [Sat, 13 Jul 2013 21:02:41 +0000 (14:02 -0700)]
Makefile: fix cls_refcount linkage

Broken by d0bee5d85c1c862639450bad69c0ad20a98ef5c9.

Fixes: #5622
Signed-off-by: Sage Weil <sage@inktank.com>