]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoRevert "mon: OSDMonitor: only thrash and propose if we are the leader"
Sage Weil [Fri, 19 Jul 2013 23:23:04 +0000 (16:23 -0700)]
Revert "mon: OSDMonitor: only thrash and propose if we are the leader"

This reverts commit 5eac38797d9eb5a59fcff1d81571cff7a2f10e66.

12 years agoRevert "mon/OSDMonitor: fix typo"
Sage Weil [Fri, 19 Jul 2013 23:22:48 +0000 (16:22 -0700)]
Revert "mon/OSDMonitor: fix typo"

This reverts commit d656aed599ee754646e16386ce5a4ab0117f2d6e.

12 years agomon: improve osdmap subscription debug output
Sage Weil [Fri, 19 Jul 2013 21:50:03 +0000 (14:50 -0700)]
mon: improve osdmap subscription debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-stats' into next
Sage Weil [Fri, 19 Jul 2013 21:49:25 +0000 (14:49 -0700)]
Merge remote-tracking branch 'gh/wip-stats' into next

Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge branch 'wip-rgw-next-2' into next
Greg Farnum [Fri, 19 Jul 2013 20:25:48 +0000 (13:25 -0700)]
Merge branch 'wip-rgw-next-2' into next

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agorgw: remove extra unused param from RGWRados::get_attr()
Yehuda Sadeh [Fri, 19 Jul 2013 20:06:53 +0000 (13:06 -0700)]
rgw: remove extra unused param from RGWRados::get_attr()

No user for the extra obj_version param.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agocls_rgw: quiet down verbose log message
Yehuda Sadeh [Fri, 19 Jul 2013 18:19:05 +0000 (11:19 -0700)]
cls_rgw: quiet down verbose log message

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: replace logic that compares regions
Yehuda Sadeh [Fri, 19 Jul 2013 16:44:43 +0000 (09:44 -0700)]
rgw: replace logic that compares regions

The logic was a bit broken. Basically, we want to make sure
that region names are the same. However, if region name is not
set then we need to check whether it's the master region. This
can happen in upgrade cases where originally we didn't have
a region name set.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw-admin: link / unlink should report errors
Yehuda Sadeh [Wed, 17 Jul 2013 23:14:02 +0000 (16:14 -0700)]
rgw-admin: link / unlink should report errors

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix time parsing in replica log
Yehuda Sadeh [Fri, 19 Jul 2013 04:50:51 +0000 (21:50 -0700)]
rgw: fix time parsing in replica log

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: bucket entry point object ver fixes
Yehuda Sadeh [Fri, 19 Jul 2013 00:40:52 +0000 (17:40 -0700)]
rgw: bucket entry point object ver fixes

Multiple fixes:
 - sync master, secondary entry point ver on creation
 - use correct entry point version when removing entry point
 - check correct version on bucket removal

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: remove s->objv_tracker
Yehuda Sadeh [Thu, 18 Jul 2013 20:07:55 +0000 (13:07 -0700)]
rgw: remove s->objv_tracker

was never initialized correctly anyway. It was only supposed to
be used for buckets, but it was never initialized in that case.
Using s->bucket_info.objv_tracker instead.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: forward delete bucket request to master after removal
Yehuda Sadeh [Thu, 18 Jul 2013 18:16:15 +0000 (11:16 -0700)]
rgw: forward delete bucket request to master after removal

We can only forward the bucket removal to the master if it was
successfully removed locally.
The master region has no knowledge about whether the
bucket can be removed or not, e.g., there are still objects in the
bucket. If we send it to the master first, then it'll happily remove it
even though it might fail in the end.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: adjust error for bucket removal on secondary region
Yehuda Sadeh [Thu, 18 Jul 2013 17:48:39 +0000 (10:48 -0700)]
rgw: adjust error for bucket removal on secondary region

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: forward x_amz_meta headers when forwarding a request
Yehuda Sadeh [Thu, 18 Jul 2013 00:20:30 +0000 (17:20 -0700)]
rgw: forward x_amz_meta headers when forwarding a request

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix bucket re-creation on secondary region
Yehuda Sadeh [Wed, 17 Jul 2013 23:34:50 +0000 (16:34 -0700)]
rgw: fix bucket re-creation on secondary region

We had a problem with bucket recreation, where we identified
that bucket has already existed, but missed the fact that it's
the same bucket, so removal of the bucket index was wrong.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agomon/MonClient: fix small leak
Sage Weil [Thu, 18 Jul 2013 23:58:50 +0000 (16:58 -0700)]
mon/MonClient: fix small leak

We need to delete the version_req_d here.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: mark addr-based [lazy_]send_message and get_connection deprecated
Sage Weil [Thu, 18 Jul 2013 22:05:22 +0000 (15:05 -0700)]
msgr: mark addr-based [lazy_]send_message and get_connection deprecated

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: mark_down by con
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con

We have the con handy; use it.  This avoids generate a spurious RESET
event, which we do not need or do anything useful with.  Note that in this
case we are not attaching anything to the Connection priv field.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: mark_down session by con, not addr
Sage Weil [Thu, 18 Jul 2013 21:46:57 +0000 (14:46 -0700)]
mon: mark_down session by con, not addr

We have the ConnectionRef here; use it.  This avoids generating a spurious
RESET event for the connection.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: break con <-> session ref cycle in mon even if shutting down
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down

If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection.  Note that we are touching no
internal Monitor state here and the locking has not changed.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/SimpleMessenger: remove duplicated interface docs
Sage Weil [Thu, 18 Jul 2013 18:28:09 +0000 (11:28 -0700)]
msg/SimpleMessenger: remove duplicated interface docs

Document these in the interface, not the implementation; having two copies
clutters the header and invites them to get out of sync.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: update docs for mark_down, mark_down_all semantics
Sage Weil [Thu, 18 Jul 2013 17:53:04 +0000 (10:53 -0700)]
msgr: update docs for mark_down, mark_down_all semantics

* RESET events
* note that the reset detection only happens if it is enabled in the
  policy.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: generate reset event on mark_down to addr (not con)
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)

If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd/ReplicatedPG: fix obc leak on invalid LIST_SNAPS op
Sage Weil [Thu, 18 Jul 2013 22:02:07 +0000 (15:02 -0700)]
osd/ReplicatedPG: fix obc leak on invalid LIST_SNAPS op

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: break con <-> session cycle when marking down old peers
Sage Weil [Thu, 18 Jul 2013 22:02:02 +0000 (15:02 -0700)]
osd: break con <-> session cycle when marking down old peers

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: make ms_handle_reset debug more useful
Sage Weil [Thu, 18 Jul 2013 22:01:53 +0000 (15:01 -0700)]
osd: make ms_handle_reset debug more useful

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomon/PGMap: don't mangle stamp_delta in clear_delta()
Sage Weil [Fri, 19 Jul 2013 17:55:02 +0000 (10:55 -0700)]
mon/PGMap: don't mangle stamp_delta in clear_delta()

This is a delta, not a timestamp.

This triggered when a cluster is idle for 2* the mon_delta_reset_interval,
and required a mon restart to fix.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: log PG state changes at level 5
Sage Weil [Wed, 10 Jul 2013 19:54:18 +0000 (12:54 -0700)]
osd: log PG state changes at level 5

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomon/PGMap: avoid negative pg stats when calculating rates 446/head
Sage Weil [Fri, 19 Jul 2013 17:37:16 +0000 (10:37 -0700)]
mon/PGMap: avoid negative pg stats when calculating rates

We periodically see strange values come out of the estimated cluster
throughput and recovery rates.  Pretty sure this is cause by feeding
negative values into the rate arithmetic and then giving the si_t
helpers mangled (sign-extended + bit shifted) values.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/PGMap: use signed values for calculated rates
Sage Weil [Fri, 19 Jul 2013 17:39:17 +0000 (10:39 -0700)]
mon/PGMap: use signed values for calculated rates

si_t (and friends) does not handle signed values, but at least we can
give the Formatters unmangled values.  This shouldn't happen (tm), but
if it does this will make things a bit less confusing and makes the code
a bit less fragile.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoReplicatedPG: track temp collection contents, clear during on_change
Samuel Just [Fri, 19 Jul 2013 02:26:02 +0000 (19:26 -0700)]
ReplicatedPG: track temp collection contents, clear during on_change

We also assert in on_flushed() that the temp collection is actually
empty.

Fixes: #5670
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoPG, ReplicatedPG: pass a transaction down to ReplicatedPG::on_change
Samuel Just [Fri, 19 Jul 2013 02:25:14 +0000 (19:25 -0700)]
PG, ReplicatedPG: pass a transaction down to ReplicatedPG::on_change

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoosd: add floor() method to pg/osd stat structs
Sage Weil [Thu, 18 Jul 2013 04:52:50 +0000 (21:52 -0700)]
osd: add floor() method to pg/osd stat structs

We often want to maintain a nonnegative value.  We generalize
this to floors other than zero only because it makes the function
call make intuitive sense; I don't think it is at all useful.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make pool_stat_t *log_size fields signed
Sage Weil [Thu, 18 Jul 2013 04:47:14 +0000 (21:47 -0700)]
osd: make pool_stat_t *log_size fields signed

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/MonClient: better debugging on version requests
Sage Weil [Fri, 19 Jul 2013 16:59:25 +0000 (09:59 -0700)]
mon/MonClient: better debugging on version requests

From leak hunting, but useful.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: work around incorrect features reported by earlier versions
Sage Weil [Thu, 18 Jul 2013 23:24:00 +0000 (16:24 -0700)]
msg/Pipe: work around incorrect features reported by earlier versions

If we see a peer reporting features ~0ull, we know they are deluded in a
particular way and should infer what features they *actually* have.  Do
this right when the features come over the wire to catch all users.

Fixes: #5655
Signed-off-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMessage,OSD,PG: make Connection::features private
Sage Weil [Fri, 19 Jul 2013 15:08:02 +0000 (08:08 -0700)]
Message,OSD,PG: make Connection::features private

Use has_feature() method too.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest: update cli test for radosgw-admin
Yehuda Sadeh [Fri, 19 Jul 2013 14:47:51 +0000 (07:47 -0700)]
test: update cli test for radosgw-admin

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoMerge pull request #448 from kri5/wip-5416
Yehuda Sadeh [Fri, 19 Jul 2013 14:20:51 +0000 (07:20 -0700)]
Merge pull request #448 from kri5/wip-5416

rgw: Adds --rgw-zone --rgw-region help text.

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: Adds --rgw-zone --rgw-region help text. 448/head
Christophe Courtaut [Fri, 19 Jul 2013 08:13:51 +0000 (10:13 +0200)]
rgw: Adds --rgw-zone --rgw-region help text.

Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
12 years agomon/MonClient: fix small leak
Sage Weil [Thu, 18 Jul 2013 23:58:50 +0000 (16:58 -0700)]
mon/MonClient: fix small leak

We need to delete the version_req_d here.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #445 from ceph/wip-osd-leaks
Sage Weil [Fri, 19 Jul 2013 01:03:48 +0000 (18:03 -0700)]
Merge pull request #445 from ceph/wip-osd-leaks

fix msgr issues causing osd leaks on shutdown

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoinit-ceph: don't activate-all for vstart clusters
Sage Weil [Fri, 19 Jul 2013 00:10:51 +0000 (17:10 -0700)]
init-ceph: don't activate-all for vstart clusters

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/PGMonitor: fix 'pg map' output key names
Sage Weil [Thu, 18 Jul 2013 23:53:23 +0000 (16:53 -0700)]
mon/PGMonitor: fix 'pg map' output key names

This got lost in a big file of fixes a while back.  :/

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoPG: add perf counter for peering latency
Samuel Just [Thu, 18 Jul 2013 21:33:37 +0000 (14:33 -0700)]
PG: add perf counter for peering latency

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomsgr: mark addr-based [lazy_]send_message and get_connection deprecated 445/head
Sage Weil [Thu, 18 Jul 2013 22:05:22 +0000 (15:05 -0700)]
msgr: mark addr-based [lazy_]send_message and get_connection deprecated

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoclient: mark_down by con
Sage Weil [Thu, 18 Jul 2013 21:50:32 +0000 (14:50 -0700)]
client: mark_down by con

We have the con handy; use it.  This avoids generate a spurious RESET
event, which we do not need or do anything useful with.  Note that in this
case we are not attaching anything to the Connection priv field.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: mark_down session by con, not addr
Sage Weil [Thu, 18 Jul 2013 21:46:57 +0000 (14:46 -0700)]
mon: mark_down session by con, not addr

We have the ConnectionRef here; use it.  This avoids generating a spurious
RESET event for the connection.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: break con <-> session ref cycle in mon even if shutting down
Sage Weil [Thu, 18 Jul 2013 21:44:17 +0000 (14:44 -0700)]
mon: break con <-> session ref cycle in mon even if shutting down

If we get a reset during shutdown, we should still break the cycle to avoid
tripping the valgrind leak detection.  Note that we are touching no
internal Monitor state here and the locking has not changed.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/SimpleMessenger: remove duplicated interface docs
Sage Weil [Thu, 18 Jul 2013 18:28:09 +0000 (11:28 -0700)]
msg/SimpleMessenger: remove duplicated interface docs

Document these in the interface, not the implementation; having two copies
clutters the header and invites them to get out of sync.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: update docs for mark_down, mark_down_all semantics
Sage Weil [Thu, 18 Jul 2013 17:53:04 +0000 (10:53 -0700)]
msgr: update docs for mark_down, mark_down_all semantics

* RESET events
* note that the reset detection only happens if it is enabled in the
  policy.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: generate reset event on mark_down to addr (not con)
Sage Weil [Wed, 17 Jul 2013 05:43:26 +0000 (22:43 -0700)]
msgr: generate reset event on mark_down to addr (not con)

If the caller is marking down an addr, they presumably don't have the
Connection* handy, so we should generate a reset event to help them
clean up con <-> session ref cycles.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd/ReplicatedPG: fix obc leak on invalid LIST_SNAPS op
Sage Weil [Thu, 18 Jul 2013 22:02:07 +0000 (15:02 -0700)]
osd/ReplicatedPG: fix obc leak on invalid LIST_SNAPS op

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: break con <-> session cycle when marking down old peers
Sage Weil [Thu, 18 Jul 2013 22:02:02 +0000 (15:02 -0700)]
osd: break con <-> session cycle when marking down old peers

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoosd: make ms_handle_reset debug more useful
Sage Weil [Thu, 18 Jul 2013 22:01:53 +0000 (15:01 -0700)]
osd: make ms_handle_reset debug more useful

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agocls_lock: fix duration test
Sage Weil [Thu, 18 Jul 2013 21:06:41 +0000 (14:06 -0700)]
cls_lock: fix duration test

It's possible for us to just be really slow when getting the reply to the
first op or doing the second op, resulting in a successful lock.  If we
do get a success, assert that at least that amount of time has passed to
avoid any false positives.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agomds: tracedn should be NULL for LOOKUPINO/LOOKUPHASH reply
Yan, Zheng [Thu, 18 Jul 2013 02:01:09 +0000 (10:01 +0800)]
mds: tracedn should be NULL for LOOKUPINO/LOOKUPHASH reply

Fixes: #5658
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoFileStore: add global replay guard for split, collection_rename
Samuel Just [Thu, 18 Jul 2013 17:12:17 +0000 (10:12 -0700)]
FileStore: add global replay guard for split, collection_rename

In the event of a split or collection rename, we need to ensure that
we don't replay any operations on objects within those collections
prior to that point.  Thus, we mark a global replay guard on the
collection after doing a syncfs and make sure to check that in
_check_replay_guard() for all object operations.

Fixes: #5154
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: do not hold pipe_lock for verify_authorizer()
Sage Weil [Thu, 18 Jul 2013 16:55:43 +0000 (09:55 -0700)]
msg/Pipe: do not hold pipe_lock for verify_authorizer()

We shouldn't hold the pipe_lock while doing the ms_verify_authorizer
upcalls.

Fix by unlocking a bit earlier, and verifying our state is still correct
in the failure path.

This regression was introduced by ecab4bb9513385bd765cca23e4e2fadb7ac4bac2.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomon: fix off-by-one in check for when sync falls behind
Sage Weil [Thu, 18 Jul 2013 04:31:46 +0000 (21:31 -0700)]
mon: fix off-by-one in check for when sync falls behind

This is what e213b1bc25a212ffe42623c1d4b4eadf9f69319e intended to do
but managed to bungle by using >= instead of >.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoMerge pull request #444 from ceph/wip-osd-latency
Sage Weil [Thu, 18 Jul 2013 05:03:07 +0000 (22:03 -0700)]
Merge pull request #444 from ceph/wip-osd-latency

osd: include op queue age histogram in osd_stat_t

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agorgw: drop unused assignment
Sage Weil [Wed, 17 Jul 2013 04:33:28 +0000 (21:33 -0700)]
rgw: drop unused assignment

rgw/rgw_rados.cc: In member function 'virtual int RGWPutObjProcessor_Atomic::handle_data(ceph::bufferlist&, off_t, void**)':
rgw/rgw_rados.cc:648:5: warning: parameter 'ofs' set but not used [-Wunused-but-set-parameter]

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agomon: make 'health' warn about slow requests 444/head
Sage Weil [Wed, 17 Jul 2013 22:49:16 +0000 (15:49 -0700)]
mon: make 'health' warn about slow requests

Currently we see slow request warnings go by in the cluster log, but they
are not reflected by 'ceph health'.  Use the new op queue histograms to
raise a flag there as well.

For example:

HEALTH_WARN 59 requests are blocked > 32 sec; 2 osds have slow requests
21 ops are blocked > 65.536 sec
38 ops are blocked > 32.768 sec
16 ops are blocked > 65.536 sec on osd.1
23 ops are blocked > 32.768 sec on osd.1
5 ops are blocked > 65.536 sec on osd.2
15 ops are blocked > 32.768 sec on osd.2
2 osds have slow requests

Fixes: #5505
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: include op queue age histogram in osd_stat_t
Sage Weil [Wed, 17 Jul 2013 21:21:40 +0000 (14:21 -0700)]
osd: include op queue age histogram in osd_stat_t

This includes a simple power-of-2 histogram of op ages in the op queue
inside osd_stat_t.  This can be used for a coarse view of overall cluster
performance (it will get summed by the mon), to identify specific outlier
osds who have a higher latency than the others, or to identify stuck ops.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoqa/workunits/cephtool/test.sh: test 'osd create <uuid>'
Sage Weil [Thu, 18 Jul 2013 01:17:29 +0000 (18:17 -0700)]
qa/workunits/cephtool/test.sh: test 'osd create <uuid>'

Make sure it gives us back the same id.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoPG: start flush on primary only after we process the master log
Samuel Just [Wed, 17 Jul 2013 22:04:10 +0000 (15:04 -0700)]
PG: start flush on primary only after we process the master log

Once we start serving reads, stray objects must have already
been removed.  Therefore, we have to flush all operations
up to the transaction writing out the authoritative log.
On replicas, we flush in Stray() if we will not eventually
be activated and in ReplicaActive if we are in the acting
set.  This way a replica won't serve a replica read until
the store is consistent.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoReplicatedPG: replace clean_up_local with a debug check
Samuel Just [Wed, 17 Jul 2013 19:51:19 +0000 (12:51 -0700)]
ReplicatedPG: replace clean_up_local with a debug check

Stray objects should have been cleaned up in the merge_log
transactions.  Only on the primary have those operations
necessarily been flushed at activate().

Fixes: 5084
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomsgr: fix a typo/goto-cross from dd4addef2d
Greg Farnum [Wed, 17 Jul 2013 22:23:12 +0000 (15:23 -0700)]
msgr: fix a typo/goto-cross from dd4addef2d

We didn't build or review carefully enough!

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #441 from ceph/wip-5626
Sage Weil [Wed, 17 Jul 2013 21:50:41 +0000 (14:50 -0700)]
Merge pull request #441 from ceph/wip-5626

msgr fixes for lossless peer sessions

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoosd: make 'from dead osd' message more informative 441/head
Sage Weil [Tue, 16 Jul 2013 21:21:08 +0000 (14:21 -0700)]
osd: make 'from dead osd' message more informative

I thought I saw some weirdness here.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: a bit of additional debug output
Sage Weil [Tue, 16 Jul 2013 21:17:05 +0000 (14:17 -0700)]
msg/Pipe: a bit of additional debug output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: hold pipe_lock during important parts of accept()
Sage Weil [Tue, 16 Jul 2013 20:13:46 +0000 (13:13 -0700)]
msg/Pipe: hold pipe_lock during important parts of accept()

Previously we did not bother with locking for accept() because we were
not visible to any other threads.  However, we need to close accepting
Pipes from mark_down_all(), which means we need to handle interference.

Fix up the locking so that we hold pipe_lock when looking at Pipe state
and verify that we are still in the ACCEPTING state any time we retake
the lock.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: close accepting_pipes from mark_down_all()
Sage Weil [Tue, 16 Jul 2013 00:16:23 +0000 (17:16 -0700)]
msgr: close accepting_pipes from mark_down_all()

We need to catch these pipes too, particularly when doing a rebind(),
to avoid them leaking through.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: maintain list of accepting pipes
Sage Weil [Tue, 16 Jul 2013 00:14:25 +0000 (17:14 -0700)]
msgr: maintain list of accepting pipes

New pipes exist in a sort of limbo before we know who the peer is and
add them to rank_pipe.  Keep a list of them in accepting_pipes for that
period.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: adjust nonce on rebind()
Sage Weil [Tue, 16 Jul 2013 23:25:28 +0000 (16:25 -0700)]
msgr: adjust nonce on rebind()

We can have a situation where:

 - we have a pipe to a peer
 - pipe goes to standby (on peer)
 - we rebind to a new port
 - ....
 - we rebind again to the same old port
 - we connect to peer

and get reattached to the ancient pipe from two instances back.  Avoid that
by picking a new nonce each time we rebind.

Add 1,000,000 each time so that the port is still legible in the printed
output.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: mark_down_all() after, not before, rebind
Sage Weil [Tue, 16 Jul 2013 00:10:23 +0000 (17:10 -0700)]
msgr: mark_down_all() after, not before, rebind

If we are shutting down all old connections and binding to new ports,
we want to avoid a sequence like:

 - close all prevoius connections
 - new connection comes in on old port
 - rebind to new ports
 -> connection from old port leaks through

As a first step, close all connections after we shut down the old
accepter and before we start the new one.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: unlock msgr->lock earlier in accept()
Sage Weil [Tue, 16 Jul 2013 20:01:18 +0000 (13:01 -0700)]
msg/Pipe: unlock msgr->lock earlier in accept()

Small cleanup.  Nothing needs msgr->lock for the previously larger
window.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: avoid creating empty out_q entry
Sage Weil [Tue, 16 Jul 2013 17:09:02 +0000 (10:09 -0700)]
msg/Pipe: avoid creating empty out_q entry

We need to maintain the invariant that all sub queues in out_q are never
empty.  Fix discard_requeued_up_to() to avoid creating an entry unless we
know it is already present.

This bug leads to an incorrect reconnect attempt when

 - we accept a pipe (lossless peer)
 - they send some stuff, maybe
 - fault
 - we initiate reconnect, even tho we have nothing queued

In particular, we shouldn't reconnect because we aren't checking for
resets, and the fact that our out_seq is 0 while the peer's might be
something else entirely will trigger asserts later.

This fixes at least one source of #5626, and possibly #5517.

Backport: cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: assert lock is held in various helpers
Sage Weil [Mon, 15 Jul 2013 21:47:05 +0000 (14:47 -0700)]
msg/Pipe: assert lock is held in various helpers

These all require that we hold pipe_lock.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph_mon: obtain backup monmap if store is marked with 'force_sync'
Joao Eduardo Luis [Wed, 17 Jul 2013 18:50:38 +0000 (19:50 +0100)]
ceph_mon: obtain backup monmap if store is marked with 'force_sync'

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state
Sage Weil [Wed, 17 Jul 2013 00:08:23 +0000 (17:08 -0700)]
mon/OSDMonitor: make 'osd pool mksnap ...' not expose uncommitted state

We were returning success without waiting if the pending pool state had
the snap.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoqa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests
Sage Weil [Wed, 17 Jul 2013 16:36:36 +0000 (09:36 -0700)]
qa/workunits/cephtest/test.sh: put 'osd ls' before any 'osd create' tests

A monc/mon connection fault or the dup command test flag may mean an extra
osd id is created that we isn't actually up; reorder so that doesn't screw
up 'osd ls'.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon: MonCommands: remove obsolete 'sync status' command
Joao Eduardo Luis [Wed, 17 Jul 2013 14:50:37 +0000 (15:50 +0100)]
mon: MonCommands: remove obsolete 'sync status' command

Obsoleted by the sync refactor from
da0aff28ab478bcc3136715f92bc1af8d4b403c1

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoOSD::_try_resurrect_pg: fix cur/pgid confusion
Samuel Just [Tue, 16 Jul 2013 23:16:47 +0000 (16:16 -0700)]
OSD::_try_resurrect_pg: fix cur/pgid confusion

This bug prevented resurrection of ancestor pgs where
necessary.

Fixes: #5269
This may result in pg A being created just before pg B
is resurrected and split into A and B resulting in one
or the other operations getting and EEXIST.

Backport: cuttlefish
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomon/AuthMonitor: make 'auth del ...' idempotent
Sage Weil [Wed, 17 Jul 2013 00:21:33 +0000 (17:21 -0700)]
mon/AuthMonitor: make 'auth del ...' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa/workunits/cephtool/test.sh: mds cluster_down/up are idempotent
Sage Weil [Wed, 17 Jul 2013 00:14:09 +0000 (17:14 -0700)]
qa/workunits/cephtool/test.sh: mds cluster_down/up are idempotent

As of d45429b81ab9817284d6dca98077cb77b5e8280f; fix the test.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND
Sage Weil [Tue, 9 Jul 2013 04:12:49 +0000 (21:12 -0700)]
ceph: send successful commands twice with CEPH_CLI_TEST_DUP_COMMAND

Monitor commands need to be idempotent.  This helps us test this by
simply issuing any successful command a second time so that we notice
when a dup submission fails.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/MDSMonitor: make 'mds cluster_{up,down}' idempotent
Sage Weil [Tue, 16 Jul 2013 23:26:57 +0000 (16:26 -0700)]
mon/MDSMonitor: make 'mds cluster_{up,down}' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosdmaptool: fix cli tests
Sage Weil [Tue, 16 Jul 2013 23:10:08 +0000 (16:10 -0700)]
osdmaptool: fix cli tests

From the HASHPSPOOL change in acbc2f0bc0b4266125403aebb28e6e3a2365394d.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-ceph-disk' into next
Sage Weil [Tue, 16 Jul 2013 22:52:37 +0000 (15:52 -0700)]
Merge branch 'wip-ceph-disk' into next

Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Tested-by: Jing Yuan Luke <jyluke@gmail.com>
12 years agoceph-disk: use /sys/block to determine partition device names
Sage Weil [Thu, 11 Jul 2013 19:59:56 +0000 (12:59 -0700)]
ceph-disk: use /sys/block to determine partition device names

Not all devices are basename + number; some have intervening character(s),
like /dev/cciss/c0d1p2.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: reimplement is_partition() using /sys/block
Sage Weil [Wed, 3 Jul 2013 18:01:58 +0000 (11:01 -0700)]
ceph-disk: reimplement is_partition() using /sys/block

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: use get_dev_name() helper throughout
Sage Weil [Wed, 3 Jul 2013 18:01:39 +0000 (11:01 -0700)]
ceph-disk: use get_dev_name() helper throughout

This is more robust than the broken split trick.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: refactor list_[all_]partitions
Sage Weil [Wed, 3 Jul 2013 17:55:36 +0000 (10:55 -0700)]
ceph-disk: refactor list_[all_]partitions

Make these methods work in terms of device *names*, not paths, and fix up
the only direct list_partitions() caller to do the same.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-disk: add get_dev_name, path helpers
Sage Weil [Wed, 3 Jul 2013 17:52:29 +0000 (10:52 -0700)]
ceph-disk: add get_dev_name, path helpers

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomon/OSDMonitor: fix typo
Sage Weil [Tue, 16 Jul 2013 22:36:53 +0000 (15:36 -0700)]
mon/OSDMonitor: fix typo

From 5eac38797d9eb5a59fcff1d81571cff7a2f10e66

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy
Sage Weil [Tue, 16 Jul 2013 22:28:07 +0000 (15:28 -0700)]
osd/OSDMonitor: make 'osd pool rmsnap ...' not racy/crashy

Ensure that the snap does in fact exist before we try to remove it.  This
avoids a crash where a we get two dup rmsnap requests (due to thrashing, or
a reconnect, or something), the committed (p) value does have the snap, but
the uncommitted (pp) does not.  This fails the old test such that we try
to remove it from pp again, and assert.

Restructure the flow so that it is easier to distinguish the committed
short return from the uncommitted return (which must still wait for the
commit).

     0> 2013-07-16 14:21:27.189060 7fdf301e9700 -1 osd/osd_types.cc: In function 'void pg_pool_t::remove_snap(snapid_t)' thread 7fdf301e9700 time 2013-07-16 14:21:27.187095
osd/osd_types.cc: 662: FAILED assert(snaps.count(s))

 ceph version 0.66-602-gcd39d8a (cd39d8a6727d81b889869e98f5869e4227b50720)
 1: (pg_pool_t::remove_snap(snapid_t)+0x6d) [0x7ad6dd]
 2: (OSDMonitor::prepare_command(MMonCommand*)+0x6407) [0x5c1517]
 3: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x1fb) [0x5c41ab]
 4: (PaxosService::dispatch(PaxosServiceMessage*)+0x937) [0x598c87]
 5: (Monitor::handle_command(MMonCommand*)+0xe56) [0x56ec36]
 6: (Monitor::_ms_dispatch(Message*)+0xd1d) [0x5719ad]
 7: (Monitor::handle_forward(MForward*)+0x821) [0x572831]
 8: (Monitor::_ms_dispatch(Message*)+0xe44) [0x571ad4]
 9: (Monitor::ms_dispatch(Message*)+0x32) [0x588c52]
 10: (DispatchQueue::entry()+0x549) [0x7cf1d9]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x7060fd]
 12: (()+0x7e9a) [0x7fdf35165e9a]
 13: (clone()+0x6d) [0x7fdf334fcccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoObjectStore: add omap_rmkeyrange to dump
Samuel Just [Tue, 16 Jul 2013 17:53:51 +0000 (10:53 -0700)]
ObjectStore: add omap_rmkeyrange to dump

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoOSD: add perfcounter tracking messages delayed pending a map
Samuel Just [Mon, 15 Jul 2013 23:12:07 +0000 (16:12 -0700)]
OSD: add perfcounter tracking messages delayed pending a map

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>