]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoworkunits: add a test for caching redirects 560/head
Greg Farnum [Thu, 29 Aug 2013 22:26:08 +0000 (15:26 -0700)]
workunits: add a test for caching redirects

This may need to change since it exploits some of the loose
consistency we currently have with caching pools, but for now
it checks that the Objecter does what we want.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd_types: note that write_tier wins if read_tier is different
Greg Farnum [Thu, 29 Aug 2013 20:58:04 +0000 (13:58 -0700)]
osd_types: note that write_tier wins if read_tier is different

For pg_pool_t.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: respect read_tier & write_tier for initial op submission
Greg Farnum [Thu, 29 Aug 2013 20:57:10 +0000 (13:57 -0700)]
Objecter: respect read_tier & write_tier for initial op submission

We overwrite target_oloc.pool with the appropriate [read|write]_tier.
write_tier wins if it matches both.
We don't handle any sort of redirect yet.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: be careful about precalculated pgids
Greg Farnum [Thu, 29 Aug 2013 20:52:35 +0000 (13:52 -0700)]
Objecter: be careful about precalculated pgids

The only current user of the precalc_pgid field is list_objects. That's
fine, but we don't want new users to inadvertently appear and somehow
break the caching/tiering stuff by forcing us to go to the base pool
when we should be talking to somebody else. Add an assert to catch
these cases.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: add an Op::target_oloc and use it instead of base_oloc in send_op()
Greg Farnum [Thu, 29 Aug 2013 20:12:41 +0000 (13:12 -0700)]
Objecter: add an Op::target_oloc and use it instead of base_oloc in send_op()

For now we simply set target_oloc = base_oloc in recalc_op_target(), but
we will shortly be doing more interesting things with it there.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: rename Op::oloc -> Op::base_oloc
Greg Farnum [Thu, 29 Aug 2013 20:08:03 +0000 (13:08 -0700)]
Objecter: rename Op::oloc -> Op::base_oloc

We want to be able to target other pools for caching and tiering, so
we need to take an oloc from the client and translate it into an
actual target. Rename oloc to base_oloc to make clear which one it is.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agomon/OSDMonitor: 'osd tier {set,remove}-overlay <pool> [tierpool]' 554/head
Greg Farnum [Thu, 29 Aug 2013 00:49:48 +0000 (17:49 -0700)]
mon/OSDMonitor: 'osd tier {set,remove}-overlay <pool> [tierpool]'

Also prevent 'osd tier remove ...' if the tierpool is the current overlay.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoqa/workunits/cephtool/test.sh: test osd tier CLI
Greg Farnum [Thu, 29 Aug 2013 00:47:42 +0000 (17:47 -0700)]
qa/workunits/cephtool/test.sh: test osd tier CLI

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: 'osd tier cache-mode <pool> <mode>'
Sage Weil [Tue, 27 Aug 2013 20:44:52 +0000 (13:44 -0700)]
mon/OSDMonitor: 'osd tier cache-mode <pool> <mode>'

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agomon/OSDMonitor: 'osd pool tier <add|remove> <pool> <tierpool>'
Sage Weil [Tue, 27 Aug 2013 20:43:09 +0000 (13:43 -0700)]
mon/OSDMonitor: 'osd pool tier <add|remove> <pool> <tierpool>'

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd/OSDMonitor: avoid polluting pending_inc on error for 'osd pool set ...'
Sage Weil [Tue, 27 Aug 2013 19:47:53 +0000 (12:47 -0700)]
osd/OSDMonitor: avoid polluting pending_inc on error for 'osd pool set ...'

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd_types: add pg_pool_t cache-related fields
Sage Weil [Mon, 26 Aug 2013 22:59:54 +0000 (15:59 -0700)]
osd_types: add pg_pool_t cache-related fields

We add fields sufficient to specify
* many pools have a tiering relationship with pool foo
* pool foo is a tier pool for pool bar
* the tiering relationship between foo and bar is specified
  by cache_mode
* client reads and writes for pool foo should be directed to
  pools bar and baz, respectively (where probably, but not
  necessarily, baz == bar or baz == foo).

This lets us specify very sophisticated caching policies on
the server side that all clients going forward can handle
simply by directing the messages as the read_tier and write_tier
flags, and the (not-yet-implemented) redirect replies
from OSDs, specify.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agodoc/dev/cache-pool: document cache pool management interface
Sage Weil [Mon, 26 Aug 2013 23:57:58 +0000 (16:57 -0700)]
doc/dev/cache-pool: document cache pool management interface

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoadd CEPH_FEATURE_OSD_CACHEPOOL
Sage Weil [Mon, 26 Aug 2013 22:11:43 +0000 (15:11 -0700)]
add CEPH_FEATURE_OSD_CACHEPOOL

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #549 from ceph/wip-6029
Gregory Farnum [Wed, 28 Aug 2013 16:15:36 +0000 (09:15 -0700)]
Merge pull request #549 from ceph/wip-6029

Make user_version a first-class citizen
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Just <sam.just@inktank.com>
11 years agoMerge remote-tracking branch 'origin/master' into wip-6029 549/head
Greg Farnum [Wed, 28 Aug 2013 00:26:36 +0000 (17:26 -0700)]
Merge remote-tracking branch 'origin/master' into wip-6029

Conflicts:
src/librados/AioCompletionImpl.h

11 years agodoc: update to describe new OSD version support as it actually exists
Greg Farnum [Tue, 27 Aug 2013 22:21:49 +0000 (15:21 -0700)]
doc: update to describe new OSD version support as it actually exists

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: add OpContext::user_at_version
Greg Farnum [Wed, 28 Aug 2013 00:24:24 +0000 (17:24 -0700)]
ReplicatedPG: add OpContext::user_at_version

Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.

Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).

[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: stop filling in replay_version from the MOSDOp to begin with
Greg Farnum [Tue, 27 Aug 2013 19:55:52 +0000 (12:55 -0700)]
MOSDOpReply: stop filling in replay_version from the MOSDOp to begin with

It's just asking for trouble.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: switch to comprehensive instead of individual version setters
Greg Farnum [Tue, 27 Aug 2013 21:06:49 +0000 (14:06 -0700)]
MOSDOpReply: switch to comprehensive instead of individual version setters

There's little point to updating versions individually when we can
do so en masse and avoid mistakes in duplication.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: add enough fields to be backwards compatible.
Greg Farnum [Tue, 27 Aug 2013 18:02:44 +0000 (11:02 -0700)]
MOSDOpReply: add enough fields to be backwards compatible.

The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: actually fill in user_version in pg_log_entry_t
Greg Farnum [Wed, 28 Aug 2013 00:14:56 +0000 (17:14 -0700)]
osd: actually fill in user_version in pg_log_entry_t

We now require it when creating a pg_log_entry_t. The user_version
is the version which info.last_user_version should be set to
after the transaction is applied, which for everything except for
a user-modify op is going to be the version it was already at.
For now we are filling in the user-modify op's changing user_version
to be ctx->at_version.version

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: add last_user_version to pg_info_t
Greg Farnum [Wed, 21 Aug 2013 18:26:28 +0000 (11:26 -0700)]
osd: add last_user_version to pg_info_t

We add a corresponding user_version to pg_log_entry_t, and the logic
to assign from one to the other and to recover last_user_version from
a master's log. We aren't yet setting it to anything, though.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: remove OpContext::reply_user_version
Greg Farnum [Wed, 21 Aug 2013 00:11:14 +0000 (17:11 -0700)]
ReplicatedPG: remove OpContext::reply_user_version

ctx->new_obs.oi.user_version is initialized to ctx->obs.oi.user_version,
and for read ops it won't be changed. That means
reply_user_version == ctx->new_obs.oi.user_version in all cases, which
means we don't want it.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: switch object_info_t::user_version to be a version_t
Greg Farnum [Wed, 21 Aug 2013 00:13:53 +0000 (17:13 -0700)]
osd: switch object_info_t::user_version to be a version_t

We never expose the full eversion_t data to users, and do not want to.
However, we pull some tricks in the encode/decode functions to avoid
having to change the object_info_t disk format for this change.
When we can break compatibility, we should simplify this.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: Fill in the MOSDOpReply's user_version
Greg Farnum [Tue, 20 Aug 2013 23:22:27 +0000 (16:22 -0700)]
ReplicatedPG: Fill in the MOSDOpReply's user_version

As part of this, rename OpContext::reply_version->reply_user_version.
The semantics that necessitate the reply_version are only for user versions,
so rename it for clarity. Then use the reply_user_version in
set_user_version() (if the op succeeded).
For now we use the PG version for ENOENT (preserving the previous
semantics), but that will get changed to the pg's user_version soon
as well.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: set the replay version based on the at_version
Greg Farnum [Tue, 20 Aug 2013 23:18:18 +0000 (16:18 -0700)]
ReplicatedPG: set the replay version based on the at_version

The replay version is not for users to consume, so we don't want
to use the user_version for it.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: expose MOSDOp's new user_version instead of the replay_version
Greg Farnum [Tue, 20 Aug 2013 20:55:54 +0000 (13:55 -0700)]
Objecter: expose MOSDOp's new user_version instead of the replay_version

We don't want users to ever see the replay_version, which is about
to become private RADOS data.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: librados: mass switch from eversion_t to version_t
Greg Farnum [Tue, 20 Aug 2013 21:21:04 +0000 (14:21 -0700)]
Objecter: librados: mass switch from eversion_t to version_t

There are a lot of pointers throughout our request infrastructure used solely
for exporting the version to users. The interfaces we actually expose only
provide a uint64_t (leaving off eversion_t's epoch), and that's all we're
going to maintain in our new user_version scheme, so don't pretend we'll
have more in our internal interfaces.

I audited this pretty carefully; in particular:
Op::objver is only used for passing data back to users via the calling
functions IoCtxImpl::last_objver, etc
IoCtxImpl::last_objver is used only for the set_sync_op_version() call, which
provides data only for the uint64_t get_last_version() and
rados_get_last_version() calls.
AioCompletionImpl::objver is used only for the uint64_t get_version() call.
LingerOp::pobjver is used only for referencing things that are now version_t.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: rename Op::version to Op::replay_version
Greg Farnum [Tue, 20 Aug 2013 21:21:32 +0000 (14:21 -0700)]
Objecter: rename Op::version to Op::replay_version

This is used for replay, so let's be more precise!

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: add user_version field
Greg Farnum [Wed, 28 Aug 2013 00:02:15 +0000 (17:02 -0700)]
MOSDOpReply: add user_version field

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agodoc: include plan for new user_version support
Greg Farnum [Tue, 27 Aug 2013 22:16:29 +0000 (15:16 -0700)]
doc: include plan for new user_version support

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: do not do a redundant set of ctx->new_obs.oi.version
Greg Farnum [Thu, 22 Aug 2013 21:54:19 +0000 (14:54 -0700)]
ReplicatedPG: do not do a redundant set of ctx->new_obs.oi.version

We set this in the if below for writes, and for reads it doesn't need to
be updated (and isn't). Remove the confusing double-set so future code
inspectors don't get concerned there's a bug like I did.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: remove long-dead branch
Greg Farnum [Mon, 26 Aug 2013 21:38:30 +0000 (14:38 -0700)]
ReplicatedPG: remove long-dead branch

This was confusing the heck out of me when trying to figure out
why I was hitting an assert. So replace the if-else block with
a more appropriate assert and don't include any misleading calls
to prepare_transaction() from sub_op_modify().

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: rename *_version() -> *_replay_version()
Greg Farnum [Wed, 28 Aug 2013 00:00:38 +0000 (17:00 -0700)]
MOSDOpReply: rename *_version() -> *_replay_version()

We have been returning the object's "user version" and using that
for replay, but that is in fact incorrect. In preparation for fixing
up the user version semantics, rename get_version to get_replay_version
and set_version to set_replay_version.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: rename reassert_version -> replay_version
Greg Farnum [Tue, 27 Aug 2013 23:56:40 +0000 (16:56 -0700)]
MOSDOpReply: rename reassert_version -> replay_version

Because that's what it's for. reassert_version is a bit ambiguous.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agodocs: document how the current OSD PG/object versions work
Greg Farnum [Tue, 27 Aug 2013 22:08:28 +0000 (15:08 -0700)]
docs: document how the current OSD PG/object versions work

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #545 from dachary/wip-6117
athanatos [Tue, 27 Aug 2013 17:56:49 +0000 (10:56 -0700)]
Merge pull request #545 from dachary/wip-6117

SharedPtrRegistry: get_next must not delete while holding the lock

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agodoc: Updated to accurately reflect that upstart applies to a single node.
John Wilkins [Tue, 27 Aug 2013 17:25:50 +0000 (10:25 -0700)]
doc: Updated to accurately reflect that upstart applies to a single node.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agoceph.spec.in: radosgw package doesn't require mod_fcgi
Gary Lowell [Tue, 27 Aug 2013 16:53:12 +0000 (09:53 -0700)]
ceph.spec.in:  radosgw package doesn't require mod_fcgi

Fixes #5702

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agolibrbd: fix debug print in aio_write
Sage Weil [Tue, 27 Aug 2013 15:30:50 +0000 (08:30 -0700)]
librbd: fix debug print in aio_write

Reported-by: James Harper <james.harper@bendigoit.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agocleanup: removed last references to g_conf from auth
Roald J. van Loon [Tue, 27 Aug 2013 15:17:19 +0000 (08:17 -0700)]
cleanup: removed last references to g_conf from auth

Trivial cleanup. There were still 3 references to g_conf in CephxKeyServer.
Replaced them in favor of cct->_conf.

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
11 years agoSharedPtrRegistry: get_next must not delete while holding the lock 545/head
Loic Dachary [Tue, 27 Aug 2013 14:09:17 +0000 (16:09 +0200)]
SharedPtrRegistry: get_next must not delete while holding the lock

    bool get_next(const K &key, pair<K, VPtr> *next)

may indirectly delete the object pointed by next->second when
doing :

    *next = make_pair(i->first, next_val);

and it will deadlock (EDEADLK) when

    void operator()(V *to_remove) {
      {
Mutex::Locker l(parent->lock);

tries to acquire the lock because it is already held. The
Mutex::Locker is isolated in a block and the *next* parameter is set
outside of the block.

A test case demonstrating the problem is added to test_sharedptr_registry.cc

http://tracker.ceph.com/issues/6117 fixes #6117

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agocommon: move SharedPtrRegistry test after t.join 544/head
Loic Dachary [Tue, 27 Aug 2013 11:58:33 +0000 (13:58 +0200)]
common: move SharedPtrRegistry test after t.join

The thread created to test SharedPtrRegistry race conditions updates a
value ( ptr ) that is tested by the main gtest thread but is not
protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.

http://tracker.ceph.com/issues/6130 fixes #6130

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Tue, 27 Aug 2013 01:11:32 +0000 (18:11 -0700)]
Merge remote-tracking branch 'gh/next'

11 years agoosd: install admin socket commands after signals
Sage Weil [Sat, 24 Aug 2013 21:04:09 +0000 (14:04 -0700)]
osd: install admin socket commands after signals

This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly.  See #5924.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomon/DataHealthService: preserve compat of data stats dump
Sage Weil [Mon, 26 Aug 2013 20:19:27 +0000 (13:19 -0700)]
mon/DataHealthService: preserve compat of data stats dump

See 96621bdb004e539a0186fb592f44d51cf49f1c31.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #526 from ceph/wip-5909
Sage Weil [Mon, 26 Aug 2013 20:17:20 +0000 (13:17 -0700)]
Merge pull request #526 from ceph/wip-5909

mon: Early warning system for monitor stores growing over predefined threshold

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #540 from ceph/wip-doc-update
Sage Weil [Mon, 26 Aug 2013 17:42:34 +0000 (10:42 -0700)]
Merge pull request #540 from ceph/wip-doc-update

List packages needed for RPM-based distros

11 years agoWBThrottle: use fdatasync instead of fsync
Samuel Just [Thu, 22 Aug 2013 18:19:52 +0000 (11:19 -0700)]
WBThrottle: use fdatasync instead of fsync

Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoFileStore: add config option to disable the wbthrottle
Samuel Just [Thu, 22 Aug 2013 18:19:37 +0000 (11:19 -0700)]
FileStore: add config option to disable the wbthrottle

Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agofix nss lib name 540/head
Alfredo Deza [Mon, 26 Aug 2013 16:48:56 +0000 (12:48 -0400)]
fix nss lib name

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
11 years agoupdate the README with required RPM packages
Alfredo Deza [Mon, 26 Aug 2013 16:05:00 +0000 (12:05 -0400)]
update the README with required RPM packages

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
11 years agoMerge branch 'sleinen'
Josh Durgin [Mon, 26 Aug 2013 00:12:42 +0000 (17:12 -0700)]
Merge branch 'sleinen'

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoImprove warning message when there are unfound objects, but probing
Simon Leinen [Sun, 4 Aug 2013 14:34:52 +0000 (14:34 +0000)]
Improve warning message when there are unfound objects, but probing
hasn't finished yet.

Signed-off-by: Simon Leinen <simon.leinen@switch.ch>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sat, 24 Aug 2013 21:12:44 +0000 (14:12 -0700)]
Merge remote-tracking branch 'gh/next'

11 years agomon: DataHealthService: monitor backing store's size and report it 526/head
Joao Eduardo Luis [Thu, 22 Aug 2013 15:08:22 +0000 (16:08 +0100)]
mon: DataHealthService: monitor backing store's size and report it

If the store's size grows beyond what we believe to be reasonable, we must
let the user know that something fishy may be going on.  This intends to
act as an early warning system for monitors suffering from leveldb
compaction issues.  However, if the monitor's store is just growing a lot
due to normal cluster behaviour, we made sure that the warning threshold
is adjustable by tuning 'mon_leveldb_size_warn' (defaulting to 40GB).

Fixes: #5909
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agomon: mon_types: DataStats: add 'dump(Formatter*)' method
Joao Eduardo Luis [Thu, 22 Aug 2013 15:05:17 +0000 (16:05 +0100)]
mon: mon_types: DataStats: add 'dump(Formatter*)' method

... and use it on DataHealthService.cc, instead of building our own
version of the classes' formatted output.

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agomon: MonitorDBStore: rely on backing store to provide estimated store size
Joao Eduardo Luis [Thu, 22 Aug 2013 14:57:05 +0000 (15:57 +0100)]
mon: MonitorDBStore: rely on backing store to provide estimated store size

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agotest: ceph_test_store_tool: output estimated store size on 'get-size'
Joao Eduardo Luis [Thu, 22 Aug 2013 15:17:12 +0000 (16:17 +0100)]
test: ceph_test_store_tool: output estimated store size on 'get-size'

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agoMerge pull request #514 from kri5/wip-clang-compilation
Sage Weil [Sat, 24 Aug 2013 04:21:57 +0000 (21:21 -0700)]
Merge pull request #514 from kri5/wip-clang-compilation

Do not use some compilation flag invalid for clang

Reviewed-by: Loic Dachary <loic@dachary.org>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #522 from kri5/master
Sage Weil [Sat, 24 Aug 2013 04:18:44 +0000 (21:18 -0700)]
Merge pull request #522 from kri5/master

vstart.sh: Allow to run multiple cluster instances.

11 years agoMerge pull request #531 from dmick/wip-6099
Sage Weil [Fri, 23 Aug 2013 22:30:41 +0000 (15:30 -0700)]
Merge pull request #531 from dmick/wip-6099

ceph_rest_api.py: create own default for log_file

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agorados-config: do not load ceph.conf
Sage Weil [Fri, 23 Aug 2013 22:21:41 +0000 (15:21 -0700)]
rados-config: do not load ceph.conf

Fixes: #2901
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoosd/ReplicatedPG: require write payload match length
Sage Weil [Fri, 23 Aug 2013 22:11:49 +0000 (15:11 -0700)]
osd/ReplicatedPG: require write payload match length

Hopefully this won't break old clients; I can't think of any.  We *should*
be picky about our requests.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoosd/ReplicatedPG: verify we have enough data for WRITE and WRITEFULL
Sage Weil [Fri, 23 Aug 2013 22:02:00 +0000 (15:02 -0700)]
osd/ReplicatedPG: verify we have enough data for WRITE and WRITEFULL

Fixes: #2207
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoceph_rest_api.py: create own default for log_file 531/head
Dan Mick [Fri, 23 Aug 2013 00:30:24 +0000 (17:30 -0700)]
ceph_rest_api.py: create own default for log_file

common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
    /var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.

Fixes: #6099
Backport: dumpling
Signed-off-by: Dan Mick <dan.mick@inktank.com>
11 years agoReplicatedPG: mark stats invalid when marking unfound lost
Samuel Just [Fri, 23 Aug 2013 21:50:42 +0000 (14:50 -0700)]
ReplicatedPG: mark stats invalid when marking unfound lost

Fixes: #3660
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoReplicatedPG: make watch timeout configurable
Samuel Just [Fri, 23 Aug 2013 21:50:20 +0000 (14:50 -0700)]
ReplicatedPG: make watch timeout configurable

Fixes: #2354
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoosd/OSDCap: allow . for unquoted strings
Sage Weil [Fri, 23 Aug 2013 21:56:46 +0000 (14:56 -0700)]
osd/OSDCap: allow . for unquoted strings

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomon/MonCap: allow . in unquoted string
Sage Weil [Fri, 23 Aug 2013 21:56:37 +0000 (14:56 -0700)]
mon/MonCap: allow . in unquoted string

Fixes: #5967
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agolibrados: make safe and complete callback arguments separate
Sage Weil [Fri, 23 Aug 2013 21:56:12 +0000 (14:56 -0700)]
librados: make safe and complete callback arguments separate

Fixes: #2914
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomds: remove waiting lock before merging with neighbours
David Disseldorp [Mon, 29 Jul 2013 15:05:44 +0000 (17:05 +0200)]
mds: remove waiting lock before merging with neighbours

CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:

Initial MDS state (two clients, same file):
held_locks -- start: 0, length: 1, client: 4116, pid: 7899, type: 2
      start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Waiting lock entry 4116@1:1 fires:
handle_client_file_setlock: start: 1, length: 1,
    client: 4116, pid: 7899, type: 2

MDS state after lock is obtained:
held_locks -- start: 0, length: 2, client: 4116, pid: 7899, type: 2
      start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2

Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.

When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call because adjust_locks changed the new lock to
include the old locks.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agodoc: Fixed broken link by adding Transitioning to ceph-deploy to this doc.
John Wilkins [Fri, 23 Aug 2013 20:43:44 +0000 (13:43 -0700)]
doc: Fixed broken link by adding Transitioning to ceph-deploy to this doc.

fixes: 6107

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agoMerge pull request #495 from kri5/wip-5820
Yehuda Sadeh [Fri, 23 Aug 2013 20:16:16 +0000 (13:16 -0700)]
Merge pull request #495 from kri5/wip-5820

rgw: rgw-admin throw an error when invalid flag is passed

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoMerge pull request #533 from ceph/wip-osd-healthy-tuanble
Sage Weil [Fri, 23 Aug 2013 19:45:06 +0000 (12:45 -0700)]
Merge pull request #533 from ceph/wip-osd-healthy-tuanble

osd: add 'osd heartbeat min healthy ratio' tunable

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #535 from ceph/wip-readdir-r-sucks
Yehuda Sadeh [Fri, 23 Aug 2013 19:00:30 +0000 (12:00 -0700)]
Merge pull request #535 from ceph/wip-readdir-r-sucks

Fix readdir_r invocation

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoos: make readdir_r buffers larger 535/head
Sage Weil [Fri, 23 Aug 2013 18:45:35 +0000 (11:45 -0700)]
os: make readdir_r buffers larger

PATH_MAX isn't quite big enough.

Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoos: fix readdir_r buffer size
Sage Weil [Fri, 23 Aug 2013 18:45:08 +0000 (11:45 -0700)]
os: fix readdir_r buffer size

The buffer needs to be big or else we're walk all over the stack.

Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoos: KeyValueDB: expose interface to obtain estimated store size
Joao Eduardo Luis [Thu, 22 Aug 2013 15:17:02 +0000 (16:17 +0100)]
os: KeyValueDB: expose interface to obtain estimated store size

On LevelDBStore, instead of using leveldb's GetApproximateSizes() function,
we will instead assess what's the store's raw size from the contents of
the store dir (this means .sst's, .log's, etc).  The reason behind this
approach is that GetApproximateSizes() would expect us to provide a range
of keys for which to obtain an approximate size; on the other hand, what we
really want is to obtain the size of the store -- not the size of the
data (besides, with the compaction issues we've been seeing, we wonder
how reliable such approximation would be).

Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agomon/Paxos: fix another uncommitted value corner case
Sage Weil [Thu, 22 Aug 2013 22:54:48 +0000 (15:54 -0700)]
mon/Paxos: fix another uncommitted value corner case

It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100.  During last/collect we discover 100 has been
committed already.  But also, another node provides an uncommitted value
for 101 with the same pn.  Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.

There are two possible fixes here:

 - make this a >= as we can accept newer values from the same pn.
 - discard our uncommitted value metadata when we commit the value.

Let's do both!

Fixes: #6090
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agorgw: bucket meta remove don't overwrite entry point first
Yehuda Sadeh [Mon, 19 Aug 2013 23:56:27 +0000 (16:56 -0700)]
rgw: bucket meta remove don't overwrite entry point first

Fixes: #6056
When removing a bucket metadata entry we first unlink the bucket
and then we remove the bucket entrypoint object. Originally
when unlinking the bucket we first overwrote the bucket entrypoint
entry marking it as 'unlinked'. However, this is not really needed
as we're just about to remove it. The original version triggered
a bug, as we needed to propagate the new header version first (which
we didn't do, so the subsequent bucket removal failed).

Reviewed-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoceph-disk: specify the filetype when mounting
Alfredo Deza [Fri, 23 Aug 2013 12:56:07 +0000 (08:56 -0400)]
ceph-disk: specify the filetype when mounting

Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agodoc/release-notes: v0.67.2
Sage Weil [Fri, 23 Aug 2013 15:12:46 +0000 (08:12 -0700)]
doc/release-notes: v0.67.2

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #528 from kri5/wip-radosgw-admin-help
Yehuda Sadeh [Fri, 23 Aug 2013 14:17:39 +0000 (07:17 -0700)]
Merge pull request #528 from kri5/wip-radosgw-admin-help

rgw: Adds --system option help to radosgw-admin

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agorgw: Adds --system option help to radosgw-admin 528/head
Christophe Courtaut [Thu, 22 Aug 2013 15:54:08 +0000 (17:54 +0200)]
rgw: Adds --system option help to radosgw-admin

Signed-off-by: Christophe Courtaut <christophe.courtaut@gmail.com>
11 years agoosd: add 'osd heartbeat min healthy ratio' tunable 533/head
Sage Weil [Fri, 23 Aug 2013 04:44:31 +0000 (21:44 -0700)]
osd: add 'osd heartbeat min healthy ratio' tunable

This was hard-coded to 1/3; make it tunable.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #532 from dmick/next
Sage Weil [Fri, 23 Aug 2013 04:34:57 +0000 (21:34 -0700)]
Merge pull request #532 from dmick/next

PGMonitor: pg dump_stuck should respect --format (plain works fine)

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoQA: Compile fsstress if missing on machine.
Sandon Van Ness [Fri, 23 Aug 2013 02:44:40 +0000 (19:44 -0700)]
QA: Compile fsstress if missing on machine.

Some distro's have a lack of ltp-kernel packages and all we need is
fstress. This just modified the shell script to download/compile
fstress from source and copy it to the right location if it doesn't
currently exist where it is expected. It is a very small/quick
compile and currently only SLES and debian do not have it already.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
11 years agoQA: Compile fsstress if missing on machine.
Sandon Van Ness [Fri, 23 Aug 2013 02:44:40 +0000 (19:44 -0700)]
QA: Compile fsstress if missing on machine.

Some distro's have a lack of ltp-kernel packages and all we need is
fstress. This just modified the shell script to download/compile
fstress from source and copy it to the right location if it doesn't
currently exist where it is expected. It is a very small/quick
compile and currently only SLES and debian do not have it already.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sandon Van Ness <sandon@inktank.com>
11 years agoPGMonitor: pg dump_stuck should respect --format (plain works fine) 532/head
Dan Mick [Fri, 23 Aug 2013 01:53:13 +0000 (18:53 -0700)]
PGMonitor: pg dump_stuck should respect --format (plain works fine)

Signed-off-by: Dan Mick <dan.mick@inktank.com>
11 years agoinit-ceph: behave if incompletely installed
Sage Weil [Sat, 20 Jul 2013 16:02:40 +0000 (09:02 -0700)]
init-ceph: behave if incompletely installed

e.g., Debian 'removed, config remains' state

Fixes: #5695
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Fri, 23 Aug 2013 00:23:09 +0000 (17:23 -0700)]
Merge remote-tracking branch 'gh/next'

11 years agoMOSDOpReply: set reassert_version for very old clients
Greg Farnum [Thu, 22 Aug 2013 22:28:15 +0000 (15:28 -0700)]
MOSDOpReply: set reassert_version for very old clients

I think this must make every sufficiently-old client fail on replay --
very bad!

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoyasm-wrapper: more futzing to behave on fedora 19
Sage Weil [Thu, 22 Aug 2013 21:20:57 +0000 (14:20 -0700)]
yasm-wrapper: more futzing to behave on fedora 19

Some new arguments, and behave (return success) when the touch target isn't
specified.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agorgw: fix crash when creating new zone on init
Yehuda Sadeh [Thu, 22 Aug 2013 17:53:12 +0000 (10:53 -0700)]
rgw: fix crash when creating new zone on init

Moving the watch/notify init before the zone init,
as we might need to send a notification.

Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoceph.spec.in: remove trailing paren in previous commit
Gary Lowell [Thu, 22 Aug 2013 20:29:32 +0000 (13:29 -0700)]
ceph.spec.in:  remove trailing paren in previous commit

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agoceph.spec.in: Don't invoke debug_package macro on centos.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in:  Don't invoke debug_package macro on centos.

If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default.   The build will fail when the package installed
and the specfile also invokes the macro.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agoMerge pull request #414 from dachary/wip-5510
athanatos [Thu, 22 Aug 2013 17:24:52 +0000 (10:24 -0700)]
Merge pull request #414 from dachary/wip-5510

replace ObjectContext pointers with shared_ptr

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #527 from ceph/wip-mon-fix-verbose-output
Sage Weil [Thu, 22 Aug 2013 16:17:16 +0000 (09:17 -0700)]
Merge pull request #527 from ceph/wip-mon-fix-verbose-output

mon: remove lingering debug output

Reviewed-by: Sage Weil <sage@inktank.com>