]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoValidate S3 tokens against Keystone
Roald J. van Loon [Fri, 9 Aug 2013 11:31:10 +0000 (13:31 +0200)]
Validate S3 tokens against Keystone

- Added config option to allow S3 to use Keystone auth
- Implemented JSONDecoder for KeystoneToken
- RGW_Auth_S3::authorize now uses rgw_store_user_info on keystone auth
- Minor fix in get_canon_resource; dout is now after the assignment

Reviewed-by: Yehuda Sadeh<yehuda@inktank.com>
Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
11 years agoMerge pull request #561 from ceph/wip-6178
Sage Weil [Sat, 31 Aug 2013 23:46:52 +0000 (16:46 -0700)]
Merge pull request #561 from ceph/wip-6178

os: LevelDBStore: ignore ENOENT files when estimating store size

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge branch 'next'
Sage Weil [Sat, 31 Aug 2013 17:31:31 +0000 (10:31 -0700)]
Merge branch 'next'

11 years agomon: fix uninitialized Op field
Roald J. van Loon [Sat, 31 Aug 2013 17:30:14 +0000 (10:30 -0700)]
mon: fix uninitialized Op field

- Uninitialized field in MonitorLevelDB::Op causes random build errors.

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
11 years agoautomake cleanup: uninitialized version_t
Roald J. van Loon [Fri, 30 Aug 2013 21:05:52 +0000 (23:05 +0200)]
automake cleanup: uninitialized version_t

This sometimes gives a completely random uint64_t value, because it is
potentially used uninitialized.

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
11 years agoMerge pull request #541 from ceph/wip-6036
Sage Weil [Sat, 31 Aug 2013 00:02:49 +0000 (17:02 -0700)]
Merge pull request #541 from ceph/wip-6036

osd objecter; copy-get

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoosd/ReplicatedPG: do not requeue if not primary 541/head
Sage Weil [Tue, 27 Aug 2013 22:01:02 +0000 (15:01 -0700)]
osd/ReplicatedPG: do not requeue if not primary

This saves us a bit of work, since we will discard the op anyway if
we aren't primary (or even if we become primary again before we get to
it).

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: COPY_GET operation
Sage Weil [Tue, 27 Aug 2013 22:25:50 +0000 (15:25 -0700)]
osd: COPY_GET operation

Add new rados operation to copy all user-visible content for an object
in a simple, safe way.  Use a new object_copy_cursor_t to keep track of
our position.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: factor {execute,reply}_ctx() out of do_op()
Sage Weil [Sun, 25 Aug 2013 04:58:11 +0000 (21:58 -0700)]
osd/ReplicatedPG: factor {execute,reply}_ctx() out of do_op()

Separate the processing of an OpContext from the preamble and
allocation, so that we can delay the execution for some ops (like the
COPYFROM operation we're about to add).

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: feed OSDMaps to the Objecter
Sage Weil [Sat, 17 Aug 2013 06:33:06 +0000 (23:33 -0700)]
osd: feed OSDMaps to the Objecter

Feed every map message we see (that isn't discarded for some other
reason) to the Objecter.  It has the same continuity requirements that
the OSD has, so it should be satisfied with what we get.  It can also
request maps via our MonClient.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: add an Objecter instance
Sage Weil [Sat, 17 Aug 2013 06:17:03 +0000 (23:17 -0700)]
osd: add an Objecter instance

It gets its own lock, timer, and osdmap.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: discriminate based on connection messenger, not peer type
Sage Weil [Mon, 26 Aug 2013 20:58:47 +0000 (13:58 -0700)]
osd: discriminate based on connection messenger, not peer type

Replace ->get_source().is_osd() checks and instead see if it is the
cluster_messenger so that we do not confuse ourselves when we get
legit requests from other OSDs on our public interface.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoceph-osd: rename msgr vars
Sage Weil [Sat, 17 Aug 2013 23:23:24 +0000 (16:23 -0700)]
ceph-osd: rename msgr vars

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: add a separate messenger for the Objecter
Sage Weil [Sat, 17 Aug 2013 06:03:26 +0000 (23:03 -0700)]
osd: add a separate messenger for the Objecter

We will give the OSD's Objecter its own messenger so that it does not
interfere with the OSD when it marks things up or down.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: add whitespace
Sage Weil [Mon, 19 Aug 2013 04:26:44 +0000 (21:26 -0700)]
osd/ReplicatedPG: add whitespace

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd: less whitespace
Sage Weil [Sat, 17 Aug 2013 06:45:14 +0000 (23:45 -0700)]
osd: less whitespace

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosdc/Objecter: allow ops to be canceled
Sage Weil [Sun, 18 Aug 2013 00:01:53 +0000 (17:01 -0700)]
osdc/Objecter: allow ops to be canceled

This is useful in general, and specifically will be useful for the
rados COPY operation.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosdc/Objecter: only request map on startup if epoch == 0
Sage Weil [Sat, 17 Aug 2013 06:27:39 +0000 (23:27 -0700)]
osdc/Objecter: only request map on startup if epoch == 0

Normal clients have no map and need one to get started.  If we are the
OSD, we will already have one and will get fed maps as they come in.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd, objecter: clean up assert_ver()
Sage Weil [Sun, 25 Aug 2013 18:12:44 +0000 (11:12 -0700)]
osd, objecter: clean up assert_ver()

Create a separate union in the args and clean up the code a bit so that
this doesn't reuse the (unrelated) watch helpers.  No change in
protocol.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: drop src_obc.clear() calls
Sage Weil [Sat, 24 Aug 2013 04:34:28 +0000 (21:34 -0700)]
osd/ReplicatedPG: drop src_obc.clear() calls

These are all about to go out of scope; no need to clear them
explicitly.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoos/ObjectStore: add bufferlist variant of setattrs
Sage Weil [Wed, 21 Aug 2013 05:23:54 +0000 (22:23 -0700)]
os/ObjectStore: add bufferlist variant of setattrs

And hopefully we can kill the bufferptr ones someday!

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agounittest_lfnindex testing older HASH_INDEX_TAG
David Zafman [Fri, 30 Aug 2013 23:17:16 +0000 (16:17 -0700)]
unittest_lfnindex testing older HASH_INDEX_TAG

Switch to work with new HOBJECT_WITH_POOL

fixes: #6196

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agodoc/rados/operations/pools: remove experimental note about pg splitting
Sage Weil [Fri, 30 Aug 2013 22:41:02 +0000 (15:41 -0700)]
doc/rados/operations/pools: remove experimental note about pg splitting

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #560 from ceph/wip-6032-cache-objecter
Sage Weil [Fri, 30 Aug 2013 22:24:41 +0000 (15:24 -0700)]
Merge pull request #560 from ceph/wip-6032-cache-objecter

Wip 6032 cache objecter

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #554 from ceph/wip-tier-interface
Gregory Farnum [Fri, 30 Aug 2013 21:13:25 +0000 (14:13 -0700)]
Merge pull request #554 from ceph/wip-tier-interface

Specify a user and pg_pool_t interface for tiering/caching specifications

Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoworkunits: add a test for caching redirects 560/head
Greg Farnum [Thu, 29 Aug 2013 22:26:08 +0000 (15:26 -0700)]
workunits: add a test for caching redirects

This may need to change since it exploits some of the loose
consistency we currently have with caching pools, but for now
it checks that the Objecter does what we want.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agomon/OSDMonitor: 'osd tier {set,remove}-overlay <pool> [tierpool]' 554/head
Greg Farnum [Thu, 29 Aug 2013 00:49:48 +0000 (17:49 -0700)]
mon/OSDMonitor: 'osd tier {set,remove}-overlay <pool> [tierpool]'

Also prevent 'osd tier remove ...' if the tierpool is the current overlay.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd_types: note that write_tier wins if read_tier is different
Greg Farnum [Thu, 29 Aug 2013 20:58:04 +0000 (13:58 -0700)]
osd_types: note that write_tier wins if read_tier is different

For pg_pool_t.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoqa/workunits/cephtool/test.sh: test osd tier CLI
Greg Farnum [Thu, 29 Aug 2013 00:47:42 +0000 (17:47 -0700)]
qa/workunits/cephtool/test.sh: test osd tier CLI

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoObjecter: respect read_tier & write_tier for initial op submission
Greg Farnum [Thu, 29 Aug 2013 20:57:10 +0000 (13:57 -0700)]
Objecter: respect read_tier & write_tier for initial op submission

We overwrite target_oloc.pool with the appropriate [read|write]_tier.
write_tier wins if it matches both.
We don't handle any sort of redirect yet.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agomon/OSDMonitor: 'osd tier cache-mode <pool> <mode>'
Sage Weil [Tue, 27 Aug 2013 20:44:52 +0000 (13:44 -0700)]
mon/OSDMonitor: 'osd tier cache-mode <pool> <mode>'

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: be careful about precalculated pgids
Greg Farnum [Thu, 29 Aug 2013 20:52:35 +0000 (13:52 -0700)]
Objecter: be careful about precalculated pgids

The only current user of the precalc_pgid field is list_objects. That's
fine, but we don't want new users to inadvertently appear and somehow
break the caching/tiering stuff by forcing us to go to the base pool
when we should be talking to somebody else. Add an assert to catch
these cases.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: add an Op::target_oloc and use it instead of base_oloc in send_op()
Greg Farnum [Thu, 29 Aug 2013 20:12:41 +0000 (13:12 -0700)]
Objecter: add an Op::target_oloc and use it instead of base_oloc in send_op()

For now we simply set target_oloc = base_oloc in recalc_op_target(), but
we will shortly be doing more interesting things with it there.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: rename Op::oloc -> Op::base_oloc
Greg Farnum [Thu, 29 Aug 2013 20:08:03 +0000 (13:08 -0700)]
Objecter: rename Op::oloc -> Op::base_oloc

We want to be able to target other pools for caching and tiering, so
we need to take an oloc from the client and translate it into an
actual target. Rename oloc to base_oloc to make clear which one it is.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #530 from ceph/wip-monc-leak
João Eduardo Luís [Fri, 30 Aug 2013 17:36:07 +0000 (10:36 -0700)]
Merge pull request #530 from ceph/wip-monc-leak

mon/MonClient: release pending outgoing messages on shutdown

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agoos: LevelDBStore: ignore ENOENT files when estimating store size 561/head
Joao Eduardo Luis [Fri, 30 Aug 2013 17:05:33 +0000 (18:05 +0100)]
os: LevelDBStore: ignore ENOENT files when estimating store size

While iterating over the store files we race against leveldb, which may
be shuffling data around thus removing some files.

By ignoring missing files on stat, we'll get to not account those files
but that's okay -- this is just an estimate.

Fixes: #6178
Signed-off-by: Joao Eduardo Luis <jecluis@gmail.com>
11 years agoceph-post-file: use mktemp instead of tempfile
Sage Weil [Fri, 30 Aug 2013 16:41:29 +0000 (09:41 -0700)]
ceph-post-file: use mktemp instead of tempfile

tempfile is a debian thing, apparently; mktemp is present everywhere.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #559 from ceph/wip-osd-rollback
Sage Weil [Thu, 29 Aug 2013 23:34:42 +0000 (16:34 -0700)]
Merge pull request #559 from ceph/wip-osd-rollback

fixes a few osd dout bugs; make rados model behave with rollback

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoceph_test_rados: rollback bumps user_version 559/head
Sage Weil [Thu, 29 Aug 2013 23:08:44 +0000 (16:08 -0700)]
ceph_test_rados: rollback bumps user_version

Sigh.  This doesn't make much intuitive sense to me, but this is how it
currently works.

Switch to using the async api while we are at it.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoPGLog: initialize writeout_from in PGLog constructor
Samuel Just [Thu, 29 Aug 2013 22:08:58 +0000 (15:08 -0700)]
PGLog: initialize writeout_from in PGLog constructor

Fixes: 6151
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: 'osd pool tier <add|remove> <pool> <tierpool>'
Sage Weil [Tue, 27 Aug 2013 20:43:09 +0000 (13:43 -0700)]
mon/OSDMonitor: 'osd pool tier <add|remove> <pool> <tierpool>'

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd/OSDMonitor: avoid polluting pending_inc on error for 'osd pool set ...'
Sage Weil [Tue, 27 Aug 2013 19:47:53 +0000 (12:47 -0700)]
osd/OSDMonitor: avoid polluting pending_inc on error for 'osd pool set ...'

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd_types: add pg_pool_t cache-related fields
Sage Weil [Mon, 26 Aug 2013 22:59:54 +0000 (15:59 -0700)]
osd_types: add pg_pool_t cache-related fields

We add fields sufficient to specify
* many pools have a tiering relationship with pool foo
* pool foo is a tier pool for pool bar
* the tiering relationship between foo and bar is specified
  by cache_mode
* client reads and writes for pool foo should be directed to
  pools bar and baz, respectively (where probably, but not
  necessarily, baz == bar or baz == foo).

This lets us specify very sophisticated caching policies on
the server side that all clients going forward can handle
simply by directing the messages as the read_tier and write_tier
flags, and the (not-yet-implemented) redirect replies
from OSDs, specify.

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd/ReplicatedPG: drop dout from object_context_destructor_callback
Sage Weil [Thu, 29 Aug 2013 21:28:11 +0000 (14:28 -0700)]
osd/ReplicatedPG: drop dout from object_context_destructor_callback

We don't hold the pg lock; cannot call dout here.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: remove debug lines from snapset_context get/put
Sage Weil [Thu, 29 Aug 2013 21:27:46 +0000 (14:27 -0700)]
osd/ReplicatedPG: remove debug lines from snapset_context get/put

The dout() prefix does get_osdmap(), which requires (and asserts) that we
hold the pg lock, but in some cases we do not, notably
ReplicatedPG::object_context_destructor_callback.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #556 from ceph/wip-user-version
Sage Weil [Thu, 29 Aug 2013 18:39:33 +0000 (11:39 -0700)]
Merge pull request #556 from ceph/wip-user-version

make ceph_test_rados / RadosModel validate the versions exposed by librados

Reviewed-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agorgw: Fix S3 auth when using response-* query string params
Sylvain Munaut [Thu, 29 Aug 2013 14:17:30 +0000 (16:17 +0200)]
rgw: Fix S3 auth when using response-* query string params

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Signed-off-by: Sylvain Munaut <s.munaut@whatever-company.com>
11 years agoceph.spec.in: remove trailing paren in previous commit
Gary Lowell [Thu, 22 Aug 2013 20:29:32 +0000 (13:29 -0700)]
ceph.spec.in:  remove trailing paren in previous commit

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agoceph.spec.in: Don't invoke debug_package macro on centos.
Gary Lowell [Thu, 22 Aug 2013 18:07:16 +0000 (11:07 -0700)]
ceph.spec.in:  Don't invoke debug_package macro on centos.

If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default.   The build will fail when the package installed
and the specfile also invokes the macro.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agoMerge pull request #361 from atwardowski/patch-1
Yehuda Sadeh [Thu, 29 Aug 2013 00:54:26 +0000 (17:54 -0700)]
Merge pull request #361 from atwardowski/patch-1

Update adminops.rst add capabilities

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoceph_test_rados: validate user_version 556/head
Sage Weil [Wed, 28 Aug 2013 23:34:36 +0000 (16:34 -0700)]
ceph_test_rados: validate user_version

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: set version, user_version correctly on reads
Sage Weil [Wed, 28 Aug 2013 23:29:16 +0000 (16:29 -0700)]
osd/ReplicatedPG: set version, user_version correctly on reads

Set the user version to the *current* object version, not the version
we would use if we were to modify it.  We move the assignments inside
the reply (read or error) block to make it more obvious which paths
are possible.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomessages/MOSDOpReply: fix user_version in reply (add missing braces)
Sage Weil [Wed, 28 Aug 2013 23:22:34 +0000 (16:22 -0700)]
messages/MOSDOpReply: fix user_version in reply (add missing braces)

Presumbly a mismerge somewhere back around
de20997445803dca4225ed0dac1bad6a8a1e6512.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agolibrados: add get_version64()
Sage Weil [Wed, 28 Aug 2013 22:41:31 +0000 (15:41 -0700)]
librados: add get_version64()

The C++ AioCompletion::get_version() method only returns 32-bits.  Sigh.

Add a get_version64() method that returns all 64-bits. Do not touch the
32-bit version to avoid breaking the ABI.

Backport: dumpling, cuttlefish
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #550 from ceph/wip-6040
athanatos [Wed, 28 Aug 2013 21:10:37 +0000 (14:10 -0700)]
Merge pull request #550 from ceph/wip-6040

Wip 6040

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Loic Dachary <loic@dachary.com>
11 years agoPGLog: maintain writeout_from and trimmed 550/head
Samuel Just [Tue, 27 Aug 2013 15:49:14 +0000 (08:49 -0700)]
PGLog: maintain writeout_from and trimmed

This way, we can avoid omap_rmkeyrange in the common append
and trim cases.

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agodoc/release-notes: v0.56.6 and .7 bobtail
Sage Weil [Wed, 28 Aug 2013 17:23:58 +0000 (10:23 -0700)]
doc/release-notes: v0.56.6 and .7 bobtail

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #539 from dachary/master
Sage Weil [Wed, 28 Aug 2013 17:29:17 +0000 (10:29 -0700)]
Merge pull request #539 from dachary/master

doc : erasure code developer notes updates

11 years agoMerge pull request #552 from ceph/wip-4924-master
João Eduardo Luís [Wed, 28 Aug 2013 17:08:31 +0000 (10:08 -0700)]
Merge pull request #552 from ceph/wip-4924-master

mon: discover mon addrs, names during election state too

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agomon: discover mon addrs, names during election state too 552/head
Sage Weil [Wed, 28 Aug 2013 16:50:11 +0000 (09:50 -0700)]
mon: discover mon addrs, names during election state too

Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.

One way to work around this is to continue addr and name discovery during
the election.  We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.

Fixes: #4924
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
11 years agodoc/dev/cache-pool: document cache pool management interface
Sage Weil [Mon, 26 Aug 2013 23:57:58 +0000 (16:57 -0700)]
doc/dev/cache-pool: document cache pool management interface

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoadd CEPH_FEATURE_OSD_CACHEPOOL
Sage Weil [Mon, 26 Aug 2013 22:11:43 +0000 (15:11 -0700)]
add CEPH_FEATURE_OSD_CACHEPOOL

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #549 from ceph/wip-6029
Gregory Farnum [Wed, 28 Aug 2013 16:15:36 +0000 (09:15 -0700)]
Merge pull request #549 from ceph/wip-6029

Make user_version a first-class citizen
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Just <sam.just@inktank.com>
11 years agoPGLog: don't maintain log_keys_debug if the config is disabled
Samuel Just [Tue, 27 Aug 2013 14:27:26 +0000 (07:27 -0700)]
PGLog: don't maintain log_keys_debug if the config is disabled

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoPGLog: move the log size check after the early return
Samuel Just [Tue, 27 Aug 2013 06:19:45 +0000 (23:19 -0700)]
PGLog: move the log size check after the early return

There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list.  That assert, therefore, is quite expensive!

Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge remote-tracking branch 'origin/master' into wip-6029 549/head
Greg Farnum [Wed, 28 Aug 2013 00:26:36 +0000 (17:26 -0700)]
Merge remote-tracking branch 'origin/master' into wip-6029

Conflicts:
src/librados/AioCompletionImpl.h

11 years agodoc: update to describe new OSD version support as it actually exists
Greg Farnum [Tue, 27 Aug 2013 22:21:49 +0000 (15:21 -0700)]
doc: update to describe new OSD version support as it actually exists

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: add OpContext::user_at_version
Greg Farnum [Wed, 28 Aug 2013 00:24:24 +0000 (17:24 -0700)]
ReplicatedPG: add OpContext::user_at_version

Set this up with the existing at_version member, but only increase
it for user_modify ops. Use this when logging the PG's user_version. In
order to maintain compatibility with old clients on classic pools, we
force user_version to follow at_version whenever it's updated.

Now that we have and are maintaining this PG user version, use it
for the user version on ops that get ENOENT back, when short-circuiting
replies as part of reply_op_error()[1], or when replying to repops
in eval_repop; further use it for the cls_current_version() function. This
is a small semantic change for that function, as previously it would
generally return the same value as the user would get sent back via
MOSDOpReply -- but I don't think it was something you could count on.
We now define it as being the user version of the PG at the start of the
op, and as a bonus it is defined even for read ops (the at_version is
only filled in on write operations).

[1]: We tweak PGLog to make it easier to retrieve both user and PG versions.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: stop filling in replay_version from the MOSDOp to begin with
Greg Farnum [Tue, 27 Aug 2013 19:55:52 +0000 (12:55 -0700)]
MOSDOpReply: stop filling in replay_version from the MOSDOp to begin with

It's just asking for trouble.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: switch to comprehensive instead of individual version setters
Greg Farnum [Tue, 27 Aug 2013 21:06:49 +0000 (14:06 -0700)]
MOSDOpReply: switch to comprehensive instead of individual version setters

There's little point to updating versions individually when we can
do so en masse and avoid mistakes in duplication.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: add enough fields to be backwards compatible.
Greg Farnum [Tue, 27 Aug 2013 18:02:44 +0000 (11:02 -0700)]
MOSDOpReply: add enough fields to be backwards compatible.

The system we've been building up works out very nicely for new clients,
but they could not have interoperated with old clients that were only
referring to our replay_version. In order to deal with this, we add
a bad_replay_version to MOSDOpReply which is encoded where we used
to encode replay_version. bad_replay_version will follow the same semantics
as reassert_version used to (except that it is filled in on reads), but
is not accessible to new clients, who can see only our properly-controlled
replay_version and user_version. This will let old and new clients
interoperate correctly when communicating about watches, etc.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: actually fill in user_version in pg_log_entry_t
Greg Farnum [Wed, 28 Aug 2013 00:14:56 +0000 (17:14 -0700)]
osd: actually fill in user_version in pg_log_entry_t

We now require it when creating a pg_log_entry_t. The user_version
is the version which info.last_user_version should be set to
after the transaction is applied, which for everything except for
a user-modify op is going to be the version it was already at.
For now we are filling in the user-modify op's changing user_version
to be ctx->at_version.version

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: add last_user_version to pg_info_t
Greg Farnum [Wed, 21 Aug 2013 18:26:28 +0000 (11:26 -0700)]
osd: add last_user_version to pg_info_t

We add a corresponding user_version to pg_log_entry_t, and the logic
to assign from one to the other and to recover last_user_version from
a master's log. We aren't yet setting it to anything, though.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: remove OpContext::reply_user_version
Greg Farnum [Wed, 21 Aug 2013 00:11:14 +0000 (17:11 -0700)]
ReplicatedPG: remove OpContext::reply_user_version

ctx->new_obs.oi.user_version is initialized to ctx->obs.oi.user_version,
and for read ops it won't be changed. That means
reply_user_version == ctx->new_obs.oi.user_version in all cases, which
means we don't want it.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoosd: switch object_info_t::user_version to be a version_t
Greg Farnum [Wed, 21 Aug 2013 00:13:53 +0000 (17:13 -0700)]
osd: switch object_info_t::user_version to be a version_t

We never expose the full eversion_t data to users, and do not want to.
However, we pull some tricks in the encode/decode functions to avoid
having to change the object_info_t disk format for this change.
When we can break compatibility, we should simplify this.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: Fill in the MOSDOpReply's user_version
Greg Farnum [Tue, 20 Aug 2013 23:22:27 +0000 (16:22 -0700)]
ReplicatedPG: Fill in the MOSDOpReply's user_version

As part of this, rename OpContext::reply_version->reply_user_version.
The semantics that necessitate the reply_version are only for user versions,
so rename it for clarity. Then use the reply_user_version in
set_user_version() (if the op succeeded).
For now we use the PG version for ENOENT (preserving the previous
semantics), but that will get changed to the pg's user_version soon
as well.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: set the replay version based on the at_version
Greg Farnum [Tue, 20 Aug 2013 23:18:18 +0000 (16:18 -0700)]
ReplicatedPG: set the replay version based on the at_version

The replay version is not for users to consume, so we don't want
to use the user_version for it.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: expose MOSDOp's new user_version instead of the replay_version
Greg Farnum [Tue, 20 Aug 2013 20:55:54 +0000 (13:55 -0700)]
Objecter: expose MOSDOp's new user_version instead of the replay_version

We don't want users to ever see the replay_version, which is about
to become private RADOS data.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: librados: mass switch from eversion_t to version_t
Greg Farnum [Tue, 20 Aug 2013 21:21:04 +0000 (14:21 -0700)]
Objecter: librados: mass switch from eversion_t to version_t

There are a lot of pointers throughout our request infrastructure used solely
for exporting the version to users. The interfaces we actually expose only
provide a uint64_t (leaving off eversion_t's epoch), and that's all we're
going to maintain in our new user_version scheme, so don't pretend we'll
have more in our internal interfaces.

I audited this pretty carefully; in particular:
Op::objver is only used for passing data back to users via the calling
functions IoCtxImpl::last_objver, etc
IoCtxImpl::last_objver is used only for the set_sync_op_version() call, which
provides data only for the uint64_t get_last_version() and
rados_get_last_version() calls.
AioCompletionImpl::objver is used only for the uint64_t get_version() call.
LingerOp::pobjver is used only for referencing things that are now version_t.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoObjecter: rename Op::version to Op::replay_version
Greg Farnum [Tue, 20 Aug 2013 21:21:32 +0000 (14:21 -0700)]
Objecter: rename Op::version to Op::replay_version

This is used for replay, so let's be more precise!

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: add user_version field
Greg Farnum [Wed, 28 Aug 2013 00:02:15 +0000 (17:02 -0700)]
MOSDOpReply: add user_version field

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agodoc: include plan for new user_version support
Greg Farnum [Tue, 27 Aug 2013 22:16:29 +0000 (15:16 -0700)]
doc: include plan for new user_version support

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: do not do a redundant set of ctx->new_obs.oi.version
Greg Farnum [Thu, 22 Aug 2013 21:54:19 +0000 (14:54 -0700)]
ReplicatedPG: do not do a redundant set of ctx->new_obs.oi.version

We set this in the if below for writes, and for reads it doesn't need to
be updated (and isn't). Remove the confusing double-set so future code
inspectors don't get concerned there's a bug like I did.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoReplicatedPG: remove long-dead branch
Greg Farnum [Mon, 26 Aug 2013 21:38:30 +0000 (14:38 -0700)]
ReplicatedPG: remove long-dead branch

This was confusing the heck out of me when trying to figure out
why I was hitting an assert. So replace the if-else block with
a more appropriate assert and don't include any misleading calls
to prepare_transaction() from sub_op_modify().

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: rename *_version() -> *_replay_version()
Greg Farnum [Wed, 28 Aug 2013 00:00:38 +0000 (17:00 -0700)]
MOSDOpReply: rename *_version() -> *_replay_version()

We have been returning the object's "user version" and using that
for replay, but that is in fact incorrect. In preparation for fixing
up the user version semantics, rename get_version to get_replay_version
and set_version to set_replay_version.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMOSDOpReply: rename reassert_version -> replay_version
Greg Farnum [Tue, 27 Aug 2013 23:56:40 +0000 (16:56 -0700)]
MOSDOpReply: rename reassert_version -> replay_version

Because that's what it's for. reassert_version is a bit ambiguous.

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agodocs: document how the current OSD PG/object versions work
Greg Farnum [Tue, 27 Aug 2013 22:08:28 +0000 (15:08 -0700)]
docs: document how the current OSD PG/object versions work

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #548 from dmick/next
Sage Weil [Tue, 27 Aug 2013 21:02:26 +0000 (14:02 -0700)]
Merge pull request #548 from dmick/next

ceph.in: add to $PATH if needed regardless of LD_LIBRARY_PATH state

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoceph.in: add to $PATH if needed regardless of LD_LIBRARY_PATH state 548/head
Dan Mick [Tue, 27 Aug 2013 20:37:14 +0000 (13:37 -0700)]
ceph.in: add to $PATH if needed regardless of LD_LIBRARY_PATH state

Signed-off-by: Dan Mick <dan.mick@inktank.com>
11 years agoMerge pull request #545 from dachary/wip-6117
athanatos [Tue, 27 Aug 2013 17:56:49 +0000 (10:56 -0700)]
Merge pull request #545 from dachary/wip-6117

SharedPtrRegistry: get_next must not delete while holding the lock

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agodoc: Updated to accurately reflect that upstart applies to a single node.
John Wilkins [Tue, 27 Aug 2013 17:25:50 +0000 (10:25 -0700)]
doc: Updated to accurately reflect that upstart applies to a single node.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
11 years agoceph.spec.in: radosgw package doesn't require mod_fcgi
Gary Lowell [Tue, 27 Aug 2013 16:53:12 +0000 (09:53 -0700)]
ceph.spec.in:  radosgw package doesn't require mod_fcgi

Fixes #5702

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
11 years agolibrbd: fix debug print in aio_write
Sage Weil [Tue, 27 Aug 2013 15:30:50 +0000 (08:30 -0700)]
librbd: fix debug print in aio_write

Reported-by: James Harper <james.harper@bendigoit.com.au>
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agocleanup: removed last references to g_conf from auth
Roald J. van Loon [Tue, 27 Aug 2013 15:17:19 +0000 (08:17 -0700)]
cleanup: removed last references to g_conf from auth

Trivial cleanup. There were still 3 references to g_conf in CephxKeyServer.
Replaced them in favor of cct->_conf.

Signed-off-by: Roald J. van Loon <roaldvanloon@gmail.com>
11 years agoSharedPtrRegistry: get_next must not delete while holding the lock 545/head
Loic Dachary [Tue, 27 Aug 2013 14:09:17 +0000 (16:09 +0200)]
SharedPtrRegistry: get_next must not delete while holding the lock

    bool get_next(const K &key, pair<K, VPtr> *next)

may indirectly delete the object pointed by next->second when
doing :

    *next = make_pair(i->first, next_val);

and it will deadlock (EDEADLK) when

    void operator()(V *to_remove) {
      {
Mutex::Locker l(parent->lock);

tries to acquire the lock because it is already held. The
Mutex::Locker is isolated in a block and the *next* parameter is set
outside of the block.

A test case demonstrating the problem is added to test_sharedptr_registry.cc

http://tracker.ceph.com/issues/6117 fixes #6117

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agodoc : erasure code developer notes updates 539/head
Loic Dachary [Mon, 26 Aug 2013 11:12:00 +0000 (13:12 +0200)]
doc : erasure code developer notes updates

* unify conventions to match those used by jerasure ( data chunk = K,
  coding chunk = M, use coding instead of parity, use erasures instead
  of erased )

* make lines 80 characters long

* modify the descriptions to take into account that the chunk rank
  will encoded in the pool name and not on a per object basis

* remove the doxygen link to ErasureCodeInterface because it fails
  doc: asphyxiate does not support class
  http://tracker.ceph.com/issues/6115

* only systematic codes are considered at this point ( all jerasure
  techniques are systematic). Although the API could be extended to
  include non systematic codes, it is probably a case of over
  engineering at this point.

* add link to
  http://tracker.ceph.com/issues/6113
  add ceph osd pool create [name] [key=value]

* update the plugin system description to match the proposed
  implementation http://tracker.ceph.com/issues/5877

http://tracker.ceph.com/issues/4929 refs #4929

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agocommon: move SharedPtrRegistry test after t.join 544/head
Loic Dachary [Tue, 27 Aug 2013 11:58:33 +0000 (13:58 +0200)]
common: move SharedPtrRegistry test after t.join

The thread created to test SharedPtrRegistry race conditions updates a
value ( ptr ) that is tested by the main gtest thread but is not
protected by a lock. Instead of adding a lock, the main thread tests
the value after pthread_join() on the child thread.

http://tracker.ceph.com/issues/6130 fixes #6130

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Tue, 27 Aug 2013 01:11:32 +0000 (18:11 -0700)]
Merge remote-tracking branch 'gh/next'

11 years agoosd: install admin socket commands after signals
Sage Weil [Sat, 24 Aug 2013 21:04:09 +0000 (14:04 -0700)]
osd: install admin socket commands after signals

This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly.  See #5924.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomon/DataHealthService: preserve compat of data stats dump
Sage Weil [Mon, 26 Aug 2013 20:19:27 +0000 (13:19 -0700)]
mon/DataHealthService: preserve compat of data stats dump

See 96621bdb004e539a0186fb592f44d51cf49f1c31.

Signed-off-by: Sage Weil <sage@inktank.com>