]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agotest_lock_fence.sh, rbdrw.py: rbd lock/fence test
Dan Mick [Mon, 26 Nov 2012 21:43:13 +0000 (13:43 -0800)]
test_lock_fence.sh, rbdrw.py: rbd lock/fence test

qa/workunits/rbd/test_lock_fence.sh runs using test/rbdrw.py

rbdrw.py creates an image, locks it, and runs an I/O loop;
test_lock_fence.sh runs it, waits, and then blacklists that client,
which causes rbdrw.py to get ESHUTDOWN on operations thereafter.
Currently doesn't work with rbd caching enabled.

rbd.py gets new exception type for ESHUTDOWN

Fixes: #3190
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-4249-master'
Sage Weil [Tue, 26 Feb 2013 01:48:16 +0000 (17:48 -0800)]
Merge remote-tracking branch 'gh/wip-4249-master'

12 years agoMerge remote-tracking branch 'gh/wip-4252'
Sage Weil [Tue, 26 Feb 2013 01:41:07 +0000 (17:41 -0800)]
Merge remote-tracking branch 'gh/wip-4252'

12 years agomon: PaxosService: remove lingering uses of paxos getters and wait methods
Sage Weil [Sat, 23 Feb 2013 17:01:07 +0000 (09:01 -0800)]
mon: PaxosService: remove lingering uses of paxos getters and wait methods

We should use the PaxosServices getters, setters, and wait methods when and
wherever possible.  These must have fallen through the cracks during the
merge.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-4147'
Sage Weil [Tue, 26 Feb 2013 00:49:37 +0000 (16:49 -0800)]
Merge remote-tracking branch 'gh/wip-4147'

12 years agodoc: Added subnet example and verbiage to network settings.
John Wilkins [Tue, 26 Feb 2013 00:29:57 +0000 (16:29 -0800)]
doc: Added subnet example and verbiage to network settings.

fixes: #4049

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added content to remove REJECT rules from iptables.
John Wilkins [Tue, 26 Feb 2013 00:12:50 +0000 (16:12 -0800)]
doc: Added content to remove REJECT rules from iptables.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agotest_rbd: move flatten tests back into TestClone
Josh Durgin [Tue, 26 Feb 2013 00:09:26 +0000 (16:09 -0800)]
test_rbd: move flatten tests back into TestClone

They need the same setup, and it's easy enough to run specific
subtests. Making them a separate subclass accidentally duplicated
tests from TestClone.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoqa: enable watch-notify dependent test
Josh Durgin [Mon, 25 Feb 2013 23:59:48 +0000 (15:59 -0800)]
qa: enable watch-notify dependent test

This works now that watch-notify has been reworked a bit.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agotest_rbd: close image before removing it
Josh Durgin [Mon, 25 Feb 2013 23:55:36 +0000 (15:55 -0800)]
test_rbd: close image before removing it

This error was masked before by watch notify not differentiating
between watches from the same client with different cookies.
Reopen the image at the end of this test so teardown works.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agodoc: Added a small ref section for osd config reference.
John Wilkins [Mon, 25 Feb 2013 23:28:07 +0000 (15:28 -0800)]
doc: Added a small ref section for osd config reference.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Updated osd configuration reference.
John Wilkins [Mon, 25 Feb 2013 23:27:09 +0000 (15:27 -0800)]
doc: Updated osd configuration reference.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agosystest: restrict list error acceptance
Josh Durgin [Mon, 25 Feb 2013 23:02:50 +0000 (15:02 -0800)]
systest: restrict list error acceptance

Only ignore errors after the midway point if the midway_sem_post is
defined.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agosystest: fix race with pool deletion
Josh Durgin [Mon, 25 Feb 2013 22:55:34 +0000 (14:55 -0800)]
systest: fix race with pool deletion

The second test have pool deletion and object listing wait on the same
semaphore to connect and start. This led to errors sometimes when the
pool was deleted before it could be opened by the listing process. Add
another semaphore so the pool deletion happens only after the listing
has begun.

Fixes: #4147
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoqa: output rados test names as they run
Josh Durgin [Mon, 25 Feb 2013 22:09:41 +0000 (14:09 -0800)]
qa: output rados test names as they run

So we don't have to figure out which test is running from the output,
which can be difficult with the system tests.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: remove unused internal method
Josh Durgin [Mon, 25 Feb 2013 20:12:57 +0000 (12:12 -0800)]
librbd: remove unused internal method

get_snap_size() has been replaced by get_image_size(snap_id) everywhere.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge branch 'wip-4249' into wip-4249-master
Josh Durgin [Mon, 25 Feb 2013 20:05:16 +0000 (12:05 -0800)]
Merge branch 'wip-4249' into wip-4249-master

Make snap_rollback() only take a read lock on snap_lock, since
it does not modify snapshot-related fields.
Conflicts:
src/librbd/internal.cc

12 years agolibrbd: drop snap_lock before invalidating cache
Josh Durgin [Mon, 25 Feb 2013 19:33:48 +0000 (11:33 -0800)]
librbd: drop snap_lock before invalidating cache

Writeback will take the snap_lock, so read everything we need under it
before invalidating the cache. This avoids a recursive lock when writeback
uses snap_lock while snap_rollback() was holding it.

Remove a not-very-useful debugging message that depended on snap_lock being held.

Fixes: #4249
Backport: bobtail
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge pull request #71 from dalgaaf/wip-da-sca-c_str
Sage Weil [Mon, 25 Feb 2013 17:10:49 +0000 (09:10 -0800)]
Merge pull request #71 from dalgaaf/wip-da-sca-c_str

fix some c_str() usage

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #72 from dalgaaf/wip-da-comp-sign-unsign
Sage Weil [Mon, 25 Feb 2013 16:53:22 +0000 (08:53 -0800)]
Merge pull request #72 from dalgaaf/wip-da-comp-sign-unsign

Monitor.cc: fix -Wsign-

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge pull request #70 from dalgaaf/wip-da-fix-spec
Sage Weil [Mon, 25 Feb 2013 16:52:26 +0000 (08:52 -0800)]
Merge pull request #70 from dalgaaf/wip-da-fix-spec

Add missing files to spec and debian files

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodoc: Moved admonition to kernel mount.
John Wilkins [Mon, 25 Feb 2013 16:21:11 +0000 (08:21 -0800)]
doc: Moved admonition to kernel mount.

fixes: #4146

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added verbiage to describe single host deadlocks.
John Wilkins [Mon, 25 Feb 2013 16:19:58 +0000 (08:19 -0800)]
doc: Added verbiage to describe single host deadlocks.

fixes: #3076

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoMonitor.cc: fix -Wsign-compare 72/head
Danny Al-Gaaf [Mon, 25 Feb 2013 15:38:50 +0000 (16:38 +0100)]
Monitor.cc: fix -Wsign-compare

Fix -Wsign-compare, make 'i' unsigned int.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agodebian: add new files 70/head
Danny Al-Gaaf [Mon, 25 Feb 2013 14:36:37 +0000 (15:36 +0100)]
debian: add new files

Add new (installed) files to debian install files.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agoceph.spec.in: add new files
Danny Al-Gaaf [Mon, 25 Feb 2013 14:34:17 +0000 (15:34 +0100)]
ceph.spec.in: add new files

Add new files to spec file since they get installed.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agoClient.cc: don't pass c_str() if std::string is expected 71/head
Danny Al-Gaaf [Mon, 25 Feb 2013 14:28:37 +0000 (15:28 +0100)]
Client.cc: don't pass c_str() if std::string is expected

Don't pass c_str() to _lookup(). The function expect a std::string
as second parameter.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agoPaxos.h: fix dangerouse use of c_str()
Danny Al-Gaaf [Mon, 25 Feb 2013 13:10:20 +0000 (14:10 +0100)]
Paxos.h: fix dangerouse use of c_str()

No need to use c_str() in get_statename(), simply return a
std::strin instead.

Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
12 years agodebian: make gdisk, parted requirements, not recommendations.
Sage Weil [Sun, 24 Feb 2013 21:31:06 +0000 (13:31 -0800)]
debian: make gdisk, parted requirements, not recommendations.

ceph-prepare-disk (and thus ceph-deploy) need this.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sun, 24 Feb 2013 21:22:47 +0000 (13:22 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoMinor wording change.
Neil Levine [Sat, 23 Feb 2013 00:43:44 +0000 (00:43 +0000)]
Minor wording change.

Signed-off-by: Neil Levine <neil.levine@inktank.com>
12 years agoGrammar typo
Neil Levine [Sat, 23 Feb 2013 00:24:48 +0000 (00:24 +0000)]
Grammar typo

Signed-off-by: Neil Levine <neil.levine@inktank.com>
12 years agoChanges to the OS support, multi-data center, and hypervisor questions.
Neil Levine [Fri, 22 Feb 2013 22:41:09 +0000 (22:41 +0000)]
Changes to the OS support, multi-data center, and hypervisor questions.

Signed-off-by: Neil Levine <neil.levine@inktank.com>
12 years agomds: reencode MDSMap in MMDSMap if MDSENC feature is not present
Sage Weil [Sun, 24 Feb 2013 00:36:36 +0000 (16:36 -0800)]
mds: reencode MDSMap in MMDSMap if MDSENC feature is not present

In some cases the MMDSMap message from mon -> client passes from leader ->
peon -> client, and the leader doesn't encode with the correct feature
bits.  As with MMOSDMap, we reencode the nested MDSMap based on the
features if relevant bits are not present.

We forgot to include this with the mds encoding changes.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa/run_xfstests.sh: use $TESTDIR instead of /tmp/cephtest
Sage Weil [Sat, 23 Feb 2013 16:38:10 +0000 (08:38 -0800)]
qa/run_xfstests.sh: use $TESTDIR instead of /tmp/cephtest

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: an interval can't go readwrite if its acting is empty
Sage Weil [Thu, 21 Feb 2013 19:15:58 +0000 (11:15 -0800)]
osd: an interval can't go readwrite if its acting is empty

Let's not forget that min_size can be zero.

Fixes: #4159
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 4277265d99647c9fe950ba627e5d86234cfd70a9)

12 years agomkcephfs: create mon data dir prior to ceph-mon --mkfs
Sage Weil [Sat, 23 Feb 2013 00:24:18 +0000 (16:24 -0800)]
mkcephfs: create mon data dir prior to ceph-mon --mkfs

ceph-mon now expects this directory to already exist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: Added a lot of info to OSD troubleshooting.
John Wilkins [Fri, 22 Feb 2013 23:38:20 +0000 (15:38 -0800)]
doc: Added a lot of info to OSD troubleshooting.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added mention of Admin Socket interface and brief description.
John Wilkins [Fri, 22 Feb 2013 23:37:03 +0000 (15:37 -0800)]
doc: Added mention of Admin Socket interface and brief description.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Changed title to OSD and PG, indicating both subjects are covered.
John Wilkins [Fri, 22 Feb 2013 23:35:24 +0000 (15:35 -0800)]
doc: Changed title to OSD and PG, indicating both subjects are covered.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added references from monitoring OSD to troubleshooting OSD.
John Wilkins [Fri, 22 Feb 2013 23:34:38 +0000 (15:34 -0800)]
doc: Added references from monitoring OSD to troubleshooting OSD.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: set maxdepth to 2, so TOC isn't so long with new OSD troubleshooting.
John Wilkins [Fri, 22 Feb 2013 23:33:43 +0000 (15:33 -0800)]
doc: set maxdepth to 2, so TOC isn't so long with new OSD troubleshooting.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoclient: use 4MB f_bsize and f_frsize for statfs
Sage Weil [Fri, 22 Feb 2013 23:15:27 +0000 (15:15 -0800)]
client: use 4MB f_bsize and f_frsize for statfs

Old stat(1) reports:

  Block size: 1048576    Fundamental block size: 1048576

and the df(1) arithmetic works out.  New stat(1) reports:

  Block size: 1048576    Fundamental block size: 4096

which is what we are shoving into statvfs, but we have the b_size and
fr_size arithmetic swapped.  However, doing the *correct* reporting would
then break the old stat by making both sizes appear to be 4KB (or
whatever).

Sidestep the issue by making *both* values 4MB.. which is both large enough
to report large FS sizes, and also the default stripe size and thus a
"reasonable" value to report for a block size.

Perhaps in the future, when we no longer care about old userland, we can
report the page size for f_bsize, which is probably the "most correct"
thing to do.

Fixes: #3794. See also #3793.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agotest/librados/watch_notify: fix warning
Sage Weil [Fri, 22 Feb 2013 22:57:45 +0000 (14:57 -0800)]
test/librados/watch_notify: fix warning

In file included from test/librados/watch_notify.cc:8:0:
../src/gtest/include/gtest/gtest.h: In function ‘testing::AssertionResult testing::internal::CmpHelperEQ(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int, T2 = int]’:
../src/gtest/include/gtest/gtest.h:1300:30: instantiated from ‘static testing::AssertionResult testing::internal::EqHelper::Compare(const char*, const char*, const T1&, const T2&) [with T1 = long unsigned int, T2 = int, bool lhs_is_null_literal = false]’
test/librados/watch_notify.cc:67:224: instantiated from here
warning: ../src/gtest/include/gtest/gtest.h:1263:3: comparison between signed and unsigned integer expressions [-Wsign-compare]

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-object-corpus: re-update
Sage Weil [Fri, 22 Feb 2013 22:39:34 +0000 (14:39 -0800)]
ceph-object-corpus: re-update

This was set by 9af94eea209fc2555f66214f01f3edddc35d4209, then single
paxos merge, then accidentally reverted by the next commit
6cb53740f2c356768adfbd3cb55c007d187309d3.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoPG::proc_replica_log: oinfo.last_complete must be *before* first entry in omissing
Samuel Just [Fri, 22 Feb 2013 22:12:28 +0000 (14:12 -0800)]
PG::proc_replica_log: oinfo.last_complete must be *before* first entry in omissing

Fixes: #4189
Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-rbd-flatten-deadlock'
Sage Weil [Fri, 22 Feb 2013 22:23:45 +0000 (14:23 -0800)]
Merge remote-tracking branch 'gh/wip-rbd-flatten-deadlock'

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-objecter-fsx'
Sage Weil [Fri, 22 Feb 2013 22:16:07 +0000 (14:16 -0800)]
Merge remote-tracking branch 'gh/wip-objecter-fsx'

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-3403-4-rebase'
David Zafman [Fri, 22 Feb 2013 20:49:47 +0000 (12:49 -0800)]
Merge branch 'wip-3403-4-rebase'

Feature: #3403

Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agoobjecter: don't resend linger ops unnecessarily
Josh Durgin [Fri, 22 Feb 2013 07:31:21 +0000 (23:31 -0800)]
objecter: don't resend linger ops unnecessarily

recalc_linger_op_target() was checking and then setting
linger_op->pgid and linger_op->active, but these were only set by
recalc_linger_op_target(). This was only called by handle_osd_map(),
so the first osdmap after a watch was established would cause a resend
of the watch. Analogous to the normal Op, set this information by
calling recalc_linger_op_target in send_linger().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoobjecter: initialize linger op snapid
Josh Durgin [Fri, 22 Feb 2013 07:22:59 +0000 (23:22 -0800)]
objecter: initialize linger op snapid

Since they are write ops now, it must be CEPH_NOSNAP or the OSD
returns EINVAL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoAdd test for list_watchers() C++ interface
David Zafman [Fri, 22 Feb 2013 05:45:06 +0000 (21:45 -0800)]
Add test for list_watchers() C++ interface

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agoAdd listwatchers command to rados
David Zafman [Fri, 22 Feb 2013 01:59:17 +0000 (17:59 -0800)]
Add listwatchers command to rados

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agoAdd ObjectReadOperation and IoCtx functions
David Zafman [Thu, 21 Feb 2013 23:20:08 +0000 (15:20 -0800)]
Add ObjectReadOperation and IoCtx functions

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agolibrados: expose a list of watchers on an object
David Zafman [Fri, 22 Feb 2013 00:11:01 +0000 (16:11 -0800)]
librados: expose a list of watchers on an object

Add new op CEPH_OSD_OP_LIST_WATCHERS
Add Objecter handling

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agoAdd rados_types.h header file
David Zafman [Fri, 22 Feb 2013 00:04:24 +0000 (16:04 -0800)]
Add rados_types.h header file

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agoconfiguration parsing: give better error for missing =
Dan Mick [Fri, 22 Feb 2013 05:41:25 +0000 (21:41 -0800)]
configuration parsing: give better error for missing =

A ceph.conf line with "key" and no "= value" currently shows
"unexpected character while parsing putative key value,
at char N line M".  There's no reason it can't be clearer.

Fixes: #4229
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoosd/PG: fix typo, missing -> omissing
Sage Weil [Fri, 22 Feb 2013 01:55:21 +0000 (17:55 -0800)]
osd/PG: fix typo, missing -> omissing

From ce7ffc34408bf32c66dc07e6f42d54b7ec489d41.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest_librbd_fsx: fix image closing
Josh Durgin [Fri, 22 Feb 2013 01:39:19 +0000 (17:39 -0800)]
test_librbd_fsx: fix image closing

Always close the image we opened in check_clone(), and check the
return code of the rbd_close() called before cloning.

Refs: #3958
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoobjecter: separate out linger_read() and linger_mutate()
Sage Weil [Thu, 21 Feb 2013 23:44:19 +0000 (15:44 -0800)]
objecter: separate out linger_read() and linger_mutate()

A watch is a mutation, while a notify is a read.  The mutations need to
pass in a proper snap context to be fully correct.

Also, make the WRITE flag implicit so the caller doesn't need to pass it
in.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make watch OSDOp print sanely
Sage Weil [Thu, 21 Feb 2013 23:31:08 +0000 (15:31 -0800)]
osd: make watch OSDOp print sanely

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Fri, 22 Feb 2013 01:30:46 +0000 (17:30 -0800)]
Merge branch 'next'

12 years agoceph_common.sh: fix iteration of items in ceph.conf
Sage Weil [Fri, 22 Feb 2013 01:29:58 +0000 (17:29 -0800)]
ceph_common.sh: fix iteration of items in ceph.conf

This broke in c8f528a4070dd3aa0b25c435c6234032aee39b21.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoceph-conf.rst: missing '=' in example network settings
Dan Mick [Fri, 22 Feb 2013 01:02:17 +0000 (17:02 -0800)]
ceph-conf.rst: missing '=' in example network settings

Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wsp.bobtail.2merge'
Sage Weil [Thu, 21 Feb 2013 23:45:36 +0000 (15:45 -0800)]
Merge remote-tracking branch 'gh/wsp.bobtail.2merge'

12 years agoPG::proc_replica_log: adjust oinfo.last_complete based on omissing
Samuel Just [Thu, 21 Feb 2013 23:31:36 +0000 (15:31 -0800)]
PG::proc_replica_log: adjust oinfo.last_complete based on omissing

Otherwise, search_for_missing may neglect to check the missing
set for some objects assuming that if the need version is
prior to last_complete, the replica must have it.

Fixes: #4994
Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge remote-tracking branch 'upstream/wip_clone_attrs'
Samuel Just [Thu, 21 Feb 2013 22:42:33 +0000 (14:42 -0800)]
Merge remote-tracking branch 'upstream/wip_clone_attrs'

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMDS: remove a few other unnecessary is_base() checks
Greg Farnum [Thu, 21 Feb 2013 22:30:42 +0000 (14:30 -0800)]
MDS: remove a few other unnecessary is_base() checks

We should let users remove xattrs as well as set them. ;) And
the check in handle_client_setlayout was totally useless -- perhaps
intended for setdirlayout?

This is a follow-on to 9f82ae60fac30391dfa9d17d2fc014bf9e21f387 and
should be taken wherever it goes.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomds: allow xattrs on the root inode
Greg Farnum [Thu, 21 Feb 2013 22:21:08 +0000 (14:21 -0800)]
mds: allow xattrs on the root inode

This was previously disallowed because Once Upon a Time, the root
inode wasn't persisted to disk and was an entirely in-memory construct. But
it's safe now, and has been for a while.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomds: use inode_t::layout for dir layout policy
Greg Farnum [Thu, 21 Feb 2013 17:22:00 +0000 (09:22 -0800)]
mds: use inode_t::layout for dir layout policy

This cherry-pick is going in the reverse direction of normal. That's
because this direction makes for the minimal change -- this patchset
is required to fix the loss of directory layouts we were previously
seeing, but fixing it requires changing the encoding versions. So we
wrote it on top of Bobtail and let it update the struct_v's as they existed
then. Note that we here change a few encoding versions in ways which are
NOT COMPATIBLE with previous development code (but not any releases). In
particular, development code introduced and this removes the
file_layout_policy_t, and some of the CInode and EMetaBlob encoding
struct_v values were used in development code to mean one thing, but
mean something different due to the Bobtail patch.

Remove the default_file_layout struct, which was just a ceph_file_layout,
and store it in the inode_t.  Rip out all the annoying code that put this
on the heap.

To aid in this usage, add a clear_layout() function to inode_t.

Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 36ed407e0f939a9bca57c3ffc0ee5608d50ab7ed)
Conflicts:

src/mds/CInode.cc
src/mds/CInode.h
src/mds/MDCache.cc
src/mds/Server.cc
src/mds/events/EMetaBlob.h
Cherry-pick-
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomds: parse ceph.*.layout vxattr key/value content
Sage Weil [Mon, 21 Jan 2013 05:53:37 +0000 (21:53 -0800)]
mds: parse ceph.*.layout vxattr key/value content

Use qi to parse a strictly formatted set of key/value pairs.  Be picky
about whitespace.  Any subset of recognized keys is allowed.  Parse the
same set of keys as the ceph.*.layout.* vxattrs.

Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 5551aa5b3b5c2e9e7006476b9cd8cc181d2c9a04)

12 years agoosdc/Objecter: unwatch is a mutation, not a read
Sage Weil [Thu, 21 Feb 2013 21:28:47 +0000 (13:28 -0800)]
osdc/Objecter: unwatch is a mutation, not a read

This was causing librados to unblock after the ACK on unwatch, which meant
that librbd users raced and tried to delete the image before the unwatch
change was committed..and got EBUSY.  See #3958.

The watch operation has a similar problem.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoFileStore::_clone: use _fsetattrs rather than _setattrs
Samuel Just [Thu, 21 Feb 2013 21:28:26 +0000 (13:28 -0800)]
FileStore::_clone: use _fsetattrs rather than _setattrs

The omap portion of the clone happened above in DBObjectMap::clone.
Only the fs stored attrs need to be explicitely copied.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoFileStore::_setattrs: use _fsetattrs
Samuel Just [Thu, 21 Feb 2013 21:26:56 +0000 (13:26 -0800)]
FileStore::_setattrs: use _fsetattrs

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoFileStore: add _fsetattrs
Samuel Just [Thu, 21 Feb 2013 21:26:40 +0000 (13:26 -0800)]
FileStore: add _fsetattrs

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoFileStore::_setattrs: only do omap operations if necessary
Samuel Just [Thu, 21 Feb 2013 21:25:49 +0000 (13:25 -0800)]
FileStore::_setattrs: only do omap operations if necessary

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoFileStore::_setattrs no need to grab an Index lock for the omap operations
Samuel Just [Thu, 21 Feb 2013 21:24:42 +0000 (13:24 -0800)]
FileStore::_setattrs no need to grab an Index lock for the omap operations

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge pull request #67 from jaharkes/content_length
Yehuda Sadeh [Thu, 21 Feb 2013 20:59:06 +0000 (12:59 -0800)]
Merge pull request #67 from jaharkes/content_length

Handle empty CONTENT_LENGTH environment variable.

12 years agoFix failing > 4MB range requests through radosgw S3 API.
Jan Harkes [Thu, 21 Feb 2013 20:17:38 +0000 (15:17 -0500)]
Fix failing > 4MB range requests through radosgw S3 API.

When a range request is made for more than rgw_get_obj_max_req_size
bytes the first returned chunk sets 'ret' to STATUS_PARTIAL_CONTENT and
all remaining chunks behave as if there is an error state and only
return a minimal header.

Fix this by passing STATUS_PARTIAL_CONTENT to set_req_state_err, but
leave the 'ret' member variable untouched.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c83a01d4e8dcd26eec24c020c5b79fcfa4ae44a3)

12 years agoMerge pull request #66 from jaharkes/range_requests
Yehuda Sadeh [Thu, 21 Feb 2013 20:42:06 +0000 (12:42 -0800)]
Merge pull request #66 from jaharkes/range_requests

Fix failing > 4MB range requests through radosgw S3 API.

12 years agoHandle empty CONTENT_LENGTH environment variable. 67/head
Jan Harkes [Mon, 18 Feb 2013 21:15:36 +0000 (16:15 -0500)]
Handle empty CONTENT_LENGTH environment variable.

nginx seems to be providing a CONTENT_LENGTH environment variable with no data
when the request body is empty.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
12 years agoFix failing > 4MB range requests through radosgw S3 API. 66/head
Jan Harkes [Thu, 21 Feb 2013 20:17:38 +0000 (15:17 -0500)]
Fix failing > 4MB range requests through radosgw S3 API.

When a range request is made for more than rgw_get_obj_max_req_size
bytes the first returned chunk sets 'ret' to STATUS_PARTIAL_CONTENT and
all remaining chunks behave as if there is an error state and only
return a minimal header.

Fix this by passing STATUS_PARTIAL_CONTENT to set_req_state_err, but
leave the 'ret' member variable untouched.

Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu>
12 years agoosd: an interval can't go readwrite if its acting is empty
Sage Weil [Thu, 21 Feb 2013 19:15:58 +0000 (11:15 -0800)]
osd: an interval can't go readwrite if its acting is empty

Let's not forget that min_size can be zero.

Fixes: #4159
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: make sure racing flattens don't crash
Josh Durgin [Thu, 21 Feb 2013 19:26:45 +0000 (11:26 -0800)]
librbd: make sure racing flattens don't crash

The only way for a parent to disappear is a racing flatten completing,
or possibly in the future the image being forcibly removed. In either
case, continuing to flatten makes no sense, so stop early.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: use rwlocks instead of mutexes for several fields
Josh Durgin [Thu, 21 Feb 2013 19:17:18 +0000 (11:17 -0800)]
librbd: use rwlocks instead of mutexes for several fields

Image metadata like snapshots, size, and parent is frequently read,
but rarely updated. During flatten, we were depending on the parent
lock to prevent the parent ImageCtx from disappearing out from under
us while we read from it. The copy-up path also needed the parent lock
to be able to read from the parent image, which lead to a deadlock.

Convert parent_lock, snap_lock, and md_lock to RWLocks, and change
their use to read instead of exclusive locks where appropriate. The
main place exclusive locks are needed is in ictx_refresh, so this is
pretty simple. This fixes the deadlock, since parent_lock is only
needed for read access in both flatten and the copy-up operation.

cache_lock and refresh_lock are only really used for exclusive access,
so leave them as regular mutexes.

One downside to this is that there's no way to assert is_locked()
for RWLocks, so we'll have to be very careful about changing code
in the future.

Fixes: #3665
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agocommon: add lockers for RWLocks
Josh Durgin [Thu, 21 Feb 2013 19:15:41 +0000 (11:15 -0800)]
common: add lockers for RWLocks

This makes them easier to use, especially instead of existing mutexes.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge branch 'next'
Sage Weil [Thu, 21 Feb 2013 18:44:04 +0000 (10:44 -0800)]
Merge branch 'next'

Conflicts:
src/osd/ReplicatedPG.cc

12 years agoosd: clear recovery state on pg removal
Sage Weil [Thu, 21 Feb 2013 18:30:08 +0000 (10:30 -0800)]
osd: clear recovery state on pg removal

This ensures we release our in-progress recovery counters, which prevents
recovery from getting blocked indefinitely when a pool removal races with
recovery ops.

Fixes: #4217
Backport: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agotest: fix run-rbd-tests pool deletion
Josh Durgin [Thu, 21 Feb 2013 01:04:58 +0000 (17:04 -0800)]
test: fix run-rbd-tests pool deletion

Use the new safety check

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoceph-object-corpus: use temporary 'wsp.master.new' corpus until we get merged into...
Joao Eduardo Luis [Thu, 21 Feb 2013 18:29:36 +0000 (18:29 +0000)]
ceph-object-corpus: use temporary 'wsp.master.new' corpus until we get merged into master

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoMerge branch 'wsp.bobtail.2merge' into wsp.bobtail.master
Joao Eduardo Luis [Thu, 21 Feb 2013 18:04:22 +0000 (18:04 +0000)]
Merge branch 'wsp.bobtail.2merge' into wsp.bobtail.master

Conflicts:
src/.gitignore
src/Makefile.am
src/include/ceph_features.h
src/mon/MDSMonitor.cc
src/mon/PGMonitor.cc

12 years agovstart.sh: Create mon data directory before --mkfs
Joao Eduardo Luis [Wed, 30 Jan 2013 16:04:36 +0000 (16:04 +0000)]
vstart.sh: Create mon data directory before --mkfs

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agotest: ObjectMap: add a generic leveldb store tool
Joao Eduardo Luis [Wed, 30 Jan 2013 17:54:11 +0000 (17:54 +0000)]
test: ObjectMap: add a generic leveldb store tool

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon: ceph-mon: convert an old monitor store to the new format
Joao Eduardo Luis [Wed, 20 Feb 2013 18:46:34 +0000 (18:46 +0000)]
mon: ceph-mon: convert an old monitor store to the new format

With the single-paxos patches we shifted from an approach with multiple
paxos instances (one for each paxos service) keeping their own versions
to a single paxos instance for all the paxos services, thus ending up
with a single global version for paxos.

With the release of v0.52, the monitor started tracking these global
versions, keeping them for the single purpose of making it possible to
convert the store to a single-paxos format.

This patch now introduces a mechanism to convert a GV-enabled store to
the single-paxos format store when the monitor is upgraded.

As we require the global versions to be present, we first check if the
store has the GV feature set: if not we will not proceed, but we will
start the conversion otherwise.

In the end of the conversion, the monitor data directory will have a
brand new 'store.db' directory, where the key/value store lies,
alongside with the old store.  This makes it possible to revert to a
previous monitor version if things go sideways, without jeopardizing the
data in the store.

The conversion is done as during a rolling upgrade, without any
intervention by the user.  Fire up the new monitor version on an old
store, and the monitor itself will convert the store, trim any lingering
versions that might not be required, and proceed to start as expected.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon: Add an offline monitor store converter
Joao Eduardo Luis [Tue, 18 Sep 2012 15:10:39 +0000 (16:10 +0100)]
mon: Add an offline monitor store converter

This tool will convert an old monitor store format (bobtail) to the new
key/value store-backed, single-paxos format.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoos: LevelDBStore: scrap init() and create open() and create_and_open()
Joao Eduardo Luis [Mon, 17 Sep 2012 17:08:05 +0000 (18:08 +0100)]
os: LevelDBStore: scrap init() and create open() and create_and_open()

The init() function always implicitly created a new store if it was
missing.

This patches makes init() a private function accepting a bool that used
to specify whether or not we want to create the store if it does not
exists, and creates two functions: open() and create_and_open().

open() will fail if the store we are trying to open does not exist;
create_and_open() maintains the same behavior as the previous behavior of
init() and will create the store if it does not exist before opening it.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
12 years agomon: Monitor: Add monitor store synchronization support
Joao Eduardo Luis [Tue, 12 Feb 2013 13:25:16 +0000 (13:25 +0000)]
mon: Monitor: Add monitor store synchronization support

Synchronize two monitor stores when one of the monitors has diverged
significantly from the remaining monitor cluster.

This process roughly consists of the following steps:

  0. mon.X tries to join the cluster;
  1. mon.X verifies that it has diverged from the remaining cluster;
  2. mon.X asks the leader to sync;
  3. the leader allows mon.X to sync, pointing out a mon.Y from
     which mon.X should sync;
  4. mon.X asks mon.Y to sync;
  5. mon.Y sends its own store in one or more chunks;
  6. mon.X acks each received chunk; go to 5;
  7. mon.X receives the last chunk from mon.Y;
  8. mon.X informs the leader that it has finished synchronizing;
  9. the leader acks mon.X's finished sync;
 10. mon.X bootstraps and retries joining the cluster (goto 0.)

This is the most simple and straightforward process that can be hoped
for. However, things may go sideways at any time (monitors failing, for
instance), which could potentially lead to a corrupted monitor store.
There are however mechanisms at work to avoid such scenario at any step
of the process.

Some of these mechanisms include:

 - aborting the sync if the leader fails or leadership changes;
 - state barriers on synchronization functions to avoid stray/outdated
   messages from interfering on the normal monitor behavior or on-going
   synchronization;
 - store clean-up before any synchronization process starts;
 - store clean-up if a sync process fails;
 - resuming sync from a different monitor mon.Z if mon.Y fails mid-sync;
 - several timeouts to guarantee that all the involved parties are still
   alive and participating in the sync effort.
 - request forwarding when mon.X contacts a monitor outside the quorum
   that might know who the leader is (or might know someone who does)
   [4].

Changes:
  - Adapt the MMonProbe message for the single-paxos approach, dropping
    the version map and using a lower and upper bound version instead.
  - Remove old slurp code.
  - Add 'sync force' command; 'sync_force' through the admin socket.

Notes:

[1] It's important to keep track of the paxos version at the time at
    which a store sync starts.  Given that after the sync we end up with
    the same state as the monitor we are synchronizing from, there is a
    chance that we might end up with an uncommitted paxos version if we
    are synchronizing with the leader (there's some paxos stashing done
    prior to commit on the leader).  By keeping track at which version
    the sync started, we can then let the requester to which version he
    should cap its paxos store.

[2] Furthermore, the enforced paxos cap, described on [1], is even more
    important if we consider the need to reapply the paxos versions that
    were received during the sync, to make sure the paxos store is
    consistent.  If we happened to have some yet-uncommitted version in
    the store, we could end up applying it.

[3] What is described in [1] and [2]:

Fixes: #4026
Fixes: #4037
Fixes: #4040
[4] Whenever a given monitor mon.X is on the probing phase and notices
    that there is a mon.Y with a paxos version considerably higher than
    the one mon.X has, then mon.X will attempt to synchronize from
    mon.Y.  This is the basis for the store sync.  However this might
    hold true, the fact is that there might be a chance that, by the
    time mon.Y handles the sync request from mon.X, mon.Y might already
    be attempting a sync himself with some other mon.Z.  In this case,
    the appropriate thing for mon.Y to do is to forward mon.X's request
    to mon.Z, as mon.Z should be part of the quorum, know who the leader
    is or be the leader himself -- if not, at least it is guaranteed
    that mon.Z has a higher version than both mon.X and mon.Y, so it
    should be okay to sync from him.

Fixes: #4162
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomessage: MMonSync: Monitor Synchronization message
Joao Eduardo Luis [Mon, 9 Jul 2012 21:51:49 +0000 (22:51 +0100)]
message: MMonSync: Monitor Synchronization message

The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon: MonitorDBStore: add store iterators to obtain chunks for sync
Joao Eduardo Luis [Wed, 15 Aug 2012 14:35:39 +0000 (15:35 +0100)]
mon: MonitorDBStore: add store iterators to obtain chunks for sync

We created an interface specific to the MonitorDBStore, which can be used
to create iterators to obtain chunks for sync.

Two different iterators were defined: one that will iterate over the whole
store, focusing on the specified set of prefixes; another that will
iterate over only one specific prefix.

These two different iterators allow us build the sync process in two
distinct phases: 1) obtain all key/value pairs for paxos and all paxos
services, bundle them in chunks and send them over the wire; and 2) obtain
all the paxos versions, bundle them in chunks and send them over the wire.

Also, we are currently considering a chunk to be (at most) 1 MB worth of
data, although it can be tuned using 'mon_sync_max_payload_size' option.

mon: MonitorDBStore: add crc support when --mon-sync-debug is set

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomon: Paxos: get rid of slurp-related code
Joao Eduardo Luis [Fri, 31 Aug 2012 17:39:27 +0000 (18:39 +0100)]
mon: Paxos: get rid of slurp-related code

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>