]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoMerge pull request #169 from ceph/wip-rbd-diff
Sage Weil [Mon, 1 Apr 2013 18:26:16 +0000 (11:26 -0700)]
Merge pull request #169 from ceph/wip-rbd-diff

rbd incremental backup/restore

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agotest: fix signed/unsigned comparison in test_cors
Greg Farnum [Mon, 1 Apr 2013 16:56:27 +0000 (09:56 -0700)]
test: fix signed/unsigned comparison in test_cors

Signed-off-by: Greg Farnum <greg@inktank.com>
Acked-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-mds'
Greg Farnum [Mon, 1 Apr 2013 16:31:37 +0000 (09:31 -0700)]
Merge branch 'wip-mds'

12 years agomds: bump the protocol version.
Greg Farnum [Mon, 1 Apr 2013 16:27:27 +0000 (09:27 -0700)]
mds: bump the protocol version.

We've changed quite a lot of the restart behavior, as well as one
of the message encodings. This is cheaper and easier than using feature bits,
and CephFS is still a tech preview or whatever, so let's cover them using this.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't roll back prepared table updates
Yan, Zheng [Sun, 31 Mar 2013 06:19:17 +0000 (14:19 +0800)]
mds: don't roll back prepared table updates

When table server is recovering, it re-sends 'agree' messages for
prepared table updates. It is possible table client receives an
'agree' messages before it commits the corresponding update. Don't
send 'rollback' message back to the server in this case.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: clear scatter dirty if replica inode has no auth subtree
Yan, Zheng [Sun, 17 Mar 2013 03:13:38 +0000 (11:13 +0800)]
mds: clear scatter dirty if replica inode has no auth subtree

This avoids sending superfluous scatterlock state to recovering MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't replicate purging dentry
Yan, Zheng [Fri, 15 Mar 2013 05:09:34 +0000 (13:09 +0800)]
mds: don't replicate purging dentry

open_remote_ino is racy, it's possible someone deletes the inode's
last linkage while the MDS is discovering the inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: eval inodes with caps imported by cache rejoin message
Yan, Zheng [Sun, 17 Mar 2013 01:45:55 +0000 (09:45 +0800)]
mds: eval inodes with caps imported by cache rejoin message

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: try merging subtree after clear EXPORTBOUND
Yan, Zheng [Sat, 16 Mar 2013 13:43:17 +0000 (21:43 +0800)]
mds: try merging subtree after clear EXPORTBOUND

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: clear dirty inode rstat if import fails
Yan, Zheng [Sat, 16 Mar 2013 04:38:56 +0000 (12:38 +0800)]
mds: clear dirty inode rstat if import fails

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't open dirfrag while subtree is frozen
Yan, Zheng [Tue, 12 Mar 2013 12:51:43 +0000 (20:51 +0800)]
mds: don't open dirfrag while subtree is frozen

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: notify bystanders if export aborts
Yan, Zheng [Thu, 14 Mar 2013 03:57:16 +0000 (11:57 +0800)]
mds: notify bystanders if export aborts

So bystanders know the subtree is single auth earlier.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: fix export cancel notification
Yan, Zheng [Thu, 14 Mar 2013 04:24:54 +0000 (12:24 +0800)]
mds: fix export cancel notification

The comment says that if the importer is dead, bystanders thinks the
exporter is the only auth, as per mdcache->handle_mds_failure(). But
there is no such code in MDCache::handle_mds_failure().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: unfreeze subtree if import aborts in PREPPED state
Yan, Zheng [Thu, 14 Mar 2013 04:01:08 +0000 (12:01 +0800)]
mds: unfreeze subtree if import aborts in PREPPED state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: check MDS peer's state through mdsmap
Yan, Zheng [Thu, 14 Mar 2013 03:23:48 +0000 (11:23 +0800)]
mds: check MDS peer's state through mdsmap

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: avoid double auth pin for file recovery
Yan, Zheng [Thu, 14 Mar 2013 02:11:31 +0000 (10:11 +0800)]
mds: avoid double auth pin for file recovery

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: add dirty imported dirfrag to LogSegment
Yan, Zheng [Tue, 12 Mar 2013 08:11:13 +0000 (16:11 +0800)]
mds: add dirty imported dirfrag to LogSegment

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send lock action message when auth MDS is in proper state.
Yan, Zheng [Tue, 12 Mar 2013 08:51:53 +0000 (16:51 +0800)]
mds: send lock action message when auth MDS is in proper state.

For rejoining object, don't send lock ACK message because lock states
are still uncertain. The lock ACK may confuse object's auth MDS and
trigger assertion.

If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK
and REQSCATTER messages. MDCache::handle_mds_recovery() will take care
of them.

Also defer caps release message until clientreplay or active

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: issue caps when lock state in replica become SYNC
Yan, Zheng [Tue, 12 Mar 2013 08:19:26 +0000 (16:19 +0800)]
mds: issue caps when lock state in replica become SYNC

because client can request READ caps from non-auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: share inode max size after MDS recovers
Yan, Zheng [Tue, 12 Mar 2013 08:27:22 +0000 (16:27 +0800)]
mds: share inode max size after MDS recovers

The MDS may crash after journaling the new max size, but before sending
the new max size to the client. Later when the MDS recovers, the client
re-requests the new max size, but the MDS finds max size unchanged. So
the client waits for the new max size forever. This issue can be avoided
by checking client cap's last_sent, share inode max size if it is zero.
(reconnected cap's last_sent is zero)

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: take object's versionlock when rejoinning xlock
Yan, Zheng [Thu, 14 Mar 2013 12:56:27 +0000 (20:56 +0800)]
mds: take object's versionlock when rejoinning xlock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: reqid for rejoinning authpin/wrlock need to be list
Yan, Zheng [Thu, 14 Mar 2013 12:29:53 +0000 (20:29 +0800)]
mds: reqid for rejoinning authpin/wrlock need to be list

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: handle linkage mismatch during cache rejoin
Yan, Zheng [Thu, 14 Mar 2013 12:06:27 +0000 (20:06 +0800)]
mds: handle linkage mismatch during cache rejoin

For MDS cluster, not all file system namespace operations that impact
multiple MDS use two phase commit. Some operations use dentry link/unlink
message to update replica dentry's linkage after they are committed by
the master MDS. It's possible the master MDS crashes after journaling an
operation, but before sending the dentry link/unlink messages. Later when
the MDS recovers and receives cache rejoin messages from the surviving
MDS, it will find linkage mismatch.

The original cache rejoin code does not properly handle the case that
dentry unlink messages were missing. Unlinked inodes were linked to stray
dentries. So the cache rejoin ack message need push replicas of these
stray dentries to the surviving MDS.

This patch also adds code that handles cache expiration in the middle of
cache rejoining.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: encode dirfrag base in cache rejoin ack
Yan, Zheng [Wed, 13 Mar 2013 12:58:26 +0000 (20:58 +0800)]
mds: encode dirfrag base in cache rejoin ack

Cache rejoin ack message already encodes inode base, make it also encode
dirfrag base. This allowes the message to replicate stray dentries like
MDentryUnlink message. The function will be used by later patch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge pull request #179 from ceph/wip-client-cond
Gregory Farnum [Mon, 1 Apr 2013 16:22:45 +0000 (09:22 -0700)]
Merge pull request #179 from ceph/wip-client-cond

client: always remove cond from list after waiting

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: include replica nonce in MMDSCacheRejoin::inode_strong
Yan, Zheng [Wed, 13 Mar 2013 12:47:11 +0000 (20:47 +0800)]
mds: include replica nonce in MMDSCacheRejoin::inode_strong

So the recovering MDS can properly handle cache expire messages.
Also increase the nonce value when sending the cache rejoin acks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Also update the MMDSCacheRejoin encoding to the new format.
Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomon: OSDMonitor: only output warn/err messages if quotas are set > 0
Joao Eduardo Luis [Mon, 1 Apr 2013 16:14:15 +0000 (17:14 +0100)]
mon: OSDMonitor: only output warn/err messages if quotas are set > 0

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomds: remove MDCache::rejoin_fetch_dirfrags()
Yan, Zheng [Wed, 13 Mar 2013 11:23:18 +0000 (19:23 +0800)]
mds: remove MDCache::rejoin_fetch_dirfrags()

In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced
MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates the function
of MDCache::open_undef_dirfrags(), so just remove rejoin_fetch_dirfrags()
and make open_undef_dirfrags() also handle undefined inodes.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: fix MDS recovery involving cross authority rename
Yan, Zheng [Wed, 13 Mar 2013 10:56:27 +0000 (18:56 +0800)]
mds: fix MDS recovery involving cross authority rename

For mds cluster, rename operation may involve multiple MDS. If the
rename source's auth MDS crashes after some witness MDS have prepared
the rename but before the rename is committing. Later when the MDS
recovers, its subtree map and linkages are different from the prepared
MDS'. This causes problems for both subtree resolve and cache rejoin.
The solution is, if the rename source's auth MDS fails, the prepared
witness MDS query the master MDS if the operation is committing. If
it's not, rollback the rename, then send resolve message to the
recovering MDS.

Another similar case is a prepared witness MDS crashes when the
rename source's auth MDS has prepared or is preparing the operation.
when the witness recovers, the master just delay sending the resolve
ack message until the it commits the operation.

This patch also updates Server::handle_client_rename(). Make preparing
the rename source's auth MDS be the final step before committing the
rename.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send resolve acks after master updates are safely logged
Yan, Zheng [Wed, 13 Mar 2013 08:54:58 +0000 (16:54 +0800)]
mds: send resolve acks after master updates are safely logged

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send cache rejoin messages after gathering all resolves
Yan, Zheng [Thu, 14 Mar 2013 07:06:45 +0000 (15:06 +0800)]
mds: send cache rejoin messages after gathering all resolves

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't send MDentry{Link,Unlink} before receiving cache rejoin
Yan, Zheng [Fri, 15 Mar 2013 02:34:09 +0000 (10:34 +0800)]
mds: don't send MDentry{Link,Unlink} before receiving cache rejoin

The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it
receives the cache rejoin message. The function will remove the objects
replicated by MDentry{Link,Unlink} from replica map.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: set resolve/rejoin gather MDS set in advance
Yan, Zheng [Thu, 14 Mar 2013 16:08:39 +0000 (00:08 +0800)]
mds: set resolve/rejoin gather MDS set in advance

For active MDS, it may receive resolve/rejoin message before receiving
the mdsmap message that claims the MDS cluster is in resolving/rejoning
state. So instead of set the gather MDS set when receiving the mdsmap.
set them in advance when detecting MDS' failure.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't send resolve message between active MDS
Yan, Zheng [Thu, 14 Mar 2013 04:27:51 +0000 (12:27 +0800)]
mds: don't send resolve message between active MDS

When MDS cluster is resolving, current behavior is sending subtree resolve
message to all other MDS and waiting for all other MDS' resolve message.
The problem is that active MDS can have diffent subtree map due to rename.
Besides gathering active MDS's resolve messages are also racy. The only
function for these messages is disambiguate other MDS' import. We can
replace it by import finish notification.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: compose and send resolve messages in batch
Yan, Zheng [Wed, 13 Mar 2013 08:23:30 +0000 (16:23 +0800)]
mds: compose and send resolve messages in batch

Resolve messages for all MDS are the same, so we can compose and
send them in batch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't delay processing replica buffer in slave request
Yan, Zheng [Wed, 13 Mar 2013 06:05:21 +0000 (14:05 +0800)]
mds: don't delay processing replica buffer in slave request

Replicated objects need to be added into the cache immediately

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: unify slave request waiting
Yan, Zheng [Wed, 13 Mar 2013 02:28:58 +0000 (10:28 +0800)]
mds: unify slave request waiting

When requesting remote xlock or remote wrlock, the master request is
put into lock object's REMOTEXLOCK waiting queue. The problem is that
remote wrlock's target can be different from lock's auth MDS. When
the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake
incorrect request. So just unify slave request waiting, dispatch the
master request when receiving slave request reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomds: defer eval gather locks when removing replica
Yan, Zheng [Tue, 12 Mar 2013 12:24:52 +0000 (20:24 +0800)]
mds: defer eval gather locks when removing replica

Locks' states should not change between composing the cache rejoin ack
messages and sending the message. If Locker::eval_gather() is called
in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
change locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: avoid sending duplicated table prepare/commit
Yan, Zheng [Sun, 31 Mar 2013 09:54:50 +0000 (17:54 +0800)]
mds: avoid sending duplicated table prepare/commit

This patch makes table client defer sending table prepare/commit messages
until receiving table server's 'ready' message. This avoid duplicated table
prepare/commit messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: make sure table request id unique
Yan, Zheng [Sat, 16 Mar 2013 00:02:18 +0000 (08:02 +0800)]
mds: make sure table request id unique

When a MDS becomes active, the table server re-sends 'agree' messages
for old prepared request. If the recoverd MDS starts a new table request
at the same time, The new request's ID can happen to be the same as old
prepared request's ID, because current table client code assigns request
ID from zero after MDS restarts.

This patch make table server send 'ready' messages when table clients
become active or itself becomes active. The 'ready' message updates
table client's last_reqid to avoid request ID collision. The message
also replaces the roles of finish_recovery() and handle_mds_recovery()
callbacks for table client.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: consider MDS as recovered when it reaches clientreplay state.
Yan, Zheng [Mon, 25 Mar 2013 06:22:13 +0000 (14:22 +0800)]
mds: consider MDS as recovered when it reaches clientreplay state.

MDS in clientreplsy state already starts servering requests. It also
make MDS::handle_mds_recovery() and MDS::recovery_done() match.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoclient: always remove cond from list after waiting 179/head
Sage Weil [Mon, 1 Apr 2013 16:12:44 +0000 (09:12 -0700)]
client: always remove cond from list after waiting

The signal method removes conds from the list after it signals.  That's
not okay if the cond triggers for some other reason; an invalid Cond*
will remain on the list and get signaled later.

Make the wait_on_list() helper remove it; use that in several callers;
explicitly do the removal in the remaining callers.

Change signal_cond_list() to not clear the list; rely on the signalee's to
do that.  Audit all users and make sure they are either using the
wait_on_list() helper (which removes its Cond) or do the remove explicitly.

Backport some form of this: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: fix size arg type for diff_iterate 169/head
Sage Weil [Mon, 1 Apr 2013 15:47:30 +0000 (08:47 -0700)]
librbd: fix size arg type for diff_iterate

Fixes build on 32-bit archs.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoPendingReleaseNotes: note about rbd progress output
Josh Durgin [Mon, 1 Apr 2013 07:15:40 +0000 (00:15 -0700)]
PendingReleaseNotes: note about rbd progress output

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agotest_librbd: add diff_iterate test including discard
Josh Durgin [Mon, 1 Apr 2013 06:13:51 +0000 (23:13 -0700)]
test_librbd: add diff_iterate test including discard

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd.py: add some missing functions
Josh Durgin [Sun, 31 Mar 2013 14:49:59 +0000 (07:49 -0700)]
rbd.py: add some missing functions

discard, flush, and striping info slipped through the cracks before,
but are useful and trivial to add.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: add C and python bindings for diff_iterate
Josh Durgin [Sun, 31 Mar 2013 14:48:49 +0000 (07:48 -0700)]
librbd: add C and python bindings for diff_iterate

The python interface is a bit awkward since it maps directly
to the C interface, but it'll work well enough and not use
tons of memory.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrados: don't insert zero length extents in a diff
Josh Durgin [Mon, 1 Apr 2013 06:07:39 +0000 (23:07 -0700)]
librados: don't insert zero length extents in a diff

They're useless, and trigger an assert in interval_set::inesrt.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: add formatted output to diff command
Josh Durgin [Sun, 31 Mar 2013 00:32:08 +0000 (17:32 -0700)]
rbd: add formatted output to diff command

All the other commands that display information have this.
For consistency, add it to this command too.

Also switch the plain output to use a TextTable for better readability.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: return -ENOENT from diff_iterate when the snap doesn't exist
Josh Durgin [Sun, 31 Mar 2013 00:28:35 +0000 (17:28 -0700)]
librbd: return -ENOENT from diff_iterate when the snap doesn't exist

This is a bit more helpful than -EINVAL.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: initialize random number generator for bench-write
Josh Durgin [Sun, 31 Mar 2013 00:27:25 +0000 (17:27 -0700)]
rbd: initialize random number generator for bench-write

Without this, the same seed is used each time, so multiple runs
of bench-write with the same parameters have the same I/O pattern.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: change diff_iterate interface to be more C-friendly
Josh Durgin [Sun, 31 Mar 2013 00:25:18 +0000 (17:25 -0700)]
librbd: change diff_iterate interface to be more C-friendly

Use int instead of bool for the callback, and make it represent
whether the data exists, rather than the opposite, since callers
are likely to test for whether it's data instead of whether its zeroes.

Change the return value to 0, since an int64_t will wrap around
for large reads, and there's no value in reporting the length
read when it will always be the length requested clipped to the
size of the image.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: remove alway-true else condition in import-diff
Josh Durgin [Sat, 30 Mar 2013 00:06:08 +0000 (17:06 -0700)]
rbd: remove alway-true else condition in import-diff

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: make diff banner length depend on the banner
Josh Durgin [Sat, 30 Mar 2013 00:03:59 +0000 (17:03 -0700)]
rbd: make diff banner length depend on the banner

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agomkcephfs: warn that mkcephfs is deprecated in favor of ceph-deploy
Neil Levine [Mon, 1 Apr 2013 15:53:24 +0000 (08:53 -0700)]
mkcephfs: warn that mkcephfs is deprecated in favor of ceph-deploy

Signed-off-by: Neil Levine <neil.levine@inktank.com>
12 years agorbd: fail import-diff if we reach the end of the stream sooner than expected
Josh Durgin [Fri, 29 Mar 2013 23:48:02 +0000 (16:48 -0700)]
rbd: fail import-diff if we reach the end of the stream sooner than expected

safe_read() just protects against EINTR, and may return less data than
requested if it reaches the end of the file. Use safe_read_exact() to
make sure we get the right amount of data.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: complete progress for import-diff from stdin
Josh Durgin [Fri, 29 Mar 2013 23:44:50 +0000 (16:44 -0700)]
rbd: complete progress for import-diff from stdin

The diff format gives us a size, so unlike a normal import, we do update progress.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: fix else style in import-diff
Josh Durgin [Fri, 29 Mar 2013 23:41:42 +0000 (16:41 -0700)]
rbd: fix else style in import-diff

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: update progress as a diff is exported
Josh Durgin [Fri, 29 Mar 2013 23:35:20 +0000 (16:35 -0700)]
rbd: update progress as a diff is exported

This will be jumpy since changed extents probably aren't evenly
distributed, but it's better than nothing.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: remove unused argument from do_diff()
Josh Durgin [Fri, 29 Mar 2013 23:27:32 +0000 (16:27 -0700)]
rbd: remove unused argument from do_diff()

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agorbd: fix size change output
Sage Weil [Fri, 29 Mar 2013 15:20:22 +0000 (08:20 -0700)]
rbd: fix size change output

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: send progress info to stderr, not stdout
Sage Weil [Fri, 29 Mar 2013 15:17:49 +0000 (08:17 -0700)]
rbd: send progress info to stderr, not stdout

This avoids interfering when export is sent to stdout.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: include 'diff' command in man page
Sage Weil [Fri, 29 Mar 2013 04:48:51 +0000 (21:48 -0700)]
rbd: include 'diff' command in man page

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: update man page for import-diff and export-diff
Sage Weil [Fri, 29 Mar 2013 04:47:12 +0000 (21:47 -0700)]
rbd: update man page for import-diff and export-diff

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: prevent import-diff if start snapshot is not already present
Sage Weil [Fri, 29 Mar 2013 04:53:30 +0000 (21:53 -0700)]
rbd: prevent import-diff if start snapshot is not already present

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: fail import-diff if end snap already exists
Sage Weil [Fri, 29 Mar 2013 04:29:55 +0000 (21:29 -0700)]
rbd: fail import-diff if end snap already exists

This will prevent a user from inadvertantly reapplying a diff twice.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc/dev/rbd-diff: specify that metadata records come before data
Sage Weil [Fri, 29 Mar 2013 04:19:32 +0000 (21:19 -0700)]
doc/dev/rbd-diff: specify that metadata records come before data

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: implement image.snap_exists()
Sage Weil [Fri, 29 Mar 2013 04:17:21 +0000 (21:17 -0700)]
librbd: implement image.snap_exists()

This is a much more convenient way to tell if a snapshot already exists.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrados: move snap_set_diff to librados/
Sage Weil [Thu, 28 Mar 2013 23:44:23 +0000 (16:44 -0700)]
librados: move snap_set_diff to librados/

This is most closely related to the librados list_snaps API; move it there.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrados: cleanly define SNAP_HEAD, SNAP_DIR constants
Sage Weil [Thu, 28 Mar 2013 23:13:35 +0000 (16:13 -0700)]
librados: cleanly define SNAP_HEAD, SNAP_DIR constants

We were using the internal CEPH_NOSNAP and CEPH_SNAPDIR constants, and
defining a clone_info_t::HEAD (with a different value).  The docs were
referrring to the internal constant names.

Instead, define librados constants (C and C++) with the same values as the
internal types.

Note that this changes the clone_info_t::HEAD value from -1 to -2 so that
it now matches the internal type.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrados: document list_snaps
Sage Weil [Thu, 28 Mar 2013 23:01:55 +0000 (16:01 -0700)]
librados: document list_snaps

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: drop unused elapsed calc for diff_iterate
Sage Weil [Thu, 28 Mar 2013 22:40:26 +0000 (15:40 -0700)]
librbd: drop unused elapsed calc for diff_iterate

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: diff_iterate fromsnapname after the end snap is also invalid
Sage Weil [Thu, 28 Mar 2013 22:17:19 +0000 (15:17 -0700)]
librbd: diff_iterate fromsnapname after the end snap is also invalid

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: document diff_iterate in header
Sage Weil [Thu, 28 Mar 2013 22:16:35 +0000 (15:16 -0700)]
librbd: document diff_iterate in header

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: uint64_t len for diff_iterate
Sage Weil [Thu, 28 Mar 2013 22:12:11 +0000 (15:12 -0700)]
librbd: uint64_t len for diff_iterate

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc/dev/rbd-diff: update incremental file format
Sage Weil [Thu, 28 Mar 2013 21:19:59 +0000 (14:19 -0700)]
doc/dev/rbd-diff: update incremental file format

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa: rbd/diff_continuous.sh: use non-standard striping
Sage Weil [Thu, 28 Mar 2013 21:13:46 +0000 (14:13 -0700)]
qa: rbd/diff_continuous.sh: use non-standard striping

Exercise the striping arithmetic by using non-standard striping that
varies between the parent and child.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: fix diff_iterate arithmetic for non-standard striping
Sage Weil [Thu, 28 Mar 2013 21:13:03 +0000 (14:13 -0700)]
librbd: fix diff_iterate arithmetic for non-standard striping

This code is confusing because we are moving back and forth between
image offsets, "buffer" offsets (image offsets relative to off), and
object offsets.  Fix the math.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa: rbd/diff_continuous.sh: base test off a clone
Sage Weil [Thu, 28 Mar 2013 04:26:54 +0000 (21:26 -0700)]
qa: rbd/diff_continuous.sh: base test off a clone

Get a bit of coverage on clones by starting with a clone.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: implement simple 'diff' command
Sage Weil [Thu, 28 Mar 2013 16:26:43 +0000 (09:26 -0700)]
rbd: implement simple 'diff' command

Report extents allocated/changed, and whether they contain data or zeros.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: handle diff from clone
Sage Weil [Thu, 28 Mar 2013 16:26:11 +0000 (09:26 -0700)]
librbd: handle diff from clone

If we have a parent image, and the reference is from snap 0 (beginning of
time) we need to look at the diff on the parent from the beginning of time
and report that when we get an ENOENT.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: send import debug noise to dout, not stdout
Sage Weil [Wed, 27 Mar 2013 18:07:53 +0000 (11:07 -0700)]
rbd: send import debug noise to dout, not stdout

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoqa: add rbd/diff_continuous.sh stress test
Sage Weil [Wed, 27 Mar 2013 06:16:54 +0000 (23:16 -0700)]
qa: add rbd/diff_continuous.sh stress test

Stress test that does io on an image while we are mirroring a diff from
earlier snaps to a second copy.  At the end, verify that all snaps have
matching content.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: implement 'export-diff' and 'import-diff' commands
Sage Weil [Tue, 26 Mar 2013 20:32:28 +0000 (13:32 -0700)]
rbd: implement 'export-diff' and 'import-diff' commands

Export a diff of an image from a previous snapshot to a file (or stdout).

Import a diff and apply it to an image, and then create the ending
snapshot.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: add --io-pattern <seq|rand> option to bench-write
Sage Weil [Tue, 26 Mar 2013 20:33:00 +0000 (13:33 -0700)]
rbd: add --io-pattern <seq|rand> option to bench-write

Write to random offsets instead of sequentially.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: implement diff_iterate
Sage Weil [Mon, 25 Mar 2013 21:14:50 +0000 (14:14 -0700)]
librbd: implement diff_iterate

Implement a diff_iterate() method that will iterate over an image and
report which extents vary between two snapshots (or a snapshot and the
head).  The callback gets an extent and a flag indicating whether it is
full of data or is known to be zero in the ending snapshot.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agolibrados: expose snapset seq via list_snaps
Sage Weil [Tue, 26 Mar 2013 04:02:37 +0000 (21:02 -0700)]
librados: expose snapset seq via list_snaps

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosdc/Objecter: prval optional for listsnaps
Sage Weil [Mon, 25 Mar 2013 04:18:46 +0000 (21:18 -0700)]
osdc/Objecter: prval optional for listsnaps

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix error codes for list-snaps
Sage Weil [Tue, 26 Mar 2013 15:53:09 +0000 (08:53 -0700)]
osd: fix error codes for list-snaps

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix clone snap list for list-snaps
Sage Weil [Tue, 26 Mar 2013 15:53:00 +0000 (08:53 -0700)]
osd: fix clone snap list for list-snaps

We need to return the list of snaps that each clone is defined for, not
the list of snaps we know may or may not exist globally over a similar
interval.  This requires looking at the clone's obc, unfortunately.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: wait for all clones on SNAPDIR requests
Sage Weil [Wed, 27 Mar 2013 01:07:07 +0000 (18:07 -0700)]
osd: wait for all clones on SNAPDIR requests

Wait for all clones to be present.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: direct reads on SNAPDIR to either head or snapdir
Sage Weil [Tue, 26 Mar 2013 17:31:19 +0000 (10:31 -0700)]
osd: direct reads on SNAPDIR to either head or snapdir

The list_snaps operation needs to look at the SnapSet, and is logically
querying all revisions of the object.  Make requests to SNAPDIR be
read-only, and grab the head or snapdir obc transparently (whichever one
exists).  This allows us to list snaps when, say, the head does not
exist, but there are in fact snaps.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: do not include snaps with head on list_snaps()
Sage Weil [Tue, 26 Mar 2013 04:18:47 +0000 (21:18 -0700)]
osd: do not include snaps with head on list_snaps()

If there is a sequence of snaps 1, 2, 3, 4, 5, and we have a clone
2 with [1,2], and the head reflects content at snap times [3,4,5], then
the snap_list should return

 clone 2 snaps [1,2]
 head snaps
 seq 2

because it never saw a write after snap 2, and therefor has the same
content currently as it did in snaps 3,4,5.  If the SnapSet on the
object lists snaps 3,4,5, and the head exists, it actually means the
object was deleted between 2 and 3, and was recreated after 5:

 clone 2 snaps [1,2]
 head snaps []
 seq 5

The key to telling the two situations apart is the seq number on the
SnapSet (now included in the list_snaps reply) that tells us when the
last update was.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: clean up some whitespace
Sage Weil [Tue, 26 Mar 2013 03:59:23 +0000 (20:59 -0700)]
osd: clean up some whitespace

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: include SnapSet seq in the list snaps response
Sage Weil [Tue, 26 Mar 2013 03:59:11 +0000 (20:59 -0700)]
osd: include SnapSet seq in the list snaps response

It is important to know the latest seq that the object has seen in order
to tell if a response like

 clone 2 snaps=[1,2]
 clone head snaps=[]

was untouched before a hypothetical snap 3, or deleted prior to snap 3,
and then recreated+modified after.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: make LIST_WATCHERS and LIST_SNAPS print nicely for OSDOp
Sage Weil [Mon, 25 Mar 2013 04:18:31 +0000 (21:18 -0700)]
osd: make LIST_WATCHERS and LIST_SNAPS print nicely for OSDOp

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agostrings: add 'list-watchers' to MOSDOp strings
Sage Weil [Mon, 25 Mar 2013 04:18:09 +0000 (21:18 -0700)]
strings: add 'list-watchers' to MOSDOp strings

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-cors-rebased'
Sage Weil [Mon, 1 Apr 2013 06:23:47 +0000 (23:23 -0700)]
Merge remote-tracking branch 'gh/wip-cors-rebased'

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix warning
Sage Weil [Mon, 1 Apr 2013 04:47:38 +0000 (21:47 -0700)]
rgw: fix warning

On a 64-bit arch, we still want to make sure it's a 32-bit value.  Gcc is
too smart for us to just cast; it will still warn on 32-bit arch that the
comparison is always true.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: add missing include file
Yehuda Sadeh [Sun, 31 Mar 2013 07:25:13 +0000 (00:25 -0700)]
rgw: add missing include file

Add missing limits.h, needed for ULONG_MAX.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>