]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agotest: fix signed/unsigned comparison in test_cors
Greg Farnum [Mon, 1 Apr 2013 16:56:27 +0000 (09:56 -0700)]
test: fix signed/unsigned comparison in test_cors

Signed-off-by: Greg Farnum <greg@inktank.com>
Acked-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'wip-mds'
Greg Farnum [Mon, 1 Apr 2013 16:31:37 +0000 (09:31 -0700)]
Merge branch 'wip-mds'

12 years agomds: bump the protocol version.
Greg Farnum [Mon, 1 Apr 2013 16:27:27 +0000 (09:27 -0700)]
mds: bump the protocol version.

We've changed quite a lot of the restart behavior, as well as one
of the message encodings. This is cheaper and easier than using feature bits,
and CephFS is still a tech preview or whatever, so let's cover them using this.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't roll back prepared table updates
Yan, Zheng [Sun, 31 Mar 2013 06:19:17 +0000 (14:19 +0800)]
mds: don't roll back prepared table updates

When table server is recovering, it re-sends 'agree' messages for
prepared table updates. It is possible table client receives an
'agree' messages before it commits the corresponding update. Don't
send 'rollback' message back to the server in this case.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: clear scatter dirty if replica inode has no auth subtree
Yan, Zheng [Sun, 17 Mar 2013 03:13:38 +0000 (11:13 +0800)]
mds: clear scatter dirty if replica inode has no auth subtree

This avoids sending superfluous scatterlock state to recovering MDS

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't replicate purging dentry
Yan, Zheng [Fri, 15 Mar 2013 05:09:34 +0000 (13:09 +0800)]
mds: don't replicate purging dentry

open_remote_ino is racy, it's possible someone deletes the inode's
last linkage while the MDS is discovering the inode.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: eval inodes with caps imported by cache rejoin message
Yan, Zheng [Sun, 17 Mar 2013 01:45:55 +0000 (09:45 +0800)]
mds: eval inodes with caps imported by cache rejoin message

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: try merging subtree after clear EXPORTBOUND
Yan, Zheng [Sat, 16 Mar 2013 13:43:17 +0000 (21:43 +0800)]
mds: try merging subtree after clear EXPORTBOUND

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: clear dirty inode rstat if import fails
Yan, Zheng [Sat, 16 Mar 2013 04:38:56 +0000 (12:38 +0800)]
mds: clear dirty inode rstat if import fails

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't open dirfrag while subtree is frozen
Yan, Zheng [Tue, 12 Mar 2013 12:51:43 +0000 (20:51 +0800)]
mds: don't open dirfrag while subtree is frozen

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: notify bystanders if export aborts
Yan, Zheng [Thu, 14 Mar 2013 03:57:16 +0000 (11:57 +0800)]
mds: notify bystanders if export aborts

So bystanders know the subtree is single auth earlier.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: fix export cancel notification
Yan, Zheng [Thu, 14 Mar 2013 04:24:54 +0000 (12:24 +0800)]
mds: fix export cancel notification

The comment says that if the importer is dead, bystanders thinks the
exporter is the only auth, as per mdcache->handle_mds_failure(). But
there is no such code in MDCache::handle_mds_failure().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: unfreeze subtree if import aborts in PREPPED state
Yan, Zheng [Thu, 14 Mar 2013 04:01:08 +0000 (12:01 +0800)]
mds: unfreeze subtree if import aborts in PREPPED state

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: check MDS peer's state through mdsmap
Yan, Zheng [Thu, 14 Mar 2013 03:23:48 +0000 (11:23 +0800)]
mds: check MDS peer's state through mdsmap

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: avoid double auth pin for file recovery
Yan, Zheng [Thu, 14 Mar 2013 02:11:31 +0000 (10:11 +0800)]
mds: avoid double auth pin for file recovery

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: add dirty imported dirfrag to LogSegment
Yan, Zheng [Tue, 12 Mar 2013 08:11:13 +0000 (16:11 +0800)]
mds: add dirty imported dirfrag to LogSegment

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send lock action message when auth MDS is in proper state.
Yan, Zheng [Tue, 12 Mar 2013 08:51:53 +0000 (16:51 +0800)]
mds: send lock action message when auth MDS is in proper state.

For rejoining object, don't send lock ACK message because lock states
are still uncertain. The lock ACK may confuse object's auth MDS and
trigger assertion.

If object's auth MDS is not active, just skip sending NUDGE, REQRDLOCK
and REQSCATTER messages. MDCache::handle_mds_recovery() will take care
of them.

Also defer caps release message until clientreplay or active

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: issue caps when lock state in replica become SYNC
Yan, Zheng [Tue, 12 Mar 2013 08:19:26 +0000 (16:19 +0800)]
mds: issue caps when lock state in replica become SYNC

because client can request READ caps from non-auth MDS.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: share inode max size after MDS recovers
Yan, Zheng [Tue, 12 Mar 2013 08:27:22 +0000 (16:27 +0800)]
mds: share inode max size after MDS recovers

The MDS may crash after journaling the new max size, but before sending
the new max size to the client. Later when the MDS recovers, the client
re-requests the new max size, but the MDS finds max size unchanged. So
the client waits for the new max size forever. This issue can be avoided
by checking client cap's last_sent, share inode max size if it is zero.
(reconnected cap's last_sent is zero)

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: take object's versionlock when rejoinning xlock
Yan, Zheng [Thu, 14 Mar 2013 12:56:27 +0000 (20:56 +0800)]
mds: take object's versionlock when rejoinning xlock

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: reqid for rejoinning authpin/wrlock need to be list
Yan, Zheng [Thu, 14 Mar 2013 12:29:53 +0000 (20:29 +0800)]
mds: reqid for rejoinning authpin/wrlock need to be list

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: handle linkage mismatch during cache rejoin
Yan, Zheng [Thu, 14 Mar 2013 12:06:27 +0000 (20:06 +0800)]
mds: handle linkage mismatch during cache rejoin

For MDS cluster, not all file system namespace operations that impact
multiple MDS use two phase commit. Some operations use dentry link/unlink
message to update replica dentry's linkage after they are committed by
the master MDS. It's possible the master MDS crashes after journaling an
operation, but before sending the dentry link/unlink messages. Later when
the MDS recovers and receives cache rejoin messages from the surviving
MDS, it will find linkage mismatch.

The original cache rejoin code does not properly handle the case that
dentry unlink messages were missing. Unlinked inodes were linked to stray
dentries. So the cache rejoin ack message need push replicas of these
stray dentries to the surviving MDS.

This patch also adds code that handles cache expiration in the middle of
cache rejoining.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: encode dirfrag base in cache rejoin ack
Yan, Zheng [Wed, 13 Mar 2013 12:58:26 +0000 (20:58 +0800)]
mds: encode dirfrag base in cache rejoin ack

Cache rejoin ack message already encodes inode base, make it also encode
dirfrag base. This allowes the message to replicate stray dentries like
MDentryUnlink message. The function will be used by later patch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge pull request #179 from ceph/wip-client-cond
Gregory Farnum [Mon, 1 Apr 2013 16:22:45 +0000 (09:22 -0700)]
Merge pull request #179 from ceph/wip-client-cond

client: always remove cond from list after waiting

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: include replica nonce in MMDSCacheRejoin::inode_strong
Yan, Zheng [Wed, 13 Mar 2013 12:47:11 +0000 (20:47 +0800)]
mds: include replica nonce in MMDSCacheRejoin::inode_strong

So the recovering MDS can properly handle cache expire messages.
Also increase the nonce value when sending the cache rejoin acks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Also update the MMDSCacheRejoin encoding to the new format.
Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomon: OSDMonitor: only output warn/err messages if quotas are set > 0
Joao Eduardo Luis [Mon, 1 Apr 2013 16:14:15 +0000 (17:14 +0100)]
mon: OSDMonitor: only output warn/err messages if quotas are set > 0

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agomds: remove MDCache::rejoin_fetch_dirfrags()
Yan, Zheng [Wed, 13 Mar 2013 11:23:18 +0000 (19:23 +0800)]
mds: remove MDCache::rejoin_fetch_dirfrags()

In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced
MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates the function
of MDCache::open_undef_dirfrags(), so just remove rejoin_fetch_dirfrags()
and make open_undef_dirfrags() also handle undefined inodes.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: fix MDS recovery involving cross authority rename
Yan, Zheng [Wed, 13 Mar 2013 10:56:27 +0000 (18:56 +0800)]
mds: fix MDS recovery involving cross authority rename

For mds cluster, rename operation may involve multiple MDS. If the
rename source's auth MDS crashes after some witness MDS have prepared
the rename but before the rename is committing. Later when the MDS
recovers, its subtree map and linkages are different from the prepared
MDS'. This causes problems for both subtree resolve and cache rejoin.
The solution is, if the rename source's auth MDS fails, the prepared
witness MDS query the master MDS if the operation is committing. If
it's not, rollback the rename, then send resolve message to the
recovering MDS.

Another similar case is a prepared witness MDS crashes when the
rename source's auth MDS has prepared or is preparing the operation.
when the witness recovers, the master just delay sending the resolve
ack message until the it commits the operation.

This patch also updates Server::handle_client_rename(). Make preparing
the rename source's auth MDS be the final step before committing the
rename.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send resolve acks after master updates are safely logged
Yan, Zheng [Wed, 13 Mar 2013 08:54:58 +0000 (16:54 +0800)]
mds: send resolve acks after master updates are safely logged

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: send cache rejoin messages after gathering all resolves
Yan, Zheng [Thu, 14 Mar 2013 07:06:45 +0000 (15:06 +0800)]
mds: send cache rejoin messages after gathering all resolves

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't send MDentry{Link,Unlink} before receiving cache rejoin
Yan, Zheng [Fri, 15 Mar 2013 02:34:09 +0000 (10:34 +0800)]
mds: don't send MDentry{Link,Unlink} before receiving cache rejoin

The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it
receives the cache rejoin message. The function will remove the objects
replicated by MDentry{Link,Unlink} from replica map.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: set resolve/rejoin gather MDS set in advance
Yan, Zheng [Thu, 14 Mar 2013 16:08:39 +0000 (00:08 +0800)]
mds: set resolve/rejoin gather MDS set in advance

For active MDS, it may receive resolve/rejoin message before receiving
the mdsmap message that claims the MDS cluster is in resolving/rejoning
state. So instead of set the gather MDS set when receiving the mdsmap.
set them in advance when detecting MDS' failure.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't send resolve message between active MDS
Yan, Zheng [Thu, 14 Mar 2013 04:27:51 +0000 (12:27 +0800)]
mds: don't send resolve message between active MDS

When MDS cluster is resolving, current behavior is sending subtree resolve
message to all other MDS and waiting for all other MDS' resolve message.
The problem is that active MDS can have diffent subtree map due to rename.
Besides gathering active MDS's resolve messages are also racy. The only
function for these messages is disambiguate other MDS' import. We can
replace it by import finish notification.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: compose and send resolve messages in batch
Yan, Zheng [Wed, 13 Mar 2013 08:23:30 +0000 (16:23 +0800)]
mds: compose and send resolve messages in batch

Resolve messages for all MDS are the same, so we can compose and
send them in batch.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: don't delay processing replica buffer in slave request
Yan, Zheng [Wed, 13 Mar 2013 06:05:21 +0000 (14:05 +0800)]
mds: don't delay processing replica buffer in slave request

Replicated objects need to be added into the cache immediately

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: unify slave request waiting
Yan, Zheng [Wed, 13 Mar 2013 02:28:58 +0000 (10:28 +0800)]
mds: unify slave request waiting

When requesting remote xlock or remote wrlock, the master request is
put into lock object's REMOTEXLOCK waiting queue. The problem is that
remote wrlock's target can be different from lock's auth MDS. When
the lock's auth MDS recovers, MDCache::handle_mds_recovery() may wake
incorrect request. So just unify slave request waiting, dispatch the
master request when receiving slave request reply.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agomds: defer eval gather locks when removing replica
Yan, Zheng [Tue, 12 Mar 2013 12:24:52 +0000 (20:24 +0800)]
mds: defer eval gather locks when removing replica

Locks' states should not change between composing the cache rejoin ack
messages and sending the message. If Locker::eval_gather() is called
in MDCache::{inode,dentry}_remove_replica(), it may wake requests and
change locks' states.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: avoid sending duplicated table prepare/commit
Yan, Zheng [Sun, 31 Mar 2013 09:54:50 +0000 (17:54 +0800)]
mds: avoid sending duplicated table prepare/commit

This patch makes table client defer sending table prepare/commit messages
until receiving table server's 'ready' message. This avoid duplicated table
prepare/commit messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: make sure table request id unique
Yan, Zheng [Sat, 16 Mar 2013 00:02:18 +0000 (08:02 +0800)]
mds: make sure table request id unique

When a MDS becomes active, the table server re-sends 'agree' messages
for old prepared request. If the recoverd MDS starts a new table request
at the same time, The new request's ID can happen to be the same as old
prepared request's ID, because current table client code assigns request
ID from zero after MDS restarts.

This patch make table server send 'ready' messages when table clients
become active or itself becomes active. The 'ready' message updates
table client's last_reqid to avoid request ID collision. The message
also replaces the roles of finish_recovery() and handle_mds_recovery()
callbacks for table client.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: consider MDS as recovered when it reaches clientreplay state.
Yan, Zheng [Mon, 25 Mar 2013 06:22:13 +0000 (14:22 +0800)]
mds: consider MDS as recovered when it reaches clientreplay state.

MDS in clientreplsy state already starts servering requests. It also
make MDS::handle_mds_recovery() and MDS::recovery_done() match.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoclient: always remove cond from list after waiting 179/head
Sage Weil [Mon, 1 Apr 2013 16:12:44 +0000 (09:12 -0700)]
client: always remove cond from list after waiting

The signal method removes conds from the list after it signals.  That's
not okay if the cond triggers for some other reason; an invalid Cond*
will remain on the list and get signaled later.

Make the wait_on_list() helper remove it; use that in several callers;
explicitly do the removal in the remaining callers.

Change signal_cond_list() to not clear the list; rely on the signalee's to
do that.  Audit all users and make sure they are either using the
wait_on_list() helper (which removes its Cond) or do the remove explicitly.

Backport some form of this: bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomkcephfs: warn that mkcephfs is deprecated in favor of ceph-deploy
Neil Levine [Mon, 1 Apr 2013 15:53:24 +0000 (08:53 -0700)]
mkcephfs: warn that mkcephfs is deprecated in favor of ceph-deploy

Signed-off-by: Neil Levine <neil.levine@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-cors-rebased'
Sage Weil [Mon, 1 Apr 2013 06:23:47 +0000 (23:23 -0700)]
Merge remote-tracking branch 'gh/wip-cors-rebased'

Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix warning
Sage Weil [Mon, 1 Apr 2013 04:47:38 +0000 (21:47 -0700)]
rgw: fix warning

On a 64-bit arch, we still want to make sure it's a 32-bit value.  Gcc is
too smart for us to just cast; it will still warn on 32-bit arch that the
comparison is always true.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: add missing include file
Yehuda Sadeh [Sun, 31 Mar 2013 07:25:13 +0000 (00:25 -0700)]
rgw: add missing include file

Add missing limits.h, needed for ULONG_MAX.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoMakefile.am: change some cors rules
Yehuda Sadeh [Sun, 31 Mar 2013 06:28:18 +0000 (23:28 -0700)]
Makefile.am: change some cors rules

The cors unitest should be a standalone test (not part of the make
unitests) as it requires having a running gateway and needs input params
to run correctly.
Also update missing header files.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: fix a few warnings
Yehuda Sadeh [Sun, 31 Mar 2013 06:27:58 +0000 (23:27 -0700)]
rgw: fix a few warnings

Adjust data types

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: more cors fixes
Babu Shanmugam [Thu, 28 Mar 2013 05:05:01 +0000 (10:35 +0530)]
rgw: more cors fixes

Remove the check for read_cors_config in rgw_main.cc, and changes type of 'a' to unsigned from long as max_age cannot be a negative integer

Modified the type of 'a' to unsigned long and used ULONG_MAX and strtol in rgw_cors_swift.h

Signed-off-by: Babu Shanmugam <anbu@enovance.com>
12 years agorgw: cors, style fixes, other fixes
Yehuda Sadeh [Wed, 27 Mar 2013 19:57:06 +0000 (12:57 -0700)]
rgw: cors, style fixes, other fixes

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agorgw: with CORS support
Babu Shanmugam [Tue, 5 Mar 2013 03:52:55 +0000 (09:22 +0530)]
rgw: with CORS support

With CORS test cases

1. Added license headers to the cors files
2. SIWFT POST metadata for cors will replace the old cors configuration
3. Fixed a buf in rgw_cors_swift.h

With Yehuda's review comments along with some fixes;
1. If the origin is allowed only for https, we should not approve the same host for http requests
2. Accounted for hostname situtation like www.www.org, or www.wowwww.com or www.*
3. Replaced atoi with strtol
4. Have a centralized place for parsing host names, hence avoiding duplicates

Checked certain senarios with amazon S3 and made changes accordingly

With some fixes in rgw_cors.cc and str_list.cc

Removing the whitespace auto-append to the delimiters in get_str_list(), added white spaces delimiters in is_string_in_set()

12 years agomds: mark connection down when MDS fails
Yan, Zheng [Tue, 12 Mar 2013 08:58:08 +0000 (16:58 +0800)]
mds: mark connection down when MDS fails

So if the MDS restarts and uses the same address, it does not get
old messages.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: fix MDCache::adjust_bounded_subtree_auth()
Yan, Zheng [Tue, 12 Mar 2013 11:41:13 +0000 (19:41 +0800)]
mds: fix MDCache::adjust_bounded_subtree_auth()

There are cases that need both create new bound and swallow intervening
subtree. For example: A MDS exports subtree A with bound B and imports
subtree B with bound C at the same time. The MDS crashes, exporting
subtree A fails, but importing subtree B succeed. During recovery, the
MDS may create new bound C and swallow subtree B.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: process finished contexts in batch
Yan, Zheng [Thu, 31 Jan 2013 02:07:35 +0000 (10:07 +0800)]
mds: process finished contexts in batch

If there are several unstable locks in an inode, current Locker::eval(CInode*,)
processes each lock's finished contexts seperately. This may cause very deep
call stack if finished contexts also call Locker::eval() on the same inode.
An extreme example is:

Locker::eval() wakes an open request(). Server::handle_client_open() starts
a log entry, then call Locker::issue_new_caps(). Locker::issue_new_caps()
calls Locker::eval() and wakes another request. The later request also tries
starting a log entry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: preserve subtree bounds until slave commit
Yan, Zheng [Thu, 31 Jan 2013 02:37:11 +0000 (10:37 +0800)]
mds: preserve subtree bounds until slave commit

When replaying an operation that rename a directory inode to non-auth subtree,
if the inode has subtree bounds, we should prevent them from being trimmed
until slave commit.

This patch also fixes a bug in ESlaveUpdate::replay(). EMetaBlob::replay()
should be called before MDCache::finish_uncommitted_slave_update().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge pull request #175 from dachary/wip-4594
Sage Weil [Sun, 31 Mar 2013 01:22:01 +0000 (18:22 -0700)]
Merge pull request #175 from dachary/wip-4594

fix null character in object name triggering segfault

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agofix null character in object name triggering segfault 175/head
Loic Dachary [Sat, 30 Mar 2013 10:26:12 +0000 (11:26 +0100)]
fix null character in object name triggering segfault

Parsing \n in  lfn_parse_object_name is implemented with

  out->append('\0');

which segfaults when using libstdc++ and g++ version 4.6.3 on Debian
GNU/Linux. It is replaced with

  (*out) += '\0';

to avoid the bugous implicit conversion. There is no append(charT)
method in C++98 or C++11, which means it relies on an implicit
conversion that is bugous. It would be better to rely on the
basic_string& operator+=(charT c); method as defined in ISO 14882-1998
(page 385) thru ISO 14882-2012 (page 640)

A set of tests is added to generate and parse object names. They need
access to the private function lfn_parse_object_name because there is
no convenient protected method to exercise it. The tests contain a
LFNIndex derived class, TestWrapLFNIndex which is made a friend of
LFNIndex to gain access to the private methods.

http://tracker.ceph.com/issues/4594 refs #4594

Signed-off-by: Loic Dachary <loic@dachary.org>
12 years agoMerge branch 'wip-4490'
Sage Weil [Sat, 30 Mar 2013 01:02:15 +0000 (18:02 -0700)]
Merge branch 'wip-4490'

12 years agomon: OSDMonitor: add 'osd pool set-quota' command 174/head
Sage Weil [Sat, 30 Mar 2013 00:59:35 +0000 (17:59 -0700)]
mon: OSDMonitor: add 'osd pool set-quota' command

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agodoc: Added entries for Pool, PG, & CRUSH. Moved heartbeat link.
John Wilkins [Sat, 30 Mar 2013 00:38:48 +0000 (17:38 -0700)]
doc: Added entries for Pool, PG, & CRUSH. Moved heartbeat link.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Added heartbeat configuration settings.
John Wilkins [Sat, 30 Mar 2013 00:38:02 +0000 (17:38 -0700)]
doc: Added heartbeat configuration settings.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Moved PG info to separate page. Moved heartbeat to mon-osd doc.
John Wilkins [Sat, 30 Mar 2013 00:36:23 +0000 (17:36 -0700)]
doc: Moved PG info to separate page. Moved heartbeat to mon-osd doc.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Rewrote monitor configuration section.
John Wilkins [Sat, 30 Mar 2013 00:34:45 +0000 (17:34 -0700)]
doc: Rewrote monitor configuration section.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Moved to separate section for parallelism.
John Wilkins [Sat, 30 Mar 2013 00:32:47 +0000 (17:32 -0700)]
doc: Moved to separate section for parallelism.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Cleanup.
John Wilkins [Sat, 30 Mar 2013 00:32:00 +0000 (17:32 -0700)]
doc: Cleanup.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoceph-disk list: say 'unknown cluster $UUID' when cluster is unknown
Sage Weil [Sat, 30 Mar 2013 00:30:28 +0000 (17:30 -0700)]
ceph-disk list: say 'unknown cluster $UUID' when cluster is unknown

This makes it clearer that an old osd is in fact old.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoconfig_opts: fix rgw_port comments to be plaintext
Greg Farnum [Sat, 30 Mar 2013 00:04:58 +0000 (17:04 -0700)]
config_opts: fix rgw_port comments to be plaintext

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agoReplicatedPG: check for full if delta_stats.num_bytes > 0
Samuel Just [Fri, 29 Mar 2013 21:27:29 +0000 (14:27 -0700)]
ReplicatedPG: check for full if delta_stats.num_bytes > 0

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agomon: Monitor: check if 'pss' arg is !NULL on parse_pos_long()
Joao Eduardo Luis [Fri, 29 Mar 2013 15:28:51 +0000 (15:28 +0000)]
mon: Monitor: check if 'pss' arg is !NULL on parse_pos_long()

We already do it all throughout the function, but this one place didn't.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agocommon: util: add 'unit_to_bytesize()' function
Joao Eduardo Luis [Fri, 29 Mar 2013 15:27:35 +0000 (15:27 +0000)]
common: util: add 'unit_to_bytesize()' function

Converts from a numerical value that may or may not contain an unit
modifier ('1024', '1K', '2M', ..., '1E') and returns the parsed size
in bytes.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoosd: osd_types: add pool quota related fields
Joao Eduardo Luis [Thu, 28 Mar 2013 19:24:51 +0000 (19:24 +0000)]
osd: osd_types: add pool quota related fields

12 years agoceph-disk: handle missing journal_uuid field gracefully
Sage Weil [Fri, 29 Mar 2013 20:59:04 +0000 (13:59 -0700)]
ceph-disk: handle missing journal_uuid field gracefully

Only lower if we know it's not None.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote branch 'origin/next'
Josh Durgin [Fri, 29 Mar 2013 19:58:01 +0000 (12:58 -0700)]
Merge remote branch 'origin/next'

12 years agoMerge pull request #170 from ceph/wip-rbd-aio-flush
Josh Durgin [Fri, 29 Mar 2013 20:20:32 +0000 (13:20 -0700)]
Merge pull request #170 from ceph/wip-rbd-aio-flush

Reviewed-by: Sage Weil <sage.weil@inktank.com>
12 years agolibrados: move snapc creation to caller for aio_operate 170/head
Josh Durgin [Fri, 29 Mar 2013 19:46:27 +0000 (12:46 -0700)]
librados: move snapc creation to caller for aio_operate

The common case already has a snapshot context, so avoid duplicating
it (copying a potentially large vector) in IoCtxImpl::aio_operate().

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge pull request #166 from ceph/wip-disk-list
Sage Weil [Fri, 29 Mar 2013 19:24:47 +0000 (12:24 -0700)]
Merge pull request #166 from ceph/wip-disk-list

Wip disk list

Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoclient: update cap->implemented when handling revoke
Yan, Zheng [Fri, 29 Mar 2013 18:23:27 +0000 (11:23 -0700)]
client: update cap->implemented when handling revoke

Fixes #4578

Tested-by: Noah Watkins <noahwatkins@gmail.com>
12 years agoMerge pull request #161 from dachary/wip-4560
athanatos [Fri, 29 Mar 2013 17:50:55 +0000 (10:50 -0700)]
Merge pull request #161 from dachary/wip-4560

unit tests for LFNIndex

12 years agomsgr: allow users to mark_down a NULL Connection*
Greg Farnum [Fri, 29 Mar 2013 17:39:56 +0000 (10:39 -0700)]
msgr: allow users to mark_down a NULL Connection*

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sam Just <sam.just@inktank.com>
12 years agoMerge pull request #150 from ceph/wip-4313
Sage Weil [Fri, 29 Mar 2013 17:24:53 +0000 (10:24 -0700)]
Merge pull request #150 from ceph/wip-4313

mon: ConfigKeyService: stash config keys on the monitor

Reviewed-by: Sage Weil <sage@inktank.com
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoMerge pull request #171 from Elbandi/master
Sage Weil [Fri, 29 Mar 2013 15:38:22 +0000 (08:38 -0700)]
Merge pull request #171 from Elbandi/master

Run wrap-and-sort and add git to build deps

Reviewed-by: Sage Weil <sage@inkank.com>
12 years agoMerge pull request #172 from ceph/wip-ceph-json
Sage Weil [Fri, 29 Mar 2013 15:37:04 +0000 (08:37 -0700)]
Merge pull request #172 from ceph/wip-ceph-json

Wip ceph json

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agodebian: Add git to Build-Depends (need by check_version script) 171/head
Andras Elso [Fri, 29 Mar 2013 12:34:03 +0000 (13:34 +0100)]
debian: Add git to Build-Depends (need by check_version script)

Signed-off-by: Andras Elso <elso.andras@gmail.com>
12 years agodebian: Run wrap-and-sort from devscripts
Andras Elso [Fri, 29 Mar 2013 12:28:28 +0000 (13:28 +0100)]
debian: Run wrap-and-sort from devscripts

Signed-off-by: Andras Elso <elso.andras@gmail.com>
12 years agounit test LFNIndex::remove_object and LFNIndex::lfn_unlink 161/head
Loic Dachary [Thu, 28 Mar 2013 12:38:09 +0000 (13:38 +0100)]
unit test LFNIndex::remove_object and LFNIndex::lfn_unlink

When the object name is short, check that the corresponding file is
::unlink()ed. When the object name is long, there may be multiple files
with the same name, modulo the anti-collision number showing just before
the FILENAME_COOKIE. The following scenarii are tested:

 * there only is one file

 * there are multiple files and the last one is removed

 * there are multiple files and the last one is moved in place of the
   file that is to be removed

lfn_unlink and remove_object are tested together because
lfn_unlink is a private function and remove_object is a protected function
that does very little beside calling lfn_unlink

http://tracker.ceph.com/issues/4560 refs #4560

Signed-off-by: Loic Dachary <loic@dachary.org>
12 years agoceph_json: add missing include file 172/head
Yehuda Sadeh [Thu, 28 Mar 2013 23:41:56 +0000 (16:41 -0700)]
ceph_json: add missing include file

Needed for LONG_MAX and friends

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoceph_json: add template specializations
Yehuda Sadeh [Thu, 28 Mar 2013 20:11:58 +0000 (13:11 -0700)]
ceph_json: add template specializations

Missing template specializations for data types that
needed for 32 bit compilation

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
12 years agoceph-disk: implement 'list' 166/head
Sage Weil [Fri, 29 Mar 2013 03:49:24 +0000 (20:49 -0700)]
ceph-disk: implement 'list'

This is based on Sandon's initial patch, but much-modified.

Mounts ceph data volumes temporarily to see what is inside.  Attempts to
associated journals with osds.

Resolves: #3120
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoceph.spec.in: Add python-argparse dependency
Gary Lowell [Fri, 29 Mar 2013 00:14:33 +0000 (17:14 -0700)]
ceph.spec.in:  Add python-argparse dependency

The python-argparse package is needed by ceph-create-keys script.

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agomon: ConfigKeyService: stash config keys on the monitor 150/head
Joao Eduardo Luis [Fri, 1 Mar 2013 22:34:16 +0000 (22:34 +0000)]
mon: ConfigKeyService: stash config keys on the monitor

Building up on the Single-Paxos and our existing k/v store that backs
the monitor, we now introduce a simple service so that the monitors
act as a generic k/v store available to the cluster, in which a user
can stash (and later obtain) configuration keys at his own discretion.

Users can put, get, delete, list and check for values using the
following commands:

 - ceph config-key put <key> [<value>]
  or
 - ceph config-key put <key> [-i <in-file>]
  with 'value' and 'in-file' being optional; if these are not specified,
  'put' will act as 'touch' if 'key' does not exist, or will overwrite
  the value of 'key' with a zero byte value (i.e., truncates the
  contents of the value to zero)

 - ceph config-key get <key>
  or
 - ceph config-key get <key> -o <out-file>

 - ceph config-key delete <key>

 - ceph config-key list [-o <out-file]

 - ceph config-key exists <key>

Fixes: #4313
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
12 years agoceph.spec.in: Move four scripts from sbin to usr/bin
Gary Lowell [Thu, 28 Mar 2013 23:12:33 +0000 (16:12 -0700)]
ceph.spec.in:  Move four scripts from sbin to usr/bin

The ceph-create-keys, ceph-disk, ceph-disk-activate, and
ceph-disk-prepare scripts are built in sbin, but debian installs
them into usr/bin, and several utilities look for them there.
This commit changes the RPM to install them in /usr/bin. (Bug #3921)

Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
12 years agoceph: propagate do_command()'s return value to user space
Joao Eduardo Luis [Thu, 28 Mar 2013 02:07:18 +0000 (02:07 +0000)]
ceph: propagate do_command()'s return value to user space

We were returning '1' regardless of what do_command() returned in case
of error.  This would make building tools relying on command error codes
short of useless, and forced them to rely instead on error messages.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit e91405d540ce11b9996e4977212553bd33afb3ed)

12 years agoceph: propagate do_command()'s return value to user space
Joao Eduardo Luis [Thu, 28 Mar 2013 02:07:18 +0000 (02:07 +0000)]
ceph: propagate do_command()'s return value to user space

We were returning '1' regardless of what do_command() returned in case
of error.  This would make building tools relying on command error codes
short of useless, and forced them to rely instead on error messages.

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
12 years agoMerge pull request #168 from athanatos/wip_4471
Sage Weil [Thu, 28 Mar 2013 21:15:07 +0000 (14:15 -0700)]
Merge pull request #168 from athanatos/wip_4471

Wip 4471

Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoPG: update PGPool::name in PGPool::update 168/head
Samuel Just [Thu, 28 Mar 2013 21:09:17 +0000 (14:09 -0700)]
PG: update PGPool::name in PGPool::update

Fixes: #4471
Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: use int64_t for pool id in PGPool
Samuel Just [Thu, 28 Mar 2013 21:01:45 +0000 (14:01 -0700)]
PG: use int64_t for pool id in PGPool

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge pull request #164 from dalgaaf/wip-da-fix-misc-1
Yehuda Sadeh [Thu, 28 Mar 2013 20:01:51 +0000 (13:01 -0700)]
Merge pull request #164 from dalgaaf/wip-da-fix-misc-1

some SCA related fixes

12 years agoOSD: flush pg osr on shutdown prior to put()
Samuel Just [Wed, 27 Mar 2013 18:32:24 +0000 (11:32 -0700)]
OSD: flush pg osr on shutdown prior to put()

Fixes: #4538
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agolibrbd: increment version for new functions is_complete() and aio_flush()
Josh Durgin [Wed, 27 Mar 2013 22:48:17 +0000 (15:48 -0700)]
librbd: increment version for new functions is_complete() and aio_flush()

This done in a separate commit since the increased version number
should not be backported.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: add an async flush
Josh Durgin [Thu, 21 Mar 2013 23:04:10 +0000 (16:04 -0700)]
librbd: add an async flush

At this point it's a simple wrapper around the ObjectCacher or
librados.

This is needed for QEMU so that its main thread can continue while a
flush is occurring. Since this will be backported, don't update the
librbd version yet, just add a #define that QEMU and others can use to
detect the presence of aio_flush().

Refs: #3737
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
12 years agolibrbd: use the same IoCtx for each request
Josh Durgin [Wed, 27 Mar 2013 22:42:10 +0000 (15:42 -0700)]
librbd: use the same IoCtx for each request

Before we were duplicating the IoCtx for each new request since they
could have a different snapshot context or read from a different
snapshot id. Since librados now supports setting these explicitly
for a given request, do that instead.

Since librados tracks outstanding requests on a per-IoCtx basis, this
also fixes a bug that causes flush() without caching to ignore
all the outstanding requests, since they were to separate,
duplicate IoCtxs.

Signed-off-by: Josh Durgin <josh.durgin@inktank.com>