Yehuda Sadeh [Thu, 13 Dec 2012 23:52:34 +0000 (15:52 -0800)]
rgw: key indexes are only link to user info
Instead of keeping multiple copies of the user info,
we just treat the key index as a pointer to the actual
user info (indexed by uid). This helps with two issues:
first, it scales better as we don't need to update the
entire set of keys whenever we make any change. Second,
it helps with the uid index atomicity.
One point to keep in mind is that both the links and the
info can be cached, so effect on performance is minimal.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: caleb miles <caleb.miles@inktank.com>
Dan Mick [Thu, 31 Jan 2013 01:33:09 +0000 (17:33 -0800)]
Validate format strings for CLS_ERR/CLS_LOG
cls_log needed __attribute__((format(printf..)) to allow the compiler
to crosscheck format strings and arguments. After adding that, there
needed to be a bunch of fixups for %ll, and a few changes for missing
arguments, etc. uncovered by the checking.
Fixes: #3970 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
Alex Elder [Thu, 31 Jan 2013 12:47:59 +0000 (06:47 -0600)]
qa: update the rbd/concurrent.sh workunit
A few changes, now that a few rbd problems have been fixed.
First, the more substantive changes:
- Generate a source file, and compare what's read back from rbd
devices with the content of that file.
- Write to the rbd device such that the written data spans
an (assumed 4 MB) rbd object boundary, as well as starting
and ending on non-page-aligned offsets.
- Perform multiple reads on rbd devices: entirely within a range
before any written data; beginning before but ending within
written data; the exact written data (and validating what's
read); beginning within written data but ending after it;
reading after written data but within a written rbd object;
and reading from an unwritten rbd object.
- Have the sleep between iterations provide a non-integer value
to avoid zero (or quantized) delays.
Also, some a little less substantive (but possibly informative):
- Don't run with "set -x". It produces a ton of noise that is
not useful for this test. This is an exerciser, looking
really for system crashes during concurrent activity, and
knowing which commands were (concurrently) active isn't going
to help much in diagnosis.
- Create two more directories, used to track the degree of
concurrency (more or less) and the highest rbd id consumed.
Files whose names are numbers are touched in each, and the
highest at the end is the highest during the run. This gets
around issues passing environment info from sub-shells to the
top-level shell. As a bonus, it offers a better chance of
avoiding problems due to concurrent update.
- NAMESDIR is renamed NAMES_DIR, and it (and the others) is
set up in the setup() function.
- Increase the concurrency and iteration counts.
- Move the default definitions before the ceph secrets stuff
Danny Al-Gaaf [Wed, 30 Jan 2013 17:52:24 +0000 (18:52 +0100)]
PGMap: fix -Wsign-compare warning
Fix -Wsign-compare compiler warning:
mon/PGMap.cc: In member function 'void PGMap::apply_incremental
(CephContext*, const PGMap::Incremental&)':
mon/PGMap.cc:247:30: warning: comparison between signed and
unsigned integer expressions [-Wsign-compare]
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Dan Mick [Wed, 30 Jan 2013 02:41:20 +0000 (18:41 -0800)]
mds/Server.cc: fix warring assert.h's
New include boost/lexical_cast.hpp apparently drags in the system
assert.h on quantal and squeeze at least, breaking our careful
assert.h; re-include our file to fix it back
Fixes: #3957 Signed-off-by: Dan Mick <dan.mick@inktank.com>
Dan Mick [Tue, 29 Jan 2013 23:18:53 +0000 (15:18 -0800)]
init-ceph: make ulimit -n be part of daemon command
ulimit -n from 'max open files' was being set only on the machine
running /etc/init.d/ceph. It needs to be added to the commands to
start the daemons, and run both locally and remotely.
Verified by examining /proc/<pid>/limits on local and remote hosts
Fixes: #3900 Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Loïc Dachary <loic@dachary.org> Reviewed-by: Gary Lowell <gary.lowell@inktank.com>
Sage Weil [Fri, 18 Jan 2013 06:00:42 +0000 (22:00 -0800)]
mds: open mydir after replay
In certain cases, we may replay the journal and not end up with the
dirfrag for mydir open. This is fine--we just need to open it up and
fetch it below.
Alex Elder [Tue, 29 Jan 2013 21:51:13 +0000 (15:51 -0600)]
qa: add rbd/concurrent workunit
This defines a new workunit shell script that performs a bunch of
rbd operations concurrently in order to exercise code paths and
catch reference count and bad pointer problems.
Sam Lang [Tue, 29 Jan 2013 17:28:00 +0000 (11:28 -0600)]
mds: Send created ino in journaled_reply
The MDS avoids sending an early reply if a request
triggered inode allocation (no preallocated inodes yet).
For create, this prevented the created ino from being
sent back to the client, which is used to indicate
creation (as apposed to already existing) of the file.
This commit fixes the issue by adding the created ino
to the journaled (safe) reply.
Sam Lang [Tue, 29 Jan 2013 16:18:29 +0000 (10:18 -0600)]
client: Don't use geteuid/gid for fuse ll_create
Fixes a bug in ll_create where files that already exist at the MDS
don't get the created flag set on reply. This causes a permissions
check, which fails because geteuid/getegid are 0/0 for ll_create.
Yan, Zheng [Sun, 27 Jan 2013 07:14:55 +0000 (15:14 +0800)]
mds: fix 'discover' handling in the rejoin stage
If the MDS is the resolve stage, current MDCache::handle_discover() only handles
'discover' from MDS that it has already gotten rejoin acknowledgement. This can
cause circular wait because MDCache::rejoin_gather_finish() fetches reconnected
inodes before send rejoin acknowledgements, and fetching reconnected inode may
triggers 'discover'. The fix is not delay handling 'discover' from MDS that are
also in the rejoin stage.
Yan, Zheng [Sat, 19 Jan 2013 01:24:12 +0000 (09:24 +0800)]
mds: fetch missing inodes from disk
The problem of fetching missing inodes from replicas is that replicated inodes
does not have up-to-date rstat and fragstat. So just fetch missing inodes from
disk
Yan, Zheng [Fri, 18 Jan 2013 14:54:02 +0000 (22:54 +0800)]
mds: move variables special to rename into MDRequest::more
My previous patches add two pointers (ambiguous_auth_inode and
auth_pin_freeze) to class Mutation. They are both used by cross
authority rename, both point to the renamed inode. Later patches
need add more rename special state to MDRequest, So just move them
into MDRequest::more
Yan, Zheng [Mon, 21 Jan 2013 02:04:03 +0000 (10:04 +0800)]
mds: don't journal opened non-auth inode
If we journal opened non-auth inode, during journal replay, the corresponding
entry will add non-auth objects to the cache. But the MDS does not journal all
subsequent modifications (rmdir,rename) to these non-auth objects, so the code
that manages cache and subtree may get confused. Besides non-auth objects will
be trimmed at the resolve stage.
Yan, Zheng [Wed, 16 Jan 2013 11:58:49 +0000 (19:58 +0800)]
mds: don't replace existing slave request
The MDS may receive a client request, but find there is an existing
slave request. It means other MDS is handling the same request, so
we should not replace the slave request with a new client request,
just forward the request.
The client request may include embeded cap releases, we need process
them even the request is forwarded.
Yan, Zheng [Wed, 16 Jan 2013 11:38:38 +0000 (19:38 +0800)]
mds: always use {push,pop}_projected_linkage to change linkage
Current code skips using {push,pop}_projected_linkage to modify replica
dentry's linkage. This confuses EMetaBlob::add_dir_context() and makes
it record out-of-date path when TO_ROOT mode is used. This patch changes
the code to always use {push,pop}_projected_linkage to modify dentry's
linkage. It makes sure MDCache::create_subtree_map() record correct and
up-to-date subtree map.
Yan, Zheng [Sat, 19 Jan 2013 01:49:04 +0000 (09:49 +0800)]
mds: send resolve messages after all MDS reach resolve stage
Current code sends resolve messages when resolving MDS set changes.
There is no need to send resolve messages when some MDS leave the
resolve stage. Sending message while some MDS are replaying is also
not very useful.
Yan, Zheng [Fri, 18 Jan 2013 11:41:48 +0000 (19:41 +0800)]
mds: split reslove into two sub-stages
The resolve stage serves to disambiguate the fate of uncommitted slave
updates and resolve subtrees authority. The MDS sends resolve message
that claims subtrees authority immediately when reslove stage is entered,
When receiving a resolve message, the MDS also processes it immediately.
This may cause problem if there are uncommitted slave rename and some of
them need rollback later. It's because slave rename rollback may modify
subtree map.
The fix is split reslove into two sub-stages, the first sub-stage serves
to disambiguate slave updates, do slave commit or rollback. After the
the first sub-stage finishes, the MDS sends resolve messages that claim
subtrees authority to other MDS and processes received resolve messages.
Yan, Zheng [Sat, 19 Jan 2013 05:00:29 +0000 (13:00 +0800)]
mds: fix slave rename rollback
The main issue of old slave rename rollback code is that it assumes
all affected objects are in the cache. The assumption is not true
when MDS does rollback in the resolve stage. This patch removes the
assumption and makes Server::do_rename_rollback() check individual
object and roll back change.
Yan, Zheng [Sat, 19 Jan 2013 04:57:31 +0000 (12:57 +0800)]
mds: preserve non-auth/unlinked objects until slave commit
The MDS should not trim objects in non-auth subtree immediately after
replaying a slave rename. Because the slave rename may require rollback
later and these objects are needed for rollback.
Yan, Zheng [Fri, 18 Jan 2013 06:08:45 +0000 (14:08 +0800)]
mds: force journal straydn for rename if necessary
rename may overwrite an empty directory inode and move it into stray
directory. MDS who has auth subtree beneath the overwrited directory
need journal the stray dentry when handling rename slave request.
Yan, Zheng [Fri, 18 Jan 2013 02:47:21 +0000 (10:47 +0800)]
mds: fix "had dentry linked to wrong inode" warning
The reason of "had dentry linked to wrong inode" warning is that
Server::_rename_prepare() adds the destdir to the EMetaBlob before
adding the straydir. So during MDS recovers, the destdir is first
replayed. The old inode is directly replaced by the source inode.
We can void the warning by adding the straydir first.
Yan, Zheng [Sat, 19 Jan 2013 00:30:23 +0000 (08:30 +0800)]
mds: don't set xlocks on dentries done when early reply rename
_rename_finish() does not send dentry link/unlink message to replicas.
We should prevent dentries that are modified by the rename operation
from getting new replicas while the rename operation is committing.
So don't set xlocks on dentries "done".