Sage Weil [Sat, 3 Mar 2012 23:39:06 +0000 (15:39 -0800)]
mds: don't force in->first == dn->first
The fullbit sets it now. For multiversion inodes, it's "first" can be in
the future, since this dentry may not have changed when the inode was
cowed in place. (OTOH, the dentry cannot have changed without the inode
also have changing.)
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
Yan, Zheng [Sat, 8 Dec 2012 14:43:32 +0000 (22:43 +0800)]
mds: fix race between send_dentry_link() and cache expire
MDentryLink message can race with cache expire, When it arrives at
the target MDS, it's possible there is no corresponding dentry in
the cache. If this race happens, we should expire the replica inode
encoded in the MDentryLink message. But to expire an inode, the MDS
need to know which subtree does the inode belong to, so modify the
MDentryLink message to include this information.
Yan, Zheng [Mon, 10 Dec 2012 07:43:44 +0000 (15:43 +0800)]
mds: fix file existing check in Server::handle_client_openc()
Creating new file needs to be handled by directory fragment's auth
MDS, opening existing file in write mode needs to be handled by
corresponding inode's auth MDS. If a file is remote link, its parent
directory fragment's auth MDS can be different from corresponding
inode's auth MDS. So which MDS to handle create file request can be
affected by if the corresponding file already exists.
handle_client_openc() calls rdlock_path_xlock_dentry() at the very
beginning. It always assumes the request needs to be handled by
directory fragment's auth MDS. When handling a create file request,
if the file already exists and remotely linked to a non-auth inode,
handle_client_openc() falls back to handle_client_open(),
handle_client_open() forwards the request because the MDS is not
inode's auth MDS. Then when the request arrives at inode's auth MDS,
rdlock_path_xlock_dentry() is called, it will forward the request
back.
Yan, Zheng [Sun, 9 Dec 2012 05:03:41 +0000 (13:03 +0800)]
mds: don't retry readdir request after issuing caps
If remote linkage without inode is encountered after some caps are
issued, Server::handle_client_readdir() should send the reply to
client immediately instead of retrying the request after opening
the remote dentry. This is because the MDS may want to revoke these
caps before the MDS succeeds in opening the remote dentry.
Yan, Zheng [Sat, 8 Dec 2012 16:53:28 +0000 (00:53 +0800)]
mds: take export lock set before sending MExportDirDiscover
Migrator::export_dir() only check if it can lock the export lock set
but not take the lock set. So someone else can change the path to
the exporting dir and confuse Migrator::handle_export_discover().
Yan, Zheng [Sun, 9 Dec 2012 05:06:33 +0000 (13:06 +0800)]
mds: re-issue caps after importing caps
The imported caps may prevent unstable locks from entering stable
states. So we should call Locker::eval_gather() with parameter
"first" set to true after caps are imported.
Yan, Zheng [Sat, 8 Dec 2012 07:07:53 +0000 (15:07 +0800)]
mds: fix error hanlding in MDCache::handle_discover_reply()
The error hanlding code in MDCache::handle_discover_reply() has two
main issues. MDCache::handle_discover_reply() does not wake waiters
if dir_auth_hint in reply message is equal to itself's nodeid. This
can happen if discover race with subtree importing. Another issue is
that it checks the existence of cached directory fragment to decide
if it should take waiter from inode or from directory fragment. The
check is unreliable because subtree importing can add directory
fragments to the cache.
Yan, Zheng [Sat, 8 Dec 2012 05:59:38 +0000 (13:59 +0800)]
mds: set want_base_dir to false for MDCache::discover_ino()
When frozen inode is encountered, MDCache::handle_discover() sends
reply immediately if the reply message is not empty. When handling
"discover ino" requests, the reply message always contains the base
directory fragment. But requestor already has the base directory
fragment, the only effect of the reply message is wake the requestor
and make it send same "discover ino" request again. So the requestor
keeps sending "discover ino" requests but can't make any progress.
The fix is set want_base_dir to false for MDCache::discover_ino().
After set want_base_dir to false, also need update the code that
handles "discover ino" error.
This patch also remove unused error handling code for flag_error_dn
Yan, Zheng [Fri, 7 Dec 2012 07:59:56 +0000 (15:59 +0800)]
mds: no bloom filter for replica dir
We should delete dir fragment's bloom filter after exporting the dir
fragment to other MDS. Otherwise the residual bloom filter may cause
problem if the MDS imports dir fragment later.
Yan, Zheng [Thu, 6 Dec 2012 01:28:46 +0000 (09:28 +0800)]
mds: properly mark dirfrag dirty
If predirty_journal_parents() does not propagate changes in dir's
fragstat into corresponding inode's dirstat, it should mark the
inode as dirfrag dirty. This happens when we modify dir fragments
that are auth subtree roots.
Yan, Zheng [Fri, 30 Nov 2012 00:53:33 +0000 (08:53 +0800)]
mds: alllow handle_client_readdir() fetching freezing dir.
At that point, the request already auth pins and locks some objects.
So CDir::fetch() should ignore the can_auth_pin check and continue
to fetch freezing dir.
Sage Weil [Mon, 24 Dec 2012 03:59:04 +0000 (19:59 -0800)]
Merge branch 'wip-create-layout'
Reviewed-by: Greg Farnum <greg@inktank.com>
The functional tests for the create operations should add and specify non-default
pools, but we don't have a set of library methods to do that yet (to interact with
the monitor).
Dan Mick [Fri, 21 Dec 2012 03:53:07 +0000 (19:53 -0800)]
import_export.sh: sparse import export
Add tests for:
- sparse import makes expected sparse images
- sparse export makes expected sparse files
- sparse import from stdin also creates sparse images
- import from partially-sparse file leads to partially-sparse image
- import from stdin with zeros leads to sparse
- export from zeros-image to file leads to sparse file
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Dan Mick [Sat, 8 Dec 2012 06:57:06 +0000 (22:57 -0800)]
rbd: harder-working sparse import from stdin
Try to accumulate image-sized blocks when importing from stdin, even if
each read is shorter than requested; if we get a full block, and it's
all zeroes, we can seek and make a sparse output file
Signed-off-by: Dan Mick <dan.mick@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Sage Weil [Sat, 22 Dec 2012 00:47:50 +0000 (16:47 -0800)]
osd: fix pg stat msgs vs timeout
We can get a pattern like so:
- new mon session
- after say 120 seconds, we decide to send a stats msg
- outstanding_pg_stats is finally true, we immediately time out (30 second
grace), and reconnect to a new mon
-> repeat
The problem is that we don't reset the last_sent timestamp when we send.
Or that we do this check after sending instead of before. Fix both.
This should resolve the issue #3661 where osds that don't have pgs
updating are not stats messags to the mon to check in, and are eventually
getting marked down as a result.
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Samuel Just <sam.just@inktank.com>
John Wilkins [Sat, 22 Dec 2012 00:07:27 +0000 (16:07 -0800)]
doc: Updated the Configuration File section.
- Replaced ceph.conf with Ceph configuration to clarify
when running multiple clusters on the same hardware.
- Added a [client] entry so people know it can be set too.
- Updated existing auth example.
- Added an authentication section with a link to the cephx guide.
- Added section for running multiple clusters. Per Tommi.
Signed-off-by: John Wilkins <john.wilkins@inktank.com>
Sage Weil [Fri, 21 Dec 2012 21:44:19 +0000 (13:44 -0800)]
monc: only warn about missing keyring if we fail to authenticate
This avoids the situation where a librados or other user with the default
of 'cephx,none' and no keyring is authenticating against a cluster with
required of 'none' and an annoying warning is generated every time. Now
we only print a helpful message if we actually failed.
Sage Weil [Fri, 21 Dec 2012 06:01:34 +0000 (22:01 -0800)]
osd: clear scrub state if queued scrub doesn't start
We set SCRUBBING when we queue a pg for scrub. If we dequeue and
call scrub() but abort for some reason (!active, degraded, etc.), clear
that state bit.
Bug is easily reproduced with 'ceph osd scrub N' during cluster startup
when PGs are peering; some PGs can get left in the scrubbing state.
Add ceph osd ls to help; make help for ceph osd tell N bench look
more like injectargs, which says <osd-id or *> to make it clear you
can benchmark all osds simultaneously
Sage Weil [Thu, 20 Dec 2012 21:48:06 +0000 (13:48 -0800)]
log: fix flush/signal race
We need to signal the cond in the same interval where we hold the lock
*and* modify the queue. Otherwise, we can have a race like:
queue has 1 item, max is 1.
A: enter submit_entry, signal cond, wait on condition
B: enter submit_entry, signal cond, wait on condition
C: flush wakes up, flushes 1 previous item
A: retakes lock, enqueues something, exits
B: retakes lock, condition fails, waits
-> C is never woken up as there are 2 items waiting
Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Dan Mick <dan.mick@inktank.com>
Samuel Just [Thu, 20 Dec 2012 21:23:27 +0000 (13:23 -0800)]
OSD,ReplicatedPG: do not track notifies on the session
handle_notify_timeout and remove_notify currently do not clean up this
state leaving dangling Notification*. Further, we only use this mapping
in unwatch in order to determine which notifies to update. We can
accomplish the same thing by iterating through the obc->notifs mapping
since all notifications relevant for a given watch would have been for
the same obc as the watch.
Yehuda Sadeh [Wed, 19 Dec 2012 18:21:57 +0000 (10:21 -0800)]
rgw: don't try to assign content type if not found
Fixes: #3648
Cannot assign a NULL pointer into stl string. This is only
relevant to swift, when uploading an object without specifying
content type, and when the suffix cannot be determined.
Yehuda Sadeh [Thu, 20 Dec 2012 00:59:43 +0000 (16:59 -0800)]
rgw: don't initialize keystone if not set up
Fixes: #3653
No need to initialize keystone, including the keystone
revocation thread which was verbose if key stone was
not set up. This removes some unuseful errors from the
log.
Yehuda Sadeh [Wed, 19 Dec 2012 22:34:53 +0000 (14:34 -0800)]
rgw: remove useless configurable, fix swift auth error handling
Fixes: #3649
No need to have an extra configurable to use keystone. Use keystone
whenever keystone url has been specified. Also, fix a bad error
handling that turned a failure to authenticate into successfully
authenticating a bad user.
Sage Weil [Wed, 19 Dec 2012 03:21:24 +0000 (19:21 -0800)]
ceph: report error string to stderr, not stdout
If we return an error, send the message to stderr. This makes things
more easily scriptable because error messages won't take the place of
expected output.
mon: OSDMonitor: add option 'mon_max_pool_pg_num' and limit 'pg_num' accordingly
Instead of having a hardcoded default, use a configurable one. It is
limited to 65536 until future testing guarantees there is no side-effects
of increasing it past this value, but by being adjustable the user still
has the freedom to specify whatever maximum value he wants.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>