]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
12 years agoLFNIndex: fix move_subdir comments
Samuel Just [Tue, 11 Dec 2012 01:45:02 +0000 (17:45 -0800)]
LFNIndex: fix move_subdir comments

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoHashIndex: fix typo in reset_attr documentation
Samuel Just [Tue, 11 Dec 2012 01:40:10 +0000 (17:40 -0800)]
HashIndex: fix typo in reset_attr documentation

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoHashIndex: init exists in col_split_level and reset_attr
Samuel Just [Tue, 11 Dec 2012 01:39:13 +0000 (17:39 -0800)]
HashIndex: init exists in col_split_level and reset_attr

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPrioritizedQueue: increment ret when removing items from list
Samuel Just [Tue, 11 Dec 2012 01:31:44 +0000 (17:31 -0800)]
PrioritizedQueue: increment ret when removing items from list

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPrioritizedQueue: move if check out of loop in filter_list_pairs
Samuel Just [Tue, 11 Dec 2012 01:30:59 +0000 (17:30 -0800)]
PrioritizedQueue: move if check out of loop in filter_list_pairs

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: store current pg epoch in info and load at that epoch
Samuel Just [Thu, 6 Dec 2012 01:05:38 +0000 (17:05 -0800)]
OSD: store current pg epoch in info and load at that epoch

Prior to split, this did not matter.  With split, however, it's
crucial that a pg go through advance_pg() for the map causing
the split.  During operation, a PG lags the OSD superblock
epoch.  If the OSD dies after the OSD epoch passes the split
but before the pg epoch passes the split, the PG will be
reloaded at the OSD epoch and won't see the split operation.
The PG collection might after that point contain incorrect
objects which should have been split into a child.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: account for split in project_pg_history
Samuel Just [Thu, 29 Nov 2012 01:14:11 +0000 (17:14 -0800)]
OSD: account for split in project_pg_history

split causes a new interval.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: update info.last_update_started in split_into
Samuel Just [Wed, 21 Nov 2012 22:10:51 +0000 (14:10 -0800)]
PG: update info.last_update_started in split_into

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSDMonitor: require --allow-experimental-feature to increase pg_num
Samuel Just [Tue, 20 Nov 2012 20:16:44 +0000 (12:16 -0800)]
OSDMonitor: require --allow-experimental-feature to increase pg_num

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: set child up/acting in split_into
Samuel Just [Tue, 20 Nov 2012 03:58:43 +0000 (19:58 -0800)]
PG: set child up/acting in split_into

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: do _remove_pg in add_newly_split_pg is pool if gone
Samuel Just [Mon, 19 Nov 2012 03:24:00 +0000 (19:24 -0800)]
OSD: do _remove_pg in add_newly_split_pg is pool if gone

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd/: dirty info and log on child during split
Samuel Just [Tue, 13 Nov 2012 22:48:54 +0000 (14:48 -0800)]
osd/: dirty info and log on child during split

Otherwise, the log may not get written out.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd/: mark info.stats as invalid after split, fix in scrub
Samuel Just [Thu, 20 Sep 2012 01:41:34 +0000 (18:41 -0700)]
osd/: mark info.stats as invalid after split, fix in scrub

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: split ops for child objects into child
Samuel Just [Thu, 20 Sep 2012 03:15:04 +0000 (20:15 -0700)]
PG: split ops for child objects into child

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: add initial split support
Samuel Just [Wed, 21 Nov 2012 00:47:49 +0000 (16:47 -0800)]
OSD: add initial split support

PGs are split after updating to the map on which they split.
OSD::activate_map populates the set of currently "splitting"
pgs.  Messages for those pgs are delayed until the split
is complete.  We add the newly split children to pg_map
once the transaction populating their on-disk state completes.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPG: add split_into to populate child members
Samuel Just [Wed, 21 Nov 2012 00:45:56 +0000 (16:45 -0800)]
PG: add split_into to populate child members

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd/: splitting a pg now triggers a new interval
Samuel Just [Thu, 20 Sep 2012 00:30:43 +0000 (17:30 -0700)]
osd/: splitting a pg now triggers a new interval

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPrioritizedQueue: allow caller to get items removed by removed_by_filter
Samuel Just [Sat, 10 Nov 2012 02:37:44 +0000 (18:37 -0800)]
PrioritizedQueue: allow caller to get items removed by removed_by_filter

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agomon/OSDMonitor: enable split in Monitor
Samuel Just [Wed, 19 Sep 2012 16:23:21 +0000 (09:23 -0700)]
mon/OSDMonitor: enable split in Monitor

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoPGMonitor,OSD: don't send creates on split
Samuel Just [Wed, 12 Sep 2012 16:38:05 +0000 (09:38 -0700)]
PGMonitor,OSD: don't send creates on split

Splits will be handled when the map update effecting the split is
processed for the splitting pg on each OSD.  This will mesh
with the pg history which will place the new pg at the current
positions of the splitting pg.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: dispatch_context only discard transaction if contexts empty
Samuel Just [Thu, 20 Sep 2012 20:55:48 +0000 (13:55 -0700)]
OSD: dispatch_context only discard transaction if contexts empty

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: don't wait for superblock writes in handle_osd_map
Samuel Just [Thu, 20 Sep 2012 21:45:00 +0000 (14:45 -0700)]
OSD: don't wait for superblock writes in handle_osd_map

Instead, pass the pinned maps into a Context and clear the
cache after the transaction is applied.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoos/: Add CollectionIndex::prep_delete
Samuel Just [Sun, 18 Nov 2012 02:18:23 +0000 (18:18 -0800)]
os/: Add CollectionIndex::prep_delete

If an unlink is interupted between removing the file
and updating the subdir attribute, the attribute will
overestimate the number of files in the directory.  This
is by design, at worst we will merge the collection later
than intended, but closing the gap would require a second
subdir xattr update.  However, this can in extreme cases
result in a collection with subdirectories but no objects.
FileStore::_destry_collection would therefore see an
erroneous -ENOTEMPTY.

prep_delete allows the CollectionIndex implementation to
clean up state prior to removal.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoos/: Add failure CollectionIndex failure injection
Samuel Just [Fri, 16 Nov 2012 22:26:32 +0000 (14:26 -0800)]
os/: Add failure CollectionIndex failure injection

Several pieces of HashIndex involve multi-step operations
which are sensitive to OSD crashes.  This patch introduces
failure injection to force retries from various points in
the LFNIndex helper methods to be used with store_test.cc.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agotest/store_test: add simple tests for collection_split
Samuel Just [Tue, 11 Sep 2012 04:47:09 +0000 (21:47 -0700)]
test/store_test: add simple tests for collection_split

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoos/: add filestore collection_split
Samuel Just [Tue, 11 Sep 2012 04:46:38 +0000 (21:46 -0700)]
os/: add filestore collection_split

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoOSD: ignore queries on now deleted pools
Samuel Just [Wed, 5 Dec 2012 19:11:10 +0000 (11:11 -0800)]
OSD: ignore queries on now deleted pools

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-mds' into next
Greg Farnum [Wed, 5 Dec 2012 00:48:09 +0000 (16:48 -0800)]
Merge remote-tracking branch 'origin/wip-mds' into next

Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agoMerge branch 'wip-filestore' into next
Sage Weil [Tue, 4 Dec 2012 23:05:18 +0000 (15:05 -0800)]
Merge branch 'wip-filestore' into next

Reviewed-by: Sam Just <sam.just@inktank.com>
12 years agoMerge branch 'wip-msgr-delay-queue' into next
Sage Weil [Tue, 4 Dec 2012 22:52:22 +0000 (14:52 -0800)]
Merge branch 'wip-msgr-delay-queue' into next

12 years agomds: journal remote inode's projected parent
Yan, Zheng [Tue, 4 Dec 2012 08:09:48 +0000 (16:09 +0800)]
mds: journal remote inode's projected parent

Server::_rename_prepare() adds remote inode's parent instead of
projected parent to the journal. So during journal replay, the
journal entry for the rename operation will wrongly revert the
remote inode's projected rename. This issue can be reproduced by:

 touch file1
 ln file1 file2
 rm file1
 mv file2 file3

After journal replay, file1 reappears and directory's fragstat
gets corrupted.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't create bloom filter for incomplete dir
Yan, Zheng [Tue, 4 Dec 2012 08:09:47 +0000 (16:09 +0800)]
mds: don't create bloom filter for incomplete dir

Creating bloom filter for incomplete dir that was added by log
replay will confuse subsequent dir lookup and can create null
dentry for existing file. The erroneous null dentry confuses the
fragstat accounting and causes undeletable empty directory.

The fix is check if the dir is complete before creating the bloom
filter. For the MDCache::trim_non_auth{,_subtree} cases, just do
not call CDir::add_to_bloom because bloom filter is useless for
replica.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agoPG: remove last_epoch_started asserts in proc_primary_info
Samuel Just [Tue, 4 Dec 2012 19:36:58 +0000 (11:36 -0800)]
PG: remove last_epoch_started asserts in proc_primary_info

These asserts are valid for a uniform cluster, but they won't hold
for a replica running a version without the info.last_epoch_started
patch.

Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agomds: journal remote inode's projected parent
Yan, Zheng [Tue, 4 Dec 2012 08:09:48 +0000 (16:09 +0800)]
mds: journal remote inode's projected parent

Server::_rename_prepare() adds remote inode's parent instead of
projected parent to the journal. So during journal replay, the
journal entry for the rename operation will wrongly revert the
remote inode's projected rename. This issue can be reproduced by:

 touch file1
 ln file1 file2
 rm file1
 mv file2 file3

After journal replay, file1 reappears and directory's fragstat
gets corrupted.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't create bloom filter for incomplete dir
Yan, Zheng [Tue, 4 Dec 2012 08:09:47 +0000 (16:09 +0800)]
mds: don't create bloom filter for incomplete dir

Creating bloom filter for incomplete dir that was added by log
replay will confuse subsequent dir lookup and can create null
dentry for existing file. The erroneous null dentry confuses the
fragstat accounting and causes undeletable empty directory.

The fix is check if the dir is complete before creating the bloom
filter. For the MDCache::trim_non_auth{,_subtree} cases, just do
not call CDir::add_to_bloom because bloom filter is useless for
replica.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds' into next
Sage Weil [Tue, 4 Dec 2012 13:27:59 +0000 (05:27 -0800)]
Merge remote-tracking branch 'gh/wip-mds' into next

12 years agologrotate: do not spam stdout
Sage Weil [Tue, 4 Dec 2012 13:25:52 +0000 (05:25 -0800)]
logrotate: do not spam stdout

Avoid anything on stdout that will generate cron emails for people.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge branch 'master' of http://github.com/ceph/ceph
Gary Lowell [Tue, 4 Dec 2012 05:31:14 +0000 (21:31 -0800)]
Merge branch 'master' of http://github.com/ceph/ceph

12 years agoMerge branch 'next'
Gary Lowell [Tue, 4 Dec 2012 05:29:52 +0000 (21:29 -0800)]
Merge branch 'next'

12 years agodoc: Added a striping section for Architecture.
John Wilkins [Tue, 4 Dec 2012 04:48:02 +0000 (20:48 -0800)]
doc: Added a striping section for Architecture.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agov0.55 v0.55
Gary Lowell [Tue, 4 Dec 2012 03:08:35 +0000 (19:08 -0800)]
v0.55

12 years agoceph.spec.in: Add SLES and remove Fedora from debug package list.
Gary Lowell [Tue, 4 Dec 2012 03:06:42 +0000 (19:06 -0800)]
ceph.spec.in:  Add SLES and remove Fedora from debug package list.

12 years agoMerge branch 'next'
Sage Weil [Mon, 3 Dec 2012 23:33:29 +0000 (15:33 -0800)]
Merge branch 'next'

12 years agotest_rados_api_misc: fix dup rmmkey test
Sage Weil [Mon, 3 Dec 2012 23:29:56 +0000 (15:29 -0800)]
test_rados_api_misc: fix dup rmmkey test

We now expect ENONET as of 9961640f76a950c674c0e7cc2453931088c63fd7
again.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agodoc: Fixed many hyperlinks, a few typos, and some minor clarifications.
John Wilkins [Mon, 3 Dec 2012 20:22:37 +0000 (12:22 -0800)]
doc: Fixed many hyperlinks, a few typos, and some minor clarifications.

fixes: #3564

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agodoc: Clarified example for root user.
John Wilkins [Mon, 3 Dec 2012 18:48:10 +0000 (10:48 -0800)]
doc: Clarified example for root user.

Signed-off-by: John Wilkins <john.wilkins@inktank.com>
12 years agoconfig: we still want osd_thread_recovery_timeout
Sage Weil [Mon, 3 Dec 2012 11:56:15 +0000 (03:56 -0800)]
config: we still want osd_thread_recovery_timeout

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoconfig: Remove unused options
Sam Lang [Sat, 1 Dec 2012 22:59:36 +0000 (16:59 -0600)]
config: Remove unused options

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoclient: Fix ceph_mount() when subdir is specified
Sam Lang [Thu, 29 Nov 2012 20:32:32 +0000 (14:32 -0600)]
client: Fix ceph_mount() when subdir is specified

If a subdirectory is specified to ceph_mount, the
root inode does not have an ino of CEPH_INO_ROOT, so
cwd will fail to ever find root and eventially hits
an assertion in in->get_first_parent().  This fix uses
the inode stored in the root member instead, ensuring
that we stop wherever the mount is rooted.

Signed-off-by: Sam Lang <sam.lang@inktank.com>
12 years agoosd: EINVAL on unknown TMAP op code
Sage Weil [Wed, 28 Nov 2012 04:04:34 +0000 (20:04 -0800)]
osd: EINVAL on unknown TMAP op code

The old/slow implementation did this, but the optimized version did
not.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: use TMAP_RMSLOPPY op when removing dentries
Sage Weil [Wed, 28 Nov 2012 04:43:38 +0000 (20:43 -0800)]
mds: use TMAP_RMSLOPPY op when removing dentries

After replay, we don't know if the dentry removal has already been
committed.  Use a sloppy removal so that we succeed even if we are
repeating the operation.

Conveniently, the previous implementation (pre v0.55) silently ignored
tmap op codes it did not understand, which means this new RMSLOPPY will
be interpreted the same as an actual RMSLOPPY.  That means an v0.55
mds can run against an older osd (say, argonaut) without problems.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: add TMAP_RMSLOPPY op
Sage Weil [Wed, 28 Nov 2012 04:00:19 +0000 (20:00 -0800)]
osd: add TMAP_RMSLOPPY op

Remove a key, but succeed if key already does not exist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: ENOENT on TMAP_RM on non-existent key
Sage Weil [Wed, 28 Nov 2012 04:02:54 +0000 (20:02 -0800)]
osd: ENOENT on TMAP_RM on non-existent key

This reverts 29fae494d0b1459c8bb934d42446e0ada7355402 and fixes the
alternate implmentation added by 8e91d00b52808aa1a4e3a838deda34a439.
librbd relies the ENOENT return value.

Reported-by: Dan Mick <dan.mick@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/JournalingObjectStore: applied_seq -> max_applied_seq
Sage Weil [Sun, 2 Dec 2012 15:31:49 +0000 (07:31 -0800)]
os/JournalingObjectStore: applied_seq -> max_applied_seq

Rename applied_seq to max_applied_seq, since it is a bound; there may be
seq's < max_applied_seq that are not applied.  This aligns the naming with
max_applying_seq.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoos/FileStore: only wait for applying ops to complete before commit
Sage Weil [Sun, 2 Dec 2012 15:29:46 +0000 (07:29 -0800)]
os/FileStore: only wait for applying ops to complete before commit

We can have a large number of operations in the op_wq waiting to be applied
to the fs.  Currently, when we want to commit, we want for them *all* to
apply.  This can take a very long time (the default queue length is 500
operations!).

Instead, mark an Op as started ("applying") when the thread pool actually
starts to apply it.  At that point, only wait for applying ops to complete.
We let any threads with an op seq < max_applying_seq begin as well so that
we have a proper ordering/barrier.  When those flush, applied_seq will ==
max_applying_seq, and that becomes the committing_seq value.

Note that 'applied_seq' is still maintain, but serves no real purpose
except to populate our asserts with sanity checks.  max_applying_seq serves
the purpose applied_seq used to.

This removes once unnecessary source of latency associated with fs
commits.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix RepModify when past last_peering_reset
Sage Weil [Sun, 2 Dec 2012 03:15:18 +0000 (19:15 -0800)]
osd: fix RepModify when past last_peering_reset

If we apply or commit a RepModify from a prevous perring interval, we need
to free it.

This fixes 'slow request' messages when in fact clients requests are not
delayed, and plugs the related memory leak.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'origin/wip-coverity'
Yehuda Sadeh [Sun, 2 Dec 2012 03:32:07 +0000 (19:32 -0800)]
Merge remote-tracking branch 'origin/wip-coverity'

12 years agoMerge remote-tracking branch 'gh/next'
Sage Weil [Sun, 2 Dec 2012 02:23:02 +0000 (18:23 -0800)]
Merge remote-tracking branch 'gh/next'

12 years agoOutputDataSocket: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:43:06 +0000 (21:43 -0800)]
OutputDataSocket: fix uninit var

CID 745933 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "data_size" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorgw: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:41:54 +0000 (21:41 -0800)]
rgw: fix uninit var

CID 745935 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "expiration" is not initialized in this constructor nor in any functions that it calls.

At (2): Non-static class member "min_len" is not initialized in this constructor nor in any functions that it calls.
At (4): Non-static class member "max_len" is not initialized in this constructor nor in any functions that it calls.
At (6): Non-static class member "ret" is not initialized in this constructor nor in any functions that it calls.
At (8): Non-static class member "len" is not initialized in this constructor nor in any functions that it calls.
At (10): Non-static class member "ofs" is not initialized in this constructor nor in any functions that it calls.
At (12): Non-static class member "supplied_md5_b64" is not initialized in this constructor nor in any functions that it calls.
At (14): Non-static class member "supplied_etag" is not initialized in this constructor nor in any functions that it calls.
CID 745934 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
At (16): Non-static class member "data_pending" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agotest/osdc/FakeWriteback: fix uninit var
Sage Weil [Sat, 1 Dec 2012 05:39:05 +0000 (21:39 -0800)]
test/osdc/FakeWriteback: fix uninit var

CID 745936 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
At (2): Non-static class member "m_off" is not initialized in this constructor nor in any functions that it calls.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoosd: fix missing unlock; simplify
Sage Weil [Sat, 1 Dec 2012 05:35:20 +0000 (21:35 -0800)]
osd: fix missing unlock; simplify

Instead of a special-case exit, just skip the con replacement.  Continue
on to mark the old con down.

CID 745920 (#1 of 1): Missing unlock (LOCK)
At (8): Returning without unlocking "this->heartbeat_lock._m".

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: fix freeze inode deadlock
Yan, Zheng [Mon, 19 Nov 2012 02:43:46 +0000 (10:43 +0800)]
mds: fix freeze inode deadlock

CInode::freeze_inode() is used in the case of cross authority rename.
Server::handle_slave_rename_prep() calls it to wait for all other
operations on source inode to complete. This happens after all locks
for the rename operation are acquired. But to acquire locks, we need
auth pin locks' parent objects first. So there is an ABBA deadlock
if someone auth pins the source inode after locks for rename are
acquired and before Server::handle_slave_rename_prep() is called.
The fix is freeze and auth pin the source inode at the same time.

This patch introduces CInode::freeze_auth_pin(), it waits for all
other MDRequests to release auth pins, then change the inode to
FROZENAUTHPIN state, this state prevents other MDRequests from
getting new auth pins.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: use rdlock_try() when checking NULL dentry
Yan, Zheng [Mon, 19 Nov 2012 02:43:48 +0000 (10:43 +0800)]
mds: use rdlock_try() when checking NULL dentry

Use rdlock_try() instead can_read() when path_traverse encounters
a NULL dentry. This can partly avoid infinitely waiting for the
dentry to become readable when the dentry is replica.

Strictly speaking, use rdlock_try() is still enough because auth
MDS may drop the REQRDLOCK message in some cases.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: allow open_remote_ino() to open xlocked dentry
Yan, Zheng [Mon, 19 Nov 2012 02:43:47 +0000 (10:43 +0800)]
mds: allow open_remote_ino() to open xlocked dentry

discover_ino() has a parameter want_xlocked. The parameter indicates
if remote discover handler can proceed when xlocked dentry is
encountered. open_remote_ino() uses discover_ino() to find non-auth
inode, but always set 'want_xlocked' to false. This may cause
dead lock in some corner cases. For example:

we rename a inode's primary dentry to one of its remote dentry and
send slave request to one witness MDS. but before the slave request
reaches the witness MDS, the inode is trimmed from the witness MDS'
cache. Then when the slave request arrives, open_remote_ino() will
be called during traversing the destpath. open_remote_ino() calls
discover_ino() with 'want_xlocled=false' to find the inode.
discover_ino() sends MDiscover message to the inode's authority MDS.
The handler of MDiscover message finds the inode's primary dentry
is xlocked and it sleeps.

The fix is add a parameter 'want_xlocked' to open_remote_ino() and
make open_remote_ino() pass the parameter to discover_ino().

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix assertion in handle_cache_expire
Yan, Zheng [Mon, 19 Nov 2012 02:43:45 +0000 (10:43 +0800)]
mds: fix assertion in handle_cache_expire

During export, it's possible to get cache expire messages in
DISCOVERING, FREEZING and PREPPING state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: fix open_remote_inode race
Yan, Zheng [Mon, 19 Nov 2012 02:43:44 +0000 (10:43 +0800)]
mds: fix open_remote_inode race

discover_ino() may return -ENOENT if it races with other FS activities.
so use C_MDC_RetryOpenRemoteIno instead of C_MDC_OpenRemoteIno as
onfinish callback.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: consider revoking caps in imported caps as issued
Yan, Zheng [Mon, 19 Nov 2012 02:43:43 +0000 (10:43 +0800)]
mds: consider revoking caps in imported caps as issued

The clients may already send caps release message to the exporting
MDS, so the importing MDS waits for the release message forever.
consider revoking caps as issued can avoid this issue.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: drop locks if requiring auth pinning new objects.
Yan, Zheng [Mon, 19 Nov 2012 02:43:42 +0000 (10:43 +0800)]
mds: drop locks if requiring auth pinning new objects.

Locker::acquire_locks() skip auth pinning replica object if we only
request a rdlock and the lock is read-lockable. To get all locks,
we may call Locker::acquire_locks() several times, locks in replca
objects may become not read-lockable between calls. So it is
possible we need auth pin new objects after already take some locks.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: don't forward client request from MDS
Yan, Zheng [Mon, 19 Nov 2012 02:43:40 +0000 (10:43 +0800)]
mds: don't forward client request from MDS

Forwarding client request that was from MDS will trigger assertion
in MDS::forward_message_mds(). MDS only send client requests for
stray migration/reintegration, so it's safe to drop them.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: call eval() after caps are exported
Yan, Zheng [Mon, 19 Nov 2012 02:43:39 +0000 (10:43 +0800)]
mds: call eval() after caps are exported

For an inode just changed authority, if the new auth MDS want to
change a lock in the inode from 'sync' to 'lock' state before caps
are exported. The lock in replica can be in 'sync->lock' state
because client caps prevent it from transitting to 'lock' state.
So we should call eval() after clearing client caps.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED
Yan, Zheng [Mon, 19 Nov 2012 02:43:38 +0000 (10:43 +0800)]
mds: clear lock flushed if replica is waiting for AC_LOCKFLUSHED

So eval_gather() will not skip calling scatter_writebehind(),
otherwise the replica lock may be in flushing state forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: Don't acquire replica object's versionlock
Yan, Zheng [Mon, 19 Nov 2012 02:43:37 +0000 (10:43 +0800)]
mds: Don't acquire replica object's versionlock

Both CInode and CDentry's versionlocks are of type LocalLock.
Acquiring LocalLock in replica object is useless and problematic.
For example, if two requests try acquiring a replica object's
versionlock, the first request succeeds, the second request
is added to wait queue. Later when the first request finishes,
MDCache::request_drop_foreign_locks() finds the lock's parent is
non-auth, it skips waking requests in the wait queue. So the
second request hangs.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomds: allow try_eval to eval unstable locks in freezing object
Yan, Zheng [Mon, 19 Nov 2012 02:43:36 +0000 (10:43 +0800)]
mds: allow try_eval to eval unstable locks in freezing object

Unstable locks hold auth_pins on the object, it prevents the freezing
object become frozen and then unfreeze. So try_eval() should not wait
for freezing object

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
12 years agomsg/Pipe: flush delayed messages when stealing/failing pipes
Sage Weil [Sat, 1 Dec 2012 04:23:52 +0000 (20:23 -0800)]
msg/Pipe: flush delayed messages when stealing/failing pipes

If we are failing a pipe, flush the incoming messages before we try to
reconnect.  Similarly, flush queued messages on an existing pipe beore we
replace it.  This ensures that when we get a socket failure and reconnect
the delayed messages are handled in the normal fashion.

Specifically, it fixes a situation like:

 - read msg, update in_seq etc.
 - delay msg
 - pipe faults
 - peer reconnects, we replace existing pipe, discard delayed msgs
 - peer resends msgs
 - we discard, because they are < in_seq

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agorbd: report striping as a feature in rbd info
Dan Mick [Sat, 1 Dec 2012 06:46:36 +0000 (22:46 -0800)]
rbd: report striping as a feature in rbd info

Fixes: #3549
Signed-off-by: Dan Mick <dan.mick@inktank.com>
12 years agoceph-osd: put g_ceph_context before exit
Samuel Just [Sat, 1 Dec 2012 01:57:35 +0000 (17:57 -0800)]
ceph-osd: put g_ceph_context before exit

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoReplicatedPG: only increment active_scrub on primary for final push
Samuel Just [Fri, 30 Nov 2012 22:04:53 +0000 (14:04 -0800)]
ReplicatedPG: only increment active_scrub on primary for final push

We only queue the _applied_recovered_object callback on the primary for the
final push.  It is this callback which decrements active_pushes.  It's ok to
not increment active_pushes for the intermediate pushes since these only affact
a temp file.

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-osd-msgr'
Sage Weil [Fri, 30 Nov 2012 20:12:23 +0000 (12:12 -0800)]
Merge remote-tracking branch 'gh/wip-osd-msgr'

12 years agoOSDService: make messengers private
Samuel Just [Fri, 30 Nov 2012 19:20:41 +0000 (11:20 -0800)]
OSDService: make messengers private

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agoosd/: make OSDService messenger helpers return ConnectionRef
Samuel Just [Fri, 30 Nov 2012 19:08:55 +0000 (11:08 -0800)]
osd/: make OSDService messenger helpers return ConnectionRef

Signed-off-by: Samuel Just <sam.just@inktank.com>
12 years agomon: PaxosService: cancel proposal timer after election
Joao Eduardo Luis [Fri, 30 Nov 2012 17:16:35 +0000 (17:16 +0000)]
mon: PaxosService: cancel proposal timer after election

Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
12 years agoMerge remote-tracking branch 'gh/wip-mds-ls2'
Sage Weil [Fri, 30 Nov 2012 16:26:25 +0000 (08:26 -0800)]
Merge remote-tracking branch 'gh/wip-mds-ls2'

Reviewed-by: Greg Farnum <greg@inktank.com>
12 years agodoc: update kernel recs
Sage Weil [Fri, 30 Nov 2012 01:28:36 +0000 (17:28 -0800)]
doc: update kernel recs

Mention which stable kernels we recommend.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agologrotate: fix rotation
David Zafman [Fri, 30 Nov 2012 02:07:20 +0000 (18:07 -0800)]
logrotate: fix rotation

Fixes: #3554
Always reload with Upstart because in some configs the init.d script doesn't work

Signed-off-by: David Zafman <david.zafman@inktank.com>
12 years agomds: assert segements not emtpy in get_current_segment()
Sage Weil [Thu, 29 Nov 2012 05:21:15 +0000 (21:21 -0800)]
mds: assert segements not emtpy in get_current_segment()

Only one caller can tolerate no segments; make a new
peek_current_segment() for them.

Motivated by paranoia tracking down a crash during client unmount, but
it wasn't this.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: be explicit about MDRequest killed state
Sage Weil [Thu, 29 Nov 2012 05:19:37 +0000 (21:19 -0800)]
mds: be explicit about MDRequest killed state

Set the killed flag and use that instead of inferring things from
the session xlist.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: drop redundant mdr->committing = true
Sage Weil [Thu, 29 Nov 2012 05:16:31 +0000 (21:16 -0800)]
mds: drop redundant mdr->committing = true

journal_and_reply() does this.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomds: fix request_kill()
Sage Weil [Thu, 29 Nov 2012 05:19:01 +0000 (21:19 -0800)]
mds: fix request_kill()

Only request_cleanup() if the request isn't already committing.  If it
is, wait for it to commit before we clean up.

It might fix all of #3531, #3210, #1947, and #1548.  Maybe.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agoRevert "osd: fix leak of heartbeat con on reset"
Sage Weil [Fri, 30 Nov 2012 01:03:30 +0000 (17:03 -0800)]
Revert "osd: fix leak of heartbeat con on reset"

This reverts commit b31a99abda75b9170a5805b02944a0c0c78245b7.

12 years agoclient: only dump cache on umount if we time out
Sage Weil [Fri, 30 Nov 2012 00:45:52 +0000 (16:45 -0800)]
client: only dump cache on umount if we time out

We don't want to dump the cache every time an item is trimmed and the
mount_cond gets signaled; this can make umount crazy-slow when logging is
turned up.

Instead, only dump if we wait 5 seconds without making any progress on
shrinking the cache.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: release dispatch throttle on delayed queue discard
Sage Weil [Thu, 29 Nov 2012 18:49:47 +0000 (10:49 -0800)]
msg/Pipe: release dispatch throttle on delayed queue discard

This avoids leaking into the throttle and deadlocking.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: start delay thread *after* we know peer type
Sage Weil [Tue, 27 Nov 2012 23:58:09 +0000 (15:58 -0800)]
msg/Pipe: start delay thread *after* we know peer type

At end of connect(), or end of accept().

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: drop queue helpers
Sage Weil [Tue, 27 Nov 2012 23:27:18 +0000 (15:27 -0800)]
msg/Pipe: drop queue helpers

There is a single caller; these only obfuscate.

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsg/Pipe: refactor msgr delays
Sage Weil [Tue, 27 Nov 2012 23:36:11 +0000 (15:36 -0800)]
msg/Pipe: refactor msgr delays

- move all delay state into a single class
- create thread once and only once per Pipe
- adjust debug levels
- discard messages at the appropriate times

Signed-off-by: Sage Weil <sage@inktank.com>
12 years agomsgr: add a delay_until queue that is used to delay deliveries.
Greg Farnum [Tue, 27 Nov 2012 18:05:47 +0000 (10:05 -0800)]
msgr: add a delay_until queue that is used to delay deliveries.

Its life-cycle matches that of delay_queue, and the delayed_delivery
function respects it. For now queue_received is just setting it to
delay everything by 1 second.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomsgr: clear out the delay queue when stop()ing
Greg Farnum [Tue, 27 Nov 2012 17:44:19 +0000 (09:44 -0800)]
msgr: clear out the delay queue when stop()ing

After some brief thought, I believe deleting any messages in the
delay queue is correct -- we are trying to simulate line delays
in delivery and so anything still in the queue has supposedly
not arrived yet. So delete them when we stop the Pipe for
any reason.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomsgr: move the delay queue initialization into start_reader
Greg Farnum [Tue, 27 Nov 2012 19:02:07 +0000 (11:02 -0800)]
msgr: move the delay queue initialization into start_reader

The Pipe doesn't know the peer type in the constructor. It
doesn't always know in start_reader either, so this needs more work,
but at least it knows more frequently than it did.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agomessenger: add the shell of a system to delay incoming Message delivery
Greg Farnum [Wed, 21 Nov 2012 18:54:06 +0000 (10:54 -0800)]
messenger: add the shell of a system to delay incoming Message delivery

When ms_inject_delay_type matches that of the incoming Connection,
the Pipe sets up a delay queue that it shuttles all Messages through.
This lets us check cleanup and some notification code but doesn't
actually generate any delays.

Signed-off-by: Greg Farnum <greg@inktank.com>
12 years agorgw: treat lack of swift token as anonymous user access
Yehuda Sadeh [Fri, 30 Nov 2012 00:04:41 +0000 (16:04 -0800)]
rgw: treat lack of swift token as anonymous user access

Fixes: 3534
If a swift token hasn't been provided, set user as anonymous.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>