]> git.apps.os.sepia.ceph.com Git - ceph.git/log
ceph.git
11 years agoReplicatedPG::_scrub: don't bail early for snapdir 1580/head
Samuel Just [Wed, 2 Apr 2014 17:11:02 +0000 (10:11 -0700)]
ReplicatedPG::_scrub: don't bail early for snapdir

Fixes: #7937
Signed-off-by: Samuel Just <sam.just@inktank.com>
11 years agoosd/ReplicatedPG: continue scrub logic when snapset.head_exists doesn't match
Sage Weil [Tue, 1 Apr 2014 18:02:42 +0000 (11:02 -0700)]
osd/ReplicatedPG: continue scrub logic when snapset.head_exists doesn't match

The 'continue' will cause more damange/noise than continuing because the
next_clone value won't be updated properly.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: handle snapdir properly during scrub
Sage Weil [Tue, 1 Apr 2014 18:04:47 +0000 (11:04 -0700)]
osd/ReplicatedPG: handle snapdir properly during scrub

Handle snapdir similarly to how head is treated when updating the
next_clone info.  Also, add a warning when we have a snapdir object and
head_exists == true (the converse of the existing check).

Fixes: #7937
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1575 from jdurgin/wip-librbd-cp-sparse
Sage Weil [Tue, 1 Apr 2014 01:15:12 +0000 (18:15 -0700)]
Merge pull request #1575 from jdurgin/wip-librbd-cp-sparse

librbd: skip zeroes when copying an image

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoosd/PG: debug cached_removed_snaps changes
Sage Weil [Mon, 31 Mar 2014 22:29:00 +0000 (15:29 -0700)]
osd/PG: debug cached_removed_snaps changes

See #7915.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1574 from mikenel/master
Sage Weil [Mon, 31 Mar 2014 22:19:57 +0000 (15:19 -0700)]
Merge pull request #1574 from mikenel/master

Add ceph-client-debug and jerasure shared objects to RPM spec file.

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agolibrbd: skip zeroes when copying an image 1575/head
Josh Durgin [Mon, 31 Mar 2014 21:53:31 +0000 (14:53 -0700)]
librbd: skip zeroes when copying an image

This is the simple coarse-grained solution, but it works well in
common cases like a small base image resized with a bunch of empty
space at the end. Finer-grained sparseness can be copied by using rbd
{export,import}-diff.

Fixes: #6257
Signed-off-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoMerge pull request #1573 from ceph/wip-7912
Sage Weil [Mon, 31 Mar 2014 21:59:38 +0000 (14:59 -0700)]
Merge pull request #1573 from ceph/wip-7912

mon/PGMap: clear pool sum when last pg is deleted

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1556 from ceph/wip-7888
Sage Weil [Mon, 31 Mar 2014 21:57:57 +0000 (14:57 -0700)]
Merge pull request #1556 from ceph/wip-7888

msgr: add new ping/ping reply to use in place of keepalive

Reviewed-by: Greg Farnum <greg@inktank.com>
11 years agoMerge pull request #1566 from ceph/wip-fuse-access
Sage Weil [Mon, 31 Mar 2014 21:56:13 +0000 (14:56 -0700)]
Merge pull request #1566 from ceph/wip-fuse-access

fuse: implement 'access' low level function

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoqa/workunits/cephtool/test.sh: test 'osd pg-temp ...'
Sage Weil [Sat, 29 Mar 2014 17:46:16 +0000 (10:46 -0700)]
qa/workunits/cephtool/test.sh: test 'osd pg-temp ...'

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: clear primary_temp on osd pg_temp updates
Sage Weil [Sat, 29 Mar 2014 00:54:17 +0000 (17:54 -0700)]
mon/OSDMonitor: clear primary_temp on osd pg_temp updates

Until the OSD and the MOSDPGTemp messages encode primary_temp updates,
assume that any pg_temp update will clear primary_temp.

Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Ilya Dryomov <ilya.dryomov@inktank.com>
11 years agoOSDMonitor: add 'mon osd allow primary temp' bool option
Ilya Dryomov [Fri, 28 Mar 2014 16:28:44 +0000 (18:28 +0200)]
OSDMonitor: add 'mon osd allow primary temp' bool option

By default, we don't send out maps with primary_temp mappings because
there is no infrastructure in place that would make sure that the
entire cluster knows about primary_temp.  Add an option to allow
primary_temp mappings, for development purposes.

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoOSDMonitor: add 'osd primary-temp ...' command
Ilya Dryomov [Fri, 28 Mar 2014 16:28:44 +0000 (18:28 +0200)]
OSDMonitor: add 'osd primary-temp ...' command

ceph osd primary-temp <pgid> [<osd>]

Examples:

ceph osd primary-temp 0.2 4 # set primary_temp mapping for 0.2 to osd4
ceph osd primary-temp 0.2   # remove primary_temp mapping for 0.2

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoOSDMonitor: add 'osd pg-temp ...' command
Ilya Dryomov [Fri, 28 Mar 2014 16:28:44 +0000 (18:28 +0200)]
OSDMonitor: add 'osd pg-temp ...' command

ceph osd pg-temp <pgid> [<osd1> [<osd2> ...]]

Examples:

ceph osd pg-temp 0.2 0 1 2 # set pg_temp mapping for 0.2 to osds [0,1,2]
ceph osd pg-temp 0.2 3     # set pg_temp mapping for 0.2 to osds [3]
ceph osd pg-temp 0.2       # remove pg_temp mapping for 0.2

Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agojava/test: ceph.file.layout xattr is still not there now
Greg Farnum [Mon, 31 Mar 2014 20:17:22 +0000 (13:17 -0700)]
java/test: ceph.file.layout xattr is still not there now

b8ea65694faf59f12f285a65dc21753dab20ba11 tried to fix this, but
missed a spot.

Signed-off-by: Greg Farnum <greg@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1567 from ceph/wip-7849
Josh Durgin [Mon, 31 Mar 2014 19:40:35 +0000 (12:40 -0700)]
Merge pull request #1567 from ceph/wip-7849

ceph-conf: don't create log files

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoMerge pull request #1570 from dachary/wip-gitignore
Josh Durgin [Mon, 31 Mar 2014 19:36:07 +0000 (12:36 -0700)]
Merge pull request #1570 from dachary/wip-gitignore

.gitignore: add examples/librados files

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoAdd ceph-client-debug and jerasure shared objects to RPM spec file. 1574/head
Michael Nelson [Mon, 31 Mar 2014 19:35:56 +0000 (12:35 -0700)]
Add ceph-client-debug and jerasure shared objects to RPM spec file.

11 years agomon/PGMap: clear pool sum when last pg is deleted 1573/head
Sage Weil [Mon, 31 Mar 2014 17:42:23 +0000 (10:42 -0700)]
mon/PGMap: clear pool sum when last pg is deleted

Use the x.0 pg as a sentinel for the existence of the pool.  Note that we
have to clean in up two paths: apply_incrmenetal (which is actually
deprecated) and the normal PGMonitor refresh.

Fixes: #7912
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1572 from ceph/wip-ec-profile-idempotent
Sage Weil [Mon, 31 Mar 2014 17:18:58 +0000 (10:18 -0700)]
Merge pull request #1572 from ceph/wip-ec-profile-idempotent

mon: make 'ceph osd erasure-code-profile set ...' idempotent

Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
11 years agomon: make 'ceph osd erasure-code-profile set ...' idempotent 1572/head
Sage Weil [Mon, 31 Mar 2014 17:01:43 +0000 (10:01 -0700)]
mon: make 'ceph osd erasure-code-profile set ...' idempotent

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoqa/workunits/rados/test_alloc_hint: fix erasure syntax
Sage Weil [Mon, 31 Mar 2014 16:14:36 +0000 (09:14 -0700)]
qa/workunits/rados/test_alloc_hint: fix erasure syntax

This changed recently.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1571 from kotnik/docufix
Loic Dachary [Mon, 31 Mar 2014 14:27:43 +0000 (16:27 +0200)]
Merge pull request #1571 from kotnik/docufix

Small glossary typo fix

Reviewed-by: Loic Dachary <loic@dachary.org>
11 years agodoc: fix typos in glossary 1571/head
Nikola Kotur [Mon, 31 Mar 2014 14:24:17 +0000 (16:24 +0200)]
doc: fix typos in glossary

11 years ago.gitignore: add examples/librados files 1570/head
Loic Dachary [Mon, 31 Mar 2014 09:30:10 +0000 (11:30 +0200)]
.gitignore: add examples/librados files

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoMerge pull request #1564 from dachary/wip-erasure-code-doc
Sage Weil [Sun, 30 Mar 2014 16:20:58 +0000 (09:20 -0700)]
Merge pull request #1564 from dachary/wip-erasure-code-doc

doc: updates to the erasure code development docs

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1568 from dachary/wip-jerasure-warnings
Sage Weil [Sun, 30 Mar 2014 16:17:21 +0000 (09:17 -0700)]
Merge pull request #1568 from dachary/wip-jerasure-warnings

erasure-code: update jerasure / gf-complete submodules

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1569 from dachary/wip-ssse3
Sage Weil [Sun, 30 Mar 2014 16:16:44 +0000 (09:16 -0700)]
Merge pull request #1569 from dachary/wip-ssse3

autotools: s/ssse3/sse3/ typo

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoautotools: s/ssse3/sse3/ typo 1569/head
Loic Dachary [Sun, 30 Mar 2014 15:57:22 +0000 (17:57 +0200)]
autotools: s/ssse3/sse3/ typo

Reported-by: Justin Erenkrantz <justin@erenkrantz.com>
Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoerasure-code: update jerasure / gf-complete submodules 1568/head
Loic Dachary [Sun, 30 Mar 2014 09:07:46 +0000 (11:07 +0200)]
erasure-code: update jerasure / gf-complete submodules

For compilation warning patches.

http://tracker.ceph.com/issues/7909 Fixes #7909

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agodoc: pgbackend dev doc outdated notice 1564/head
Loic Dachary [Sat, 29 Mar 2014 10:30:42 +0000 (11:30 +0100)]
doc: pgbackend dev doc outdated notice

* Warn the reader that the implementation is ahead and may differ
* Update the links to the Firefly branch
* Remove links to issues used during development to avoid confusion

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agodoc: update jerasure plugin
Loic Dachary [Sat, 29 Mar 2014 10:29:22 +0000 (11:29 +0100)]
doc: update jerasure plugin

* The parameters come from the erasure code profile
* Add a note about the upstream submodules gf-complete / jerasure

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agodoc: erasure code developer notes updates
Loic Dachary [Sat, 29 Mar 2014 10:27:00 +0000 (11:27 +0100)]
doc: erasure code developer notes updates

Update the introduction to explain erasure code profiles. Remove
obsolete explanations about partial writes etc. Remove links to tickets
used during development. Update permalinks to be closer to
Firefly (v0.78).

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agofuse: implement 'access' low level function 1566/head
Yan, Zheng [Sun, 30 Mar 2014 01:21:57 +0000 (09:21 +0800)]
fuse: implement 'access' low level function

Add an empty 'access' function to fuse low level functions. This
allow us to use ceph-fuse with fuse_default_permissions = false.
'fuse_default_permissions = false' can significantly improve the
speed of create/removing large number of files.

When fuse_default_permissions is true, the fuse kernel module sends
a getattr request whenever the kernel needs to check a directory's
permission. getattr (STAT_CAP_INODE_ALL) can be very slow if the
directory was just modified.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agoosd/ReplicatedPG: fix cache tier scrub again
Sage Weil [Sun, 30 Mar 2014 05:28:13 +0000 (22:28 -0700)]
osd/ReplicatedPG: fix cache tier scrub again

This condition was flipped from commit eb71924ea27e78d97bd45674ef5e6a7f
and the test case in c3292e48483d861148322590ea1f05afd28cc2d3 still didn't
catch it.  (It does now.)

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoceph_test_rados_api_tier: improve promote+scrub test
Sage Weil [Sun, 30 Mar 2014 05:27:04 +0000 (22:27 -0700)]
ceph_test_rados_api_tier: improve promote+scrub test

We need to have multiple clones with some different patterns of
missing-ness.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoceph-conf: use global_pre_init to avoid starting logging 1567/head
Sage Weil [Sun, 30 Mar 2014 04:52:09 +0000 (21:52 -0700)]
ceph-conf: use global_pre_init to avoid starting logging

This avoids starting up logging, which is not appropriate when we are
examining the config state and not actually starting up the entity in
question.

Fixes: #7849
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoglobal: separate first half of global_init into global_pre_init
Sage Weil [Sun, 30 Mar 2014 04:51:20 +0000 (21:51 -0700)]
global: separate first half of global_init into global_pre_init

The pre_init now captures enough to create the g_ceph_context and parse
and initialize the in-memory config.  However, we don't

 - fiddle with signal handlers
 - init lockdep
 - call config observers (which starts up logging)

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1520 from ceph/wip-multimds
Sage Weil [Sun, 30 Mar 2014 04:27:08 +0000 (21:27 -0700)]
Merge pull request #1520 from ceph/wip-multimds

Wip multimds

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1563 from dachary/wip-sse-i386
Sage Weil [Sun, 30 Mar 2014 00:28:36 +0000 (17:28 -0700)]
Merge pull request #1563 from dachary/wip-sse-i386

erasure-code: do not attempt to compile SSE4 on i386

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoceph_test_rados_api_tier: improve cache tier + scrub test 1565/head
Sage Weil [Sat, 29 Mar 2014 23:58:38 +0000 (16:58 -0700)]
ceph_test_rados_api_tier: improve cache tier + scrub test

Create lots of objects and make *some* of them be missing clones but not
all.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: tolerate trailing missing clones on cache tiers
Sage Weil [Sat, 29 Mar 2014 23:57:48 +0000 (16:57 -0700)]
osd/ReplicatedPG: tolerate trailing missing clones on cache tiers

I missed this case in eb71924ea27e78d97bd45674ef5e6a7fce30932f.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agojava/test: ceph.file.layout xattr is not there now
Sage Weil [Sat, 29 Mar 2014 21:25:47 +0000 (14:25 -0700)]
java/test: ceph.file.layout xattr is not there now

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoqa/workunits/fs/misc/layout_vxattrs: ceph.file.layout is not listed
Sage Weil [Sat, 29 Mar 2014 21:23:21 +0000 (14:23 -0700)]
qa/workunits/fs/misc/layout_vxattrs: ceph.file.layout is not listed

As of 08a3d6bd428c5e78dd4a10e6ee97540f66f9729c.  A similar change was made
in the kernel.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1548 from ceph/wip-7880
Sage Weil [Sat, 29 Mar 2014 16:20:07 +0000 (09:20 -0700)]
Merge pull request #1548 from ceph/wip-7880

mds: properly propagate dirty dirstat to auth inode

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agomds: find approximal bounds when adjusting subtree auth 1520/head
Yan, Zheng [Sat, 29 Mar 2014 02:36:12 +0000 (10:36 +0800)]
mds: find approximal bounds when adjusting subtree auth

When finishing exporting a subtree, the exporter MDS drops locks and
sends MExportDirFinish message to the importer MDS. The bounds of
subtree can get fragmented by third party before the importer MDS
receives the MExportDirFinish message. So the importer MDS can add
inaccurate bounds to the EImportFinish event.

The fix is find approximal bounds when finishing ambiguous imports.

11 years agodoc: erasure-code development complete
Loic Dachary [Sat, 29 Mar 2014 10:25:59 +0000 (11:25 +0100)]
doc: erasure-code development complete

remove the note explaining that it is not yet available.

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agoerasure-code: do not attempt to compile SSE4 on i386 1563/head
Loic Dachary [Sat, 29 Mar 2014 09:34:29 +0000 (10:34 +0100)]
erasure-code: do not attempt to compile SSE4 on i386

SSE4 are only not availabe on older CPUs. Although the compiler could
probably generate the code, there is no point in doing so. The SSE4.1,
SSE4.2 and PCLMUL cpu features are only tested if the target CPU is
AMD64 or x86_64.

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agomds: commit new dirfrag before splitting it
Yan, Zheng [Fri, 28 Mar 2014 17:53:15 +0000 (01:53 +0800)]
mds: commit new dirfrag before splitting it

Commit 6e013cd6 (properly set COMPLETE flag when merging dirfrags)
tries solving the issue that new dirfrag's COMPLETE flag gets lost
if MDS splits the new dirfrag, then the fragment operation gets
rolled back. It records the original dirfrag's COMPLETE flag when
EFragment PREPARE event is encountered. If the fragment operation
needs to rollback, The COMPLETE flag is journaled in corresponding
EFragment ROLLBACK event. This is problematic when the ROLLBACK
event and the "mkdir" event belong to different log segments. After
the log segment that contains the "mkdir" event is trimmed, the
dirfrag can not be considered as complete.

The fix is commit new dirfrag before splitting it. After dirfrag is
committed to object store, losing COMPLETE flag is not a big deal.

Signed-off-by: Yan, Zheng <zheng.z.yan@ntel.com>
11 years agorbd.cc: yes, cover formatted output as well. sigh.
Dan Mick [Sat, 29 Mar 2014 01:10:43 +0000 (18:10 -0700)]
rbd.cc: yes, cover formatted output as well.  sigh.

Fixes: #7577
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Dan Mick <dan.mick@inktank.com>
11 years agoMerge pull request #1562 from onlyjob/debian
Sage Weil [Sat, 29 Mar 2014 01:13:05 +0000 (18:13 -0700)]
Merge pull request #1562 from onlyjob/debian

init: fix OSD startup issue

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoRevert "ceph-conf: do not log"
Sage Weil [Sat, 29 Mar 2014 01:08:32 +0000 (18:08 -0700)]
Revert "ceph-conf: do not log"

This reverts commit acc31e75a3e7115c00f9980609948455e3b2d49e.

11 years agoRevert "ceph-conf: no admin_socket"
Sage Weil [Sat, 29 Mar 2014 01:08:09 +0000 (18:08 -0700)]
Revert "ceph-conf: no admin_socket"

This reverts commit 72715b235a0daee7ab8e5cd3ab6e415de2939df9.

This breaks the ceph cli, which uses ceph-conf --show-config-value ... to
get the admin socket.

11 years agoinit: fix OSD startup issue 1562/head
Dmitry Smirnov [Sat, 29 Mar 2014 00:59:24 +0000 (11:59 +1100)]
init: fix OSD startup issue

 On machines with MON and OSDs (on boot) OSDs started shortly after MON startup
 but MON needs time to become oprational so OSDs fail to start due to short
 timeout because they don't have enough time to establish communication with
 cluster. This is even more likely to happen when there are other monitors down
 which is not unusual when servers are rebooting after power failure.
 Increasing timeout significantly improves chances for successful OSD start.

Signed-off-by: Dmitry Smirnov <onlyjob@member.fsf.org>
11 years agorbd.cc: tolerate lack of NUL-termination on block_name_prefix
Dan Mick [Wed, 26 Mar 2014 00:09:48 +0000 (17:09 -0700)]
rbd.cc: tolerate lack of NUL-termination on block_name_prefix

Fixes: #7577
Signed-off-by: Dan Mick <dan.mick@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1552 from ceph/wip-7902
Sage Weil [Sat, 29 Mar 2014 00:19:23 +0000 (17:19 -0700)]
Merge pull request #1552 from ceph/wip-7902

osd/PG: fix choose_acting revert to up case

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoMerge pull request #1560 from ceph/wip-7903
Sage Weil [Fri, 28 Mar 2014 23:59:02 +0000 (16:59 -0700)]
Merge pull request #1560 from ceph/wip-7903

Wip 7903

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agomon/MonClient: use keepalive2 to verify the mon session is live 1556/head
Sage Weil [Fri, 28 Mar 2014 04:33:21 +0000 (21:33 -0700)]
mon/MonClient: use keepalive2 to verify the mon session is live

Verify that the mon is responding by checking the keepalive2 reply
timestamp.  We cannot rely solely on TCP timing out and returning an
error.

Fixes: #7888
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomsgr: add KEEPALIVE2 feature
Sage Weil [Fri, 28 Mar 2014 04:09:13 +0000 (21:09 -0700)]
msgr: add KEEPALIVE2 feature

This is similar to KEEPALIVE, except a timestamp is also exchanged.  It is
sent with the KEEPALIVE, and then returned with the ACK.  The last
received stamp is stored in the Connection so that it can be queried for
liveness.  Since all of the users of keepalive are already regularly
triggering a keepalive, they can check the liveness at the same time.

See #7888.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1558 from ceph/wip-7837
Sage Weil [Fri, 28 Mar 2014 23:01:20 +0000 (16:01 -0700)]
Merge pull request #1558 from ceph/wip-7837

ReplicatedPG: include pending_attrs when reseting attrs in WRITEFULL

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1557 from ceph/wip-7867
Josh Durgin [Fri, 28 Mar 2014 22:08:07 +0000 (15:08 -0700)]
Merge pull request #1557 from ceph/wip-7867

client: fix assert(!unclean) due to readahead vs close race

Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
11 years agoclient: pin Inode during readahead 1557/head
Sage Weil [Thu, 27 Mar 2014 04:52:00 +0000 (21:52 -0700)]
client: pin Inode during readahead

Make sure the Inode does not go away while a readahead is in progress.  In
particular:

 - read_async
   - start a readahead
   - get actual read from cache, return
 - close/release
   - call ObjectCacher::release_set() and get unclean > 0, assert

Fixes: #7867
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosdc/ObjectCacher: call read completion even when no target buffer
Sage Weil [Fri, 28 Mar 2014 19:34:07 +0000 (12:34 -0700)]
osdc/ObjectCacher: call read completion even when no target buffer

If we do no assemble a target bl, we still want to return a valid return
code with the number of bytes read-ahead so that the C_RetryRead completion
will see this as a finish and call the caller's provided Context.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1553 from ceph/wip-7874
Sage Weil [Fri, 28 Mar 2014 21:07:50 +0000 (14:07 -0700)]
Merge pull request #1553 from ceph/wip-7874

ReplicatedPG: disable clone subsets for cache pools

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1554 from ceph/wip-7828
Sage Weil [Fri, 28 Mar 2014 21:06:24 +0000 (14:06 -0700)]
Merge pull request #1554 from ceph/wip-7828

ReplicatedPG:: s/_delete_head/_delete_oid, adjust head_exists iff is_hea...

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agoMerge pull request #1555 from ceph/wip-7835
Sage Weil [Fri, 28 Mar 2014 21:05:41 +0000 (14:05 -0700)]
Merge pull request #1555 from ceph/wip-7835

ReplicatedPG::make_writeable: fill in ssc on clone

Reviewed-by: Sage Weil <sage@inktank.com>
11 years agorgw: move max_chunk_size initialization 1560/head
Yehuda Sadeh [Fri, 28 Mar 2014 21:05:00 +0000 (14:05 -0700)]
rgw: move max_chunk_size initialization

RGWRados::initialize() is not called when doing
RGWRados::get_raw_storage_provider(). This was the culprit for issue

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agorgw: only look at prefetched data if we actually prefetched
Yehuda Sadeh [Fri, 28 Mar 2014 20:25:47 +0000 (13:25 -0700)]
rgw: only look at prefetched data if we actually prefetched

Fixes: #7903
Since we didn't prefetch data then we couldn't rely on the data to
actually exist there. In that case just move on and read the object.

Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
11 years agoosd/PG: fix choose_acting revert to up case 1552/head
Sage Weil [Fri, 28 Mar 2014 20:10:06 +0000 (13:10 -0700)]
osd/PG: fix choose_acting revert to up case

If we decide to revert back to up, we need to

1- return false, so that we go into the NeedActingChange state, and
2- actually ask for that change.

It's too fugly to try to jump down to the existing queue_want_pg_temp
call 100+ lines down in this function, so just do it here.  We already
know that we are requesting to clear the pg_temp.

Fixes: #7902
Backport: emperor, dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomds: don't trim non-auth root inode/dirfrag
Yan, Zheng [Thu, 27 Mar 2014 21:42:34 +0000 (05:42 +0800)]
mds: don't trim non-auth root inode/dirfrag

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: include authority of the overwrited inode in rename witnesses
Yan, Zheng [Wed, 26 Mar 2014 15:03:56 +0000 (23:03 +0800)]
mds: include authority of the overwrited inode in rename witnesses

Rename operation needs to adjust the overwrited inode's link count.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: don't increase nlink when rollback stray reintegration
Yan, Zheng [Wed, 26 Mar 2014 06:28:26 +0000 (14:28 +0800)]
mds: don't increase nlink when rollback stray reintegration

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: allow sending MMDSFindIno to MDS who is in clientreplay state
Yan, Zheng [Wed, 26 Mar 2014 10:55:19 +0000 (18:55 +0800)]
mds: allow sending MMDSFindIno to MDS who is in clientreplay state

Because MDCache::kick_find_ino_peers() is called when a MDS enters
clientreplay state.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: fix negative dirstat assertion
Yan, Zheng [Wed, 26 Mar 2014 02:15:15 +0000 (10:15 +0800)]
mds: fix negative dirstat assertion

When splitting dirfrag, delta dirstat is always added to the first new
dirfrag. Before the delta dirstat is propagated to inode, unlinking file
from the rest dirfrags can cause nagtive inode dirstat.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: fix stack overflow caused by nested dispatch
Yan, Zheng [Wed, 26 Mar 2014 01:51:23 +0000 (09:51 +0800)]
mds: fix stack overflow caused by nested dispatch

Commit bc3325b37 fixes a stack overflow bug happens when replaying
client requests. Similar stack overflow can happens when processing
finished contexts.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: don't clear scatter dirty when cache rejoin ack is received
Yan, Zheng [Mon, 24 Mar 2014 08:47:04 +0000 (16:47 +0800)]
mds: don't clear scatter dirty when cache rejoin ack is received

The auth mds has received dirty scatterlock state. But it hasn't
journaled the dirty state yet. The log segment that marked the
scatterlock dirty need to be preserved. Therefore, we can't clear
the dirty flag of scatterlock.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: explicitly set nonce for imported dirfrag
Yan, Zheng [Sun, 23 Mar 2014 10:18:19 +0000 (18:18 +0800)]
mds: explicitly set nonce for imported dirfrag

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: skip non-opened session when flushing client sessions
Yan, Zheng [Sun, 23 Mar 2014 05:32:00 +0000 (13:32 +0800)]
mds: skip non-opened session when flushing client sessions

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: fix null pointer dereference in MDCache::rejoin_send_rejoins()
Yan, Zheng [Sun, 23 Mar 2014 00:02:08 +0000 (08:02 +0800)]
mds: fix null pointer dereference in MDCache::rejoin_send_rejoins()

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: journal EFragment::OP_COMMIT before drop locks
Yan, Zheng [Sat, 22 Mar 2014 12:26:34 +0000 (20:26 +0800)]
mds: journal EFragment::OP_COMMIT before drop locks

Dropping locks can dispatch other requests. These request can submit
log entry.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: fix CInode::get_approx_dirfrag
Yan, Zheng [Fri, 21 Mar 2014 23:38:22 +0000 (07:38 +0800)]
mds: fix CInode::get_approx_dirfrag

return NULL if there is no opened dirfrag

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: don't trim ambiguous import dirfrags
Yan, Zheng [Sun, 23 Mar 2014 12:07:35 +0000 (20:07 +0800)]
mds: don't trim ambiguous import dirfrags

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: trim empty non-auth dirfrags
Yan, Zheng [Sun, 23 Mar 2014 09:47:05 +0000 (17:47 +0800)]
mds: trim empty non-auth dirfrags

Fragmenting a non-auth dirfrag results several smaller dirfrags. Some
of the resulting dirfrags can be empty, which are not used to connected
to auth subtree.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: trim non-auth inode with remote parents
Yan, Zheng [Fri, 21 Mar 2014 23:31:07 +0000 (07:31 +0800)]
mds: trim non-auth inode with remote parents

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: properly journal fragment rollback
Yan, Zheng [Fri, 21 Mar 2014 15:16:00 +0000 (23:16 +0800)]
mds: properly journal fragment rollback

If dirfrags are subtree roots, mark the dirfragtreelock as scattered
dirty, otherwise journal the dirfragtree change.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: fix CDir::WAIT_ANY_MASK
Yan, Zheng [Fri, 21 Mar 2014 01:50:41 +0000 (09:50 +0800)]
mds: fix CDir::WAIT_ANY_MASK

make sure CDir::WAIT_ANY_MASK include MDSCacheObject::WAIT_UNFREEZE

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: avoid journaling non-auth opened inode
Yan, Zheng [Thu, 20 Mar 2014 05:05:31 +0000 (13:05 +0800)]
mds: avoid journaling non-auth opened inode

Exporting inode has AUTH bit set while EExport event is being
journaled.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: handle race between cache rejoin and fragmenting
Yan, Zheng [Thu, 20 Mar 2014 03:30:46 +0000 (11:30 +0800)]
mds: handle race between cache rejoin and fragmenting

MDCache::handle_cache_expire() ignores mismatched dirfrags. this is
OK during normal operation because MDS doesn't trim replica inode
whose dirfrags are likely being fragmented (see commit 22535340).

During recovery, the recovering MDS can reveive survivor MDS' cache
expire message before it sends cache rejoin acks. In this case,
there still can be mismatched dirfrags, but nothing prevents the
survivor MDS to trim inode of these mismatched dirfrags. So there
can be unconnected dirfrags when the recovering MDS sends cache
rejoin acks.

The fix is, when mismatched dirfrag is encountered during recovery,
check if inode of the dirfrag is still replicated to the sender MDS.
If the inode is not replicated, remove the sender MDS from replica
maps of all child dirfrags.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agomds: handle interaction between slave rollback and fragmenting
Yan, Zheng [Wed, 19 Mar 2014 11:56:26 +0000 (19:56 +0800)]
mds: handle interaction between slave rollback and fragmenting

For slave rename and rmdir events, the MDS needs to preserve non-auth
dirfrag where the renamed inode originally lives in until slave commit
event is encountered. Current method to handle this is use MDCache::
uncommitted_slave_rename_olddir to track any non-auth dirfrag that
need to be preserved. This method does not works well if any preserved
dirfrag gets fragmented by log event (such as ESubtreeMap) between the
slave prepare event and the slave commit event.

The fix is tracking inode of dirfrag instead of tracking dirfrag that
need to preserved directly.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agoMerge pull request #1549 from dachary/wip-doc
Sage Weil [Fri, 28 Mar 2014 15:23:46 +0000 (08:23 -0700)]
Merge pull request #1549 from dachary/wip-doc

doc: fix typos in tiering dev doc

11 years agodoc: fix typos in tiering dev doc 1549/head
Loic Dachary [Fri, 28 Mar 2014 13:01:53 +0000 (14:01 +0100)]
doc: fix typos in tiering dev doc

Signed-off-by: Loic Dachary <loic@dachary.org>
11 years agomds: properly propagate dirty dirstat to auth inode 1548/head
Yan, Zheng [Fri, 28 Mar 2014 04:57:29 +0000 (12:57 +0800)]
mds: properly propagate dirty dirstat to auth inode

Propagate dirty dirstat to freezing auth inode if the inode is
already auth pinned by the Mutation. Otherwiese the dirstat can
be propagated to inode after client changes inode's mtime.

Fixes: #7880
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
11 years agoMerge pull request #1547 from ceph/wip-cache-scrub
Samuel Just [Fri, 28 Mar 2014 00:14:34 +0000 (17:14 -0700)]
Merge pull request #1547 from ceph/wip-cache-scrub

osd: improve scrub checks on clones; tolerate missing clones on cache pools

Fixes: #7885
Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agoPipe: rename keepalive->send_keepalive
Greg Farnum [Wed, 26 Mar 2014 22:58:10 +0000 (15:58 -0700)]
Pipe: rename keepalive->send_keepalive

Signed-off-by: Greg Farnum <greg@inktank.com>
11 years agoMerge branch 'wip-7875'
Sage Weil [Thu, 27 Mar 2014 23:39:36 +0000 (16:39 -0700)]
Merge branch 'wip-7875'

Reviewed-by: Samuel Just <sam.just@inktank.com>
11 years agomon/OSDMonitor: require OSD_CACHEPOOL feature before using tiering features
Sage Weil [Thu, 27 Mar 2014 23:39:01 +0000 (16:39 -0700)]
mon/OSDMonitor: require OSD_CACHEPOOL feature before using tiering features

The OSDs need to support this feature before we allow users to turn it
on.  This is similar to what the erasure pool support does.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agomon/OSDMonitor: prevent setting hit_set unless all OSDs support it
Sage Weil [Thu, 27 Mar 2014 23:38:46 +0000 (16:38 -0700)]
mon/OSDMonitor: prevent setting hit_set unless all OSDs support it

We are using OSD_CACHEPOOL as a proxy for the support for the tiering
OSDMap infrastructure.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: tolerate missing clones in cache pools 1547/head
Sage Weil [Thu, 27 Mar 2014 22:12:25 +0000 (15:12 -0700)]
osd/ReplicatedPG: tolerate missing clones in cache pools

A few cases:

- As we are working through the list, if we see a clone that is lower than
  the next one we were expecting, we should be able to skip them.
- If we see a head, we can skip all of the rest of the clones.
- If we get to the end and next_clone was set, we can ignore it.

Signed-off-by: Sage Weil <sage@inktank.com>
11 years agoosd/ReplicatedPG: improve clone vs head checking
Sage Weil [Thu, 27 Mar 2014 20:51:15 +0000 (13:51 -0700)]
osd/ReplicatedPG: improve clone vs head checking

- notice when we are missing a clone (that isn't at the end of the list)
- notice when we are missing a clone on the last object in the scrub map
- do not assert when we are missing a clone

There is still more we could do to improve this (like noticing one missing
clone but still checking the others), but we'll leave that aside for just
a moment...

Signed-off-by: Sage Weil <sage@inktank.com>