John Spray [Tue, 22 Jul 2014 01:08:08 +0000 (02:08 +0100)]
mds: handle replaying old format journals
To get back to the reformatting procedure that otherwise
occurs during MDLog::open, introduce an MDLog::reopen call
that MDS can use in the standbyreplay->standby transition
for the special case where the journal is old.
Fixes: #8869 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Mon, 21 Jul 2014 17:50:07 +0000 (18:50 +0100)]
mds: refactor MDS boot
* Make boot_start private.
* Define boot stages in enum, replace int with type.
* Merge steps 0 and 1, 0 always fell through to 1.
* starting_done was only ever reached by a fall through
from the previous step, so call it directly from there.
John Spray [Thu, 17 Jul 2014 23:44:38 +0000 (00:44 +0100)]
mds: separate inode recovery queue from MDCache
Refactor to:
* have somewhere to put some logic for doing
background recovery in future.
* trim a few lines from the oversized MDCache.cc
whereever we can.
John Spray [Mon, 28 Jul 2014 14:32:12 +0000 (15:32 +0100)]
tools/cephfs: fuller header in dump/undump
There were two problems here:
* write_pos was modified through an undump/dump cycle,
because it was probed during recovery.
* stream format was being forgotten.
Higher the clone probability to 8% and lower the probability of flatten
to 2%. This should give us longer parent chaines (before this we would
usually have one parent and even then only for a few ops time).
Truncate base images after they have been cloned from to cover more
code paths and make sure that clients look at snapshot parent_overlap
(i.e. parent_overlap of the base image at the time the snapshot was
taken) and not that of the base image (i.e. parent_overlap of the base
image as of now).
librbd: make rbd_get_parent_info() accept NULL out params
The C++ version of rbd_get_parent_info() allows passing NULL for parent
image name, image name and snapshot name out parameters. Make C API do
the same both for consistency and to make it easier to check whether
the image at hand has a parent or not.
Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.
Haomai Wang [Sun, 27 Jul 2014 05:37:49 +0000 (13:37 +0800)]
Only write bufferhead when it's dirty
The TX state bh should be skipped because the bh should be inflight. We only
need to write dirty bh. And TX and dirty state bh both should be waited until
flushed.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Haomai Wang [Mon, 14 Jul 2014 06:32:57 +0000 (14:32 +0800)]
Reduce ObjectCacher flush overhead
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.
Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.
The dirty_or_tx list is used by flush_set, which means we can
resubmit new IOs for writes that are already in progress. This
has a compounding effect that overwhelms the OSDs with dup IOs
and stalls out the client.
See, for example, teh failues in this run:
/a/sage-2014-07-25_17:14:20-fs-wip-msgr-testing-basic-plana
The fix is probably pretty simple, but reverting for now to make
the tests pass.
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Sage Weil [Fri, 25 Jul 2014 21:48:10 +0000 (14:48 -0700)]
osd/ReplicatedPG: requeue cache full waiters if no longer writeback
If the cache is full, we block some requests, and then we change the
cache_mode to something else (say, forward), the full waiters don't get
requeued until the cache becomes un-full. In the meantime, however, later
requests will get processed and redirected, breaking the op ordering.
Fix this by requeueing any full waiters if we see that the cache_mode is
not writeback.
Fixes: #8931 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 25 Jul 2014 20:17:32 +0000 (13:17 -0700)]
common/RefCountedObject: fix use-after-free in debug print
We could race with another thread that deletes this right after we call
dec(). Our access of cct would then become a use-after-free. Valgrind
managed to turn this up.
Copy it into a local variable before the dec() to be safe, and move the
dout line below to make this possibility explicit and obvious in the code.
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.
Sage Weil [Fri, 25 Jul 2014 16:20:20 +0000 (09:20 -0700)]
osd: fix bad Message* defer in C_SendMap and send_map_on_destruct
We were carrying a bare Message*, which could get freed if the op was
canceled (or possibly completed). Instead, just stash the entity_name_t,
the only piece we need. The Connection is properly ref counted so no
worries there.
Fixes: #8926 Signed-off-by: Sage Weil <sage@redhat.com>
Ma Jianpeng [Wed, 23 Jul 2014 17:10:38 +0000 (10:10 -0700)]
os/FileJournal: Update the journal header when closing journal
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Fixes: #8882 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.
Sage Weil [Sat, 19 Jul 2014 06:16:09 +0000 (23:16 -0700)]
os/LFNIndex: only consider alt xattr if nlink > 1
If we are doing a lookup, the main xattr fails, we'll check if there is an
alt xattr. If it exists, but the nlink on the inode is only 1, we will
kill the xattr. This cleans up the mess left over by an incomplete
lfn_unlink operation.
This resolves the problem with an lfn_link to a second long name that
hashes to the same short_name: we will ignore the old name the moment the
old link goes away.
Fixes: #8701 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Sat, 19 Jul 2014 00:28:18 +0000 (17:28 -0700)]
os/LFNIndex: remove alt xattr after unlink
After we unlink, if the nlink on the inode is still non-zero, remove the
alt xattr. We can *only* do this after the rename or unlink operation
because we don't want to leave a file system link in place without the
matching xattr; hence the fsync_dir() call.
Note that this might leak an alt xattr if we happen to fail after the
rename/unlink but before the removexattr is committed. We'll fix that
next.
Sage Weil [Sat, 19 Jul 2014 00:09:07 +0000 (17:09 -0700)]
os/LFNIndex: handle long object names with multiple links (i.e., rename)
When we rename an object (collection_move_rename) to a different name, and
the name is long, we run into problems because the lfn xattr can only track
a single long name linking to the inode. For example, suppose we have
foobar -> foo_123_0 (attr: foobar) where foobar hashes to 123.
At first, collection_add could only link a file to another file in a
different collection with the same name. Allowing collection_move_rename
to rename the file, however, means that we have to convert:
col1/foobar -> foo_123_0 (attr: foobar)
to
col1/foobaz -> foo_234_0 (attr: foobaz)
This is a problem because if we link, reset xattr, unlink we end up with
col1/foobar -> foo_123_0 (attr: foobaz)
if we restart after we reset the attr. This will cause the initial foobar
lookup to since the attr doesn't match, and the file won't be able to be
looked up.
Fix this by allow *two* (long) names to link to the same inode. If we
lfn_link a second (different) name, move the previous name to the "alt"
xattr and set the new name. (This works because link is always followed
by unlink.) On lookup, check either xattr.
Don't even bother to remove the alt xattr on unlink. This works as long
as the old name and new name don't hash to the same shortname and end up
in the same LFN chain. (Don't worry, we'll fix that next.)
Fixes part of #8701 Signed-off-by: Sage Weil <sage@redhat.com>
Dan Mick [Thu, 3 Jul 2014 23:08:44 +0000 (16:08 -0700)]
Fix/add missing dependencies:
- rbd-fuse depends on librados2/librbd1
- ceph-devel depends on specific releases of libs and libcephfs_jni1
- librbd1 depends on librados2
- python-ceph does not depend on libcephfs1
Sage Weil [Tue, 22 Jul 2014 20:16:11 +0000 (13:16 -0700)]
osd/ReplicatedPG: greedily take write_lock for copyfrom finish, snapdir
In the cases where we are taking a write lock and are careful
enough that we know we should succeed (i.e, we assert(got)),
use the get_write_greedy() variant that skips the checks for
waiters (be they ops or backfill) that are normally necessary
to avoid starvation. We don't care about staration here
because our op is already in-progress and can't easily be
aborted, and new ops won't start because they do make those
checks.
Fixes: #8889 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Tue, 22 Jul 2014 20:11:42 +0000 (13:11 -0700)]
osd: allow greedy get_write() for ObjectContext locks
There are several lockers that need to take a write lock
because there is an operation that is already in progress and
know it is safe to do so. In particular, they need to skip
the starvation checks (op waiters, backfill waiting).