John Spray [Tue, 22 Jul 2014 01:08:08 +0000 (02:08 +0100)]
mds: handle replaying old format journals
To get back to the reformatting procedure that otherwise
occurs during MDLog::open, introduce an MDLog::reopen call
that MDS can use in the standbyreplay->standby transition
for the special case where the journal is old.
Fixes: #8869 Signed-off-by: John Spray <john.spray@redhat.com>
John Spray [Mon, 21 Jul 2014 17:50:07 +0000 (18:50 +0100)]
mds: refactor MDS boot
* Make boot_start private.
* Define boot stages in enum, replace int with type.
* Merge steps 0 and 1, 0 always fell through to 1.
* starting_done was only ever reached by a fall through
from the previous step, so call it directly from there.
John Spray [Thu, 17 Jul 2014 23:44:38 +0000 (00:44 +0100)]
mds: separate inode recovery queue from MDCache
Refactor to:
* have somewhere to put some logic for doing
background recovery in future.
* trim a few lines from the oversized MDCache.cc
whereever we can.
John Spray [Mon, 28 Jul 2014 14:32:12 +0000 (15:32 +0100)]
tools/cephfs: fuller header in dump/undump
There were two problems here:
* write_pos was modified through an undump/dump cycle,
because it was probed during recovery.
* stream format was being forgotten.
Higher the clone probability to 8% and lower the probability of flatten
to 2%. This should give us longer parent chaines (before this we would
usually have one parent and even then only for a few ops time).
Truncate base images after they have been cloned from to cover more
code paths and make sure that clients look at snapshot parent_overlap
(i.e. parent_overlap of the base image at the time the snapshot was
taken) and not that of the base image (i.e. parent_overlap of the base
image as of now).
librbd: make rbd_get_parent_info() accept NULL out params
The C++ version of rbd_get_parent_info() allows passing NULL for parent
image name, image name and snapshot name out parameters. Make C API do
the same both for consistency and to make it easier to check whether
the image at hand has a parent or not.
Currently for pools with different rules, "ceph df" cannot report
right available space for them, respectively. For detail assisment
of the bug ,pls refer to bug report #8943
This patch fix this bug and make ceph df works correctlly.
Haomai Wang [Sun, 27 Jul 2014 05:37:49 +0000 (13:37 +0800)]
Only write bufferhead when it's dirty
The TX state bh should be skipped because the bh should be inflight. We only
need to write dirty bh. And TX and dirty state bh both should be waited until
flushed.
Haomai Wang [Mon, 14 Jul 2014 06:27:17 +0000 (14:27 +0800)]
Add rbdcache max dirty object option
Librbd will calculate max dirty object according to rbd_cache_max_size, it
doesn't suitable for every case. If user set image order 24, the calculating
result is too small for reality. It will increase the overhead of trim call
which is called each read/write op.
Now we make it as option for tunning, by default this value is calculated.
Haomai Wang [Mon, 14 Jul 2014 06:32:57 +0000 (14:32 +0800)]
Reduce ObjectCacher flush overhead
Flush op in ObjectCacher will iterate the whole active object set, each
dirty object also may own several BufferHead. If the object set is large,
it will consume too much time.
Use dirty_bh instead to reduce overhead. Now only dirty BufferHead will
be checked.
The dirty_or_tx list is used by flush_set, which means we can
resubmit new IOs for writes that are already in progress. This
has a compounding effect that overwhelms the OSDs with dup IOs
and stalls out the client.
See, for example, teh failues in this run:
/a/sage-2014-07-25_17:14:20-fs-wip-msgr-testing-basic-plana
The fix is probably pretty simple, but reverting for now to make
the tests pass.
John Spray [Thu, 17 Jul 2014 12:15:45 +0000 (13:15 +0100)]
mds: fix journal reformat failure in standbyreplay
In the 0.82 release, standbyreplay MDS daemons would try
to reformat the jouranl if they saw an older version on
disk, where this should have only been done by the active
MDS for the rank. Depending on timing, this could cause
fatal corruption of the journal.
This change handles the following cases:
* only do reformat if not in standbyreplay (else raise EAGAIN
to keep trying til an active mds reformats it)
* if journal header goes away while in standbyreplay then raise
EAGAIN (handle rewrite happening in background)
* if journal version is greater than the max supported, suicide
Sage Weil [Fri, 25 Jul 2014 21:48:10 +0000 (14:48 -0700)]
osd/ReplicatedPG: requeue cache full waiters if no longer writeback
If the cache is full, we block some requests, and then we change the
cache_mode to something else (say, forward), the full waiters don't get
requeued until the cache becomes un-full. In the meantime, however, later
requests will get processed and redirected, breaking the op ordering.
Fix this by requeueing any full waiters if we see that the cache_mode is
not writeback.
Fixes: #8931 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Fri, 25 Jul 2014 20:17:32 +0000 (13:17 -0700)]
common/RefCountedObject: fix use-after-free in debug print
We could race with another thread that deletes this right after we call
dec(). Our access of cct would then become a use-after-free. Valgrind
managed to turn this up.
Copy it into a local variable before the dec() to be safe, and move the
dout line below to make this possibility explicit and obvious in the code.
Fixes: #8442
Backport: firefly
Data pools might have strict write alignment requirements. Use pool
alignment info when setting the max_chunk_size for the write.
Sage Weil [Fri, 25 Jul 2014 16:20:20 +0000 (09:20 -0700)]
osd: fix bad Message* defer in C_SendMap and send_map_on_destruct
We were carrying a bare Message*, which could get freed if the op was
canceled (or possibly completed). Instead, just stash the entity_name_t,
the only piece we need. The Connection is properly ref counted so no
worries there.
Fixes: #8926 Signed-off-by: Sage Weil <sage@redhat.com>
Ma Jianpeng [Wed, 23 Jul 2014 17:10:38 +0000 (10:10 -0700)]
os/FileJournal: Update the journal header when closing journal
When closing journal, it should check must_write_header and update
journal header if must_write_header alreay set.
It can reduce the nosense journal-replay after restarting osd.
Signed-off-by: Ma Jianpeng <jianpeng.ma@intel.com> Reviewed-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 24 Jul 2014 01:25:53 +0000 (18:25 -0700)]
osd/ReplicatedPG: observe INCOMPLETE_CLONES in is_present_clone()
We cannot assume that just because cache_mode is NONE that we will have
all clones present; check for the absense of the INCOMPLETE_CLONES flag
here too.
Sage Weil [Thu, 24 Jul 2014 01:24:51 +0000 (18:24 -0700)]
osd/ReplicatedPG: observed INCOMPLETE_CLONES when doing clone subsets
During recovery, we can clone subsets if we know that all clones will be
present. We skip this on caching pools because they may not be; do the
same when INCOMPLETE_CLONES is set.
Sage Weil [Thu, 24 Jul 2014 01:23:56 +0000 (18:23 -0700)]
osd/ReplicatedPG: do not complain about missing clones when INCOMPLETE_CLONES is set
When scrubbing, do not complain about missing cloens when we are in a
caching mode *or* when the INCOMPLETE_CLONES flag is set. Both are
indicators that we may be missing clones and that that is okay.
Fixes: #8882 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Thu, 24 Jul 2014 01:21:38 +0000 (18:21 -0700)]
osd/osd_types: add pg_pool_t FLAG_COMPLETE_CLONES
Set a flag on the pg_pool_t when we change cache_mode NONE. This
is because object promotion may promote heads without all of the clones,
and when we switch the cache_mode back those objects may remain. Do
this on any cache_mode change (to or from NONE) to capture legacy
pools that were set up before this flag existed.