mds: properly update rstat/fragstat when fragmentating directory
The stale rstat/dirstat check in CDir::merge() is wrong. dirfrag's
rstat/fragstat is stale if the accounted rstat/fragstat version isn't
equal to inode's rstat/dirstat version.
For CDir::split(), no need to worry about if the rstat/fragstat of the
origin dirfrag is stale. If it's stale, the rstat/fragstat of the
resulting dirfrags are stale too.
Yan, Zheng [Fri, 4 Oct 2013 07:10:37 +0000 (15:10 +0800)]
mds: delete orphan dirfrags during MDS recovers
This patch make the MDS use following steps to fragmentate directory.
1. freeze the old dirfrags
2. journal EFragment::OP_PREPARE
3. store the new dirfrags
4. journal EFragment::OP_COMMIT
5. delete the old dirfrags
6. journal EFragment::OP_FINISH
The newly introduced event EFragment::OP_FINISH indicates that all orphan
frags have been deleted. The new process guarantees that orphan frags can
be properly deleted if the MDS crashes while fragmentating directory.
Fragments with different 'bits' can be merged into one fragment. So we can't
use 'basefrag' and 'bits' to calculate the original fragments when rolling
back a merge operation.
We also can't rely on MDCache::adjust_dir_fragments() to restore the original
fragments' statistics during rolling back. This is because fragments are not
always complete in the rollback case.
mds: start internal MDS request for fragmentating directory
Start internal MDS request for fragmentating directory operation. With
MDS request, we can easily acquire locks required by the fragmentating
directory operation. (The old way to get locks is 'try lock' style,
which is not reliable)
David Zafman [Fri, 27 Sep 2013 00:42:13 +0000 (17:42 -0700)]
common, os, osd: Use common functions for safe file reading and writing
Add new safe_read_file() and safe_write_file() to update files atomically
Used instead of original OSD::read_meta(), OSD::write_meta() they are based on
Used by read_superblock() and write_superblock()
Used by write_version_stamp() and version_stamp_is_valid()
Fixes: #6422 Signed-off-by: David Zafman <david.zafman@inktank.com>
* Update to the current state of the ghobject implementaiton and the fact
that they encode the shard_t Although the pool also contains the shard
id, it is less relevant to understand the implementation.
* Update with the erasure code plugin infrastructure and the example
plugin now in master.
* Move jerasure to a separate page to be expanded and link it from the
toc
* Kill the partial read and writes notes as it will probably not be
implemented in the near future. Kill some of the notes because they
are no longer relevant.
Sage Weil [Tue, 1 Oct 2013 21:21:40 +0000 (14:21 -0700)]
osd: remove magical tmap -> omap conversion
This is incomplete and unfortunately unusable in its current state:
- it would only set USES_TMAP for old encoded object_info_t and tmapput,
but would NOT set it for tmapup
- a config option turned that off by default.
That means that the mds conversion from tmap -> omap won't be able to use
this because any existing cluster has tmap objects without the USES_TMAP
flag set. And we don't want to unconditionally try a tmap->omap conversion
on omap operations because there are lots of existing librados users out
there that will be negatively impacted by this.
Instead, the MDS will need to handle this conversion on the client side by
reading either tmap or omap objects and explicitly rewriting the content
with omap (while truncating the tmap data away).
Sage Weil [Wed, 2 Oct 2013 00:04:44 +0000 (17:04 -0700)]
osd: add ISDIRTY, UNDIRTY rados operations
ISDIRTY will query whether the dirty flag is set on an object. UNDIRTY
will explicitly clear it. Note that a user doing so will likely run amok
with the caching code.
Sage Weil [Tue, 1 Oct 2013 19:12:55 +0000 (12:12 -0700)]
osd/ReplicatedPG: update all find_object_context() users to handle whiteouts
In each case, we treat the whiteout as if we got an ENOENT.
We do not change the semantics of bool exists to avoid breaking lots of
potentially fragile code. We are only interested in changing the
user-visible behavior of the object, not the way it is internally stored
or managed.
This will likely be refined as we grow acutal users for whiteoutes in the
pool caching code.
Sage Weil [Tue, 1 Oct 2013 16:28:29 +0000 (09:28 -0700)]
osdc/ObjectCacher: limit writeback IOs generated while holding lock
While analyzing a log from Mike Dawson I saw a long stall while librbd's
objectcacher was starting lots (many hundreds) of IOs. Limit the amount of
time we spend doing this at a time to allow IO replies to be processed so
that the cache remains responsive.
I'm not sure this warrants a tunable (which we would need to add for both
libcephfs and librbd).
Yehuda Sadeh [Mon, 26 Aug 2013 18:16:08 +0000 (11:16 -0700)]
rgw: quiet down warning message
Fixes: #6123
We don't want to know about failing to read region map info
if it's not found, only if failed on some other error. In
any case it's just a warning.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Fri, 27 Sep 2013 05:24:37 +0000 (22:24 -0700)]
ceph_argparse.py: clean up error reporting when required param missing
Treat "need 1, got 0" as a special case, and change the message to
"missing required parameter <x>". Also, when failing for that reason,
print the command concise description and its helptext.
Fixes: #6384 Signed-off-by: Dan Mick <dan.mick@inktank.com>
Fixes: #6444
Backport: dumpling
If pool creation fails (e.g., due to -EEXIST) then we leak the
completion object. Earlier we couldn't just drop the reference, as
librados have already removed the internal completion object. This fix
drop the completion reference even if got an error, which is now
possible.
librados: pool async create / delete does not delete completion handle
Backport: dumpling
The pool async delete / create function used to delete the internal
completion object. However, caller still holds the allocated completion
object, which it can't drop a reference to (as it'd try to deallocate
the already freed internal object). This fix removes the internal object
deletion, a following commit will fix a related leak (#6444) by having
the application (radosgw) drop the reference even if got an error.
Objecter: add "honor_cache_redirects" flag covering cache settings
When set to false, we do not redirect based on the cache_pool data
in the OSDMap. We'll use this so the OSDs can actually fetch data
into the cache pools on promotion! Signed-off-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 5 Sep 2013 04:29:11 +0000 (21:29 -0700)]
common/crc32c_intel_fast: avoid reading partial trailing word
The optimized intel code reads in word-sized chunks, knowing that the
allocator will only hand out memory in word-sized increments. This makes
valgrind unhappy. Whitelisting doesn't work because for some reason there
is no caller context (probably because of some interaction with yasm?).
Instead, just use the baseline code for the last few bytes. This should
not be significant.
Dan Mick [Fri, 27 Sep 2013 01:00:31 +0000 (18:00 -0700)]
ceph_argparse.py, cephtool/test.sh: fix blacklist with no nonce
It's legal to give a CephEntityAddr to osd blacklist with no nonce,
so allow it in the valid() method; also add validation of any nonce
given that it's a long >= 0.
Also fix comment on CephEntityAddr type description in MonCommands.h,
and add tests for invalid nonces (while fixing the existing tests to remove
the () around expect_false args).
Fixes: #6425 Signed-off-by: Dan Mick <dan.mick@inktank.com>