Note that this can happen if we fail to reconnect do an MDS during its
reconnect interval. If that happens, we probably have inodes in our
cache with no caps and things are generally not going to work very well.
This is but one step in improving the situation.
Separate out the two methods since they share little/no behavior.
majianpeng [Thu, 1 Aug 2013 03:19:02 +0000 (11:19 +0800)]
ceph: Update FUSE_USE_VERSION from 26 to 30.
When compiling, it met this error:
>In file included from /usr/local/include/fuse/fuse.h:19:0,
> from client/fuse_ll.cc:17:
>/usr/local/include/fuse/fuse_common.h:474:4: error: #error only API
>version 30 or greater is supported
Update FUSE_USE_VERSION from 26 to 30.
Yan, Zheng [Fri, 9 Aug 2013 05:43:54 +0000 (13:43 +0800)]
client: trim deleted inode
Previous patch makes MDS send notification to clients when an inode
is deleted. When receiving a such notification, we invalidate any
dentry link to the deleted inode. If there is no other reference to
the inode, the inode gets trimmed.
For cephfs fuse client, we use fuse_lowlevel_notify_inval_entry() or
fuse_lowlevel_notify_delete() to notify the kernel to trim the deleted
inode. (this is not completely reliable because we play unlink/link
tricks when handle MDS replies. it's difficult to keep the user space
cache and kernel dcache in sync)
David Zafman [Fri, 27 Sep 2013 00:42:13 +0000 (17:42 -0700)]
common, os, osd: Use common functions for safe file reading and writing
Add new safe_read_file() and safe_write_file() to update files atomically
Used instead of original OSD::read_meta(), OSD::write_meta() they are based on
Used by read_superblock() and write_superblock()
Used by write_version_stamp() and version_stamp_is_valid()
Fixes: #6422 Signed-off-by: David Zafman <david.zafman@inktank.com>
* Update to the current state of the ghobject implementaiton and the fact
that they encode the shard_t Although the pool also contains the shard
id, it is less relevant to understand the implementation.
* Update with the erasure code plugin infrastructure and the example
plugin now in master.
* Move jerasure to a separate page to be expanded and link it from the
toc
* Kill the partial read and writes notes as it will probably not be
implemented in the near future. Kill some of the notes because they
are no longer relevant.
Sage Weil [Tue, 1 Oct 2013 21:21:40 +0000 (14:21 -0700)]
osd: remove magical tmap -> omap conversion
This is incomplete and unfortunately unusable in its current state:
- it would only set USES_TMAP for old encoded object_info_t and tmapput,
but would NOT set it for tmapup
- a config option turned that off by default.
That means that the mds conversion from tmap -> omap won't be able to use
this because any existing cluster has tmap objects without the USES_TMAP
flag set. And we don't want to unconditionally try a tmap->omap conversion
on omap operations because there are lots of existing librados users out
there that will be negatively impacted by this.
Instead, the MDS will need to handle this conversion on the client side by
reading either tmap or omap objects and explicitly rewriting the content
with omap (while truncating the tmap data away).
Sage Weil [Wed, 2 Oct 2013 00:04:44 +0000 (17:04 -0700)]
osd: add ISDIRTY, UNDIRTY rados operations
ISDIRTY will query whether the dirty flag is set on an object. UNDIRTY
will explicitly clear it. Note that a user doing so will likely run amok
with the caching code.
Sage Weil [Tue, 1 Oct 2013 19:12:55 +0000 (12:12 -0700)]
osd/ReplicatedPG: update all find_object_context() users to handle whiteouts
In each case, we treat the whiteout as if we got an ENOENT.
We do not change the semantics of bool exists to avoid breaking lots of
potentially fragile code. We are only interested in changing the
user-visible behavior of the object, not the way it is internally stored
or managed.
This will likely be refined as we grow acutal users for whiteoutes in the
pool caching code.
Sage Weil [Tue, 1 Oct 2013 16:28:29 +0000 (09:28 -0700)]
osdc/ObjectCacher: limit writeback IOs generated while holding lock
While analyzing a log from Mike Dawson I saw a long stall while librbd's
objectcacher was starting lots (many hundreds) of IOs. Limit the amount of
time we spend doing this at a time to allow IO replies to be processed so
that the cache remains responsive.
I'm not sure this warrants a tunable (which we would need to add for both
libcephfs and librbd).
Yehuda Sadeh [Mon, 26 Aug 2013 18:16:08 +0000 (11:16 -0700)]
rgw: quiet down warning message
Fixes: #6123
We don't want to know about failing to read region map info
if it's not found, only if failed on some other error. In
any case it's just a warning.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
Dan Mick [Fri, 27 Sep 2013 05:24:37 +0000 (22:24 -0700)]
ceph_argparse.py: clean up error reporting when required param missing
Treat "need 1, got 0" as a special case, and change the message to
"missing required parameter <x>". Also, when failing for that reason,
print the command concise description and its helptext.
Fixes: #6384 Signed-off-by: Dan Mick <dan.mick@inktank.com>
Fixes: #6444
Backport: dumpling
If pool creation fails (e.g., due to -EEXIST) then we leak the
completion object. Earlier we couldn't just drop the reference, as
librados have already removed the internal completion object. This fix
drop the completion reference even if got an error, which is now
possible.
librados: pool async create / delete does not delete completion handle
Backport: dumpling
The pool async delete / create function used to delete the internal
completion object. However, caller still holds the allocated completion
object, which it can't drop a reference to (as it'd try to deallocate
the already freed internal object). This fix removes the internal object
deletion, a following commit will fix a related leak (#6444) by having
the application (radosgw) drop the reference even if got an error.
Objecter: add "honor_cache_redirects" flag covering cache settings
When set to false, we do not redirect based on the cache_pool data
in the OSDMap. We'll use this so the OSDs can actually fetch data
into the cache pools on promotion! Signed-off-by: Greg Farnum <greg@inktank.com>
Sage Weil [Thu, 5 Sep 2013 04:29:11 +0000 (21:29 -0700)]
common/crc32c_intel_fast: avoid reading partial trailing word
The optimized intel code reads in word-sized chunks, knowing that the
allocator will only hand out memory in word-sized increments. This makes
valgrind unhappy. Whitelisting doesn't work because for some reason there
is no caller context (probably because of some interaction with yasm?).
Instead, just use the baseline code for the last few bytes. This should
not be significant.
Dan Mick [Fri, 27 Sep 2013 01:00:31 +0000 (18:00 -0700)]
ceph_argparse.py, cephtool/test.sh: fix blacklist with no nonce
It's legal to give a CephEntityAddr to osd blacklist with no nonce,
so allow it in the valid() method; also add validation of any nonce
given that it's a long >= 0.
Also fix comment on CephEntityAddr type description in MonCommands.h,
and add tests for invalid nonces (while fixing the existing tests to remove
the () around expect_false args).
Fixes: #6425 Signed-off-by: Dan Mick <dan.mick@inktank.com>
We now put CEPH_ARGS in the actual args we parse in python, which are passed
to rados piecemeal later. This lets you put things like --id ... in there
that need to be parsed before librados is initialized.