Sage Weil [Thu, 3 Oct 2013 23:30:29 +0000 (16:30 -0700)]
common/bloom_filter: drop raw_table_size_ member
We were storing table_size_ and raw_table_size_, where one is the size in
bits and the other is the size in bytes. This is silly. Store only the
size in bytes.
Also, bytes are always 8 bits, so use bit shifts and drop some of that
silliness too.
Move the member declarations to the top of the class so you read them
before the methods.
David Zafman [Mon, 30 Sep 2013 22:53:35 +0000 (15:53 -0700)]
common, os: Perform xattr handling based on detected fs type
In FileStore::_detect_fs() store discovered filesystem type in m_fs_type
Add per-filesystem filestore_max_inline_xattr_size_* variants
Add per-filesystem filestore_max_inline_xattrs_* variants
New function set_xattr_limits_via_conf()
Set m_filestore_max_inline_xattr_size based on override or fs type
Set m_filestore_max_inline_xattrs based on override or fs type
Handle conf change of any relevant value by calling set_xattr_limits_via_conf()
Change filestore_max_inline_xattr_size to override if non-zero
Change filestore_max_inline_xattrs to override if non-zero
Fixes: #6143 Signed-off-by: David Zafman <david.zafman@inktank.com>
Sage Weil [Fri, 4 Oct 2013 04:27:36 +0000 (21:27 -0700)]
osd/ReplicatedPG: fix null deref on rollback_to whiteout check
Bring this whole if/else chain up one level so that we can capture both
ENOENT and whiteout in the same case. (And don't dereference the
pointer when we know it is NULL.)
Fixes: #6474 Signed-off-by: Sage Weil <sage@inktank.com>
mon: Monitor: drop client msg if no session exists and msg is not MAuth
If we are not a monitor and we don't have a session yet, we must first
authenticate with the cluster. Therefore, the first message to the
monitor must be an MAuth. If not, we assume it's a stray message and
just drop it.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
mon: MonmapMonitor: make 'ceph mon add' idempotent
MonMap changes lead to bootstraps. Callbacks waiting for a proposal to
finish can have several fates, depending on what happens: finished, rerun
or aborted.
In the case of a bootstrap right after a monmap change, callbacks are
rerun. Considering we queued the message that lead to the monmap change
on this queue, if we instead of finishing it end up reruning it, we will
end up trying to perform the same modification twice -- the last one will
try to modify an already existing state and we will return just that:
whatever you're attempting to do has already been done.
This patch makes 'ceph mon add' completely idempotent. If one tries to
add an already existing monitor (i.e., same name, same ip:port), one
simply gets a 'monitor foo added', with return 0, no matter how many
times one runs the command.
Fixes: #5896 Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Note that this can happen if we fail to reconnect do an MDS during its
reconnect interval. If that happens, we probably have inodes in our
cache with no caps and things are generally not going to work very well.
This is but one step in improving the situation.
Separate out the two methods since they share little/no behavior.
majianpeng [Thu, 1 Aug 2013 03:19:02 +0000 (11:19 +0800)]
ceph: Update FUSE_USE_VERSION from 26 to 30.
When compiling, it met this error:
>In file included from /usr/local/include/fuse/fuse.h:19:0,
> from client/fuse_ll.cc:17:
>/usr/local/include/fuse/fuse_common.h:474:4: error: #error only API
>version 30 or greater is supported
Update FUSE_USE_VERSION from 26 to 30.
Yan, Zheng [Fri, 9 Aug 2013 05:43:54 +0000 (13:43 +0800)]
client: trim deleted inode
Previous patch makes MDS send notification to clients when an inode
is deleted. When receiving a such notification, we invalidate any
dentry link to the deleted inode. If there is no other reference to
the inode, the inode gets trimmed.
For cephfs fuse client, we use fuse_lowlevel_notify_inval_entry() or
fuse_lowlevel_notify_delete() to notify the kernel to trim the deleted
inode. (this is not completely reliable because we play unlink/link
tricks when handle MDS replies. it's difficult to keep the user space
cache and kernel dcache in sync)
Sage Weil [Fri, 20 Sep 2013 00:57:14 +0000 (17:57 -0700)]
common/bloom_filter: unit tests
Fun facts:
- fpp = false positive probability
- fpp is a function of insert count only
- at .1% fpp, we pay about 2 bytes per insert
- at 1-2% fpp, we pay about 1 byte per insert
- at 15% fpp, we pay about .5 bytes per insert
David Zafman [Fri, 27 Sep 2013 00:42:13 +0000 (17:42 -0700)]
common, os, osd: Use common functions for safe file reading and writing
Add new safe_read_file() and safe_write_file() to update files atomically
Used instead of original OSD::read_meta(), OSD::write_meta() they are based on
Used by read_superblock() and write_superblock()
Used by write_version_stamp() and version_stamp_is_valid()
Fixes: #6422 Signed-off-by: David Zafman <david.zafman@inktank.com>
* Update to the current state of the ghobject implementaiton and the fact
that they encode the shard_t Although the pool also contains the shard
id, it is less relevant to understand the implementation.
* Update with the erasure code plugin infrastructure and the example
plugin now in master.
* Move jerasure to a separate page to be expanded and link it from the
toc
* Kill the partial read and writes notes as it will probably not be
implemented in the near future. Kill some of the notes because they
are no longer relevant.
Greg Farnum [Tue, 1 Oct 2013 23:41:22 +0000 (16:41 -0700)]
ReplicatedPG: copy: use CopyCallback instead of CopyOp in OpContext
In order to make this happen, we make the switch to generate the complete
transaction in the generic copy code and save it into the Callback. Then
in finish_copy() we just take that transaction and prepend it to the existing
transaction.
With that change, and by making use of the existing CopyCallback data,
we no longer need to access the CopyOp from the OpContext, so we can remove it.
Hurray, the pipelines are now independent!
Sage Weil [Tue, 1 Oct 2013 21:21:40 +0000 (14:21 -0700)]
osd: remove magical tmap -> omap conversion
This is incomplete and unfortunately unusable in its current state:
- it would only set USES_TMAP for old encoded object_info_t and tmapput,
but would NOT set it for tmapup
- a config option turned that off by default.
That means that the mds conversion from tmap -> omap won't be able to use
this because any existing cluster has tmap objects without the USES_TMAP
flag set. And we don't want to unconditionally try a tmap->omap conversion
on omap operations because there are lots of existing librados users out
there that will be negatively impacted by this.
Instead, the MDS will need to handle this conversion on the client side by
reading either tmap or omap objects and explicitly rewriting the content
with omap (while truncating the tmap data away).
Sage Weil [Wed, 2 Oct 2013 00:04:44 +0000 (17:04 -0700)]
osd: add ISDIRTY, UNDIRTY rados operations
ISDIRTY will query whether the dirty flag is set on an object. UNDIRTY
will explicitly clear it. Note that a user doing so will likely run amok
with the caching code.
We implement enough of the CopyFromCallback that CopyOp no longer needs
a direct reference to the OpContext, so we remove it and replace all
references with calls to cop->cb->complete().