Jeff Layton [Mon, 24 Oct 2016 14:03:01 +0000 (10:03 -0400)]
client: add an optional Inode ** parm to ceph_readdirplus_r
Ganesha needs an inode reference in addition to the attributes when it
calls readdirplus. Other callers however don't need an inode reference.
We could just take one universally and pass it to the callback, but most
callers don't need that reference and would need to put it in the
callback. That's cumbersome and mutex-thrashy.
So, we need to fix the readdir engine to only conditionally take this
extra reference, when the callback will actually use it. Add a bool to
readdir_r_cb that defaults to false and indicates that the caller wants
an inode reference for each dentry returned. When that bool is true
we'll pass a pointer to the inode to the callback after taking a
reference. Otherwise, NULL is passed to the callback.
Next, add a return double pointer arg to ceph_readdirplus_r that
indicates whether the caller wants an inode reference and where to put
the pointer to the inode. Almost all callers will set that to NULL, but
ganesha can set it to a non-NULL value to get the inode reference that
it wants on each call.
Jeff Layton [Mon, 24 Oct 2016 14:03:01 +0000 (10:03 -0400)]
client: pass want and flags to readdir_r_cb
...so we can ensure that we have the necessary caps when filling out
the ceph_statx for each dentry's inode. In order to only do this when
completely necessary, we have want default to 0 and the flags default
to AT_NO_ATTR_SYNC. The only codepath where we pass in a non-default
set of args there is ceph_readdirplus_r as it's the only codepath that
cares about fields in the ceph_statx that aren't immutable.
For now, since we have no support for requesting caps during a readdir
call, we simply issue getattrs prior to calling fill_statx. If we
already have the necessary caps, or are doing a lazy statx then this
becomes a no-op.
Note too that I _think_ the MDS will recall caps on the entries when
satisfying a readdir, so we avoid calling getattr when we're populating
the ceph_statx out of a just-acquired readdir response.
Jeff Layton [Mon, 24 Oct 2016 14:03:00 +0000 (10:03 -0400)]
client: change args to ceph_readdirplus_r
Make ceph_readdirplus_r take a ceph_statx, a want and a flags parm. With
this, we can allow applications to express an interest in subset of the
attributes, and can allow for a "lazy" readdirplus.
Drop the stmask. It ends up returning the caps that the client holds on
the inode. That's not well defined, and we can now express that in a
better way via the stx_mask, which applications can use to tell which
fields in the ceph_statx are actually valid.
For now, the want mask is ignored. I don't see a way to ask for a set
of caps in a ceph readdir request on the wire. Maybe we could add that?
Jeff Layton [Mon, 24 Oct 2016 14:03:00 +0000 (10:03 -0400)]
client: convert ceph_ll_link to UserPerm and remove struct stat parameter
The main user of this API (ganesha) doesn't do anything with the
returned attributes, so there's no real point in returning them
there.
Also, we're not guaranteed to have any caps on the target inode
after the link operation, so in the case of FUSE (which does
require the post-op attributes) we should really do a getattr
to get the latest attributes.
Jeff Layton [Mon, 24 Oct 2016 14:02:59 +0000 (10:02 -0400)]
client: convert ceph_ll_walk to use ceph_statx
In addition to acquiring the right caps for the requested attributes, we
can also do a path walk that terminates on an existing symlink without
following the link now.
Jeff Layton [Mon, 24 Oct 2016 14:02:59 +0000 (10:02 -0400)]
client: switch arguments on ceph_ll_lookup to use ceph_statx and UserPerm
For now, we leave the old ->ll_lookup method in place, as FUSE needs
it. We could do a ceph_statx -> stat conversion, but that's extra
copies and I don't think we want the perf hit in FUSE.
Jeff Layton [Mon, 24 Oct 2016 14:02:58 +0000 (10:02 -0400)]
client: allow UserPerm constructor to populate gid list
Add args for the gids_count and gids list, and give them default
values so that callers can populate it correctly. We'll need this
for ganesha so it can populate the UserPerm from a RPC AUTH object.
Note that the gidlist pointer must be valid for the lifetime of
the created object!
Sage Weil [Sun, 23 Oct 2016 23:40:57 +0000 (18:40 -0500)]
msg: adjust byte_throttler from Message::encode
Normally we never call encode on a message that has a byte_throttler set
because we only use it for messages we received. However, for forwarded
messages that we clear_payload() before resending, we *do* reencode, and in
that case we need to retake the appropriate number of bytes from the
throttler--just like we release them in clear_payload().
Sage Weil [Sat, 22 Oct 2016 18:01:34 +0000 (14:01 -0400)]
messages/MForward: reencode forwarded message if target has differing features
This ensures we reencode the payload with the
appropriate set of features if the client, us, or the
target do not have identical features. Otherwise we
may forward an encoding with more features than the
target can handle.
Sage Weil [Fri, 21 Oct 2016 19:42:19 +0000 (15:42 -0400)]
os/bluestore: clear extent map on object removal
Clear ExtentMap (esp shards, etc.) when an object is removed. Otherwise
if we recreate it we will have stale state (like the shards vector or
inline_bl) that are bogus.
Sage Weil [Fri, 21 Oct 2016 16:25:08 +0000 (12:25 -0400)]
mon/OSDMonitor: encode OSDMap::Incremental with same features as OSDMap
The Incremental encode stashes encode_features, which is
what we use later to reencode the updated OSDMap. Use
the same features so that the encoding will match!
John Spray [Thu, 6 Oct 2016 10:20:47 +0000 (12:20 +0200)]
messages: fix out of range assertion
When clang uses an 8 bit type for the enum, it
complains (out of range) if comparing <256,
and complains (tautological) if comparing <=256.
Avoid this by explicitly making the enum an
uint8_t, and just asserting that that it has
that size at the point that we assume so for
the encoding (in case someone modified the
type definition without checking how it was used).
Yan, Zheng [Fri, 21 Oct 2016 03:38:44 +0000 (11:38 +0800)]
mds: fix CDir::log_mark_dirty()
CDir::log_mark_dirty() moves dirfrag to current log segment's dirty
dirfrag list, but it does not submit any log event. Old log segments
(that include events which dirty the dirfrag) may get expired before
the dirfrag gets committed. If MDS crashes, the changes in expired
log segments get lost.
Yan, Zheng [Wed, 19 Oct 2016 15:41:43 +0000 (23:41 +0800)]
mds: avoid wrapping contexts during logging
For each log event, mds allocate two extra contexts, one for
marking op tracker event, one for updating log's safe_pos after
executing the finish context. This is sub-optimization.
This patch defines MDSLogContextBase for log event context.
MDSLogContextBase::complete() function can do the extra jobs.