Jeff Layton [Wed, 1 Jun 2022 20:13:12 +0000 (16:13 -0400)]
[DO NOT MERGE] libceph: vet page pointers passed to sendpage_ok
We've seen some crashes in teuthology that look like we were passed a
bogus page pointer from the upper layers. Dump some info about the
pointer if it turns out to be clearly bad.
URL: https://tracker.ceph.com/issues/55818 Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph: prevent snapshots to be created in encrypted locked directories
With snapshot names encryption we can not allow snapshots to be created in
locked directories because the names wouldn't be encrypted. This patch
forces the directory to be unlocked to allow a snapshot to be created.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph: add support for handling encrypted snapshot names
When creating a snapshot, the .snap directories for every subdirectory will
show the snapshot name in the "long format":
# mkdir .snap/my-snap
# ls my-dir/.snap/
_my-snap_1099511627782
Encrypted snapshots will need to be able to handle these snapshot names by
encrypting/decrypting only the snapshot part of the string ('my-snap').
Also, since the MDS prevents snapshot names to be bigger than 240 characters
it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate this extra
limitation.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Since filenames in encrypted directories are already encrypted and shown
as a base64-encoded string when the directory is locked, snapshot names
should show a similar behaviour.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph: invalidate pages when doing direct/sync writes
When doing a direct/sync write, we need to invalidate the page cache in
the range being written to. If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.
In the event that invalidation fails, just ignore the error. That likely
just means that we raced with another task doing a buffered write, in
which case we want to leave the page intact anyway.
[ jlayton: minor comment update ]
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 10 Jan 2022 16:58:03 +0000 (11:58 -0500)]
ceph: add encryption support to writepage
Allow writepage to issue encrypted writes. Extend out the requested size
and offset to cover complete blocks, and then encrypt and write them to
the OSDs.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 27 Jan 2021 18:13:45 +0000 (13:13 -0500)]
ceph: add read/modify/write to ceph_sync_write
When doing a synchronous write on an encrypted inode, we have no
guarantee that the caller is writing crypto block-aligned data. When
that happens, we must do a read/modify/write cycle.
First, expand the range to cover complete blocks. If we had to change
the original pos or length, issue a read to fill the first and/or last
pages, and fetch the version of the object from the result.
We then copy data into the pages as usual, encrypt the result and issue
a write prefixed by an assertion that the version hasn't changed. If it has
changed then we restart the whole thing again.
If there is no object at that position in the file (-ENOENT), we prefix
the write on an exclusive create of the object instead.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 25 Jan 2021 18:47:30 +0000 (13:47 -0500)]
ceph: align data in pages in ceph_sync_write
Encrypted files will need to be dealt with in block-sized chunks and
once we do that, the way that ceph_sync_write aligns the data in the
bounce buffer won't be acceptable.
Change it to align the data the same way it would be aligned in the
pagecache.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 7 Jan 2022 14:30:42 +0000 (09:30 -0500)]
ceph: don't use special DIO path for encrypted inodes
Eventually I want to merge the synchronous and direct read codepaths,
possibly via new netfs infrastructure. For now, the direct path is not
crypto-enabled, so use the sync read/write paths instead.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 30 Nov 2021 18:38:42 +0000 (13:38 -0500)]
ceph: disable copy offload on encrypted inodes
If we have an encrypted inode, then the client will need to re-encrypt
the contents of the new object. Disable copy offload to or from
encrypted inodes.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 22 Feb 2021 21:28:16 +0000 (16:28 -0500)]
libceph: allow ceph_osdc_new_request to accept a multi-op read
Currently we have some special-casing for multi-op writes, but in the
case of a read, we can't really handle it. All of the current multi-op
callers call it with CEPH_OSD_FLAG_WRITE set.
Have ceph_osdc_new_request check for CEPH_OSD_FLAG_READ and if it's set,
allocate multiple reply ops instead of multiple request ops. If neither
flag is set, return -EINVAL.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Wed, 12 Jan 2022 01:46:48 +0000 (09:46 +0800)]
ceph: add truncate size handling support for fscrypt
This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.
The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Fri, 5 Nov 2021 14:22:13 +0000 (22:22 +0800)]
ceph: add __ceph_sync_read helper support
Turn the guts of ceph_sync_read into a new helper that takes an inode
and an offset instead of a kiocb struct, and make ceph_sync_read call
the helper as a wrapper.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 5 Nov 2021 14:22:11 +0000 (22:22 +0800)]
ceph: handle fscrypt fields in cap messages from MDS
Handle the new fscrypt_file and fscrypt_auth fields in cap messages. Use
them to populate new fields in cap_extra_info and update the inode with
those values.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 5 Nov 2021 14:22:10 +0000 (22:22 +0800)]
ceph: get file size from fscrypt_file when present in inode traces
When we get an inode trace from the MDS, grab the fscrypt_file field if
the inode is encrypted, and use it to populate the i_size field instead
of the regular inode size field.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 5 Nov 2021 14:22:07 +0000 (22:22 +0800)]
libceph: add CEPH_OSD_OP_ASSERT_VER support
...and record the user_version in the reply in a new field in
ceph_osd_request, so we can populate the assert_ver appropriately.
Shuffle the fields a bit too so that the new field fits in an
existing hole on x86_64.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 31 Mar 2021 17:57:14 +0000 (13:57 -0400)]
ceph: make ceph_get_name decrypt filenames
When we do a lookupino to the MDS, we get a filename in the trace.
ceph_get_name uses that name directly, so we must properly decrypt
it before copying it to the name buffer.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 3 Sep 2020 17:31:10 +0000 (13:31 -0400)]
ceph: create symlinks with encrypted and base64-encoded targets
When creating symlinks in encrypted directories, encrypt and
base64-encode the target with the new inode's key before sending to the
MDS.
When filling a symlinked inode, base64-decode it into a buffer that
we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer
into a new one that will hang off i_link.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 10 Aug 2020 13:33:41 +0000 (09:33 -0400)]
ceph: add support to readdir for encrypted filenames
Once we've decrypted the names in a readdir reply, we no longer need the
crypttext, so overwrite them in ceph_mds_reply_dir_entry with the
unencrypted names. Then in both ceph_readdir_prepopulate() and
ceph_readdir() we will use the dencrypted name directly.
[ jlayton: convert some BUG_ONs into error returns ]
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 14 Mar 2022 02:28:35 +0000 (10:28 +0800)]
ceph: add ceph_encode_encrypted_dname() helper
Add a new helper that basically calls ceph_encode_encrypted_fname, but
with a qstr pointer instead of a dentry pointer. This will make it
simpler to decrypt names in a readdir reply, before we have a dentry.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 14 Mar 2022 02:28:34 +0000 (10:28 +0800)]
ceph: pass the request to parse_reply_info_readdir()
Instead of passing just the r_reply_info to the readdir reply parser,
pass the request pointer directly instead. This will facilitate
implementing readdir on fscrypted directories.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 26 Mar 2021 16:26:18 +0000 (12:26 -0400)]
ceph: add helpers for converting names for userland presentation
Define a new ceph_fname struct that we can use to carry information
about encrypted dentry names. Add helpers for working with these
objects, including ceph_fname_to_usr which formats an encrypted filename
for userland presentation.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 7 Aug 2020 19:47:17 +0000 (15:47 -0400)]
ceph: make d_revalidate call fscrypt revalidator for encrypted dentries
If we have a dentry which represents a no-key name, then we need to test
whether the parent directory's encryption key has since been added. Do
that before we test anything else about the dentry.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 28 Mar 2022 20:18:38 +0000 (16:18 -0400)]
ceph: set DCACHE_NOKEY_NAME in atomic open
Atomic open can act as a lookup if handed a dentry that is negative on
the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in
atomic_open, if we don't have the key for the parent. Otherwise, we can
end up validating the dentry inappropriately if someone later adds a
key.
Reviewed-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 14 Jan 2021 15:39:22 +0000 (10:39 -0500)]
ceph: send altname in MClientRequest
In the event that we have a filename longer than CEPH_NOHASH_NAME_MAX,
we'll need to hash the tail of the filename. The client however will
still need to know the full name of the file if it has a key.
To support this, the MClientRequest field has grown a new alternate_name
field that we populate with the full (binary) crypttext of the filename.
This is then transmitted to the clients in readdir or traces as part of
the dentry lease.
Add support for populating this field when the filenames are very long.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 7 Aug 2020 13:28:31 +0000 (09:28 -0400)]
ceph: add encrypted fname handling to ceph_mdsc_build_path
Allow ceph_mdsc_build_path to encrypt and base64 encode the filename
when the parent is encrypted and we're sending the path to the MDS.
In most cases, we just encrypt the filenames and base64 encode them,
but when the name is longer than CEPH_NOHASH_NAME_MAX, we use a similar
scheme to fscrypt proper, and hash the remaning bits with sha256.
When doing this, we then send along the full crypttext of the name in
the new alternate_name field of the MClientRequest. The MDS can then
send that along in readdir responses and traces.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph: add base64 endcoding routines for encrypted names
The base64url encoding used by fscrypt includes the '_' character, which
may cause problems in snapshot names (if the name starts with '_').
Thus, use the base64 encoding defined for IMAP mailbox names (RFC 3501),
which uses '+' and ',' instead of '-' and '_'.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 28 Jul 2020 13:58:43 +0000 (09:58 -0400)]
ceph: add fscrypt ioctls
We gate most of the ioctls on MDS feature support. The exception is the
key removal and status functions that we still want to work if the MDS's
were to (inexplicably) lose the feature.
For the set_policy ioctl, we take Fs caps to ensure that nothing can
create files in the directory while the ioctl is running. That should
be enough to ensure that the "empty_dir" check is reliable.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 11 Jan 2021 17:35:48 +0000 (12:35 -0500)]
ceph: decode alternate_name in lease info
Ceph is a bit different from local filesystems, in that we don't want
to store filenames as raw binary data, since we may also be dealing
with clients that don't support fscrypt.
We could just base64-encode the encrypted filenames, but that could
leave us with filenames longer than NAME_MAX. It turns out that the
MDS doesn't care much about filename length, but the clients do.
To manage this, we've added a new "alternate name" field that can be
optionally added to any dentry that we'll use to store the binary
crypttext of the filename if its base64-encoded value will be longer
than NAME_MAX. When a dentry has one of these names attached, the MDS
will send it along in the lease info, which we can then store for
later usage.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Tue, 8 Sep 2020 13:47:40 +0000 (09:47 -0400)]
ceph: implement -o test_dummy_encryption mount option
Add support for the test_dummy_encryption mount option. This allows us
to test the encrypted codepaths in ceph without having to manually set
keys, etc.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 27 Jul 2020 14:16:09 +0000 (10:16 -0400)]
ceph: fscrypt_auth handling for ceph
Most fscrypt-enabled filesystems store the crypto context in an xattr,
but that's problematic for ceph as xatts are governed by the XATTR cap,
but we really want the crypto context as part of the AUTH cap.
Because of this, the MDS has added two new inode metadata fields:
fscrypt_auth and fscrypt_file. The former is used to hold the crypto
context, and the latter is used to track the real file size.
Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now,
we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt
context.
Allow the client to use a setattr request for setting the fscrypt_auth
field. Since this is not a standard setattr request from the VFS, we add
a new field to __ceph_setattr that carries ceph-specific inode attrs.
Have the set_context op do a setattr that sets the fscrypt_auth value,
and get_context just return the contents of that field (since it should
always be available).
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 5 Aug 2020 18:50:48 +0000 (14:50 -0400)]
ceph: make ceph_msdc_build_path use ref-walk
Encryption potentially requires allocation, at which point we'll need to
be in a non-atomic context. Convert ceph_msdc_build_path to take dentry
spinlocks and references instead of using rcu_read_lock to walk the
path.
This is slightly less efficient, and we may want to eventually allow
using RCU when the leaf dentry isn't encrypted.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 26 Aug 2020 17:11:00 +0000 (13:11 -0400)]
ceph: preallocate inode for ops that may create one
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Sat, 26 Feb 2022 11:33:03 +0000 (06:33 -0500)]
ceph: add new mount option to enable sparse reads
Add a new mount option that has the client issue sparse reads instead of
normal ones. The callers now preallocate an sparse extent buffer that
the libceph receive code can populate and hand back after the operation
completes.
After a successful sparse read, we can't use the req->r_result value to
determine the amount of data "read", so instead we set the received
length to be from the end of the last extent in the buffer. Any
interstitial holes will have been filled by the receive code.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 23 Mar 2022 16:17:15 +0000 (12:17 -0400)]
libceph: support sparse reads on msgr2 secure codepath
Add a new init_sgs_pages helper that populates the scatterlist from
an arbitrary point in an array of pages.
Change setup_message_sgs to take an optional pointer to an array of
pages. If that's set, then the scatterlist will be set using that
array instead of the cursor.
When given a sparse read on a secure connection, decrypt the data
in-place rather than into the final destination, by passing it the
in_enc_pages array.
After decrypting, run the sparse_read state machine in a loop, copying
data from the decrypted pages until it's complete.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 11 Feb 2022 16:38:02 +0000 (11:38 -0500)]
libceph: add sparse read support to OSD client
Have get_reply check for the presence of sparse read ops in the
request and set the sparse_read boolean in the msg. That will queue the
messenger layer to use the sparse read codepath instead of the normal
data receive.
Add a new sparse_read operation for the OSD client, driven by its own
state machine. The messenger will repeatedly call the sparse_read
operation, and it will pass back the necessary info to set up to read
the next extent of data, while zero-filling the sparse regions.
The state machine will stop at the end of the last extent, and will
attach the extent map buffer to the ceph_osd_req_op so that the caller
can use it.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 25 Jan 2022 13:26:31 +0000 (08:26 -0500)]
libceph: add sparse read support to msgr2 crc state machine
Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.
The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.
This necessitates adding some new field to some other structs:
- The msg gets a new bool to track whether it's a sparse_read request.
- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data
- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 16 Mar 2022 19:23:00 +0000 (15:23 -0400)]
libceph: define struct ceph_sparse_extent and add some helpers
When the OSD sends back a sparse read reply, it contains an array of
these structures. Define the structure and add a couple of helpers for
dealing with them.
Also add a place in struct ceph_osd_req_op to store the extent buffer,
and code to free it if it's populated when the req is torn down.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 14 Mar 2022 19:47:18 +0000 (15:47 -0400)]
libceph: add spinlock around osd->o_requests
In a later patch, we're going to need to search for a request in
the rbtree, but taking the o_mutex is inconvenient as we already
hold the con mutex at the point where we need it.
Add a new spinlock that we take when inserting and erasing entries from
the o_requests tree. Search of the rbtree can be done with either the
mutex or the spinlock, but insertion and removal requires both.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Fri, 27 May 2022 04:39:17 +0000 (12:39 +0800)]
ceph: choose auth MDS for getxattr with the Xs caps
And for the 'Xs' caps for getxattr we will also choose the auth MDS,
because the MDS side code is buggy due to setxattr won't notify the
replica MDSes when the values changed and the replica MDS will return
the old values. Though we will fix it in MDS code, but this still
makes sense for old ceph.
URL: https://tracker.ceph.com/issues/55331 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Xiubo Li [Thu, 26 May 2022 05:21:31 +0000 (13:21 +0800)]
ceph: add session already open notify support
If the connection was accidently closed due to the socket issue or
something else the clients will try to open the opened sessions, the
MDSes will send the session open reply one more time if the clients
support the notify feature.
When the clients retry to open the sessions the s_seq will be 0 as
default, we need to update it anyway.
URL: https://tracker.ceph.com/issues/53911 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Wed, 25 May 2022 10:11:00 +0000 (06:11 -0400)]
libceph: drop last_piece flag from ceph_msg_data_cursor
ceph_msg_data_next is always passed a NULL pointer for this field. Some
of the "next" operations look at it in order to determine the length,
but we can just take the min of the data on the page or cursor->resid.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Tue, 10 May 2022 01:47:01 +0000 (09:47 +0800)]
ceph: wait the first reply of inflight async unlink
In async unlink case the kclient won't wait for the first reply
from MDS and just drop all the links and unhash the dentry and then
succeeds immediately.
For any new create/link/rename,etc requests followed by using the
same file names we must wait for the first reply of the inflight
unlink request, or the MDS possibly will fail these following
requests with -EEXIST if the inflight async unlink request was
delayed for some reasons.
And the worst case is that for the none async openc request it will
successfully open the file if the CDentry hasn't been unlinked yet,
but later the previous delayed async unlink request will remove the
CDenty. That means the just created file is possiblly deleted later
by accident.
We need to wait for the inflight async unlink requests to finish
when creating new files/directories by using the same file names.
[ xiubli: fixed the build warning reported by kernel test rotot]
URL: https://tracker.ceph.com/issues/55332 Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Luís Henriques [Tue, 24 May 2022 16:06:27 +0000 (17:06 +0100)]
ceph: use correct index when encoding client supported features
Feature bits have to be encoded into the correct locations. This hasn't
been an issue so far because the only hole in the feature bits was in bit
10 (CEPHFS_FEATURE_RECLAIM_CLIENT), which is located in the 2nd byte. When
adding more bits that go beyond the this 2nd byte, the bug will show up.
[xiubli: remove the incorrect comment for CEPHFS_FEATURES_CLIENT_SUPPORTED]
Fixes: 9ba1e224538a ("ceph: allocate the correct amount of extra bytes for the session features") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Fri, 13 May 2022 14:23:25 +0000 (10:23 -0400)]
[DO NOT MERGE] mm: BUG if filemap_alloc_folio gives us a folio with a non-NULL ->private
We've seen some instances where we call __filemap_get_folio and get back
one with a ->private value that is non-NULL. Let's have the allocator
bug if that happens.
For now, let's just put this into the testing kernel. We can let Willy
decide if he wants it in mainline.
URL: https://tracker.ceph.com/issues/55421 Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiubo Li <xiubli@redhat.com> Cc: Luís Henriques <lhenriques@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 1 Sep 2020 16:56:42 +0000 (12:56 -0400)]
fscrypt: add fscrypt_context_for_new_inode
Most filesystems just call fscrypt_set_context on new inodes, which
usually causes a setxattr. That's a bit late for ceph, which can send
along a full set of attributes with the create request.
Doing so allows it to avoid race windows that where the new inode could
be seen by other clients without the crypto context attached. It also
avoids the separate round trip to the server.
Refactor the fscrypt code a bit to allow us to create a new crypto
context, attach it to the inode, and write it to the buffer, but without
calling set_context on it. ceph can later use this to marshal the
context into the attributes we send along with the create request.
Acked-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 8 Jan 2021 20:34:38 +0000 (15:34 -0500)]
fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
For ceph, we want to use our own scheme for handling filenames that are
are longer than NAME_MAX after encryption and Base64 encoding. This
allows us to have a consistent view of the encrypted filenames for
clients that don't support fscrypt and clients that do but that don't
have the key.
Currently, fs/crypto only supports encrypting filenames using
fscrypt_setup_filename, but that also handles encoding nokey names. Ceph
can't use that because it handles nokey names in a different way.
Export fscrypt_fname_encrypt. Rename fscrypt_fname_encrypted_size to
__fscrypt_fname_encrypted_size and add a new wrapper called
fscrypt_fname_encrypted_size that takes an inode argument rather than a
pointer to a fscrypt_policy union.
Acked-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 31 Mar 2022 20:29:00 +0000 (16:29 -0400)]
fs: change test in inode_insert5 for adding to the sb list
inode_insert5 currently looks at I_CREATING to decide whether to insert
the inode into the sb list. This test is a bit ambiguous, as I_CREATING
state is not directly related to that list.
This test is also problematic for some upcoming ceph changes to add
fscrypt support. We need to be able to allocate an inode using new_inode
and insert it into the hash later iff we end up using it, and doing that
now means that we double add it and corrupt the list.
What we really want to know in this test is whether the inode is already
in its superblock list, and then add it if it isn't. Have it test for
list_empty instead and ensure that we always initialize the list by
doing it in inode_init_once. It's only ever removed from the list with
list_del_init, so that should be sufficient.
Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jeff Layton [Mon, 23 May 2022 17:22:09 +0000 (13:22 -0400)]
MAINTAINERS: move myself from ceph "Maintainer" to "Reviewer"
Xiubo has graciously volunteered to take over for me as the Linux cephfs
client maintainer. Make it official by changing myself to be a
"Reviewer" for libceph and ceph.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Acked-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Luís Henriques [Mon, 23 May 2022 16:09:51 +0000 (17:09 +0100)]
ceph: fix decoding of client session messages flags
The cephfs kernel client started to show the message:
ceph: mds0 session blocklisted
when mounting a filesystem. This is due to the fact that the session
messages are being incorrectly decoded: the skip needs to take into
account the 'len'.
While there, fixed some whitespaces too.
Cc: stable@vger.kernel.org Fixes: e1c9788cb397 ("ceph: don't rely on error_string to validate blocklisted session.") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Wed, 18 May 2022 14:49:26 +0000 (22:49 +0800)]
ceph: switch TASK_INTERRUPTIBLE to TASK_KILLABLE
If the task is placed in the TASK_INTERRUPTIBLE state it will sleep
until either something explicitly wakes it up, or a non-masked signal
is received. Switch to TASK_KILLABLE to avoid the noises.
Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Colin Ian King [Wed, 18 May 2022 08:55:08 +0000 (09:55 +0100)]
ceph: remove redundant variable ino
Variable ino is being assigned a value that is never read. The variable
and assignment are redundant, remove it.
Cleans up clang scan build warning:
warning: Although the value stored to 'ino' is used in the enclosing
expression, the value is never actually read from 'ino'
[deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Wed, 27 Apr 2022 06:14:41 +0000 (14:14 +0800)]
ceph: try to queue a writeback if revoking fails
If the pagecaches writeback just finished and the i_wrbuffer_ref
reaches zero it will try to trigger ceph_check_caps(). But if just
before ceph_check_caps() the i_wrbuffer_ref could be increased
again by mmap/cache write, then the Fwb revoke will fail.
We need to try to queue a writeback in this case instead of
triggering the writeback by BDI's delayed work per 5 seconds.
URL: https://tracker.ceph.com/issues/46904
URL: https://tracker.ceph.com/issues/55377 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing a mount using as base a directory that has 'max_bytes' quotas
statfs uses that value as the total; if a subdirectory is used instead,
the same 'max_bytes' too in statfs, unless there is another quota set.
Unfortunately, if this subdirectory only has the 'max_files' quota set,
then statfs uses the filesystem total. Fix this by making sure we only
lookup realms that contain the 'max_bytes' quota.
Cc: Ryan Taylor <rptaylor@uvic.ca>
URL: https://tracker.ceph.com/issues/55090 Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to
mds.1.
4, then the mds.1 will request the rd lock for CInode::filelock from
the auth mds.0, the mds.0 will do the CInode::filelock state transation
from excl --> sync, but it need to revoke the Fxwb caps back from the
clients.
While the kernel client has aleady held the Fwb caps and waiting for
the getattr(Fsr).
It's deadlock!
URL: https://tracker.ceph.com/issues/55377 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Sun, 24 Apr 2022 09:35:53 +0000 (17:35 +0800)]
ceph: redirty the page for writepage on failure
When run out of memories we should redirty the page before failing
the writepage. Or we will hit BUG_ON(folio_get_private(folio)) in
ceph_dirty_folio().
URL: https://tracker.ceph.com/issues/55421 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Thu, 21 Apr 2022 03:26:40 +0000 (11:26 +0800)]
ceph: try to choose the auth MDS if possible for getattr
If any 'x' caps is issued we can just choose the auth MDS instead
of the random replica MDSes. Because only when the Locker is in
LOCK_EXEC state will the loner client could get the 'x' caps. And
if we send the getattr requests to any replica MDS it must auth pin
and tries to rdlock from the auth MDS, and then the auth MDS need
to do the Locker state transition to LOCK_SYNC. And after that the
lock state will change back.
This cost much when doing the Locker state transition and usually
will need to revoke caps from clients.
URL: https://tracker.ceph.com/issues/55240 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Wed, 20 Apr 2022 05:13:02 +0000 (13:13 +0800)]
ceph: disable updating the atime since cephfs won't maintain it
Since CephFS makes no attempt to maintain atime, we shouldn't
try to update it in mmap and generic read cases and ignore updating
it in direct and sync read cases.
And even we update it in mmap and generic read cases we will drop
it and won't sync it to MDS. And we are seeing the atime will be
updated and then dropped to the floor again and again.
Xiubo Li [Tue, 19 Apr 2022 00:58:49 +0000 (08:58 +0800)]
ceph: flush the mdlog for filesystem sync
Before waiting for a request's safe reply, we will send the mdlog flush
request to the relevant MDS. And this will also flush the mdlog for all
the other unsafe requests in the same session, so we can record the last
session and no need to flush mdlog again in the next loop. But there
still have cases that it may send the mdlog flush requst twice or more,
but that should be not often.
Rename wait_unsafe_requests() to
flush_mdlog_and_wait_mdsc_unsafe_requests() to make it more
descriptive.
[xiubli: fold in MDS request refcount leak fix from Jeff]
URL: https://tracker.ceph.com/issues/55284
URL: https://tracker.ceph.com/issues/55411 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Mon, 11 Apr 2022 01:59:09 +0000 (09:59 +0800)]
ceph: fix statx AT_STATX_DONT_SYNC vs AT_STATX_FORCE_SYNC check
From the posix and the initial statx supporting commit comments,
the AT_STATX_DONT_SYNC is a lightweight stat and the
AT_STATX_FORCE_SYNC is a heaverweight one. And also checked all
the other current usage about these two flags they are all doing
the same, that is only when the AT_STATX_FORCE_SYNC is not set
and the AT_STATX_DONT_SYNC is set will they skip sync retriving
the attributes from storage.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Thu, 7 Apr 2022 05:12:42 +0000 (13:12 +0800)]
ceph: no need to invalidate the fscache twice
Fixes: 400e1286c0ec3 ("ceph: conversion to new fscache API") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jakob Koschel [Thu, 31 Mar 2022 21:53:29 +0000 (23:53 +0200)]
ceph: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean.
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Jakob Koschel [Thu, 31 Mar 2022 21:53:28 +0000 (23:53 +0200)]
ceph: use dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable.
Xiubo Li [Wed, 30 Mar 2022 04:21:12 +0000 (12:21 +0800)]
ceph: update the dlease for the hashed dentry when removing
The MDS will always refresh the dentry lease when removing the files
or directories. And if the dentry is still hashed, we can update
the dentry lease and no need to do the lookup from the MDS later.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Wed, 30 Mar 2022 06:39:33 +0000 (14:39 +0800)]
ceph: stop retrying the request when exceeding 256 times
The type of 'r_attempts' in kernel 'ceph_mds_request' is 'int',
while in 'ceph_mds_request_head' the type of 'num_retry' is '__u8'.
So in case the request retries exceeding 256 times, the MDS will
receive a incorrect retry seq.
In this case it's ususally a bug in MDS and continue retrying the
request makes no sense. For now let's limit it to 256. In future
this could be fixed in ceph code, so avoid using the hardcode here.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Tue, 29 Mar 2022 04:48:01 +0000 (12:48 +0800)]
ceph: stop forwarding the request when exceeding 256 times
The type of 'num_fwd' in ceph 'MClientRequestForward' is 'int32_t',
while in 'ceph_mds_request_head' the type is '__u8'. So in case
the request bounces between MDSes exceeding 256 times, the client
will get stuck.
In this case it's ususally a bug in MDS and continue bouncing the
request makes no sense.
URL: https://tracker.ceph.com/issues/55130 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luís Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Xiubo Li [Mon, 28 Mar 2022 02:25:35 +0000 (10:25 +0800)]
ceph: remove unused CEPH_MDS_LEASE_RELEASE related code
The ceph_mdsc_lease_release() has been removed by commit 8aa152c77890
(ceph: remove ceph_mdsc_lease_release). ceph_mdsc_lease_send_msg will
never be called with CEPH_MDS_LEASE_RELEASE.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Jakob Koschel [Thu, 24 Mar 2022 07:20:50 +0000 (08:20 +0100)]
rbd: replace usage of found with dedicated list iterator variable
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean.
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Venky Shankar [Thu, 10 Mar 2022 14:34:19 +0000 (09:34 -0500)]
ceph: allow ceph.dir.rctime xattr to be updatable
`rctime' has been a pain point in cephfs due to its buggy
nature - inconsistent values reported and those sorts.
Fixing rctime is non-trivial needing an overall redesign
of the entire nested statistics infrastructure.
As a workaround, PR
http://github.com/ceph/ceph/pull/37938
allows this extended attribute to be manually set. This allows
users to "fixup" inconsistent rctime values. While this sounds
messy, its probably the wisest approach allowing users/scripts
to workaround buggy rctime values.
The above PR enables Ceph MDS to allow manually setting
rctime extended attribute with the corresponding user-land
changes. We may as well allow the same to be done via kclient
for parity.