Luís Henriques [Thu, 25 Aug 2022 13:31:31 +0000 (09:31 -0400)]
ceph: prevent snapshots to be created in encrypted locked directories
With snapshot names encryption we can not allow snapshots to be created in
locked directories because the names wouldn't be encrypted. This patch
forces the directory to be unlocked to allow a snapshot to be created.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Luís Henriques [Thu, 25 Aug 2022 13:31:29 +0000 (09:31 -0400)]
ceph: add support for handling encrypted snapshot names
When creating a snapshot, the .snap directories for every subdirectory will
show the snapshot name in the "long format":
# mkdir .snap/my-snap
# ls my-dir/.snap/
_my-snap_1099511627782
Encrypted snapshots will need to be able to handle these snapshot names by
encrypting/decrypting only the snapshot part of the string ('my-snap').
Also, since the MDS prevents snapshot names to be bigger than 240 characters
it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate this extra
limitation.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Luís Henriques [Thu, 25 Aug 2022 13:31:28 +0000 (09:31 -0400)]
ceph: add support for encrypted snapshot names
Since filenames in encrypted directories are already encrypted and shown
as a base64-encoded string when the directory is locked, snapshot names
should show a similar behaviour.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Luís Henriques [Thu, 25 Aug 2022 13:31:27 +0000 (09:31 -0400)]
ceph: invalidate pages when doing direct/sync writes
When doing a direct/sync write, we need to invalidate the page cache in
the range being written to. If we don't do this, the cache will include
invalid data as we just did a write that avoided the page cache.
In the event that invalidation fails, just ignore the error. That likely
just means that we raced with another task doing a buffered write, in
which case we want to leave the page intact anyway.
[ jlayton: minor comment update ]
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:25 +0000 (09:31 -0400)]
ceph: add encryption support to writepage
Allow writepage to issue encrypted writes. Extend out the requested size
and offset to cover complete blocks, and then encrypt and write them to
the OSDs.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:21 +0000 (09:31 -0400)]
ceph: add read/modify/write to ceph_sync_write
When doing a synchronous write on an encrypted inode, we have no
guarantee that the caller is writing crypto block-aligned data. When
that happens, we must do a read/modify/write cycle.
First, expand the range to cover complete blocks. If we had to change
the original pos or length, issue a read to fill the first and/or last
pages, and fetch the version of the object from the result.
We then copy data into the pages as usual, encrypt the result and issue
a write prefixed by an assertion that the version hasn't changed. If it has
changed then we restart the whole thing again.
If there is no object at that position in the file (-ENOENT), we prefix
the write on an exclusive create of the object instead.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:20 +0000 (09:31 -0400)]
ceph: align data in pages in ceph_sync_write
Encrypted files will need to be dealt with in block-sized chunks and
once we do that, the way that ceph_sync_write aligns the data in the
bounce buffer won't be acceptable.
Change it to align the data the same way it would be aligned in the
pagecache.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:19 +0000 (09:31 -0400)]
ceph: don't use special DIO path for encrypted inodes
Eventually I want to merge the synchronous and direct read codepaths,
possibly via new netfs infrastructure. For now, the direct path is not
crypto-enabled, so use the sync read/write paths instead.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:18 +0000 (09:31 -0400)]
ceph: disable copy offload on encrypted inodes
If we have an encrypted inode, then the client will need to re-encrypt
the contents of the new object. Disable copy offload to or from
encrypted inodes.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:16 +0000 (09:31 -0400)]
libceph: allow ceph_osdc_new_request to accept a multi-op read
Currently we have some special-casing for multi-op writes, but in the
case of a read, we can't really handle it. All of the current multi-op
callers call it with CEPH_OSD_FLAG_WRITE set.
Have ceph_osdc_new_request check for CEPH_OSD_FLAG_READ and if it's set,
allocate multiple reply ops instead of multiple request ops. If neither
flag is set, return -EINVAL.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Thu, 25 Aug 2022 13:31:15 +0000 (09:31 -0400)]
ceph: add truncate size handling support for fscrypt
This will transfer the encrypted last block contents to the MDS
along with the truncate request only when the new size is smaller
and not aligned to the fscrypt BLOCK size. When the last block is
located in the file hole, the truncate request will only contain
the header.
The MDS could fail to do the truncate if there has another client
or process has already updated the RADOS object which contains
the last block, and will return -EAGAIN, then the kclient needs
to retry it. The RMW will take around 50ms, and will let it retry
20 times for now.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Thu, 25 Aug 2022 13:31:12 +0000 (09:31 -0400)]
ceph: add __ceph_sync_read helper support
Turn the guts of ceph_sync_read into a new helper that takes an inode
and an offset instead of a kiocb struct, and make ceph_sync_read call
the helper as a wrapper.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:09 +0000 (09:31 -0400)]
ceph: handle fscrypt fields in cap messages from MDS
Handle the new fscrypt_file and fscrypt_auth fields in cap messages. Use
them to populate new fields in cap_extra_info and update the inode with
those values.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:08 +0000 (09:31 -0400)]
ceph: get file size from fscrypt_file when present in inode traces
When we get an inode trace from the MDS, grab the fscrypt_file field if
the inode is encrypted, and use it to populate the i_size field instead
of the regular inode size field.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 25 Aug 2022 13:31:05 +0000 (09:31 -0400)]
libceph: add CEPH_OSD_OP_ASSERT_VER support
...and record the user_version in the reply in a new field in
ceph_osd_request, so we can populate the assert_ver appropriately.
Shuffle the fields a bit too so that the new field fits in an
existing hole on x86_64.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 24 Aug 2022 13:24:42 +0000 (09:24 -0400)]
ceph: increment i_version when doing a setattr with caps
When the client has enough caps to satisfy a setattr locally without
having to talk to the server, we currently do the setattr without
incrementing the change attribute.
Ensure that if the ctime changes locally, then the change attribute
does too.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Xiubo Li [Thu, 11 Aug 2022 05:00:53 +0000 (13:00 +0800)]
ceph: no need to wait for transition RDCACHE|RD -> RD
For write when trying to get the Fwb caps we need to keep waiting
on transition from WRBUFFER|WR -> WR to avoid a new WR sync write
from going before a prior buffered writeback happens.
While for read there is no need to wait on transition from
RDCACHE|RD -> RD, and we can just exclude the revoking caps and
force to sync read.
Jeff Layton [Wed, 31 Mar 2021 17:57:14 +0000 (13:57 -0400)]
ceph: make ceph_get_name decrypt filenames
When we do a lookupino to the MDS, we get a filename in the trace.
ceph_get_name uses that name directly, so we must properly decrypt
it before copying it to the name buffer.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 3 Sep 2020 17:31:10 +0000 (13:31 -0400)]
ceph: create symlinks with encrypted and base64-encoded targets
When creating symlinks in encrypted directories, encrypt and
base64-encode the target with the new inode's key before sending to the
MDS.
When filling a symlinked inode, base64-decode it into a buffer that
we'll keep in ci->i_symlink. When get_link is called, decrypt the buffer
into a new one that will hang off i_link.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 10 Aug 2020 13:33:41 +0000 (09:33 -0400)]
ceph: add support to readdir for encrypted filenames
Once we've decrypted the names in a readdir reply, we no longer need the
crypttext, so overwrite them in ceph_mds_reply_dir_entry with the
unencrypted names. Then in both ceph_readdir_prepopulate() and
ceph_readdir() we will use the dencrypted name directly.
[ jlayton: convert some BUG_ONs into error returns ]
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 14 Mar 2022 02:28:35 +0000 (10:28 +0800)]
ceph: add ceph_encode_encrypted_dname() helper
Add a new helper that basically calls ceph_encode_encrypted_fname, but
with a qstr pointer instead of a dentry pointer. This will make it
simpler to decrypt names in a readdir reply, before we have a dentry.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Mon, 14 Mar 2022 02:28:34 +0000 (10:28 +0800)]
ceph: pass the request to parse_reply_info_readdir()
Instead of passing just the r_reply_info to the readdir reply parser,
pass the request pointer directly instead. This will facilitate
implementing readdir on fscrypted directories.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 26 Mar 2021 16:26:18 +0000 (12:26 -0400)]
ceph: add helpers for converting names for userland presentation
Define a new ceph_fname struct that we can use to carry information
about encrypted dentry names. Add helpers for working with these
objects, including ceph_fname_to_usr which formats an encrypted filename
for userland presentation.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 7 Aug 2020 19:47:17 +0000 (15:47 -0400)]
ceph: make d_revalidate call fscrypt revalidator for encrypted dentries
If we have a dentry which represents a no-key name, then we need to test
whether the parent directory's encryption key has since been added. Do
that before we test anything else about the dentry.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 28 Mar 2022 20:18:38 +0000 (16:18 -0400)]
ceph: set DCACHE_NOKEY_NAME in atomic open
Atomic open can act as a lookup if handed a dentry that is negative on
the MDS. Ensure that we set DCACHE_NOKEY_NAME on the dentry in
atomic_open, if we don't have the key for the parent. Otherwise, we can
end up validating the dentry inappropriately if someone later adds a
key.
Reviewed-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 14 Jan 2021 15:39:22 +0000 (10:39 -0500)]
ceph: send altname in MClientRequest
In the event that we have a filename longer than CEPH_NOHASH_NAME_MAX,
we'll need to hash the tail of the filename. The client however will
still need to know the full name of the file if it has a key.
To support this, the MClientRequest field has grown a new alternate_name
field that we populate with the full (binary) crypttext of the filename.
This is then transmitted to the clients in readdir or traces as part of
the dentry lease.
Add support for populating this field when the filenames are very long.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 7 Aug 2020 13:28:31 +0000 (09:28 -0400)]
ceph: add encrypted fname handling to ceph_mdsc_build_path
Allow ceph_mdsc_build_path to encrypt and base64 encode the filename
when the parent is encrypted and we're sending the path to the MDS.
In most cases, we just encrypt the filenames and base64 encode them,
but when the name is longer than CEPH_NOHASH_NAME_MAX, we use a similar
scheme to fscrypt proper, and hash the remaning bits with sha256.
When doing this, we then send along the full crypttext of the name in
the new alternate_name field of the MClientRequest. The MDS can then
send that along in readdir responses and traces.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
ceph: add base64 endcoding routines for encrypted names
The base64url encoding used by fscrypt includes the '_' character, which
may cause problems in snapshot names (if the name starts with '_').
Thus, use the base64 encoding defined for IMAP mailbox names (RFC 3501),
which uses '+' and ',' instead of '-' and '_'.
Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 28 Jul 2020 13:58:43 +0000 (09:58 -0400)]
ceph: add fscrypt ioctls
We gate most of the ioctls on MDS feature support. The exception is the
key removal and status functions that we still want to work if the MDS's
were to (inexplicably) lose the feature.
For the set_policy ioctl, we take Fs caps to ensure that nothing can
create files in the directory while the ioctl is running. That should
be enough to ensure that the "empty_dir" check is reliable.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 11 Jan 2021 17:35:48 +0000 (12:35 -0500)]
ceph: decode alternate_name in lease info
Ceph is a bit different from local filesystems, in that we don't want
to store filenames as raw binary data, since we may also be dealing
with clients that don't support fscrypt.
We could just base64-encode the encrypted filenames, but that could
leave us with filenames longer than NAME_MAX. It turns out that the
MDS doesn't care much about filename length, but the clients do.
To manage this, we've added a new "alternate name" field that can be
optionally added to any dentry that we'll use to store the binary
crypttext of the filename if its base64-encoded value will be longer
than NAME_MAX. When a dentry has one of these names attached, the MDS
will send it along in the lease info, which we can then store for
later usage.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Tue, 8 Sep 2020 13:47:40 +0000 (09:47 -0400)]
ceph: implement -o test_dummy_encryption mount option
Add support for the test_dummy_encryption mount option. This allows us
to test the encrypted codepaths in ceph without having to manually set
keys, etc.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Mon, 27 Jul 2020 14:16:09 +0000 (10:16 -0400)]
ceph: fscrypt_auth handling for ceph
Most fscrypt-enabled filesystems store the crypto context in an xattr,
but that's problematic for ceph as xatts are governed by the XATTR cap,
but we really want the crypto context as part of the AUTH cap.
Because of this, the MDS has added two new inode metadata fields:
fscrypt_auth and fscrypt_file. The former is used to hold the crypto
context, and the latter is used to track the real file size.
Parse new fscrypt_auth and fscrypt_file fields in inode traces. For now,
we don't use fscrypt_file, but fscrypt_auth is used to hold the fscrypt
context.
Allow the client to use a setattr request for setting the fscrypt_auth
field. Since this is not a standard setattr request from the VFS, we add
a new field to __ceph_setattr that carries ceph-specific inode attrs.
Have the set_context op do a setattr that sets the fscrypt_auth value,
and get_context just return the contents of that field (since it should
always be available).
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 1 Jul 2022 10:30:13 +0000 (06:30 -0400)]
ceph: use osd_req_op_extent_osd_iter for netfs reads
The netfs layer has already pinned the pages involved before calling
issue_op, so we can just pass down the iter directly instead of calling
iov_iter_get_pages_alloc.
Instead of having to allocate a page array, use CEPH_MSG_DATA_ITER and
pass it the iov_iter directly to clone.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Fri, 1 Jul 2022 10:30:12 +0000 (06:30 -0400)]
libceph: add new iov_iter-based ceph_msg_data_type and ceph_osd_data_type
Add an iov_iter to the unions in ceph_msg_data and ceph_msg_data_cursor.
Instead of requiring a list of pages or bvecs, we can just use an
iov_iter directly, and avoid extra allocations.
We assume that the pages represented by the iter are pinned such that
they shouldn't incur page faults, which is the case for the iov_iters
created by netfs.
While working on this, Al Viro informed me that he was going to change
iov_iter_get_pages to auto-advance the iterator as that pattern is more
or less required for ITER_PIPE anyway. We emulate that here for now by
advancing in the _next op and tracking that amount in the "lastlen"
field.
In the event that _next is called twice without an intervening
_advance, we revert the iov_iter by the remaining lastlen before
calling iov_iter_get_pages.
Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Xiubo Li <xiubli@redhat.com>
Jeff Layton [Wed, 5 Aug 2020 18:50:48 +0000 (14:50 -0400)]
ceph: make ceph_msdc_build_path use ref-walk
Encryption potentially requires allocation, at which point we'll need to
be in a non-atomic context. Convert ceph_msdc_build_path to take dentry
spinlocks and references instead of using rcu_read_lock to walk the
path.
This is slightly less efficient, and we may want to eventually allow
using RCU when the leaf dentry isn't encrypted.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 26 Aug 2020 17:11:00 +0000 (13:11 -0400)]
ceph: preallocate inode for ops that may create one
When creating a new inode, we need to determine the crypto context
before we can transmit the RPC. The fscrypt API has a routine for getting
a crypto context before a create occurs, but it requires an inode.
Change the ceph code to preallocate an inode in advance of a create of
any sort (open(), mknod(), symlink(), etc). Move the existing code that
generates the ACL and SELinux blobs into this routine since that's
mostly common across all the different codepaths.
In most cases, we just want to allow ceph_fill_trace to use that inode
after the reply comes in, so add a new field to the MDS request for it
(r_new_inode).
The async create codepath is a bit different though. In that case, we
want to hash the inode in advance of the RPC so that it can be used
before the reply comes in. If the call subsequently fails with
-EJUKEBOX, then just put the references and clean up the as_ctx. Note
that with this change, we now need to regenerate the as_ctx when this
occurs, but it's quite rare for it to happen.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Sat, 26 Feb 2022 11:33:03 +0000 (06:33 -0500)]
ceph: add new mount option to enable sparse reads
Add a new mount option that has the client issue sparse reads instead of
normal ones. The callers now preallocate an sparse extent buffer that
the libceph receive code can populate and hand back after the operation
completes.
After a successful sparse read, we can't use the req->r_result value to
determine the amount of data "read", so instead we set the received
length to be from the end of the last extent in the buffer. Any
interstitial holes will have been filled by the receive code.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 23 Mar 2022 16:17:15 +0000 (12:17 -0400)]
libceph: support sparse reads on msgr2 secure codepath
Add a new init_sgs_pages helper that populates the scatterlist from
an arbitrary point in an array of pages.
Change setup_message_sgs to take an optional pointer to an array of
pages. If that's set, then the scatterlist will be set using that
array instead of the cursor.
When given a sparse read on a secure connection, decrypt the data
in-place rather than into the final destination, by passing it the
in_enc_pages array.
After decrypting, run the sparse_read state machine in a loop, copying
data from the decrypted pages until it's complete.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 11 Feb 2022 16:38:02 +0000 (11:38 -0500)]
libceph: add sparse read support to OSD client
Have get_reply check for the presence of sparse read ops in the
request and set the sparse_read boolean in the msg. That will queue the
messenger layer to use the sparse read codepath instead of the normal
data receive.
Add a new sparse_read operation for the OSD client, driven by its own
state machine. The messenger will repeatedly call the sparse_read
operation, and it will pass back the necessary info to set up to read
the next extent of data, while zero-filling the sparse regions.
The state machine will stop at the end of the last extent, and will
attach the extent map buffer to the ceph_osd_req_op so that the caller
can use it.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 25 Jan 2022 13:26:31 +0000 (08:26 -0500)]
libceph: add sparse read support to msgr2 crc state machine
Add support for a new sparse_read ceph_connection operation. The idea is
that the client driver can define this operation use it to do special
handling for incoming reads.
The alloc_msg routine will look at the request and determine whether the
reply is expected to be sparse. If it is, then we'll dispatch to a
different set of state machine states that will repeatedly call the
driver's sparse_read op to get length and placement info for reading the
extent map, and the extents themselves.
This necessitates adding some new field to some other structs:
- The msg gets a new bool to track whether it's a sparse_read request.
- A new field is added to the cursor to track the amount remaining in the
current extent. This is used to cap the read from the socket into the
msg_data
- Handing a revoke with all of this is particularly difficult, so I've
added a new data_len_remain field to the v2 connection info, and then
use that to skip that much on a revoke. We may want to expand the use of
that to the normal read path as well, just for consistency's sake.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 16 Mar 2022 19:23:00 +0000 (15:23 -0400)]
libceph: define struct ceph_sparse_extent and add some helpers
When the OSD sends back a sparse read reply, it contains an array of
these structures. Define the structure and add a couple of helpers for
dealing with them.
Also add a place in struct ceph_osd_req_op to store the extent buffer,
and code to free it if it's populated when the req is torn down.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Mon, 14 Mar 2022 19:47:18 +0000 (15:47 -0400)]
libceph: add spinlock around osd->o_requests
In a later patch, we're going to need to search for a request in
the rbtree, but taking the o_mutex is inconvenient as we already
hold the con mutex at the point where we need it.
Add a new spinlock that we take when inserting and erasing entries from
the o_requests tree. Search of the rbtree can be done with either the
mutex or the spinlock, but insertion and removal requires both.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 25 May 2022 10:11:00 +0000 (06:11 -0400)]
libceph: drop last_piece flag from ceph_msg_data_cursor
ceph_msg_data_next is always passed a NULL pointer for this field. Some
of the "next" operations look at it in order to determine the length,
but we can just take the min of the data on the page or cursor->resid.
Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Fri, 13 May 2022 14:23:25 +0000 (10:23 -0400)]
[DO NOT MERGE] mm: BUG if filemap_alloc_folio gives us a folio with a non-NULL ->private
We've seen some instances where we call __filemap_get_folio and get back
one with a ->private value that is non-NULL. Let's have the allocator
bug if that happens.
For now, let's just put this into the testing kernel. We can let Willy
decide if he wants it in mainline.
URL: https://tracker.ceph.com/issues/55421 Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiubo Li <xiubli@redhat.com> Cc: Luís Henriques <lhenriques@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Yury Norov [Fri, 12 Aug 2022 05:34:25 +0000 (22:34 -0700)]
radix-tree: replace gfp.h inclusion with gfp_types.h
Radix tree header includes gfp.h for __GFP_BITS_SHIFT only. Now we
have gfp_types.h for this.
Fixes powerpc allmodconfig build:
In file included from include/linux/nodemask.h:97,
from include/linux/mmzone.h:17,
from include/linux/gfp.h:7,
from include/linux/radix-tree.h:12,
from include/linux/idr.h:15,
from include/linux/kernfs.h:12,
from include/linux/sysfs.h:16,
from include/linux/kobject.h:20,
from include/linux/pci.h:35,
from arch/powerpc/kernel/prom_init.c:24:
include/linux/random.h: In function 'add_latent_entropy':
>> include/linux/random.h:25:46: error: 'latent_entropy' undeclared (first use in this function); did you mean 'add_latent_entropy'?
25 | add_device_randomness((const void *)&latent_entropy, sizeof(latent_entropy));
| ^~~~~~~~~~~~~~
| add_latent_entropy
include/linux/random.h:25:46: note: each undeclared identifier is reported only once for each function it appears in
Reported-by: kernel test robot <lkp@intel.com> CC: Andy Shevchenko <andriy.shevchenko@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 14 Aug 2022 20:03:53 +0000 (13:03 -0700)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lseek fix from Al Viro:
"Fix proc_reg_llseek() breakage. Always had been possible if somebody
left NULL ->proc_lseek, became a practical issue now"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
take care to handle NULL ->proc_lseek()
Linus Torvalds [Sun, 14 Aug 2022 16:22:11 +0000 (09:22 -0700)]
Merge tag 'perf-tools-fixes-for-v6.0-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull more perf tool updates from Arnaldo Carvalho de Melo:
- 'perf c2c' now supports ARM64, adjust its output to cope with
differences with what is in x86_64. Now go find false sharing on
ARM64 (at least Neoverse) as well!
- Refactor the JSON processing, making the output more compact and thus
reducing the size of the resulting perf binary
- Improvements for 'perf offcpu' profiling, including tracking child
processes
- Update Intel JSON metrics and events files for broadwellde,
broadwellx, cascadelakex, haswellx, icelakex, ivytown, jaketown,
knightslanding, sapphirerapids, skylakex and snowridgex
- Add 'perf stat' JSON output and a 'perf test' entry for it
- Ignore memfd and anonymous mmap events if jitdump present
- Fix an error handling path in 'parse_perf_probe_command()'
- Fixes for the guest Intel PT tracing patchkit in the 1st batch of
this merge window
- Print debuginfod queries if -v option is used, to explain delays in
processing when debuginfo servers are enabled to fetch DSOs with
richer symbol tables
- Improve error message for 'perf record -p not_existing_pid'
- Fix openssl and libbpf feature detection
- Add PMU pai_crypto event description for IBM z16 on 'perf list'
- Fix typos and duplicated words on comments in various places
* tag 'perf-tools-fixes-for-v6.0-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (81 commits)
perf test: Refactor shell tests allowing subdirs
perf vendor events: Update events for snowridgex
perf vendor events: Update events and metrics for skylakex
perf vendor events: Update metrics for sapphirerapids
perf vendor events: Update events for knightslanding
perf vendor events: Update metrics for jaketown
perf vendor events: Update metrics for ivytown
perf vendor events: Update events and metrics for icelakex
perf vendor events: Update events and metrics for haswellx
perf vendor events: Update events and metrics for cascadelakex
perf vendor events: Update events and metrics for broadwellx
perf vendor events: Update metrics for broadwellde
perf jevents: Fold strings optimization
perf jevents: Compress the pmu_events_table
perf metrics: Copy entire pmu_event in find metric
perf pmu-events: Hide the pmu_events
perf pmu-events: Don't assume pmu_event is an array
perf pmu-events: Move test events/metrics to JSON
perf test: Use full metric resolution
perf pmu-events: Hide pmu_events_map
...
Linus Torvalds [Sun, 14 Aug 2022 15:48:13 +0000 (08:48 -0700)]
Merge tag 'powerpc-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Ensure we never emit lwarx with EH=1 on 32-bit, because some 32-bit
CPUs trap on it rather than ignoring it as they should.
- Fix ftrace when building with clang, which was broken by some
refactoring.
- A couple of other minor fixes.
Thanks to Christophe Leroy, Naveen N. Rao, Nick Desaulniers, Ondrej
Mosnacek, Pali Rohár, Russell Currey, and Segher Boessenkool.
* tag 'powerpc-6.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/kexec: Fix build failure from uninitialised variable
powerpc/ppc-opcode: Fix PPC_RAW_TW()
powerpc64/ftrace: Fix ftrace for clang builds
powerpc: Make eh value more explicit when using lwarx
powerpc: Don't hide eh field of lwarx behind a macro
powerpc: Fix eh field when calling lwarx on PPC32
Linus Torvalds [Sun, 14 Aug 2022 00:31:18 +0000 (17:31 -0700)]
Merge tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6
Pull more cifs updates from Steve French:
- two fixes for stable, one for a lock length miscalculation, and
another fixes a lease break timeout bug
- improvement to handle leases, allows the close timeout to be
configured more safely
- five restructuring/cleanup patches
* tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not access tcon->cfids->cfid directly from is_path_accessible
cifs: Add constructor/destructors for tcon->cfid
SMB3: fix lease break timeout when multiple deferred close handles for the same file.
smb3: allow deferred close timeout to be configurable
cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dir
cifs: Move cached-dir functions into a separate file
cifs: Remove {cifs,nfs}_fscache_release_page()
cifs: fix lock length calculation
David Howells [Wed, 10 Aug 2022 17:52:47 +0000 (18:52 +0100)]
afs: Enable multipage folio support
Enable multipage folio support for the afs filesystem.
Support has already been implemented in netfslib, fscache and cachefiles
and in most of afs, but I've waited for Matthew Wilcox's latest folio
changes.
Note that it does require a change to afs_write_begin() to return the
correct subpage. This is a "temporary" change as we're working on
getting rid of the need for ->write_begin() and ->write_end()
completely, at least as far as network filesystems are concerned - but
it doesn't prevent afs from making use of the capability.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: kafs-testing@auristor.com Cc: Marc Dionne <marc.dionne@auristor.com> Cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/lkml/2274528.1645833226@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 13 Aug 2022 21:38:22 +0000 (14:38 -0700)]
Merge tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Ingo Molnar:
"Misc timer fixes:
- fix a potential use-after-free bug in posix timers
- correct a prototype
- address a build warning"
* tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Cleanup CPU timers before freeing them during exec
time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64
posix-timers: Make do_clock_gettime() static
Linus Torvalds [Sat, 13 Aug 2022 21:24:12 +0000 (14:24 -0700)]
Merge tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix the 'IBPB mitigated RETBleed' mode of operation on AMD CPUs (not
turned on by default), which also need STIBP enabled (if available) to
be '100% safe' on even the shortest speculation windows"
* tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Enable STIBP for IBPB mitigated RETBleed
Linus Torvalds [Sat, 13 Aug 2022 21:00:45 +0000 (14:00 -0700)]
Merge tag 'ntb-5.20' of https://github.com/jonmason/ntb
Pull NTB updates from Jon Mason:
"Non-Transparent Bridge updates.
Fix of heap data and clang warnings, support for a new Intel NTB
device, and NTB EndPoint Function (EPF) support and the various fixes
for that"
* tag 'ntb-5.20' of https://github.com/jonmason/ntb:
MAINTAINERS: add PCI Endpoint NTB drivers to NTB files
NTB: EPF: Tidy up some bounds checks
NTB: EPF: Fix error code in epf_ntb_bind()
PCI: endpoint: pci-epf-vntb: reduce several globals to statics
PCI: endpoint: pci-epf-vntb: fix error handle in epf_ntb_mw_bar_init()
PCI: endpoint: Fix Kconfig dependency
NTB: EPF: set pointer addr to null using NULL rather than 0
Documentation: PCI: extend subheading underline for "lspci output" section
Documentation: PCI: Use code-block block for scratchpad registers diagram
Documentation: PCI: Add specification for the PCI vNTB function device
PCI: endpoint: Support NTB transfer between RC and EP
NTB: epf: Allow more flexibility in the memory BAR map method
PCI: designware-ep: Allow pci_epc_set_bar() update inbound map address
ntb: intel: add GNR support for Intel PCIe gen5 NTB
NTB: ntb_tool: uninitialized heap data in tool_fn_write()
ntb: idt: fix clang -Wformat warnings
Linus Torvalds [Sat, 13 Aug 2022 20:50:11 +0000 (13:50 -0700)]
Merge tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull more xfs updates from Darrick Wong:
"There's not a lot this time around, just the usual bug fixes and
corrections for missing error returns.
- Return error codes from block device flushes to userspace
- Fix a deadlock between reclaim and mount time quotacheck
- Fix an unnecessary ENOSPC return when doing COW on a filesystem
with severe free space fragmentation
- Fix a miscalculation in the transaction reservation computations
for file removal operations"
* tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix inode reservation space for removing transaction
xfs: Fix false ENOSPC when performing direct write on a delalloc extent in cow fork
xfs: fix intermittent hang during quotacheck
xfs: check return codes when flushing block devices
Linus Torvalds [Sat, 13 Aug 2022 20:41:48 +0000 (13:41 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"Mostly small bug fixes and trivial updates.
The major new core update is a change to the way device, target and
host reference counting is done to try to make it more robust (this
change has soaked for a while to try to winkle out any bugs)"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: pm8001: Fix typo 'the the' in comment
scsi: megaraid_sas: Remove redundant variable cmd_type
scsi: FlashPoint: Remove redundant variable bm_int_st
scsi: zfcp: Fix missing auto port scan and thus missing target ports
scsi: core: Call blk_mq_free_tag_set() earlier
scsi: core: Simplify LLD module reference counting
scsi: core: Make sure that hosts outlive targets
scsi: core: Make sure that targets outlive devices
scsi: ufs: ufs-pci: Correct check for RESET DSM
scsi: target: core: De-RCU of se_lun and se_lun acl
scsi: target: core: Fix race during ACL removal
scsi: ufs: core: Correct ufshcd_shutdown() flow
scsi: ufs: core: Increase the maximum data buffer size
scsi: lpfc: Check the return value of alloc_workqueue()
Linus Torvalds [Sat, 13 Aug 2022 20:37:36 +0000 (13:37 -0700)]
Merge tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request
- print nvme connect Linux error codes properly (Amit Engel)
- fix the fc_appid_store return value (Christoph Hellwig)
- fix a typo in an error message (Christophe JAILLET)
- add another non-unique identifier quirk (Dennis P. Kliem)
- check if the queue is allocated before stopping it in nvme-tcp
(Maurizio Lombardi)
- restart admin queue if the caller needs to restart queue in
nvme-fc (Ming Lei)
- use kmemdup instead of kmalloc + memcpy in nvme-auth (Zhang
Xiaoxu)
- __alloc_disk_node() error handling fix (Rafael)
* tag 'block-6.0-2022-08-12' of git://git.kernel.dk/linux-block:
block: Do not call blk_put_queue() if gendisk allocation fails
nvme-pci: add NVME_QUIRK_BOGUS_NID for ADATA XPG GAMMIX S70
nvme-tcp: check if the queue is allocated before stopping it
nvme-fabrics: Fix a typo in an error message
nvme-fabrics: parse nvme connect Linux error codes
nvmet-auth: use kmemdup instead of kmalloc + memcpy
nvme-fc: fix the fc_appid_store return value
nvme-fc: restart admin queue if the caller needs to restart queue
Linus Torvalds [Sat, 13 Aug 2022 20:28:54 +0000 (13:28 -0700)]
Merge tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
- Regression fix for this merge window, fixing a wrong order of
arguments for io_req_set_res() for passthru (Dylan)
- Fix for the audit code leaking context memory (Peilin)
- Ensure that provided buffers are memcg accounted (Pavel)
- Correctly handle short zero-copy sends (Pavel)
- Sparse warning fixes for the recvmsg multishot command (Dylan)
- Error handling fix for passthru (Anuj)
- Remove randomization of struct kiocb fields, to avoid it growing in
size if re-arranged in such a fashion that it grows more holes or
padding (Keith, Linus)
- Small series improving type safety of the sqe fields (Stefan)
* tag 'io_uring-6.0-2022-08-13' of git://git.kernel.dk/linux-block:
io_uring: add missing BUILD_BUG_ON() checks for new io_uring_sqe fields
io_uring: make io_kiocb_to_cmd() typesafe
fs: don't randomize struct kiocb fields
io_uring: consistently make use of io_notif_to_data()
io_uring: fix error handling for io_uring_cmd
io_uring: fix io_recvmsg_prep_multishot sparse warnings
io_uring/net: send retry for zerocopy
io_uring: mem-account pbuf buckets
audit, io_uring, io-wq: Fix memory leak in io_sq_thread() and io_wqe_worker()
io_uring: pass correct parameters to io_req_set_res
Carsten Haitzler [Fri, 12 Aug 2022 12:16:28 +0000 (13:16 +0100)]
perf test: Refactor shell tests allowing subdirs
This is a prelude to adding more tests to shell tests and in order to
support putting those tests into subdirectories, I need to change the
test code that scans/finds and runs them.
To support subdirs I have to recurse so it's time to refactor the code
to allow this and centralize the shell script finding into one location
and only one single scan that builds a list of all the found tests in
memory instead of it being duplicated in 3 places.
This code also optimizes things like knowing the max width of desciption
strings (as we can do that while we scan instead of a whole new pass of
opening files).
It also more cleanly filters scripts to see only *.sh files thus
skipping random other files in directories like *~ backup files, other
random junk/data files that may appear and the scripts must be
executable to make the cut (this ensures the script lib dir is not seen
as scripts to run).
This avoids perf test running previous older versions of test scripts
that are editor backup files as well as skipping perf.data files that
may appear and so on.
Reviewed-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Carsten Haitzler <carsten.haitzler@arm.com> Tested-by: Leo Yan <leo.yan@linaro.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Leach <mike.leach@linaro.org> Cc: Suzuki Poulouse <suzuki.poulose@arm.com> Cc: coresight@lists.linaro.org Link: https://lore.kernel.org/r/20220812121641.336465-2-carsten.haitzler@foss.arm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>