]> git-server-git.apps.pok.os.sepia.ceph.com Git - ceph-client.git/log
ceph-client.git
5 years agolibceph: remove osdtimeout option entirely
Ilya Dryomov [Fri, 22 Jan 2021 15:50:42 +0000 (16:50 +0100)]
libceph: remove osdtimeout option entirely

Commit 83aff95eb9d6 ("libceph: remove 'osdtimeout' option") deprecated
osdtimeout over 8 years ago, but it is still recognized.  Let's remove
it entirely.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
5 years agolibceph: deprecate [no]cephx_require_signatures options
Ilya Dryomov [Fri, 22 Jan 2021 14:41:14 +0000 (15:41 +0100)]
libceph: deprecate [no]cephx_require_signatures options

These options were introduced in 3.19 with support for message signing
and are rather useless, as explained in commit a51983e4dd2d ("libceph:
add nocephx_sign_messages option").  Deprecate them.

In case there is someone out there with a cluster that lacks support
for MSG_AUTH feature (very unlikely but has to be considered since we
haven't formally raised the bar from argonaut to bobtail yet), make
nocephx_sign_messages also waive MSG_AUTH requirement.  This is probably
how it should have been done in the first place -- if we aren't going
to sign, requiring the signing feature makes no sense.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
5 years agoceph: allow queueing cap/snap handling after putting cap references
Jeff Layton [Thu, 10 Dec 2020 19:39:26 +0000 (14:39 -0500)]
ceph: allow queueing cap/snap handling after putting cap references

Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.

Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: clean up inode work queueing
Jeff Layton [Fri, 9 Oct 2020 18:24:34 +0000 (14:24 -0400)]
ceph: clean up inode work queueing

Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: fix flush_snap logic after putting caps
Jeff Layton [Thu, 10 Dec 2020 18:35:46 +0000 (13:35 -0500)]
ceph: fix flush_snap logic after putting caps

A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.

Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.

Fixes: e64f44a88465 (ceph: skip checking caps when session reconnecting and releasing reqs)
Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agolibceph: add osd op counter metric support
Xiubo Li [Tue, 10 Nov 2020 14:19:37 +0000 (22:19 +0800)]
libceph: add osd op counter metric support

The logic is the same with osdc/Objecter.cc in ceph in user space.

URL: https://tracker.ceph.com/issues/48053
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years ago[DO NOT MERGE] rbd: bump RBD_MAX_PARENT_CHAIN_LEN to 128
Ilya Dryomov [Sat, 20 Feb 2016 17:26:57 +0000 (18:26 +0100)]
[DO NOT MERGE] rbd: bump RBD_MAX_PARENT_CHAIN_LEN to 128

Bump RBD_MAX_PARENT_CHAIN_LEN from 16 to 128 to avoid fsx failures.

(The alternative is changing fsx to flatten unconditionally when the
limit of 16 is reached, which is ugly and not needed for librbd.)

5 years agoMerge branch 'dhowells/fscache-next'
Jeff Layton [Sat, 6 Feb 2021 11:51:26 +0000 (06:51 -0500)]
Merge branch 'dhowells/fscache-next'

Merge in David Howells fscache-next branch, which contains the new
netfs infrastructure and the patches to allow ceph to use it.

5 years agofscache: rectify minor kernel-doc issues
Lukas Bulwahn [Thu, 4 Feb 2021 07:56:24 +0000 (08:56 +0100)]
fscache: rectify minor kernel-doc issues

The command './scripts/kernel-doc -none include/linux/fscache.h' reports
some minor mismatches of the kernel-doc and function signature, which are
easily resolved.

Rectify the kernel-doc, such that no issues remain for fscache.h.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Fix kerneldoc on netfs_subreq_terminated()
David Howells [Wed, 3 Feb 2021 11:14:09 +0000 (11:14 +0000)]
netfs: Fix kerneldoc on netfs_subreq_terminated()

Fix the kerneldoc on netfs_subreq_terminated() to describe the second
function argument.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Fix error handling in afs_req_issue_op()
David Howells [Tue, 2 Feb 2021 17:40:36 +0000 (17:40 +0000)]
afs: Fix error handling in afs_req_issue_op()

Fix error handling in afs_req_issue_op() by calling
netfs_subreq_terminated() rather than simply storing the error in
subreq->error.  The netfs function must be called to wake up anyone
waiting.

Fixes: 751551a7a74a ("afs: Use new fscache read helper API")
Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Fix various bits of error handling
David Howells [Tue, 2 Feb 2021 17:37:51 +0000 (17:37 +0000)]
netfs: Fix various bits of error handling

Fix some bits of error handling to do with failing to fully slice up a read
request.  This would typically due to ENOMEM occurring somewhere.

 (1) In netfs_rreq_submit_slice():

     (a) When slicing fails, put the subrequest on the list so that it will
       be cleaned up later and can be examined by the assessment and page
       unlock code.

     (b) A subrequest must always hold a ref on the main request struct,
       even if we fail to slice it, as the ref will be released
       unconditionally on the last put.

     (c) The error code from the subreq should be saved (this is presumed
       to have been set by the cache or the netfs).

 (2) In netfs_rreq_unlock():

     (a) If we run out of subreqs whilst going through the loop, just
       unlock all remaining pages without marking them PG_uptodate.

     (b) In netfs_rreq_unlock(), we should be checking pg_failed, not
       subreq_failed, to see if a page is fully read (a page may be
       contributed to by multiple subreqs, and subreq_failed is the state
       of only the last one).

 (3) netfs_readpage() and netfs_write_begin() should return EIO if
     insufficient data was read and no other error is recorded.

Fixes: 467ef3015ee4 ("netfs: Provide readahead and readpage netfs helpers")
Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoceph: fix an oops in error handling in ceph_netfs_issue_op
Jeff Layton [Tue, 2 Feb 2021 13:10:41 +0000 (08:10 -0500)]
ceph: fix an oops in error handling in ceph_netfs_issue_op

Dan reported a potential oops in the cleanup if ceph_osdc_new_request
returns an error. Eliminate the unneeded initialization of "req" and
then just set it to NULL in the case where it holds an ERR_PTR.

Also, drop the unneeded NULL check before calling
ceph_osdc_put_request.

Fixes: 1cf7fdf52d5a ("ceph: convert readpage to fscache read helper")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Suggested-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Dan Carpenter <dan.carpenter@oracle.com>
5 years agoMerge branch 'ceph-netfs-lib' of https://git.kernel.org/pub/scm/linux/kernel/git...
David Howells [Thu, 28 Jan 2021 12:47:02 +0000 (12:47 +0000)]
Merge branch 'ceph-netfs-lib' of https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux into fscache-next

Merge Jeff Layton's Ceph changes for the new netfs helper library.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoMerge branch 'fscache-netfs-lib' into fscache-next
David Howells [Thu, 28 Jan 2021 12:36:25 +0000 (12:36 +0000)]
Merge branch 'fscache-netfs-lib' into fscache-next

Merge the core netfs helper library, the code required to support it in
fscache and cachefiles and the AFS changes to use it.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoceph: convert ceph_readpages to ceph_readahead
Jeff Layton [Thu, 9 Jul 2020 18:43:23 +0000 (14:43 -0400)]
ceph: convert ceph_readpages to ceph_readahead

Convert ceph_readpages to ceph_readahead and make it use
netfs_readahead. With this we can rip out a lot of the old
readpage/readpages infrastructure.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: plug write_begin into read helper
Jeff Layton [Fri, 5 Jun 2020 14:43:21 +0000 (10:43 -0400)]
ceph: plug write_begin into read helper

Convert ceph_write_begin to use the netfs_write_begin helper. Most of
the ops we need for it are already in place from the readpage conversion
but we do add a new check_write_begin op since ceph needs to be able to
vet whether there is an incompatible writeback already in flight before
reading in the page.

With this, we can also remove the old ceph_do_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: convert readpage to fscache read helper
Jeff Layton [Mon, 1 Jun 2020 14:10:21 +0000 (10:10 -0400)]
ceph: convert readpage to fscache read helper

Have the ceph KConfig select NETFS_SUPPORT. Add a new netfs ops
structure and the operations for it. Convert ceph_readpage to use
the new netfs_readpage helper.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: fix fscache invalidation
Jeff Layton [Thu, 21 Jan 2021 23:05:37 +0000 (18:05 -0500)]
ceph: fix fscache invalidation

Ensure that we invalidate the fscache whenever we invalidate the
pagecache.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: rework PageFsCache handling
Jeff Layton [Thu, 21 Jan 2021 21:27:14 +0000 (16:27 -0500)]
ceph: rework PageFsCache handling

With the new fscache API, the PageFsCache bit now indicates that the
page is being written to the cache and shouldn't be modified or released
until it's finished.

Change releasepage and invalidatepage to wait on that bit before
returning.

Also define FSCACHE_USE_NEW_IO_API so that we opt into the new fscache
API.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoceph: disable old fscache readpage handling
Jeff Layton [Thu, 21 Jan 2021 17:32:05 +0000 (12:32 -0500)]
ceph: disable old fscache readpage handling

With the new netfs read helper functions, we won't need a lot of this
infrastructure as it handles the pagecache pages itself. Rip out the
read handling for now, and much of the old infrastructure that deals in
individual pages.

The cookie handling is mostly unchanged, however.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
5 years agoafs: Use new fscache read helper API
David Howells [Thu, 6 Feb 2020 14:22:29 +0000 (14:22 +0000)]
afs: Use new fscache read helper API

Make AFS use the new fscache read helpers to implement the VM read
operations:

 - afs_readpage() now hands off responsibility to fscache_readpage().

 - afs_readpages() is gone and replaced with afs_readahead().

 - afs_readahead() just hands off responsibility to fscache_readahead().

These make use of the cache if a cookie is supplied, otherwise just call
the ->issue_op() method a sufficient number of times to complete the entire
request.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Use the fs operation ops to handle FetchData completion
David Howells [Fri, 18 Sep 2020 08:11:15 +0000 (09:11 +0100)]
afs: Use the fs operation ops to handle FetchData completion

Use the 'success' and 'aborted' afs_operations_ops methods and add a
'failed' method to handle the completion of an AFS.FetchData,
AFS.FetchData64 or YFS.FetchData64 RPC operation rather than directly
calling the done func pointed to by the afs_read struct from the call
delivery handler.

This means the done function will be called back on error also, not just on
successful completion.

This allows motion towards asynchronous data reception on data fetch calls
and allows any error to be handed off to the fscache read helper in the
same place as a successful completion.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Prepare for use of THPs
David Howells [Tue, 20 Oct 2020 08:33:45 +0000 (09:33 +0100)]
afs: Prepare for use of THPs

As a prelude to supporting transparent huge pages, use thp_size() and
similar rather than PAGE_SIZE/SHIFT.

Further, try and frame everything in terms of file positions and lengths
rather than page indices and numbers of pages.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Extract writeback extension into its own function
David Howells [Fri, 30 Oct 2020 10:01:09 +0000 (10:01 +0000)]
afs: Extract writeback extension into its own function

Extract writeback extension into its own function to break up the writeback
function a bit.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Wait on PG_fscache before modifying/releasing a page
David Howells [Thu, 6 Feb 2020 14:22:28 +0000 (14:22 +0000)]
afs: Wait on PG_fscache before modifying/releasing a page

PG_fscache is going to be used to indicate that a page is being written to
the cache, and that the page should not be modified or released until it's
finished.

Make afs_invalidatepage() and afs_releasepage() wait for it.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Use ITER_XARRAY for writing
David Howells [Thu, 6 Feb 2020 14:22:28 +0000 (14:22 +0000)]
afs: Use ITER_XARRAY for writing

Use a single ITER_XARRAY iterator to describe the portion of a file to be
transmitted to the server rather than generating a series of small
ITER_BVEC iterators on the fly.  This will make it easier to implement AIO
in afs.

In theory we could maybe use one giant ITER_BVEC, but that means
potentially allocating a huge array of bio_vec structs (max 256 per page)
when in fact the pagecache already has a structure listing all the relevant
pages (radix_tree/xarray) that can be walked over.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Set up the iov_iter before calling afs_extract_data()
David Howells [Thu, 6 Feb 2020 14:22:28 +0000 (14:22 +0000)]
afs: Set up the iov_iter before calling afs_extract_data()

afs_extract_data() sets up a temporary iov_iter and passes it to AF_RXRPC
each time it is called to describe the remaining buffer to be filled.

Instead:

 (1) Put an iterator in the afs_call struct.

 (2) Set the iterator for each marshalling stage to load data into the
     appropriate places.  A number of convenience functions are provided to
     this end (eg. afs_extract_to_buf()).

     This iterator is then passed to afs_extract_data().

 (3) Use the new ITER_MAPPING iterator when reading data to load directly
     into the inode's pages without needing to create a list of them.

This will allow O_DIRECT calls to be supported in future patches.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Log remote unmarshalling errors
David Howells [Sun, 7 Jun 2020 20:50:29 +0000 (21:50 +0100)]
afs: Log remote unmarshalling errors

Log unmarshalling errors reported by the peer (ie. it can't parse what we
sent it).  Limit the maximum number of messages to 3.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Don't truncate iter during data fetch
David Howells [Thu, 6 Feb 2020 14:22:28 +0000 (14:22 +0000)]
afs: Don't truncate iter during data fetch

Don't truncate the iterator to correspond to the actual data size when
fetching the data from the server - rather, pass the length we want to read
to rxrpc.

This will allow the clear-after-read code in future to simply clear the
remaining iterator capacity rather than having to reinitialise the
iterator.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Move key to afs_read struct
David Howells [Thu, 6 Feb 2020 14:22:27 +0000 (14:22 +0000)]
afs: Move key to afs_read struct

Stash the key used to authenticate read operations in the afs_read struct.
This will be necessary to reissue the operation against the server if a
read from the cache fails in upcoming cache changes.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Print the operation debug_id when logging an unexpected data version
David Howells [Thu, 22 Oct 2020 13:38:15 +0000 (14:38 +0100)]
afs: Print the operation debug_id when logging an unexpected data version

Print the afs_operation debug_id when logging an unexpected change in the
data version.  This allows the logged message to be matched against
tracelines.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Pass page into dirty region helpers to provide THP size
David Howells [Wed, 28 Oct 2020 14:23:46 +0000 (14:23 +0000)]
afs: Pass page into dirty region helpers to provide THP size

Pass a pointer to the page being accessed into the dirty region helpers so
that the size of the page can be determined in case it's a transparent huge
page.

This also required the page to be passed into the afs_page_dirty trace
point - so there's no need to specifically pass in the index or private
data as these can be retrieved directly from the page struct.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoafs: Disable use of the fscache I/O routines
David Howells [Mon, 10 Feb 2020 10:00:22 +0000 (10:00 +0000)]
afs: Disable use of the fscache I/O routines

Disable use of the fscache I/O routined by the AFS filesystem.  It's about
to transition to passing iov_iters down and fscache is about to have its
I/O path to use iov_iter, so all that needs to change.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org

5 years agofscache, cachefiles: Add alternate API to use kiocb for read/write to cache
David Howells [Mon, 18 Jan 2021 09:48:48 +0000 (09:48 +0000)]
fscache, cachefiles: Add alternate API to use kiocb for read/write to cache

Add an alternate API by which the cache can be accessed through a kiocb,
doing async DIO, rather than using the current API that tells the cache
where all the pages are.

The new API is intended to be used in conjunction with the netfs helper
library.  A filesystem must pick one or the other and not mix them.

Filesystems wanting to use the new API must #define FSCACHE_USE_NEW_IO_API
before #including the header

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Define an interface to talk to a cache
David Howells [Thu, 6 Feb 2020 14:22:24 +0000 (14:22 +0000)]
netfs: Define an interface to talk to a cache

Add an interface to the netfs helper library for reading data from the
cache instead of downloading it from the server and support for writing
data just downloaded or cleared to the cache.

The API passes an iov_iter to the cache read/write routines to indicate the
data/buffer to be used.  This is done using the ITER_XARRAY type to provide
direct access to the netfs inode's pagecache.

When the netfs's ->begin_cache_operation() method is called, this must fill
in the cache_resources in the netfs_read_request struct, including the
netfs_cache_ops used by the helper lib to talk to the cache.  The helper
lib does not directly access the cache.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Add write_begin helper
David Howells [Tue, 22 Sep 2020 10:06:07 +0000 (11:06 +0100)]
netfs: Add write_begin helper

Add a helper to do the pre-reading work for the netfs write_begin address
space op.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Gather stats
David Howells [Tue, 3 Nov 2020 11:32:41 +0000 (11:32 +0000)]
netfs: Gather stats

Gather statistics from the netfs interface that can be exported through a
seqfile.  This is intended to be called by a later patch when viewing
/proc/fs/fscache/stats.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Add tracepoints
David Howells [Fri, 18 Sep 2020 08:25:13 +0000 (09:25 +0100)]
netfs: Add tracepoints

Add three tracepoints to track the activity of the read helpers:

 (1) netfs/netfs_read

     This logs entry to the read helpers and also expansion of the range in
     a readahead request.

 (2) netfs/netfs_rreq

     This logs the progress of netfs_read_request objects which track
     read requests.  A read request may be a compound of multiple
     subrequests.

 (3) netfs/netfs_sreq

     This logs the progress of netfs_read_subrequest objects, which track
     the contributions from various sources to a read request.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agonetfs: Provide readahead and readpage netfs helpers
David Howells [Wed, 13 May 2020 16:41:20 +0000 (17:41 +0100)]
netfs: Provide readahead and readpage netfs helpers

Add a pair of helper functions:

 (*) netfs_readahead()
 (*) netfs_readpage()

to do the work of handling a readahead or a readpage, where the page(s)
that form part of the request may be split between the local cache, the
server or just require clearing, and may be single pages and transparent
huge pages.  This is all handled within the helper.

Note that while both will read from the cache if there is data present,
only netfs_readahead() will expand the request beyond what it was asked to
do, and only netfs_readahead() will write back to the cache.

netfs_readpage(), on the other hand, is synchronous and only fetches the
page (which might be a THP) it is asked for.

The netfs gives the helper parameters from the VM, the cache cookie it
wants to use (or NULL) and a table of operations (only one of which is
mandatory):

 (*) expand_readahead() [optional]

     Called to allow the netfs to request an expansion of a readahead
     request to meet its own alignment requirements.  This is done by
     changing rreq->start and rreq->len.

 (*) clamp_length() [optional]

     Called to allow the netfs to cut down a subrequest to meet its own
     boundary requirements.  If it does this, the helper will generate
     additional subrequests until the full request is satisfied.

 (*) is_still_valid() [optional]

     Called to find out if the data just read from the cache has been
     invalidated and must be reread from the server.

 (*) issue_op() [required]

     Called to ask the netfs to issue a read to the server.  The subrequest
     describes the read.  The read request holds information about the file
     being accessed.

     The netfs can cache information in rreq->netfs_priv.

     Upon completion, the netfs should set the error, transferred and can
     also set FSCACHE_SREQ_CLEAR_TAIL and then call
     fscache_subreq_terminated().

 (*) done() [optional]

     Called after the pages have been unlocked.  The read request is still
     pinning the file and mapping and may still be pinning pages with
     PG_fscache.  rreq->error indicates any error that has been
     accumulated.

 (*) cleanup() [optional]

     Called when the helper is disposing of a finished read request.  This
     allows the netfs to clear rreq->netfs_priv.

Netfs support is enabled with CONFIG_NETFS_SUPPORT=y.  It will be built
even if CONFIG_FSCACHE=n and in this case much of it should be optimised
away, allowing the filesystem to use it even when caching is disabled.

Signed-off-by: David Howells <dhowells@redhat.com>
5 years agoLinux 5.11-rc5
Linus Torvalds [Mon, 25 Jan 2021 00:47:14 +0000 (16:47 -0800)]
Linux 5.11-rc5

5 years agoMerge tag 'sh-for-5.11' of git://git.libc.org/linux-sh
Linus Torvalds [Sun, 24 Jan 2021 21:52:02 +0000 (13:52 -0800)]
Merge tag 'sh-for-5.11' of git://git.libc.org/linux-sh

Pull arch/sh updates from Rich Felker:
 "Cleanup and warning fixes"

* tag 'sh-for-5.11' of git://git.libc.org/linux-sh:
  sh/intc: Restore devm_ioremap() alignment
  sh: mach-sh03: remove duplicate include
  arch: sh: remove duplicate include
  sh: Drop ARCH_NR_GPIOS definition
  sh: Remove unused HAVE_COPY_THREAD_TLS macro
  sh: remove CONFIG_IDE from most defconfig
  sh: mm: Convert to DEFINE_SHOW_ATTRIBUTE
  sh: intc: Convert to DEFINE_SHOW_ATTRIBUTE
  arch/sh: hyphenate Non-Uniform in Kconfig prompt
  sh: dma: fix kconfig dependency for G2_DMA

5 years agoMerge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sun, 24 Jan 2021 20:30:14 +0000 (12:30 -0800)]
Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Still need a final cancelation fix that isn't quite done done,
  expected in the next day or two. That said, this contains:

   - Wakeup fix for IOPOLL requests

   - SQPOLL split close op handling fix

   - Ensure that any use of io_uring fd itself is marked as inflight

   - Short non-regular file read fix (Pavel)

   - Fix up bad false positive warning (Pavel)

   - SQPOLL fixes (Pavel)

   - In-flight removal fix (Pavel)"

* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
  io_uring: account io_uring internal files as REQ_F_INFLIGHT
  io_uring: fix sleeping under spin in __io_clean_op
  io_uring: fix short read retries for non-reg files
  io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
  io_uring: fix skipping disabling sqo on exec
  io_uring: fix uring_flush in exit_files() warning
  io_uring: fix false positive sqo warning on flush
  io_uring: iopoll requests should also wake task ->in_idle state

5 years agoMerge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Linus Torvalds [Sun, 24 Jan 2021 20:24:35 +0000 (12:24 -0800)]
Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - NVMe pull request from Christoph:
      - fix a status code in nvmet (Chaitanya Kulkarni)
      - avoid double completions in nvme-rdma/nvme-tcp (Chao Leng)
      - fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen)
      - fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar)
      - fix a double DMA unmap in nvme-pci

 - lightnvm error path leak fix (Pan)

 - MD pull request from Song:
      - Flush request fix (Xiao)

* tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
  lightnvm: fix memory leak when submit fails
  nvme-pci: fix error unwind in nvme_map_data
  nvme-pci: refactor nvme_unmap_data
  md: Set prev_flush_start and flush_bio in an atomic way
  nvmet: set right status on error in id-ns handler
  nvme-pci: allow use of cmb on v1.4 controllers
  nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout
  nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout
  nvme: check the PRINFO bit before deciding the host buffer length

5 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Sun, 24 Jan 2021 20:16:34 +0000 (12:16 -0800)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "18 patches.

  Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
  memory-failure, and highmem), ubsan, proc, and MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  MAINTAINERS: add a couple more files to the Clang/LLVM section
  proc_sysctl: fix oops caused by incorrect command parameters
  powerpc/mm/highmem: use __set_pte_at() for kmap_local()
  mips/mm/highmem: use set_pte() for kmap_local()
  mm/highmem: prepare for overriding set_pte_at()
  sparc/mm/highmem: flush cache and TLB
  mm: fix page reference leak in soft_offline_page()
  ubsan: disable unsigned-overflow check for i386
  kasan, mm: fix resetting page_alloc tags for HW_TAGS
  kasan, mm: fix conflicts with init_on_alloc/free
  kasan: fix HW_TAGS boot parameters
  kasan: fix incorrect arguments passing in kasan_add_zero_shadow
  kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
  mm: fix numa stats for thp migration
  mm: memcg: fix memcg file_dirty numa stat
  mm: memcg/slab: optimize objcg stock draining
  mm: fix initialization of struct page for holes in memory layout
  x86/setup: don't remove E820_TYPE_RAM for pfn 0

5 years agoMerge tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Linus Torvalds [Sun, 24 Jan 2021 19:26:46 +0000 (11:26 -0800)]
Merge tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc driver fixes from Greg KH:
 "Here are some small char/misc driver fixes for 5.11-rc5:

   - habanalabs driver fixes

   - phy driver fixes

   - hwtracing driver fixes

   - rtsx cardreader driver fix

  All of these have been in linux-next with no reported issues"

* tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  misc: rtsx: init value of aspm_enabled
  habanalabs: disable FW events on device removal
  habanalabs: fix backward compatibility of idle check
  habanalabs: zero pci counters packet before submit to FW
  intel_th: pci: Add Alder Lake-P support
  stm class: Fix module init return on allocation failure
  habanalabs: prevent soft lockup during unmap
  habanalabs: fix reset process in case of failures
  habanalabs: fix dma_addr passed to dma_mmap_coherent
  phy: mediatek: allow compile-testing the dsi phy
  phy: cpcap-usb: Fix warning for missing regulator_disable
  PHY: Ingenic: fix unconditional build of phy-ingenic-usb

5 years agoMerge tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 24 Jan 2021 19:05:48 +0000 (11:05 -0800)]
Merge tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are some small driver core fixes for 5.11-rc5 that resolve some
  reported problems:

   - revert of a -rc1 patch that was causing problems with some machines

   - device link device name collision problem fix (busses only have to
     name devices unique to their bus, not unique to all busses)

   - kernfs splice bugfixes to resolve firmware loading problems for
     Qualcomm systems.

   - other tiny driver core fixes for minor issues reported.

  All of these have been in linux-next with no reported problems"

* tag 'driver-core-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Fix device link device name collision
  driver core: Extend device_is_dependent()
  kernfs: wire up ->splice_read and ->splice_write
  kernfs: implement ->write_iter
  kernfs: implement ->read_iter
  Revert "driver core: Reorder devices on successful probe"
  Driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  drivers core: Free dma_range_map when driver probe failed

5 years agoMerge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh...
Linus Torvalds [Sun, 24 Jan 2021 19:02:01 +0000 (11:02 -0800)]
Merge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO driver fixes from Greg KH:
 "Here are some IIO driver fixes for 5.11-rc5 to resolve some reported
  problems.

  Nothing major, just a few small fixes, all of these have been in
  linux-next for a while and full details are in the shortlog"

* tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  iio: sx9310: Fix semtech,avg-pos-strength setting when > 16
  iio: common: st_sensors: fix possible infinite loop in st_sensors_irq_thread
  iio: ad5504: Fix setting power-down state
  counter:ti-eqep: remove floor
  drivers: iio: temperature: Add delay after the addressed reset command in mlx90632.c
  iio: adc: ti_am335x_adc: remove omitted iio_kfifo_free()
  dt-bindings: iio: accel: bma255: Fix bmc150/bmi055 compatible
  iio: sx9310: Off by one in sx9310_read_thresh()

5 years agoMerge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Linus Torvalds [Sun, 24 Jan 2021 18:56:45 +0000 (10:56 -0800)]
Merge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial fixes from Greg KH:
 "Here are three small tty/serial fixes for 5.11-rc5 to resolve reported
  problems:

   - two patches to fix up writing to ttys with splice

   - mvebu-uart driver fix for reported problem

  All of these have been in linux-next with no reported problems"

* tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  tty: fix up hung_up_tty_write() conversion
  tty: implement write_iter
  serial: mvebu-uart: fix tx lost characters at power off

5 years agoMerge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Linus Torvalds [Sun, 24 Jan 2021 18:54:54 +0000 (10:54 -0800)]
Merge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
 "Here are some small USB driver fixes for 5.11-rc5.  They resolve:

   - xhci issues for some reported problems

   - ehci driver issue for one specific device

   - USB gadget fixes for some reported problems

   - cdns3 driver fixes for issues reported

   - MAINTAINERS file update

   - thunderbolt minor fix

  All of these have been in linux-next with no reported issues"

* tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: bdc: Make bdc pci driver depend on BROKEN
  xhci: tegra: Delay for disabling LFPS detector
  xhci: make sure TRB is fully written before giving it to the controller
  usb: udc: core: Use lock when write to soft_connect
  USB: gadget: dummy-hcd: Fix errors in port-reset handling
  usb: gadget: aspeed: fix stop dma register setting.
  USB: ehci: fix an interrupt calltrace error
  ehci: fix EHCI host controller initialization sequence
  MAINTAINERS: update Peter Chen's email address
  thunderbolt: Drop duplicated 0x prefix from format string
  MAINTAINERS: Update address for Cadence USB3 driver
  usb: cdns3: imx: improve driver .remove API
  usb: cdns3: imx: fix can't create core device the second time issue
  usb: cdns3: imx: fix writing read-only memory issue

5 years agoMAINTAINERS: add a couple more files to the Clang/LLVM section
Nathan Chancellor [Sun, 24 Jan 2021 05:02:21 +0000 (21:02 -0800)]
MAINTAINERS: add a couple more files to the Clang/LLVM section

The K: entry should ensure that Nick and I always get CC'd on patches that
touch these files but it is better to be explicit rather than implicit.

Link: https://lkml.kernel.org/r/20210114004059.2129921-1-natechancellor@gmail.com
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoproc_sysctl: fix oops caused by incorrect command parameters
Xiaoming Ni [Sun, 24 Jan 2021 05:02:16 +0000 (21:02 -0800)]
proc_sysctl: fix oops caused by incorrect command parameters

The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val).  If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.

For example:
  "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
  triggered. The call stack is as follows:
    Kernel command line: .... hung_task_panic
    ......
    Call trace:
    __pi_strlen+0x10/0x98
    parse_args+0x278/0x344
    do_sysctl_args+0x8c/0xfc
    kernel_init+0x5c/0xf4
    ret_from_fork+0x10/0x30

To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().

Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com
Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line")
Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: <stable@vger.kernel.org> [5.8+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agopowerpc/mm/highmem: use __set_pte_at() for kmap_local()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:11 +0000 (21:02 -0800)]
powerpc/mm/highmem: use __set_pte_at() for kmap_local()

The original PowerPC highmem mapping function used __set_pte_at() to
denote that the mapping is per CPU.  This got lost with the conversion
to the generic implementation.

Override the default map function.

Link: https://lkml.kernel.org/r/20210112170411.281464308@linutronix.de
Fixes: 47da42b27a56 ("powerpc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomips/mm/highmem: use set_pte() for kmap_local()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:07 +0000 (21:02 -0800)]
mips/mm/highmem: use set_pte() for kmap_local()

set_pte_at() on MIPS invokes update_cache() which might recurse into
kmap_local().

Use set_pte() like the original MIPS highmem implementation did.

Link: https://lkml.kernel.org/r/20210112170411.187513575@linutronix.de
Fixes: a4c33e83bca1 ("mips/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Paul Cercueil <paul@crapouillou.net>
Reported-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm/highmem: prepare for overriding set_pte_at()
Thomas Gleixner [Sun, 24 Jan 2021 05:02:02 +0000 (21:02 -0800)]
mm/highmem: prepare for overriding set_pte_at()

The generic kmap_local() map function uses set_pte_at(), but MIPS requires
set_pte() and PowerPC wants __set_pte_at().

Provide arch_kmap_local_set_pte() and default it to set_pte_at().

Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agosparc/mm/highmem: flush cache and TLB
Thomas Gleixner [Sun, 24 Jan 2021 05:01:57 +0000 (21:01 -0800)]
sparc/mm/highmem: flush cache and TLB

Patch series "mm/highmem: Fix fallout from generic kmap_local
conversions".

The kmap_local conversion wreckaged sparc, mips and powerpc as it missed
some of the details in the original implementation.

This patch (of 4):

The recent conversion to the generic kmap_local infrastructure failed to
assign the proper pre/post map/unmap flush operations for sparc.

Sparc requires cache flush before map/unmap and tlb flush afterwards.

Link: https://lkml.kernel.org/r/20210112170136.078559026@linutronix.de
Link: https://lkml.kernel.org/r/20210112170410.905976187@linutronix.de
Fixes: 3293efa97807 ("sparc/mm/highmem: Switch to generic kmap atomic")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Andreas Larsson <andreas@gaisler.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Cercueil <paul@crapouillou.net>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: fix page reference leak in soft_offline_page()
Dan Williams [Sun, 24 Jan 2021 05:01:52 +0000 (21:01 -0800)]
mm: fix page reference leak in soft_offline_page()

The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.

Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.

When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.

Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoubsan: disable unsigned-overflow check for i386
Arnd Bergmann [Sun, 24 Jan 2021 05:01:48 +0000 (21:01 -0800)]
ubsan: disable unsigned-overflow check for i386

Building ubsan kernels even for compile-testing introduced these
warnings in my randconfig environment:

  crypto/blake2b_generic.c:98:13: error: stack frame size of 9636 bytes in function 'blake2b_compress' [-Werror,-Wframe-larger-than=]
  static void blake2b_compress(struct blake2b_state *S,

  crypto/sha512_generic.c:151:13: error: stack frame size of 1292 bytes in function 'sha512_generic_block_fn' [-Werror,-Wframe-larger-than=]
  static void sha512_generic_block_fn(struct sha512_state *sst, u8 const *src,

  lib/crypto/curve25519-fiat32.c:312:22: error: stack frame size of 2180 bytes in function 'fe_mul_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_mul_impl(u32 out[10], const u32 in1[10], const u32 in2[10])

  lib/crypto/curve25519-fiat32.c:444:22: error: stack frame size of 1588 bytes in function 'fe_sqr_impl' [-Werror,-Wframe-larger-than=]
  static noinline void fe_sqr_impl(u32 out[10], const u32 in1[10])

Further testing showed that this is caused by
-fsanitize=unsigned-integer-overflow, but is isolated to the 32-bit x86
architecture.

The one in blake2b immediately overflows the 8KB stack area
architectures, so better ensure this never happens by disabling the
option for 32-bit x86.

Link: https://lkml.kernel.org/r/20210112202922.2454435-1-arnd@kernel.org
Link: https://lore.kernel.org/lkml/20201230154749.746641-1-arnd@kernel.org/
Fixes: d0a3ac549f38 ("ubsan: enable for all*config builds")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Marco Elver <elver@google.com>
Cc: George Popescu <georgepope@android.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokasan, mm: fix resetting page_alloc tags for HW_TAGS
Andrey Konovalov [Sun, 24 Jan 2021 05:01:43 +0000 (21:01 -0800)]
kasan, mm: fix resetting page_alloc tags for HW_TAGS

A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.

That commit did reset page tags before the metadata access, but didn't
restore them after.  As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.

Fix this by recovering the tag after the metadata access.

Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com
Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokasan, mm: fix conflicts with init_on_alloc/free
Andrey Konovalov [Sun, 24 Jan 2021 05:01:38 +0000 (21:01 -0800)]
kasan, mm: fix conflicts with init_on_alloc/free

A few places where SLUB accesses object's data or metadata were missed
in a previous patch.  This leads to false positives with hardware
tag-based KASAN when bulk allocations are used with init_on_alloc/free.

Fix the false-positives by resetting pointer tags during these accesses.

(The kasan_reset_tag call is removed from slab_alloc_node, as it's added
 into maybe_wipe_obj_freeptr.)

Link: https://linux-review.googlesource.com/id/I50dd32838a666e173fe06c3c5c766f2c36aae901
Link: https://lkml.kernel.org/r/093428b5d2ca8b507f4a79f92f9929b35f7fada7.1610731872.git.andreyknvl@google.com
Fixes: aa1ef4d7b3f67 ("kasan, mm: reset tags when accessing metadata")
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokasan: fix HW_TAGS boot parameters
Andrey Konovalov [Sun, 24 Jan 2021 05:01:34 +0000 (21:01 -0800)]
kasan: fix HW_TAGS boot parameters

The initially proposed KASAN command line parameters are redundant.

This change drops the complex "kasan.mode=off/prod/full" parameter and
adds a simpler kill switch "kasan=off/on" instead.  The new parameter
together with the already existing ones provides a cleaner way to
express the same set of features.

The full set of parameters with this change:

  kasan=off/on             - whether KASAN is enabled
  kasan.fault=report/panic - whether to only print a report or also panic
  kasan.stacktrace=off/on  - whether to collect alloc/free stack traces

Default values:

  kasan=on
  kasan.fault=report
  kasan.stacktrace=on  (if CONFIG_DEBUG_KERNEL=y)
  kasan.stacktrace=off (otherwise)

Link: https://linux-review.googlesource.com/id/Ib3694ed90b1e8ccac6cf77dfd301847af4aba7b8
Link: https://lkml.kernel.org/r/4e9c4a4bdcadc168317deb2419144582a9be6e61.1610736745.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Marco Elver <elver@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Branislav Rankov <Branislav.Rankov@arm.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokasan: fix incorrect arguments passing in kasan_add_zero_shadow
Lecopzer Chen [Sun, 24 Jan 2021 05:01:29 +0000 (21:01 -0800)]
kasan: fix incorrect arguments passing in kasan_add_zero_shadow

kasan_remove_zero_shadow() shall use original virtual address, start and
size, instead of shadow address.

Link: https://lkml.kernel.org/r/20210103063847.5963-1-lecopzer@gmail.com
Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agokasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
Lecopzer Chen [Sun, 24 Jan 2021 05:01:25 +0000 (21:01 -0800)]
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow

During testing kasan_populate_early_shadow and kasan_remove_zero_shadow,
if the shadow start and end address in kasan_remove_zero_shadow() is not
aligned to PMD_SIZE, the remain unaligned PTE won't be removed.

In the test case for kasan_remove_zero_shadow():

    shadow_start: 0xffffffb802000000, shadow end: 0xffffffbfbe000000

    3-level page table:
      PUD_SIZE: 0x40000000 PMD_SIZE: 0x200000 PAGE_SIZE: 4K

0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in
kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next
address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.

In the correct condition, this should fallback to the next level
kasan_remove_pmd_table() but the condition flow always continue to skip
the unaligned part.

Fix by correcting the condition when next and addr are neither aligned.

Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com
Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN")
Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: YJ Chiang <yj.chiang@mediatek.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoMerge tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:24:20 +0000 (10:24 -0800)]
Merge tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Borislav Petkov:

 - Fix a kernel panic in mips-cpu due to invalid irq domain hierarchy.

 - Fix to not lose IPIs on bcm2836.

 - Fix for a bogus marking of ITS devices as shared due to unitialized
   stack variable.

 - Clear a phantom interrupt on qcom-pdc to unblock suspend.

 - Small cleanups, warning and build fixes.

* tag 'irq_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  genirq: Export irq_check_status_bit()
  irqchip/mips-cpu: Set IPI domain parent chip
  irqchip/pruss: Simplify the TI_PRUSS_INTC Kconfig
  irqchip/loongson-liointc: Fix build warnings
  driver core: platform: Add extra error check in devm_platform_get_irqs_affinity()
  irqchip/bcm2836: Fix IPI acknowledgement after conversion to handle_percpu_devid_irq
  irqchip/irq-sl28cpld: Convert comma to semicolon
  genirq/msi: Initialize msi_alloc_info before calling msi_domain_prepare_irqs()

5 years agoMerge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:17:03 +0000 (10:17 -0800)]
Merge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool fixes from Borislav Petkov:

 - Adjust objtool to handle a recent binutils change to not generate
   unused symbols anymore.

 - Revert the fail-the-build-on-fatal-errors objtool strategy for now
   due to the ever-increasing matrix of supported toolchains/plugins and
   them causing too many such fatal errors currently.

 - Do not add empty symbols to objdump's rbtree to accommodate clang
   removing section symbols.

* tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Don't fail on missing symbol table
  objtool: Don't fail the kernel build on fatal errors
  objtool: Don't add empty symbols to the rbtree

5 years agoMerge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 18:09:20 +0000 (10:09 -0800)]
Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Borislav Petkov:

 - Correct the marking of kthreads which are supposed to run on a
   specific, single CPU vs such which are affine to only one CPU, mark
   per-cpu workqueue threads as such and make sure that marking
   "survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.

 - A fix to not push away tasks on CPUs coming online.

 - Have workqueue CPU hotplug code use cpu_possible_mask when breaking
   affinity on CPU offlining so that pending workers can finish on newly
   arrived onlined CPUs too.

 - Dump tasks which haven't vacated a CPU which is currently being
   unplugged.

 - Register a special scale invariance callback which gets called on
   resume from RAM to read out APERF/MPERF after resume and thus make
   the schedutil scaling governor more precise.

* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Relax the set_cpus_allowed_ptr() semantics
  sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
  sched: Prepare to use balance_push in ttwu()
  workqueue: Restrict affinity change to rescuer
  workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
  kthread: Extract KTHREAD_IS_PER_CPU
  sched: Don't run cpu-online with balance_push() enabled
  workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
  sched/core: Print out straggler tasks in sched_cpu_dying()
  x86: PM: Register syscore_ops for scale invariance

5 years agoMerge tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 17:58:38 +0000 (09:58 -0800)]
Merge tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Borislav Petkov:

 - Fix an integer overflow in the NTP RTC synchronization which led to
   the latter happening every 2 seconds instead of the intended every 11
   minutes.

 - Get rid of now unused get_seconds().

* tag 'timers_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ntp: Fix RTC synchronization on 32-bit platforms
  timekeeping: Remove unused get_seconds()

5 years agoMerge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 24 Jan 2021 17:46:05 +0000 (09:46 -0800)]
Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

 - Add a new Intel model number for Alder Lake

 - Differentiate which aspects of the FPU state get saved/restored when
   the FPU is used in-kernel and fix a boot crash on K7 due to early
   MXCSR access before CR4.OSFXSR is even set.

 - A couple of noinstr annotation fixes

 - Correct die ID setting on AMD for users of topology information which
   need the correct die ID

 - A SEV-ES fix to handle string port IO to/from kernel memory properly

* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/cpu: Add another Alder Lake CPU to the Intel family
  x86/mmx: Use KFPU_387 for MMX string operations
  x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
  x86/topology: Make __max_die_per_package available unconditionally
  x86: __always_inline __{rd,wr}msr()
  x86/mce: Remove explicit/superfluous tracing
  locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
  locking/lockdep: Cure noinstr fail
  x86/sev: Fix nonistr violation
  x86/entry: Fix noinstr fail
  x86/cpu/amd: Set __max_die_per_package on AMD
  x86/sev-es: Handle string port IO to kernel memory properly

5 years agoMerge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 24 Jan 2021 17:40:51 +0000 (09:40 -0800)]
Merge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - Fix a bad interaction between the scv handling and the fallback L1D
   flush, which could lead to user register corruption. Only affects
   people using scv (~no one) on machines with old firmware that are
   missing the L1D flush.

 - Two small selftest fixes.

Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das,
and Tulio Magno Quites Machado Filho.

* tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: fix scv entry fallback flush vs interrupt
  selftests/powerpc: Only test lwm/stmw on big endian
  selftests/powerpc: Fix exit status of pkey tests

5 years agoMerge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sun, 24 Jan 2021 17:35:28 +0000 (09:35 -0800)]
Merge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull misc fixes from Christian Brauner:

 - Jann reported sparse complaints because of a missing __user
   annotation in a helper we added way back when we added
   pidfd_send_signal() to avoid compat syscall handling. Fix it.

 - Yanfei replaces a reference in a comment to the _do_fork() helper I
   removed a while ago with a reference to the new kernel_clone()
   replacement

 - Alexander Guril added a simple coding style fix

* tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  kthread: remove comments about old _do_fork() helper
  Kernel: fork.c: Fix coding style: Do not use {} around single-line statements
  signal: Add missing __user annotation to copy_siginfo_from_user_any

5 years agoMerge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 24 Jan 2021 17:27:14 +0000 (09:27 -0800)]
Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "An important signal handling patch for stable, and two small cleanup
  patches"

* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: do not fail __smb_send_rqst if non-fatal signals are pending
  fs/cifs: Simplify bool comparison.
  fs/cifs: Assign boolean values to a bool variable

5 years agomm: fix numa stats for thp migration
Shakeel Butt [Sun, 24 Jan 2021 05:01:15 +0000 (21:01 -0800)]
mm: fix numa stats for thp migration

Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration.  Fix that.

For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.

Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com
Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memcg: fix memcg file_dirty numa stat
Shakeel Butt [Sun, 24 Jan 2021 05:01:11 +0000 (21:01 -0800)]
mm: memcg: fix memcg file_dirty numa stat

The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.

That was not an issue until recently the commit 5f9a4f4a7096 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.

So fix the file_dirty per-memcg numa stat.

Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com
Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: memcg/slab: optimize objcg stock draining
Roman Gushchin [Sun, 24 Jan 2021 05:01:07 +0000 (21:01 -0800)]
mm: memcg/slab: optimize objcg stock draining

Imran Khan reported a 16% regression in hackbench results caused by the
commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
instead of pages").  The regression is noticeable in the case of a
consequent allocation of several relatively large slab objects, e.g.
skb's.  As soon as the amount of stocked bytes exceeds PAGE_SIZE,
drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
to a number of atomic operations in page_counter_uncharge().

The corresponding call graph is below (provided by Imran Khan):

  |__alloc_skb
  |    |
  |    |__kmalloc_reserve.isra.61
  |    |    |
  |    |    |__kmalloc_node_track_caller
  |    |    |    |
  |    |    |    |slab_pre_alloc_hook.constprop.88
  |    |    |     obj_cgroup_charge
  |    |    |    |    |
  |    |    |    |    |__memcg_kmem_charge
  |    |    |    |    |    |
  |    |    |    |    |    |page_counter_try_charge
  |    |    |    |    |
  |    |    |    |    |refill_obj_stock
  |    |    |    |    |    |
  |    |    |    |    |    |drain_obj_stock.isra.68
  |    |    |    |    |    |    |
  |    |    |    |    |    |    |__memcg_kmem_uncharge
  |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |page_counter_uncharge
  |    |    |    |    |    |    |    |    |
  |    |    |    |    |    |    |    |    |page_counter_cancel
  |    |    |    |
  |    |    |    |
  |    |    |    |__slab_alloc
  |    |    |    |    |
  |    |    |    |    |___slab_alloc
  |    |    |    |    |
  |    |    |    |slab_post_alloc_hook

Instead of directly uncharging the accounted kernel memory, it's
possible to refill the generic page-sized per-cpu stock instead.  It's a
much faster operation, especially on a default hierarchy.  As a bonus,
__memcg_kmem_uncharge_page() will also get faster, so the freeing of
page-sized kernel allocations (e.g.  large kmallocs) will become faster.

A similar change has been done earlier for the socket memory by the
commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket
memory uncharging").

Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com
Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reported-by: Imran Khan <imran.f.khan@oracle.com>
Tested-by: Imran Khan <imran.f.khan@oracle.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Michal Koutn <mkoutny@suse.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agomm: fix initialization of struct page for holes in memory layout
Mike Rapoport [Sun, 24 Jan 2021 05:01:02 +0000 (21:01 -0800)]
mm: fix initialization of struct page for holes in memory layout

There could be struct pages that are not backed by actual physical
memory.  This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.

Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.

init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.

On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:

# grep -A1 E820 /proc/iomem
7a17b000-7a216fff : Unknown E820 type
7a217000-7bffffff : System RAM

unset zone link in struct page will trigger

VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.

Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.

Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org
Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agox86/setup: don't remove E820_TYPE_RAM for pfn 0
Mike Rapoport [Sun, 24 Jan 2021 05:00:57 +0000 (21:00 -0800)]
x86/setup: don't remove E820_TYPE_RAM for pfn 0

Patch series "mm: fix initialization of struct page for holes in  memory layout", v3.

Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported
back in April [1].  It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org

This patch (of 2):

The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory.  As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.

If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger

VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);

Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch().  These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.

Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.

Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agoio_uring: account io_uring internal files as REQ_F_INFLIGHT
Jens Axboe [Sat, 23 Jan 2021 22:49:31 +0000 (15:49 -0700)]
io_uring: account io_uring internal files as REQ_F_INFLIGHT

We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoio_uring: fix sleeping under spin in __io_clean_op
Pavel Begunkov [Sun, 24 Jan 2021 15:08:14 +0000 (15:08 +0000)]
io_uring: fix sleeping under spin in __io_clean_op

[   27.629441] BUG: sleeping function called from invalid context
at fs/file.c:402
[   27.631317] in_atomic(): 1, irqs_disabled(): 1, non_block: 0,
pid: 1012, name: io_wqe_worker-0
[   27.633220] 1 lock held by io_wqe_worker-0/1012:
[   27.634286]  #0: ffff888105e26c98 (&ctx->completion_lock)
{....}-{2:2}, at: __io_req_complete.part.102+0x30/0x70
[   27.649249] Call Trace:
[   27.649874]  dump_stack+0xac/0xe3
[   27.650666]  ___might_sleep+0x284/0x2c0
[   27.651566]  put_files_struct+0xb8/0x120
[   27.652481]  __io_clean_op+0x10c/0x2a0
[   27.653362]  __io_cqring_fill_event+0x2c1/0x350
[   27.654399]  __io_req_complete.part.102+0x41/0x70
[   27.655464]  io_openat2+0x151/0x300
[   27.656297]  io_issue_sqe+0x6c/0x14e0
[   27.660991]  io_wq_submit_work+0x7f/0x240
[   27.662890]  io_worker_handle_work+0x501/0x8a0
[   27.664836]  io_wqe_worker+0x158/0x520
[   27.667726]  kthread+0x134/0x180
[   27.669641]  ret_from_fork+0x1f/0x30

Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.

Reported-by: Abaci <abaci@linux.alibaba.com>
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoMerge branch 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Linus Torvalds [Sat, 23 Jan 2021 20:02:58 +0000 (12:02 -0800)]
Merge branch 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux

Pull mtd fixes from Miquel Raynal.

* 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
  mtd: rawnand: omap: Use BCH private fields in the specific OOB layout
  mtd: spinand: Fix MTD_OPS_AUTO_OOB requests
  mtd: rawnand: intel: check the mtd name only after setting the variable
  mtd: rawnand: nandsim: Fix the logic when selecting Hamming soft ECC engine
  mtd: rawnand: gpmi: fix dst bit offset when extracting raw payload

5 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 23 Jan 2021 19:43:02 +0000 (11:43 -0800)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:
 "Another bunch  of driver fixes"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: sprd: depend on COMMON_CLK to fix compile tests
  Revert "i2c: imx: Remove unused .id_table support"
  i2c: octeon: check correct size of maximum RECV_LEN packet
  i2c: tegra: Create i2c_writesl_vi() to use with VI I2C for filling TX FIFO
  i2c: bpmp-tegra: Ignore unknown I2C_M flags
  i2c: tegra: Wait for config load atomically while in ISR

5 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 23 Jan 2021 19:35:02 +0000 (11:35 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Twelve minor fixes, all in drivers or doc.

  Most of the fixes are pretty obvious (although we had two goes to get
  the UFS sysfs doc right) and the biggest change is in the ufs driver
  which they've extensively tested"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: ibmvfc: Set default timeout to avoid crash during migration
  scsi: target: tcmu: Fix use-after-free of se_cmd->priv
  scsi: fnic: Fix memleak in vnic_dev_init_devcmd2
  scsi: libfc: Avoid invoking response handler twice if ep is already completed
  scsi: scsi_transport_srp: Don't block target in failfast state
  scsi: docs: ABI: sysfs-driver-ufs: Rectify table formatting
  scsi: ufs: Fix tm request when non-fatal error happens
  scsi: ufs: Fix livelock of ufshcd_clear_ua_wluns()
  scsi: ibmvfc: Fix missing cast of ibmvfc_event pointer to u64 handle
  scsi: ufs: ufshcd-pltfrm depends on HAS_IOMEM
  scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression
  scsi: docs: ABI: sysfs-driver-ufs: Add DeepSleep power mode

5 years agoMerge tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm...
Linus Torvalds [Sat, 23 Jan 2021 19:25:33 +0000 (11:25 -0800)]
Merge tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kunit fixes from Shuah :
 "Five fixes to the kunit tool and documentation from Daniel Latypov and
  David Gow"

* tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: tool: move kunitconfig parsing into __init__, make it optional
  kunit: tool: fix minor typing issue with None status
  kunit: tool: surface and address more typing issues
  Documentation: kunit: include example of a parameterized test
  kunit: tool: Fix spelling of "diagnostic" in kunit_parser

5 years agocifs: do not fail __smb_send_rqst if non-fatal signals are pending
Ronnie Sahlberg [Wed, 20 Jan 2021 22:22:48 +0000 (08:22 +1000)]
cifs: do not fail __smb_send_rqst if non-fatal signals are pending

RHBZ 1848178

The original intent of returning an error in this function
in the patch:
  "CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).

In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.

Fixes: b30c74c73c78 ("CIFS: Mask off signals when sending SMB packets")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
5 years agoMerge tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Fri, 22 Jan 2021 22:31:00 +0000 (14:31 -0800)]
Merge tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - Fix DM integrity crash if "recalculate" used without "internal_hash"

 - Fix DM integrity "recalculate" support to prevent recalculating
   checksums if we use internal_hash or journal_hash with a key (e.g.
   HMAC). Use of crypto as a means to prevent malicious corruption
   requires further changes and was never a design goal for
   dm-integrity's primary usecase of detecting accidental corruption.

 - Fix a benign dm-crypt copy-and-paste bug introduced as part of a fix
   that was merged for 5.11-rc4.

 - Fix DM core's dm_get_device() to avoid filesystem lookup to get block
   device (if possible).

* tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: avoid filesystem lookup in dm_get_dev_t()
  dm crypt: fix copy and paste bug in crypt_alloc_req_aead
  dm integrity: conditionally disable "recalculate" feature
  dm integrity: fix a crash if "recalculate" used without "internal_hash"

5 years agoMerge tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 22 Jan 2021 21:55:00 +0000 (13:55 -0800)]
Merge tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux

Pull more perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix id index used in Intel PT for heterogeneous systems

 - Fix overrun issue in 'perf script' for dynamically-allocated PMU type
   number

 - Fix 'perf stat' metrics containing the 'duration_time' synthetic
   event

 - Fix system PMU 'perf stat' metrics

* tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  perf script: Fix overrun issue for dynamically-allocated PMU type number
  perf metricgroup: Fix system PMU metrics
  perf metricgroup: Fix for metrics containing duration_time
  perf evlist: Fix id index for heterogeneous systems

5 years agoMerge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Linus Torvalds [Fri, 22 Jan 2021 21:51:17 +0000 (13:51 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Catalin Marinas:

 - Correctly mask out bits 63:60 in a kernel tag check fault address
   (specified as unknown by the architecture). Previously they were just
   zeroed but for kernel pointers they need to be all ones.

 - Fix a panic (unexpected kernel BRK exception) caused by kprobes being
   reentered due to an interrupt.

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: kprobes: Fix Uexpected kernel BRK exception at EL1
  kasan, arm64: fix pointer tags in KASAN reports

5 years agoMerge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client
Linus Torvalds [Fri, 22 Jan 2021 21:47:25 +0000 (13:47 -0800)]
Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch to zero out sensitive cryptographic data and two minor
  cleanups prompted by the fact that a bunch of code was moved in this
  cycle"

* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
  libceph: fix "Boolean result is used in bitwise operation" warning
  libceph, ceph: disambiguate ceph_connection_operations handlers
  libceph: zero out session key and connection secret

5 years agoMerge tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt...
Linus Torvalds [Fri, 22 Jan 2021 21:45:52 +0000 (13:45 -0800)]
Merge tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock

Pull typo fix from Mike Rapoport:
 "Fix typo in comment of memblock_phys_alloc_try_nid()"

* tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
  mm/memblock: Fix typo in comment of memblock_phys_alloc_try_nid()

5 years agoMerge tag 'mmc-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Linus Torvalds [Fri, 22 Jan 2021 21:43:42 +0000 (13:43 -0800)]
Merge tag 'mmc-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:
 "MMC core:
   - Fix initialization of block size when ext_csd isn't present

  MMC host:
   - sdhci-brcmstb: Fix mmc timeout errors on S5 suspend
   - sdhci-of-dwcmshc: Fix request accessing RPMB
   - sdhci-xenon: Fix 1.8v regulator stabilization"

* tag 'mmc-v5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: core: don't initialize block size from ext_csd if not present
  mmc: sdhci-brcmstb: Fix mmc timeout errors on S5 suspend
  mmc: sdhci-xenon: fix 1.8v regulator stabilization
  mmc: sdhci-of-dwcmshc: fix rpmb access

5 years agoMerge tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Fri, 22 Jan 2021 21:38:40 +0000 (13:38 -0800)]
Merge tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86

Pull x86 platform driver fixes from Hans de Goede:
 "A small collection of bug-fixes and model-specific quirks"

* tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
  platform/x86: thinkpad_acpi: Add P53/73 firmware to fan_quirk_table for dual fan control
  platform/x86: hp-wmi: Don't log a warning on HPWMI_RET_UNKNOWN_COMMAND errors
  platform/x86: intel-vbtn: Drop HP Stream x360 Convertible PC 11 from allow-list
  platform/x86: ideapad-laptop: Disable touchpad_switch for ELAN0634
  platform/x86: amd-pmc: Fix CONFIG_DEBUG_FS check
  platform/x86: thinkpad_acpi: correct palmsensor error checking
  platform/x86: intel-vbtn: Support for tablet mode on Dell Inspiron 7352
  platform/x86: touchscreen_dmi: Add swap-x-y quirk for Goodix touchscreen on Estar Beauty HD tablet
  platform/x86: i2c-multi-instantiate: Don't create platform device for INT3515 ACPI nodes
  platform/surface: SURFACE_PLATFORMS should depend on ACPI
  platform/surface: surface_gpe: Fix non-PM_SLEEP build warnings
  tools/power/x86/intel-speed-select: Set higher of cpuinfo_max_freq or base_frequency
  tools/power/x86/intel-speed-select: Set scaling_max_freq to base_frequency

5 years agoio_uring: fix short read retries for non-reg files
Pavel Begunkov [Thu, 21 Jan 2021 12:01:08 +0000 (12:01 +0000)]
io_uring: fix short read retries for non-reg files

Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.

Cc: stable@vger.kernel.org # 5.9+
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoio_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
Jens Axboe [Tue, 19 Jan 2021 17:10:54 +0000 (10:10 -0700)]
io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state

IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:

BUG: kernel NULL pointer dereference, address: 00000000000000d8
PGD 800000010b76f067 P4D 800000010b76f067 PUD 10b462067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1788 Comm: io_uring-sq Not tainted 5.11.0-rc4 #1
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__lock_acquire+0x19d/0x18c0
Code: 00 00 8b 1d fd 56 dd 08 85 db 0f 85 43 05 00 00 48 c7 c6 98 7b 95 82 48 c7 c7 57 96 93 82 e8 9a bc f5 ff 0f 0b e9 2b 05 00 00 <48> 81 3f c0 ca 67 8a b8 00 00 00 00 41 0f 45 c0 89 04 24 e9 81 fe
RSP: 0018:ffffc90001933828 EFLAGS: 00010002
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000000d8
RBP: 0000000000000246 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff888106e8a140 R15: 00000000000000d8
FS:  0000000000000000(0000) GS:ffff88813bd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d8 CR3: 0000000106efa004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 lock_acquire+0x31a/0x440
 ? close_fd_get_file+0x39/0x160
 ? __lock_acquire+0x647/0x18c0
 _raw_spin_lock+0x2c/0x40
 ? close_fd_get_file+0x39/0x160
 close_fd_get_file+0x39/0x160
 io_issue_sqe+0x1334/0x14e0
 ? lock_acquire+0x31a/0x440
 ? __io_free_req+0xcf/0x2e0
 ? __io_free_req+0x175/0x2e0
 ? find_held_lock+0x28/0xb0
 ? io_wq_submit_work+0x7f/0x240
 io_wq_submit_work+0x7f/0x240
 io_wq_cancel_cb+0x161/0x580
 ? io_wqe_wake_worker+0x114/0x360
 ? io_uring_get_socket+0x40/0x40
 io_async_find_and_cancel+0x3b/0x140
 io_issue_sqe+0xbe1/0x14e0
 ? __lock_acquire+0x647/0x18c0
 ? __io_queue_sqe+0x10b/0x5f0
 __io_queue_sqe+0x10b/0x5f0
 ? io_req_prep+0xdb/0x1150
 ? mark_held_locks+0x6d/0xb0
 ? mark_held_locks+0x6d/0xb0
 ? io_queue_sqe+0x235/0x4b0
 io_queue_sqe+0x235/0x4b0
 io_submit_sqes+0xd7e/0x12a0
 ? _raw_spin_unlock_irq+0x24/0x30
 ? io_sq_thread+0x3ae/0x940
 io_sq_thread+0x207/0x940
 ? do_wait_intr_irq+0xc0/0xc0
 ? __ia32_sys_io_uring_enter+0x650/0x650
 kthread+0x134/0x180
 ? kthread_create_worker_on_cpu+0x90/0x90
 ret_from_fork+0x1f/0x30

Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.

For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.

Cc: stable@vger.kernel.org
Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE")
Reported-by: "Abaci <abaci@linux.alibaba.com>"
Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 years agoarm64: kprobes: Fix Uexpected kernel BRK exception at EL1
Qais Yousef [Fri, 22 Jan 2021 11:09:09 +0000 (11:09 +0000)]
arm64: kprobes: Fix Uexpected kernel BRK exception at EL1

I was hitting the below panic continuously when attaching kprobes to
scheduler functions

[  159.045212] Unexpected kernel BRK exception at EL1
[  159.053753] Internal error: BRK handler: f2000006 [#1] PREEMPT SMP
[  159.059954] Modules linked in:
[  159.063025] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.11.0-rc4-00008-g1e2a199f6ccd #56
[rt-app] <notice> [1] Exiting.[  159.071166] Hardware name: ARM Juno development board (r2) (DT)
[  159.079689] pstate: 600003c5 (nZCv DAIF -PAN -UAO -TCO BTYPE=--)

[  159.085723] pc : 0xffff80001624501c
[  159.089377] lr : attach_entity_load_avg+0x2ac/0x350
[  159.094271] sp : ffff80001622b640
[rt-app] <notice> [0] Exiting.[  159.097591] x29: ffff80001622b640 x28: 0000000000000001
[  159.105515] x27: 0000000000000049 x26: ffff000800b79980

[  159.110847] x25: ffff00097ef37840 x24: 0000000000000000
[  159.116331] x23: 00000024eacec1ec x22: ffff00097ef12b90
[  159.121663] x21: ffff00097ef37700 x20: ffff800010119170
[rt-app] <notice> [11] Exiting.[  159.126995] x19: ffff00097ef37840 x18: 000000000000000e
[  159.135003] x17: 0000000000000001 x16: 0000000000000019
[  159.140335] x15: 0000000000000000 x14: 0000000000000000
[  159.145666] x13: 0000000000000002 x12: 0000000000000002
[  159.150996] x11: ffff80001592f9f0 x10: 0000000000000060
[  159.156327] x9 : ffff8000100f6f9c x8 : be618290de0999a1
[  159.161659] x7 : ffff80096a4b1000 x6 : 0000000000000000
[  159.166990] x5 : ffff00097ef37840 x4 : 0000000000000000
[  159.172321] x3 : ffff000800328948 x2 : 0000000000000000
[  159.177652] x1 : 0000002507d52fec x0 : ffff00097ef12b90
[  159.182983] Call trace:
[  159.185433]  0xffff80001624501c
[  159.188581]  update_load_avg+0x2d0/0x778
[  159.192516]  enqueue_task_fair+0x134/0xe20
[  159.196625]  enqueue_task+0x4c/0x2c8
[  159.200211]  ttwu_do_activate+0x70/0x138
[  159.204147]  sched_ttwu_pending+0xbc/0x160
[  159.208253]  flush_smp_call_function_queue+0x16c/0x320
[  159.213408]  generic_smp_call_function_single_interrupt+0x1c/0x28
[  159.219521]  ipi_handler+0x1e8/0x3c8
[  159.223106]  handle_percpu_devid_irq+0xd8/0x460
[  159.227650]  generic_handle_irq+0x38/0x50
[  159.231672]  __handle_domain_irq+0x6c/0xc8
[  159.235781]  gic_handle_irq+0xcc/0xf0
[  159.239452]  el1_irq+0xb4/0x180
[  159.242600]  rcu_is_watching+0x28/0x70
[  159.246359]  rcu_read_lock_held_common+0x44/0x88
[  159.250991]  rcu_read_lock_any_held+0x30/0xc0
[  159.255360]  kretprobe_dispatcher+0xc4/0xf0
[  159.259555]  __kretprobe_trampoline_handler+0xc0/0x150
[  159.264710]  trampoline_probe_handler+0x38/0x58
[  159.269255]  kretprobe_trampoline+0x70/0xc4
[  159.273450]  run_rebalance_domains+0x54/0x80
[  159.277734]  __do_softirq+0x164/0x684
[  159.281406]  irq_exit+0x198/0x1b8
[  159.284731]  __handle_domain_irq+0x70/0xc8
[  159.288840]  gic_handle_irq+0xb0/0xf0
[  159.292510]  el1_irq+0xb4/0x180
[  159.295658]  arch_cpu_idle+0x18/0x28
[  159.299245]  default_idle_call+0x9c/0x3e8
[  159.303265]  do_idle+0x25c/0x2a8
[  159.306502]  cpu_startup_entry+0x2c/0x78
[  159.310436]  secondary_start_kernel+0x160/0x198
[  159.314984] Code: d42000c0 aa1e03e9 d42000c0 aa1e03e9 (d42000c0)

After a bit of head scratching and debugging it turned out that it is
due to kprobe handler being interrupted by a tick that causes us to go
into (I think another) kprobe handler.

The culprit was kprobe_breakpoint_ss_handler() returning DBG_HOOK_ERROR
which leads to the Unexpected kernel BRK exception.

Reverting commit ba090f9cafd5 ("arm64: kprobes: Remove redundant
kprobe_step_ctx") seemed to fix the problem for me.

Further analysis showed that kcb->kprobe_status is set to
KPROBE_REENTER when the error occurs. By teaching
kprobe_breakpoint_ss_handler() to handle this status I can no  longer
reproduce the problem.

Fixes: ba090f9cafd5 ("arm64: kprobes: Remove redundant kprobe_step_ctx")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/r/20210122110909.3324607-1-qais.yousef@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
5 years agosched: Relax the set_cpus_allowed_ptr() semantics
Peter Zijlstra [Sat, 16 Jan 2021 10:56:37 +0000 (11:56 +0100)]
sched: Relax the set_cpus_allowed_ptr() semantics

Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
tasks to retain during CPU offline, we can relax the warning in
set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
the last minute will get pushed off before it can run.

While during CPU online there is no harm, and actual benefit, to
allowing kthreads back on early, it simplifies hotplug code and fixes
a number of outstanding races.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Lai jiangshan <jiangshanlai@gmail.com>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org
5 years agosched: Fix CPU hotplug / tighten is_per_cpu_kthread()
Peter Zijlstra [Tue, 12 Jan 2021 10:28:16 +0000 (11:28 +0100)]
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()

Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug") we'd leave any task on the dying CPU and
break affinity and force them off at the very end.

This scheme had to change in order to enable migrate_disable(). One
cannot wait for migrate_disable() to complete while stuck in
stop_machine(). Furthermore, since we need at the very least: idle,
hotplug and stop threads at any point before stop_machine, we can't
break affinity and/or push those away.

Under the assumption that all per-cpu kthreads are sanely handled by
CPU hotplug, the new code no long breaks affinity or migrates any of
them (which then includes the critical ones above).

However, there's an important difference between per-cpu kthreads and
kthreads that happen to have a single CPU affinity which is lost. The
latter class very much relies on the forced affinity breaking and
migration semantics previously provided.

Use the new kthread_is_per_cpu() infrastructure to tighten
is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
change.

Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org
5 years agosched: Prepare to use balance_push in ttwu()
Peter Zijlstra [Wed, 20 Jan 2021 14:05:41 +0000 (15:05 +0100)]
sched: Prepare to use balance_push in ttwu()

In preparation of using the balance_push state in ttwu() we need it to
provide a reliable and consistent state.

The immediate problem is that rq->balance_callback gets cleared every
schedule() and then re-set in the balance_push_callback() itself. This
is not a reliable signal, so add a variable that stays set during the
entire time.

Also move setting it before the synchronize_rcu() in
sched_cpu_deactivate(), such that we get guaranteed visibility to
ttwu(), which is a preempt-disable region.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.966069627@infradead.org
5 years agoworkqueue: Restrict affinity change to rescuer
Peter Zijlstra [Fri, 15 Jan 2021 18:08:36 +0000 (19:08 +0100)]
workqueue: Restrict affinity change to rescuer

create_worker() will already set the right affinity using
kthread_bind_mask(), this means only the rescuer will need to change
it's affinity.

Howveer, while in cpu-hot-unplug a regular task is not allowed to run
on online&&!active as it would be pushed away quite agressively. We
need KTHREAD_IS_PER_CPU to survive in that environment.

Therefore set the affinity after getting that magic flag.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.826629830@infradead.org
5 years agoworkqueue: Tag bound workers with KTHREAD_IS_PER_CPU
Peter Zijlstra [Tue, 12 Jan 2021 10:26:49 +0000 (11:26 +0100)]
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU

Mark the per-cpu workqueue workers as KTHREAD_IS_PER_CPU.

Workqueues have unfortunate semantics in that per-cpu workers are not
default flushed and parked during hotplug, however a subset does
manual flush on hotplug and hard relies on them for correctness.

Therefore play silly games..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.693465814@infradead.org
5 years agokthread: Extract KTHREAD_IS_PER_CPU
Peter Zijlstra [Tue, 12 Jan 2021 10:24:04 +0000 (11:24 +0100)]
kthread: Extract KTHREAD_IS_PER_CPU

There is a need to distinguish geniune per-cpu kthreads from kthreads
that happen to have a single CPU affinity.

Geniune per-cpu kthreads are kthreads that are CPU affine for
correctness, these will obviously have PF_KTHREAD set, but must also
have PF_NO_SETAFFINITY set, lest userspace modify their affinity and
ruins things.

However, these two things are not sufficient, PF_NO_SETAFFINITY is
also set on other tasks that have their affinities controlled through
other means, like for instance workqueues.

Therefore another bit is needed; it turns out kthread_create_per_cpu()
already has such a bit: KTHREAD_IS_PER_CPU, which is used to make
kthread_park()/kthread_unpark() work correctly.

Expose this flag and remove the implicit setting of it from
kthread_create_on_cpu(); the io_uring usage of it seems dubious at
best.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.557620262@infradead.org
5 years agosched: Don't run cpu-online with balance_push() enabled
Peter Zijlstra [Fri, 15 Jan 2021 17:17:45 +0000 (18:17 +0100)]
sched: Don't run cpu-online with balance_push() enabled

We don't need to push away tasks when we come online, mark the push
complete right before the CPU dies.

XXX hotplug state machine has trouble with rollback here.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210121103506.415606087@infradead.org