Jeff Layton [Tue, 30 Jul 2019 15:52:16 +0000 (11:52 -0400)]
ceph: new tracepoints when adding and removing caps
Add support for two new tracepoints surrounding the adding/updating and
removing of caps from the cache. To support this, we also add new functions
for printing cap strings a'la ceph_cap_string().
Jeff Layton [Wed, 27 Nov 2019 17:06:14 +0000 (12:06 -0500)]
ceph: attempt to do async create when possible
With the Octopus release, the MDS will hand out directoy create caps.
If we have Fxc caps on the directory, and complete directory information
or a known negative dentry, then we can return without waiting on the
reply, allowing the open() call to return very quickly to userland.
We use the normal ceph_fill_inode() routine to fill in the inode, so we
have to gin up some reply inode information with what we'd expect a
newly-created inode to have. The client assumes that it has a full set
of caps on the new inode, and that the MDS will revoke them when there
is conflicting access.
This functionality is gated on the enable_async_dirops module option,
along with async unlinks.
Jeff Layton [Thu, 2 Jan 2020 12:11:38 +0000 (07:11 -0500)]
ceph: copy layout, max_size and truncate_size on successful sync create
It doesn't do much good to do an asynchronous create unless we can do
I/O to it before the create reply comes in. That means we need layout
info the new file before we've gotten the response from the MDS.
All files created in a directory will initially inherit the same layout,
so copy off the requisite info from the first synchronous create in the
directory. Save it in the same fields in the directory inode, as those
are otherwise unsed for dir inodes. This means we need to be a bit
careful about only updating layout info on non-dir inodes.
Also, zero out the layout when we drop Dc caps in the dir.
Jeff Layton [Mon, 2 Dec 2019 18:47:57 +0000 (13:47 -0500)]
ceph: add flag to delegate an inode number for async create
In order to issue an async create request, we need to send an inode
number when we do the request, but we don't know which to which MDS
we'll be issuing the request.
Add a new r_req_flag that tells the request sending machinery to
grab an inode number from the delegated set, and encode it into the
request. If it can't get one then have it return -ECHILD. The
requestor can then reissue a synchronous request.
Jeff Layton [Fri, 15 Nov 2019 16:51:55 +0000 (11:51 -0500)]
ceph: decode interval_sets for delegated inos
Starting in Octopus, the MDS will hand out caps that allow the client
to do asynchronous file creates under certain conditions. As part of
that, the MDS will delegate ranges of inode numbers to the client.
Add the infrastructure to decode these ranges, and stuff them into an
xarray for later consumption by the async creation code.
Because the xarray code currently only handles unsigned long indexes,
and those are 32-bits on 32-bit arches, we only enable the decoding when
running on a 64-bit arch.
Luis Henriques [Wed, 8 Jan 2020 10:03:53 +0000 (10:03 +0000)]
ceph: use copy-from2 op in copy_file_range
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.
If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP. In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.
Signed-off-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Tue, 2 Apr 2019 19:35:56 +0000 (15:35 -0400)]
ceph: perform asynchronous unlink if we have sufficient caps
The MDS is getting a new lock-caching facility that will allow it
to cache the necessary locks to allow asynchronous directory operations.
Since the CEPH_CAP_FILE_* caps are currently unused on directories,
we can repurpose those bits for this purpose.
When performing an unlink, if we have Fx on the parent directory,
and CEPH_CAP_DIR_UNLINK (aka Fr), and we know that the dentry being
removed is the primary link, then then we can fire off an unlink
request immediately and don't need to wait on reply before returning.
In that situation, just fix up the dcache and link count and return
immediately after issuing the call to the MDS. This does mean that we
need to hold an extra reference to the inode being unlinked, and extra
references to the caps to avoid races. Those references are put and
error handling is done in the r_callback routine.
If the operation ends up failing, then set a writeback error on the
directory inode that can be fetched later by an fsync on the dir.
Since this feature is very new, also add a new module parameter to
enable and disable it (default is disabled).
Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Thu, 4 Apr 2019 12:05:38 +0000 (08:05 -0400)]
ceph: register MDS request with dir inode from the start
When the unsafe reply to a request comes in, the request is put on the
r_unsafe_dir inode's list. In future patches, we're going to need to
wait on requests that may not have gotten an unsafe reply yet.
Change __register_request to put the entry on the dir inode's list when
the pointer is set in the request, and don't check the
CEPH_MDS_R_GOT_UNSAFE flag when unregistering it.
The only place that uses this list today is fsync codepath, and with
the coming changes, we'll want to wait on all operations whether it has
gotten an unsafe reply or not.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Jeff Layton [Wed, 3 Apr 2019 17:16:01 +0000 (13:16 -0400)]
ceph: hold extra reference to r_parent over life of request
Currently, we just assume that it will stick around by virtue of the
submitter's reference, but later patches will allow the syscall to
return early and we can't rely on that reference at that point.
Take an extra reference to the inode when setting r_parent and release
it when releasing the request.
Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Arnd Bergmann [Tue, 7 Jan 2020 21:01:04 +0000 (22:01 +0100)]
rbd: work around -Wuninitialized warning
gcc -O3 warns about a dummy variable that is passed
down into rbd_img_fill_nodata without being initialized:
drivers/block/rbd.c: In function 'rbd_img_fill_nodata':
drivers/block/rbd.c:2573:13: error: 'dummy' is used uninitialized in this function [-Werror=uninitialized]
fctx->iter = *fctx->pos;
Since this is a dummy, I assume the warning is harmless, but
it's better to initialize it anyway and avoid the warning.
Fixes: mmtom ("init/Kconfig: enable -O3 for all arches") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Have the client ignore the extra slashes in the server path when
mounting. This will also canonicalize the path, so that identical mounts
can be consilidated.
Regardless of the internal treatment of these paths, the kernel still
stores the original string including the leading '/' for presentation
to userland.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Jeff Layton [Wed, 11 Dec 2019 20:21:24 +0000 (15:21 -0500)]
ceph: don't clear I_NEW until inode metadata is fully populated
Currently, we could have an open-by-handle (or NFS server) call
into the filesystem and start working with an inode before it's
properly filled out.
Don't clear I_NEW until we have filled out the inode, and discard it
properly if that fails. Note that we occasionally take an extra
reference to the inode to ensure that we don't put the last reference in
discard_new_inode, but rather leave it for ceph_async_iput.
Xiubo Li [Mon, 9 Dec 2019 12:47:15 +0000 (07:47 -0500)]
ceph: retry the same mds later after the new session is opened
If max_mds > 1 and a request is submitted that chooses a random mds
rank, and the relating session is not opened yet, the request will wait
until the session has been opened and resend again.
Every time the request goes through __do_request, it will release the
req->session first and choose a random one again, which may be a
completely different rank than the one it just waited on.
In the worst case, it will open all the mds sessions one by one just
before the request can be successfully sent out.
URL: https://tracker.ceph.com/issues/43270 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Wed, 11 Dec 2019 01:29:40 +0000 (20:29 -0500)]
ceph: check availability of mds cluster on mount after wait timeout
If all the MDS daemons are down for some reason, then the first mount
attempt will fail with EIO after the mount request times out. A mount
attempt will also fail with EIO if all of the MDS's are laggy.
This patch changes the code to return -EHOSTUNREACH in these situations
and adds a pr_info error message to help the admin determine the cause.
URL: https://tracker.ceph.com/issues/4386 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Fri, 6 Dec 2019 03:35:51 +0000 (22:35 -0500)]
ceph: keep the session state until it is released
When reconnecting the session but if it is denied by the MDS due
to client was in blacklist or something else, kclient will receive
a session close reply, and we will never see the important log:
"ceph: mds%d reconnect denied"
And with the confusing log:
"ceph: handle_session mds0 close 0000000085804730 state ??? seq 0"
Let's keep the session state until its memories is released.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Wed, 4 Dec 2019 06:27:18 +0000 (01:27 -0500)]
ceph: fix possible long time wait during umount
During umount, if there has no any unsafe request in the mdsc and
some requests still in-flight and not got reply yet, and if the
rest requets are all safe ones, after that even all of them in mdsc
are unregistered, the umount must wait until after mount_timeout
seconds anyway.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
Xiubo Li [Tue, 26 Nov 2019 12:24:21 +0000 (07:24 -0500)]
ceph: fix mdsmap cluster available check based on laggy number
In case the max_mds > 1 in MDS cluster and there is no any standby
MDS and all the max_mds MDSs are in up:active state, if one of the
up:active MDSs is dead, the m->m_num_laggy in kclient will be 1.
Then the mount will fail without considering other healthy MDSs.
There manybe some MDSs still "in" the cluster but not in up:active
state, we will ignore them. Only when all the up:active MDSs in
the cluster are laggy will treat the cluster as not be available.
In case decreasing the max_mds, the cluster will not stop the extra
up:active MDSs immediately and there will be a latency. During it
the up:active MDS number will be larger than the max_mds, so later
the m_info memories will 100% be reallocated.
Here will pick out the up:active MDSs as the m_num_mds and allocate
the needed memories once.
Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
We print session's refcount in debug message inside
ceph_put_mds_session() and get_session(), so don't
have to print it in con_get()/__ceph_lookup_mds_session()/con_put().
Linus Torvalds [Sun, 5 Jan 2020 19:15:31 +0000 (11:15 -0800)]
Merge tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
"Several fixes for RISC-V:
- Fix function graph trace support
- Prefix the CSR IRQ_* macro names with "RV_", to avoid collisions
with macros elsewhere in the Linux kernel tree named "IRQ_TIMER"
- Use __pa_symbol() when computing the physical address of a kernel
symbol, rather than __pa()
- Mark the RISC-V port as supporting GCOV
One DT addition:
- Describe the L2 cache controller in the FU540 DT file
One documentation update:
- Add patch acceptance guideline documentation"
* tag 'riscv/for-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
Documentation: riscv: add patch acceptance guidelines
riscv: prefix IRQ_ macro names with an RV_ namespace
clocksource: riscv: add notrace to riscv_sched_clock
riscv: ftrace: correct the condition logic in function graph tracer
riscv: dts: Add DT support for SiFive L2 cache controller
riscv: gcov: enable gcov for RISC-V
riscv: mm: use __pa_symbol for kernel symbols
Formalize, in kernel documentation, the patch acceptance policy for
arch/riscv. In summary, it states that as maintainers, we plan to
only accept patches for new modules or extensions that have been
frozen or ratified by the RISC-V Foundation.
We've been following these guidelines for the past few months. In the
meantime, we've received quite a bit of feedback that it would be
helpful to have these guidelines formally documented.
Based on a suggestion from Matthew Wilcox, we also add a link to this
file to Documentation/process/index.rst, to make this document easier
to find. The format of this document has also been changed to align
to the format outlined in the maintainer entry profiles, in accordance
with comments from Jon Corbet and Dan Williams.
Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com> Reviewed-by: Palmer Dabbelt <palmerdabbelt@google.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Krste Asanovic <krste@berkeley.edu> Cc: Andrew Waterman <waterman@eecs.berkeley.edu> Cc: Matthew Wilcox <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jonathan Corbet <corbet@lwn.net>
Paul Walmsley [Fri, 20 Dec 2019 11:09:49 +0000 (03:09 -0800)]
riscv: prefix IRQ_ macro names with an RV_ namespace
"IRQ_TIMER", used in the arch/riscv CSR header file, is a sufficiently
generic macro name that it's used by several source files across the
Linux code base. Some of these other files ultimately include the
arch/riscv CSR include file, causing collisions. Fix by prefixing the
RISC-V csr.h IRQ_ macro names with an RV_ prefix.
Fixes: a4c3733d32a72 ("riscv: abstract out CSR names for supervisor vs machine mode") Reported-by: Olof Johansson <olof@lixom.net> Acked-by: Olof Johansson <olof@lixom.net> Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Zong Li [Mon, 23 Dec 2019 08:46:14 +0000 (16:46 +0800)]
clocksource: riscv: add notrace to riscv_sched_clock
When enabling ftrace graph tracer, it gets the tracing clock in
ftrace_push_return_trace(). Eventually, it invokes riscv_sched_clock()
to get the clock value. If riscv_sched_clock() isn't marked with
'notrace', it will call ftrace_push_return_trace() and cause infinite
loop.
Signed-off-by: Zong Li <zong.li@sifive.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
[paul.walmsley@sifive.com: cleaned up patch description] Fixes: 92e0d143fdef ("clocksource/drivers/riscv_timer: Provide the sched_clock") Cc: stable@vger.kernel.org Signed-off-by: Paul Walmsley <paul.walmsley@sifive.com>
Linus Torvalds [Sun, 5 Jan 2020 03:38:51 +0000 (19:38 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"17 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
hexagon: define ioremap_uc
ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less
ocfs2: call journal flush to mark journal as empty after journal recovery when mount
mm/hugetlb: defer freeing of huge pages if in non-task context
mm/gup: fix memory leak in __gup_benchmark_ioctl
mm/oom: fix pgtables units mismatch in Killed process message
fs/posix_acl.c: fix kernel-doc warnings
hexagon: work around compiler crash
hexagon: parenthesize registers in asm predicates
fs/namespace.c: make to_mnt_ns() static
fs/nsfs.c: include headers for missing declarations
fs/direct-io.c: include fs/internal.h for missing prototype
mm: move_pages: return valid node id in status if the page is already on the target node
memcg: account security cred as well to kmemcg
kcov: fix struct layout for kcov_remote_arg
mm/zsmalloc.c: fix the migrated zspage statistics.
mm/memory_hotplug: shrink zones when offlining memory
Linus Torvalds [Sun, 5 Jan 2020 03:28:30 +0000 (19:28 -0800)]
Merge tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
Pull apparmor fixes from John Johansen:
- performance regression: only get a label reference if the fast path
check fails
- fix aa_xattrs_match() may sleep while holding a RCU lock
- fix bind mounts aborting with -ENOMEM
* tag 'apparmor-pr-2020-01-04' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
apparmor: only get a label reference if the fast path check fails
apparmor: fix bind mounts aborting with -ENOMEM
John Johansen [Thu, 2 Jan 2020 13:31:22 +0000 (05:31 -0800)]
apparmor: fix aa_xattrs_match() may sleep while holding a RCU lock
aa_xattrs_match() is unfortunately calling vfs_getxattr_alloc() from a
context protected by an rcu_read_lock. This can not be done as
vfs_getxattr_alloc() may sleep regardles of the gfp_t value being
passed to it.
Fix this by breaking the rcu_read_lock on the policy search when the
xattr match feature is requested and restarting the search if a policy
changes occur.
Fixes: 8e51f9087f40 ("apparmor: Add support for attaching profiles via xattr, presence and value") Reported-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: John Johansen <john.johansen@canonical.com>
Linus Torvalds [Sat, 4 Jan 2020 22:16:57 +0000 (14:16 -0800)]
Merge tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Paul Burton:
"A collection of MIPS fixes:
- Fill the struct cacheinfo shared_cpu_map field with sensible
values, notably avoiding issues with perf which was unhappy in the
absence of these values.
- A boot fix for Loongson 2E & 2F machines which was fallout from
some refactoring performed this cycle.
- A Kconfig dependency fix for the Loongson CPU HWMon driver.
- A couple of VDSO fixes, ensuring gettimeofday() behaves
appropriately for kernel configurations that don't include support
for a clocksource the VDSO can use & fixing the calling convention
for the n32 & n64 VDSOs which would previously clobber the $gp/$28
register.
- A build fix for vmlinuz compressed images which were
inappropriately building with -fsanitize-coverage despite not being
part of the kernel proper, then failing to link due to the missing
__sanitizer_cov_trace_pc() function.
- A couple of eBPF JIT fixes, including disabling it for MIPS32 due
to a large number of issues with the code generated there &
reflecting ISA dependencies in Kconfig to enforce that systems
which don't support the JIT must include the interpreter"
* tag 'mips_fixes_5.5_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: Avoid VDSO ABI breakage due to global register variable
MIPS: BPF: eBPF JIT: check for MIPS ISA compliance in Kconfig
MIPS: BPF: Disable MIPS32 eBPF JIT
MIPS: Prevent link failure with kcov instrumentation
MIPS: Kconfig: Use correct form for 'depends on'
mips: Fix gettimeofday() in the vdso library
MIPS: Fix boot on Fuloong2 systems
mips: cacheinfo: report shared CPU map
Gang He [Sat, 4 Jan 2020 21:00:22 +0000 (13:00 -0800)]
ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less
Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
file system will trigger the system crash, usually after ocfs2 file
system is unmounted.
This system crash is caused by a generic memory corruption, these crash
backtraces are not always the same, for exapmle,
This regression problem was introduced by commit e581595ea29c ("ocfs: no
need to check return value of debugfs_create functions").
Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com Fixes: e581595ea29c ("ocfs: no need to check return value of debugfs_create functions") Signed-off-by: Gang He <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> [5.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kai Li [Sat, 4 Jan 2020 21:00:18 +0000 (13:00 -0800)]
ocfs2: call journal flush to mark journal as empty after journal recovery when mount
If journal is dirty when mount, it will be replayed but jbd2 sb log tail
cannot be updated to mark a new start because journal->j_flag has
already been set with JBD2_ABORT first in journal_init_common.
When a new transaction is committed, it will be recored in block 1
first(journal->j_tail is set to 1 in journal_reset). If emergency
restart happens again before journal super block is updated
unfortunately, the new recorded trans will not be replayed in the next
mount.
The following steps describe this procedure in detail.
1. mount and touch some files
2. these transactions are committed to journal area but not checkpointed
3. emergency restart
4. mount again and its journals are replayed
5. journal super block's first s_start is 1, but its s_seq is not updated
6. touch a new file and its trans is committed but not checkpointed
7. emergency restart again
8. mount and journal is dirty, but trans committed in 6 will not be
replayed.
This exception happens easily when this lun is used by only one node.
If it is used by multi-nodes, other node will replay its journal and its
journal super block will be updated after recovery like what this patch
does.
ocfs2_recover_node->ocfs2_replay_journal.
The following jbd2 journal can be generated by touching a new file after
journal is replayed, and seq 15 is the first valid commit, but first seq
is 13 in journal super block.
The following is journal recovery log when recovering the upper jbd2
journal when mount again.
syslog:
ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13
Due to first commit seq 13 recorded in journal super is not consistent
with the value recorded in block 1(seq is 14), journal recovery will be
terminated before seq 15 even though it is an unbroken commit, inode 8257802 is a new file and it will be lost.
Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com Signed-off-by: Kai Li <li.kai4@h3c.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Changwei Ge <gechangwei@live.cn> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page(). One way to solve the problem is to make both locks
irq-safe. However, Mike Kravetz had learned that the hugetlb_lock is
held for a linear scan of ALL hugetlb pages during a cgroup reparentling
operation. So it is just too long to have irq disabled unless we can
break hugetbl_lock down into finer-grained locks with shorter lock hold
times.
Another alternative is to defer the freeing to a workqueue job. This
patch implements the deferred freeing by adding a free_hpage_workfn()
work function to do the actual freeing. The free_huge_page() call in a
non-task context saves the page to be freed in the hpage_freelist linked
list in a lockless manner using the llist APIs.
The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.
Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
llist APIs which simplfy the code.
Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Navid Emamdoost [Sat, 4 Jan 2020 21:00:12 +0000 (13:00 -0800)]
mm/gup: fix memory leak in __gup_benchmark_ioctl
In the implementation of __gup_benchmark_ioctl() the allocated pages
should be released before returning in case of an invalid cmd. Release
pages via kvfree().
[akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1] Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com Fixes: 714a3a1ebafe ("mm/gup_benchmark.c: add additional pinning methods") Signed-off-by: Navid Emamdoost <navid.emamdoost@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ilya Dryomov [Sat, 4 Jan 2020 21:00:09 +0000 (13:00 -0800)]
mm/oom: fix pgtables units mismatch in Killed process message
pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
As everything else is printed in kB, I chose to fix the value rather than
the string.
Before:
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1878] 1000 1878 217253 151144 1269760 0 0 python
...
Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0
After:
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
...
[ 1436] 1000 1436 217253 151890 1294336 0 0 python
...
Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0
Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Edward Chron <echron@arista.com> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Sat, 4 Jan 2020 21:00:05 +0000 (13:00 -0800)]
fs/posix_acl.c: fix kernel-doc warnings
Fix kernel-doc warnings in fs/posix_acl.c.
Also fix one typo (setgit -> setgid).
fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org Fixes: 073931017b49d ("posix_acl: Clear SGID bit when setting file permissions") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Biggers [Sat, 4 Jan 2020 20:59:52 +0000 (12:59 -0800)]
fs/nsfs.c: include headers for missing declarations
Include linux/proc_fs.h and fs/internal.h to address the following
'sparse' warnings:
fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Biggers [Sat, 4 Jan 2020 20:59:49 +0000 (12:59 -0800)]
fs/direct-io.c: include fs/internal.h for missing prototype
Include fs/internal.h to address the following 'sparse' warning:
fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is because the status is not set if the page is already on the
target node, but move_pages() should return valid status as long as it
succeeds. The valid status may be errno or node id.
We can't simply initialize status array to zero since the pages may be
not on node 0. Fix it by updating status with node id which the page is
already on.
Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com Fixes: a49bd4d71637 ("mm, numa: rework do_pages_move") Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Felix Abecassis <fabecassis@nvidia.com> Tested-by: Felix Abecassis <fabecassis@nvidia.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Sat, 4 Jan 2020 20:59:43 +0000 (12:59 -0800)]
memcg: account security cred as well to kmemcg
The cred_jar kmem_cache is already memcg accounted in the current kernel
but cred->security is not. Account cred->security to kmemcg.
Recently we saw high root slab usage on our production and on further
inspection, we found a buggy application leaking processes. Though that
buggy application was contained within its memcg but we observe much
more system memory overhead, couple of GiBs, during that period. This
overhead can adversely impact the isolation on the system.
One source of high overhead we found was cred->security objects, which
have a lifetime of at least the life of the process which allocated
them.
Link: http://lkml.kernel.org/r/20191205223721.40034-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Chris Down <chris@chrisdown.name> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make the layout of kcov_remote_arg the same for 32-bit and 64-bit code.
This makes it more convenient to write userspace apps that can be
compiled into 32-bit or 64-bit binaries and still work with the same
64-bit kernel.
Also use proper __u32 types in uapi headers instead of unsigned ints.
Link: http://lkml.kernel.org/r/9e91020876029cfefc9211ff747685eba9536426.1575638983.git.andreyknvl@google.com Fixes: eec028c9386ed1a ("kcov: remote coverage support") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Marco Elver <elver@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Felipe Balbi <balbi@kernel.org> Cc: Chunfeng Yun <chunfeng.yun@mediatek.com> Cc: "Jacky . Cao @ sony . com" <Jacky.Cao@sony.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chanho Min [Sat, 4 Jan 2020 20:59:36 +0000 (12:59 -0800)]
mm/zsmalloc.c: fix the migrated zspage statistics.
When zspage is migrated to the other zone, the zone page state should be
updated as well, otherwise the NR_ZSPAGE for each zone shows wrong
counts including proc/zoneinfo in practice.
Link: http://lkml.kernel.org/r/1575434841-48009-1-git-send-email-chanho.min@lge.com Fixes: 91537fee0013 ("mm: add NR_ZSMALLOC to vmstat") Signed-off-by: Chanho Min <chanho.min@lge.com> Signed-off-by: Jinsuk Choi <jjinsuk.choi@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/memory_hotplug: shrink zones when offlining memory
We currently try to shrink a single zone when removing memory. We use
the zone of the first page of the memory we are removing. If that
memmap was never initialized (e.g., memory was never onlined), we will
read garbage and can trigger kernel BUGs (due to a stale pointer):
Instead, shrink the zones when offlining memory or when onlining failed.
Introduce and use remove_pfn_range_from_zone(() for that. We now
properly shrink the zones, even if we have DIMMs whereby
- Some memory blocks fall into no zone (never onlined)
- Some memory blocks fall into multiple zones (offlined+re-onlined)
- Multiple memory blocks that fall into different zones
Drop the zone parameter (with a potential dubious value) from
__remove_pages() and __remove_section().
Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319] Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 4 Jan 2020 18:49:15 +0000 (10:49 -0800)]
Merge tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma
Pull dmaengine fixes from Vinod Koul:
"A bunch of fixes for:
- uninitialized dma_slave_caps access
- virt-dma use after free in vchan_complete()
- driver fixes for ioat, k3dma and jz4780"
* tag 'dmaengine-fix-5.5-rc5' of git://git.infradead.org/users/vkoul/slave-dma:
ioat: ioat_alloc_ring() failure handling.
dmaengine: virt-dma: Fix access after free in vchan_complete()
dmaengine: k3dma: Avoid null pointer traversal
dmaengine: dma-jz4780: Also break descriptor chains on JZ4725B
dmaengine: Fix access to uninitialized dma_slave_caps
Linus Torvalds [Sat, 4 Jan 2020 18:41:08 +0000 (10:41 -0800)]
Merge tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- some fixes at CEC core to comply with HDMI 2.0 specs and fix some
border cases
- a fix at the transmission logic of the pulse8-cec driver
- one alignment fix on a data struct at ipu3 when built with 32 bits
* tag 'media/v5.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes
media: pulse8-cec: fix lost cec_transmit_attempt_done() call
media: cec: check 'transmit_in_progress', not 'transmitting'
media: cec: avoid decrementing transmit_queue_sz if it is 0
media: cec: CEC 2.0-only bcast messages were ignored
Linus Torvalds [Fri, 3 Jan 2020 20:20:21 +0000 (12:20 -0800)]
Merge tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for btrfs:
- blkcg accounting problem with compression that could stall writes
- setting up blkcg bio for compression crashes due to NULL bdev
pointer
- fix possible infinite loop in writeback for nocow files (here
possible means almost impossible, 13 things that need to happen to
trigger it)"
* tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: fix infinite loop during nocow writeback due to race
btrfs: fix compressed write bio blkcg attribution
btrfs: punt all bios created in btrfs_submit_compressed_write()
This makes it hard to extract a useable coredump for global init
from a kernel crashdump because by the time we panic exit_mm() will
have already released global init's mm. We now panic slightly
earlier. This has been a problem in certain environments such as
Android.
- Fix a race in assigning and reading taskstats for thread-groups
with more than one thread.
This patch has been waiting for quite a while since people
disagreed on what the correct fix was at first"
* tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
exit: panic before exit_mm() on global init exit
taskstats: fix data-race
Linus Torvalds [Fri, 3 Jan 2020 19:13:50 +0000 (11:13 -0800)]
Merge tag 'powerpc-5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"Two more powerpc fixes for 5.5:
- One commit to fix a build error when CONFIG_JUMP_LABEL=n,
introduced by our recent fix to is_shared_processor().
- A commit marking some SLB related functions as notrace, as tracing
them triggers warnings.
Thanks to Jason A Donenfeld"
* tag 'powerpc-5.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/spinlocks: Include correct header for static key
powerpc/mm: Mark get_slice_psize() & slice_addr_is_low() as notrace
Linus Torvalds [Fri, 3 Jan 2020 19:10:31 +0000 (11:10 -0800)]
Merge tag 'sound-5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Nothing to worry at this stage but all nice small changes:
- A regression fix for AMD GPU detection in HD-audio
- A long-standing sleep-in-atomic fix for an ice1724 device
- Usual suspects, the device-specific quirks for HD- and USB-audio"
* tag 'sound-5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda/realtek - Enable the bass speaker of ASUS UX431FLC
ALSA: ice1724: Fix sleep-in-atomic in Infrasonic Quartet support code
ALSA: hda/realtek - Add Bass Speaker and fixed dac for bass speaker
ALSA: hda - Apply sync-write workaround to old Intel platforms, too
ALSA: hda/hdmi - fix atpx_present when CLASS is not VGA
ALSA: usb-audio: fix set_format altsetting sanity check
ALSA: hda/realtek - Add headset Mic no shutup for ALC283
ALSA: usb-audio: set the interface format after resume on Dell WD19
The problem is the check introduced to for_each_hstate() loop that
should skip default_hstate_idx. Since it doesn't update 'i' counter,
all subsequent huge page sizes are skipped as well.
Fixes: 8fc312b32b25 ("mm/hugetlbfs: fix error handling when setting up mounts") Signed-off-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ard Biesheuvel [Mon, 30 Dec 2019 14:07:47 +0000 (15:07 +0100)]
kbuild/deb-pkg: annotate libelf-dev dependency as :native
Cross compiling the x86 kernel on a non-x86 build machine produces
the following error when CONFIG_UNWINDER_ORC is enabled, regardless
of whether libelf-dev is installed or not.
dpkg-checkbuilddeps: error: Unmet build dependencies: libelf-dev
dpkg-buildpackage: warning: build dependencies/conflicts unsatisfied; aborting
dpkg-buildpackage: warning: (Use -d flag to override.)
Since this is a build time dependency for a build tool, we need to
depend on the native version of libelf-dev so add the appropriate
annotation.
Prior to commit 858805b336be ("kbuild: add $(BASH) to run scripts with
bash-extension"), this shell script was almost always run by bash since
bash is usually installed on the system by default.
Now, this script is run by sh, which might be a symlink to dash. On such
distributions, the following code emits an error:
local dev=`LC_ALL=C ls -l "${location}"`
You can reproduce the build error, for example by setting
CONFIG_INITRAMFS_SOURCE="/dev".
GEN usr/initramfs_data.cpio.gz
./usr/gen_initramfs_list.sh: 131: local: 1: bad variable name
make[1]: *** [usr/Makefile:61: usr/initramfs_data.cpio.gz] Error 2
This is because `LC_ALL=C ls -l "${location}"` contains spaces.
Surrounding it with double-quotes fixes the error.
Fixes: 858805b336be ("kbuild: add $(BASH) to run scripts with bash-extension") Reported-by: Jory A. Pratt <anarchy@gentoo.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Sakari Ailus [Wed, 6 Nov 2019 11:57:07 +0000 (12:57 +0100)]
media: intel-ipu3: Align struct ipu3_uapi_awb_fr_config_s to 32 bytes
A struct that needs to be aligned to 32 bytes has a size of 28. Increase
the size to 32.
This makes elements of arrays of this struct aligned to 32 as well, and
other structs where members are aligned to 32 mixing
ipu3_uapi_awb_fr_config_s as well as other types.
Fixes: commit dca5ef2aa1e6 ("media: staging/intel-ipu3: remove the unnecessary compiler flags") Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Tested-by: Bingbu Cao <bingbu.cao@intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Yunfeng Ye [Tue, 17 Dec 2019 12:22:57 +0000 (20:22 +0800)]
agp: remove unused variable arqsz in agp_3_5_enable()
This patch fix the following warning:
drivers/char/agp/isoch.c: In function ‘agp_3_5_enable’:
drivers/char/agp/isoch.c:322:13: warning: variable ‘arqsz’ set but not
used [-Wunused-but-set-variable]
u32 isoch, arqsz;
^~~~~
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Yunfeng Ye [Tue, 17 Dec 2019 12:21:37 +0000 (20:21 +0800)]
agp: remove unused variable mcapndx
This patch fix the following warning:
drivers/char/agp/isoch.c: In function ‘agp_3_5_isochronous_node_enable’:
drivers/char/agp/isoch.c:87:5: warning: variable ‘mcapndx’ set but not
used [-Wunused-but-set-variable]
u8 mcapndx;
^~~~~~~
Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Linus Torvalds [Fri, 3 Jan 2020 00:46:30 +0000 (16:46 -0800)]
Merge tag 'gcc-plugins-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc-plugins fix from Kees Cook:
"Build flexibility fix: allow builds to disable plugins even when
plugins available (Arnd Bergmann)"
* tag 'gcc-plugins-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: make it possible to disable CONFIG_GCC_PLUGINS again
Linus Torvalds [Fri, 3 Jan 2020 00:42:10 +0000 (16:42 -0800)]
Merge tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull seccomp fixes from Kees Cook:
"Fixes for seccomp_notify_ioctl uapi sanity from Sargun Dhillon.
The bulk of this is fixing the surrounding samples and selftests so
that seccomp can correctly validate the seccomp_notify_ioctl buffer as
being initially zeroed.
Summary:
- Fix samples and selftests to zero passed-in buffer
- Enforce zeroed buffer checking
- Verify buffer sanity check in selftest"
* tag 'seccomp-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
selftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV
seccomp: Check that seccomp_notif is zeroed out by the user
selftests/seccomp: Zero out seccomp_notif
samples/seccomp: Zero out members based on seccomp_notif_sizes
Paul Burton [Thu, 2 Jan 2020 04:50:38 +0000 (20:50 -0800)]
MIPS: Avoid VDSO ABI breakage due to global register variable
Declaring __current_thread_info as a global register variable has the
effect of preventing GCC from saving & restoring its value in cases
where the ABI would typically do so.
To quote GCC documentation:
> If the register is a call-saved register, call ABI is affected: the
> register will not be restored in function epilogue sequences after the
> variable has been assigned. Therefore, functions cannot safely return
> to callers that assume standard ABI.
When our position independent VDSO is built for the n32 or n64 ABIs all
functions it exposes should be preserving the value of $gp/$28 for their
caller, but in the presence of the __current_thread_info global register
variable GCC stops doing so & simply clobbers $gp/$28 when calculating
the address of the GOT.
In cases where the VDSO returns success this problem will typically be
masked by the caller in libc returning & restoring $gp/$28 itself, but
that is by no means guaranteed. In cases where the VDSO returns an error
libc will typically contain a fallback path which will now fail
(typically with a bad memory access) if it attempts anything which
relies upon the value of $gp/$28 - eg. accessing anything via the GOT.
One fix for this would be to move the declaration of
__current_thread_info inside the current_thread_info() function,
demoting it from global register variable to local register variable &
avoiding inadvertently creating a non-standard calling ABI for the VDSO.
Unfortunately this causes issues for clang, which doesn't support local
register variables as pointed out by commit fe92da0f355e ("MIPS: Changed
current_thread_info() to an equivalent supported by both clang and GCC")
which introduced the global register variable before we had a VDSO to
worry about.
Instead, fix this by continuing to use the global register variable for
the kernel proper but declare __current_thread_info as a simple extern
variable when building the VDSO. It should never be referenced, and will
cause a link error if it is. This resolves the calling convention issue
for the VDSO without having any impact upon the build of the kernel
itself for either clang or gcc.
Signed-off-by: Paul Burton <paulburton@kernel.org> Fixes: ebb5e78cc634 ("MIPS: Initial implementation of a VDSO") Reported-by: Jason A. Donenfeld <Jason@zx2c4.com> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christian Brauner <christian.brauner@canonical.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: <stable@vger.kernel.org> # v4.4+ Cc: linux-mips@vger.kernel.org Cc: linux-kernel@vger.kernel.org
Linus Torvalds [Fri, 3 Jan 2020 00:39:51 +0000 (16:39 -0800)]
Merge tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore bug fixes from Kees Cook:
- always reset circular buffer state when writing new dump (Aleksandr
Yashkin)
- fix rare error-path memory leak (Kees Cook)
* tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
pstore/ram: Write new dumps to start of recycled zones
pstore/ram: Fix error-path memory leak in persistent_ram_new() callers
This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
handling when opening /dev/console").
Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
caused more trouble than what is worth it: it requires accessing vfs
internals and it turns out there were other bugs in it too.
In particular, the file reference counting was wrong - because unlike
the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
thus had an extra leaked file reference.
That in turn then caused odd problems with Androidx86 long after boot
becaue of how the extra reference to the console kept the session active
even after all file descriptors had been closed.
Arnd Bergmann [Wed, 11 Dec 2019 13:39:28 +0000 (14:39 +0100)]
gcc-plugins: make it possible to disable CONFIG_GCC_PLUGINS again
I noticed that randconfig builds with gcc no longer produce a lot of
ccache hits, unlike with clang, and traced this back to plugins
now being enabled unconditionally if they are supported.
I am now working around this by adding
export CCACHE_COMPILERCHECK=/usr/bin/size -A %compiler%
to my top-level Makefile. This changes the heuristic that ccache uses
to determine whether the plugins are the same after a 'make clean'.
However, it also seems that being able to just turn off the plugins is
generally useful, at least for build testing it adds noticeable overhead
but does not find a lot of bugs additional bugs, and may be easier for
ccache users than my workaround.
Sargun Dhillon [Mon, 30 Dec 2019 20:38:11 +0000 (12:38 -0800)]
selftests/seccomp: Catch garbage on SECCOMP_IOCTL_NOTIF_RECV
This adds logic to the user_notification_basic test to set a member
of struct seccomp_notif to an invalid value to ensure that the kernel
returns EINVAL if any of the struct seccomp_notif members are set to
invalid values.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20191230203811.4996-1-sargun@sargun.me Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
Sargun Dhillon [Sun, 29 Dec 2019 06:24:50 +0000 (22:24 -0800)]
seccomp: Check that seccomp_notif is zeroed out by the user
This patch is a small change in enforcement of the uapi for
SECCOMP_IOCTL_NOTIF_RECV ioctl. Specifically, the datastructure which
is passed (seccomp_notif) must be zeroed out. Previously any of its
members could be set to nonsense values, and we would ignore it.
This ensures all fields are set to their zero value.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com> Acked-by: Tycho Andersen <tycho@tycho.ws> Link: https://lore.kernel.org/r/20191229062451.9467-2-sargun@sargun.me Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
Sargun Dhillon [Sun, 29 Dec 2019 06:24:49 +0000 (22:24 -0800)]
selftests/seccomp: Zero out seccomp_notif
The seccomp_notif structure should be zeroed out prior to calling the
SECCOMP_IOCTL_NOTIF_RECV ioctl. Previously, the kernel did not check
whether these structures were zeroed out or not, so these worked.
This patch zeroes out the seccomp_notif data structure prior to calling
the ioctl.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Reviewed-by: Tycho Andersen <tycho@tycho.ws> Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20191229062451.9467-1-sargun@sargun.me Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
Sargun Dhillon [Mon, 30 Dec 2019 20:35:03 +0000 (12:35 -0800)]
samples/seccomp: Zero out members based on seccomp_notif_sizes
The sizes by which seccomp_notif and seccomp_notif_resp are allocated are
based on the SECCOMP_GET_NOTIF_SIZES ioctl. This allows for graceful
extension of these datastructures. If userspace zeroes out the
datastructure based on its version, and it is lagging behind the kernel's
version, it will end up sending trailing garbage. On the other hand,
if it is ahead of the kernel version, it will write extra zero space,
and potentially cause corruption.
Signed-off-by: Sargun Dhillon <sargun@sargun.me> Suggested-by: Tycho Andersen <tycho@tycho.ws> Link: https://lore.kernel.org/r/20191230203503.4925-1-sargun@sargun.me Fixes: fec7b6690541 ("samples: add an example of seccomp user trap") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
pstore/ram: Write new dumps to start of recycled zones
The ram_core.c routines treat przs as circular buffers. When writing a
new crash dump, the old buffer needs to be cleared so that the new dump
doesn't end up in the wrong place (i.e. at the end).
The solution to this problem is to reset the circular buffer state before
writing a new Oops dump.
Signed-off-by: Aleksandr Yashkin <a.yashkin@inango-systems.com> Signed-off-by: Nikolay Merinov <n.merinov@inango-systems.com> Signed-off-by: Ariel Gilman <a.gilman@inango-systems.com> Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com Fixes: 896fc1f0c4c6 ("pstore/ram: Switch to persistent_ram routines") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
John Johansen [Wed, 18 Dec 2019 19:04:07 +0000 (11:04 -0800)]
apparmor: only get a label reference if the fast path check fails
The common fast path check can be done under rcu_read_lock() and
doesn't need a reference count on the label. Only take a reference
count if entering the slow path.
Fixes reported hackbench regression
- sha1 79e178a57dae ("Merge tag 'apparmor-pr-2019-12-03' of
git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor")
hackbench -l (256000/#grp) -g #grp
128 groups 19.679 ±0.90%
- previous sha1 01d1dff64662 ("Merge tag 's390-5.5-2' of
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux")
hackbench -l (256000/#grp) -g #grp
128 groups 3.1689 ±3.04%
Reported-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Fixes: bce4e7e9c45e ("apparmor: reduce rcu_read_lock scope for aa_file_perm mediation") Signed-off-by: John Johansen <john.johansen@canonical.com>