ceph-client.git
4 years agoceph: fix end offset in truncate_inode_pages_range call ceph-for-5.3-rc1
Luis Henriques [Mon, 1 Jul 2019 17:16:34 +0000 (18:16 +0100)]
ceph: fix end offset in truncate_inode_pages_range call

Commit e450f4d1a5d6 ("ceph: pass inclusive lend parameter to
filemap_write_and_wait_range()") fixed the end offset parameter used to
call filemap_write_and_wait_range and invalidate_inode_pages2_range.
Unfortunately it missed truncate_inode_pages_range, introducing a
regression that is easily detected by xfstest generic/130.

The problem is that when doing direct IO it is possible that an extra page
is truncated from the page cache when the end offset is page aligned.
This can cause data loss if that page hasn't been sync'ed to the OSDs.

While there, change code to use PAGE_ALIGN macro instead.

Cc: stable@vger.kernel.org
Fixes: e450f4d1a5d6 ("ceph: pass inclusive lend parameter to filemap_write_and_wait_range()")
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: use generic_delete_inode() for ->drop_inode
Luis Henriques [Fri, 5 Jul 2019 16:14:56 +0000 (17:14 +0100)]
ceph: use generic_delete_inode() for ->drop_inode

ceph_drop_inode() implementation is not any different from the generic
function, thus there's no point in keeping it around.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: use ceph_evict_inode to cleanup inode's resource
Yan, Zheng [Sun, 2 Jun 2019 01:45:38 +0000 (09:45 +0800)]
ceph: use ceph_evict_inode to cleanup inode's resource

remove_session_caps() relies on __wait_on_freeing_inode(), to wait for
freeing inode to remove its caps. But VFS wakes freeing inode waiters
before calling destroy_inode().

Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/40102
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: initialize superblock s_time_gran to 1
Luis Henriques [Thu, 27 Jun 2019 13:51:22 +0000 (14:51 +0100)]
ceph: initialize superblock s_time_gran to 1

Having granularity set to 1us results in having inode timestamps with a
accurancy different from the fuse client (i.e. atime, ctime and mtime will
always end with '000').  This patch normalizes this behaviour and sets the
granularity to 1.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Sage Weil <sage@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoMAINTAINERS: take over for Zheng as CephFS kernel client maintainer
Jeff Layton [Wed, 26 Jun 2019 10:56:17 +0000 (06:56 -0400)]
MAINTAINERS: take over for Zheng as CephFS kernel client maintainer

Zheng wants to be able to spend more time working on the MDS, so I've
volunteered to take over for him as the CephFS kernel client maintainer.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Acked-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agorbd: setallochint only if object doesn't exist
Ilya Dryomov [Wed, 19 Jun 2019 13:45:27 +0000 (15:45 +0200)]
rbd: setallochint only if object doesn't exist

setallochint is really only useful on object creation.  Continue
hinting unconditionally if object map cannot be used.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: support for object-map and fast-diff
Ilya Dryomov [Wed, 5 Jun 2019 17:25:11 +0000 (19:25 +0200)]
rbd: support for object-map and fast-diff

Speed up reads, discards and zeroouts through RBD_OBJ_FLAG_MAY_EXIST
and RBD_OBJ_FLAG_NOOP_FOR_NONEXISTENT based on object map.

Invalid object maps are not trusted, but still updated.  Note that we
never iterate, resize or invalidate object maps.  If object-map feature
is enabled but object map fails to load, we just fail the requester
(either "rbd map" or I/O, by way of post-acquire action).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agorbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()
Ilya Dryomov [Mon, 17 Jun 2019 13:29:49 +0000 (15:29 +0200)]
rbd: call rbd_dev_mapping_set() from rbd_dev_image_probe()

Snapshot object map will be loaded in rbd_dev_image_probe(), so we need
to know snapshot's size (as opposed to HEAD's size) sooner.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agolibceph: export osd_req_op_data() macro
Ilya Dryomov [Fri, 14 Jun 2019 16:00:19 +0000 (18:00 +0200)]
libceph: export osd_req_op_data() macro

We already have one exported wrapper around it for extent.osd_data and
rbd_object_map_update_finish() needs another one for cls.request_data.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
4 years agolibceph: change ceph_osdc_call() to take page vector for response
Ilya Dryomov [Fri, 14 Jun 2019 16:16:51 +0000 (18:16 +0200)]
libceph: change ceph_osdc_call() to take page vector for response

This will be used for loading object map.  rbd_obj_read_sync() isn't
suitable because object map must be accessed through class methods.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
4 years agolibceph: bump CEPH_MSG_MAX_DATA_LEN (again)
Ilya Dryomov [Mon, 8 Apr 2019 12:16:05 +0000 (14:16 +0200)]
libceph: bump CEPH_MSG_MAX_DATA_LEN (again)

This time for rbd object map.  Object maps are limited in size to
256000000 objects, two bits per object.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: new exclusive lock wait/wake code
Ilya Dryomov [Thu, 6 Jun 2019 15:14:49 +0000 (17:14 +0200)]
rbd: new exclusive lock wait/wake code

rbd_wait_state_locked() is built around rbd_dev->lock_waitq and blocks
rbd worker threads while waiting for the lock, potentially impacting
other rbd devices.  There is no good way to pass an error code into
image request state machines when acquisition fails, hence the use of
RBD_DEV_FLAG_BLACKLISTED for everything and various other issues.

Introduce rbd_dev->acquiring_list and move acquisition into image
request state machine.  Use rbd_img_schedule() for kicking and passing
error codes.  No blocking occurs while waiting for the lock, but
rbd_dev->lock_rwsem is still held across lock, unlock and set_cookie
calls.

Always acquire the lock on "rbd map" to avoid associating the latency
of acquiring the lock with the first I/O request.

A slight regression is that lock_timeout is now respected only if lock
acquisition is triggered by "rbd map" and not by I/O.  This is somewhat
compensated by the fact that we no longer block if the peer refuses to
release lock -- I/O is failed with EROFS right away.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: quiescing lock should wait for image requests
Ilya Dryomov [Thu, 30 May 2019 14:07:48 +0000 (16:07 +0200)]
rbd: quiescing lock should wait for image requests

Syncing OSD requests doesn't really work.  A single image request may
be comprised of multiple object requests, each of which can go through
a series of OSD requests (original, copyups, etc).  On top of that, the
OSD cliest may be shared with other rbd devices.

What we want is to ensure that all in-flight image requests complete.
Introduce rbd_dev->running_list and block in RBD_LOCK_STATE_RELEASING
until that happens.  New OSD requests may be started during this time.

Note that __rbd_img_handle_request() acquires rbd_dev->lock_rwsem only
if need_exclusive_lock() returns true.  This avoids a deadlock similar
to the one outlined in the previous commit between unlock and I/O that
doesn't require lock, such as a read with object-map feature disabled.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: lock should be quiesced on reacquire
Ilya Dryomov [Thu, 30 May 2019 09:15:23 +0000 (11:15 +0200)]
rbd: lock should be quiesced on reacquire

Quiesce exclusive lock at the top of rbd_reacquire_lock() instead
of only when ceph_cls_set_cookie() fails.  This avoids a deadlock on
rbd_dev->lock_rwsem.

If rbd_dev->lock_rwsem is needed for I/O completion, set_cookie can
hang ceph-msgr worker thread if set_cookie reply ends up behind an I/O
reply, because, like lock and unlock requests, set_cookie is sent and
waited upon with rbd_dev->lock_rwsem held for write.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: introduce copyup state machine
Ilya Dryomov [Thu, 13 Jun 2019 15:44:08 +0000 (17:44 +0200)]
rbd: introduce copyup state machine

Both write and copyup paths will get more complex with object map.
Factor copyup code out into a separate state machine.

While at it, take advantage of obj_req->osd_reqs list and issue empty
and current snapc OSD requests together, one after another.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: rename rbd_obj_setup_*() to rbd_obj_init_*()
Ilya Dryomov [Fri, 31 May 2019 13:11:26 +0000 (15:11 +0200)]
rbd: rename rbd_obj_setup_*() to rbd_obj_init_*()

These functions don't allocate and set up OSD requests anymore.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: move OSD request allocation into object request state machines
Ilya Dryomov [Wed, 12 Jun 2019 16:33:31 +0000 (18:33 +0200)]
rbd: move OSD request allocation into object request state machines

Following submission, move initial OSD request allocation into object
request state machines.  Everything that has to do with OSD requests is
now handled inside the state machine, all __rbd_img_fill_request() has
left is initialization.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: factor out __rbd_osd_setup_discard_ops()
Ilya Dryomov [Wed, 29 May 2019 15:31:37 +0000 (17:31 +0200)]
rbd: factor out __rbd_osd_setup_discard_ops()

With obj_req->xferred removed, obj_req->ex.oe_off and obj_req->ex.oe_len
can be updated if required for alignment.  Previously the new offset and
length weren't stored anywhere beyond rbd_obj_setup_discard().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: factor out rbd_osd_setup_copyup()
Ilya Dryomov [Wed, 29 May 2019 14:53:14 +0000 (16:53 +0200)]
rbd: factor out rbd_osd_setup_copyup()

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: introduce obj_req->osd_reqs list
Ilya Dryomov [Mon, 27 May 2019 09:41:36 +0000 (11:41 +0200)]
rbd: introduce obj_req->osd_reqs list

Since the dawn of time it had been assumed that a single object request
spawns a single OSD request.  This is already impacting copyup: instead
of sending empty and current snapc copyups together, we wait for empty
snapc OSD request to complete in order to reassign obj_req->osd_req
with current snapc OSD request.  Looking further, updating potentially
hundreds of snapshot object maps serially is a non-starter.

Replace obj_req->osd_req pointer with obj_req->osd_reqs list.  Use
osd_req->r_private_item for linkage.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agolibceph: rename r_unsafe_item to r_private_item
Ilya Dryomov [Mon, 8 Jul 2019 10:50:09 +0000 (12:50 +0200)]
libceph: rename r_unsafe_item to r_private_item

This list item remained from when we had safe and unsafe replies
(commit vs ack).  It has since become a private list item for use by
clients.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agorbd: introduce image request state machine
Ilya Dryomov [Thu, 16 May 2019 13:06:56 +0000 (15:06 +0200)]
rbd: introduce image request state machine

Make it possible to schedule image requests on a workqueue.  This fixes
parent chain recursion added in the previous commit and lays the ground
for exclusive lock wait/wake improvements.

The "wait for pending subrequests and report first nonzero result" code
is generalized to be used by object request state machine.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: move OSD request submission into object request state machines
Ilya Dryomov [Tue, 14 May 2019 19:06:07 +0000 (21:06 +0200)]
rbd: move OSD request submission into object request state machines

Start eliminating asymmetry where the initial OSD request is allocated
and submitted from outside the state machine, making error handling and
restarts harder than they could be.  This commit deals with submission,
a commit that deals with allocation will follow.

Note that this commit adds parent chain recursion on the submission
side:

  rbd_img_request_submit
    rbd_obj_handle_request
      __rbd_obj_handle_request
        rbd_obj_handle_read
          rbd_obj_handle_write_guard
            rbd_obj_read_from_parent
              rbd_img_request_submit

This will be fixed in the next commit.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}
Ilya Dryomov [Tue, 14 May 2019 18:45:38 +0000 (20:45 +0200)]
rbd: get rid of RBD_OBJ_WRITE_{FLAT,GUARD}

In preparation for moving OSD request allocation and submission into
object request state machines, get rid of RBD_OBJ_WRITE_{FLAT,GUARD}.
We would need to start in a new state, whether the request is guarded
or not.  Unify them into RBD_OBJ_WRITE_OBJECT and pass guard info
through obj_req->flags.

While at it, make our ENOENT handling a little more precise: only hide
ENOENT when it is actually expected, that is on delete.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: replace obj_req->tried_parent with obj_req->read_state
Ilya Dryomov [Wed, 8 May 2019 11:35:57 +0000 (13:35 +0200)]
rbd: replace obj_req->tried_parent with obj_req->read_state

Make rbd_obj_handle_read() look like a state machine and get rid of
the necessity to patch result in rbd_obj_handle_request(), completing
the removal of obj_req->xferred and img_req->xferred.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agorbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred
Ilya Dryomov [Sat, 11 May 2019 14:21:49 +0000 (16:21 +0200)]
rbd: get rid of obj_req->xferred, obj_req->result and img_req->xferred

obj_req->xferred and img_req->xferred don't bring any value.  The
former is used for short reads and has to be set to obj_req->ex.oe_len
after that and elsewhere.  The latter is just an aggregate.

Use result for short reads (>=0 - number of bytes read, <0 - error) and
pass it around explicitly.  No need to store it in obj_req.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
4 years agoceph: don't NULL terminate virtual xattrs
Jeff Layton [Mon, 24 Jun 2019 11:32:21 +0000 (07:32 -0400)]
ceph: don't NULL terminate virtual xattrs

The convention with xattrs is to not store the termination with string
data, given that it returns the length. This is how setfattr/getfattr
operate.

Most of ceph's virtual xattr routines use snprintf to plop the string
directly into the destination buffer, but snprintf always NULL
terminates the string. This means that if we send the kernel a buffer
that is the exact length needed to hold the string, it'll end up
truncated.

Add a ceph_fmt_xattr helper function to format the string into an
on-stack buffer that should always be large enough to hold the whole
thing and then memcpy the result into the destination buffer. If it does
turn out that the formatted string won't fit in the on-stack buffer,
then return -E2BIG and do a WARN_ONCE().

Change over most of the virtual xattr routines to use the new helper. A
couple of the xattrs are sourced from strings however, and it's
difficult to know how long they'll be. Just have those memcpy the result
in place after verifying the length.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: return -ERANGE if virtual xattr value didn't fit in buffer
Jeff Layton [Thu, 13 Jun 2019 19:17:00 +0000 (15:17 -0400)]
ceph: return -ERANGE if virtual xattr value didn't fit in buffer

The getxattr manpage states that we should return ERANGE if the
destination buffer size is too small to hold the value.
ceph_vxattrcb_layout does this internally, but we should be doing
this for all vxattrs.

Fix the only caller of getxattr_cb to check the returned size
against the buffer length and return -ERANGE if it doesn't fit.
Drop the same check in ceph_vxattrcb_layout and just rely on the
caller to handle it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: make getxattr_cb return ssize_t
Jeff Layton [Mon, 24 Jun 2019 11:39:18 +0000 (07:39 -0400)]
ceph: make getxattr_cb return ssize_t

The getxattr_cb functions return size_t, which is unsigned and then
cast that value to int and then ssize_t before returning it. While all
of this works, it relies on implicit casting rules for signed/unsigned
conversions.

Change getxattr_cb to return ssize_t to better conform with what the
caller actually wants. Also, remove some suspicious casts.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP
Yan, Zheng [Thu, 20 Jun 2019 04:09:08 +0000 (12:09 +0800)]
ceph: more precise CEPH_CLIENT_CAPS_PENDING_CAPSNAP

Client uses this flag to tell mds if there is more cap snap need to
flush. It's mainly for the case that client needs to re-send cap/snap
flushes after mds failover, but CEPH_CAP_ANY_FILE_WR on corresponding
inodes are all released before mds failover.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: kick flushing and flush snaps before sending normal cap message
Yan, Zheng [Thu, 20 Jun 2019 08:09:37 +0000 (16:09 +0800)]
ceph: kick flushing and flush snaps before sending normal cap message

Otherwise client may send cap flush messages in wrong order.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps()
Yan, Zheng [Thu, 20 Jun 2019 08:00:31 +0000 (16:00 +0800)]
ceph: clear CEPH_I_KICK_FLUSH flag inside __kick_flushing_caps()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: increment change_attribute on local changes
Jeff Layton [Thu, 6 Jun 2019 12:57:27 +0000 (08:57 -0400)]
ceph: increment change_attribute on local changes

We don't set SB_I_VERSION on ceph since we need to manage it ourselves,
so we must increment it whenever we update the file times.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: handle change_attr in cap messages
Jeff Layton [Thu, 6 Jun 2019 12:06:40 +0000 (08:06 -0400)]
ceph: handle change_attr in cap messages

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: add change_attr field to ceph_inode_info
Jeff Layton [Thu, 6 Jun 2019 11:29:23 +0000 (07:29 -0400)]
ceph: add change_attr field to ceph_inode_info

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoiversion: add a routine to update a raw value with a larger one
Jeff Layton [Wed, 5 Jun 2019 21:24:22 +0000 (17:24 -0400)]
iversion: add a routine to update a raw value with a larger one

Under ceph, clients can be independently updating iversion themselves,
while working under comprehensive sets of caps on an inode. In that
situation we always want to prefer the largest value of a change
attribute. Add a new function that will update a raw value with a larger
one, but otherwise leave it alone.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: allow querying of STATX_BTIME in ceph_getattr
Jeff Layton [Wed, 5 Jun 2019 16:08:32 +0000 (12:08 -0400)]
ceph: allow querying of STATX_BTIME in ceph_getattr

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: turn on CEPH_FEATURE_MSG_ADDR2
Jeff Layton [Fri, 31 May 2019 16:24:22 +0000 (12:24 -0400)]
libceph: turn on CEPH_FEATURE_MSG_ADDR2

Now that the client can handle either address formatting, advertise to
the peer that we can support it.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: handle btime in cap messages
Jeff Layton [Wed, 29 May 2019 16:23:14 +0000 (12:23 -0400)]
ceph: handle btime in cap messages

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: add btime field to ceph_inode_info
Jeff Layton [Wed, 29 May 2019 15:19:42 +0000 (11:19 -0400)]
ceph: add btime field to ceph_inode_info

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: rename ceph_encode_addr to ceph_encode_banner_addr
Jeff Layton [Mon, 17 Jun 2019 13:24:31 +0000 (09:24 -0400)]
libceph: rename ceph_encode_addr to ceph_encode_banner_addr

...ditto for the decode function. We only use these functions to fix
up banner addresses now, so let's name them more appropriately.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE
Jeff Layton [Mon, 17 Jun 2019 10:57:25 +0000 (06:57 -0400)]
libceph: use TYPE_LEGACY for entity addrs instead of TYPE_NONE

Going forward, we'll have different address types so let's use
the addr2 TYPE_LEGACY for internal tracking rather than TYPE_NONE.

Also, make ceph_pr_addr print the address type value as well.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix decode_locker to use ceph_decode_entity_addr
Jeff Layton [Tue, 4 Jun 2019 19:17:32 +0000 (15:17 -0400)]
ceph: fix decode_locker to use ceph_decode_entity_addr

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: have MDS map decoding use entity_addr_t decoder
Jeff Layton [Tue, 4 Jun 2019 15:26:36 +0000 (11:26 -0400)]
ceph: have MDS map decoding use entity_addr_t decoder

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: correctly decode ADDR2 addresses in incremental OSD maps
Jeff Layton [Tue, 4 Jun 2019 19:10:44 +0000 (15:10 -0400)]
libceph: correctly decode ADDR2 addresses in incremental OSD maps

Given the new format, we have to decode the addresses twice. Once to
skip past the new_up_client field, and a second time to collect the
addresses.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: fix watch_item_t decoding to use ceph_decode_entity_addr
Jeff Layton [Tue, 4 Jun 2019 18:35:18 +0000 (14:35 -0400)]
libceph: fix watch_item_t decoding to use ceph_decode_entity_addr

While we're in there, let's also fix up the decoder to do proper
bounds checking.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: switch osdmap decoding to use ceph_decode_entity_addr
Jeff Layton [Mon, 3 Jun 2019 19:08:13 +0000 (15:08 -0400)]
libceph: switch osdmap decoding to use ceph_decode_entity_addr

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: ADDR2 support for monmap
Jeff Layton [Fri, 31 May 2019 19:32:28 +0000 (15:32 -0400)]
libceph: ADDR2 support for monmap

Switch the MonMap decoder to use the new decoding routine for
entity_addr_t's.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: add ceph_decode_entity_addr
Jeff Layton [Mon, 3 Jun 2019 18:45:16 +0000 (14:45 -0400)]
libceph: add ceph_decode_entity_addr

Add a function for decoding an entity_addr_t. Once
CEPH_FEATURE_MSG_ADDR2 is enabled, the server daemons will start
encoding entity_addr_t differently.

Add a new helper function that can handle either format.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: fix sa_family just after reading address
Jeff Layton [Tue, 4 Jun 2019 17:13:48 +0000 (13:13 -0400)]
libceph: fix sa_family just after reading address

It doesn't make sense to leave it undecoded until later.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: remove request from waiting list before unregister
Yan, Zheng [Fri, 14 Jun 2019 02:55:05 +0000 (10:55 +0800)]
ceph: remove request from waiting list before unregister

Link: https://tracker.ceph.com/issues/40339
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: don't blindly unregister session that is in opening state
Yan, Zheng [Mon, 10 Jun 2019 07:45:09 +0000 (15:45 +0800)]
ceph: don't blindly unregister session that is in opening state

handle_cap_export() may add placeholder caps to session that is in
opening state. These caps' session pointer become wild after session get
unregistered.

The fix is not to unregister session in opening state during mds failovers,
just let client to reconnect later when mds is recovered.

Link: https://tracker.ceph.com/issues/40190
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix infinite loop in get_quota_realm()
Yan, Zheng [Fri, 31 May 2019 12:19:18 +0000 (20:19 +0800)]
ceph: fix infinite loop in get_quota_realm()

get_quota_realm() enters infinite loop if quota inode has no caps.
This can happen after client gets evicted.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Luis Henriques <lhenriques@suse.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: add selinux support
Yan, Zheng [Sun, 26 May 2019 08:27:56 +0000 (16:27 +0800)]
ceph: add selinux support

When creating new file/directory, use security_dentry_init_security() to
prepare selinux context for the new inode, then send openc/mkdir request
to MDS, together with selinux xattr.

security_dentry_init_security() only supports single security module and
only selinux has dentry_init_security hook. So only selinux is supported
for now. We can add support for other security modules once kernel has a
generic version of dentry_init_security()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: rename struct ceph_acls_info to ceph_acl_sec_ctx
Yan, Zheng [Sun, 26 May 2019 07:35:39 +0000 (15:35 +0800)]
ceph: rename struct ceph_acls_info to ceph_acl_sec_ctx

Also rename ceph_release_acls_info() to ceph_release_acl_sec_ctx().
And move their definitions to different files. This is preparation
for security label support.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix debug print format in __set_xattr()
Yan, Zheng [Mon, 27 May 2019 08:15:41 +0000 (16:15 +0800)]
ceph: fix debug print format in __set_xattr()

name is not '\0' terminated.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix warning PTR_ERR_OR_ZERO can be used
Hariprasad Kelam [Sat, 25 May 2019 09:15:59 +0000 (14:45 +0530)]
ceph: fix warning PTR_ERR_OR_ZERO can be used

change1: fix below warning  reported by coccicheck

/fs/ceph/export.c:371:33-39: WARNING: PTR_ERR_OR_ZERO can be used

change2: typecasted PTR_ERR_OR_ZERO to long as dout expecting long

Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: hold i_ceph_lock when removing caps for freeing inode
Yan, Zheng [Thu, 23 May 2019 03:01:37 +0000 (11:01 +0800)]
ceph: hold i_ceph_lock when removing caps for freeing inode

ceph_d_revalidate(, LOOKUP_RCU) may call __ceph_caps_issued_mask()
on a freeing inode.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()
Yan, Zheng [Thu, 23 May 2019 02:45:24 +0000 (10:45 +0800)]
ceph: ensure d_name/d_parent stability in ceph_mdsc_lease_send_msg()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: use READ_ONCE to access d_parent in RCU critical section
Yan, Zheng [Thu, 23 May 2019 02:22:55 +0000 (10:22 +0800)]
ceph: use READ_ONCE to access d_parent in RCU critical section

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix dir_lease_is_valid()
Yan, Zheng [Wed, 22 May 2019 09:26:27 +0000 (17:26 +0800)]
ceph: fix dir_lease_is_valid()

It should call __ceph_dentry_dir_lease_touch() under dentry->d_lock.
Besides, ceph_dentry(dentry) can be NULL when called by LOOKUP_RCU
d_revalidate()

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: close race between d_name_cmp() and update_dentry_lease()
Yan, Zheng [Wed, 22 May 2019 13:49:44 +0000 (21:49 +0800)]
ceph: close race between d_name_cmp() and update_dentry_lease()

d_name_cmp() and update_dentry_lease() lock and unlock dentry->d_lock
respectively. Dentry may get renamed between them. The fix is moving
the dentry name compare into update_dentry_lease().

This patch introduce two version of update_dentry_lease(). One version
is for the case that parent inode is locked. It does not need to check
parent/target inode and dentry name. Another version is for the case
that parent inode is not locked. It checks parent/target inode and
dentry name after locking dentry->d_lock.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix improper use of smp_mb__before_atomic()
Andrea Parri [Mon, 20 May 2019 17:23:58 +0000 (19:23 +0200)]
ceph: fix improper use of smp_mb__before_atomic()

This barrier only applies to the read-modify-write operations; in
particular, it does not apply to the atomic64_set() primitive.

Replace the barrier with an smp_mb().

Fixes: fdd4e15838e59 ("ceph: rework dcache readdir")
Reported-by: "Paul E. McKenney" <paulmck@linux.ibm.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrea Parri <andrea.parri@amarulasolutions.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix "ceph.dir.rctime" vxattr value
David Disseldorp [Wed, 15 May 2019 14:56:39 +0000 (16:56 +0200)]
ceph: fix "ceph.dir.rctime" vxattr value

The vxattr value incorrectly places a "09" prefix to the nanoseconds
field, instead of providing it as a zero-pad width specifier after '%'.

Fixes: 3489b42a72a4 ("ceph: fix three bugs, two in ceph_vxattrcb_file_layout()")
Link: https://tracker.ceph.com/issues/39943
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: remove unused vxattr length helpers
David Disseldorp [Thu, 18 Apr 2019 12:15:49 +0000 (14:15 +0200)]
ceph: remove unused vxattr length helpers

ceph_listxattr() now calculates the length of vxattrs dynamically, so
these helpers, which incorrectly ignore vxattr.exists_cb(), can be
removed.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: fix listxattr vxattr buffer length calculation
David Disseldorp [Thu, 18 Apr 2019 12:15:48 +0000 (14:15 +0200)]
ceph: fix listxattr vxattr buffer length calculation

ceph_listxattr() incorrectly returns a length based on the static
ceph_vxattrs_name_size() value, which only takes into account whether
vxattrs are hidden, ignoring vxattr.exists_cb().

When filling the xattr buffer ceph_listxattr() checks VXATTR_FLAG_HIDDEN
and vxattr.exists_cb(). If both are false, we return an incorrect
(oversize) length.

Fix this behaviour by always calculating the vxattrs length at runtime,
taking both vxattr.hidden and vxattr.exists_cb() into account.

This bug is only exposed with the new "ceph.snap.btime" vxattr, as all
other vxattrs with a non-null exists_cb also carry VXATTR_FLAG_HIDDEN.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: add ceph.snap.btime vxattr
David Disseldorp [Thu, 18 Apr 2019 12:15:47 +0000 (14:15 +0200)]
ceph: add ceph.snap.btime vxattr

The ceph.snap.btime virtual xattr provides the snapshot creation (birth)
time in $secs.$nsecs format.

Link: https://tracker.ceph.com/issues/38838
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: carry snapshot creation time with inodes
David Disseldorp [Thu, 18 Apr 2019 12:15:46 +0000 (14:15 +0200)]
ceph: carry snapshot creation time with inodes

MDS InodeStat v3 wire structures include a trailing snapshot creation
time member. Unmarshall this and retain it for a future vxattr.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: clean up ceph.dir.pin vxattr name sizeof()
David Disseldorp [Thu, 18 Apr 2019 12:15:45 +0000 (14:15 +0200)]
ceph: clean up ceph.dir.pin vxattr name sizeof()

.name_size should use the same string as .name.

Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoceph: silence a checker warning in mdsc_show()
Dan Carpenter [Thu, 9 May 2019 10:11:25 +0000 (13:11 +0300)]
ceph: silence a checker warning in mdsc_show()

The problem is that if ceph_mdsc_build_path() fails then we set "path"
to NULL and the "pathlen" variable is uninitialized.  Then we call
ceph_mdsc_free_path(path, pathlen) to clean up.  Since "path" is NULL,
the function is a no-op but Smatch and UBSan still complain that
"pathlen" is uninitialized.

This patch doesn't change run time, it just silence the warnings.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agolibceph: remove ceph_get_direct_page_vector()
Christoph Hellwig [Wed, 1 May 2019 20:40:32 +0000 (16:40 -0400)]
libceph: remove ceph_get_direct_page_vector()

This function is entirely unused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
4 years agoLinux 5.2
Linus Torvalds [Sun, 7 Jul 2019 22:41:56 +0000 (15:41 -0700)]
Linux 5.2

4 years agoMerge tag 'for-linus-20190706' of git://git.kernel.dk/linux-block
Linus Torvalds [Sat, 6 Jul 2019 18:48:39 +0000 (11:48 -0700)]
Merge tag 'for-linus-20190706' of git://git.kernel.dk/linux-block

Pull block fix from Jens Axboe:
 "Just a single fix for a patch from Greg KH, which reportedly break
  block debugfs locations for certain setups. Trivial enough that I
  think we should include it now, rather than wait and release 5.2 with
  it, since it's a regression in this series"

* tag 'for-linus-20190706' of git://git.kernel.dk/linux-block:
  blk-mq: fix up placement of debugfs directory of queue files

4 years agoMerge tag 'mips_fixes_5.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips...
Linus Torvalds [Sat, 6 Jul 2019 17:32:12 +0000 (10:32 -0700)]
Merge tag 'mips_fixes_5.2_2' of git://git./linux/kernel/git/mips/linux

Pull MIPS fixes from Paul Burton:
 "A few more MIPS fixes:

   - Fix a silly typo in virt_addr_valid which led to completely bogus
     behavior (that happened to stop tripping up hardened usercopy
     despite being broken).

   - Fix UART parity setup on AR933x systems.

   - A build fix for non-Linux build machines.

   - Have the 'all' make target build DTBs, primarily to fit in with the
     behavior of scripts/package/builddeb.

   - Handle an execution hazard in TLB exceptions that use KScratch
     registers, which could inadvertently clobber the $1 register on
     some generally higher-end out-of-order CPUs.

   - A MAINTAINERS update to fix the path to the NAND driver for Ingenic
     systems"

* tag 'mips_fixes_5.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MAINTAINERS: Correct path to moved files
  MIPS: Add missing EHB in mtc0 -> mfc0 sequence.
  MIPS: have "plain" make calls build dtbs for selected platforms
  MIPS: fix build on non-linux hosts
  MIPS: ath79: fix ar933x uart parity mode
  MIPS: Fix bounds check virt_addr_valid

4 years agoMerge tag 'dmaengine-fix-5.2' of git://git.infradead.org/users/vkoul/slave-dma
Linus Torvalds [Sat, 6 Jul 2019 17:06:37 +0000 (10:06 -0700)]
Merge tag 'dmaengine-fix-5.2' of git://git.infradead.org/users/vkoul/slave-dma

Pull dmaengine fixes from Vinod Koul:

 - bam_dma fix for completed descriptor count

 - fix for imx-sdma remove BD_INTR for channel0 and use-after-free on
   probe error path

 - endian bug fix in jz4780 IRQ handler

* tag 'dmaengine-fix-5.2' of git://git.infradead.org/users/vkoul/slave-dma:
  dmaengine: qcom: bam_dma: Fix completed descriptors count
  dmaengine: imx-sdma: remove BD_INTR for channel0
  dmaengine: imx-sdma: fix use-after-free on probe error path
  dmaengine: jz4780: Fix an endian bug in IRQ handler

4 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 6 Jul 2019 16:56:20 +0000 (09:56 -0700)]
Merge tag 'scsi-fixes' of git://git./linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "Two iscsi fixes.

  One for an oops in the client which can be triggered by the server
  authentication protocol and the other in the target code which causes
  data corruption"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: iscsi: set auth_protocol back to NULL if CHAP_A value is not supported
  scsi: target/iblock: Fix overrun in WRITE SAME emulation

4 years agoMerge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 6 Jul 2019 16:53:08 +0000 (09:53 -0700)]
Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs

Pull vfs fixlet from Al Viro:
 "Fix bogus default y in Kconfig (VALIDATE_FS_PARSER)

  That thing should not be turned on by default, especially since it's
  not quiet in case it finds no problems. Geert has sent the obvious fix
  quite a few times, but it fell through the cracks"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: VALIDATE_FS_PARSER should default to n

4 years agoblk-mq: fix up placement of debugfs directory of queue files
Greg Kroah-Hartman [Sat, 6 Jul 2019 15:50:32 +0000 (17:50 +0200)]
blk-mq: fix up placement of debugfs directory of queue files

When the blk-mq debugfs file creation logic was "cleaned up" it was
cleaned up too much, causing the queue file to not be created in the
correct location.  Turns out the check for the directory being present
is needed as if that has not happened yet, the files should not be
created, and the function will be called later on in the initialization
code so that the files can be created in the correct location.

Fixes: 6cfc0081b046 ("blk-mq: no need to check return value of debugfs_create functions")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: linux-block@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoRevert "mm: page cache: store only head pages in i_pages"
Linus Torvalds [Sat, 6 Jul 2019 02:55:18 +0000 (19:55 -0700)]
Revert "mm: page cache: store only head pages in i_pages"

This reverts commit 5fd4ca2d84b249f0858ce28cf637cf25b61a398f.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

   page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
   anon
   flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
   raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
   raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
   page dumped because: VM_BUG_ON_PAGE(entry != page)
   page->mem_cgroup:ffff978433ace000
   ------------[ cut here ]------------
   kernel BUG at mm/swap_state.c:170!
   invalid opcode: 0000 [#1] SMP NOPTI
   CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
   Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
   RIP: 0010:__delete_from_swap_cache+0x20d/0x240
   Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
   RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
   RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
   RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
   RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
   R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
   R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
   FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
   Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens.  It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
4 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sat, 6 Jul 2019 02:13:24 +0000 (19:13 -0700)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "x86 bugfix patches and one compilation fix for ARM"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: arm64/sve: Fix vq_present() macro to yield a bool
  KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC
  KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS
  KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM
  KVM: x86: degrade WARN to pr_warn_ratelimited

4 years agoMerge tag 'mtd/fixes-for-5.2-final' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 6 Jul 2019 02:07:57 +0000 (19:07 -0700)]
Merge tag 'mtd/fixes-for-5.2-final' of git://git./linux/kernel/git/mtd/linux

Pull mtf fixes from Miquel Raynal:

 - Fix the memory organization structure of a Macronix SPI-NAND chip.

 - Fix a build dependency wrongly described.

 - Fix the sunxi NAND driver for A23/A33 SoCs by (a) reverting the
   faulty commit introducing broken DMA support and (b) applying another
   commit bringing working DMA support.

* tag 'mtd/fixes-for-5.2-final' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
  mtd: rawnand: sunxi: Add A23/A33 DMA support with extra MBUS configuration
  Revert "mtd: rawnand: sunxi: Add A23/A33 DMA support"
  mtd: rawnand: ingenic: Fix ingenic_ecc dependency
  mtd: spinand: Fix max_bad_eraseblocks_per_lun info in memorg

4 years agoMerge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sat, 6 Jul 2019 02:04:57 +0000 (19:04 -0700)]
Merge branch 'i2c/for-current' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixlet from Wolfram Sang:
 "I2C has a MAINTAINERS update which will be benfitial for developers,
  so let's add it right away"

* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: tegra: Add Dmitry as a reviewer

4 years agoMerge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux
Linus Torvalds [Sat, 6 Jul 2019 02:00:37 +0000 (19:00 -0700)]
Merge tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two more quick bugfixes for nfsd: fixing a regression causing mount
  failures on high-memory machines and fixing the DRC over RDMA"

* tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux:
  nfsd: Fix overflow causing non-working mounts on 1 TB machines
  svcrdma: Ignore source port when computing DRC hash

4 years agomtd: rawnand: sunxi: Add A23/A33 DMA support with extra MBUS configuration
Miquel Raynal [Mon, 8 Apr 2019 07:41:46 +0000 (09:41 +0200)]
mtd: rawnand: sunxi: Add A23/A33 DMA support with extra MBUS configuration

Allwinner NAND controllers can make use of DMA to enhance the I/O
throughput thanks to ECC pipelining. DMA handling with A23/A33 NAND IP
is a bit different than with the older SoCs, hence the introduction of
a new compatible to handle:
* the differences between register offsets,
* the burst length change from 4 to minimum 8,
* manage SRAM accesses through MBUS with extra configuration.

Fixes: c49836f05aa1 ("mtd: rawnand: sunxi: Add A23/A33 DMA support")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
4 years agoRevert "mtd: rawnand: sunxi: Add A23/A33 DMA support"
Miquel Raynal [Fri, 5 Jul 2019 09:25:29 +0000 (11:25 +0200)]
Revert "mtd: rawnand: sunxi: Add A23/A33 DMA support"

This reverts commit c49836f05aa15282f7280e06ede3f6f8a6324833.

The commit is wrong and its approach actually does not work. Let's
revert it in order to add the feature with a clean patch.

Fixes: c49836f05aa1 ("mtd: rawnand: sunxi: Add A23/A33 DMA support")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
4 years agoi2c: tegra: Add Dmitry as a reviewer
Dmitry Osipenko [Sun, 23 Jun 2019 17:46:55 +0000 (20:46 +0300)]
i2c: tegra: Add Dmitry as a reviewer

I'm contributing to Tegra's upstream development in general and happened
to review the Tegra's I2C patches for awhile because I'm actively using
upstream kernel on all of my Tegra-powered devices and initially some of
the submitted patches were getting my attention since they were causing
problems. Recently Wolfram Sang asked whether I'm interested in becoming
a reviewer for the driver and I don't mind at all.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
[wsa: ack was expressed by Thierry Reding in a mail thread]
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
4 years agofs: VALIDATE_FS_PARSER should default to n
Geert Uytterhoeven [Mon, 1 Apr 2019 11:53:57 +0000 (13:53 +0200)]
fs: VALIDATE_FS_PARSER should default to n

CONFIG_VALIDATE_FS_PARSER is a debugging tool to check that the parser
tables are vaguely sane.  It was set to default to 'Y' for the moment to
catch errors in upcoming fs conversion development.

Make sure it is not enabled by default in the final release of v5.1.

Fixes: 31d921c7fb969172 ("vfs: Add configuration parser helpers")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
4 years agoKVM: arm64/sve: Fix vq_present() macro to yield a bool
Zhang Lei [Wed, 3 Jul 2019 17:42:50 +0000 (18:42 +0100)]
KVM: arm64/sve: Fix vq_present() macro to yield a bool

The original implementation of vq_present() relied on aggressive
inlining in order for the compiler to know that the code is
correct, due to some const-casting issues.  This was causing sparse
and clang to complain, while GCC compiled cleanly.

Commit 0c529ff789bc addressed this problem, but since vq_present()
is no longer a function, there is now no implicit casting of the
returned value to the return type (bool).

In set_sve_vls(), this uncast bit value is compared against a bool,
and so may spuriously compare as unequal when both are nonzero.  As
a result, KVM may reject valid SVE vector length configurations as
invalid, and vice versa.

Fix it by forcing the returned value to a bool.

Signed-off-by: Zhang Lei <zhang.lei@jp.fujitsu.com>
Fixes: 0c529ff789bc ("KVM: arm64: Implement vq_present() as a macro")
Signed-off-by: Dave Martin <Dave.Martin@arm.com> [commit message rewrite]
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
4 years agodmaengine: qcom: bam_dma: Fix completed descriptors count
Sricharan R [Fri, 28 Jun 2019 12:09:46 +0000 (17:39 +0530)]
dmaengine: qcom: bam_dma: Fix completed descriptors count

One space is left unused in circular FIFO to differentiate
'full' and 'empty' cases. So take that in to account while
counting for the descriptors completed.

Fixes the issue reported here,
https://lkml.org/lkml/2019/6/18/669

Cc: stable@vger.kernel.org
Reported-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Sricharan R <sricharan@codeaurora.org>
Tested-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
4 years agodmaengine: imx-sdma: remove BD_INTR for channel0
Robin Gong [Fri, 21 Jun 2019 08:23:06 +0000 (16:23 +0800)]
dmaengine: imx-sdma: remove BD_INTR for channel0

It is possible for an irq triggered by channel0 to be received later
after clks are disabled once firmware loaded during sdma probe. If
that happens then clearing them by writing to SDMA_H_INTR won't work
and the kernel will hang processing infinite interrupts. Actually,
don't need interrupt triggered on channel0 since it's pollling
SDMA_H_STATSTOP to know channel0 done rather than interrupt in
current code, just clear BD_INTR to disable channel0 interrupt to
avoid the above case.
This issue was brought by commit 1d069bfa3c78 ("dmaengine: imx-sdma:
ack channel 0 IRQ in the interrupt handler") which didn't take care
the above case.

Fixes: 1d069bfa3c78 ("dmaengine: imx-sdma: ack channel 0 IRQ in the interrupt handler")
Cc: stable@vger.kernel.org #5.0+
Signed-off-by: Robin Gong <yibin.gong@nxp.com>
Reported-by: Sven Van Asbroeck <thesven73@gmail.com>
Tested-by: Sven Van Asbroeck <thesven73@gmail.com>
Reviewed-by: Michael Olbrich <m.olbrich@pengutronix.de>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
4 years agodmaengine: imx-sdma: fix use-after-free on probe error path
Sven Van Asbroeck [Mon, 24 Jun 2019 14:07:31 +0000 (10:07 -0400)]
dmaengine: imx-sdma: fix use-after-free on probe error path

If probe() fails anywhere beyond the point where
sdma_get_firmware() is called, then a kernel oops may occur.

Problematic sequence of events:
1. probe() calls sdma_get_firmware(), which schedules the
   firmware callback to run when firmware becomes available,
   using the sdma instance structure as the context
2. probe() encounters an error, which deallocates the
   sdma instance structure
3. firmware becomes available, firmware callback is
   called with deallocated sdma instance structure
4. use after free - kernel oops !

Solution: only attempt to load firmware when we're certain
that probe() will succeed. This guarantees that the firmware
callback's context will remain valid.

Note that the remove() path is unaffected by this issue: the
firmware loader will increment the driver module's use count,
ensuring that the module cannot be unloaded while the
firmware callback is pending or running.

Signed-off-by: Sven Van Asbroeck <TheSven73@gmail.com>
Reviewed-by: Robin Gong <yibin.gong@nxp.com>
[vkoul: fixed braces for if condition]
Signed-off-by: Vinod Koul <vkoul@kernel.org>
4 years agodmaengine: jz4780: Fix an endian bug in IRQ handler
Dan Carpenter [Mon, 24 Jun 2019 13:49:40 +0000 (16:49 +0300)]
dmaengine: jz4780: Fix an endian bug in IRQ handler

The "pending" variable was a u32 but we cast it to an unsigned long
pointer when we do the for_each_set_bit() loop.  The problem is that on
big endian 64bit systems that results in an out of bounds read.

Fixes: 4e4106f5e942 ("dmaengine: jz4780: Fix transfers being ACKed too soon")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
4 years agoMerge tag 'drm-fixes-2019-07-05-1' of git://anongit.freedesktop.org/drm/drm
Linus Torvalds [Fri, 5 Jul 2019 05:10:30 +0000 (14:10 +0900)]
Merge tag 'drm-fixes-2019-07-05-1' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
 "I skipped last week because there wasn't much worth doing, this week
  got a few more fixes in.

  amdgpu:
   - default register value change
   - runpm regression fix
   - fan control fix

  i915:
   - fix Ironlake regression

  panfrost:
   - fix a double free

  virtio:
   - fix a locking bug

  imx:
   - crtc disable fixes"

* tag 'drm-fixes-2019-07-05-1' of git://anongit.freedesktop.org/drm/drm:
  drm/imx: only send event on crtc disable if kept disabled
  drm/imx: notify drm core before sending event during crtc disable
  drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context
  drm/amdgpu/gfx9: use reset default for PA_SC_FIFO_SIZE
  drm/amdgpu: Don't skip display settings in hwmgr_resume()
  drm/amd/powerplay: use hardware fan control if no powerplay fan table
  drm/panfrost: Fix a double-free error
  drm/etnaviv: add missing failure path to destroy suballoc
  drm/virtio: move drm_connector_update_edid_property() call

4 years agoMerge tag 'imx-drm-fixes-2019-07-04' of git://git.pengutronix.de/git/pza/linux into...
Dave Airlie [Fri, 5 Jul 2019 02:54:48 +0000 (12:54 +1000)]
Merge tag 'imx-drm-fixes-2019-07-04' of git://git.pengutronix.de/git/pza/linux into drm-fixes

drm/imx: fix stale vblank timestamp after a modeset

This series fixes stale vblank timestamps in the first event sent after
a crtc was disabled. The core now is notified via drm_crtc_vblank_off
before sending the last pending event in atomic_disable. If the crtc is
reenabled right away during to a modeset, the event is not sent at all,
as the next vblank will take care of it.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Philipp Zabel <p.zabel@pengutronix.de>
Link: https://patchwork.freedesktop.org/patch/msgid/1562237119.6641.16.camel@pengutronix.de
4 years agoMerge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Linus Torvalds [Fri, 5 Jul 2019 04:31:19 +0000 (13:31 +0900)]
Merge branch 'linus' of git://git./linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
 "This fixes two memory leaks and a list corruption bug"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: user - prevent operating on larval algorithms
  crypto: cryptd - Fix skcipher instance memory leak
  lib/mpi: Fix karactx leak in mpi_powm

4 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Fri, 5 Jul 2019 02:39:56 +0000 (11:39 +0900)]
Merge branch 'akpm' (patches from Andrew)

Merge more fixes from Andrew Morton:
 "5 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  swap_readpage(): avoid blk_wake_io_task() if !synchronous
  devres: allow const resource arguments
  mm/vmscan.c: prevent useless kswapd loops
  fs/userfaultfd.c: disable irqs for fault_pending and event locks
  mm/page_alloc.c: fix regression with deferred struct page init

4 years agoMerge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Fri, 5 Jul 2019 02:35:45 +0000 (11:35 +0900)]
Merge tag 'armsoc-fixes' of git://git./linux/kernel/git/soc/soc

Pull ARM SoC fixes from Olof Johansson:
 "Likely our final small batch of fixes for 5.2:

   - Some fixes for USB on davinci, regressions were due to the recent
     conversion of the OCHI driver to use GPIO regulators

   - A fixup of kconfig dependencies for a TI irq controller

   - A switch of armada-38x to avoid dropped characters on uart, caused
     by switch of base inherited platform description earlier this year"

* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
  ARM: davinci: da830-evm: fix GPIO lookup for OHCI
  ARM: davinci: omapl138-hawk: add missing regulator constraints for OHCI
  ARM: davinci: da830-evm: add missing regulator constraints for OHCI
  soc: ti: fix irq-ti-sci link error
  ARM: dts: armada-xp-98dx3236: Switch to armada-38x-uart serial node

4 years agoMerge tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm...
Linus Torvalds [Fri, 5 Jul 2019 02:32:11 +0000 (11:32 +0900)]
Merge tag 'dax-fix-5.2-rc8' of git://git./linux/kernel/git/nvdimm/nvdimm

Pull dax fix from Dan Williams:
 "A single dax fix that has been soaking awaiting other fixes under
  discussion to join it. As it is getting late in the cycle lets proceed
  with this fix and save follow-on changes for post-v5.3-rc1.

   - Fix xarray entry association for mixed mappings"

* tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix xarray entry association for mixed mappings

4 years agoMerge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Fri, 5 Jul 2019 02:21:36 +0000 (11:21 +0900)]
Merge branch 'fixes' of git://git./linux/kernel/git/viro/vfs

Pull do_move_mount() fix from Al Viro:
 "Regression fix"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: move_mount: reject moving kernel internal mounts

4 years agoswap_readpage(): avoid blk_wake_io_task() if !synchronous
Oleg Nesterov [Thu, 4 Jul 2019 22:14:49 +0000 (15:14 -0700)]
swap_readpage(): avoid blk_wake_io_task() if !synchronous

swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
this means that the caller can get the spurious wakeup after return.

This can be fatal if blk_wake_io_task() does
set_current_state(TASK_RUNNING) after the caller does
set_special_state(), in the worst case the kernel can crash in
do_task_dead().

Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
Fixes: 0619317ff8baa2d ("block: add polled wakeup task helper")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>