]> git.apps.os.sepia.ceph.com Git - ceph-client.git/log
ceph-client.git
2 years agoblock/rnbd: introduce rnbd_access_modes
Guoqing Jiang [Wed, 24 May 2023 07:00:21 +0000 (15:00 +0800)]
block/rnbd: introduce rnbd_access_modes

Add one new array (marked with __maybe_unused to prevent gcc warning about
"defined but not used" with W=1), then we can remove rnbd_access_mode_str
and rnbd-common.c accordingly.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20230524070026.2932-4-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/rnbd-srv: remove unused header
Guoqing Jiang [Wed, 24 May 2023 07:00:20 +0000 (15:00 +0800)]
block/rnbd-srv: remove unused header

No need to include it since none of macros in limits.h are
used by rnbd-srv.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-3-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock/rnbd: kill rnbd_flags_supported
Guoqing Jiang [Wed, 24 May 2023 07:00:19 +0000 (15:00 +0800)]
block/rnbd: kill rnbd_flags_supported

This routine is not called since added. Then the two flags
(RNBD_OP_LAST and RNBD_F_ALL) can be removed too after kill
rnbd_flags_supported.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20230524070026.2932-2-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: fix rootwait= again
Christoph Hellwig [Fri, 9 Jun 2023 05:17:37 +0000 (07:17 +0200)]
block: fix rootwait= again

The previous rootwait fix added an -EINVAL return to a completely
bogus superflous branch, fix this.

Fixes: 1341c7d2ccf4 ("block: fix rootwait=")
Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Fabio Estevam <festevam@gmail.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20230609051737.328930-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Sort headers
Andy Shevchenko [Fri, 10 Mar 2023 16:45:49 +0000 (18:45 +0200)]
pktcdvd: Sort headers

Sort the headers in alphabetic order in order to ease
the maintenance for this part.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-10-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Get rid of redundant 'else'
Andy Shevchenko [Fri, 10 Mar 2023 16:45:48 +0000 (18:45 +0200)]
pktcdvd: Get rid of redundant 'else'

In the snippets like the following

if (...)
return / goto / break / continue ...;
else
...

the 'else' is redundant. Get rid of it.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-9-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Use put_unaligned_be16() and get_unaligned_be16()
Andy Shevchenko [Fri, 10 Mar 2023 16:45:47 +0000 (18:45 +0200)]
pktcdvd: Use put_unaligned_be16() and get_unaligned_be16()

This makes the driver code slightly better to understand.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230310164549.22133-8-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Use DEFINE_SHOW_ATTRIBUTE() to simplify code
Andy Shevchenko [Fri, 10 Mar 2023 16:45:46 +0000 (18:45 +0200)]
pktcdvd: Use DEFINE_SHOW_ATTRIBUTE() to simplify code

Use DEFINE_SHOW_ATTRIBUTE() helper macro to simplify the code.
No functional change.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-7-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Drop redundant castings for sector_t
Andy Shevchenko [Fri, 10 Mar 2023 16:45:45 +0000 (18:45 +0200)]
pktcdvd: Drop redundant castings for sector_t

Since the commit 72deb455b5ec ("block: remove CONFIG_LBDAF")
the sector_t is always 64-bit type, no need to cast anymore.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-6-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Get rid of pkt_seq_show() forward declaration
Andy Shevchenko [Fri, 10 Mar 2023 16:45:44 +0000 (18:45 +0200)]
pktcdvd: Get rid of pkt_seq_show() forward declaration

The code can be neater without forward declarations.
Get rid of pkt_seq_show() forward declaration. This
will also allow futher cleanups to be cleaner.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-5-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: use sysfs_emit() to instead of scnprintf()
Andy Shevchenko [Fri, 10 Mar 2023 16:45:43 +0000 (18:45 +0200)]
pktcdvd: use sysfs_emit() to instead of scnprintf()

Follow the advice of the Documentation/filesystems/sysfs.rst and show()
should only use sysfs_emit() or sysfs_emit_at() when formatting the
value to be returned to user space.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: replace sscanf() by kstrtoul()
Andy Shevchenko [Fri, 10 Mar 2023 16:45:42 +0000 (18:45 +0200)]
pktcdvd: replace sscanf() by kstrtoul()

The checkpatch.pl warns: "Prefer kstrto<type> to single variable sscanf".
Fix the code accordingly.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20230310164549.22133-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agopktcdvd: Get rid of custom printing macros
Andy Shevchenko [Fri, 10 Mar 2023 16:45:41 +0000 (18:45 +0200)]
pktcdvd: Get rid of custom printing macros

We may use traditional dev_*() macros instead of custom ones
provided by the driver.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230310164549.22133-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: fix rootwait=
Christoph Hellwig [Wed, 7 Jun 2023 13:57:46 +0000 (15:57 +0200)]
block: fix rootwait=

Failures to look up the gendisk must return -ENODEV so that rootwait
retries the lookup instead of -EINVAL which exits early.

Fixes: cf056a431215 ("init: improve the name_to_dev_t interface")
Reported-by: Fabio Estevam <festevam@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Fabio Estevam <festevam@gmail.com>
Link: https://lore.kernel.org/r/20230607135746.92995-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-cgroup: Reinit blkg_iostat_set after clearing in blkcg_reset_stats()
Waiman Long [Tue, 6 Jun 2023 18:07:24 +0000 (14:07 -0400)]
blk-cgroup: Reinit blkg_iostat_set after clearing in blkcg_reset_stats()

When blkg_alloc() is called to allocate a blkcg_gq structure
with the associated blkg_iostat_set's, there are 2 fields within
blkg_iostat_set that requires proper initialization - blkg & sync.
The former field was introduced by commit 3b8cc6298724 ("blk-cgroup:
Optimize blkcg_rstat_flush()") while the later one was introduced by
commit f73316482977 ("blk-cgroup: reimplement basic IO stats using
cgroup rstat").

Unfortunately those fields in the blkg_iostat_set's are not properly
re-initialized when they are cleared in v1's blkcg_reset_stats(). This
can lead to a kernel panic due to NULL pointer access of the blkg
pointer. The missing initialization of sync is less problematic and
can be a problem in a debug kernel due to missing lockdep initialization.

Fix these problems by re-initializing them after memory clearing.

Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Fixes: f73316482977 ("blk-cgroup: reimplement basic IO stats using cgroup rstat")
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230606180724.2455066-1-longman@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue()
Yu Kuai [Tue, 6 Jun 2023 01:14:38 +0000 (09:14 +0800)]
blk-ioc: fix recursive spin_lock/unlock_irq() in ioc_clear_queue()

Recursive spin_lock/unlock_irq() is not safe, because spin_unlock_irq()
will enable irq unconditionally:

spin_lock_irq queue_lock -> disable irq
spin_lock_irq ioc->lock
spin_unlock_irq ioc->lock -> enable irq
/*
 * AA dead lock will be triggered if current context is preempted by irq,
 * and irq try to hold queue_lock again.
 */
spin_unlock_irq queue_lock

Fix this problem by using spin_lock/unlock() directly for 'ioc->lock'.

Fixes: 5a0ac57c48aa ("blk-ioc: protect ioc_destroy_icq() by 'queue_lock'")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230606011438.3743440-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonbd: Add the maximum limit of allocated index in nbd_dev_add
Zhong Jinghua [Mon, 5 Jun 2023 12:21:59 +0000 (20:21 +0800)]
nbd: Add the maximum limit of allocated index in nbd_dev_add

If the index allocated by idr_alloc greater than MINORMASK >> part_shift,
the device number will overflow, resulting in failure to create a block
device.

Fix it by imiting the size of the max allocation.

Signed-off-by: Zhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230605122159.2134384-1-zhongjinghua@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-ioprio: Introduce promote-to-rt policy
Hou Tao [Fri, 28 Apr 2023 07:44:04 +0000 (15:44 +0800)]
blk-ioprio: Introduce promote-to-rt policy

Since commit a78418e6a04c ("block: Always initialize bio IO priority on
submit"), bio->bi_ioprio will never be IOPRIO_CLASS_NONE when calling
blkcg_set_ioprio(), so there will be no way to promote the io-priority
of one cgroup to IOPRIO_CLASS_RT, because bi_ioprio will always be
greater than or equals to IOPRIO_CLASS_RT.

It seems possible to call blkcg_set_ioprio() first then try to
initialize bi_ioprio later in bio_set_ioprio(), but this doesn't work
for bio in which bi_ioprio is already initialized (e.g., direct-io), so
introduce a new promote-to-rt policy to promote the iopriority of bio to
IOPRIO_CLASS_RT if the ioprio is not already RT.

For none-to-rt policy, although it doesn't work now, but considering
that its purpose was also to override the io-priority to RT and allowing
for a smoother transition, just keep it and treat it as an alias of
the promote-to-rt policy.

Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://lore.kernel.org/r/20230428074404.280532-1-houtao@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-iocost: use spin_lock_irqsave in adjust_inuse_and_calc_cost
Li Nan [Sat, 27 May 2023 09:19:04 +0000 (17:19 +0800)]
blk-iocost: use spin_lock_irqsave in adjust_inuse_and_calc_cost

adjust_inuse_and_calc_cost() use spin_lock_irq() and IRQ will be enabled
when unlock. DEADLOCK might happen if we have held other locks and disabled
IRQ before invoking it.

Fix it by using spin_lock_irqsave() instead, which can keep IRQ state
consistent with before when unlock.

  ================================
  WARNING: inconsistent lock state
  5.10.0-02758-g8e5f91fd772f #26 Not tainted
  --------------------------------
  inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
  kworker/2:3/388 [HC0[0]:SC0[0]:HE0:SE1] takes:
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
  ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390
  {IN-HARDIRQ-W} state was registered at:
    __lock_acquire+0x3d7/0x1070
    lock_acquire+0x197/0x4a0
    __raw_spin_lock_irqsave
    _raw_spin_lock_irqsave+0x3b/0x60
    bfq_idle_slice_timer_body
    bfq_idle_slice_timer+0x53/0x1d0
    __run_hrtimer+0x477/0xa70
    __hrtimer_run_queues+0x1c6/0x2d0
    hrtimer_interrupt+0x302/0x9e0
    local_apic_timer_interrupt
    __sysvec_apic_timer_interrupt+0xfd/0x420
    run_sysvec_on_irqstack_cond
    sysvec_apic_timer_interrupt+0x46/0xa0
    asm_sysvec_apic_timer_interrupt+0x12/0x20
  irq event stamp: 837522
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] __raw_spin_unlock_irqrestore
  hardirqs last  enabled at (837521): [<ffffffff84b9419d>] _raw_spin_unlock_irqrestore+0x3d/0x40
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] __raw_spin_lock_irq
  hardirqs last disabled at (837522): [<ffffffff84b93fa3>] _raw_spin_lock_irq+0x43/0x50
  softirqs last  enabled at (835852): [<ffffffff84e00558>] __do_softirq+0x558/0x8ec
  softirqs last disabled at (835845): [<ffffffff84c010ff>] asm_call_irq_on_stack+0xf/0x20

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(&bfqd->lock);
    <Interrupt>
      lock(&bfqd->lock);

   *** DEADLOCK ***

  3 locks held by kworker/2:3/388:
   #0: ffff888107af0f38 ((wq_completion)kthrotld){+.+.}-{0:0}, at: process_one_work+0x742/0x13f0
   #1: ffff8881176bfdd8 ((work_completion)(&td->dispatch_work)){+.+.}-{0:0}, at: process_one_work+0x777/0x13f0
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: spin_lock_irq
   #2: ffff888118c00c28 (&bfqd->lock){?.-.}-{2:2}, at: bfq_bio_merge+0x141/0x390

  stack backtrace:
  CPU: 2 PID: 388 Comm: kworker/2:3 Not tainted 5.10.0-02758-g8e5f91fd772f #26
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
  Workqueue: kthrotld blk_throtl_dispatch_work_fn
  Call Trace:
   __dump_stack lib/dump_stack.c:77 [inline]
   dump_stack+0x107/0x167
   print_usage_bug
   valid_state
   mark_lock_irq.cold+0x32/0x3a
   mark_lock+0x693/0xbc0
   mark_held_locks+0x9e/0xe0
   __trace_hardirqs_on_caller
   lockdep_hardirqs_on_prepare.part.0+0x151/0x360
   trace_hardirqs_on+0x5b/0x180
   __raw_spin_unlock_irq
   _raw_spin_unlock_irq+0x24/0x40
   spin_unlock_irq
   adjust_inuse_and_calc_cost+0x4fb/0x970
   ioc_rqos_merge+0x277/0x740
   __rq_qos_merge+0x62/0xb0
   rq_qos_merge
   bio_attempt_back_merge+0x12c/0x4a0
   blk_mq_sched_try_merge+0x1b6/0x4d0
   bfq_bio_merge+0x24a/0x390
   __blk_mq_sched_bio_merge+0xa6/0x460
   blk_mq_sched_bio_merge
   blk_mq_submit_bio+0x2e7/0x1ee0
   __submit_bio_noacct_mq+0x175/0x3b0
   submit_bio_noacct+0x1fb/0x270
   blk_throtl_dispatch_work_fn+0x1ef/0x2b0
   process_one_work+0x83e/0x13f0
   process_scheduled_works
   worker_thread+0x7e3/0xd80
   kthread+0x353/0x470
   ret_from_fork+0x1f/0x30

Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Li Nan <linan122@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230527091904.3001833-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: mark early_lookup_bdev as __init
Christoph Hellwig [Wed, 31 May 2023 12:55:35 +0000 (14:55 +0200)]
block: mark early_lookup_bdev as __init

early_lookup_bdev is now only used during the early boot code as it
should, so mark it __init to not waste run time memory on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-25-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomtd: block2mtd: don't call early_lookup_bdev after the system is running
Christoph Hellwig [Wed, 31 May 2023 12:55:34 +0000 (14:55 +0200)]
mtd: block2mtd: don't call early_lookup_bdev after the system is running

early_lookup_bdev is supposed to only be called from the early boot
code, but mdtblock_early_get_bdev is called as a general fallback when
lookup_bdev fails, which is problematic because early_lookup_bdev
bypasses all normal path based permission checking, and might cause
problems with certain container environments renaming devices.

Switch to only call early_lookup_bdev when block2mtd is built-in and the
system state in not running yet.

Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems.  They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again.  But if
avoidable in any way I'd rather avoid that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20230531125535.676098-24-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomtd: block2mtd: factor the early block device open logic into a helper
Christoph Hellwig [Wed, 31 May 2023 12:55:33 +0000 (14:55 +0200)]
mtd: block2mtd: factor the early block device open logic into a helper

Simplify add_device a bit by splitting out the cumbersome early boot logic
into a separate helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Link: https://lore.kernel.org/r/20230531125535.676098-23-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoPM: hibernate: don't use early_lookup_bdev in resume_store
Christoph Hellwig [Wed, 31 May 2023 12:55:32 +0000 (14:55 +0200)]
PM: hibernate: don't use early_lookup_bdev in resume_store

resume_store is a sysfs attribute written during normal kernel runtime,
and it should not use the early_lookup_bdev API that bypasses all normal
path based permission checking, and might cause problems with certain
container environments renaming devices.

Switch to lookup_bdev, which does a normal path lookup instead, and fall
back to trying to parse a numeric dev_t just like early_lookup_bdev did.

Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems.  They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again.  But if
avoidable in any way I'd rather avoid that.

Fixes: 421a5fa1a6cf ("PM / hibernate: use name_to_dev_t to parse resume")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-22-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: only call early_lookup_bdev from early boot context
Christoph Hellwig [Wed, 31 May 2023 12:55:31 +0000 (14:55 +0200)]
dm: only call early_lookup_bdev from early boot context

early_lookup_bdev is supposed to only be called from the early boot
code, but dm_get_device calls it as a general fallback when lookup_bdev
fails, which is problematic because early_lookup_bdev bypasses all normal
path based permission checking, and might cause problems with certain
container environments renaming devices.

Switch to only call early_lookup_bdev when dm is built-in and the system
state in not running yet.  This means it is still available when tables
are constructed by dm-init.c from the kernel command line, but not
otherwise.

Note that this strictly speaking changes the kernel ABI as the PARTUUID=
and PARTLABEL= style syntax is now not available during a running
systems.  They never were intended for that, but this breaks things
we'll have to figure out a way to make them available again.  But if
avoidable in any way I'd rather avoid that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-21-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: remove dm_get_dev_t
Christoph Hellwig [Wed, 31 May 2023 12:55:30 +0000 (14:55 +0200)]
dm: remove dm_get_dev_t

Open code dm_get_dev_t in the only remaining caller, and propagate the
exact error code from lookup_bdev and early_lookup_bdev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-20-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: open code dm_get_dev_t in dm_init_init
Christoph Hellwig [Wed, 31 May 2023 12:55:29 +0000 (14:55 +0200)]
dm: open code dm_get_dev_t in dm_init_init

dm_init_init is called from early boot code, and thus lookup_bdev
will never succeed.  Just open code that call to early_lookup_bdev
instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm-snap: simplify the origin_dev == cow_dev check in snapshot_ctr
Christoph Hellwig [Wed, 31 May 2023 12:55:28 +0000 (14:55 +0200)]
dm-snap: simplify the origin_dev == cow_dev check in snapshot_ctr

Use the block_device acquired in dm_get_device for the check instead
of doing an extra lookup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: move more code to early-lookup.c
Christoph Hellwig [Wed, 31 May 2023 12:55:27 +0000 (14:55 +0200)]
block: move more code to early-lookup.c

blk_lookup_devt is only used by code in early-lookup.c, so move it
there.

printk_all_partitions and it's helper bdevt_str are only used by the
early init code in init/do_mounts.c, so they should go there as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: move the code to do early boot lookup of block devices to block/
Christoph Hellwig [Wed, 31 May 2023 12:55:26 +0000 (14:55 +0200)]
block: move the code to do early boot lookup of block devices to block/

Create a new block/early-lookup.c to keep the early block device lookup
code instead of having this code sit with the early mount code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: clear root_wait on all invalid root= strings
Christoph Hellwig [Wed, 31 May 2023 12:55:25 +0000 (14:55 +0200)]
init: clear root_wait on all invalid root= strings

Instead of only clearing root_wait in devt_from_partuuid when the UUID
format was invalid, do that in parse_root_device for all strings that
failed to parse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: improve the name_to_dev_t interface
Christoph Hellwig [Wed, 31 May 2023 12:55:24 +0000 (14:55 +0200)]
init: improve the name_to_dev_t interface

name_to_dev_t has a very misleading name, that doesn't make clear
it should only be used by the early init code, and also has a bad
calling convention that doesn't allow returning different kinds of
errors.  Rename it to early_lookup_bdev to make the use case clear,
and return an errno, where -EINVAL means the string could not be
parsed, and -ENODEV means it the string was valid, but there was
no device found for it.

Also stub out the whole call for !CONFIG_BLOCK as all the non-block
root cases are always covered in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: move the nfs/cifs/ram special cases out of name_to_dev_t
Christoph Hellwig [Wed, 31 May 2023 12:55:23 +0000 (14:55 +0200)]
init: move the nfs/cifs/ram special cases out of name_to_dev_t

The nfs/cifs/ram special case only needs to be parsed once, and only in
the boot code.  Move them out of name_to_dev_t and into
prepare_namespace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: factor the root_wait logic in prepare_namespace into a helper
Christoph Hellwig [Wed, 31 May 2023 12:55:22 +0000 (14:55 +0200)]
init: factor the root_wait logic in prepare_namespace into a helper

The root_wait logic is a bit obsfucated right now.  Expand it and move it
into a helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: handle ubi/mtd root mounting like all other root types
Christoph Hellwig [Wed, 31 May 2023 12:55:21 +0000 (14:55 +0200)]
init: handle ubi/mtd root mounting like all other root types

Assign a Root_Generic magic value for UBI/MTD root and handle the root
mounting in mount_root like all other root types.  Besides making the
code more clear this also means that UBI/MTD root can be used together
with an initrd (not that anyone should care).

Also factor parsing of the root name into a helper now that it can
be easily done and will get more complicated with subsequent patches.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: don't remove the /dev/ prefix from error messages
Christoph Hellwig [Wed, 31 May 2023 12:55:20 +0000 (14:55 +0200)]
init: don't remove the /dev/ prefix from error messages

Remove the code that drops the /dev/ prefix from root_device_name, which
is only used for error messages when mounting the root device fails.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: pass root_device_name explicitly
Christoph Hellwig [Wed, 31 May 2023 12:55:19 +0000 (14:55 +0200)]
init: pass root_device_name explicitly

Instead of declaring root_device_name as a global variable pass it as an
argument to the functions using it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: refactor mount_root
Christoph Hellwig [Wed, 31 May 2023 12:55:18 +0000 (14:55 +0200)]
init: refactor mount_root

Provide stubs for all the lower level mount helpers, and just switch
on ROOT_DEV in the main function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: rename mount_block_root to mount_root_generic
Christoph Hellwig [Wed, 31 May 2023 12:55:17 +0000 (14:55 +0200)]
init: rename mount_block_root to mount_root_generic

mount_block_root is also used to mount non-block file systems, so give
it a better name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoinit: remove pointless Root_* values
Christoph Hellwig [Wed, 31 May 2023 12:55:16 +0000 (14:55 +0200)]
init: remove pointless Root_* values

Remove all unused defines, and just use the expanded versions for
the SCSI disk majors.

I've decided to keep Root_RAM0 even if it could be expanded as there
is a lot of special casing for it in the init code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531125535.676098-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoPM: hibernate: move finding the resume device out of software_resume
Christoph Hellwig [Wed, 31 May 2023 12:55:15 +0000 (14:55 +0200)]
PM: hibernate: move finding the resume device out of software_resume

software_resume can be called either from an init call in the boot code,
or from sysfs once the system has finished booting, and the two
invocation methods this can't race with each other.

For the latter case we did just parse the suspend device manually, while
the former might not have one.  Split software_resume so that the search
only happens for the boot case, which also means the special lockdep
nesting annotation can go away as the system transition mutex can be
taken a little later and doesn't have the sysfs locking nest inside it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoPM: hibernate: remove the global snapshot_test variable
Christoph Hellwig [Wed, 31 May 2023 12:55:14 +0000 (14:55 +0200)]
PM: hibernate: remove the global snapshot_test variable

Passing call dependent variable in global variables is a huge
antipattern.  Fix it up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoPM: hibernate: factor out a helper to find the resume device
Christoph Hellwig [Wed, 31 May 2023 12:55:13 +0000 (14:55 +0200)]
PM: hibernate: factor out a helper to find the resume device

Split the logic to find the resume device out software_resume and into
a separate helper to start unwindig the convoluted goto logic.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20230531125535.676098-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodriver core: return bool from driver_probe_done
Christoph Hellwig [Wed, 31 May 2023 12:55:12 +0000 (14:55 +0200)]
driver core: return bool from driver_probe_done

bool is the most sensible return value for a yes/no return.  Also
add __init as this funtion is only called from the early boot code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20230531125535.676098-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoext4: wire up the ->mark_dead holder operation for log devices
Christoph Hellwig [Thu, 1 Jun 2023 09:44:59 +0000 (11:44 +0200)]
ext4: wire up the ->mark_dead holder operation for log devices

Implement a set of holder_ops that shut down the file system when the
block device used as log device is removed undeneath the file system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-17-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoext4: wire up sops->shutdown
Christoph Hellwig [Thu, 1 Jun 2023 09:44:58 +0000 (11:44 +0200)]
ext4: wire up sops->shutdown

Wire up the shutdown method to shut down the file system when the
underlying block device is marked dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-16-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoext4: split ext4_shutdown
Christoph Hellwig [Thu, 1 Jun 2023 09:44:57 +0000 (11:44 +0200)]
ext4: split ext4_shutdown

Split ext4_shutdown into a low-level helper that will be reused for
implementing the shutdown super operation and a wrapper for the ioctl
handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoxfs: wire up the ->mark_dead holder operation for log and RT devices
Christoph Hellwig [Thu, 1 Jun 2023 09:44:56 +0000 (11:44 +0200)]
xfs: wire up the ->mark_dead holder operation for log and RT devices

Implement a set of holder_ops that shut down the file system when the
block device used as log or RT device is removed undeneath the file
system.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoxfs: wire up sops->shutdown
Christoph Hellwig [Thu, 1 Jun 2023 09:44:55 +0000 (11:44 +0200)]
xfs: wire up sops->shutdown

Wire up the shutdown method to shut down the file system when the
underlying block device is marked dead.  Add a new message to
clearly distinguish this shutdown reason from other shutdowns.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-13-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agofs: add a method to shut down the file system
Christoph Hellwig [Thu, 1 Jun 2023 09:44:54 +0000 (11:44 +0200)]
fs: add a method to shut down the file system

Add a new ->shutdown super operation that can be used to tell the file
system to shut down, and call it from newly created holder ops when the
block device under a file system shuts down.

This only covers the main block device for "simple" file systems using
get_tree_bdev / mount_bdev.  File systems their own get_tree method
or opening additional devices will need to set up their own
blk_holder_ops.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-12-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: add a mark_dead holder operation
Christoph Hellwig [Thu, 1 Jun 2023 09:44:53 +0000 (11:44 +0200)]
block: add a mark_dead holder operation

Add a mark_dead method to blk_holder_ops that is called from blk_mark_disk_dead
to notify the holder that the block device it is using has been marked dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-11-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: introduce holder ops
Christoph Hellwig [Thu, 1 Jun 2023 09:44:52 +0000 (11:44 +0200)]
block: introduce holder ops

Add a new blk_holder_ops structure, which is passed to blkdev_get_by_* and
installed in the block_device for exclusive claims.  It will be used to
allow the block layer to call back into the user of the block device for
thing like notification of a removed device or a device resize.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: remove blk_drop_partitions
Christoph Hellwig [Thu, 1 Jun 2023 09:44:51 +0000 (11:44 +0200)]
block: remove blk_drop_partitions

There is only a single caller left, so fold the loop into that.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: delete partitions later in del_gendisk
Christoph Hellwig [Thu, 1 Jun 2023 09:44:50 +0000 (11:44 +0200)]
block: delete partitions later in del_gendisk

Delay dropping the block_devices for partitions in del_gendisk until
after the call to blk_mark_disk_dead, so that we can implementat
notification of removed devices in blk_mark_disk_dead.

This requires splitting a lower-level drop_partition helper out of
delete_partition and using that from del_gendisk, while having a
common loop for the whole device and partitions that calls
remove_inode_hash, fsync_bdev and __invalidate_device before the
call to blk_mark_disk_dead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: unhash the inode earlier in delete_partition
Christoph Hellwig [Thu, 1 Jun 2023 09:44:49 +0000 (11:44 +0200)]
block: unhash the inode earlier in delete_partition

Move the call to remove_inode_hash to the beginning of delete_partition,
as we want to prevent opening a block_device that is about to be removed
ASAP.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: avoid repeated work in blk_mark_disk_dead
Christoph Hellwig [Thu, 1 Jun 2023 09:44:48 +0000 (11:44 +0200)]
block: avoid repeated work in blk_mark_disk_dead

Check if GD_DEAD is already set in blk_mark_disk_dead, and don't
duplicate the work already done.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: consolidate the shutdown logic in blk_mark_disk_dead and del_gendisk
Christoph Hellwig [Thu, 1 Jun 2023 09:44:47 +0000 (11:44 +0200)]
block: consolidate the shutdown logic in blk_mark_disk_dead and del_gendisk

blk_mark_disk_dead does very similar work a a section of del_gendisk:

 - set the GD_DEAD flag
 - set the capacity to zero
 - start a queue drain

but del_gendisk also sets QUEUE_FLAG_DYING on the queue if it is owned by
the disk, sets the capacity to zero before starting the drain, and both
with sending a uevent and kernel message for this fake capacity change.

Move the exact logic from the more heavily used del_gendisk into
blk_mark_disk_dead and then call blk_mark_disk_dead from del_gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: turn bdev_lock into a mutex
Christoph Hellwig [Thu, 1 Jun 2023 09:44:46 +0000 (11:44 +0200)]
block: turn bdev_lock into a mutex

There is no reason for this lock to spin, and being able to sleep under
it will come in handy soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: refactor bd_may_claim
Christoph Hellwig [Thu, 1 Jun 2023 09:44:45 +0000 (11:44 +0200)]
block: refactor bd_may_claim

The long if/else chain obsfucates the actual logic.  Tidy it up to be
more structured.  Also drop the whole argument, as it can be trivially
derived from bdev using bdev_whole, and having the bdev_whole in the
function makes it easier to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: factor out a bd_end_claim helper from blkdev_put
Christoph Hellwig [Thu, 1 Jun 2023 09:44:44 +0000 (11:44 +0200)]
block: factor out a bd_end_claim helper from blkdev_put

Move all the logic to release an exclusive claim into a helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20230601094459.1350643-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: stop defining __KERNEL_SYSCALLS__
Christoph Hellwig [Thu, 1 Jun 2023 15:16:46 +0000 (17:16 +0200)]
drbd: stop defining __KERNEL_SYSCALLS__

__KERNEL_SYSCALLS__ hasn't been needed since Linux 2.6.19 so stop
defining it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20230601151646.1386867-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoublk: add control command of UBLK_U_CMD_GET_FEATURES
Ming Lei [Sat, 3 Jun 2023 04:06:01 +0000 (12:06 +0800)]
ublk: add control command of UBLK_U_CMD_GET_FEATURES

Add control command of UBLK_U_CMD_GET_FEATURES for returning driver's
feature set or capability.

This way can simplify userspace for maintaining compatibility because
userspace doesn't need to send command to one device for querying driver
feature set any more. Such as, with the queried feature set, userspace
can choose to use:

- UBLK_CMD_GET_DEV_INFO2 or UBLK_CMD_GET_DEV_INFO,
- UBLK_U_CMD_* or UBLK_CMD_*

Userspace code:
https://github.com/ming1/ubdsrv/commits/features-cmd

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230603040601.775227-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Replace all non-returning strlcpy with strscpy
Azeem Shaikh [Tue, 30 May 2023 15:56:08 +0000 (15:56 +0000)]
block: Replace all non-returning strlcpy with strscpy

strlcpy() reads the entire source buffer first.
This read may exceed the destination size limit.
This is both inefficient and can lead to linear read
overflows if a source string is not NUL-terminated [1].
In an effort to remove strlcpy() completely [2], replace
strlcpy() here with strscpy().
No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy
[2] https://github.com/KSPP/linux/issues/89

Signed-off-by: Azeem Shaikh <azeemshaikh38@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20230530155608.272266-1-azeemshaikh38@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblk-ioc: protect ioc_destroy_icq() by 'queue_lock'
Yu Kuai [Wed, 31 May 2023 07:34:35 +0000 (15:34 +0800)]
blk-ioc: protect ioc_destroy_icq() by 'queue_lock'

Currently, icq is tracked by both request_queue(icq->q_node) and
task(icq->ioc_node), and ioc_clear_queue() from elevator exit is not
safe because it can access the list without protection:

ioc_clear_queue ioc_release_fn
 lock queue_lock
 list_splice
 /* move queue list to a local list */
 unlock queue_lock
 /*
  * lock is released, the local list
  * can be accessed through task exit.
  */

lock ioc->lock
while (!hlist_empty)
 icq = hlist_entry
 lock queue_lock
  ioc_destroy_icq
   delete icq->ioc_node
 while (!list_empty)
  icq = list_entry()    list_del icq->q_node
  /*
   * This is not protected by any lock,
   * list_entry concurrent with list_del
   * is not safe.
   */

 unlock queue_lock
unlock ioc->lock

Fix this problem by protecting list 'icq->q_node' by queue_lock from
ioc_clear_queue().

Reported-and-tested-by: Pradeep Pragallapati <quic_pragalla@quicinc.com>
Link: https://lore.kernel.org/lkml/20230517084434.18932-1-quic_pragalla@quicinc.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230531073435.2923422-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: mark bio_add_folio as __must_check
Johannes Thumshirn [Wed, 31 May 2023 11:50:43 +0000 (04:50 -0700)]
block: mark bio_add_folio as __must_check

Now that all callers of bio_add_folio() check the return value, mark it as
__must_check.

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/381360a45ac3684120cfbe1e07685e9c36256e47.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agofs: iomap: use bio_add_folio_nofail where possible
Johannes Thumshirn [Wed, 31 May 2023 11:50:42 +0000 (04:50 -0700)]
fs: iomap: use bio_add_folio_nofail where possible

When the iomap buffered-io code can't add a folio to a bio, it allocates a
new bio and adds the folio to that one. This is done using bio_add_folio(),
but doesn't check for errors.

As adding a folio to a newly created bio can't fail, use the newly
introduced bio_add_folio_nofail() function.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/58fa893c24c67340a63323f09a179fefdca07f2a.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: add bio_add_folio_nofail
Johannes Thumshirn [Wed, 31 May 2023 11:50:41 +0000 (04:50 -0700)]
block: add bio_add_folio_nofail

Just like for bio_add_pages() add a no-fail variant for bio_add_folio().

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/924dff4077812804398ef84128fb920507fa4be1.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: mark bio_add_page as __must_check
Johannes Thumshirn [Wed, 31 May 2023 11:50:40 +0000 (04:50 -0700)]
block: mark bio_add_page as __must_check

Now that all users of bio_add_page check for the return value, mark
bio_add_page as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/7ae4a902e08fe2e90c012ee07aeb35d4aae28373.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm-crypt: use __bio_add_page to add single page to clone bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:39 +0000 (04:50 -0700)]
dm-crypt: use __bio_add_page to add single page to clone bio

crypt_alloc_buffer() already allocates enough entries in the clone bio's
vector, so adding a page to the bio can't fail. Use __bio_add_page() to
reflect this.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/f9a4dee5e81389fd70ffc442da01006538e55aca.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: raid1: check if adding pages to resync bio fails
Johannes Thumshirn [Wed, 31 May 2023 11:50:38 +0000 (04:50 -0700)]
md: raid1: check if adding pages to resync bio fails

Check if adding pages to resync bio fails and if bail out.

As the comment above suggests this cannot happen, WARN if it actually
happens. Technically __bio_add_pages() would be sufficient here, but
asserting the pages actually get added to the bio is preferred.

This way we can mark bio_add_pages as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/33aea4c271220dc9bcab58c4b7bec478c1511142.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: raid1: use __bio_add_page for adding single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:37 +0000 (04:50 -0700)]
md: raid1: use __bio_add_page for adding single page to bio

The sync request code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/6cf7f66c6e646231200d025dfd5f2d3ae75c8fe5.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: check for failure when adding pages in alloc_behind_master_bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:36 +0000 (04:50 -0700)]
md: check for failure when adding pages in alloc_behind_master_bio

alloc_behind_master_bio() can possibly add multiple pages to a bio, but it
is not checking for the return value of bio_add_page() if adding really
succeeded.

Check if the page adding succeeded and if not bail out.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/827aa12d44ebf3f50b41b47f5cedc0f80179f2c1.1685532726.git.johannes.thumshirn@wdc.com
[axboe: fold in s/free_page/put_page fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agofloppy: use __bio_add_page for adding single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:35 +0000 (04:50 -0700)]
floppy: use __bio_add_page for adding single page to bio

The floppy code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/33c445a3b431270c72d9be03d5da1b08ae983920.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agozram: use __bio_add_page for adding single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:34 +0000 (04:50 -0700)]
zram: use __bio_add_page for adding single page to bio

The zram writeback code uses bio_add_page() to add a page to a newly
created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/cfd141dd7773315879a126f2aa81b7f698bc0e10.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agozonefs: use __bio_add_page for adding single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:33 +0000 (04:50 -0700)]
zonefs: use __bio_add_page for adding single page to bio

The zonefs superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/04c9978ccaa0fc9871cd4248356638d98daccf0c.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agogfs2: use __bio_add_page for adding single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:32 +0000 (04:50 -0700)]
gfs2: use __bio_add_page for adding single page to bio

The GFS2 superblock reading code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/087c67d4e4973f949d3519c1e4822784ce583c5a.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agojfs: logmgr: use __bio_add_page to add single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:31 +0000 (04:50 -0700)]
jfs: logmgr: use __bio_add_page to add single page to bio

The JFS IO code uses bio_add_page() to add a page to a newly created bio.
bio_add_page() can fail, but the return value is never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/9fb5ed86d19f6e0b6f64dfc4109a48ff8ff24497.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: raid5: use __bio_add_page to add single page to new bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:30 +0000 (04:50 -0700)]
md: raid5: use __bio_add_page to add single page to new bio

The raid5-ppl submission code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked. For adding consecutive pages, the return is actually checked and
a new bio is allocated if adding the page fails.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/27e6bcd762354bff74602e89159cdd12ae3d1fa9.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: raid5-log: use __bio_add_page to add single page
Johannes Thumshirn [Wed, 31 May 2023 11:50:29 +0000 (04:50 -0700)]
md: raid5-log: use __bio_add_page to add single page

The raid5 log metadata submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/832a810d6c9e71f88b0a39cb076a8c70e8bcb821.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomd: use __bio_add_page to add single page
Johannes Thumshirn [Wed, 31 May 2023 11:50:28 +0000 (04:50 -0700)]
md: use __bio_add_page to add single page

The md-raid superblock writing code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Signed-of_-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Link: https://lore.kernel.org/r/ca196f5e650e318106dbb4496eb6cbac4bc800bd.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agofs: buffer: use __bio_add_page to add single page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:27 +0000 (04:50 -0700)]
fs: buffer: use __bio_add_page to add single page to bio

The buffer_head submission code uses bio_add_page() to add a page to a
newly created bio. bio_add_page() can fail, but the return value is never
checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Gou Hao <gouhao@uniontech.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/84ff2dcbe81b258a73ad900adb5266e208b61a4d.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodm: dm-zoned: use __bio_add_page for adding single metadata page
Johannes Thumshirn [Wed, 31 May 2023 11:50:26 +0000 (04:50 -0700)]
dm: dm-zoned: use __bio_add_page for adding single metadata page

dm-zoned uses bio_add_page() for adding a single page to a freshly created
metadata bio.

Use __bio_add_page() instead as adding a single page to a new bio is
always guaranteed to succeed.

This brings us a step closer to marking bio_add_page() __must_check

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/55a0c8dad7550379647873b579dc7cfbe0191f96.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agodrbd: use __bio_add_page to add page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:25 +0000 (04:50 -0700)]
drbd: use __bio_add_page to add page to bio

The drbd code only adds a single page to a newly created bio. So use
__bio_add_page() to add the page which is guaranteed to succeed in this
case.

This brings us closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/435007afac14f3766455559059d21843771fae53.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoswap: use __bio_add_page to add page to bio
Johannes Thumshirn [Wed, 31 May 2023 11:50:24 +0000 (04:50 -0700)]
swap: use __bio_add_page to add page to bio

The swap code only adds a single page to a newly created bio. So use
__bio_add_page() to add the page which is guaranteed to succeed in this
case.

This brings us closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/5bdafd9de806b2dab92302b30eb7a3a5f10c37d9.1685532726.git.johannes.thumshirn@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Use iov_iter_extract_pages() and page pinning in direct-io.c
David Howells [Fri, 26 May 2023 21:41:42 +0000 (22:41 +0100)]
block: Use iov_iter_extract_pages() and page pinning in direct-io.c

Change the old block-based direct-I/O code to use iov_iter_extract_pages()
to pin user pages or leave kernel pages unpinned rather than taking refs
when submitting bios.

This makes use of the preceding patches to not take pins on the zero page
(thereby allowing insertion of zero pages in with pinned pages) and to get
additional pins on pages, allowing an extracted page to be used in multiple
bios without having to re-extract it.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230526214142.958751-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomm: Provide a function to get an additional pin on a page
David Howells [Fri, 26 May 2023 21:41:41 +0000 (22:41 +0100)]
mm: Provide a function to get an additional pin on a page

Provide a function to get an additional pin on a page that we already have
a pin on.  This will be used in fs/direct-io.c when dispatching multiple
bios to a page we've extracted from a user-backed iter rather than redoing
the extraction.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20230526214142.958751-3-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agomm: Don't pin ZERO_PAGE in pin_user_pages()
David Howells [Fri, 26 May 2023 21:41:40 +0000 (22:41 +0100)]
mm: Don't pin ZERO_PAGE in pin_user_pages()

Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning.  We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christoph Hellwig <hch@infradead.org>
cc: David Hildenbrand <david@redhat.com>
cc: Lorenzo Stoakes <lstoakes@gmail.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jan Kara <jack@suse.cz>
cc: Jeff Layton <jlayton@kernel.org>
cc: Jason Gunthorpe <jgg@nvidia.com>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: Hillf Danton <hdanton@sina.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-kernel@vger.kernel.org
cc: linux-mm@kvack.org
Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20230526214142.958751-2-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: constify the whole_disk device_attribute
Thomas Weißschuh [Tue, 30 May 2023 17:10:00 +0000 (19:10 +0200)]
block: constify the whole_disk device_attribute

The struct is never modified so it can be const.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230419-const-partition-v3-4-4e14e48be367@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: constify struct part_attr_group
Thomas Weißschuh [Tue, 30 May 2023 17:09:59 +0000 (19:09 +0200)]
block: constify struct part_attr_group

The struct is never modified so it can be const.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230419-const-partition-v3-3-4e14e48be367@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: constify struct part_type part_type
Thomas Weißschuh [Tue, 30 May 2023 17:09:58 +0000 (19:09 +0200)]
block: constify struct part_type part_type

The struct is never modified so it can be const.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230419-const-partition-v3-2-4e14e48be367@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: constify partition prober array
Thomas Weißschuh [Tue, 30 May 2023 17:09:57 +0000 (19:09 +0200)]
block: constify partition prober array

The array is never modified so it can be const.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202304191640.SkNk7kVN-lkp@intel.com/
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230419-const-partition-v3-1-4e14e48be367@weissschuh.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: convert bio_map_user_iov to use iov_iter_extract_pages
David Howells [Mon, 22 May 2023 20:57:44 +0000 (21:57 +0100)]
block: convert bio_map_user_iov to use iov_iter_extract_pages

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them
to prevent VM copy-on-write from malfunctioning during a concurrent fork()
(the result of the I/O could otherwise end up being visible to/affected by
the child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-7-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
David Howells [Mon, 22 May 2023 20:57:43 +0000 (21:57 +0100)]
block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages

This will pin pages or leave them unaltered rather than getting a ref on
them as appropriate to the iterator.

The pages need to be pinned for DIO rather than having refs taken on them to
prevent VM copy-on-write from malfunctioning during a concurrent fork() (the
result of the I/O could otherwise end up being affected by/visible to the
child process).

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-6-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Add BIO_PAGE_PINNED and associated infrastructure
David Howells [Mon, 22 May 2023 20:57:42 +0000 (21:57 +0100)]
block: Add BIO_PAGE_PINNED and associated infrastructure

Add BIO_PAGE_PINNED to indicate that the pages in a bio are pinned
(FOLL_PIN) and that the pin will need removing.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-5-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic
Christoph Hellwig [Mon, 22 May 2023 20:57:41 +0000 (21:57 +0100)]
block: Replace BIO_NO_PAGE_REF with BIO_PAGE_REFFED with inverted logic

Replace BIO_NO_PAGE_REF with a BIO_PAGE_REFFED flag that has the inverted
meaning is only set when a page reference has been acquired that needs to
be released by bio_release_pages().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Jens Axboe <axboe@kernel.dk>
cc: Jan Kara <jack@suse.cz>
cc: Matthew Wilcox <willy@infradead.org>
cc: Logan Gunthorpe <logang@deltatee.com>
cc: linux-block@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-4-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock: Fix bio_flagged() so that gcc can better optimise it
David Howells [Mon, 22 May 2023 20:57:40 +0000 (21:57 +0100)]
block: Fix bio_flagged() so that gcc can better optimise it

Fix bio_flagged() so that multiple instances of it, such as:

if (bio_flagged(bio, BIO_PAGE_REFFED) ||
    bio_flagged(bio, BIO_PAGE_PINNED))

can be combined by the gcc optimiser into a single test in assembly
(arguably, this is a compiler optimisation issue[1]).

The missed optimisation stems from bio_flagged() comparing the result of
the bitwise-AND to zero.  This results in an out-of-line bio_release_page()
being compiled to something like:

   <+0>:     mov    0x14(%rdi),%eax
   <+3>:     test   $0x1,%al
   <+5>:     jne    0xffffffff816dac53 <bio_release_pages+11>
   <+7>:     test   $0x2,%al
   <+9>:     je     0xffffffff816dac5c <bio_release_pages+20>
   <+11>:    movzbl %sil,%esi
   <+15>:    jmp    0xffffffff816daba1 <__bio_release_pages>
   <+20>:    jmp    0xffffffff81d0b800 <__x86_return_thunk>

However, the test is superfluous as the return type is bool.  Removing it
results in:

   <+0>:     testb  $0x3,0x14(%rdi)
   <+4>:     je     0xffffffff816e4af4 <bio_release_pages+15>
   <+6>:     movzbl %sil,%esi
   <+10>:    jmp    0xffffffff816dab7c <__bio_release_pages>
   <+15>:    jmp    0xffffffff81d0b7c0 <__x86_return_thunk>

instead.

Also, the MOVZBL instruction looks unnecessary[2] - I think it's just
're-booling' the mark_dirty parameter.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
cc: Jens Axboe <axboe@kernel.dk>
cc: linux-block@vger.kernel.org
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108370
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108371
Link: https://lore.kernel.org/r/167391056756.2311931.356007731815807265.stgit@warthog.procyon.org.uk/
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-3-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoiomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing
David Howells [Mon, 22 May 2023 20:57:39 +0000 (21:57 +0100)]
iomap: Don't get an reference on ZERO_PAGE for direct I/O block zeroing

ZERO_PAGE can't go away, no need to hold an extra reference.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-fsdevel@vger.kernel.org
Reviewed-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230522205744.2825689-2-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoMerge branch 'for-6.5/splice' into for-6.5/block
Jens Axboe [Wed, 24 May 2023 14:42:22 +0000 (08:42 -0600)]
Merge branch 'for-6.5/splice' into for-6.5/block

Merge splice bits as subsequent block cleanups and improvements for DIO
depend on them.

* for-6.5/splice: (31 commits)
  splice: kdoc for filemap_splice_read() and copy_splice_read()
  iov_iter: Kill ITER_PIPE
  splice: Remove generic_file_splice_read()
  splice: Use filemap_splice_read() instead of generic_file_splice_read()
  cifs: Use filemap_splice_read()
  trace: Convert trace/seq to use copy_splice_read()
  zonefs: Provide a splice-read wrapper
  xfs: Provide a splice-read wrapper
  orangefs: Provide a splice-read wrapper
  ocfs2: Provide a splice-read wrapper
  ntfs3: Provide a splice-read wrapper
  nfs: Provide a splice-read wrapper
  f2fs: Provide a splice-read wrapper
  ext4: Provide a splice-read wrapper
  ecryptfs: Provide a splice-read wrapper
  ceph: Provide a splice-read wrapper
  afs: Provide a splice-read wrapper
  9p: Add splice_read wrapper
  net: Make sock_splice_read() use copy_splice_read() by default
  tty, proc, kernfs, random: Use copy_splice_read()
  ...

2 years agosplice: kdoc for filemap_splice_read() and copy_splice_read()
David Howells [Mon, 22 May 2023 13:50:18 +0000 (14:50 +0100)]
splice: kdoc for filemap_splice_read() and copy_splice_read()

Provide kerneldoc comments for filemap_splice_read() and
copy_splice_read().

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Christian Brauner <brauner@kernel.org>
cc: Christoph Hellwig <hch@lst.de>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-32-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoiov_iter: Kill ITER_PIPE
David Howells [Mon, 22 May 2023 13:50:17 +0000 (14:50 +0100)]
iov_iter: Kill ITER_PIPE

The ITER_PIPE-type iterator was only used by generic_file_splice_read() and
that has been replaced and removed.  This leaves ITER_PIPE unused - so
remove it too.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-31-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agosplice: Remove generic_file_splice_read()
David Howells [Mon, 22 May 2023 13:50:16 +0000 (14:50 +0100)]
splice: Remove generic_file_splice_read()

Remove generic_file_splice_read() as it has been replaced with calls to
filemap_splice_read() and copy_splice_read().

With this, ITER_PIPE is no longer used.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
cc: Jens Axboe <axboe@kernel.dk>
cc: Steve French <smfrench@gmail.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: David Hildenbrand <david@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: linux-mm@kvack.org
cc: linux-block@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20230522135018.2742245-30-dhowells@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>