REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs, mm: remove the punt_to_cgroup field in struct writeback_control
punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: also use kthread_associate_blkcg for uncompressible ranges
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same. Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_write
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not. Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 30 Mar 2023 14:39:03 +0000 (15:39 +0100)]
btrfs: correctly calculate delayed ref bytes when starting transaction
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:
However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.
So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 30 Mar 2023 14:39:02 +0000 (15:39 +0100)]
btrfs: make btrfs_block_rsv_full() check more boolean when starting transaction
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Tue, 28 Mar 2023 05:19:57 +0000 (14:19 +0900)]
btrfs: split partial dio bios before submit
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.
The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:
==== WRITING TASK ====
btrfs_direct_write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create full ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # page fault; partial read
submit_bio # partial bio
iomap_iter
btrfs_dio_iomap_end
btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
# submit to finish_ordered_fn wq
fault_in_iov_iter_readable # btrfs_direct_write detects partial write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create second partial ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # read all of remainder
submit_bio # partial bio with all of remainder
iomap_iter
btrfs_dio_iomap_end # nothing exciting to do with ordered io
==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left > 0
# don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left == 0
# submit to finish_ordered_fn wq
==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
# BTRFS_ORDERED_IOERR, just drops the
# ordered_extent
==SECOND PARTIAL BIO==
btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
# extents, csums, etc...
The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.
More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.
We considered a few options for fixing the bug:
1. treat the partial bio as if we had truncated the file, which would
result in properly finishing it.
2. split the ordered extent when submitting a partial bio.
3. cache the ordered extent across calls to __iomap_dio_rw in
iter->private, so that we could reuse it and correctly apply
several bios to it.
I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)
Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.
Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.
Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.
Boris Burkov [Tue, 28 Mar 2023 05:19:56 +0000 (14:19 +0900)]
btrfs: don't split NOCOW extent_maps in btrfs_extract_ordered_extent
NOCOW writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent. The NOCOW case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support NOCOW writes, but this will change soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch, wrote a commit log ] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it. Add a wrapper for the bio case that still has to do the
lookup (for now).
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify extent map splitting and rename split_zoned_em
split_zoned_em is only ever asked to split out the beginning of an extent
map. Change it to only take a len to split out instead of a pre and post
region.
Also rename the function to split_extent_map as there is nothing zoned
device specific about it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: fold btrfs_clone_ordered_extent into btrfs_split_ordered_extent
The function btrfs_clone_ordered_extent is very specific to the usage in
btrfs_split_ordered_extent. Now that only a single call to
btrfs_clone_ordered_extent is left, just fold it into
btrfs_split_ordered_extent to make the operation more clear.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: sink parameter len to btrfs_split_ordered_extent
btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent (i.e. post == 0). Change it to only take a len to
split out, and switch it to allocate the new extent for the beginning,
as that helps with callers that want to keep a pointer to the
ordered_extent that it is stealing from.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify splitting logic in btrfs_extract_ordered_extent
btrfs_extract_ordered_extent is always used to split an ordered_extent
and extent_map into two parts, so it doesn't need to deal with a three
way split.
Simplify it by only allowing for a single split point, and always split
out the beginning of the extent, as that is what we'll later need to
be able to hold on to a reference to the original ordered_extent that
the first part is split off for submission.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: move ordered_extent internal sanity checks into btrfs_split_ordered_extent
Move the three checks that are about ordered extent internal sanity
checking into btrfs_split_ordered_extent instead of doing them in the
higher level btrfs_extract_ordered_extent routine.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Tue, 28 Mar 2023 05:19:49 +0000 (14:19 +0900)]
btrfs: stash ordered extent in dio_data during iomap dio
While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.
The specific use I have in mind is that we can check if a particular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Tue, 28 Mar 2023 05:19:48 +0000 (14:19 +0900)]
btrfs: pass flags as unsigned long to btrfs_add_ordered_extent
The ordered_extent flags are declared as unsigned long, so pass them as
such to btrfs_add_ordered_extent.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch ] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Boris Burkov [Tue, 28 Mar 2023 05:19:47 +0000 (14:19 +0900)]
btrfs: add function to create and return an ordered extent
Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.
The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use __bio_add_page to add single a page in rbio_add_io_sector
The btrfs raid56 sector submission code uses bio_add_page() to add a
page to a newly created bio. bio_add_page() can fail, but the return
value is never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use __bio_add_page for adding a single page in repair_one_sector
The btrfs repair bio submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but the return value is
never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 27 Mar 2023 09:53:10 +0000 (17:53 +0800)]
btrfs: use test_and_clear_bit() in wait_dev_flush()
The function wait_dev_flush() tests for the BTRFS_DEV_STATE_FLUSH_SENT
bit and then clears it separately. Instead, use test_and_clear_bit().
Though we don't need to do the atomic test and clear, it's following a
common pattern.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 27 Mar 2023 09:53:09 +0000 (17:53 +0800)]
btrfs: change wait_dev_flush() return type to bool
The flush error code is maintained in btrfs_device::last_flush_error, so
there is no point in returning it in wait_dev_flush() when it is not being
used. Instead, we can return a boolean value.
Note that even though btrfs_device::last_flush_error may not be used, we
will keep it for now.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Mon, 27 Mar 2023 09:53:07 +0000 (17:53 +0800)]
btrfs: move last_flush_error to write_dev_flush and wait_dev_flush
We parallelize the flush command across devices using our own code,
write_dev_flush() sends the flush command to each device and
wait_dev_flush() waits for the flush to complete on all devices. Errors
from each device are recorded at device->last_flush_error and reset to
BLK_STS_OK in write_dev_flush() and to the error, if any, in
wait_dev_flush(). These functions are called from barrier_all_devices().
This patch consolidates the use of device->last_flush_error in
write_dev_flush() and wait_dev_flush() to remove it from
barrier_all_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:14:00 +0000 (11:14 +0000)]
btrfs: simplify exit paths of btrfs_evict_inode()
Instead of using two labels at btrfs_evict_inode() for exiting depending
on whether we need to delete the inode items and orphan or some error
happened, we can use a single exit label if we initialize the block
reserve to NULL, since btrfs_free_block_rsv() ignores a NULL block reserve
pointer. So just do that. It will also make an upcoming change simpler by
avoiding one extra error label.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:59 +0000 (11:13 +0000)]
btrfs: calculate the right space for delayed refs when updating global reserve
When updating the global block reserve, we account for the 6 items needed
by an unlink operation and the 6 delayed references for each one of those
items. However the calculation for the delayed references is not correct
in case we have the free space tree enabled, as in that case we need to
touch the free space tree as well and therefore need twice the number of
bytes. So use the btrfs_calc_delayed_ref_bytes() helper to calculate the
number of bytes need for the delayed references at
btrfs_update_global_block_rsv().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:58 +0000 (11:13 +0000)]
btrfs: use a constant for the number of metadata units needed for an unlink
Instead of hard coding the number of metadata units for an unlink operation
in a couple places, define a macro and use it instead. This eliminates the
problem of one place getting out of sync with the other, such as recently
fixed by the previous patch in the series ("btrfs: fix calculation of the
global block reserve's size").
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:57 +0000 (11:13 +0000)]
btrfs: fix calculation of the global block reserve's size
At btrfs_update_global_block_rsv(), we are assuming an unlink operation
uses 5 metadata units, but that's not true anymore, it uses 6 since the
commit bca4ad7c0b54 ("btrfs: reserve correct number of items for unlink
and rmdir"). So update the code and comments to consider 6 units.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:56 +0000 (11:13 +0000)]
btrfs: calculate correct amount of space for delayed reference when evicting
When evicting an inode, we are incorrectly calculating the amount of space
required for a single delayed reference in case the free space tree is
enabled. We have to multiply by 2 the result of
btrfs_calc_insert_metadata_size(). We should be calculating according to
the size update and space release of the delayed block reserve logic at
btrfs_update_delayed_refs_rsv() and btrfs_delayed_refs_rsv_release().
Fix this by using the btrfs_calc_delayed_ref_bytes() helper at
evict_refill_and_join() instead of btrfs_calc_insert_metadata_size().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:55 +0000 (11:13 +0000)]
btrfs: add helper to calculate space for delayed references
Instead of duplicating the logic for calculating how much space is
required for a given number of delayed references, add an inline helper
to encapsulate that logic and use it everywhere we are calculating the
space required.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:54 +0000 (11:13 +0000)]
btrfs: constify fs_info argument for the reclaim items calculation helpers
Now that btrfs_calc_insert_metadata_size() can take a const fs_info
argument, make the fs_info argument of calc_reclaim_items_nr() and of
calc_delayed_refs_nr() const as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:53 +0000 (11:13 +0000)]
btrfs: constify fs_info argument of the metadata size calculation helpers
The fs_info argument of the helpers btrfs_calc_insert_metadata_size() and
btrfs_calc_metadata_size() is not modified so it can be const. This will
also allow a new helper function in one of the next patches to have its
fs_info argument as const.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:52 +0000 (11:13 +0000)]
btrfs: accurately calculate number of delayed refs when flushing
When flushing a limited number of delayed references (FLUSH_DELAYED_REFS_NR
state), we are assuming each delayed reference is holding a number of bytes
matching the needed space for inserting for a single metadata item (the
result of btrfs_calc_insert_metadata_size()). That is not correct when
using the free space tree, as in that case we have to multiply that value
by 2 since we need to touch the free space tree as well. This is the same
computation as we do at btrfs_update_delayed_refs_rsv() and at
btrfs_delayed_refs_rsv_release().
So correct the computation for the amount of delayed references we need to
flush in case we have the free space tree. This does not fix a functional
issue, instead it makes the flush code flush less delayed references, only
the minimum necessary to satisfy a ticket.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:51 +0000 (11:13 +0000)]
btrfs: calculate the right space for a single delayed ref when refilling
When refilling the delayed block reserve we are incorrectly computing the
amount of bytes for a single delayed reference if the free space tree is
being used. In that case we should double the calculated amount.
Everywhere else we compute the correct amount, like when updating the
delayed block reserve, at btrfs_update_delayed_refs_rsv(), or when
releasing space from the delayed block reserve, at
btrfs_delayed_refs_rsv_release().
So fix btrfs_delayed_refs_rsv_refill() to multiply the amount of bytes for
a single delayed reference by two in case the free space tree is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:50 +0000 (11:13 +0000)]
btrfs: don't throttle on delayed items when evicting deleted inode
During inode eviction, if we are truncating a deleted inode, we don't add
delayed items for our inode, so there's no need to throttle on delayed
items on each iteration of the loop that truncates inode items from its
subvolume tree. But we dirty extent buffers from its subvolume tree, so
we only need to throttle on btree inode dirty pages.
So use btrfs_btree_balance_dirty_nodelay() in the loop that truncates
inode items.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:49 +0000 (11:13 +0000)]
btrfs: remove obsolete delayed ref throttling logic when truncating items
We have this logic encapsulated in btrfs_should_throttle_delayed_refs()
where we try to estimate if running the current amount of delayed
references we have will take more than half a second, and if so, the
caller btrfs_should_throttle_delayed_refs() should do something to
prevent more and more delayed refs from being accumulated.
This logic was added in commit 0a2b2a844af6 ("Btrfs: throttle delayed
refs better") and then further refined in commit a79b7d4b3e81 ("Btrfs:
async delayed refs"). The idea back then was that the caller of
btrfs_should_throttle_delayed_refs() would release its transaction
handle (by calling btrfs_end_transaction()) when that function returned
true, then btrfs_end_transaction() would trigger an async job to run
delayed references in a workqueue, and later start/join a transaction
again and do more work.
However we don't run delayed references asynchronously anymore, that
was removed in commit db2462a6ad3d ("btrfs: don't run delayed refs in
the end transaction logic"). That makes the logic that tries to estimate
how long we will take to run our current delayed references, at
btrfs_should_throttle_delayed_refs(), pointless as we don't take any
action to run delayed references anymore. We do have other type of
throttling, which consists of checking the size and reserved space of
the delayed and global block reserves, as well as if fluhsing delayed
references for the current transaction was already started, etc - this
is all done by btrfs_should_end_transaction(), and the only user of
btrfs_should_throttle_delayed_refs() does periodically call
btrfs_should_end_transaction().
So remove btrfs_should_throttle_delayed_refs() and the infrastructure
that keeps track of the average time used for running delayed references,
as well as adapting btrfs_truncate_inode_items() to call
btrfs_check_space_for_delayed_refs() instead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:48 +0000 (11:13 +0000)]
btrfs: simplify variables in btrfs_block_rsv_refill()
At btrfs_block_rsv_refill(), there's no point in initializing the
'num_bytes' variable to 0 and then, after taking the block reserve's
spinlock, initializing it to the value of the 'min_reserved' parameter.
So just get rid of the 'num_bytes' local variable and rename the
'min_reserved' parameter to 'num_bytes'.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:47 +0000 (11:13 +0000)]
btrfs: remove redundant counter check at btrfs_truncate_inode_items()
At btrfs_truncate_inode_items(), in the while loop when we decide that we
are going to delete an item, it's pointless to check that 'pending_del_nr'
is non-zero in an else clause because the corresponding if statement is
checking if 'pending_del_nr' has a value of zero. So just remove that
condition from the else clause.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:46 +0000 (11:13 +0000)]
btrfs: count extents before taking inode's spinlock when reserving metadata
When reserving metadata space for delalloc (and direct IO too), at
btrfs_delalloc_reserve_metadata(), there's no need to count the number of
extents while holding the inode's spinlock, since that does not require
access to any field of the inode.
This section of code can be called concurrently, when we have direct IO
writes against different file ranges that don't increase the inode's
i_size, so it's beneficial to shorten the critical section by counting
the number of extents before taking the inode's spinlock.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:45 +0000 (11:13 +0000)]
btrfs: remove bytes_used argument from btrfs_make_block_group()
The only caller of btrfs_make_block_group() always passes 0 as the value
for the bytes_used argument, so remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:44 +0000 (11:13 +0000)]
btrfs: collapse should_end_transaction() into btrfs_should_end_transaction()
The function should_end_transaction() is very short and only has one
caller, which is btrfs_should_end_transaction(). So move the code from
should_end_transaction() into btrfs_should_end_transaction().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_should_throttle_delayed_refs() returns 1 or 2 in case the
delayed refs should be throttled, however the only caller (inode eviction
and truncation path) does not care about those two different conditions,
it treats the return value as a boolean. This allows us to remove one of
the conditions in btrfs_should_throttle_delayed_refs() and change its
return value from 'int' to 'bool'. So just do that.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:42 +0000 (11:13 +0000)]
btrfs: initialize ret to -ENOSPC at __reserve_bytes()
At space-info.c:__reserve_bytes(), instead of initializing 'ret' to 0 when
it's declared and then shortly after set it to -ENOSPC under the space
info's spinlock, initialize it to -ENOSPC when declaring it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:41 +0000 (11:13 +0000)]
btrfs: update flush method assertion when reserving space
When reserving space, at space-info.c:__reserve_bytes(), we assert that
either the current task is not holding a transacion handle, or, if it is,
that the flush method is not BTRFS_RESERVE_FLUSH_ALL. This is because that
flush method can trigger transaction commits, and therefore could lead to
a deadlock.
However there are other 2 flush methods that can trigger transaction
commits:
So update the assertion to check the flush method is also not one those
two methods if the current task is holding a transaction handle.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:40 +0000 (11:13 +0000)]
btrfs: update documentation for BTRFS_RESERVE_FLUSH_EVICT flush method
The BTRFS_RESERVE_FLUSH_EVICT flush method can also commit transactions,
see the definition of the evict_flush_states const array at space-info.c,
but the documentation for it at space-info.h does not mention it.
So update the documentation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:39 +0000 (11:13 +0000)]
btrfs: remove check for NULL block reserve at btrfs_block_rsv_check()
The block reserve passed to btrfs_block_rsv_check() is never NULL, so
remove the check. In case it can ever become NULL in the future, then
we'll get a pretty obvious and clear NULL pointer dereference crash and
stack trace.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:38 +0000 (11:13 +0000)]
btrfs: pass a bool size update argument to btrfs_block_rsv_add_bytes()
At btrfs_delayed_refs_rsv_refill(), we are passing a value of 0 to the
'update_size' argument of btrfs_block_rsv_add_bytes(), which is defined
as a boolean. Functionally this is fine because a 0 is, implicitly,
converted to a boolean false value. However it's easier to read an
explicit 'false' value, so just pass 'false' instead of 0.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:13:37 +0000 (11:13 +0000)]
btrfs: pass a bool to btrfs_block_rsv_migrate() at evict_refill_and_join()
The last argument of btrfs_block_rsv_migrate() is a boolean, but we are
passing an integer, with a value of 1, to it at evict_refill_and_join().
While this is not a bug, due to type conversion, it's a lot more clear to
simply pass the boolean true value instead. So just do that.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 21 Mar 2023 11:37:04 +0000 (11:37 +0000)]
btrfs: remove btrfs_lru_cache_is_full() inline function
It's not used anywhere at the moment, but it was used in earlier version
of a patch that removed its use in the second version. So just remove
btrfs_lru_cache_is_full().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify adding pages in btrfs_add_compressed_bio_pages
btrfs_add_compressed_bio_pages is needlessly complicated. Instead
of iterating over the logic disk offset just to add pages to the bio
use a simple offset starting at 0, which also removes most of the
claiming. Additionally __bio_add_pages already takes care of the
assert that the bio is always properly sized, and btrfs_submit_bio
called right after asserts that the bio size is non-zero.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: move the bi_sector assignment out of btrfs_add_compressed_bio_pages
Adding pages to a bio has nothing to do with the sector. Move the
assignment to the two callers in preparation for cleaning up
btrfs_add_compressed_bio_pages.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Mon, 13 Mar 2023 07:46:54 +0000 (16:46 +0900)]
btrfs: sysfs: relax bg_reclaim_threshold for debugging purposes
Currently, /sys/fs/btrfs/<UUID>/bg_reclaim_threshold is limited to 0
(disable) or [50 .. 100]%, so we need to fill 50% of a device to start the
auto reclaim process. It is cumbersome to do so when we want to shake out
possible race issues of normal write vs reclaim.
Relax the threshold check under the BTRFS_DEBUG option.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: make btrfs_split_bio work on struct btrfs_bio
btrfs_split_bio expects a btrfs_bio as argument and always allocates one.
Type both the orig_bio argument and the return value as struct btrfs_bio
to improve type safety.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify finding the inode in submit_one_bio
struct btrfs_bio now has an always valid inode pointer that can be used
to find the inode in submit_one_bio, so use that and initialize all
variables for which it is possible at declaration time.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: pass a btrfs_bio to btrfs_submit_compressed_read
btrfs_submit_compressed_read expects the bio passed to it to be embedded
into a btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove unused members from struct btrfs_encoded_read_private
The inode and file_offset members in struct btrfs_encoded_read_private
are unused, so remove them.
Last used in commit 7959bd441176 ("btrfs: remove the start argument to
check_data_csum and export") and commit 7609afac6775 ("btrfs: handle
checksum validation and repair at the storage layer").
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 1 Mar 2023 20:47:08 +0000 (21:47 +0100)]
btrfs: locking: use atomic for DREW lock writers
The DREW lock uses percpu variable to track lock counters and for that
it needs to allocate the structure. In btrfs_read_tree_root() or
btrfs_init_fs_root() this may add another error case or requires the
NOFS scope protection.
One way is to preallocate the structure as was suggested in
https://lore.kernel.org/linux-btrfs/20221214021125.28289-1-robbieko@synology.com/
We may avoid the allocation altogether if we don't use the percpu
variables but an atomic for the writer counter. This should not make any
difference, the DREW lock is used for truncate and NOCOW writes along
with other IO operations.
The percpu counter for writers has been there since the original commit 8257b2dc3c1a1057 "Btrfs: introduce btrfs_{start, end}_nocow_write() for
each subvolume". The reason could be to avoid hammering the same
cacheline from all the readers but then the writers do that anyway.
Anand Jain [Thu, 2 Mar 2023 13:30:48 +0000 (21:30 +0800)]
btrfs: remove redundant clearing of NODISCARD
If no discard mount option is specified, including the NODISCARD option,
we make the async discard the default option then we don't have to call
the clear_opt again to clear the NODISCARD flag. Though this makes no
difference, that the call is redundant has been pointed out several
times so we better remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_FEATURE_INCOMPAT_SUPP is defined twice, once under
CONFIG_BTRFS_DEBUG and once without it, resulting in repetitive code. The
reason for this is to add experimental features under CONFIG_BTRFS_DEBUG.
To avoid repetitive code, add a common list BTRFS_FEATURE_INCOMPAT_SUPP_STABLE,
and append experimental features only under CONFIG_BTRFS_DEBUG.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 16 Jan 2023 07:04:13 +0000 (15:04 +0800)]
btrfs: scrub: remove root and csum_root arguments from scrub_simple_mirror()
We don't need to pass the roots as arguments, reading them from the
rb-tree is cheap. Thus there is really not much need to pre-fetch it
and pass it all the way from scrub_stripe().
And we already have more than enough arguments in scrub_simple_mirror()
and scrub_simple_stripe(), it's better to remove them and only grab
those roots in scrub_simple_mirror().
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The variable @path is no longer passed into any call sites after commit 18d30ab96149 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56
data stripe scrub"), thus we can remove the variable completely.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 22 Feb 2023 07:47:40 +0000 (15:47 +0800)]
btrfs: do not use replace target device as an extra mirror
[BUG]
Currently btrfs can use dev-replace device as an extra mirror for
read-repair. But it can lead to NODATASUM corruption in the following
case:
There is a RAID1 data chunk, and dev-replace is running from
dev2 to dev0.
|//| = Replaced data
X X+1MB X+2MB
Dev 2: | | | <- Source dev
Dev 0: |///////| | <- Target dev
Then a read on dev 2 X+2MB happens.
And something wrong happened inside devid 2, causing an -EIO.
In that case, read-repair would try the next mirror, and since we can
use target device as an extra mirror, we will use that mirror instead.
But unfortunately since the read is beyond the current replace cursor,
we should not trust it at all, what we get would be just uninitialized
garbage.
But if this read is for NODATASUM range, then we just trust them and
cause data corruption.
[CAUSE]
We used to have some checks to make sure we only return such extra
mirror when the range is before our left cursor.
The first commit introducing this behavior is ad6d620e2a57 ("Btrfs:
allow repair code to include target disk when searching mirrors").
But later a fix, 22ab04e814f4 ("Btrfs: fix race between device replace
and chunk allocation") changed the behavior, to always let
btrfs_map_block() include the extra mirror to address a race in
dev-replace which can cause missing writes to target device.
This means, we lose the tracking of cursor for the extra mirror, thus
can lead to above corruption.
[FIX]
The extra mirror is never a reliable one, at the beginning of
dev-replace, the reliability is zero, while only at the end of the
replace it's a fully reliable mirror.
We either do the complex tracking, or never trust it.
IMHO it's much easier to maintain if we don't trust it at all, and the
extra mirror can only benefit for a limited period of time (during
replace).
Thus this patch would completely remove the ability to use target device
as an extra mirror.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 28 Feb 2023 00:44:30 +0000 (08:44 +0800)]
btrfs: open_ctree() error handling cleanup
Currently open_ctree() still uses two variables for error handling, err
and ret. This can be confusing and missing some errors and does not
conform to current coding style.
This patch will fix the problems by:
- Use only ret for error handling
- Add proper ret assignment
Originally we rely on the default value (-EINVAL) of err to handle
errors, but that doesn't really reflects the error.
This will change it use the correct error number for the following
call sites:
btrfs: cleanup the main loop in btrfs_lookup_bio_sums
Introduce a bio_offset variable for the current offset into the bio
instead of recalculating it over and over. Remove the now only used
once search_len and sector_offset variables, and reduce the scope for
count and cur_disk_bytenr.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to search for a file offset in a bio, it is now always
provided in bbio->file_offset (set at bio allocation time since 0d495430db8d ("btrfs: set bbio->file_offset in alloc_new_bio")). Just
use that with the offset into the bio.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: sink calc_bio_boundaries into its only caller
Nowadays calc_bio_boundaries() is a relatively simple function that only
guarantees the one bio equals to one ordered extent rule for uncompressed
Zone Append bios.
Sink it into it's only caller alloc_new_bio().
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page always adds either the entire range passed to it or nothing.
Based on that btrfs_bio_add_page can only return a length smaller than
the passed in one when hitting the ordered extent limit, which can only
happen for writes. Given that compressed writes never even use this code
path, this means that all the special cases for compressed extent offset
handling are dead code.
Reflow submit_extent_page to take advantage of this by inlining
btrfs_bio_add_page and handling the ordered extent limit by decrementing
it for each added range and thus significantly simplifying the loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Different loop iterations in btrfs_bio_add_page not only have the same
contiguity parameters, but also any non-initial operation operates on a
fresh bio anyway.
Factor out the contiguity check into a new btrfs_bio_is_contig and only
call it once in submit_extent_page before descending into the
bio_add_page loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify the error handling in __extent_writepage_io
Remove the has_error and saved_ret variables, and just jump to a goto
label for error handling from the only place returning an error from the
main loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
submit_extent_page always returns 0 since commit d5e4377d5051 ("btrfs:
split zone append bios in btrfs_submit_bio"). Change it to a void return
type and remove all the unreachable error handling code in the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the compress_type argument to submit_extent_page
Update the compress_type in the btrfs_bio_ctrl after forcing out the
previous bio in btrfs_do_readpage, so that alloc_new_bio can just use
the compress_type member in struct btrfs_bio_ctrl instead of passing the
same information redundantly as a function argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: rename the this_bio_flag variable in btrfs_do_readpage
Rename this_bio_flag to compress_type to match the surrounding code
and better document the intent. Also use the proper enum type instead
of unsigned long.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: move the compress_type check out of btrfs_bio_add_page
The compress_type can only change on a per-extent basis. So instead of
checking it for every page in btrfs_bio_add_page, do the check once in
btrfs_do_readpage, which is the only caller of btrfs_bio_add_page and
submit_extent_page that deals with compressed extents.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing down the wbc pointer the deep call chain, just
add it to the btrfs_bio_ctrl structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the sync_io flag in struct btrfs_bio_ctrl
The sync_io flag is equivalent to wbc->sync_mode == WB_SYNC_ALL, so
just check for that and remove the separate flag.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The bio op and flags never change over the life time of a bio_ctrl,
so move it in there instead of passing it down the deep call chain
all the way down to alloc_new_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the force_bio_submit to submit_extent_page
If force_bio_submit, submit_extent_page simply calls submit_one_bio as
the first thing. This can just be moved to the only caller that sets
force_bio_submit to true.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: don't set force_bio_submit in read_extent_buffer_subpage
When read_extent_buffer_subpage calls submit_extent_page, it does
so on a freshly initialized btrfs_bio_ctrl structure that can't have
a valid bio to submit. Clear the force_bio_submit parameter to false
as there is nothing to submit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Fri, 24 Feb 2023 03:31:26 +0000 (11:31 +0800)]
btrfs: open code btrfs_bin_search()
btrfs_bin_search() is a simple wrapper that searches for the whole slots
by calling btrfs_generic_bin_search() with the starting slot/first_slot
preset to 0.
This simple wrapper can be open coded as btrfs_bin_search().
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
One can see that all the reads are submitted to devid 1, even if we have
specified "-r" option to avoid reading from the source device.
[CAUSE]
The dev-replace read mode is only set but not followed by scrub code at
all. In fact, only common read path is properly following the read
mode, but scrub itself has its own read path, thus not following the
mode.
[FIX]
Here we enhance scrub_find_good_copy() to also follow the read mode.
The idea is pretty simple, in the first loop, we avoid the following
devices:
- Missing devices
This is the existing condition
- The source device if the replace wants to avoid it.
And if above loop found no candidate (e.g. replace a single device),
then we discard the 2nd condition, and try again.
Since we're here, also enhance the function scrub_find_good_copy() by:
- Remove the forward declaration
- Makes it return int
To indicates errors, e.g. no good mirror found.
- Add extra error messages
Now with the same trace, "btrfs replace start -r" works as expected:
btrfs: fold finish_compressed_bio_write into btrfs_finish_compressed_write_work
Fold finish_compressed_bio_write into its only caller as there is no
reason to keep them separate.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: don't clear page->mapping in btrfs_free_compressed_pages
No one ever set ->mapping on these pages, so don't bother clearing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: factor out a btrfs_free_compressed_pages helper
Share the code to free the compressed pages and the array to hold them
into a common helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: factor out a btrfs_add_compressed_bio_pages helper
Factor out a common helper to add the compressed_bio pages to the
bio that is shared by the compressed read and write path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use the bbio file offset in add_ra_bio_pages
struct btrfs_bio now has a file_offset field set up by all submitters.
Use that value combined with the bio size in add_ra_bio_pages to
calculate the last offset in the bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use the bbio file offset in btrfs_submit_compressed_read
struct btrfs_bio now has a file_offset field set up by all submitters.
Use that in btrfs_submit_compressed_read instead of recalculating the
value.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove redundant free_extent_map in btrfs_submit_compressed_read
em can't be non-NULL after the free_extent_map label. Also remove
the now pointless clearing of em to NULL after freeing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: embed a btrfs_bio into struct compressed_bio
Embed a btrfs_bio into struct compressed_bio. This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 17 Feb 2023 05:37:03 +0000 (13:37 +0800)]
btrfs: replace btrfs_io_context::raid_map with a fixed u64 value
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.
But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.
So why we waste the space and time (for sorting) for raid_map?
This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:
- Replace btrfs_io_context::raid_map with full_stripe_start
- Replace call sites using raid_map[0] to use full_stripe_start
- Replace call sites using raid_map[i] to compare with nr_data_stripes.
The benefits are:
- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs sizeof(u64).
It'll always save at least one u64, and the benefit grows larger with
num_stripes.
- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 7 Feb 2023 04:26:14 +0000 (12:26 +0800)]
btrfs: use an efficient way to represent source of duplicated stripes
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 7 Feb 2023 04:26:13 +0000 (12:26 +0800)]
btrfs: reduce type width of btrfs_io_contexts
That structure is our ultimate object for all __btrfs_map_block()
related functions. We have some hard to understand members, like
tgtdev_map, but without any comments.
This patch will improve the situation:
- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
tgtdev_map[]
Especially for the last two members, add a dedicated (thus very long)
comments for them, with example to explain it.
- Shrink those int members to u16.
In fact our on-disk format is only using u16 for num_stripes, thus
no need to use int at all.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 17 Feb 2023 05:36:59 +0000 (13:36 +0800)]
btrfs: reduce div64 calls by limiting the number of stripes of a chunk to u32
There are quite some div64 calls inside btrfs_map_block() and its
variants.
Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.
However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.
This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.
With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.
This patch would do:
- Add a new tree-checker chunk validation on chunk length
Make sure no chunk can reach 256G, which can also act as a bitflip
checker.
- Reduce the width from u64 to u32 for @stripe_nr variables
- Replace unnecessary div64 calls with regular modulo and division
32bit division and modulo are much faster than 64bit operations, and
we are finally free of the div64 fear at least in those involved
functions.
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>