Jason Dillaman [Fri, 16 Oct 2020 15:25:39 +0000 (11:25 -0400)]
journal: possible race condition between flush and append callback
When notifying the journal recorder of an overflow or if the object
close request has completed due to no more in-flight IO, it was
possible for a race between a flush request and the processing of
an append completion to attempt to kick off duplicate notifications.
Since the overflowed and closed callbacks are properly protected from
duplicates, use a counter instead of a boolean to track possible
in-flight handler callbacks.
Fixes: https://tracker.ceph.com/issues/47880 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Kefu Chai [Fri, 16 Oct 2020 14:07:50 +0000 (22:07 +0800)]
crimson/common: schedule action only if the future is not available
otherwise we could call do_until() recursively if we have other tasks
which need to prempt the reactor and current future's state is actually
always available.
Kefu Chai [Fri, 16 Oct 2020 06:11:52 +0000 (14:11 +0800)]
crimson/common: do not take from a future twice
before this change, in our specialization of seastar::do_until(),
we access `f` after calling `f.get()`, this is not correct. as `f.get()`
actually moves `f._state` away and detaches the associated promise if any.
so we cannot call `f._then()` anymore after calling `f.get()`. as
`f._then()` schedules `f` by detaching the future from promise and
attaching the scheduled task to the promise. but `future_base::detach_promise()`
does not check `_promise` before accessing it, hence the segfault.
after this change, the order of the checks is rearranged so that
`f.get()` is called at the end. and also use `f.get0()` to be more
explicit, as we are accessing the only element of the returned
value.
Adam C. Emerson [Thu, 15 Oct 2020 16:03:13 +0000 (12:03 -0400)]
Merge pull request #37660 from adamemerson/wip-datalog-fix
cls/fifo: Switch use CLS_ERR for errors
rgw/fifo: Fix a few missed return value assignments
rgw/fifo: Add some error logging
rgw/fifo: Catch two instances journaling a new part
rgw/fifo: Use unique_ptr and explicit release for callbacks
Reviewed-by: J. Eric Ivancich <ivancich@redhat.com>
Matthew Oliver [Mon, 10 Aug 2020 04:46:21 +0000 (04:46 +0000)]
pick_address: Warn and continue when you find at least 1 IPv4 or IPv6 address
Currently if specify a single public or cluster network, yet have both
`ms bind ipv4` and `ms bind ipv6` set daemons crash when they can't find
both IPs from the same network:
unable to find any IPv4 address in networks '2001:db8:11d::/120' interfaces ''
And rightly so, of course it can't find an IPv4 network in an IPv6
network.
This patch, adds a new helper method, networks_address_family_coverage,
that takes the list of networks and returns a bitmap of address families
supported.
We then check to see if we have enough networks defined and if you don't
it'll warn and then continue.
Also update the network-config-ref to mention having to define both
address family addresses for cluster and or public networks.
As well as a warning about `ms bind ipv4` being enabled by default which
is easy to miss, there by enabling dual stack when you may only be
expect single stack IPv6.
Thee is also a drive by to fix a `note` that wan't being displayed due
to missing RST syntax.
Signed-off-by: Matthew Oliver <moliver@suse.com> Fixes: https://tracker.ceph.com/issues/46845 Fixes: https://tracker.ceph.com/issues/39711
Yan, Zheng [Fri, 7 Aug 2020 15:58:19 +0000 (23:58 +0800)]
mds: distribute dirfrags for ephemeral distributed directory
Instead of distribute individual dir inodes inside the ephemeral
distributed dir. Distributing dirfrags can limit number of subtrees
created by the ephemeral dist pin.
This patch also unifies codes that handle export pin and ephemeral pin.
Jason Dillaman [Mon, 5 Oct 2020 18:04:14 +0000 (14:04 -0400)]
librbd: support preprocessing source object data prior to deep-copy
Let object dispatch layers potentially mutate the data read from the
source image prior to issuing the actual deep-copy operations against
the destination image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The write-ops now only stores write vs zero ops and the type of
zero operation is delayed until the actual op is sent. This will
make the state machine compatible with the copyup process hook.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Thu, 24 Sep 2020 19:15:23 +0000 (15:15 -0400)]
librbd: support preprocessing parent data prior to copyup
Let object dispatch layers potentially mutate the copyup data read
from the parent prior to issuing the actual copyup operation. This
can allow for a layer like the crypto layer to re-encrypt the parent
image data using the child image's encryption keys, for example.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 23 Sep 2020 19:57:20 +0000 (15:57 -0400)]
librbd: new hook for pre-processing copyup data
This will permit the crypto layer to properly encrypt and potentially
align the sparse copyup data prior to it being written. It passes
potentially multiple sets of data in one pass to permit the deep-copy
state machine to utilize the same API and allow the crypto layer to
potentially handle layered alignment issues.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Wed, 30 Sep 2020 01:24:23 +0000 (21:24 -0400)]
librbd: rename SnapshotExtent to SparseExtent
The processing of copyup needs to be able to denote data extents that
are potentially zeroed or included in the associated bufferlist. By
renaming the type, it can be re-used for this second purpose.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 29 Sep 2020 00:35:38 +0000 (20:35 -0400)]
librbd: copyup state machine should always issue a sparse-read
When reading from the parent, always keep the data in a sparse
extent-map format. The forthcoming copyup preprocessing hook will
want to pass a set of sparse image-extent data.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 29 Sep 2020 00:04:48 +0000 (20:04 -0400)]
librbd: switch remaining uses of ExtentMap to Extents
The neorados API already requires the vector-based approach vs
the map-based approach. Now the remaining sparse-read functionality
has been switched to use the consistent approach.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Fri, 25 Sep 2020 14:40:32 +0000 (10:40 -0400)]
librbd: deep-copy should update object-map before writing to object
For the original use-case of RBD mirroring it was (maybe) more
acceptable to write to the object before updating the object map
because an interrupted sync will be retried. However, when using
the deep-copy object copy state machine as part of copyup, it's
more likely that the object-map has the potential to become
out-of-sync with reality if it's updated after the object is
written.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 13 Oct 2020 01:34:25 +0000 (21:34 -0400)]
librbd: update AioCompletion return value before evaluating pending count
If the pending count is decremented before the return value is updated,
there is a possibility of two ASIO threads concurrently decrementing the
pending count down from 2 -> 1 -> 0. In the second thread (the one that
performs the final decrement from 1 -> 0), it can finalize the completion
before the first thread has had a chance to update the return value.
Fixes: https://tracker.ceph.com/issues/47847 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Venky Shankar [Fri, 9 Oct 2020 11:06:45 +0000 (07:06 -0400)]
tests/pybind/cephfs: cleanup xattrs before starting tests
Some xattr tests do not fully cleanup set xattrs. Subsequent
tests may expect xattrs to be absent during the test, such as
setting an xattr and then removing followed by checking if the
xattr list to be empty. This may fail if earlier tests do not
cleanup xattrs, especially for root.
So, cleanup xattrs on root before starting tests. Other directories
are removed anyway, so we do not have to bother about those.
Venky Shankar [Tue, 25 Aug 2020 01:48:53 +0000 (21:48 -0400)]
mds: restrict setting/removing certain xattrs in ceph namespace
Since all ceph.* xattrs need not be virtual (stored in inode
structure), restrict certain xattrs (ceph.mirror.info) to be
persisted in xattr_map. Other ceph.* xattrs which do not pass
the virtual xattr check are rejected.
Venky Shankar [Wed, 26 Aug 2020 12:55:51 +0000 (08:55 -0400)]
mds: introduce ceph.mirror.info virtual xattr
This is a compound xattr with the xattr value being fixed in
format. MDS stores this xattr as two (since the xattr value
right now just has two components or compounds) separate entries
in the xattr_map. This is done to avoid bloating the xattr value
if more "compounds" are added.
You may ask, why do it this way rather having the application
(cephfs-mirror daemon in this case) just set each xattr one
after the other? Well, we loose xattr consistency (from the
application point-of-view) -- an application crash (bug!) or
an ENOSPC in the server could leave M out of N xattrs (M < N)
on-disk.
With the compound xattr operation done on the server side,
journaling ensure that either all (N) xattrs are available or
none are available after recovering from a crash.
Venky Shankar [Tue, 25 Aug 2020 01:26:12 +0000 (21:26 -0400)]
mds: introduce is_ceph_vxattr() helper
Not all ceph.* xattrs are virtual -- virtual in the sense
that such xattrs have entries in the inode structure (inode_t)
rather than in xattr_map. There could be cases where an xattr
is in ceph namespace but does not necessarily need to be stored
for each inode -- so filter such xattrs individually rather than
treating all ceph.* xattrs as virtual.