Kotresh HR [Mon, 10 Mar 2025 17:53:32 +0000 (23:23 +0530)]
mds: Load all referent inodes during path traverse
It is expected to lookup all the referent inodes of the hardlink
during path traverse/lookup. Any operation involving hardlinked
inode needs to construct the SnapContext which contains snapshots
of all the hardlinks. This is done by accessing the snaprealms
of all the hardlinks using the referent inodes. Hence, it's
always expected that the lookup of any hardlink would lookup
all the referent inodes.
Kotresh HR [Fri, 7 Mar 2025 12:06:33 +0000 (17:36 +0530)]
mon: Make use_global_snaprealm and allow_referent_inodes dependant
We can't disable using global snaprealm without enabling
referent inodes feature and we can't disable referent inodes
without disabling global snaprealm. The patch adds this dependency.
Kotresh HR [Fri, 7 Mar 2025 10:36:42 +0000 (16:06 +0530)]
mds: Add fs option use_global_snaprealm
The fs option use_global_snaprealm is introduced
to handle upgrade scenario of referent inode feature [1]
This option is enabled be default which exhibits
the existing behaviour i.e. snapshots will use
global snaprealm. This option is used along with
'allow_referent_inodes' on the new filesystems to
disable global snaprealm.
Kotresh HR [Tue, 4 Mar 2025 16:18:57 +0000 (21:48 +0530)]
mds: Fix MMDSCacheRejoin::dn_strong version check failure
MMDSCacheRejoin::dn_strong structure inside MMDSCacheRejoin
class is not versioned. But the referent inodes feature
requires to add a new variable to 'dn_strong'. Since
it's not versioned, the ceph-object-corpus encode/decode
test (src//readable.sh) is failing for no backward
compatibility.
This patch fixes the versioning issue of MMDSCacheRejoin::dn_strong
by introducing a new struct MMDSCacheRejoin::dn_strong_new
which encapsulates 'dn_strong' and the new variable and is versioned.
Kotresh HR [Wed, 5 Mar 2025 11:07:51 +0000 (16:37 +0530)]
mds: Fix mdcache rejoin invented referent inode
During mds cache rejoin, the referent inode could
be invented and marked CInode::STATE_REJOINUNDEF.
The same inode is fetched from the disk later and
the state should be cleared. This was missing for
the referent inode and hence was causing the following
crash.
Kotresh HR [Sun, 2 Mar 2025 09:07:30 +0000 (14:37 +0530)]
qa: referent inodes - unlink, stray_reintegration
The following tests are borrowed from existing test_strays.
and adjusted stray perf count numbers with referent inodes.
Also, added validation of referent_inodes list
Also, added unlink test and ALLOW_REFERENT_INODES flag
to cephfs_test_case. If ALLOW_REFERENT_INODES flag is
set in any test class, referent inodes is enabled for all
the tests in the class.
The stray perf count numbers validation in above tests don't match
with referent inodes enabled because referent inodes becomes a stray
whenever hardlink is deleted or stray is reintegrated, so disable
referent inodes. The same test for referent inodes is added at
qa/tasks/cephfs/test_referent.py adjusting the perf counters for
referent inodes.
Kotresh HR [Sat, 1 Mar 2025 19:09:41 +0000 (00:39 +0530)]
qa: Add function to fetch inode from metadata pool
Add a function to fetch the inode from the metadata
pool. This is done by fetching the omap value from
the directory object and decode it to the in-memory
inode format using ceph-dencoder.
KotreshHR [Tue, 9 May 2023 03:12:40 +0000 (08:42 +0530)]
tools/ceph-dencoder: Add an option stray_okay
The ceph-dencoder tool fails to decode if the
stray data is present at the end. Add an option
'stray_okay' to force the decode even if stray
data is present at the end.
Validates that the 'cephfs-data-scan' tool recovers
hardlink during disaster recovery of metadata
pool from the data pool correctly as hardlink with
referent inode.
Kotresh HR [Thu, 27 Feb 2025 19:24:43 +0000 (00:54 +0530)]
qa/test_backtrace: Validate remote_inode xattr is stored
Validate 'remote_inode' xattr is stored on the
referent inode upon hardlink creation if the
referent inodes feature is enabled. The value
of the xattr would be the inode number of the
primary inode.
Also validate that the referent inode is not
created on hardlink creation if the feature
is disabled.
Kotresh HR [Thu, 27 Feb 2025 18:44:18 +0000 (00:14 +0530)]
tools/cephfs-data-scan: Recover referent_inode list
The referent_inode list on the primary inode needs to
be built on recovery. This PR adds the cability to the
'scan_links' sub command of the cephfs-data-scan tool
for the same.
The 'remote_inode' xattr is stored on the referent inode in
data pool. This patch uses this xattr to build the hardlink
dentry as part of the sub command 'scan_inodes' of the
cephfs-data-scan.
Kotresh HR [Thu, 27 Feb 2025 14:28:24 +0000 (19:58 +0530)]
mds: Fix straydn race between unlink/rename linkmerge
Handle race between unlink of secondary hardlink (say hl_file1) and
linkmerge rename, triggered by unlink of primary (say file1) on to
the same secondary hardlink (hl_file1).
1. ln file1 hl_file1
2. unlink hl_file1
3. prepares straydn for referent inode
4. wait for the locks
Kotresh HR [Wed, 26 Feb 2025 12:27:32 +0000 (17:57 +0530)]
mds/rename: Handle referent inode rollback
If the destination dentry exists and is referent
remote, then the rollback should take care of
restoring the referent inode and the add back
the referent inode number to the referent_inode
list of the primary/real inode. This patch takes
care of the same.
Kotresh HR [Wed, 26 Feb 2025 11:06:45 +0000 (16:36 +0530)]
mds/rename: Handle source dentry being referent remote
This code handles the rename case where the source
dentry is referent remote. It's very similar to
source dentry being primary. In multi-mds case,
this involves exporting/importing the source referent
inode based on the auth of the srcdnl and destdnl.
This can also happen with out the linkmerge case.
A different file can be renamed to the existing
referent remote dentry. This patch handles the same.
Also, as part of it, it removes the corresponding
referent inode from the list of primary/real inode.
Kotresh HR [Tue, 4 Mar 2025 03:40:53 +0000 (09:10 +0530)]
mds/rename: Handle referent remote linkmerge case
For a file having hardlinks, when the primary file is
removed first before the secondary, stray reintegration
makes any secondary link available in the cache to be
primary. This is referred as linkmerge.
When the linkmerge happens, the referent inode of the
secondary needs to be removed otherwise it causes the
referent inode object leak in the data pool. This PR
handles the linkmerge scenario and removed the referent
inode of the secondary during the linkmerge.
Also, this patch does the following.
- removes the corresponding referent inode from the
referent_inodes list of primary/real inode
- adds is_referent_remote() check in required places
for linkmerge in the rename path.
Kotresh HR [Tue, 25 Feb 2025 12:31:34 +0000 (18:01 +0530)]
multi-mds/unlink: Handle rollback of referent_inodes list
The referent_inodes list needs to be rolled back if the
unlink operation involving multi-mds fails due to mds
going down. This patch takes care of the same.
Kotresh HR [Tue, 25 Feb 2025 12:27:41 +0000 (17:57 +0530)]
multi-mds/unlink: Referent inode - reverse link mgmt
On unlink of the secondary hardlink with referent
inode, remove the referent inode from the referent_inodes
list of the primary/real inode. This patch handles the
multi-mds scenario where the auth of the primary/real
inode is different from the dentry auth of the secondary
hardlink (also the referent inode auth)
Kotresh HR [Tue, 25 Feb 2025 11:23:39 +0000 (16:53 +0530)]
multi-mds/unlink: Unlink referent inode on dentry replicas
If the dentry being deleted is a secondary hardlink with
referent inode and is replicated, the dentry replicas mdses
gets notified to unlink the inode. The referent inode
needs to be unlinked in this case. This patch takes care
of handling the same in 'handle_dentry_unlink'.
Also, the straydn is created with remote referent to
remove the referent inode. The straydn needs to be
sent to straydn replicas and dentry replicas of hardlink
which gets handled in 'handle_dentry_unlink'. This patch
contains this change as well.
Kotresh HR [Tue, 25 Feb 2025 10:52:07 +0000 (16:22 +0530)]
multi-mds/unlink: Remove referent inode on unlink
On the unlink of the secondary hardlink with the referent
inode, remove the referent inode. This patch takes
care of multi-mds scenario where the primary/real
inode auth is a different mds than the secondary dentry
auth (also the referent inode auth).
Kotresh HR [Tue, 25 Feb 2025 07:36:23 +0000 (13:06 +0530)]
mds/unlink: Remove referent inode on unlink
The referent inode needs to be removed as part
of unlink operation on the secondary hardlink.
So prepare a straydn for the referent inode
and remove it. This patch takes care of single
mds
Kotresh HR [Mon, 24 Feb 2025 21:22:42 +0000 (02:52 +0530)]
multimds: Consistent view of referent inode list
Encode/Decode referent inode list with CEPH_LOCK_ILINK
lock along with ctime and nlink. This solves the consistent
view of the referent inode list to all it's inode replica mdses.
Kotresh HR [Mon, 24 Feb 2025 21:02:27 +0000 (02:32 +0530)]
multi-mds/link: Handle rollback for referent_inodes list
The referent inode added to the list on the primary/real
inode upon hardlink creation needs to rolled back properly
on failure. This patch takes care of the same.
Kotresh HR [Mon, 24 Feb 2025 10:32:36 +0000 (16:02 +0530)]
multi-mds/link: Reverse link primary inode to hardlink
Reverse link primary/real inode to the hardlink using
the referent inode. This patch takes care of link with
multiple active mds link. So the primary inode would
be auth on one mds and the new hardlink dentry being
created would be auth on different mds.
This is done as below.
1. Primary inode (CInode) maintains the referent_inodes list
2. Upon referent inode creation on auth mds of dentry being
created, the referent inode number is sent to auth mds of
the primary/real inode using MMDSPeerRequest::OP_LINKPREP.
The auth mds of primary/real inode adds it to the
referent_inodes list
Kotresh HR [Mon, 24 Feb 2025 11:07:05 +0000 (16:37 +0530)]
multi-mds/link: Send referent inode to dentry_replicas
On hardlink creation with multiple mdses, the
hardlink dentry being created could already exist.
In such cases, the change in inode is notified to
dentry replicas using MDCache::send_dentry_link.
If it's a referent remote, send the referent inode
to the dentry replicas and link it correctly.
Kotresh HR [Mon, 24 Feb 2025 08:12:44 +0000 (13:42 +0530)]
multi-mds/link: Create referent inode and store backtrace
On hardlink creation, create a referent inode (CInode)
and store backtrace for the hardlink. This patch
takes care of multiple active mds. So the primary inode
would be auth on one mds and the new hardlink dentry
being created would be auth on different mds.
Kotresh HR [Mon, 24 Feb 2025 05:54:19 +0000 (11:24 +0530)]
mds/link: Reverse link primary inode to hardlink
Reverse link primary/real inode to the hardlink using
the referent inode. This patch takes care of single
mds link. This is done as below.
1. Primary inode (CInode) maintains the referent_inodes list
2. Upon referent inode creation, the same is added the
referent_inodes list on the primary/real inode.
Kotresh HR [Fri, 21 Feb 2025 13:48:57 +0000 (19:18 +0530)]
mds/rejoin: Don't fetch the dir is already complete during rejoin
In MDCache::open_undef_inodes_dirfrags, it can so happen that
the same dir is added twice to fetch queue causing it to fetch
twice and hit the assert that dir is not complete for the
second fetch. So don't fetch if the dir is complete.
Kotresh HR [Fri, 21 Feb 2025 12:34:03 +0000 (18:04 +0530)]
mds/rejoin: Handle referent inode on MDSCache rejoin
The involves broadly following changes.
1. Add 'referent_ino' in the struct 'dn_strong' and required
encoding/decoding of the same. Noticed that the
MMDSCacheRejoin message actually isn't versioned yet.
There is a tracker [1] open for it. For now, CEPH_MDS_PROTOCOL
is bumped up as usual.
2. The following functions needs a change to construct the
in-memory referent inode from the inode number.
MDCache::rejoin_walk
- add_strong_dentry, pass referent inode number to build dn_strong
MDCache::handle_cache_rejoin_strong
- Construct referent inode from inode number if not found in memory and add_remote_dentry
MDCache::handle_cache_rejoin_ack
- Bad linkage check!, construct referent inode
MDCache::rejoin_send_acks
- add_strong_dentry, pass referent inode number to build dn_strong
MDCache::handle_cache_rejoin_weak
- add_strong_dentry, pass referent inode number as 0 to build dn_strong as it's weak rejoin
Kotresh HR [Wed, 19 Feb 2025 19:12:57 +0000 (00:42 +0530)]
mds/journal: Replay referent remote dentry
Add capability to EMetaBlob::replay to replay
the referent remote dentry journal. The replay
would take care of both remote dentry and
referent remote dentry.
Kotresh HR [Wed, 19 Feb 2025 18:48:12 +0000 (00:18 +0530)]
mds/journal: Journal referent remote dentry
Add machinery to journal the referent remote dentry.
The call to metablob's add_remote_dentry is just
adding remote dentry and not referent remote dentry
yet. This will be fixed as part of operations like
link, unlink, rename which makes use of referent
inodes.
Kotresh HR [Wed, 19 Feb 2025 16:44:30 +0000 (22:14 +0530)]
mds: Make referent inodes a optional feature
Hardlink referent inode plumbing work continues.
Add a fs option 'allow_referent_inodes'. Enabling the option,
allows the creation of referent inodes for the hardlinks
to store backtrace and viceversa. The option is disabled
by default so that it doesn't affect existing filesystems.
The option is introduced to handle upgrade as there is no
easy way to trace all the hardlinks of a file on existing
file systems to create referent inodes and update backtrace
online. So the feature is enabled only on new filesystems
Kotresh HR [Tue, 18 Feb 2025 18:49:14 +0000 (00:19 +0530)]
mds: Commit referent inode to disk
Hardlink referent inode plumbing work continues.
This patch adds the capability to commit the
referent inode created (would be part of hardlink
operation) to the disk.
The referent inode is marked with 'r'. If the
recovery tools recover the referent inode,
it's marked with 'R'
Kotresh HR [Tue, 4 Mar 2025 06:19:46 +0000 (11:49 +0530)]
mds: Store list of hardlinks on the inode of primary link
Hardlink referent inode plumbing continues. In order to
solve the snapshot hardlink lookup problem i.e., to be
able to identify all the hardlinks from a given file,
we need to maintain the list of referent inodes
associated with hardlink on the primary link. Now we
have two way link between primary link and corresponding
hardlink i.e., the hardlink points the primary link and
primary link points to all the hardlinks.
Kotresh HR [Tue, 4 Mar 2025 03:31:57 +0000 (09:01 +0530)]
mds: Store remote inode number in referent inode
Hardlink referent inode plumbing continues.
The remote inode information contains the following.
1. remote_ino - remote inode number
2. d_type
3. alternate_name
With the introduction of the referent inode as a full
blown CInode for the hardlink, these information needs
to be part of the inode. Hence add 'remote_ino' field
to the inode. The exisiting 'd_type' and 'alternate_name'
fields could be used.
Kotresh HR [Tue, 18 Feb 2025 11:17:53 +0000 (16:47 +0530)]
mds: Hardlink referent inode plumbing work
The linkage_t struct changes:
- Add referent_ino(inodeno_t) and referent_inode(CInode *) to the
linkage_t. These two fields becomes the core and identifies the
linkage as referent remote.
- Add 'is_referent_remote()' to check if it's referent remote
and modify 'is_remote()', 'is_null()' check accordingly.
- Add functions to access these fields.
Add the following new functions to link referent inode.
1. CDentry::push_projected_linkage - with referent_inode
2. CDir::link_null_referent_inode
3. CDir::link_referent_inode
Modify the following existing functions to link/unlink referent inode
1. CDentry::link_remote
2. CDentry::unlink_remote
3. CDentry::pop_projected_linakge
4. CDentry::CDentry
5. CDir::add_remote_dentry
6. CDir::link_inode_work
7. CDir::unlink_inode_work
* refs/pull/61321/head:
qa: update require-osd-release to tentacle
tools/monmaptool: bump new cluster version to X
doc/dev/release-checklists: remove ceph-container task
script/ceph-release-notes: add squid/tentacle
doc/dev/release-checklists:: mark task complete
doc/dev/release-checklist: add nightlies task
doc/dev/release-checklists: update ceph-build for tentacle
doc/dev/release-checklists: note redmine is done
qa: update to tentacle
doc/dev/release-checklist: question telemetry tentacle test
osd/OSDMap: update to tentacle
qa/workunits/cephtool/test: update to tentacle
mon/OSDMonitor: update to tentacle
common/options/global.yaml.in: update for tentacle
mon/MgrMonitor: update for tentacle
qa/standalone/mon/misc: update for tentacle
doc: update compatset for tentacle
doc: no deprecated features
include/ceph_features: add SERVER_TENTACLE feature bit
cephadm,ceph-volume: update to tentacle
doc/dev/release-checklist: add backport-create-issue
script: update backport-resolve-issue to tentacle
*: add constants and release names
ceph_release: update to tentacle
librbd: bump version
CMakeLists.txt: update VERSION
doc: remove obsolete checklist item
doc: reset for tentacle
Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Reviewed-by: Bill Scales <bill_scales@uk.ibm.com> Reviewed-by: Adam King <adking@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* refs/pull/60746/head:
client: skip unexpected command replies
mgr: indicate map message is acked instead of unhandled
osdc/Objecter: convert to ms_dispatch2 for ack
client: indicate maps are acked not processed
msg: add alternate statuses for ms_dispatch2 handling
tools/cephfs_mirror: do not process maps with fast dispatch
doc: add docs for volumes interface for charmap
qa: add tests for subvolume charmap settings
pybind/mgr/volumes: wire up charmap for subvol/subvolgroup
pybind/mgr: send MDS commands through cephfs client
pybind/cephfs: wire up mds_command2
mgr: add module method to send notifications
libcephfs: add mds_command2 for asynchronous commands
mgr: excise CephFS client from mgr C++ base
mgr: use std namespace
doc: add docs for CephFS charmap config
qa: add charmap tests
qa: add helpful exceptions for attr changes
qa: ignore libicu leaks
client: add wrappings for charmap manipuluation of dentry names
client: add dir_result_t::dentry::print
win32: add libicu Windows build
CMakeLists: add boost::locale dependency for client
install-deps: unconditionally install boost libraries
test/libcephfs: update root operation return values
client: refactor all path traversals through path_walk
test/libcephfs: test parallel creates
test/libcephfs: add test for lookup failure after readdir
client: init dentry shared_gen with invalid value
client: add _lookup debugging
client: remove redundant check
client: dump InodeStat from mds
mds: encode optmetadata in InodeStat sent to clients
mds: check client features for charmap
mds: add client feature bit for charmap
mds: wire up vxattr for changing charmap
mds: inherit charmap on mkdir
mds,include: add charmap optmetadata
mds,include: add inode_t optional metadata
client: hide alternate_name from API
client: move alternate_name once
client: optimize alternate_name passing to helper
client: relocate definition
client: print dentry with alternate_name on dump
client: move inode dump to print method
mds: add debugging for encoding lease stat
mds: make encode_lease a proper method
mds: add fscrypt metadata for inode stat size
client: use DentryRef for ref counting in MetaRequest
client: add DentryRef
client: add helper for determining if a perm check is necessary
client: cache client_permissions config
client: add debugging for conf changes
client: sort configs
client/UserPerm: add print method
client: note mount parameters in debug log
client: print stat mode in octal
common: add missing op string
include/filepath: add empty path check
Matan Breizman [Sun, 2 Mar 2025 08:42:45 +0000 (08:42 +0000)]
cmake/modules/BuildISAL.cmake: set no-integrated-as on clang only
this option is only relevant to clang, gcc will fail with:
```
CMake Error at ceph/build/src/erasure-code/isa/isal_ext-prefix/src/isal_ext-stamp/isal_ext-configure-Debug-impl.cmake:19 (message):
Command failed (77):
J. Eric Ivancich [Fri, 28 Feb 2025 19:22:53 +0000 (14:22 -0500)]
doc/rgw: update dynamic resharding docs to reflect recent changes
The documentation on dynamic resharding is updated to include a) a
description of reducing the number of shards, b) related configuration
options, and c) the radosgw-admin sub-command to set a minimum number
of shards for a specific bucket.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
rgw: add radosgw-admin sub-command to set-min-shards for a bucket
There is now a mechansim to set the minimum number of shards when a
bucket is created, and dynamic resharding adheres to that
setting. This adds the ability to modify that minimum shard count that
exists within the bucket layout of the bucket instance
object. Example:
J. Eric Ivancich [Wed, 15 Jan 2025 16:26:59 +0000 (11:26 -0500)]
rgw: allow per-bucket minimum number of shards
Dynamic resharding can now reduce the number of shards. The code
currently has a hard-coded value of 11 as the minimum number of shards
dynamic resharding can reshard to. There may be cases where the user
wants to set an alternate minimum, such as when they have a sense of
how many objects the bucket will eventually hold.
This PR builds off of https://github.com/ceph/ceph/pull/61269 .
That PR allows the user to specify an initial number of shards during
bucket creation. This PR then takes that number to be the minimum and
saves it in the layout field of the bucket instance object
(RGWBucketInfo).
When dynamic resharding is triggered, it will use that stored value as
a minimum number of shards for resharing.
Signed-off-by: J. Eric Ivancich <ivancich@redhat.com>
as the system monotonic clock is used when the container is used
in Scrub implementation, and on some kernels there are rare cases
where the monotonic clock can go backwards, we need to tolerate
such events.