osd: Set recovery specific Ceph options for mclock profiles to work.
Set and disable relevant recovery specific Ceph options for mclock
profiles to work as expected. Broadly,
- Set low value for recovery min cost
- High values for max active recoveries and max backfills
- Disable recovery sleep{_hdd, _ssd, _hybrid}
osd: Handle configuration changes to mclock config options
Handle configuration changes to the mclock cost per io, the max
capacity options and the mclock profile. Handle the case where the
profile is changed from a built-in profile to the custom profile.
osd: Add mclock profile infrastructure and implement mclock profiles
Define config options to specify the cost per io for an osd (hdd & ssd).
- osd_mclock_cost_per_io_msec
- osd_mclock_cost_per_io_msec_hdd
- osd_mclock_cost_per_io_msec_ssd
Define config options to set max osd capacity (hdd & ssd) to be allocated
between clients of dmclock namely,
- osd_mclock_max_capacity_iops
- osd_mclock_max_capacity_iops_hdd
- osd_mclock_max_capacity_iops_ssd
Define config option "osd_mclock_profile" to specify the built-in profile
to enable.
Also, Set the number of op shards being used in the osd within the mclock
scheduler as well. This is necessary to calculate the per shard limits
within the mclock scheduler.
With the above information, enable the specified mclock profile by
calling the appropriate method to set the profile specific mclock
parameters and Ceph options.
Prior to enqueuing an op, the scheduler performs a calculation to scale
up or down the cost associated for the OpSchedulerItem. This calculation
is done based on the existing item cost, the max osd capacity provided
and an additional cost factor based on underlying device type(hdd/ssd).
Sunny Kumar [Fri, 15 Jan 2021 02:53:05 +0000 (02:53 +0000)]
osd: handle ceph specific config changes for the mclock scheduler
The below ceph parameters are set automatically
while enabling the mclock scheduler with a built-in profile.
The user in this case will not be able to modify these
ceph specific config options during runtime.
a. osd_async_recovery_min_cost
b. osd_recovery_max_active{_hdd,_ssd}
c. osd_max_backfills
d. osd_recovery_sleep{_hdd,_ssd,_hybrid}
If the custom profile is enabled for the mclock scheduler,
the user can modify these parameters.
Patrick Donnelly [Mon, 11 Jan 2021 21:39:27 +0000 (13:39 -0800)]
client: wire up alternate_name
Here we're exposing a public Client::walk (aka path_walk) so that the
user can inspect dentries (not something normally possible in POSIX).
We're going to skip exposing such an interface in libcephfs since
there's no reason to do that (who would use it?) except for testing.
Instead, a follow-up PR will add Client tests (for the first time, yay!)
that will exercise this code.
Also, ideally we'd also expose alternate_name via readdir results but
that is a bit more complicated since dirents do not normally refer to
external memory. So, just rely on Client::walk for testing for now.
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Patrick Donnelly [Fri, 15 Jan 2021 03:55:43 +0000 (19:55 -0800)]
mds: fix alternate_name durability
This is a collection of fixes to Xiubo's prior work. Namely:
- Add new mds_alternate_name_max option to limit the size of
alternate_name. Otherwise a Client could trick the MDS into creating
an alternate_name of any size!
- Clean up how alternate_name is assigned to CDentry. In the general
case, this should be assigned as part of creating the dentry. We want
this value to be immutable for the life of the dentry. Even for the
very special case of rename(2) where the destination dentry already
exists. We explicitly check (after discussion with Jeff) that the
target dentry alternate_name already matches what the rename RPC is
giving.
- The MDS is now properly journaling the alternate_name.
- The MDS rejoin phase is properly transmitting each dentry's
alternate_name. I've discovered that this MMDSCacheRejoin message
actually wasn't versioned which I've raised in a tracker [1]. In the
mean time, we'll just bump CEPH_MDS_PROTOCOL as usual.
[1] https://tracker.ceph.com/issues/48886
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Xiubo Li [Mon, 21 Sep 2020 04:55:31 +0000 (12:55 +0800)]
mds: add alternate_name feature support for dentries
This will support the "alternate_name" filename support, and will save
an alternate name for each dentry. This alternate name is not used in
path lookup or for any other usual file system purpose. The name is
simply an added blob of metadata on the dentry that is distributed to
clients so that "long" file names may be supported for clients which
require them. In the case of an fscrypt enhanced kernel mount driver,
the long name may be the cyphertext (exceeding FILENAME_MAX) of a long
file name.
Because this affects only files with long file names, the use of this
feature should be rare but could be common for some unusual
applications.
The client mount should check the CEPHFS_FEATURE_ALTERNATE_NAME feature
bit first to check whether the MDS has support for this feature or not.
The alternate_name is transmitted as part of the message payload in
MClientRequest when setting the alternate_name. The LeaseStat structure
in MClientReply contains the alternate_name.
When executing a metadata mutation RPC, the client will set the
alternate_name (if it exists) as part of the operation. The MDS will
pick that up and set it on the new or mutated dentry.
Fixes: https://tracker.ceph.com/issues/47162 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Xiubo Li [Wed, 23 Sep 2020 02:54:34 +0000 (10:54 +0800)]
mds: add new CDentry tags support
This will add new tag 'i' for inode and new tag 'l' for remote link
for the CDentry. And at the same time will add one proper dentry
version, which will be helpful to add new features/members in future
for the CDentry.
Fixes: https://tracker.ceph.com/issues/47162 Signed-off-by: Xiubo Li <xiubli@redhat.com>
Adam Kupczyk [Thu, 24 Sep 2020 10:54:46 +0000 (12:54 +0200)]
BlueStore: Add block_cache logic to column family definition
Modified original onode column family cache into generic feature.
Now 2 options are possible for each column family:
1) use generic block cache but apply different BlockBasedTableOptions
2) create separate block cache. it will be applied for all shards of column family
It is done by specifying special option "block_cache" to CF definition:
Example: O(3,0-13)=block_cache={high_ratio=1.000}
"block_cache" accepts all BlockBasedTableOptions options with additions:
-"type" - binned_lru/lru/clock (default: ceph_options.rocksdb_cache_type)
-"size" - e.g.: 100M (default: ceph_options.rocksdb_cache_size)
-"high_ratio" - e.g.: 0.75 (default: 0.0)
If any of above is set, new block cache is created, otherwise default is used.
Thomas Goirand [Fri, 15 Jan 2021 09:50:05 +0000 (10:50 +0100)]
common/ipaddr: Allow binding on lo
Commmit 5cf0fa872231f4eaf8ce6565a04ed675ba5b689b, solves the issue that
the osd can't restart after seting a virtual local loopback IP. However,
this commit also prevents a bgp-to-the-host over unumbered Ipv6
local-link is setup, where OSD typically are bound to the lo interface.
To solve this, this single char patch simply checks against "lo:" to
match only virtual interfaces instead of anything that starts with "lo".
Fixes: https://tracker.ceph.com/issues/48893 Signed-off-by: Thomas Goirand <zigo@debian.org>
Kefu Chai [Tue, 12 Jan 2021 14:24:42 +0000 (22:24 +0800)]
doc/_ext: parse prefix and args for command signature
unlike the commands defined by C++, the commands defined by python
now uses "prefix" and "args" properties of elements in COMMAND class
attribute to define their command and arguments. the "cmd" property is
still available for the ceph CLI.
but the ceph_command sphinx extension should now use "prefix" and "args"
for printing out the usage of commands implemented using python. in
this change, the extension is updated to read "prefix" and "args"
properties of command defined by python modules.
peng jiaqi [Tue, 5 Jan 2021 03:15:26 +0000 (11:15 +0800)]
mgr: fix deadlock in ActivePyModules::get_osdmap()
In function "ActivePyModules::get_osdmap()", We do not read or write to
object "ActivePyModules", so it is safe to delete lock
"ActivePyModules::lock", and it can avoid other thread waiting for lock
"ActivePyModules::lock"
Kefu Chai [Sun, 27 Dec 2020 14:42:33 +0000 (22:42 +0800)]
pybind/mgr/mgr_module: cast string to enum when collecting kwargs
as the parameters of handlers are properly typed, they are expecting
enum parameter, let's cast string parameter to enum if the callee claims
that it wants an enum.
Kefu Chai [Wed, 13 Jan 2021 11:28:48 +0000 (19:28 +0800)]
pybind/mgr/mgr_module: preserve the signature of wrapped func
before this change, the annotations of the func wrapped by
CLICheckNonemptyFileInput and returns_command_result are lost after being
wrapped with this decorator. after this change, the parameter annotations
are preserved. these annotations are used for generating the argdesc in a
later commit.
Kefu Chai [Fri, 15 Jan 2021 01:27:26 +0000 (09:27 +0800)]
mgr/cephadm: s/yield_fixture/fixture/
silences pytest warning. it complained:`
1: cephadm/tests/fixtures.py:68
1: /var/ssd/ceph/src/pybind/mgr/cephadm/tests/fixtures.py:68: PytestDeprecationWarning: @pytest.yield_fixture is deprecated.
1: Use @pytest.fixture instead; they are the same.
1: @pytest.yield_fixture()
Patrick Donnelly [Thu, 14 Jan 2021 16:09:01 +0000 (08:09 -0800)]
Merge PR #34858 into master
* refs/pull/34858/head:
mds: do not allow the service to run on Windows
win32*.sh: fix SKIP_TESTS flag
include/win32/win32_errno:h Add missing include
client: Port CephFS client to Windows
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com>