Yaarit Hatuka [Tue, 7 Dec 2021 18:30:56 +0000 (18:30 +0000)]
mgr/telemetry: add command to list all collections
List all collections, their current enrollment state, status, default,
and description, with:
$ ceph telemetry collection ls
NAME ENROLLED STATUS DEFAULT DESC
basic_base TRUE ON ON Basic information about the cluster (capacity, number and type of daemons, version, etc.)
basic_mds_metadata TRUE ON ON MDS metadata
crash_base TRUE ON ON Information about daemon crashes (daemon type and version, backtrace, etc.)
device_base TRUE ON ON Information about device health metrics
ident_base TRUE OFF OFF User-provided identifying information about the cluster
perf_perf TRUE OFF OFF Information about performance counters of the cluster
Please note:
NAME:
=====
Collection name; prefix indicates the channel the collection belongs to.
ENROLLED:
=========
Signifies the collections that were available in the module when the
user last opted-in to telemetry. Please note: Even if a collection is
'enrolled', its metrics will be reported only if its channel is enabled.
STATUS:
=======
Indicates whether the collection metrics are reported; this is
determined by the status (enabled / disabled) of the channel the
collection belongs to, along with the enrollment status of the
collection.
DEFAULT:
========
The default status (enabled / disabled) of the channel the collection
belongs to.
Yaarit Hatuka [Tue, 23 Nov 2021 21:28:47 +0000 (21:28 +0000)]
mgr/telemetry: add preview-device and preview-all commands
`ceph telemetry show` will show a sample cluster report if the user is
opted-in to telemetry. The report will be compiled of the collections
the user is opted-in to. To preview a report compiled of the most recent
collection available, use `ceph telemetry preview`.
The device channel is not included in the cluster report, since it's
being sent to a different endpoint, thus we use
`ceph telemetry show-device` in case the user is opted-in to telemetry
and the device channel is enabled. If not, it can also be previewed with
`ceph telemetry preview-device`.
If telemetry is on, and device channel is enabled, both reports can be
reviewed with `ceph telemetry show-all`, otherwise use
`ceph telemetry preview-all`.
Yaarit Hatuka [Tue, 23 Nov 2021 17:11:38 +0000 (17:11 +0000)]
mgr/telemetry: add command to list all channels
List all channels, their current state, default, and description, with:
$ ceph telemetry channel ls
NAME ENABLED DEFAULT DESC
basic ON ON Share basic cluster information (size, version)
ident OFF OFF Share a user-provided description and/or contact email for the cluster
crash ON ON Share metadata about Ceph daemon crashes (version, stack straces, etc)
device ON ON Share device health metrics (e.g., SMART data, minus potentially identifying info like serial numbers)
perf ON OFF Share perf counter metrics summed across the whole cluster
Yaarit Hatuka [Tue, 23 Nov 2021 00:12:10 +0000 (00:12 +0000)]
mgr/telemetry: add commands to enable/disable channels
Currently we enable/disable a telemetry channel via CLI with:
`ceph config set mgr mgr/telemetry/channel_basic true`
`ceph config set mgr mgr/telemetry/channel_crash false`
We can now do this with:
`ceph telemetry enable channel basic`
`ceph telemetry disable channel crash`
Yaarit Hatuka [Mon, 15 Nov 2021 16:53:59 +0000 (16:53 +0000)]
mgr/telemetry: introduce new design for adding new data
The current design requires increasing the telemetry revision each time
we add new data to the report. As a result, users need to re-opt-in to
telemetry. This new design allows for adding new data to the report,
while allowing users to keep sending only what they already opted-in to,
hence no re-opt-in is required. In case users wish to report the new
data as well, they need to re-opt-in and enable any new channels.
Also, move formatting perf histograms to a function, so we can use it
both in `show` and `preview` commands.
Fix get_report call in dashboard to use get_report_locked.
Laura Flores [Tue, 4 Jan 2022 22:54:33 +0000 (22:54 +0000)]
mgr/telemetry: add the rocksdb version number to telemetry
Capturing the RocksDB version number in Telemetry would allow us to check that users are using the appropriate RocksDB version for their Ceph cluster. For instance, if a user is working in a Pacific cluster, but their RocksDB version is meant for Nautilus, that might be a problem.
It is strucured as "rocksdb_stats" --> "version" in anticipation of more stats that can will be added under "rocksdb_stats".
osd: Display scheduler specific info when dumping an OpSchedulerItem
Implement logic to dump information relevant to the scheduler type being
employed when dumping details about an OpSchedulerItem. For e.g., the
'priority' field is relevant for the 'wpq' scheduler, but for the
'mclock_scheduler', the 'qos_cost' gives more information during debugging.
A couple of additional fields called 'qos_cost' and 'is_qos_request' are
introduced in OpSchedulerItem class. These are mainly used to facilitate
dumping of relevant information depending on the scheduler type. The
interesting points are when an item is enqueued and dequeued.
For the 'mclock_scheduler', the 'class_id' and the 'qos_cost' fields are
dumped during enqueue and dequeue op respectively. For the 'wpq' scheduler
things remain the same as before.
An additional benefit of this change is to help immediately identify the
type of scheduler being used for a given shard depending on what is dumped
in the debug messages while debugging.
Ilya Dryomov [Fri, 7 Jan 2022 12:31:08 +0000 (13:31 +0100)]
test/librbd: make diff-iterate clone tests exercise fast-diff mode
The fast-diff feature wasn't propagated to the clone so these tests
were exercising the slow list_snaps path no matter what RBD_FEATURES
value was supplied to ceph_test_librbd.
Ilya Dryomov [Wed, 5 Jan 2022 19:24:40 +0000 (20:24 +0100)]
librbd: restore diff-iterate include_parent functionality in fast-diff mode
Commit 4429ed4f3f4c ("librbd: switch diff iterate API to use new snaps
list dispatch methods") removed the recursive execute() call. The new
list_snaps method does indeed handle parent diffs internally but it is
not used in fast-diff mode. Nothing changed there -- we still need to
load the parent object map, calculate parent object_diff_state, etc.
yuval Lifshitz [Mon, 13 Dec 2021 18:56:20 +0000 (20:56 +0200)]
rgw/notifications: add cloudevents support to HTTP endpoint
following the cloudevents HTTP spec:
https://github.com/cloudevents/spec/blob/v1.0.1/http-protocol-binding.md
and more specifically this aws-s3 spec:
https://github.com/cloudevents/spec/blob/main/cloudevents/adapters/aws-s3.md
Sage Weil [Thu, 6 Jan 2022 13:54:45 +0000 (08:54 -0500)]
Merge PR #44054 into master
* refs/pull/44054/head:
doc/rados/operations: document pg_num_max
mgr: set max of 32 pgs for .mgr pool
mgr/dashboard: expect pg_num_max property for pools
mon/OSDMonitor: add option --pg-num_max arg for create pool
mon/OSDMonitor: disallow setting pg_num < min or > max
mgr/pg_autoscaler: apply pg_num_max
mon: add pg_num_max pool property
Reviewed-by: Neha Ojha <nojha@redhat.com> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Ilya Dryomov [Tue, 4 Jan 2022 19:38:35 +0000 (20:38 +0100)]
librbd: diff-iterate reports incorrect offsets in fast-diff mode
If rbd_diff_iterate2() is called on an image offset that doesn't
correspond to an object boundary, the callback is invoked with an
incorrect image offset. For example, assuming a fully allocated
image, a diff request for 806354944~57344 results in offs=807403520,
len=57344, exists=true invocation, which is ahead by 1048576 bytes.
This occurs only in fast-diff mode, for a diff request on an image
with the fast-diff feature disabled or if whole_object parameter is
set to false the invocation is correct.
This bug goes back to the introduction of fast-diff mode in commit 6d5b969d4206 ("librbd: add diff_iterate2 to API").
The nfs upgrade renames the nfs spec from `nfs.ganesha-{service_id}`
to `nfs.{service_id}`. Previously we used the orphan-daemon check
to remove the old `nfs.ganesha-{service_id}` daemons. This does not work
as sometimes serve() tries to deploy the new daemons before cleaning up the
old daemons. This results in a port conflict breaking the upgrade.
Fixes: https://tracker.ceph.com/issues/53424 Signed-off-by: Sebastian Wagner <sewagner@redhat.com>
John Bent [Wed, 5 Jan 2022 16:04:40 +0000 (09:04 -0700)]
README.md: Update README.md to add link to tracker.ceph.com
I searched the existing documentation for a link to the tracker and had a hard time finding it. Other folks like myself might like it so prominently displayed.
Added a link to https://tracker.ceph.com/projects/ceph as https://tracker.ceph.com/ is basically blank and not totally intuitive how to get to issues.