Ilya Dryomov [Thu, 3 Dec 2020 10:24:32 +0000 (11:24 +0100)]
qa: krbd_stable_pages_required.sh: move to stable_writes attribute
bdi/stable_pages_required attribute was deprecated in 5.10 and now
always returns 0. The replacement is queue/stable_writes. (It is
also writeable, so we can simplify these test cases somewhat in the
future.)
Sebastian Wagner [Fri, 15 Jan 2021 12:13:35 +0000 (13:13 +0100)]
mgr/cephadm: try again calling ceph-volume without --filter-for-batch
Fixes: https://tracker.ceph.com/issues/48870
This deals with a cephadm upgrade issue:
1. user calls `ceph orch upgrade`
2. mgr/cephadm calls `ceph orch config set mgr.x container_image <new-container>`
3. standby mgr gets upgraded
4. mgr failover to new mgr
5. mgr/cephadm calls `_refresh_host_devices`
6. `_refresh_host_devices` calls` ceph orch config get osd container_image`.
But this returns the old image
7. `_refresh_host_devices` calls `ceph-volume ... --filter-for-batch`
with an image that doesn't support `filter-for-batch`
The idea is to simply retiry calling ceph-volume inventory without `--filter-for-batch`
(also removed `out` being used without being declared)
Sage Weil [Wed, 27 Jan 2021 21:44:21 +0000 (15:44 -0600)]
python-common: fix test_datetime_to_str_2 on non-UTC hosts
The old test parsed to a datetime without a tz, which was interpreted as
the local time zone when rendering back to a string. Specify that it's a
UTC datetime so that behavior is consistent regardless of the test host
timezone.
Kamoltat [Thu, 7 Jan 2021 15:39:19 +0000 (15:39 +0000)]
mgr/pg_autoscaler: avoid scale-down until there is pressure
The autoscaler will start out with scaling each
pools to have a full complements of pgs from the start
and will only decrease it when pools need more due to
increased usage.
Introduced a unit test that tests only the
function get_final_pg_target_and_ratio() which
deals with the distrubtion of pgs amongst the
pools
Edited workunit script to reflect the change
of how pgs are calculated and distrubted.
Ilya Dryomov [Wed, 20 Jan 2021 15:00:18 +0000 (16:00 +0100)]
qa/suites/krbd: add msgr2 modes to most subsuites
basic, rbd and rbd-nomount subsuites are expanded to run with each
of ms_mode=legacy, ms_mode=crc and ms_mode=secure. This increases
the total number of jobs in the suite from 100 to 220.
fsx, singleton and thrash subsuites choose ms_mode at random (from
the above plus ms_mode=prefer-crc).
Ilya Dryomov [Mon, 18 Jan 2021 12:49:49 +0000 (13:49 +0100)]
krbd: add support for msgr2
Recognize ms_mode map option and filter initial monitor addresses
accordingly: if ms_mode is not given or ms_mode=legacy, discard v2
addresses, otherwise discard v1 addresses.
Note that nothing was discarded (i.e. v2 addresses were passed to
the kernel) previously. The intent was to preserve that behaviour
in case ms_mode is not given, allowing to change the kernel default
in the future. However, it turns out that mount.ceph helper has
been misguidedly discarding v2 addresses since commit eae01275134e
("mount.ceph: fork a child to get info from local configuration"),
so that ship has sailed.
Sage Weil [Tue, 19 Jan 2021 22:49:08 +0000 (16:49 -0600)]
mgr/cephadm: raise HEALTH_WARN when cephadm daemon in 'error' state
If cephadm daemons are not happy we should raise a warning. Aside from
being an important part of the user experience, this will also help us
catch teuthology test errors.
cephadm: silence "Failed to evict container" log msg
Right now, we're printing some evil looking messages in the log:
```
systemd[1]: Starting Ceph mgr.node2.ankmgz for ...
podman[32354]: Error: no container with name or ID ceph-... found: no such container
bash[32363]: Error: Failed to evict container: "": Failed to find container "ceph-..." in state: no container with name or ID ceph-... found: no such container
bash[32363]: Error: no container with ID or name "ceph-..." found: no such container
````
Also, the unit.run command already removes the container. No need
for ExecStartPre to do the same.
Conflicts:
qa/tasks/mgr/dashboard/helper.py
qa/tasks/mgr/dashboard/test_auth.py
src/pybind/mgr/dashboard/controllers/__init__.py
src/pybind/mgr/dashboard/controllers/auth.py
src/pybind/mgr/dashboard/controllers/saml2.py
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/01-hosts.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/02-hosts-inventory.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/03-inventory.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/04-osds.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/ui/language.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/ui/navigation.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/package-lock.json
src/pybind/mgr/dashboard/frontend/package.json
src/pybind/mgr/dashboard/frontend/src/app/app.module.ts
src/pybind/mgr/dashboard/frontend/src/app/core/navigation/dashboard-help/dashboard-help.component.ts
- Adopting the changes from the master branch, ignoring few e2e changes
as few files doesn't exist in octopus.
move all of the etag verifier initialization into a helper function.
none of the errors there should be fatal and fail the download, they
should just turn etag verification off
rgw: rgw_sync_obj_etag_verify accounts for compressed multipart uploads
the etag verifier for multipart uploads uses the manifest to get the
logical offsets for each part. but when compression is enabled, those
are offsets into the compressed data. use the source object's compression
info to translate those compressed part offsets back to their original
offsets
Prasad Krishnan [Fri, 6 Mar 2020 05:08:05 +0000 (05:08 +0000)]
[RGW][Multisite] Add multisite verifier support for MPU objects
The Etag for MPU objects is calculated using a method different from how
it is done for atomic objects. This patch makes use of the RGWObjManifest
to determine the parts in the source cluster and re-computes the ETag in
a similar fashion at the destination cluster during multisite sync for
verification.
Prasad Krishnan [Sun, 23 Feb 2020 06:09:49 +0000 (11:39 +0530)]
RGW:Multisite: Verify if the synced object is identical to source
Introduce an option 'rgw_copy_verify_object' which allows the object
copied from remote cluster through multisite sync is identical to the
source object. This is done by generating the MD5 checksum of the data
being copied and compared to the ETAG stored as part of the object's
attribute.
Venky Shankar [Fri, 21 Aug 2020 14:07:37 +0000 (10:07 -0400)]
mgr/volumes: maintain per subvolume trash directory
PR https://github.com/ceph/ceph/pull/36472 introduces changes
that disallow nested nested snapshots in a subtree (subvolume)
and renames across subvolumes. This effect asynchronous purge
in mgr/volumes as subvolume are moved to a trash directory for
asynchronous deletion by purge threads.
To workaround this, start maintaining a subvolume specific
trash directory. Use the trash directory as an index to the
subvolume specific trash directory entry.
This changes subvolume deletion logic which currently relies
on `--retain-snapshots` flag to decide if the subvolume user
directory should get purged or the subvolume base directory
itself. Deleting a subvolume moves the user facing directory
to its specific trash directory. Purge threads take care of
deleting user facing directories (in trash) and the subvolume
base directory if required (when certain conditions are met).
Ken Dreyer [Thu, 3 Dec 2020 17:48:06 +0000 (10:48 -0700)]
mgr/prometheus: don't store exception as e
Python's logging module's exception() method will log the full exception
and stack trace for us, so we do not need to store the exception in the
"e" variable here.
Boris Ranto [Wed, 25 Nov 2020 09:27:25 +0000 (10:27 +0100)]
mgr/prometheus: Use mgr.release_name for always on modules
The host_version is not populated properly in the early stages of ceph
mgr start up process. We can use mgr.release_name instead. It is more
stable and it provides the data even if mgr_map does not contain the
versions, yet.
Paul Cuzner [Thu, 8 Oct 2020 03:30:56 +0000 (16:30 +1300)]
mgr/prometheus: Add healthcheck metric for SLOW_OPS
SLOW_OPS is triggered by op tracker, and generates a health
alert but healthchecks do not create metrics for prometheus to
use as alert triggers. This change adds SLOW_OPS metric, and
provides a simple means to extend to other relevant health
checks in the future
If the extract of the value from the health check message fails
we log an error and remove the metric from the metric set. In
addition the metric description has changed to better reflect
the scenarios where SLOW_OPS can be triggered.
Jason Dillaman [Fri, 18 Dec 2020 15:14:13 +0000 (10:14 -0500)]
librbd: drop explicit masking of implicit feature bits
Now that the create image state machine is handling the masking
of implicit features, all callers to the state machine can skip
the need to perform the masking themselves.
Jason Dillaman [Fri, 18 Dec 2020 14:55:30 +0000 (09:55 -0500)]
librbd/image: mask out all implicit features when creating an image
This will ensure that all paths to the create image state machine
properly handle this condition. Previously, it was up to the callers
of the state machine to clear the implicit feature bits.
Fixes: https://tracker.ceph.com/issues/48647 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit f52f78caca6f9743e75c8289771375f5f582300a)