Jason Dillaman [Mon, 8 Feb 2021 16:53:28 +0000 (11:53 -0500)]
rbd-mirror: don't prune older mirror snapshots when pruning incomplete snapshot
Since we normally prune in order, we need to ensure that we don't prune older
snapshots when we need to delete an incomplete mirror snapshot since the
older snapshot might be the only remaining mirror snapshot.
Jason Dillaman [Tue, 8 Dec 2020 19:16:49 +0000 (14:16 -0500)]
librbd/deep_copy: added new migrating flag to object copy
The migration operation and the copyup state machine will set
this flag when attempting to perform a deep-copy due to a
live-migration.
This flag will prevent a possible race condition between the
start of the object deep-copy when migration was enabled and
the writing portion of the deep-copy when migration might
have completed via external means.
Fixes: https://tracker.ceph.com/issues/45694 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 1baba64e213cb808804796575d3f7969cf37a3c6)
Jason Dillaman [Fri, 5 Feb 2021 15:41:30 +0000 (10:41 -0500)]
librbd/deep-copy: object-copy state machine must update object map
If there was no data to copy, the object-copy state machine was bypassing
the object-map update states and prematurely completing. Since the
object-map is default-initialized to all non-existent objects, this results
in incorrect state for OBJECT_EXISTS_CLEAN objects.
Jason Dillaman [Fri, 25 Sep 2020 14:40:32 +0000 (10:40 -0400)]
librbd: deep-copy should update object-map before writing to object
For the original use-case of RBD mirroring it was (maybe) more
acceptable to write to the object before updating the object map
because an interrupted sync will be retried. However, when using
the deep-copy object copy state machine as part of copyup, it's
more likely that the object-map has the potential to become
out-of-sync with reality if it's updated after the object is
written.
Jason Dillaman [Thu, 28 Jan 2021 23:30:16 +0000 (18:30 -0500)]
librbd/object_map: diff state machine should track object existence
The deep-copy snapshot-create state machine initializes the object-map
state to non-existent for all objects. There was an assumption that the
deep-copy object-copy state machine would always update the object map
but that was being skipped for clean objects as an optimization. This
change will support a future commit to run the object-copy state machine
for existing objects.
Mykola Golub [Wed, 2 Dec 2020 09:41:13 +0000 (09:41 +0000)]
test/librbd: print difference if deep-copy or migration test fails
It may appear to be useful to track the sporadic test failures
observed on jenkins, not reproducible locally.
Previously it was disabled because the output could be too
large. But after the hexdump was improved to skip repeating bytes
the output will hopefully be much smaller.
Kefu Chai [Wed, 3 Jun 2020 01:39:26 +0000 (09:39 +0800)]
qa/tasks/vstart_runner: do not teardown test_path if "create-cluster-only"
otherwise we could be removing a "None" directory when tearing down the cluster,
and have following failure:
Exception ignored in: <bound method LocalContext.__del__ of <__main__.LocalContext object at 0x7f99fd4a6cc0>>
Traceback (most recent call last):
File "../qa/tasks/vstart_runner.py", line 1189, in __del__
shutil.rmtree(self.teuthology_config['test_path'])
File "/tmp/tmp.mmM2ugspuR/venv/lib/python3.6/shutil.py", line 477, in rmtree
onerror(os.lstat, path, sys.exc_info())
File "/tmp/tmp.mmM2ugspuR/venv/lib/python3.6/shutil.py", line 475, in rmtree
orig_st = os.lstat(path)
TypeError: lstat: path should be string, bytes or os.PathLike, not NoneType
Ilya Dryomov [Thu, 3 Dec 2020 10:24:32 +0000 (11:24 +0100)]
qa: krbd_stable_pages_required.sh: move to stable_writes attribute
bdi/stable_pages_required attribute was deprecated in 5.10 and now
always returns 0. The replacement is queue/stable_writes. (It is
also writeable, so we can simplify these test cases somewhat in the
future.)
Sebastian Wagner [Fri, 15 Jan 2021 12:13:35 +0000 (13:13 +0100)]
mgr/cephadm: try again calling ceph-volume without --filter-for-batch
Fixes: https://tracker.ceph.com/issues/48870
This deals with a cephadm upgrade issue:
1. user calls `ceph orch upgrade`
2. mgr/cephadm calls `ceph orch config set mgr.x container_image <new-container>`
3. standby mgr gets upgraded
4. mgr failover to new mgr
5. mgr/cephadm calls `_refresh_host_devices`
6. `_refresh_host_devices` calls` ceph orch config get osd container_image`.
But this returns the old image
7. `_refresh_host_devices` calls `ceph-volume ... --filter-for-batch`
with an image that doesn't support `filter-for-batch`
The idea is to simply retiry calling ceph-volume inventory without `--filter-for-batch`
(also removed `out` being used without being declared)
Sage Weil [Wed, 27 Jan 2021 21:44:21 +0000 (15:44 -0600)]
python-common: fix test_datetime_to_str_2 on non-UTC hosts
The old test parsed to a datetime without a tz, which was interpreted as
the local time zone when rendering back to a string. Specify that it's a
UTC datetime so that behavior is consistent regardless of the test host
timezone.
Kamoltat [Thu, 7 Jan 2021 15:39:19 +0000 (15:39 +0000)]
mgr/pg_autoscaler: avoid scale-down until there is pressure
The autoscaler will start out with scaling each
pools to have a full complements of pgs from the start
and will only decrease it when pools need more due to
increased usage.
Introduced a unit test that tests only the
function get_final_pg_target_and_ratio() which
deals with the distrubtion of pgs amongst the
pools
Edited workunit script to reflect the change
of how pgs are calculated and distrubted.
Ilya Dryomov [Wed, 20 Jan 2021 15:00:18 +0000 (16:00 +0100)]
qa/suites/krbd: add msgr2 modes to most subsuites
basic, rbd and rbd-nomount subsuites are expanded to run with each
of ms_mode=legacy, ms_mode=crc and ms_mode=secure. This increases
the total number of jobs in the suite from 100 to 220.
fsx, singleton and thrash subsuites choose ms_mode at random (from
the above plus ms_mode=prefer-crc).
Ilya Dryomov [Mon, 18 Jan 2021 12:49:49 +0000 (13:49 +0100)]
krbd: add support for msgr2
Recognize ms_mode map option and filter initial monitor addresses
accordingly: if ms_mode is not given or ms_mode=legacy, discard v2
addresses, otherwise discard v1 addresses.
Note that nothing was discarded (i.e. v2 addresses were passed to
the kernel) previously. The intent was to preserve that behaviour
in case ms_mode is not given, allowing to change the kernel default
in the future. However, it turns out that mount.ceph helper has
been misguidedly discarding v2 addresses since commit eae01275134e
("mount.ceph: fork a child to get info from local configuration"),
so that ship has sailed.
Sage Weil [Tue, 19 Jan 2021 22:49:08 +0000 (16:49 -0600)]
mgr/cephadm: raise HEALTH_WARN when cephadm daemon in 'error' state
If cephadm daemons are not happy we should raise a warning. Aside from
being an important part of the user experience, this will also help us
catch teuthology test errors.
cephadm: silence "Failed to evict container" log msg
Right now, we're printing some evil looking messages in the log:
```
systemd[1]: Starting Ceph mgr.node2.ankmgz for ...
podman[32354]: Error: no container with name or ID ceph-... found: no such container
bash[32363]: Error: Failed to evict container: "": Failed to find container "ceph-..." in state: no container with name or ID ceph-... found: no such container
bash[32363]: Error: no container with ID or name "ceph-..." found: no such container
````
Also, the unit.run command already removes the container. No need
for ExecStartPre to do the same.
Conflicts:
qa/tasks/mgr/dashboard/helper.py
qa/tasks/mgr/dashboard/test_auth.py
src/pybind/mgr/dashboard/controllers/__init__.py
src/pybind/mgr/dashboard/controllers/auth.py
src/pybind/mgr/dashboard/controllers/saml2.py
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/01-hosts.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/02-hosts-inventory.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/03-inventory.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/orchestrator/04-osds.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/ui/language.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/cypress/integration/ui/navigation.e2e-spec.ts
src/pybind/mgr/dashboard/frontend/package-lock.json
src/pybind/mgr/dashboard/frontend/package.json
src/pybind/mgr/dashboard/frontend/src/app/app.module.ts
src/pybind/mgr/dashboard/frontend/src/app/core/navigation/dashboard-help/dashboard-help.component.ts
- Adopting the changes from the master branch, ignoring few e2e changes
as few files doesn't exist in octopus.
move all of the etag verifier initialization into a helper function.
none of the errors there should be fatal and fail the download, they
should just turn etag verification off
rgw: rgw_sync_obj_etag_verify accounts for compressed multipart uploads
the etag verifier for multipart uploads uses the manifest to get the
logical offsets for each part. but when compression is enabled, those
are offsets into the compressed data. use the source object's compression
info to translate those compressed part offsets back to their original
offsets