Paul Cuzner [Tue, 19 Oct 2021 00:07:02 +0000 (13:07 +1300)]
mgr/prometheus: add test cases and validation using tox
Focus all tests inside a tests directory, and use pytest/tox to
perform validation of the overall content. tox tests also use
promtool if available to provide rule checks and unittest runs.
In addition to these checks a validate_rules script provides the
format, and content checks against all rules - which is also
called via tox (but can be run independently too)
Paul Cuzner [Thu, 16 Sep 2021 23:24:29 +0000 (11:24 +1200)]
mgr/prometheus: track individual healthchecks as metrics
This patch creates a health history object maintained in
the modules kvstore. The history and current health
checks are used to create a metric per healthcheck whilst
also providing a history feature. Two new commands are added:
ceph healthcheck history ls
ceph healthcheck history clear
In addition to the new commands, the additional metrics
have been used to update the prometheus alerts
Fixes: https://tracker.ceph.com/issues/52638 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
(cherry picked from commit e0dfc02063ef40cf6a1dc6e3080d0a856ceff050)
Conflicts:
doc/mgr/prometheus.rst
- Adopting doc with master.
Aashish Sharma [Mon, 8 Nov 2021 07:31:02 +0000 (13:01 +0530)]
mgr/dashboard: Cluster Expansion - Review Section: fixes and improvements
Ensure "Storage capacity" keeps the "Description : Value" approach ("Number of devices: X" and "Raw Capacity: Y" in different lines).Correct issue with "host by services" host count
- Remove npm-force-resolutions: no resolution needed anymore and this is modifying package-lock.json every time it is run (striping last empty line).
- Add .npmrc: save exact version by default; do not launch audit report when installing.
Fixes: https://tracker.ceph.com/issues/48005 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit f08c0db689dc6bd29323ac03a91c69e2fe7365a2)
Conflicts:
src/pybind/mgr/dashboard/frontend/package-lock.json
- Accept version from master branch.
src/pybind/mgr/dashboard/frontend/package.json
- Accept version from master branch.
This is redundant and makes nsenter throw messages like following:
```
Failed to find sysfs mount point
dev/block/11:0/holders/: opendir failed: Not a directory
dev/block/252:0/holders/: opendir failed: Not a directory
dev/block/253:0/holders/: opendir failed: Not a directory
dev/block/252:1/holders/: opendir failed: Not a directory
dev/block/253:1/holders/: opendir failed: Not a directory
dev/block/252:2/holders/: opendir failed: Not a directory
dev/block/253:2/holders/: opendir failed: Not a directory
dev/block/252:3/holders/: opendir failed: Not a directory
dev/block/253:3/holders/: opendir failed: Not a directory
dev/block/252:16/holders/: opendir failed: Not a directory
dev/block/252:32/holders/: opendir failed: Not a directory
dev/block/252:48/holders/: opendir failed: Not a directory
dev/block/252:64/holders/: opendir failed: Not a directory
```
Sage Weil [Tue, 5 Oct 2021 16:06:09 +0000 (11:06 -0500)]
qa/tasks/nvme_loop: set up nvme_loop on scratch_devs
Using an nvme loop device makes the LVs look like "real" disks,
which means we can exercise all of the normal code paths for
provisioning, deprovisioning, and zapping.
ceph-volume should run pv/vg/lv commands in the host namespace rather than
running them inside the container in order to avoid lvm metadata corruption.
Jianpeng Ma [Wed, 8 Sep 2021 01:51:19 +0000 (09:51 +0800)]
librbd: Read request need exclusive-lock when enable pwl-cache.
TestLibRBD.TestFUA descript the following workload:
a)write/read the same image w/ pwl-cache
write_image = open(image_name);
read_image = open(image_name);
b)i/o workload is:
write(write_image)
write need EXLock and require EXLOCK
read(read_image)
in ExclusiveLock<I>::init(), firstly read need EXLOCK
so will require EXLOCK. write_image release EXLOCK(will
flush data to osd and remove cache). read_image init pwl-cache
and read-io firstly enter pwl-cache and missed and then read
from osd.
write(write_image)
write need EXLOCK and require EXLOCK. This make read_image remove
empty cache. write_image init cache pool and write data to cache.
read(read_image)
In send_set_require_lock(), it set write need EXLOCK.
So read don't require EXLOCK and dirtyly read from osd.
Because second-read don't need EXLOCK and make write_image don't
release EXLOCK(flush dirty data to osd and shutdown pwl-cache).
This make second-read don't read the latest data.
So we should make read also need EXLOCK when enable pwl-cache.
Fixes: https://tracker.ceph.com/issues/51438 Tested-by: Feng Hualong <hualong.feng@intel.com> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 621facb6e66ce92ca36d566c78bc065a9666639e)
Jianpeng Ma [Mon, 1 Nov 2021 00:33:23 +0000 (08:33 +0800)]
librbd: send FLUSH_SOURCE_INTERNAL when do copy/deep_copy.
copy/deep_copy use object_map to judge whether object exist.
If w/ librbdo pwl cache, flush can't flush data to osd which
change objectmap state. So we should send flush w/ FLUSH_SOURCE_INTERNAL
to make data flush to osd.
Fixes:https://tracker.ceph.com/issues/53057 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a2ae83f8aab18933eae77cf3034b740082a39e4f)
Jianpeng Ma [Mon, 29 Nov 2021 07:16:21 +0000 (15:16 +0800)]
librbd/cache/pwl: Using BlockGuard control overlap ops order when flush to osd.
In process of tests, we met some inconsistent-data problem. Test case
mainly use write,then discard to detect data consistent.
W/o pwl, write/discard are synchronous ops. After write, data already
located into osd. But w/ pwl, we use asynchronous api to send ops to
osd.
Although we mare sure send order. But send-order don't makre sure
complete order. This mean pwl keep order of write/discard. But it
don't keep the same semantics which use synchronous api. W/ pwl, it make
synchronous to asynchronous. For normal ops, it's not problem. But if
connected-commands w/ overlap, it make data inconsistent.
So we use BlockGuard to solve this issue.
Fixes: https://tracker.ceph.com/issues/49876 Fixes: https://tracker.ceph.com/issues/53108 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 8e8f3ef516e98da011f3086f8e78a2fa261293ed)
The backport for the lvm migrate feature in pacific was merged after the
get_first_*() refactor backport.
So we have still some old references to `get_single_lv()`
Yaarit Hatuka [Wed, 25 Aug 2021 02:12:08 +0000 (02:12 +0000)]
rpm, debian: move smartmontools and nvme-cli to ceph-base
We wish to be able to scrape SMART and NVMe metrics from OSD and MON
nodes. For this we require / recommend smartmontools and nvme-cli
dependencies for both the ceph-osd and ceph-mon packages. However, the
sudoers file (which is required for invoking `smartctl` by user 'ceph')
was installed only in the ceph-osd package. Since different packages
cannot own the same file, and because we want to be able to scrape from
every daemon, we move the dependencies and the sudoers installation to
ceph-base. For generalization, we rename:
sudoers.d/ceph-osd-smartctl -> sudoers.d/ceph-smartctl
Neha Ojha [Mon, 9 Aug 2021 14:35:01 +0000 (14:35 +0000)]
qa/suites/rados/perf/ceph.yaml: remove rgw
This is no longer required because we removed cosbench workloads in fd350fd0150a2d4072f055658c20314a435a19ba. This is also required to prevent
failures like the following or any other changes that break the rgw task:
```
2021-08-06T20:13:25.812 INFO:teuthology.orchestra.run.smithi060.stderr:curl: (7) Failed to connect to smithi060.front.sepia.ceph.com port 80: Connection refused
2021-08-06T20:15:33.813 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_04c2febe7099917d97a71271f17abb5710030132/teuthology/contextutil.py", line 31, in nested
vars.append(enter())
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/rgw.py", line 191, in start_rgw
wait_for_radosgw(url, remote)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/util/rgw.py", line 94, in wait_for_radosgw
assert exit_status == 0
AssertionError
```
mgr/dashboard: use -f for npm ci to skip fsevents error Fixes: https://tracker.ceph.com/issues/52507 Signed-off-by: Duncan Bellamy <dunk@denkimushi.com>
(cherry picked from commit cd2b26f653ddedf0ed1b937cfaf8bcf7aaf48ce6)
Conflicts:
src/pybind/mgr/dashboard/CMakeLists.txt
- In master this file was moved to frontend folder. Since its not
done in pacific, just made the changes here.
Alfonso Martínez [Wed, 24 Nov 2021 14:36:50 +0000 (15:36 +0100)]
mgr/dashboard: upgrade Cypress to the latest stable version
- Remove unneeded dependency that was causing UI performance issues: zone.js
- Ignore 'ResizeObserver loop limit exceeded' error.
- run-frontend-e2e-tests.sh refactoring: create rgw dashboard user through
'ceph dashboard set-rgw-credentials' and use it on rgw buckets' tests.
Fixes: https://tracker.ceph.com/issues/53357 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit 3e4e29590aa1742fc3b44d21389325a13cca8199)
Conflicts:
src/pybind/mgr/dashboard/frontend/package-lock.json
- Regenerate file to align to pacific. Signed-off-by: Alfonso Martínez <almartin@redhat.com>
Nizamudeen A [Thu, 18 Nov 2021 07:13:39 +0000 (12:43 +0530)]
mgr/dashboard: fix flaky inventory e2e test
When `inventory.getTableCount('total').should('be.eq', totalDiskCount);`
this line is executed the table was not loaded properly and hence the
getTableCount returns 0 on the first try but on second try it passes
since the table is loaded. But in orch e2es the retries are set to 0. I
am not sure if it makes sense to set it to 1. Anyway I am adapting the
test a bit to expect the count to be equal to totalDiskCount so that the
test will wait a bit.
Avan Thakkar [Tue, 9 Nov 2021 21:37:33 +0000 (03:07 +0530)]
mgr/dashboard: provisioned values is misleading in RBD image table
Fixes: https://tracker.ceph.com/issues/46617 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Adding hint in image table similar to the one in rbd-details.
Alfonso Martínez [Wed, 17 Nov 2021 12:18:26 +0000 (13:18 +0100)]
mgr/dashboard: NFS non-existent files cleanup
After https://github.com/ceph/ceph/pull/42526 and https://github.com/ceph/ceph/pull/43725 merges,
the following files do not exist but there were still references to them:
- src/pybind/mgr/dashboard/services/ganesha.py
- qa/tasks/mgr/dashboard/test_ganesha.py
The following files were renamed but there were still references to old names:
- src/pybind/mgr/dashboard/controllers/nfsganesha.py: nfsganesha.py --> nfs.py
- src/pybind/mgr/dashboard/tests/test_ganesha.py: test_ganesha.py --> test_nfs.py
Other changes in qa/suites/rados/dashboard/tasks/dashboard.yaml:
- Add missing task: tasks.mgr.dashboard.test_api
- Sort dashboard tasks alphabetically.
Fixes: https://tracker.ceph.com/issues/53123 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit 045d2d0f7656e8524bbb32b5d9c230ca1f9b8d1c)