Nizamudeen A [Thu, 30 Dec 2021 08:28:58 +0000 (13:58 +0530)]
mgr/dashboard: stabilizing the cephadm dashboard e2e
Reordering the tests and adding some more tests to verify the cluster is
healthy before proceeding to do some complex tasks like maintenance and
drain host
Nizamudeen A [Mon, 20 Dec 2021 09:14:29 +0000 (14:44 +0530)]
mgr/dashboard: fix timeout error in dashboard cephadm e2e job
1. Fix the timeout error happening in the dashboard e2e job
2. Take care of the flaky force maintenance check
Most of the time our test is getting timed out while searching for an item
in the table. Its because `.clear().type()` is not clearing the content
in the search field sometimes and that creates a wrong data to be
entered into the search field and it starts searching based on this
wrong name. To avoid this I am explicitly clearing the search area
before typing.
Aashish Sharma [Mon, 8 Nov 2021 07:31:02 +0000 (13:01 +0530)]
mgr/dashboard: Cluster Expansion - Review Section: fixes and improvements
Ensure "Storage capacity" keeps the "Description : Value" approach ("Number of devices: X" and "Raw Capacity: Y" in different lines).Correct issue with "host by services" host count
- Remove npm-force-resolutions: no resolution needed anymore and this is modifying package-lock.json every time it is run (striping last empty line).
- Add .npmrc: save exact version by default; do not launch audit report when installing.
Fixes: https://tracker.ceph.com/issues/48005 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit f08c0db689dc6bd29323ac03a91c69e2fe7365a2)
Conflicts:
src/pybind/mgr/dashboard/frontend/package-lock.json
- Accept version from master branch.
src/pybind/mgr/dashboard/frontend/package.json
- Accept version from master branch.
This is redundant and makes nsenter throw messages like following:
```
Failed to find sysfs mount point
dev/block/11:0/holders/: opendir failed: Not a directory
dev/block/252:0/holders/: opendir failed: Not a directory
dev/block/253:0/holders/: opendir failed: Not a directory
dev/block/252:1/holders/: opendir failed: Not a directory
dev/block/253:1/holders/: opendir failed: Not a directory
dev/block/252:2/holders/: opendir failed: Not a directory
dev/block/253:2/holders/: opendir failed: Not a directory
dev/block/252:3/holders/: opendir failed: Not a directory
dev/block/253:3/holders/: opendir failed: Not a directory
dev/block/252:16/holders/: opendir failed: Not a directory
dev/block/252:32/holders/: opendir failed: Not a directory
dev/block/252:48/holders/: opendir failed: Not a directory
dev/block/252:64/holders/: opendir failed: Not a directory
```
Sage Weil [Tue, 5 Oct 2021 16:06:09 +0000 (11:06 -0500)]
qa/tasks/nvme_loop: set up nvme_loop on scratch_devs
Using an nvme loop device makes the LVs look like "real" disks,
which means we can exercise all of the normal code paths for
provisioning, deprovisioning, and zapping.
ceph-volume should run pv/vg/lv commands in the host namespace rather than
running them inside the container in order to avoid lvm metadata corruption.
Jianpeng Ma [Wed, 8 Sep 2021 01:51:19 +0000 (09:51 +0800)]
librbd: Read request need exclusive-lock when enable pwl-cache.
TestLibRBD.TestFUA descript the following workload:
a)write/read the same image w/ pwl-cache
write_image = open(image_name);
read_image = open(image_name);
b)i/o workload is:
write(write_image)
write need EXLock and require EXLOCK
read(read_image)
in ExclusiveLock<I>::init(), firstly read need EXLOCK
so will require EXLOCK. write_image release EXLOCK(will
flush data to osd and remove cache). read_image init pwl-cache
and read-io firstly enter pwl-cache and missed and then read
from osd.
write(write_image)
write need EXLOCK and require EXLOCK. This make read_image remove
empty cache. write_image init cache pool and write data to cache.
read(read_image)
In send_set_require_lock(), it set write need EXLOCK.
So read don't require EXLOCK and dirtyly read from osd.
Because second-read don't need EXLOCK and make write_image don't
release EXLOCK(flush dirty data to osd and shutdown pwl-cache).
This make second-read don't read the latest data.
So we should make read also need EXLOCK when enable pwl-cache.
Fixes: https://tracker.ceph.com/issues/51438 Tested-by: Feng Hualong <hualong.feng@intel.com> Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 621facb6e66ce92ca36d566c78bc065a9666639e)
Jianpeng Ma [Mon, 1 Nov 2021 00:33:23 +0000 (08:33 +0800)]
librbd: send FLUSH_SOURCE_INTERNAL when do copy/deep_copy.
copy/deep_copy use object_map to judge whether object exist.
If w/ librbdo pwl cache, flush can't flush data to osd which
change objectmap state. So we should send flush w/ FLUSH_SOURCE_INTERNAL
to make data flush to osd.
Fixes:https://tracker.ceph.com/issues/53057 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit a2ae83f8aab18933eae77cf3034b740082a39e4f)
Jianpeng Ma [Mon, 29 Nov 2021 07:16:21 +0000 (15:16 +0800)]
librbd/cache/pwl: Using BlockGuard control overlap ops order when flush to osd.
In process of tests, we met some inconsistent-data problem. Test case
mainly use write,then discard to detect data consistent.
W/o pwl, write/discard are synchronous ops. After write, data already
located into osd. But w/ pwl, we use asynchronous api to send ops to
osd.
Although we mare sure send order. But send-order don't makre sure
complete order. This mean pwl keep order of write/discard. But it
don't keep the same semantics which use synchronous api. W/ pwl, it make
synchronous to asynchronous. For normal ops, it's not problem. But if
connected-commands w/ overlap, it make data inconsistent.
So we use BlockGuard to solve this issue.
Fixes: https://tracker.ceph.com/issues/49876 Fixes: https://tracker.ceph.com/issues/53108 Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com>
(cherry picked from commit 8e8f3ef516e98da011f3086f8e78a2fa261293ed)
The backport for the lvm migrate feature in pacific was merged after the
get_first_*() refactor backport.
So we have still some old references to `get_single_lv()`
Yaarit Hatuka [Wed, 25 Aug 2021 02:12:08 +0000 (02:12 +0000)]
rpm, debian: move smartmontools and nvme-cli to ceph-base
We wish to be able to scrape SMART and NVMe metrics from OSD and MON
nodes. For this we require / recommend smartmontools and nvme-cli
dependencies for both the ceph-osd and ceph-mon packages. However, the
sudoers file (which is required for invoking `smartctl` by user 'ceph')
was installed only in the ceph-osd package. Since different packages
cannot own the same file, and because we want to be able to scrape from
every daemon, we move the dependencies and the sudoers installation to
ceph-base. For generalization, we rename:
sudoers.d/ceph-osd-smartctl -> sudoers.d/ceph-smartctl
Neha Ojha [Mon, 9 Aug 2021 14:35:01 +0000 (14:35 +0000)]
qa/suites/rados/perf/ceph.yaml: remove rgw
This is no longer required because we removed cosbench workloads in fd350fd0150a2d4072f055658c20314a435a19ba. This is also required to prevent
failures like the following or any other changes that break the rgw task:
```
2021-08-06T20:13:25.812 INFO:teuthology.orchestra.run.smithi060.stderr:curl: (7) Failed to connect to smithi060.front.sepia.ceph.com port 80: Connection refused
2021-08-06T20:15:33.813 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_04c2febe7099917d97a71271f17abb5710030132/teuthology/contextutil.py", line 31, in nested
vars.append(enter())
File "/usr/lib/python3.6/contextlib.py", line 81, in __enter__
return next(self.gen)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/rgw.py", line 191, in start_rgw
wait_for_radosgw(url, remote)
File "/home/teuthworker/src/github.com_ceph_ceph-c_3c0f8c8164075af7aac4d1f2805d3f4580709461/qa/tasks/util/rgw.py", line 94, in wait_for_radosgw
assert exit_status == 0
AssertionError
```
mgr/dashboard: use -f for npm ci to skip fsevents error Fixes: https://tracker.ceph.com/issues/52507 Signed-off-by: Duncan Bellamy <dunk@denkimushi.com>
(cherry picked from commit cd2b26f653ddedf0ed1b937cfaf8bcf7aaf48ce6)
Conflicts:
src/pybind/mgr/dashboard/CMakeLists.txt
- In master this file was moved to frontend folder. Since its not
done in pacific, just made the changes here.
Alfonso Martínez [Wed, 24 Nov 2021 14:36:50 +0000 (15:36 +0100)]
mgr/dashboard: upgrade Cypress to the latest stable version
- Remove unneeded dependency that was causing UI performance issues: zone.js
- Ignore 'ResizeObserver loop limit exceeded' error.
- run-frontend-e2e-tests.sh refactoring: create rgw dashboard user through
'ceph dashboard set-rgw-credentials' and use it on rgw buckets' tests.
Fixes: https://tracker.ceph.com/issues/53357 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit 3e4e29590aa1742fc3b44d21389325a13cca8199)
Conflicts:
src/pybind/mgr/dashboard/frontend/package-lock.json
- Regenerate file to align to pacific. Signed-off-by: Alfonso Martínez <almartin@redhat.com>
Nizamudeen A [Thu, 18 Nov 2021 07:13:39 +0000 (12:43 +0530)]
mgr/dashboard: fix flaky inventory e2e test
When `inventory.getTableCount('total').should('be.eq', totalDiskCount);`
this line is executed the table was not loaded properly and hence the
getTableCount returns 0 on the first try but on second try it passes
since the table is loaded. But in orch e2es the retries are set to 0. I
am not sure if it makes sense to set it to 1. Anyway I am adapting the
test a bit to expect the count to be equal to totalDiskCount so that the
test will wait a bit.
Avan Thakkar [Tue, 9 Nov 2021 21:37:33 +0000 (03:07 +0530)]
mgr/dashboard: provisioned values is misleading in RBD image table
Fixes: https://tracker.ceph.com/issues/46617 Signed-off-by: Avan Thakkar <athakkar@redhat.com>
Adding hint in image table similar to the one in rbd-details.
Alfonso Martínez [Wed, 17 Nov 2021 12:18:26 +0000 (13:18 +0100)]
mgr/dashboard: NFS non-existent files cleanup
After https://github.com/ceph/ceph/pull/42526 and https://github.com/ceph/ceph/pull/43725 merges,
the following files do not exist but there were still references to them:
- src/pybind/mgr/dashboard/services/ganesha.py
- qa/tasks/mgr/dashboard/test_ganesha.py
The following files were renamed but there were still references to old names:
- src/pybind/mgr/dashboard/controllers/nfsganesha.py: nfsganesha.py --> nfs.py
- src/pybind/mgr/dashboard/tests/test_ganesha.py: test_ganesha.py --> test_nfs.py
Other changes in qa/suites/rados/dashboard/tasks/dashboard.yaml:
- Add missing task: tasks.mgr.dashboard.test_api
- Sort dashboard tasks alphabetically.
Fixes: https://tracker.ceph.com/issues/53123 Signed-off-by: Alfonso Martínez <almartin@redhat.com>
(cherry picked from commit 045d2d0f7656e8524bbb32b5d9c230ca1f9b8d1c)
Adam Kupczyk [Sat, 13 Nov 2021 10:28:18 +0000 (11:28 +0100)]
os/bluestore: Fix omap upgrade to per-pg scheme
This is fix to regression introduced by fix to omap upgrade: https://github.com/ceph/ceph/pull/43687
The problem was that we always skipped first omap entry.
This worked fine with objects having omap header key.
For objects without header key we skipped first actual omap key.
Fixes: https://tracker.ceph.com/issues/53307 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
(cherry picked from commit 65a3f374aa1c57c5bb9401e57dab98a643b4360a)