Sage Weil [Thu, 12 Mar 2020 03:39:32 +0000 (22:39 -0500)]
Merge PR #33817 into octopus
* refs/pull/33817/head:
mgr/dashboard: Adapt tests to new DriveGroupSpec
fixup mgr/test_orchestrator: validate drive group matches anything.
mgr/orch: CLI: No Tracebacks for ServiceSpecValidationError
mgr/test_orchestrator: validate drive group matches anything.
python-common: don't run flake8 on tests.
python-common: Add support for legacy serialization format for Drive Groups
doc: Move Move ServiceSpec to python-common
python-common: Add `host_pattern` to `PlacementSpec.from_string()`
cephadm: add host_pattern to supported scheduling
python-common: Joined ServiceSpec and DriveGroupSpec from_json()
python-common: Make DriveGroupSpec a sub type of ServiceSpec
pybind/mgr: Move ServiceSpec to python-common: Fix imports
python-common, orch: Move ServiceSpec to python-common: Fix imports
python-common, orch: Move ServiceSpec tests to python-common
python-common: Move ServiceSpec to python-common: fix linting
python-common, orch: Move ServiceSpec (+deps) to python-common
Sage Weil [Fri, 6 Mar 2020 23:43:33 +0000 (17:43 -0600)]
cephadm: update unit.* atomically
Some of these are run as bash scripts, which means that updating them
can lead to the running bash picking up at a weird position mid-script
when it goes to the next command. This produces weird errors like
bash[9321]: /var/lib/ceph/f1758250-639e-11ea-9a42-001a4aab830c/mon.c/unit.run: line 2: -to-stderr=true: command not found
Ernesto Puerta [Wed, 4 Mar 2020 20:31:50 +0000 (21:31 +0100)]
mgr/dashboard: add feature toggle for NFS
- New NFS feature toggle added.
- Fixed regression which broke FeatureToggles in the main menu.
- Added extensive unit testing to NavigationComponent
- Added ng-mocks to improve test isolation and performance (~7x, 139s to
20s)
- Added ng-bullet package to improve unit testing performance (2x, 20s
to 9s)
- Added Rxjs Schedulers to implement NgZone.runOutsideAngular in a Rxjs
friendly way (based on
https://stackoverflow.com/questions/43121400/run-ngrx-effect-outside-of-angulars-zone-to-prevent-timeout-in-protractor)
Minor issues found from exhaustive unit testing:
- Missing permissions in Cluster menu:
- `permissions.log.read` and `permissions.prometheus.read`
- Missing classes:
- Block -> Images: `tc_submenuitem_block_images`
- Block -> iSCSI: `tc_submenuitem_block_iscsi`
- Typos:
- class `tc_submenuitem_hosts` assigned to OSD menu (instead of
`tc_submenuitem_osd`)
- `tc_menuitem_cephs` -> `tc_menuitem_cephfs`
Minor changes:
- Previously, Cluster Map -> CRUSH Map required both OSD and Host
permissions. Now it only requires OSD permissions (there are no
references to hosts in CRUSH Map page). Nevertheless, all system roles
setting OSD permission also set Host's, so no impact expected.
- Previously, Cluster -> Monitoring menu was hidden when both Prometheus
or Alertmanager weren't configured. Now it's displayed and when clicked
on it shows the helper banner indicating that further configuration is
required. This change removes the dependency on the PrometheusService.
Sage Weil [Wed, 11 Mar 2020 13:55:51 +0000 (08:55 -0500)]
Merge PR #33830 into octopus
* refs/pull/33830/head:
qa/tasks/cephadm: no default mon|mgr|crash service specs
qa/suites/rados/cephadm/upgrade: upgrade start point that supports the no-spec option
cephadm: create initial mon and mgr service specs too
cephadm: no need to pregenerate a crash key for the bootstrap host
mgr/cephadm: do not complain when we don't have enough hosts
mgr/cephadm: remove orphan daemons
mgr/cephadm: report size=0 for fabricated ServiceDescription
mgr/cephadm: safety check to prevent removing all mon|mgr daemons
mgr/cephadm: prevent scaling mon|mgr below count=1
mgr/cephadm: do not remove daemons from remove_service
Sage Weil [Wed, 11 Mar 2020 12:12:11 +0000 (07:12 -0500)]
Merge PR #33620 into octopus
* refs/pull/33620/head:
mgr/dashboard: Crush rule modal
mgr/dashboard: Preserve rule selection on pool type change
mgr/dashboard: Crush rule is only send during replicated pool creation
mgr/dashboard: Explicit returns in pool form
mgr/dashboard: Removes fork join in pool form
mgr/dashboard: Hide ECP actions during ec pool edit
mgr/dashboard: Pool form erasure/replicated boolean
mgr/dashboard: Change pool info API endpoint
mgr/dashboard: Moves ECP info endpoint to UI-API
Kefu Chai [Wed, 11 Mar 2020 08:08:51 +0000 (16:08 +0800)]
cephadm: add "assert foo is not None" for mypy check
it's legit to pass file objects to fcntl(), but `Popen.stdout` and
`Popen.stderr` properies are not necessarily file objects -- they could be None.
this cannot be deduced at compile-time. even we can ensure this,
as we do pass `subprocess.PIPE` to the constructor. so mypy just
complains at seeing this:
```
cephadm:429: error: Argument 1 to "fcntl" has incompatible type "Optional[IO[Any]]"; expected "Union[int, HasFileno]"
cephadm:430: error: Argument 1 to "fcntl" has incompatible type "Optional[IO[Any]]"; expected "Union[int, HasFileno]"
cephadm:431: error: Argument 1 to "fcntl" has incompatible type "Optional[IO[Any]]"; expected "Union[int, HasFileno]"
cephadm:432: error: Argument 1 to "fcntl" has incompatible type "Optional[IO[Any]]"; expected "Union[int, HasFileno]"
cephadm:455: error: Item "None" of "Optional[IO[Any]]" has no attribute "fileno"
cephadm:465: error: Item "None" of "Optional[IO[Any]]" has no attribute "fileno"
cephadm:475: error: Item "None" of "Optional[IO[Any]]" has no attribute "fileno"
```
to silence this warning, insert `assert process.stdout is not None`
before accessing `process.stdout` to appease the strict optional
checking of mypy.
Jason Dillaman [Tue, 10 Mar 2020 17:31:34 +0000 (13:31 -0400)]
librbd: race condition in image watcher notification callback
If a refresh is in-progress when a header update notification is
received, the notification was previously incorrectly dropped.
This prevented rbd-mirror's snapshot-based mirroring replayer from
detecting updates in some cases.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 22:32:03 +0000 (18:32 -0400)]
librbd: re-use mirror promote state machine when disabling
The promote state machine will handle remove the non-primary
feature bit and will ensure an interrupted disable operation
doesn't leave things in an inconsistent state.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 21:04:27 +0000 (17:04 -0400)]
librbd: enable/disable implicit non-primary feature bit
When promoted to primary, disable the non-primary feature bit and
when demoted (or created non-primary), enable the non-primary feature
bit. This will prevent all non rbd-mirror RBD clients from modifying
the RBD image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Mon, 9 Mar 2020 20:49:07 +0000 (16:49 -0400)]
rbd-mirror: permit R/W operations against non-primary image
With the non-primary feature bit is enabled, mask-out the read-only
feature bit that will be set in the refresh image state machine if
the image has that feature bit set. This will ensure that only the
rbd-mirror daemon will be able to modify a non-primary image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 3 Mar 2020 20:17:52 +0000 (15:17 -0500)]
librbd: track reason why ImageCtx is read-only
This will be utilized by the RefreshRequest state machine to flag the image
as read-only if the new RBD_FEATURE_NON_PRIMARY feature is enabled. Also
allow that flag to be masked out by rbd-mirror daemon to permit IO and
operations against a non-primary image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Jason Dillaman [Tue, 3 Mar 2020 20:01:35 +0000 (15:01 -0500)]
librbd: new RBD_FEATURE_NON_PRIMARY to prevent R/W IO
When a snapshot-based image is non-primary, we will need to use
this implicit feature to ensure that writes and maintenance
operations cannot be performed against the image.
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Sage Weil [Tue, 10 Mar 2020 22:20:48 +0000 (17:20 -0500)]
Merge PR #33771 into octopus
* refs/pull/33771/head:
common/ceph_timer: Pass reference to waited time on stack
common/ceph_timer: Add test
common/ceph_timer: Use unique_function, allowing noncopyable events
common/ceph_timer: Couple cleanups
common/ceph_timer: Fix namespaces
common/ceph_timer: Add missing includes
common/ceph_timer.h: Don't indent contents of a namespace
Sage Weil [Tue, 10 Mar 2020 14:28:57 +0000 (09:28 -0500)]
cephadm: bootstrap: wait for mgr to restart after enabling a module
It was possible to enable a module (mon updates mgrmap) and then
do a mgr command and have that command reach the mgr before it got the
latest mgrmap and restarted.
Fixes: https://tracker.ceph.com/issues/44531 Signed-off-by: Sage Weil <sage@redhat.com>
Sage Weil [Mon, 9 Mar 2020 18:39:04 +0000 (13:39 -0500)]
mgr/cephadm: do not complain when we don't have enough hosts
This gets rid of INFO level log events like
2020-03-09T13:37:20.980993-0500 mgr.x [WRN] Failed to apply mds.foo spec ServiceSpec({'placement': PlacementSpec(count:2), 'service_type': 'mds', 'service_id': 'foo'}): List of host candidates is empty
Sage Weil [Mon, 9 Mar 2020 01:38:59 +0000 (20:38 -0500)]
mgr/cephadm: fix upgrade order
Create two variables, CEPH_TYPES and CEPH_UPGRADE_ORDER. In reality they
are both the same, but this way the meaning is clear, and they lists
won't get out of sync (they should always have the same elements).
Sage Weil [Mon, 9 Mar 2020 17:26:06 +0000 (12:26 -0500)]
ceph.in: only shut down rados on clean exit
If we exit due to a timeout, then calling rados shutdown can lead to all
sorts of problems, because we may still have another thread that is
trying to call rados_connect and/or do some work, and rados_connect
and rados_shutdown don't (and can't!) really behave well when racing
against each other.
Note that shutdown here isn't that important--the process is about to
exit anyway. It's only useful to exercise the shutdown code path more
often.
Fixes: https://tracker.ceph.com/issues/44526 Signed-off-by: Sage Weil <sage@redhat.com>
Adam C. Emerson [Fri, 6 Mar 2020 03:14:47 +0000 (22:14 -0500)]
common/ceph_timer: Pass reference to waited time on stack
std::condition_variable::wait_until takes a const reference to a
time_point. It may access this reference after relinquishing the
mutex, creating a potential use-after-free error if the first event is
shut down.
So, just copy the time onto the stack, so we have a reference that
won't disappear.
https://tracker.ceph.com/issues/44373
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>