Kefu Chai [Sat, 1 Oct 2022 10:03:30 +0000 (18:03 +0800)]
crimson/osd: drop redundant code
this change is a cleanup.
we already update `superblock` with the latest `boost_epoch` and
osdmap's epoch in `OSD::handle_osd_map()`, and `committed_osd_maps()` is
called at end of this function. `shutdown()` is called when we marked
down and stop by the monitor in `committed_osd_maps()`. so these
assignment statements are noops. instead, we should stop the whole osd
service. let's leave it for another commit.
Kefu Chai [Sat, 1 Oct 2022 09:43:00 +0000 (17:43 +0800)]
crimson/osd: use abort_source in stop_signal
before this change, `stop_signal::wait()` waits until it receives
`SIGTERM` or `SIGINT`, but we also need to stop the service per the
request of monitor or when a serious health condition is detected.
so, an `abort_source` should allow the server to request abort by
itself. also, as the single truth of stop, `stop_signal` will be
able to send the message to its subscribers to abort any "blocking"
calls which might prevent or delay the stop process.
Kefu Chai [Sat, 1 Oct 2022 08:30:43 +0000 (16:30 +0800)]
crimson/osd: vendor stop_signal.h
stop_signal.h is copied from seastar/apps/lib/stop_signal.hh. this
change copies it int our project so we can customize it in a following
commit. we will need to add an `abort_source` member to it.
the class `stop_signal` works great for us. but under some
circumstances, we also need to stop the crimson server programmatically
from the server itsef, for instance, per the request of monitor. in that
case, `stop_signal` cannot fulfill our needs, as the only thing which
can stop it is, SIGTERM or SIGTERM. we could send SIGINT to ourselves to
unblock `stop_signal::wait()`. but it would be better if we could leverage
`abort_source` for this purpose. for two reasons:
* `abort_source` allows use to cancel a "blocking" op.
we have a couple background tasks in crimson, like `AdminSocket`'s
accept and handle task, which could have been stopped by an
`abort_source`. but now it checks for the `stop_gate` before accepting
an incoming connection. this pattern cannot be always be repeated.
because we cannot *abort* a "blocking" task. for instance, we use
a homebrew `tri_mutex` to protect the read/write consistency of
an obc, what if the server would need to stop when a request is
still trying to acquire the lock? with the help of the `abort_source`,
we should be able to either subscribe from the `abort_source`,
and set an abort exception for each of the waiters. the same applies
to other "blocking" calls like `shared_promise::get_shared_future()`,
this allows us to abort the waiter before the promise is resolved.
this could be handy if we need to abort potentially op which
might take long and should be aborted when monitor wants to
kill the osd instance.
* `abort_source` allows us to abort the services programmatically.
this is the use case mentioned in the beginning of this commit
message.
This PR rewrites the front matter in the "Erasure Code"
section of the RADOS documentation. Previously, the information
in this section was syntactically confused. I have also fleshed
out the distinction between erasure coding and replication.
Josh Durgin [Fri, 11 Mar 2022 20:17:54 +0000 (15:17 -0500)]
doc/governance: update based on review and CLT discussions
Clarified some parts of council that were discussed previously,
specifying the number of members and a staggered term.
Added a bit more about the steering committee - thinking the meetings
could be split between tactical (3/4 weeks) and strategic (monthly),
and still open to anyone to join the discussion, but restricted to
only members voting.
Removed the meeting section since that belongs more in a separate
place, like the ceph.io website.
Josh Durgin [Fri, 15 Oct 2021 15:36:07 +0000 (11:36 -0400)]
doc/governance: add proposed structure
This is Sage's summary of the Ceph leadership team discussions around
this topic. Still many details to be worked out, this is just one
concrete proposal as a basis for further discussion.