Do this outside the standard tick interval as it needs to be driven more
frequently to keep up with client workloads that generate a lot of
capabilities.
Fixes: https://tracker.ceph.com/issues/41141 Fixes: https://tracker.ceph.com/issues/41140 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Patrick Donnelly [Fri, 23 Aug 2019 23:16:16 +0000 (16:16 -0700)]
Merge PR #28855 into master
* refs/pull/28855/head:
doc: document scrub summary in ceph status output
test: extend scrub control test to validate mds task status
mds: send scrub state changes to cluster log.
mds: periodically sent mds scrub status to ceph manager
mgr, mon: allow normal ceph services to register with manager
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Sage Weil [Fri, 23 Aug 2019 16:25:28 +0000 (11:25 -0500)]
Merge PR #28727 into master
* refs/pull/28727/head:
test/crimson: resolve name collision
test: switch to ldout; let users specify mon debug level
test: add new ElectionLogic unit test framework
elector: const-ify a bunch of functions
elector: swap order of parameters in ElectionLogic::receive_propose
elector: Update Elector and ElectionLogic function documentation
elector: persist the epoch in bump_epoch()
elector: make some more ElectionLogic members private
elector: fix privacy and restore dout in Elector
elector: don't clear peer_info in bump_epoch()
elector: split ElectionLogic into its own compilation unit
elector: move all the elector callouts into the Elector
elector: make ElectionLogic private to Elector; undo most public shenanigans
elector: create declare_standlone_victory in Elector/Logic for Monitor
elector: make ElectionLogic::declare_victory private
elector: route _bump_epoch through the interface-to-be
elector: rename handle_propose_logic -> receive_propose
elector: hoist handle_victory into ElectionLogic
elector: hoist handle_ack into ElectionLogic
elector: hoist victory into ElectionLogic
elector: hoist expire into ElectionLogic
elector: hoist start into ElectionLogic
elector: hoist participating into ElectionLogic
elector: hoist init into ElectionLogic
elector: hoist defer into ElectionLogic
elector: split handle_propose in two and hoist into ElectionLogic
elector: hoist bump_epoch into ElectionLogic
elector: store accessors for ElectionLogic
elector: hoist Elector data bits out into a new ElectionLogic class
mon: Rearrange Paxos::dispatch to be a little cleaner
Reviewed-by: Brad Hubbard <bhubbard@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
Changcheng Liu [Fri, 2 Aug 2019 03:23:08 +0000 (11:23 +0800)]
msg/async/rdma: implement function to prefetch buffers
The original RDMAConnectedSocketImpl::read read date from buffers and
prefertch data into buffers for next round of reading. It makes the
logical a little complex and the code isn't smooth to be read.
In this patch:
1) RDMAConnectedSocketImpl::buffer_prefetch private API is added to
prefetch data into buffers at the head of read_buffers.
2) reduce one time of calling notify() to reduce context switches.
It's really not needed to notify upper layer to read data since current
read operation hasn't finished yet.
3) Simplify RDMAConnectedSocketImpl::read implementation.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Wed, 19 Jun 2019 07:53:08 +0000 (15:53 +0800)]
msg/async/rdma: remove redundant code
1. Below three bits are meaningless in pollfd::events field:
POLLERR, POLLHUP, or POLLNVAL.
2. QueuePair::pd is initialized in the initialize list.
There's no need to assign same value to it.
3. Remove the never used function Chunk::set_bound
4. Remove the never used function Chunk::set_offset
5. Remove the never used function QueuePair::is_error
6. Remove SimplePolicyMessenger used vars
7. remove socket_fd() interface since it's never used.
All data write/read is based on ConnectedSocketImpl::fd.
So, there's no need to expose socket_fd since it's never used.
8. Remove RDMAServerSocketImpl::get_fd which is not used.
BTW, RDMAServerSocketImpl::fd has the same function as get_fd.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Wed, 7 Aug 2019 05:33:37 +0000 (13:33 +0800)]
msg/async/rdma: rename variable to improve readability
Device::binding_port
1. port_id is more meaningful compared to i as variable name.
2. start port_id from 1 instead of 0.
PoolAllocator::malloc
1. make clear relationship among buffer/chunk/block/memory_region with new
variable name.
2. define the variable when it's first being used.
RDMAConnectedSocketImpl::submit
1. use "wait_copy_len" to replace "need_reserve_bytes" which stands for the memory
that is waiting to be copied into chunk.
2. use "copy_start" to replace "copy_it" which stands for the start iterator to be copied.
3. use "total_copied" to replace "total" which stands for the memory that has been copied.
allocate huge page
1. use "HUGE_PAGE_SIZE_2MB" to be used for 2MB page alignment.
2. use "ALIGN_TO_PAGE_2MB" to stands align request size to 2MB.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Mon, 1 Jul 2019 02:41:18 +0000 (10:41 +0800)]
msg/async/rdma: make clear to get mem_info address
The parameter "block" points to mem_info::chunks space. It's not quite
clear about the function of "reinterpret_cast<mem_info *>(block) - 1;".
Get the mem_info::chunks address and minus the member offset from struct
head to get mem_info address.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Mon, 1 Jul 2019 02:27:45 +0000 (10:27 +0800)]
msg/async/rdma: use different strategy to reset read/write chunk
When releasing read chunk to pool, the chunk::offset & chunk::bound
should be reset to zero. For write chunk, it's better to reset
chunk::offset to zero and chunk::bound to chunk length which means that
[offset, bound) is writable.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Thu, 27 Jun 2019 05:19:58 +0000 (13:19 +0800)]
msg/async/rdma: cosmetics initialize ibv_send_wr* var
API usage:
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr)
Input Parameters:
qp struct ibv_qp from ibv_create_qp
wr first work request (WR)
Output Parameters:
bad_wr pointer to first rejected WR
Return Value:
0 on success, -1 on error.
If the call fails, errno will be set to indicate the reason for the failure.
To avoid wrong checking return value, it's better to initialize the
value to be nullptr.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. There's no need to get stack & dispatcher from RDMAStack again
since RDMAWorker has stored the value.
2. cache the Infiniband object to be used in local scope.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Thu, 27 Jun 2019 03:03:44 +0000 (11:03 +0800)]
msg/async/rdma: refine Chunk construction function
1. all values are initialized in construction function
In this way, it's easy to construct Chunk object in
PoolAllocator::malloc function.
2. For read chunk, member bound is initialized to be 0.
3. For send chunk, member bound is initialzied to be full space size.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
After reading one chunk, the chunk could be pushed into buffer list if its
effecitve content size is not zero. In this case, it also means that the
caller has got the required read length. Then all the continuous chunk will
be pushed into buffer list since the effective content size is not zero.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Keep same logic:
1. If parameter block_size is zero, then allocate all the free chunks
to parameter std::vector<Chunk*> &chunks. i.e.
chunk_buffer_number = free_chunks.size()
2. If paramter block_size is not zero, then allocate the requested or
all the free chunks to paramter std::vector<Chunk*> &chunks.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Thu, 13 Jun 2019 11:04:40 +0000 (19:04 +0800)]
msg/async/rdma: use Chunk::get_size to get chunk size
remove Chunk::over interface and add Chunk::get_size interface
1) It's not clear when reading "over" function name.
2) Some places need know the current chunk block effective content size.
3) "Chunk::over()" could be replaced by "Chunk::get_size() == 0"
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Thu, 13 Jun 2019 10:34:44 +0000 (18:34 +0800)]
msg/async/rdma: seperate Device construction if rdma_cm is used
If ms_async_rdma_cm is false, there's no need to call the api
rdma_get_device. If rdma_get_device is called, the devices remain
opened while librdmacm is loaded. This is not what we want when
ms_async_rdma_cm is false.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Fri, 28 Jun 2019 06:26:41 +0000 (14:26 +0800)]
msg/async/rdma: fix error argument to get right qp state
1. It's wrong to use "-1" as argument to query queue state.
In rdma library, ibv_query_qp will call ibv_cmd_query_qp to query
queue state. If "-1" is used as attr_mask, ibv_cmd_query_qp will
return error EOPNOTSUPP which means query failed.
2. In class QueuePair, is_error() could use member function get_state()
to get the queue pair state.
3. It's better to use qp_state as queue pair state according to
ibv_query_qp manual guide.
struct ibv_qp_attr {
enum ibv_qp_state qp_state; /* Current QP state */
enum ibv_qp_state cur_qp_state; /* Current QP state - irrelevant for ibv_query_qp */
...
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Mon, 3 Jun 2019 05:31:09 +0000 (13:31 +0800)]
msg/async/rdma: export RDMAV_HUGEPAGES_SAFE before ibv_fork_init
In rdma-core library, ibv_fork_init will check environment variable
RDMAV_HUGEPAGES_SAFE to decide whether huge page is usable in system.
It doesn't make sense to export RDMAV_HUGEPAGES_SAFE env after
calling ibv_fork_init.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Changcheng Liu [Mon, 3 Jun 2019 05:00:22 +0000 (13:00 +0800)]
msg/async/rdma: use ibv_port_attr object type in Port class
1. Avoid to do memory management without using pointer to operate
operate the allocated space. Or, it could have memory leak.
2. Since member type has been changed in class Device, it need
to use member domain operator "." to access to the sub-member in
object.
3. There's no need to consider experimental API of ibv_query_port.
So, merge ibv_query_port in the prolog.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
alfonsomthd [Thu, 22 Aug 2019 13:33:02 +0000 (15:33 +0200)]
mgr/dashboard: run-backend-api-tests.sh CI improvements
As there is now a jenkins job to run this script
(see https://github.com/ceph/ceph-build/pull/1351),
this refactoring adapt the script to be run in a jenkins job as well as locally.
xie xingguo [Wed, 21 Aug 2019 08:34:26 +0000 (16:34 +0800)]
osd/osd_type: disable incremental recovery for legacy missing item
which is important to let us talk with pre-octopus osds and
make sure the pg_missing_items created before Octopus can be
correctly (fully) recovered too.
xie xingguo [Wed, 21 Aug 2019 02:33:42 +0000 (10:33 +0800)]
osd/PGLog: trigger full recovery for divergent missing objects
They might have a dirty/invalid log history (and hence an invalid
clean_regions as well), and there is no easy way to deduce the
complete clean_regions portion.
For simplicity (and correctness), disable potential incremental recovery
mode for these objects.
xie xingguo [Tue, 20 Aug 2019 02:43:43 +0000 (10:43 +0800)]
osd/PGLog: disable incremental recovery for pre-kraken versions
Since kraken, we always persist the missing set explicitly
(see https://github.com/ceph/ceph/pull/10334) and manually
building the missing set is only meaningful to be compatiable
with pre-kraken versions.
For safety, explicitly disable incremental recovery if we have
to completely re-build the missing set at boot up.
Patrick Donnelly [Wed, 21 Aug 2019 17:57:15 +0000 (10:57 -0700)]
Merge PR #28378 into master
* refs/pull/28378/head:
qa/tasks: introduce Thrasher base class
qa/tasks: Fix typo
qa/tasks: manage thrashers
qa/tasks: start DaemonWatchdog when ceph starts
qa/tasks: make watch and bark handle more daemons
qa/tasks: move DaemonWatchdog to new file
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>
Mykola Golub [Wed, 21 Aug 2019 14:04:47 +0000 (15:04 +0100)]
journal: fix race between player shut down and cache rebalance
25a23364 was supposed to fix this race, but it was not enough:
there was still a window between `prefetch` is queued for
execution in handle_cache_rebalanced and is actually executed,
during which shut_down can be called and completed.
Jos Collin [Mon, 5 Aug 2019 10:52:10 +0000 (16:22 +0530)]
qa/tasks: introduce Thrasher base class
* Introduced a Thrasher base class.
* Updated thrashers to inherit from Thrasher.
* Replaced the magic variable e with Thrasher.exception as per the discussion.
Now the exception variable sets by default as the thrashers are inheriting
from the Thrasher class.
Fixes: https://github.com/ceph/ceph/pull/28378#discussion_r309337928 Fixes: https://tracker.ceph.com/issues/41133 Signed-off-by: Jos Collin <jcollin@redhat.com>