Lately seeing the RADOS instance in manager (python core) getting
blocklisted. Manager modules (such as mgr/volumes) use a pool of
CephFS connection handles which register their addrs on connection
initialization (via register_client()). Shutting down a connection
unregisters the respective address.
Some history:
This machinery was introduced to workaround the case where ceph-mgr
was getting evicted by ceph mds due to unlcean shutdown (ceph-mgr
python modules do not cleanup gracefully on exit) causing teuthology
tests to fail. The workaround was to blocklist these clients by including
the client addrs in the manager beacon message sent to monitor (for
inclusion in OSDMap) -- but this was not sufficient enough to solve
the eviction issue altogether. In the end whitelisting the cluster log
warning was rather convinient to workaround the tests failures.
However, the whole register/unregister machinery in still in use.
The endless loop made ceph manager unresponsive thereby getting
blocklisted in midst of normal operation.
Fixes: http://tracker.ceph.com/issues/47329
Signed-off-by: Venky Shankar <vshankar@redhat.com>
auto itp = clients.equal_range(std::string(name));
for (auto it = itp.first; it != itp.second; ++it) {
if (it->second == addrs) {
- it = clients.erase(it);
+ clients.erase(it);
+ return;
}
}
}