During the collect phase we verify that each peon has overlapping or
contiguous versions as us (and can therefore be caught up with some
series of transactions). However, we *also* assimilate any new states we
get from those peers, and that may move our own first_committed forward
in time. This means that an early responder might have originally been
contiguous, but a later one moved us forward, and when the round finished
they were not contiguous any more. This leads to a crash on the peon
when they get our first begin message.
For example:
- we have 10..20
- first peon has 5..15
- ok!
- second peon has 18..30
- we apply this state
- we are now 18..30
- we finish the round
- send commit to first peon (empty.. we aren't contiguous)
- send no commit to second peon (we match)
- we send a begin for state 31
- first peon crashes (it's lc is still 15)
Prevent this by checking at the end of the round if we are still
contiguous. If not, bootstrap. This is similar to the check we do above,
but reverse to make sure *we* aren't too far ahead of *them*.
Fixes: #9053
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit
3e5ce5f0dcec9bbe9ed4a6b41758ab7802614810)
mon->timer.cancel_event(collect_timeout_event);
collect_timeout_event = 0;
- // share committed values?
+ // is everyone contiguous and up to date?
for (map<int,version_t>::iterator p = peer_last_committed.begin();
p != peer_last_committed.end();
++p) {
+ if (p->second < first_committed && first_committed > 1) {
+ dout(5) << __func__
+ << " peon " << p->first
+ << " last_committed (" << p->second
+ << ") is too low for our first_committed (" << first_committed
+ << ") -- bootstrap!" << dendl;
+ last->put();
+ mon->bootstrap();
+ return;
+ }
if (p->second < last_committed) {
// share committed values
dout(10) << " sending commit to mon." << p->first << dendl;