Wido saw a pg go active, but an activate log+info update crossed paths with
a pg_notify info, and the primary overwrote it's updated shiny new info
with the stale old info from the replica. Don't do that. It causes
problems down the line. In this case, we got
osd/OSD.cc: In function 'void OSD::generate_backlog(PG*)':
osd/OSD.cc:3863: FAILED assert(!pg->is_active())
1: (ThreadPool::worker()+0x28f) [0x5b08ff]
2: (ThreadPool::WorkThread::entry()+0xd) [0x4edb8d]
3: (Thread::_entry_func(void*)+0xa) [0x46892a]
4: (()+0x69ca) [0x7f889ff249ca]
5: (clone()+0x6d) [0x7f889f1446cd]
on the replica because it was active but the primary was restarting peering
due to the bad info.
if (pg->peer_info.count(from) &&
pg->peer_info[from].last_update == it->last_update) {
- dout(10) << *pg << " got dup osd" << from << " info " << *it << dendl;
+ dout(10) << *pg << " got dup osd" << from << " info " << *it << ", identical to ours" << dendl;
+ } else if (pg->peer_info.count(from) &&
+ pg->is_active()) {
+ dout(10) << *pg << " got dup osd" << from << " info " << *it
+ << " but pg is active, keeping our info " << pg->peer_info[from]
+ << dendl;
} else {
dout(10) << *pg << " got osd" << from << " info " << *it << dendl;
pg->peer_info[from] = *it;