Wednesday, August 1, 2018

Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

  • The cluster went down after 24 OSDs were added and marked in simultaneously.
  • This was an erasure coded (10+5) RGW cluster on Hammer.
  • All the OSDs started failing and eventually 50% of the OSDs were down.
  • Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
  • All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.
2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)
  • OSDs maps were out of sync
2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!   
  • An OSD has ~3000 threads, most of them in sleeping state.
  • Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
  • We were suspecting a memory leak in Ceph code.

Band-aid Fixes

  • Set norebalance, norecover, nobackfill

  • Adding swap memory to OSDs

  • Tuning heartbeat interval

  • Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}
  • Tuning OSD map cache size to 20
  • Finding processes other than Ceph
    • Processes consuming network, CPU, and RAM
    • Killing them
  • Starting OSDs one by one - that worked for us :-)

RCA

  • The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
  • As network bandwidth was not enough, many messenger threads were just waiting.
  • The Simple Messanger threads are sync threads and would wait till they get through.
  • That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
  • As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

Written with StackEdit.

No comments:

Post a Comment