Symptoms

The cluster went down after 24 OSDs were added and marked in simultaneously.
This was an erasure coded (10+5) RGW cluster on Hammer.
All the OSDs started failing and eventually 50% of the OSDs were down.
Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.

2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)

OSDs maps were out of sync

2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!

An OSD has ~3000 threads, most of them in sleeping state.
Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
We were suspecting a memory leak in Ceph code.

Band-aid Fixes

Set norebalance, norecover, nobackfill
Adding swap memory to OSDs
Tuning heartbeat interval
Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}

Tuning OSD map cache size to 20
Finding processes other than Ceph
- Processes consuming network, CPU, and RAM
- Killing them
Starting OSDs one by one - that worked for us :-)

RCA

The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
As network bandwidth was not enough, many messenger threads were just waiting.
The Simple Messanger threads are sync threads and would wait till they get through.
That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

OSD Map Sync tips: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html

Written with StackEdit.

Free Threads

Wednesday, August 1, 2018

Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

Band-aid Fixes

RCA

References

No comments:

Post a Comment