Symptoms
- The cluster went down after 24 OSDs were added and marked in simultaneously.
- This was an erasure coded (10+5) RGW cluster on Hammer.
- All the OSDs started failing and eventually 50% of the OSDs were down.
- Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
- All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.
2018-07-18 08:58:12.794311 7f4aa0925700 -1
osd.127 206901 heartbeat_check:
no reply from osd.55 since
back 2018-07-18 08:45:13.647493
front 2018-07-18 08:45:13.647493
(cutoff 2018-07-18 08:57:12.794247)
2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
heartbeat_check: no reply from osd.57 since back
2018-07-18 08:45:42.452510 front 2018-07-18
08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)
2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
heartbeat_check: no reply from osd.82 since back
2018-07-18 08:45:13.647493 front 2018-07-18
08:45:13.647493
(cutoff 2018-07-18 08:57:12.794247)
- OSDs maps were out of sync
2018-07-18 08:56:52.668789 7f4886d7b700
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
pipe(0x7f4a4f39d000 sd=26 :13251
s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
claims to be 10.33.213.157:6801/1003787 not
10.33.213.157:6801/2707 - wrong node!
- An OSD has ~3000 threads, most of them in sleeping state.
- Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
- We were suspecting a memory leak in Ceph code.
Band-aid Fixes
-
Set norebalance, norecover, nobackfill
-
Adding swap memory to OSDs
-
Tuning heartbeat interval
-
Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.
$ sudo ceph daemon osd.148 status
{
"cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
"osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
"whoami": 148,
"state": "booting",
"oldest_map": 156518,
"newest_map": 221059,
"num_pgs": 1295
}
- Tuning OSD map cache size to 20
- Finding processes other than Ceph
- Processes consuming network, CPU, and RAM
- Killing them
- Starting OSDs one by one - that worked for us :-)
RCA
- The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
- As network bandwidth was not enough, many messenger threads were just waiting.
- The Simple Messanger threads are sync threads and would wait till they get through.
- That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
- As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.
References
- OSD Map Sync tips: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg10187.html
Written with StackEdit.
No comments:
Post a Comment