Problem
We started to see multiple near-full index OSDs in a Luminous cluster.
$ sudo ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
351 ssd 0.14999 0.79999 152G 123G 29727M 81.03 3.25 18 osd.351
The near-full message is triggered for OSDs that have used more the 85%.
Environment
- It runs Ceph 12.2.5.2 build on Debian 9.
- Index OSDs run on two 350G SSDs instances.
- Ceph OSDs use LVMs volumes.
- RGW index pool uses isolated OSDs which means no other pool used these OSDs.
- Traffic inflow was BAU.
Problem Analysis
The index OSDs are meant to store RGW bucket index information. Index data has primarily a small (~500B per entry for user objects) chunk and gets stored in the OMAP. The Bluestore store small objects in the WAL and later flush them to the OMAP. Larger objects are directly written to the Bluestore Block.
The configuration for small and large objects is an OSD tunable:
# ceph daemon osd.351 config show|grep min_alloc
"bluestore_min_alloc_size": "0",
"bluestore_min_alloc_size_hdd": "65536",
"bluestore_min_alloc_size_ssd": "16384",
Bluestore uses default 16KB block size for SSDs. The choice between a higher and lower size is a tradeoff of lower write-amplification vs lower fragmentation.
Coming back to the index OSDs layout, we use a large DB for OMAP and a small block (since most of the objects are small, they would get placed in the OMAP). OMAP can use the block for a spillover. A spillover means OMAP is full and needs extra space by reserving space in the Bluestore block.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 10G 0 disk
`-vda1 254:1 0 10G 0 part /
vdb 254:16 0 355G 0 disk
|-vdbVG-wal--vdb 253:0 0 2G 0 lvm
|-vdbVG-db--vdb 253:1 0 200G 0 lvm
`-vdbVG-block--vdb 253:2 0 153G 0 lvm
vdc 254:32 0 355G 0 disk
|-vdcVG-wal--vdc 253:3 0 2G 0 lvm
|-vdcVG-db--vdc 253:4 0 200G 0 lvm
`-vdcVG-block--vdc 253:5 0 153G 0 lvm
pSA-e index OSDs have a 2G WAL, 200G OMAP, and 153G block.
How OSD Utilization is calculated?
The OSD utilization is measured by usage of the block. It means that block was getting consumed which should not happen because OMAP size is good enough to store all data. Since data size is small, we did not expect any other usage of the block. So Bluestore was using the block to store the spillover data of OMAP.
There is a problem with auto-detection of media type:
2018-11-09 15:53:49.918135 7f9b51df1e00 1 bdev(0x55936bd64480 /var/lib/ceph/osd/ceph-70/block) open size 164278304768 (0x263fc00000, 152 GB) block_size 4096 (4096 B) rotational
2018-11-09 15:53:49.918620 7f9b51df1e00 1 bdev(0x55936bd65200 /dev/vdcVG/db-vdc) open size 214748364800 (0x3200000000, 200 GB) block_size 4096 (4096 B) rotational
It should have logged media type as non-rotational.
Verifying OMAP spillover
Bluestore data is not browseable. A utility, ceph-bluestore-tool can mount the bluefs part of a Bluestore OSD. Since OMAP DB size is 200G, we need a mount point that could hold it. We stopped OSD on another disk and used the full disk as a mount point for bluefs.
# Zap the disk
$ sgdisk -Z /dev/vdb
# Create a file system
$ mkfs.xfs /dev/vdb -K -f
# Mount it
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 10G 0 disk
`-vda1 254:1 0 10G 0 part /
vdb 254:16 0 355G 0 disk /mnt/osd34
vdc 254:32 0 355G 0 disk
|-vdcVG-wal--vdc 253:0 0 2G 0 lvm
|-vdcVG-db--vdc 253:1 0 200G 0 lvm
`-vdcVG-block--vdc 253:2 0 153G 0 lvm
# Get BlueFS data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-34 --out-dir /mnt/osd34
We can find out the details of space consumption.
/mnt/osd34# du -sh -l
119G
/mnt/osd34/db# du -sh .
28G
/mnt/osd34/db# ls -lh|less
total 28G
-rw-r--r-- 1 root root 66M Nov 12 16:25 072637.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072638.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072639.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072640.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072641.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072642.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072643.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 074380.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 078646.sst
Let’s check the status of db.slow
. The db.slow
is reserved space by OMAP in the Bluestore block. It covers the spillover data.
/mnt/osd34/db.slow# du -sh .
91G
/mnt/osd34/db.slow# ls -lh|less
total 91G
-rw-r--r-- 1 root root 66M Nov 12 16:27 041139.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 041423.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 042097.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 043530.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 044022.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 046002.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 046615.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 047100.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 048891.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 048892.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 049678.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 052509.sst
Why the spillover happened while DB was hardly 15% utilized
Bluestore OMAP is based on RocksDB. It uses LSM tree structure to store data in form of key: value. There are multiple levels of RocksDB LSM tree and each level is known as L0...Lmax
. At each level, LSM tree fixes an upper limit on its size. At L0 and L1, the size is 256MB. Next level L2 is 10* L1
. L3 is 10* L3
. Let’s take a look at RocksDB config in our index OSD:
# Add the following to the /etc/ceph/ceph.conf and restart OSD.
Alternatively, you can set it through admin socket.
debug rocksdb = 20/20
$vim ceph-osd.34.log
2018-11-13 23:00:49.402483 7fcd1c906e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-11-13 23:00:49.402498 7fcd1c906e00 0 set rocksdb option compression = kNoCompression
2018-11-13 23:00:49.402502 7fcd1c906e00 0 set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402505 7fcd1c906e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402530 7fcd1c906e00 0 set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402532 7fcd1c906e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402540 7fcd1c906e00 0 set rocksdb option write_buffer_size = 268435456
2018-11-13 23:00:49.402550 7fcd1c906e00 10 rocksdb: do_open db_path db size 204010946560
2018-11-13 23:00:49.402553 7fcd1c906e00 10 rocksdb: do_open db_path db.slow size 156064389529
2018-11-13 23:00:49.404399 7fcd1c906e00 4 rocksdb: Options.target_file_size_base: 67108864
2018-11-13 23:00:49.404400 7fcd1c906e00 4 rocksdb: Options.target_file_size_multiplier: 1
2018-11-13 23:00:49.404407 7fcd1c906e00 4 rocksdb: Options.max_bytes_for_level_base: 268435456
2018-11-13 23:00:49.404409 7fcd1c906e00 4 rocksdb: Options.level_compaction_dynamic_level_bytes: 0
2018-11-13 23:00:49.404410 7fcd1c906e00 4 rocksdb: Options.max_bytes_for_level_multiplier: 10.000000
The above logs tell:
- target_file_size_base implies the size of 67MB SST file.
- All levels have got same size (target_file_size_multiplier is 1) of an SST file, however, the number of files will vary.
- max_bytes_for_level_base options set the size for L0. It is 256 MB.
- max_bytes_for_level_multiplier defines the multiple for the size of a level. L2 is 10 times of L1.
So in our OSD, we will have the following levels:
L0: 256MB
L1: 256MB
L2: 2.5GB
L3: 25GB
L4: 250GB
We can hold all levels till L3 in our DB with a size of 200GB. L4 will spill over to db.slow because a level fits have to fit completely in the space provided. The total size of L0+L1+L2+L3 = ~28GB which is the same as the total size of our OMAP ‘db’ partition.
Let’s understand the RocksDB compaction logs:
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 1/0 194.99 MB 1.6 0.0 0.0 0.0 66.8 66.8 0.0 1.0 0.0 99.1 691 318 2.172 0 0
L1 4/0 162.73 MB 0.6 115.2 66.6 48.5 115.1 66.6 0.0 1.7 70.9 70.8 1664 185 8.995 509M 262K
L2 45/0 2.39 GB 1.0 110.5 12.0 98.4 110.3 11.9 54.7 9.2 71.3 71.2 1587 196 8.096 474M 472K
L3 385/0 23.04 GB 1.0 154.4 25.9 128.5 146.3 17.8 41.4 5.6 71.3 67.6 2216 347 6.387 769M 28M
L4 1398/0 86.51 GB 0.3 469.4 65.9 403.5 396.4 -7.1 0.0 6.0 84.4 71.2 5699 513 11.109 1493M 374M
Sum 1833/0 112.29 GB 0.0 849.5 170.5 679.0 834.9 155.9 96.0 12.5 73.4 72.1 11857 1559 7.605 3246M 403M
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
The above log shows that L4 has 1398 SST files. The score column implies the need for compaction. If it is more than one, compaction is required. There are total 1833 SST files in the current RocksDB.
Everytime an SST is generated, RocksDB emits the following log:
2018-11-14 06:19:03.159109 7fcd07533700 4 rocksdb: [default] [JOB 3056] Generated table #103329: 176109 keys, 69620849 bytes
2018-11-14 06:19:03.159133 7fcd07533700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1542156543159124, "cf_name": "default", "job": 3056, "event": "table_file_creation", "file_number": 103329, "file_size": 69620849, "table_properties": {"data_size": 67109663, "index_size": 1918260, "filter_size": 591939, "raw_key_size": 22863898, "raw_average_key_size": 129, "raw_value_size": 57166437, "raw_average_value_size": 324, "num_data_blocks": 17578, "num_entries": 176109, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}
The above log helps us understand the average key and value size which are 129 Bytes and 324 Bytes. There are 176109 entries in this table and the size of the table will be ~67MB. Our investigation so far concluded that we indeed had a lot of data in the index OSDs. So we decided to find out what was the data we were storing.
What is in the SST file
An SST file is a sorted keys sequence and can be dumped using sst_dump utility. It is available in Ceph sources. Pick the exact Ceph source that was installed on the OSD. We have to build sst_dump tool.
$ cd ceph/src/rocksdb/
$ make sst_dump -j4
The keys are better printed in hex.
ceph/src/rocksdb$ ./sst_dump --file=../../../085894.sst --command=scan --read_num=5 --output_hex
from [] to []
Process ../../../085894.sst
Sst file format: block-based
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363362313739373130393934386337363433376231386634346262380B280000000000000105037A0000000116F2000000000000BA55BF5BA0E0B2102000000064656466316331633765616433313337636636626463326238616439303337340F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646616F200000000000000000000000000000000000001010400000014820B2882E01F3500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31343836343235352E333035383436380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436396133396238353834646166306238353034666233633365366431' seq:0, type:1 => 080349010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363961333962383538346461663062383530346662336333653664318B260000000000000105037A0000000164F10000000000005268BE5BF2576F2A2000000066323461343235333138346165613531626266383630343864343237396265620F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646664F100000000000000000000000000000000000001010400000014828B26822B623400000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31393631363135302E3534343338370000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631D4270000000000000105037A0000000162EC0000000000000C5DBE5BB63F0F192000000039326264363564326363653162333664663562383434616561343732313763370F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646662EC0000000000000000000000000000000000000101040000001482D427822A413500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033383236372E323836353536380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461626334636536363739663132353035306264653366383865636237' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266646162633463653636373966313235303530626465336638386563623780250000000000000105037A00000001DBEE0000000000007C86BE5BC1B7231B2000000037653637336561626638383437613130636163346434353661373466633065610F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466DBEE0000000000000000000000000000000000000101040000001482802582D4A43500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033373237372E323931363231390000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666530373461396331386362376632393964326462623639333135343231' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266653037346139633138636237663239396432646262363933313534323169130000000000000105037A00000001A7F1000000000000C80CBF5B8ABCFB2B2000000030336663643132646435613139623539613137343661643035643261316663380F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466A7F10000000000000000000000000000000000000101040000001482691382BA703500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353032373932322E323939383234340000000000000000000000000000
The output is in the form of (key, seq, type, value). To make sense of the keys, we can use Python to translate hex to ASCII.
./ceph/src/rocksdb$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> '4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238'.decode("hex")
'M\x00\x00\x00\x00\x00\x00\x0e\xca\x9d.FA non retail/pdf/ABCD_hello_there_ABCD/ABCD_plane_international_FANR5bfd63b1797109948c76437b18f44bb8'
So keys are valid and we are having genuine data in our index nodes. At the same time, we can list index pool data from RADOS. You can also list key and values using RADOS.
How to get data size reduced?
One was is running manual compaction on the current RocksDB. But that would not help as the default compaction stats implied that in spite of compaction, the data itself was huge. We tried to explore dynamic compaction of RocksDB by setting level_compaction_dynamic_level_bytes to 1 but it is currently disabled and unsupported in Ceph. We can modify RocksDB config using the following in /etc/ceph/ceph.conf. Most of the configs are immutable after OSD creation.
bluestore rocksdb options = "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
# We can get this info from the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-osd.xx.asok config show|grep rocksdb
"debug_rocksdb": "20/20",
"bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
Why are we getting so much data?
There are following suspects:
- Rogue client
- Uncontrolled data ingestion by an account
- Any activity in RGW that generated index data
We did not find anything suspicious in RGW logs (enabled using debug rgw = 20/20), and all PUT requests looked sane and valid. However, there were few logs around bucket Reshard. OSD logs (enabled using debug osd=20/20) also showed many requests for rgw Reshard bucket.
ceph-rgw-log.1.gz:2018-11-14 10:14:13.602194 7f30e6d66700 0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
ceph-rgw-.log.1.gz:2018-11-14 10:14:13.680114 7f30c2d1e700 0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
A quick search around bucket index Reshard popped a tracker http://tracker.ceph.com/issues/34307. The dynamic bucket Reshard happens for a bucket that has more than 100k entries and creates smaller shards. It leaves the old bucket index though and that causes data pileup. Every Reshard would cause data duplication and the cause for our near-full OSDs.
"rgw_max_objs_per_shard": "100000",
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.asok config show|grep dynamic
"rgw_dynamic_resharding": "true",
We disabled dynamic bucket Reshard and OSDs data growth returned to normalcy.
References
- https://www.slideshare.net/VikhyatUmrao/cephalocon-apac-china/15
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023141.html
- http://tracker.ceph.com/issues/34307
- http://tracker.ceph.com/issues/24082
- https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool
- https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
- https://github.com/facebook/rocksdb/wiki/Delete-Stale-Files
- https://github.com/facebook/rocksdb/wiki/How-we-keep-track-of-live-SST-files
- https://github.com/facebook/rocksdb/wiki/Benchmarking-tools
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029851.html
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025283.html
- https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
- https://github.com/facebook/rocksdb/commit/65a9cd616876c7a1204e1a50990400e4e1f61d7e
- https://tracker.ceph.com/issues/23510
Written with StackEdit.
No comments:
Post a Comment