Wednesday, November 21, 2018

Bluestore Internals

Bluestore Internals
## Bluesotre Discussions
If WAL is full what would happen? Would writes block?

It never blocks; it will always just spill over onto the next fastest
device (wal -> db -> main). Note that there is no value to a db partition
if it is on the same device as the main partition.

Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?
Yes

I would check your running Ceph clusters and calculate the amount of objects per OSD.
total objects / num osd * 3

For the moment though, having multiple (4)
256MB WAL buffers appears to give us the best performance despite
resulting in large memtables, so 1-2GB for the WAL is right.

A tool to gather complete Ceph cluster information
https://github.com/42on/ceph-collect

Bluesotre onode size is 24k for average object size of 2.8MB in RBD. So average object size and count per TB can be calculated.

#Reference
http://ceph-users.ceph.narkive.com/8uPMEXNz/bluestore-osd-data-wal-db

Written with StackEdit.

Thursday, November 15, 2018

Updating systemctl limits on Debian

Updating systemctl limits on Debian
  1. Become root

  2. vi /lib/systemd/system/ceph-osd@.service

Change the values of proc and files to the following (extracted from ulimit -a).

[Service]  
LimitNOFILE=78452  
LimitNPROC=80248
  1. $ sudo systemctl daemon-reload

  2. Restart OSDs

$ cat update-systemctl.sh  
for ip in $(cat ip.list)  
do  
scp ceph-osd@.service $ip:/tmp
ssh $ip sudo cp /tmp/ceph-osd@.service /lib/systemd/system/ceph-osd@.service  
ssh $ip sudo systemctl daemon-reload  
ssh $ip sudo sudo systemctl start ceph\*.service ceph\*.target  
done
$ cat ceph-osd@.service  
[Unit]  
Description=Ceph object storage daemon osd.%i  
After=network-online.target local-fs.target time-sync.target ceph-mon.target  
Wants=network-online.target local-fs.target time-sync.target  
PartOf=ceph-osd.target

[Service]  
LimitNOFILE=78452  
LimitNPROC=80248  
EnvironmentFile=-/etc/default/ceph  
Environment=CLUSTER=ceph  
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph  
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i  
ExecReload=/bin/kill -HUP $MAINPID  
ProtectHome=true  
ProtectSystem=full  
PrivateTmp=true  
TasksMax=infinity  
Restart=on-failure  
StartLimitInterval=30min  
StartLimitBurst=0  
RestartSec=20s

[Install]  
WantedBy=ceph-osd.target

References

Written with StackEdit.

How to debug RcoksDB issues in Bluestore

--- ---

Problem

We started to see multiple near-full index OSDs in a Luminous cluster.

$ sudo ceph osd df tree
  ID    CLASS WEIGHT    REWEIGHT SIZE   USE     AVAIL   %USE  VAR  PGS TYPE NAME
  351   ssd   0.14999  0.79999   152G   123G   29727M   81.03 3.25  18                 osd.351

The near-full message is triggered for OSDs that have used more the 85%.

Environment

  • It runs Ceph 12.2.5.2 build on Debian 9.
  • Index OSDs run on two 350G SSDs instances.
  • Ceph OSDs use LVMs volumes.
  • RGW index pool uses isolated OSDs which means no other pool used these OSDs.
  • Traffic inflow was BAU.

Problem Analysis

The index OSDs are meant to store RGW bucket index information. Index data has primarily a small (~500B per entry for user objects) chunk and gets stored in the OMAP. The Bluestore store small objects in the WAL and later flush them to the OMAP. Larger objects are directly written to the Bluestore Block.

The configuration for small and large objects is an OSD tunable:

# ceph daemon osd.351 config show|grep min_alloc
    "bluestore_min_alloc_size": "0",
    "bluestore_min_alloc_size_hdd": "65536",
    "bluestore_min_alloc_size_ssd": "16384",

Bluestore uses default 16KB block size for SSDs. The choice between a higher and lower size is a tradeoff of lower write-amplification vs lower fragmentation.

Coming back to the index OSDs layout, we use a large DB for OMAP and a small block (since most of the objects are small, they would get placed in the OMAP). OMAP can use the block for a spillover. A spillover means OMAP is full and needs extra space by reserving space in the Bluestore block.

# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                254:0    0   10G  0 disk
`-vda1             254:1    0   10G  0 part /
vdb                254:16   0  355G  0 disk
|-vdbVG-wal--vdb   253:0    0    2G  0 lvm
|-vdbVG-db--vdb    253:1    0  200G  0 lvm
`-vdbVG-block--vdb 253:2    0  153G  0 lvm
vdc                254:32   0  355G  0 disk
|-vdcVG-wal--vdc   253:3    0    2G  0 lvm
|-vdcVG-db--vdc    253:4    0  200G  0 lvm
`-vdcVG-block--vdc 253:5    0  153G  0 lvm

pSA-e index OSDs have a 2G WAL, 200G OMAP, and 153G block.

How OSD Utilization is calculated?

The OSD utilization is measured by usage of the block. It means that block was getting consumed which should not happen because OMAP size is good enough to store all data. Since data size is small, we did not expect any other usage of the block. So Bluestore was using the block to store the spillover data of OMAP.

There is a problem with auto-detection of media type:

2018-11-09 15:53:49.918135 7f9b51df1e00 1 bdev(0x55936bd64480 /var/lib/ceph/osd/ceph-70/block) open size 164278304768 (0x263fc00000, 152 GB) block_size 4096 (4096 B) rotational

2018-11-09 15:53:49.918620 7f9b51df1e00 1 bdev(0x55936bd65200 /dev/vdcVG/db-vdc) open size 214748364800 (0x3200000000, 200 GB) block_size 4096 (4096 B) rotational

It should have logged media type as non-rotational.

Verifying OMAP spillover

Bluestore data is not browseable. A utility, ceph-bluestore-tool can mount the bluefs part of a Bluestore OSD. Since OMAP DB size is 200G, we need a mount point that could hold it. We stopped OSD on another disk and used the full disk as a mount point for bluefs.

# Zap the disk
$ sgdisk -Z /dev/vdb
 
# Create a file system
$ mkfs.xfs /dev/vdb -K -f
 
# Mount it

# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                254:0    0   10G  0 disk
`-vda1             254:1    0   10G  0 part /
vdb                254:16   0  355G  0 disk /mnt/osd34
vdc                254:32   0  355G  0 disk
|-vdcVG-wal--vdc   253:0    0    2G  0 lvm
|-vdcVG-db--vdc    253:1    0  200G  0 lvm
`-vdcVG-block--vdc 253:2    0  153G  0 lvm
 
# Get BlueFS data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-34 --out-dir /mnt/osd34

We can find out the details of space consumption.

/mnt/osd34# du -sh -l
119G
 
/mnt/osd34/db# du -sh .
28G
 
/mnt/osd34/db# ls -lh|less
total 28G
-rw-r--r-- 1 root root  66M Nov 12 16:25 072637.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072638.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072639.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072640.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072641.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072642.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072643.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 074380.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 078646.sst

Let’s check the status of db.slow. The db.slow is reserved space by OMAP in the Bluestore block. It covers the spillover data.

/mnt/osd34/db.slow# du -sh .
91G
 
/mnt/osd34/db.slow# ls -lh|less
total 91G
-rw-r--r-- 1 root root  66M Nov 12 16:27 041139.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 041423.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 042097.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 043530.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 044022.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 046002.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 046615.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 047100.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 048891.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 048892.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 049678.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 052509.sst
Why the spillover happened while DB was hardly 15% utilized

Bluestore OMAP is based on RocksDB. It uses LSM tree structure to store data in form of key: value. There are multiple levels of RocksDB LSM tree and each level is known as L0...Lmax. At each level, LSM tree fixes an upper limit on its size. At L0 and L1, the size is 256MB. Next level L2 is 10* L1. L3 is 10* L3. Let’s take a look at RocksDB config in our index OSD:

# Add the following to the /etc/ceph/ceph.conf and restart OSD. 
Alternatively, you can set it through admin socket.
debug rocksdb = 20/20

$vim ceph-osd.34.log
2018-11-13 23:00:49.402483 7fcd1c906e00  0  set rocksdb option compaction_readahead_size = 2097152
2018-11-13 23:00:49.402498 7fcd1c906e00  0  set rocksdb option compression = kNoCompression
2018-11-13 23:00:49.402502 7fcd1c906e00  0  set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402505 7fcd1c906e00  0  set rocksdb option min_write_buffer_number_to_merge = 1

2018-11-13 23:00:49.402530 7fcd1c906e00  0  set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402532 7fcd1c906e00  0  set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402540 7fcd1c906e00  0  set rocksdb option write_buffer_size = 268435456

2018-11-13 23:00:49.402550 7fcd1c906e00 10 rocksdb: do_open db_path db size 204010946560
2018-11-13 23:00:49.402553 7fcd1c906e00 10 rocksdb: do_open db_path db.slow size 156064389529
 
2018-11-13 23:00:49.404399 7fcd1c906e00  4 rocksdb:                   Options.target_file_size_base: 67108864
2018-11-13 23:00:49.404400 7fcd1c906e00  4 rocksdb:             Options.target_file_size_multiplier: 1
2018-11-13 23:00:49.404407 7fcd1c906e00  4 rocksdb:                Options.max_bytes_for_level_base: 268435456
2018-11-13 23:00:49.404409 7fcd1c906e00  4 rocksdb: Options.level_compaction_dynamic_level_bytes: 0
2018-11-13 23:00:49.404410 7fcd1c906e00  4 rocksdb:          Options.max_bytes_for_level_multiplier: 10.000000

The above logs tell:

  1. target_file_size_base implies the size of 67MB SST file.
  2. All levels have got same size (target_file_size_multiplier is 1) of an SST file, however, the number of files will vary.
  3. max_bytes_for_level_base options set the size for L0. It is 256 MB.
  4. max_bytes_for_level_multiplier defines the multiple for the size of a level. L2 is 10 times of L1.

So in our OSD, we will have the following levels:

L0: 256MB

L1: 256MB

L2: 2.5GB

L3: 25GB

L4: 250GB

We can hold all levels till L3 in our DB with a size of 200GB. L4 will spill over to db.slow because a level fits have to fit completely in the space provided. The total size of L0+L1+L2+L3 = ~28GB which is the same as the total size of our OMAP ‘db’ partition.

Let’s understand the RocksDB compaction logs:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   194.99 MB   1.6      0.0     0.0      0.0      66.8     66.8       0.0   1.0      0.0     99.1       691       318    2.172       0      0
  L1      4/0   162.73 MB   0.6    115.2    66.6     48.5     115.1     66.6       0.0   1.7     70.9     70.8      1664       185    8.995    509M   262K
  L2     45/0    2.39 GB   1.0    110.5    12.0     98.4     110.3     11.9      54.7   9.2     71.3     71.2      1587       196    8.096    474M   472K
  L3    385/0   23.04 GB   1.0    154.4    25.9    128.5     146.3     17.8      41.4   5.6     71.3     67.6      2216       347    6.387    769M    28M
  L4   1398/0   86.51 GB   0.3    469.4    65.9    403.5     396.4     -7.1       0.0   6.0     84.4     71.2      5699       513   11.109   1493M   374M
 Sum   1833/0   112.29 GB   0.0    849.5   170.5    679.0     834.9    155.9      96.0  12.5     73.4     72.1     11857      1559    7.605   3246M   403M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0

The above log shows that L4 has 1398 SST files. The score column implies the need for compaction. If it is more than one, compaction is required. There are total 1833 SST files in the current RocksDB.

Everytime an SST is generated, RocksDB emits the following log:

2018-11-14 06:19:03.159109 7fcd07533700  4 rocksdb: [default] [JOB 3056] Generated table #103329: 176109 keys, 69620849 bytes
2018-11-14 06:19:03.159133 7fcd07533700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1542156543159124, "cf_name": "default", "job": 3056, "event": "table_file_creation", "file_number": 103329, "file_size": 69620849, "table_properties": {"data_size": 67109663, "index_size": 1918260, "filter_size": 591939, "raw_key_size": 22863898, "raw_average_key_size": 129, "raw_value_size": 57166437, "raw_average_value_size": 324, "num_data_blocks": 17578, "num_entries": 176109, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}

The above log helps us understand the average key and value size which are 129 Bytes and 324 Bytes. There are 176109 entries in this table and the size of the table will be ~67MB. Our investigation so far concluded that we indeed had a lot of data in the index OSDs. So we decided to find out what was the data we were storing.

What is in the SST file

An SST file is a sorted keys sequence and can be dumped using sst_dump utility. It is available in Ceph sources. Pick the exact Ceph source that was installed on the OSD. We have to build sst_dump tool.

$ cd ceph/src/rocksdb/

$ make sst_dump -j4

The keys are better printed in hex.

ceph/src/rocksdb$ ./sst_dump --file=../../../085894.sst --command=scan --read_num=5 --output_hex
from [] to []
Process ../../../085894.sst
Sst file format: block-based
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363362313739373130393934386337363433376231386634346262380B280000000000000105037A0000000116F2000000000000BA55BF5BA0E0B2102000000064656466316331633765616433313337636636626463326238616439303337340F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646616F200000000000000000000000000000000000001010400000014820B2882E01F3500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31343836343235352E333035383436380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436396133396238353834646166306238353034666233633365366431' seq:0, type:1 => 080349010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363961333962383538346461663062383530346662336333653664318B260000000000000105037A0000000164F10000000000005268BE5BF2576F2A2000000066323461343235333138346165613531626266383630343864343237396265620F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646664F100000000000000000000000000000000000001010400000014828B26822B623400000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31393631363135302E3534343338370000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631D4270000000000000105037A0000000162EC0000000000000C5DBE5BB63F0F192000000039326264363564326363653162333664663562383434616561343732313763370F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646662EC0000000000000000000000000000000000000101040000001482D427822A413500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033383236372E323836353536380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461626334636536363739663132353035306264653366383865636237' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266646162633463653636373966313235303530626465336638386563623780250000000000000105037A00000001DBEE0000000000007C86BE5BC1B7231B2000000037653637336561626638383437613130636163346434353661373466633065610F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466DBEE0000000000000000000000000000000000000101040000001482802582D4A43500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033373237372E323931363231390000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666530373461396331386362376632393964326462623639333135343231' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266653037346139633138636237663239396432646262363933313534323169130000000000000105037A00000001A7F1000000000000C80CBF5B8ABCFB2B2000000030336663643132646435613139623539613137343661643035643261316663380F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466A7F10000000000000000000000000000000000000101040000001482691382BA703500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353032373932322E323939383234340000000000000000000000000000

The output is in the form of (key, seq, type, value). To make sense of the keys, we can use Python to translate hex to ASCII.

./ceph/src/rocksdb$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> '4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238'.decode("hex")
'M\x00\x00\x00\x00\x00\x00\x0e\xca\x9d.FA non retail/pdf/ABCD_hello_there_ABCD/ABCD_plane_international_FANR5bfd63b1797109948c76437b18f44bb8'

So keys are valid and we are having genuine data in our index nodes. At the same time, we can list index pool data from RADOS. You can also list key and values using RADOS.

How to get data size reduced?

One was is running manual compaction on the current RocksDB. But that would not help as the default compaction stats implied that in spite of compaction, the data itself was huge. We tried to explore dynamic compaction of RocksDB by setting level_compaction_dynamic_level_bytes to 1 but it is currently disabled and unsupported in Ceph. We can modify RocksDB config using the following in /etc/ceph/ceph.conf. Most of the configs are immutable after OSD creation.

bluestore rocksdb options =  "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
 
# We can get this info from the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-osd.xx.asok config show|grep rocksdb
    "debug_rocksdb": "20/20",
    "bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",

Why are we getting so much data?

There are following suspects:

  • Rogue client
  • Uncontrolled data ingestion by an account
  • Any activity in RGW that generated index data

We did not find anything suspicious in RGW logs (enabled using debug rgw = 20/20), and all PUT requests looked sane and valid. However, there were few logs around bucket Reshard. OSD logs (enabled using debug osd=20/20) also showed many requests for rgw Reshard bucket.

ceph-rgw-log.1.gz:2018-11-14 10:14:13.602194 7f30e6d66700  0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
ceph-rgw-.log.1.gz:2018-11-14 10:14:13.680114 7f30c2d1e700  0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000

A quick search around bucket index Reshard popped a tracker http://tracker.ceph.com/issues/34307. The dynamic bucket Reshard happens for a bucket that has more than 100k entries and creates smaller shards. It leaves the old bucket index though and that causes data pileup. Every Reshard would cause data duplication and the cause for our near-full OSDs.

"rgw_max_objs_per_shard": "100000",
 
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.asok config show|grep dynamic
    "rgw_dynamic_resharding": "true",

We disabled dynamic bucket Reshard and OSDs data growth returned to normalcy.

References

Written with StackEdit.