A detailed and lucid explanation is here
Free Threads
Distributed Systems, Large Scale Architecture, File System, Linux and more
Wednesday, December 12, 2018
Monday, December 10, 2018
Why use Base64 Encoding?
##What is Base64 encoding?
- Given a stream of binary bits, it will encode 6-bits to a character from a set of 2 pow 6 (64 chracters).
- Example “abcd”, the ASCII representation is 65666768.
- [1000001][1000010][1000011][1000100]
- Base64 would pics six continuous bits
- 100000|| 110000|| 101000|| 011100||0100xx here xx would be 00 (padding)
- gwocQ
Why use base64 encoding?
- Transferring binary data in URLs
- Tranferring binary data such as images as text
- Transmit and store text that might cause delimiter collision.
- Example is a random string followed by a delimiter (
_
) and a pattern and the code logic searches the delimiter to seperate the pattern. - The
_
can appear in the generated random string too. - So encoding the random string in base64 would avoid such case.
- Embed image in a XML
- Example is a random string followed by a delimiter (
Friday, December 7, 2018
Golang Runtime and Concurrency
- Golang uses a user-space component (runtime) linked to the executable.
- The runtime is written in C.
- It has implementation of scheduler, goroutine management and OS-threads management.
- Per go process, there is a max limit of OS threads.
- Go runtime schedules N goroutines on M OS threads
- One goroutine runs exactly on one thread.
- A goroutine can get blocked (e.g. on a syscall) and blocks the OS-thread too.
References
Tuesday, December 4, 2018
Time based Key Expiry in Redis
https://redis.io/commands/expire
It is a useful feature to expire keys based on their last access time. We can use it to develop interesting feature such as rate limits,
There are various rate limiting implementations.
https://github.com/redislabsdemo/RateLimiter/tree/master/src/com/redislabs/metering/ratelimiter
Written with StackEdit.
Sample Go Code
package main
import (
"fmt"
"time"
)
func say(s string) {
for i := 0; i < 5; i++ {
time.Sleep(100 * time.Millisecond)
fmt.Println(s)
}
}
func main() {
say("world")
go say("hello")
// say("abcd")
}
- Why hello does not get printed?
Written with StackEdit.
Monday, December 3, 2018
Sample git conf
#### Put this in your ~/.gitconfig or ~/.config/git/config
[user]
name = Your Full Name
email = your@email.tld
[color]
# Enable colors in color-supporting terminals
ui = auto
[alias]
st = status
ci = commit
lg = log --graph --date=relative --pretty=tformat:'%Cred%h%Creset -%C(auto)%d%Creset %s %Cgreen(%an %ad)%Creset'
oops = commit --amend --no-edit
review-local = "!git lg @{push}.."
# Or pre 2.5, as we didn't differential push and upstream in shorthands:
# review-local = lg @{upstream}..
[core]
# Don't paginate output by default
pager = cat
#
# Out of luck: on Windows w/o msysGit? You may have Notepad++…
# editor = 'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin
#
# If you want to use Sublime Text 2's subl wrapper:
# editor = subl -w
#
# Or Atom, perhaps:
# editor = atom --wait
#
# Sublime Text 2 on Windows:
# editor = 'c:/Program Files (x86)/Sublime Text 2/sublime_text.exe' -w
#
# Sublime Text 3 on Windows:
# editor = 'c:/Program Files/Sublime Text 3/subl.exe' -w
#
# Don't consider trailing space change as a cause for merge conflicts
whitespace = -trailing-space
[diff]
# Use better, descriptive initials (c, i, w) instead of a/b.
mnemonicPrefix = true
# Show renames/moves as such
renames = true
# When using --word-diff, assume --word-diff-regex=.
wordRegex = .
# Display submodule-related information (commit listings)
submodule = log
[fetch]
# Auto-fetch submodule changes (sadly, won't auto-update)
recurseSubmodules = on-demand
[grep]
# Consider most regexes to be ERE
extendedRegexp = true
[log]
# Use abbrev SHAs whenever possible/relevant instead of full 40 chars
abbrevCommit = true
# Automatically --follow when given a single path
follow = true
[merge]
# Display common-ancestor blocks in conflict hunks
conflictStyle = diff3
[mergetool]
# Clean up backup files created by merge tools on tool exit
keepBackup = false
# Clean up temp files created by merge tools on tool exit
keepTemporaries = false
# Put the temp files in a dedicated dir anyway
writeToTemp = true
# Auto-accept file prompts when launching merge tools
prompt = false
[pull]
# This is GREAT… when you know what you're doing and are careful
# not to pull --no-rebase over a local line containing a true merge.
# rebase = true
# WARNING! This option, which does away with the one gotcha of
# auto-rebasing on pulls, is only available from 1.8.5 onwards.
rebase = preserve
[push]
# Default push should only push the current branch to its push target, regardless of its remote name
default = upstream
# When pushing, also push tags whose commit-ishs are now reachable upstream
followTags = true
[rerere]
# If, like me, you like rerere, uncomment these
# autoupdate = true
# enabled = true
[status]
# Display submodule rev change summaries in status
submoduleSummary = true
# Recursively traverse untracked directories to display all contents
showUntrackedFiles = all
[color "branch"]
# Blue on black is hard to read in git branch -vv: use cyan instead
upstream = cyan
[tag]
# Sort tags as version numbers whenever applicable, so 1.10.2 is AFTER 1.2.0.
sort = version:refname
[versionsort]
prereleaseSuffix = -pre
prereleaseSuffix = .pre
prereleaseSuffix = -beta
prereleaseSuffix = .beta
prereleaseSuffix = -rc
prereleaseSuffix = .rc
###
Reference https://gist.github.com/tdd/470582
Written with StackEdit.
Wednesday, November 21, 2018
Bluestore Internals
## Bluesotre Discussions
If WAL is full what would happen? Would writes block?
It never blocks; it will always just spill over onto the next fastest
device (wal -> db -> main). Note that there is no value to a db partition
if it is on the same device as the main partition.
Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?
Yes
I would check your running Ceph clusters and calculate the amount of objects per OSD.
total objects / num osd * 3
For the moment though, having multiple (4)
256MB WAL buffers appears to give us the best performance despite
resulting in large memtables, so 1-2GB for the WAL is right.
A tool to gather complete Ceph cluster information
https://github.com/42on/ceph-collect
Bluesotre onode size is 24k for average object size of 2.8MB in RBD. So average object size and count per TB can be calculated.
#Reference
http://ceph-users.ceph.narkive.com/8uPMEXNz/bluestore-osd-data-wal-db
Written with StackEdit.
Tuesday, November 20, 2018
Thursday, November 15, 2018
Updating systemctl limits on Debian
-
Become root
-
vi /lib/systemd/system/ceph-osd@.service
Change the values of proc and files to the following (extracted from ulimit -a).
[Service]
LimitNOFILE=78452
LimitNPROC=80248
-
$ sudo systemctl daemon-reload
-
Restart OSDs
$ cat update-systemctl.sh
for ip in $(cat ip.list)
do
scp ceph-osd@.service $ip:/tmp
ssh $ip sudo cp /tmp/ceph-osd@.service /lib/systemd/system/ceph-osd@.service
ssh $ip sudo systemctl daemon-reload
ssh $ip sudo sudo systemctl start ceph\*.service ceph\*.target
done
$ cat ceph-osd@.service
[Unit]
Description=Ceph object storage daemon osd.%i
After=network-online.target local-fs.target time-sync.target ceph-mon.target
Wants=network-online.target local-fs.target time-sync.target
PartOf=ceph-osd.target
[Service]
LimitNOFILE=78452
LimitNPROC=80248
EnvironmentFile=-/etc/default/ceph
Environment=CLUSTER=ceph
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
ExecReload=/bin/kill -HUP $MAINPID
ProtectHome=true
ProtectSystem=full
PrivateTmp=true
TasksMax=infinity
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=0
RestartSec=20s
[Install]
WantedBy=ceph-osd.target
References
Written with StackEdit.
How to debug RcoksDB issues in Bluestore
Problem
We started to see multiple near-full index OSDs in a Luminous cluster.
$ sudo ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS TYPE NAME
351 ssd 0.14999 0.79999 152G 123G 29727M 81.03 3.25 18 osd.351
The near-full message is triggered for OSDs that have used more the 85%.
Environment
- It runs Ceph 12.2.5.2 build on Debian 9.
- Index OSDs run on two 350G SSDs instances.
- Ceph OSDs use LVMs volumes.
- RGW index pool uses isolated OSDs which means no other pool used these OSDs.
- Traffic inflow was BAU.
Problem Analysis
The index OSDs are meant to store RGW bucket index information. Index data has primarily a small (~500B per entry for user objects) chunk and gets stored in the OMAP. The Bluestore store small objects in the WAL and later flush them to the OMAP. Larger objects are directly written to the Bluestore Block.
The configuration for small and large objects is an OSD tunable:
# ceph daemon osd.351 config show|grep min_alloc
"bluestore_min_alloc_size": "0",
"bluestore_min_alloc_size_hdd": "65536",
"bluestore_min_alloc_size_ssd": "16384",
Bluestore uses default 16KB block size for SSDs. The choice between a higher and lower size is a tradeoff of lower write-amplification vs lower fragmentation.
Coming back to the index OSDs layout, we use a large DB for OMAP and a small block (since most of the objects are small, they would get placed in the OMAP). OMAP can use the block for a spillover. A spillover means OMAP is full and needs extra space by reserving space in the Bluestore block.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 10G 0 disk
`-vda1 254:1 0 10G 0 part /
vdb 254:16 0 355G 0 disk
|-vdbVG-wal--vdb 253:0 0 2G 0 lvm
|-vdbVG-db--vdb 253:1 0 200G 0 lvm
`-vdbVG-block--vdb 253:2 0 153G 0 lvm
vdc 254:32 0 355G 0 disk
|-vdcVG-wal--vdc 253:3 0 2G 0 lvm
|-vdcVG-db--vdc 253:4 0 200G 0 lvm
`-vdcVG-block--vdc 253:5 0 153G 0 lvm
pSA-e index OSDs have a 2G WAL, 200G OMAP, and 153G block.
How OSD Utilization is calculated?
The OSD utilization is measured by usage of the block. It means that block was getting consumed which should not happen because OMAP size is good enough to store all data. Since data size is small, we did not expect any other usage of the block. So Bluestore was using the block to store the spillover data of OMAP.
There is a problem with auto-detection of media type:
2018-11-09 15:53:49.918135 7f9b51df1e00 1 bdev(0x55936bd64480 /var/lib/ceph/osd/ceph-70/block) open size 164278304768 (0x263fc00000, 152 GB) block_size 4096 (4096 B) rotational
2018-11-09 15:53:49.918620 7f9b51df1e00 1 bdev(0x55936bd65200 /dev/vdcVG/db-vdc) open size 214748364800 (0x3200000000, 200 GB) block_size 4096 (4096 B) rotational
It should have logged media type as non-rotational.
Verifying OMAP spillover
Bluestore data is not browseable. A utility, ceph-bluestore-tool can mount the bluefs part of a Bluestore OSD. Since OMAP DB size is 200G, we need a mount point that could hold it. We stopped OSD on another disk and used the full disk as a mount point for bluefs.
# Zap the disk
$ sgdisk -Z /dev/vdb
# Create a file system
$ mkfs.xfs /dev/vdb -K -f
# Mount it
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda 254:0 0 10G 0 disk
`-vda1 254:1 0 10G 0 part /
vdb 254:16 0 355G 0 disk /mnt/osd34
vdc 254:32 0 355G 0 disk
|-vdcVG-wal--vdc 253:0 0 2G 0 lvm
|-vdcVG-db--vdc 253:1 0 200G 0 lvm
`-vdcVG-block--vdc 253:2 0 153G 0 lvm
# Get BlueFS data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-34 --out-dir /mnt/osd34
We can find out the details of space consumption.
/mnt/osd34# du -sh -l
119G
/mnt/osd34/db# du -sh .
28G
/mnt/osd34/db# ls -lh|less
total 28G
-rw-r--r-- 1 root root 66M Nov 12 16:25 072637.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072638.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072639.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072640.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072641.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072642.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 072643.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 074380.sst
-rw-r--r-- 1 root root 66M Nov 12 16:25 078646.sst
Let’s check the status of db.slow
. The db.slow
is reserved space by OMAP in the Bluestore block. It covers the spillover data.
/mnt/osd34/db.slow# du -sh .
91G
/mnt/osd34/db.slow# ls -lh|less
total 91G
-rw-r--r-- 1 root root 66M Nov 12 16:27 041139.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 041423.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 042097.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 043530.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 044022.sst
-rw-r--r-- 1 root root 66M Nov 12 16:27 046002.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 046615.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 047100.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 048891.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 048892.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 049678.sst
-rw-r--r-- 1 root root 66M Nov 12 16:28 052509.sst
Why the spillover happened while DB was hardly 15% utilized
Bluestore OMAP is based on RocksDB. It uses LSM tree structure to store data in form of key: value. There are multiple levels of RocksDB LSM tree and each level is known as L0...Lmax
. At each level, LSM tree fixes an upper limit on its size. At L0 and L1, the size is 256MB. Next level L2 is 10* L1
. L3 is 10* L3
. Let’s take a look at RocksDB config in our index OSD:
# Add the following to the /etc/ceph/ceph.conf and restart OSD.
Alternatively, you can set it through admin socket.
debug rocksdb = 20/20
$vim ceph-osd.34.log
2018-11-13 23:00:49.402483 7fcd1c906e00 0 set rocksdb option compaction_readahead_size = 2097152
2018-11-13 23:00:49.402498 7fcd1c906e00 0 set rocksdb option compression = kNoCompression
2018-11-13 23:00:49.402502 7fcd1c906e00 0 set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402505 7fcd1c906e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402530 7fcd1c906e00 0 set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402532 7fcd1c906e00 0 set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402540 7fcd1c906e00 0 set rocksdb option write_buffer_size = 268435456
2018-11-13 23:00:49.402550 7fcd1c906e00 10 rocksdb: do_open db_path db size 204010946560
2018-11-13 23:00:49.402553 7fcd1c906e00 10 rocksdb: do_open db_path db.slow size 156064389529
2018-11-13 23:00:49.404399 7fcd1c906e00 4 rocksdb: Options.target_file_size_base: 67108864
2018-11-13 23:00:49.404400 7fcd1c906e00 4 rocksdb: Options.target_file_size_multiplier: 1
2018-11-13 23:00:49.404407 7fcd1c906e00 4 rocksdb: Options.max_bytes_for_level_base: 268435456
2018-11-13 23:00:49.404409 7fcd1c906e00 4 rocksdb: Options.level_compaction_dynamic_level_bytes: 0
2018-11-13 23:00:49.404410 7fcd1c906e00 4 rocksdb: Options.max_bytes_for_level_multiplier: 10.000000
The above logs tell:
- target_file_size_base implies the size of 67MB SST file.
- All levels have got same size (target_file_size_multiplier is 1) of an SST file, however, the number of files will vary.
- max_bytes_for_level_base options set the size for L0. It is 256 MB.
- max_bytes_for_level_multiplier defines the multiple for the size of a level. L2 is 10 times of L1.
So in our OSD, we will have the following levels:
L0: 256MB
L1: 256MB
L2: 2.5GB
L3: 25GB
L4: 250GB
We can hold all levels till L3 in our DB with a size of 200GB. L4 will spill over to db.slow because a level fits have to fit completely in the space provided. The total size of L0+L1+L2+L3 = ~28GB which is the same as the total size of our OMAP ‘db’ partition.
Let’s understand the RocksDB compaction logs:
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
L0 1/0 194.99 MB 1.6 0.0 0.0 0.0 66.8 66.8 0.0 1.0 0.0 99.1 691 318 2.172 0 0
L1 4/0 162.73 MB 0.6 115.2 66.6 48.5 115.1 66.6 0.0 1.7 70.9 70.8 1664 185 8.995 509M 262K
L2 45/0 2.39 GB 1.0 110.5 12.0 98.4 110.3 11.9 54.7 9.2 71.3 71.2 1587 196 8.096 474M 472K
L3 385/0 23.04 GB 1.0 154.4 25.9 128.5 146.3 17.8 41.4 5.6 71.3 67.6 2216 347 6.387 769M 28M
L4 1398/0 86.51 GB 0.3 469.4 65.9 403.5 396.4 -7.1 0.0 6.0 84.4 71.2 5699 513 11.109 1493M 374M
Sum 1833/0 112.29 GB 0.0 849.5 170.5 679.0 834.9 155.9 96.0 12.5 73.4 72.1 11857 1559 7.605 3246M 403M
Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0
The above log shows that L4 has 1398 SST files. The score column implies the need for compaction. If it is more than one, compaction is required. There are total 1833 SST files in the current RocksDB.
Everytime an SST is generated, RocksDB emits the following log:
2018-11-14 06:19:03.159109 7fcd07533700 4 rocksdb: [default] [JOB 3056] Generated table #103329: 176109 keys, 69620849 bytes
2018-11-14 06:19:03.159133 7fcd07533700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1542156543159124, "cf_name": "default", "job": 3056, "event": "table_file_creation", "file_number": 103329, "file_size": 69620849, "table_properties": {"data_size": 67109663, "index_size": 1918260, "filter_size": 591939, "raw_key_size": 22863898, "raw_average_key_size": 129, "raw_value_size": 57166437, "raw_average_value_size": 324, "num_data_blocks": 17578, "num_entries": 176109, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}
The above log helps us understand the average key and value size which are 129 Bytes and 324 Bytes. There are 176109 entries in this table and the size of the table will be ~67MB. Our investigation so far concluded that we indeed had a lot of data in the index OSDs. So we decided to find out what was the data we were storing.
What is in the SST file
An SST file is a sorted keys sequence and can be dumped using sst_dump utility. It is available in Ceph sources. Pick the exact Ceph source that was installed on the OSD. We have to build sst_dump tool.
$ cd ceph/src/rocksdb/
$ make sst_dump -j4
The keys are better printed in hex.
ceph/src/rocksdb$ ./sst_dump --file=../../../085894.sst --command=scan --read_num=5 --output_hex
from [] to []
Process ../../../085894.sst
Sst file format: block-based
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363362313739373130393934386337363433376231386634346262380B280000000000000105037A0000000116F2000000000000BA55BF5BA0E0B2102000000064656466316331633765616433313337636636626463326238616439303337340F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646616F200000000000000000000000000000000000001010400000014820B2882E01F3500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31343836343235352E333035383436380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436396133396238353834646166306238353034666233633365366431' seq:0, type:1 => 080349010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363961333962383538346461663062383530346662336333653664318B260000000000000105037A0000000164F10000000000005268BE5BF2576F2A2000000066323461343235333138346165613531626266383630343864343237396265620F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646664F100000000000000000000000000000000000001010400000014828B26822B623400000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31393631363135302E3534343338370000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631D4270000000000000105037A0000000162EC0000000000000C5DBE5BB63F0F192000000039326264363564326363653162333664663562383434616561343732313763370F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646662EC0000000000000000000000000000000000000101040000001482D427822A413500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033383236372E323836353536380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461626334636536363739663132353035306264653366383865636237' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266646162633463653636373966313235303530626465336638386563623780250000000000000105037A00000001DBEE0000000000007C86BE5BC1B7231B2000000037653637336561626638383437613130636163346434353661373466633065610F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466DBEE0000000000000000000000000000000000000101040000001482802582D4A43500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033373237372E323931363231390000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666530373461396331386362376632393964326462623639333135343231' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266653037346139633138636237663239396432646262363933313534323169130000000000000105037A00000001A7F1000000000000C80CBF5B8ABCFB2B2000000030336663643132646435613139623539613137343661643035643261316663380F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466A7F10000000000000000000000000000000000000101040000001482691382BA703500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353032373932322E323939383234340000000000000000000000000000
The output is in the form of (key, seq, type, value). To make sense of the keys, we can use Python to translate hex to ASCII.
./ceph/src/rocksdb$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> '4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238'.decode("hex")
'M\x00\x00\x00\x00\x00\x00\x0e\xca\x9d.FA non retail/pdf/ABCD_hello_there_ABCD/ABCD_plane_international_FANR5bfd63b1797109948c76437b18f44bb8'
So keys are valid and we are having genuine data in our index nodes. At the same time, we can list index pool data from RADOS. You can also list key and values using RADOS.
How to get data size reduced?
One was is running manual compaction on the current RocksDB. But that would not help as the default compaction stats implied that in spite of compaction, the data itself was huge. We tried to explore dynamic compaction of RocksDB by setting level_compaction_dynamic_level_bytes to 1 but it is currently disabled and unsupported in Ceph. We can modify RocksDB config using the following in /etc/ceph/ceph.conf. Most of the configs are immutable after OSD creation.
bluestore rocksdb options = "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
# We can get this info from the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-osd.xx.asok config show|grep rocksdb
"debug_rocksdb": "20/20",
"bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
Why are we getting so much data?
There are following suspects:
- Rogue client
- Uncontrolled data ingestion by an account
- Any activity in RGW that generated index data
We did not find anything suspicious in RGW logs (enabled using debug rgw = 20/20), and all PUT requests looked sane and valid. However, there were few logs around bucket Reshard. OSD logs (enabled using debug osd=20/20) also showed many requests for rgw Reshard bucket.
ceph-rgw-log.1.gz:2018-11-14 10:14:13.602194 7f30e6d66700 0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
ceph-rgw-.log.1.gz:2018-11-14 10:14:13.680114 7f30c2d1e700 0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
A quick search around bucket index Reshard popped a tracker http://tracker.ceph.com/issues/34307. The dynamic bucket Reshard happens for a bucket that has more than 100k entries and creates smaller shards. It leaves the old bucket index though and that causes data pileup. Every Reshard would cause data duplication and the cause for our near-full OSDs.
"rgw_max_objs_per_shard": "100000",
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.asok config show|grep dynamic
"rgw_dynamic_resharding": "true",
We disabled dynamic bucket Reshard and OSDs data growth returned to normalcy.
References
- https://www.slideshare.net/VikhyatUmrao/cephalocon-apac-china/15
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023141.html
- http://tracker.ceph.com/issues/34307
- http://tracker.ceph.com/issues/24082
- https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool
- https://github.com/facebook/rocksdb/wiki/Leveled-Compaction
- https://github.com/facebook/rocksdb/wiki/Delete-Stale-Files
- https://github.com/facebook/rocksdb/wiki/How-we-keep-track-of-live-SST-files
- https://github.com/facebook/rocksdb/wiki/Benchmarking-tools
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029851.html
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-March/025283.html
- https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
- https://github.com/facebook/rocksdb/commit/65a9cd616876c7a1204e1a50990400e4e1f61d7e
- https://tracker.ceph.com/issues/23510
Written with StackEdit.
Tuesday, September 18, 2018
A utility to dump block devices data in Linux
Many a times it is necessary to read raw disk blocks, such as data corruption, magic block corruption.
xxd
is a simple and lightweight utility to dump a device.
# xxd /dev/vdb|less
00000000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................
00000200: 4c41 4245 4c4f 4e45 0100 0000 0000 0000 LABELONE........
00000210: ddf0 c3e3 2000 0000 4c56 4d32 2030 3031 .... ...LVM2 001
00000220: 5352 7661 7344 6162 7964 4d36 6569 6a64 SRvasDabydM6eijd
00000230: 6833 4338 3462 5532 536c 444b 564c 7943 h3C84bU2SlDKVLyC
00000240: 0000 0000 2000 0000 0000 1000 0000 0000 .... ...........
Another utility is debugfs
that can dump a valid disk having a file system.
Written with StackEdit.
Friday, September 14, 2018
General Script to run Linux Shell Commands
#for i in {0..24}
#for i in $(cat meta.osd.ip)
do
#sudo ceph osd purge $i --yes-i-really-mean-it
#ssh -q -o "StrictHostKeyChecking no" $i sudo reboot
done
Written with StackEdit.
Thursday, September 13, 2018
OSD on Debian Jessie : No cluster conf found
- Reference Tracker
The problem is in the ceph-disk
code.
ceph-disk prepare has a following log on the destination node
# ceph-disk -v prepare /dev/vdb
command: Running command:
/usr/bin/ceph-osd --cluster=None --show-config-value=fsid
The value of cluster
is None
and that is incorrect. It must be ceph
.
# /usr/bin/ceph-osd --cluster=None --show-config-value=fsid
00000000-0000-0000-0000-000000000000
# /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
68eabcd3-a4fd-4c80-9e9c-56577d841234
The code change involved the following in /usr/sbin/ceph-disk
:
prepare_parser.add_argument(
'--cluster
metavar='NAME', default='ceph', # added the parameter `default`
help='cluster name to assign this disk to',
)
- diff output
< metavar='NAME',
---
> metavar='NAME', default='ceph',
Written with StackEdit.
Wednesday, September 5, 2018
Write CRUSH rule for a Cluster
CRUSH rules are described as follows:
{
"rule_id": 1,
"rule_name": "replicated_ruleset_hdd",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -481,
"item_name": "hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "jbod"
},
{
"op": "emit"
}
]
},
Visualize the CRUSH rule as a way to traverse a tree. The ‘steps’ section decides the traversal. The above rule takes the root ‘hdd’ and picks num_replicas count of bucket of type ‘jbod’. Next, it picks num_replicas count of leaves (e.g. OSDs) in the chosen three buckets.
The leaves are decided by type 0 in the CRUSH types list.
Get the current CRUSH rules
$ sudo ceph osd getcrushmap -o crush.org
The above command gets us a compiled version of CRUSH rule. If required to make changes, we need to decompile the CRUSH rule.
$ sudo crushtool -d /tmp/crush.org -o crush.org.d
Make changes and recompile the /tmp/crush.org
$ sudo crushtool -c crush.org.d -o crush.new.c
Test the new rule
# Find out incorrect mappings from the new rule
$ sudo crushtool -i <compiled crush file> --test --show-bad-mappings
# Find out behavior of a random placement
$ sudo crushtool --test -i /tmp/crush.org --show-utilization --rule 3 --num-rep=3 --simulate
# Find out behavior of the new CRUSH rule placement
$ sudo crushtool --test -i /tmp/crush.org --show-utilization --rule 3 --num-rep=3
Sample output
device 1334: stored : 4 expected : 2.26049
device 1335: stored : 2 expected : 2.26049
device 1336: stored : 3 expected : 2.26049
device 1337: stored : 2 expected : 2.26049
device 1338: stored : 1 expected : 2.26049
device 1339: stored : 2 expected : 2.26049
device 1340: stored : 2 expected : 2.26049
References
- https://blog.dachary.org/2017/04/18/faster-ceph-crush-computation-with-smaller-buckets/
- https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/storage_strategies_guide/crush_administration
- https://blog.dachary.org/2013/12/09/testing-a-ceph-crush-map/
Written with StackEdit.
Thursday, August 30, 2018
Restarting RGW service
sudo systemctl start ceph-radosgw@rgw.`hostname -s`
sudo systemctl enable ceph-radosgw@rgw.`hostname -s`
systemctl reset-failed ceph-mgr@{{ ansible_hostname }}"
Written with StackEdit.
Friday, August 17, 2018
Notes on RGW Sytem Object State
RGW raw object store has following structure:
// rgw/rgw_rados.h
struct RGWRawObjState {
rgw_raw_obj obj;
bool has_attrs{false};
bool exists{false};
uint64_t size{0};
ceph::real_time mtime;
uint64_t epoch;
bufferlist obj_tag;
bool has_data{false};
bufferlist data;
bool prefetch_data{false};
uint64_t pg_ver{0};
/* important! don't forget to update copy constructor */
RGWObjVersionTracker objv_tracker;
map<string, bufferlist> attrset;
RGWRawObjState() {}
Written with StackEdit.
Notes on Ceph RADOS paper
Why PGs?
PGs enable a balanced distribution of objects. Without PGs, there are two ways to distribute objects:
a. Mirroring on other OSDs
b. Keep a copy of object on all the nodes in the cluster (declustering)
PGs provide a way to replicate a set of objects in a fault-tolerant manner.
Written with StackEdit.
Thursday, August 16, 2018
Notes on RGW Manifest
RGW maintains a manifest of each object. The class RGWObjManifest
implements the details with object head, tail placement.
Manifest is written as XATTRs along with RGWRados::Object::Write::_do_write_meta( )
.
/**
* Write/overwrite an object to the bucket storage.
* bucket: the bucket to store the object in
* obj: the object name/key
* data: the object contents/value
* size: the amount of data to write (data must be this long)
* accounted_size: original size of data before compression, encryption
* mtime: if non-NULL, writes the given mtime to the bucket storage
* attrs: all the given attrs are written to bucket storage for the given object
* exclusive: create object exclusively
* Returns: 0 on success, -ERR# otherwise.
*/
Written with StackEdit.
Monday, August 6, 2018
Notes on YouTube Architecture
- Web servers are usually not the bottleneck
- Caching levels:
- Database
- Serialized Python objects
- HTML pages
- Videos are sharded in the cluster to share load.
- Instead of single process Apache, lighthttp was used because it was multi-process.
- Showing thumbnails is challenging. A thumbnail is 5KB image.
-DB sharding is the key.
Written with StackEdit.
Sunday, August 5, 2018
Lifecycle of a URL access on a Browser
Step 1
- DNS lookup
- Browser cache
- OS cache
- Router cache
- ISP DNS lookup
Step 2
- Connection setup
- 3 phase TCP connection
Step 3
- Browser sends HTTP request
- GET
- POST (for auth and form submission)
Step 4
- Server handles the request and prepares a response.
- Server could be a web server (Apache, IIS)
- Handler parses the request header
- Handler can be coded on ASP, PHP etc.
Step 5
- Server responds with HTTP header having status code
Step 6
- Browser gets HTML data
- It renders tags
- Fetches images, CSS using GET (usually cached)
Written with StackEdit.