Monday, December 10, 2018

Why use Base64 Encoding?

##What is Base64 encoding?

  • Given a stream of binary bits, it will encode 6-bits to a character from a set of 2 pow 6 (64 chracters).
  • Example “abcd”, the ASCII representation is 65666768.
  • [1000001][1000010][1000011][1000100]
  • Base64 would pics six continuous bits
  • 100000|| 110000|| 101000|| 011100||0100xx here xx would be 00 (padding)
  • gwocQ

Why use base64 encoding?

  • Transferring binary data in URLs
  • Tranferring binary data such as images as text
  • Transmit and store text that might cause delimiter collision.
    • Example is a random string followed by a delimiter (_) and a pattern and the code logic searches the delimiter to seperate the pattern.
    • The _ can appear in the generated random string too.
    • So encoding the random string in base64 would avoid such case.
    • Embed image in a XML

Friday, December 7, 2018

Golang Runtime and Concurrency

  • Golang uses a user-space component (runtime) linked to the executable.
  • The runtime is written in C.
  • It has implementation of scheduler, goroutine management and OS-threads management.
  • Per go process, there is a max limit of OS threads.
  • Go runtime schedules N goroutines on M OS threads
  • One goroutine runs exactly on one thread.
  • A goroutine can get blocked (e.g. on a syscall) and blocks the OS-thread too.

References

Tuesday, December 4, 2018

Time based Key Expiry in Redis

https://redis.io/commands/expire
It is a useful feature to expire keys based on their last access time. We can use it to develop interesting feature such as rate limits,

There are various rate limiting implementations.
https://github.com/redislabsdemo/RateLimiter/tree/master/src/com/redislabs/metering/ratelimiter

Written with StackEdit.

Sample Go Code

package main

import (
	"fmt"
	"time"
)

func say(s string) {
	for i := 0; i < 5; i++ {
		time.Sleep(100 * time.Millisecond)
		fmt.Println(s)
	}
}

func main() {
    say("world")
    go say("hello")
    // say("abcd")
}

  • Why hello does not get printed?

Written with StackEdit.

Monday, December 3, 2018

Sample git conf

#### Put this in your ~/.gitconfig or ~/.config/git/config
[user]
	name = Your Full Name
	email = your@email.tld
[color]
	# Enable colors in color-supporting terminals
	ui = auto
[alias]
	st = status
	ci = commit
	lg = log --graph --date=relative --pretty=tformat:'%Cred%h%Creset -%C(auto)%d%Creset %s %Cgreen(%an %ad)%Creset'
	oops = commit --amend --no-edit
	review-local = "!git lg @{push}.."
	# Or pre 2.5, as we didn't differential push and upstream in shorthands:
	# review-local = lg @{upstream}..
[core]
	# Don't paginate output by default
	pager = cat
	#
	# Out of luck: on Windows w/o msysGit? You may have Notepad++…
	# editor = 'C:/Program Files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin
	# 
	# If you want to use Sublime Text 2's subl wrapper:
	# editor = subl -w
	# 
	# Or Atom, perhaps:
	# editor = atom --wait
	# 
	# Sublime Text 2 on Windows:
	# editor = 'c:/Program Files (x86)/Sublime Text 2/sublime_text.exe' -w
	# 
	# Sublime Text 3 on Windows:
	# editor = 'c:/Program Files/Sublime Text 3/subl.exe' -w
	#
	# Don't consider trailing space change as a cause for merge conflicts
	whitespace = -trailing-space
[diff]
	# Use better, descriptive initials (c, i, w) instead of a/b.
	mnemonicPrefix = true
	# Show renames/moves as such
	renames = true
	# When using --word-diff, assume --word-diff-regex=.
	wordRegex = .
	# Display submodule-related information (commit listings)
	submodule = log
[fetch]
	# Auto-fetch submodule changes (sadly, won't auto-update)
	recurseSubmodules = on-demand
[grep]
	# Consider most regexes to be ERE
	extendedRegexp = true
[log]
	# Use abbrev SHAs whenever possible/relevant instead of full 40 chars
	abbrevCommit = true
	# Automatically --follow when given a single path
	follow = true
[merge]
	# Display common-ancestor blocks in conflict hunks
	conflictStyle = diff3
[mergetool]
	# Clean up backup files created by merge tools on tool exit
	keepBackup = false
	# Clean up temp files created by merge tools on tool exit
	keepTemporaries = false
	# Put the temp files in a dedicated dir anyway
	writeToTemp = true
	# Auto-accept file prompts when launching merge tools
	prompt = false
[pull]
	# This is GREAT… when you know what you're doing and are careful
	# not to pull --no-rebase over a local line containing a true merge.
	# rebase = true
	# WARNING! This option, which does away with the one gotcha of
	# auto-rebasing on pulls, is only available from 1.8.5 onwards.
	rebase = preserve
[push]
	# Default push should only push the current branch to its push target, regardless of its remote name
	default = upstream
	# When pushing, also push tags whose commit-ishs are now reachable upstream
	followTags = true
[rerere]
	# If, like me, you like rerere, uncomment these
	# autoupdate = true
	# enabled = true
[status]
	# Display submodule rev change summaries in status
	submoduleSummary = true
	# Recursively traverse untracked directories to display all contents
	showUntrackedFiles = all
[color "branch"]
	# Blue on black is hard to read in git branch -vv: use cyan instead
	upstream = cyan
[tag]
	# Sort tags as version numbers whenever applicable, so 1.10.2 is AFTER 1.2.0.
	sort = version:refname
[versionsort]
	prereleaseSuffix = -pre
	prereleaseSuffix = .pre
	prereleaseSuffix = -beta
	prereleaseSuffix = .beta
	prereleaseSuffix = -rc
	prereleaseSuffix = .rc

###

Reference https://gist.github.com/tdd/470582

Written with StackEdit.

Wednesday, November 21, 2018

Bluestore Internals

Bluestore Internals
## Bluesotre Discussions
If WAL is full what would happen? Would writes block?

It never blocks; it will always just spill over onto the next fastest
device (wal -> db -> main). Note that there is no value to a db partition
if it is on the same device as the main partition.

Would a drastic (quick) action to correct a too-small-DB-partition
(impacting performance) is to destroy the OSD and rebuild it with a
larger DB partition?
Yes

I would check your running Ceph clusters and calculate the amount of objects per OSD.
total objects / num osd * 3

For the moment though, having multiple (4)
256MB WAL buffers appears to give us the best performance despite
resulting in large memtables, so 1-2GB for the WAL is right.

A tool to gather complete Ceph cluster information
https://github.com/42on/ceph-collect

Bluesotre onode size is 24k for average object size of 2.8MB in RBD. So average object size and count per TB can be calculated.

#Reference
http://ceph-users.ceph.narkive.com/8uPMEXNz/bluestore-osd-data-wal-db

Written with StackEdit.

Thursday, November 15, 2018

Updating systemctl limits on Debian

Updating systemctl limits on Debian
  1. Become root

  2. vi /lib/systemd/system/ceph-osd@.service

Change the values of proc and files to the following (extracted from ulimit -a).

[Service]  
LimitNOFILE=78452  
LimitNPROC=80248
  1. $ sudo systemctl daemon-reload

  2. Restart OSDs

$ cat update-systemctl.sh  
for ip in $(cat ip.list)  
do  
scp ceph-osd@.service $ip:/tmp
ssh $ip sudo cp /tmp/ceph-osd@.service /lib/systemd/system/ceph-osd@.service  
ssh $ip sudo systemctl daemon-reload  
ssh $ip sudo sudo systemctl start ceph\*.service ceph\*.target  
done
$ cat ceph-osd@.service  
[Unit]  
Description=Ceph object storage daemon osd.%i  
After=network-online.target local-fs.target time-sync.target ceph-mon.target  
Wants=network-online.target local-fs.target time-sync.target  
PartOf=ceph-osd.target

[Service]  
LimitNOFILE=78452  
LimitNPROC=80248  
EnvironmentFile=-/etc/default/ceph  
Environment=CLUSTER=ceph  
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph  
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i  
ExecReload=/bin/kill -HUP $MAINPID  
ProtectHome=true  
ProtectSystem=full  
PrivateTmp=true  
TasksMax=infinity  
Restart=on-failure  
StartLimitInterval=30min  
StartLimitBurst=0  
RestartSec=20s

[Install]  
WantedBy=ceph-osd.target

References

Written with StackEdit.

How to debug RcoksDB issues in Bluestore

--- ---

Problem

We started to see multiple near-full index OSDs in a Luminous cluster.

$ sudo ceph osd df tree
  ID    CLASS WEIGHT    REWEIGHT SIZE   USE     AVAIL   %USE  VAR  PGS TYPE NAME
  351   ssd   0.14999  0.79999   152G   123G   29727M   81.03 3.25  18                 osd.351

The near-full message is triggered for OSDs that have used more the 85%.

Environment

  • It runs Ceph 12.2.5.2 build on Debian 9.
  • Index OSDs run on two 350G SSDs instances.
  • Ceph OSDs use LVMs volumes.
  • RGW index pool uses isolated OSDs which means no other pool used these OSDs.
  • Traffic inflow was BAU.

Problem Analysis

The index OSDs are meant to store RGW bucket index information. Index data has primarily a small (~500B per entry for user objects) chunk and gets stored in the OMAP. The Bluestore store small objects in the WAL and later flush them to the OMAP. Larger objects are directly written to the Bluestore Block.

The configuration for small and large objects is an OSD tunable:

# ceph daemon osd.351 config show|grep min_alloc
    "bluestore_min_alloc_size": "0",
    "bluestore_min_alloc_size_hdd": "65536",
    "bluestore_min_alloc_size_ssd": "16384",

Bluestore uses default 16KB block size for SSDs. The choice between a higher and lower size is a tradeoff of lower write-amplification vs lower fragmentation.

Coming back to the index OSDs layout, we use a large DB for OMAP and a small block (since most of the objects are small, they would get placed in the OMAP). OMAP can use the block for a spillover. A spillover means OMAP is full and needs extra space by reserving space in the Bluestore block.

# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                254:0    0   10G  0 disk
`-vda1             254:1    0   10G  0 part /
vdb                254:16   0  355G  0 disk
|-vdbVG-wal--vdb   253:0    0    2G  0 lvm
|-vdbVG-db--vdb    253:1    0  200G  0 lvm
`-vdbVG-block--vdb 253:2    0  153G  0 lvm
vdc                254:32   0  355G  0 disk
|-vdcVG-wal--vdc   253:3    0    2G  0 lvm
|-vdcVG-db--vdc    253:4    0  200G  0 lvm
`-vdcVG-block--vdc 253:5    0  153G  0 lvm

pSA-e index OSDs have a 2G WAL, 200G OMAP, and 153G block.

How OSD Utilization is calculated?

The OSD utilization is measured by usage of the block. It means that block was getting consumed which should not happen because OMAP size is good enough to store all data. Since data size is small, we did not expect any other usage of the block. So Bluestore was using the block to store the spillover data of OMAP.

There is a problem with auto-detection of media type:

2018-11-09 15:53:49.918135 7f9b51df1e00 1 bdev(0x55936bd64480 /var/lib/ceph/osd/ceph-70/block) open size 164278304768 (0x263fc00000, 152 GB) block_size 4096 (4096 B) rotational

2018-11-09 15:53:49.918620 7f9b51df1e00 1 bdev(0x55936bd65200 /dev/vdcVG/db-vdc) open size 214748364800 (0x3200000000, 200 GB) block_size 4096 (4096 B) rotational

It should have logged media type as non-rotational.

Verifying OMAP spillover

Bluestore data is not browseable. A utility, ceph-bluestore-tool can mount the bluefs part of a Bluestore OSD. Since OMAP DB size is 200G, we need a mount point that could hold it. We stopped OSD on another disk and used the full disk as a mount point for bluefs.

# Zap the disk
$ sgdisk -Z /dev/vdb
 
# Create a file system
$ mkfs.xfs /dev/vdb -K -f
 
# Mount it

# lsblk
NAME               MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                254:0    0   10G  0 disk
`-vda1             254:1    0   10G  0 part /
vdb                254:16   0  355G  0 disk /mnt/osd34
vdc                254:32   0  355G  0 disk
|-vdcVG-wal--vdc   253:0    0    2G  0 lvm
|-vdcVG-db--vdc    253:1    0  200G  0 lvm
`-vdcVG-block--vdc 253:2    0  153G  0 lvm
 
# Get BlueFS data
$ ceph-bluestore-tool bluefs-export --path /var/lib/ceph/osd/ceph-34 --out-dir /mnt/osd34

We can find out the details of space consumption.

/mnt/osd34# du -sh -l
119G
 
/mnt/osd34/db# du -sh .
28G
 
/mnt/osd34/db# ls -lh|less
total 28G
-rw-r--r-- 1 root root  66M Nov 12 16:25 072637.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072638.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072639.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072640.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072641.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072642.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 072643.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 074380.sst
-rw-r--r-- 1 root root  66M Nov 12 16:25 078646.sst

Let’s check the status of db.slow. The db.slow is reserved space by OMAP in the Bluestore block. It covers the spillover data.

/mnt/osd34/db.slow# du -sh .
91G
 
/mnt/osd34/db.slow# ls -lh|less
total 91G
-rw-r--r-- 1 root root  66M Nov 12 16:27 041139.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 041423.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 042097.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 043530.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 044022.sst
-rw-r--r-- 1 root root  66M Nov 12 16:27 046002.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 046615.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 047100.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 048891.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 048892.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 049678.sst
-rw-r--r-- 1 root root  66M Nov 12 16:28 052509.sst
Why the spillover happened while DB was hardly 15% utilized

Bluestore OMAP is based on RocksDB. It uses LSM tree structure to store data in form of key: value. There are multiple levels of RocksDB LSM tree and each level is known as L0...Lmax. At each level, LSM tree fixes an upper limit on its size. At L0 and L1, the size is 256MB. Next level L2 is 10* L1. L3 is 10* L3. Let’s take a look at RocksDB config in our index OSD:

# Add the following to the /etc/ceph/ceph.conf and restart OSD. 
Alternatively, you can set it through admin socket.
debug rocksdb = 20/20

$vim ceph-osd.34.log
2018-11-13 23:00:49.402483 7fcd1c906e00  0  set rocksdb option compaction_readahead_size = 2097152
2018-11-13 23:00:49.402498 7fcd1c906e00  0  set rocksdb option compression = kNoCompression
2018-11-13 23:00:49.402502 7fcd1c906e00  0  set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402505 7fcd1c906e00  0  set rocksdb option min_write_buffer_number_to_merge = 1

2018-11-13 23:00:49.402530 7fcd1c906e00  0  set rocksdb option max_write_buffer_number = 4
2018-11-13 23:00:49.402532 7fcd1c906e00  0  set rocksdb option min_write_buffer_number_to_merge = 1
2018-11-13 23:00:49.402540 7fcd1c906e00  0  set rocksdb option write_buffer_size = 268435456

2018-11-13 23:00:49.402550 7fcd1c906e00 10 rocksdb: do_open db_path db size 204010946560
2018-11-13 23:00:49.402553 7fcd1c906e00 10 rocksdb: do_open db_path db.slow size 156064389529
 
2018-11-13 23:00:49.404399 7fcd1c906e00  4 rocksdb:                   Options.target_file_size_base: 67108864
2018-11-13 23:00:49.404400 7fcd1c906e00  4 rocksdb:             Options.target_file_size_multiplier: 1
2018-11-13 23:00:49.404407 7fcd1c906e00  4 rocksdb:                Options.max_bytes_for_level_base: 268435456
2018-11-13 23:00:49.404409 7fcd1c906e00  4 rocksdb: Options.level_compaction_dynamic_level_bytes: 0
2018-11-13 23:00:49.404410 7fcd1c906e00  4 rocksdb:          Options.max_bytes_for_level_multiplier: 10.000000

The above logs tell:

  1. target_file_size_base implies the size of 67MB SST file.
  2. All levels have got same size (target_file_size_multiplier is 1) of an SST file, however, the number of files will vary.
  3. max_bytes_for_level_base options set the size for L0. It is 256 MB.
  4. max_bytes_for_level_multiplier defines the multiple for the size of a level. L2 is 10 times of L1.

So in our OSD, we will have the following levels:

L0: 256MB

L1: 256MB

L2: 2.5GB

L3: 25GB

L4: 250GB

We can hold all levels till L3 in our DB with a size of 200GB. L4 will spill over to db.slow because a level fits have to fit completely in the space provided. The total size of L0+L1+L2+L3 = ~28GB which is the same as the total size of our OMAP ‘db’ partition.

Let’s understand the RocksDB compaction logs:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   194.99 MB   1.6      0.0     0.0      0.0      66.8     66.8       0.0   1.0      0.0     99.1       691       318    2.172       0      0
  L1      4/0   162.73 MB   0.6    115.2    66.6     48.5     115.1     66.6       0.0   1.7     70.9     70.8      1664       185    8.995    509M   262K
  L2     45/0    2.39 GB   1.0    110.5    12.0     98.4     110.3     11.9      54.7   9.2     71.3     71.2      1587       196    8.096    474M   472K
  L3    385/0   23.04 GB   1.0    154.4    25.9    128.5     146.3     17.8      41.4   5.6     71.3     67.6      2216       347    6.387    769M    28M
  L4   1398/0   86.51 GB   0.3    469.4    65.9    403.5     396.4     -7.1       0.0   6.0     84.4     71.2      5699       513   11.109   1493M   374M
 Sum   1833/0   112.29 GB   0.0    849.5   170.5    679.0     834.9    155.9      96.0  12.5     73.4     72.1     11857      1559    7.605   3246M   403M
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0         0         0    0.000       0      0

The above log shows that L4 has 1398 SST files. The score column implies the need for compaction. If it is more than one, compaction is required. There are total 1833 SST files in the current RocksDB.

Everytime an SST is generated, RocksDB emits the following log:

2018-11-14 06:19:03.159109 7fcd07533700  4 rocksdb: [default] [JOB 3056] Generated table #103329: 176109 keys, 69620849 bytes
2018-11-14 06:19:03.159133 7fcd07533700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1542156543159124, "cf_name": "default", "job": 3056, "event": "table_file_creation", "file_number": 103329, "file_size": 69620849, "table_properties": {"data_size": 67109663, "index_size": 1918260, "filter_size": 591939, "raw_key_size": 22863898, "raw_average_key_size": 129, "raw_value_size": 57166437, "raw_average_value_size": 324, "num_data_blocks": 17578, "num_entries": 176109, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "0", "kMergeOperands": "0"}}

The above log helps us understand the average key and value size which are 129 Bytes and 324 Bytes. There are 176109 entries in this table and the size of the table will be ~67MB. Our investigation so far concluded that we indeed had a lot of data in the index OSDs. So we decided to find out what was the data we were storing.

What is in the SST file

An SST file is a sorted keys sequence and can be dumped using sst_dump utility. It is available in Ceph sources. Pick the exact Ceph source that was installed on the OSD. We have to build sst_dump tool.

$ cd ceph/src/rocksdb/

$ make sst_dump -j4

The keys are better printed in hex.

ceph/src/rocksdb$ ./sst_dump --file=../../../085894.sst --command=scan --read_num=5 --output_hex
from [] to []
Process ../../../085894.sst
Sst file format: block-based
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363362313739373130393934386337363433376231386634346262380B280000000000000105037A0000000116F2000000000000BA55BF5BA0E0B2102000000064656466316331633765616433313337636636626463326238616439303337340F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646616F200000000000000000000000000000000000001010400000014820B2882E01F3500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31343836343235352E333035383436380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436396133396238353834646166306238353034666233633365366431' seq:0, type:1 => 080349010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E5235626664363961333962383538346461663062383530346662336333653664318B260000000000000105037A0000000164F10000000000005268BE5BF2576F2A2000000066323461343235333138346165613531626266383630343864343237396265620F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646664F100000000000000000000000000000000000001010400000014828B26822B623400000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31393631363135302E3534343338370000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461306137323432373535313534363361366164636661653065316631D4270000000000000105037A0000000162EC0000000000000C5DBE5BB63F0F192000000039326264363564326363653162333664663562383434616561343732313763370F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F70646662EC0000000000000000000000000000000000000101040000001482D427822A413500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033383236372E323836353536380000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666461626334636536363739663132353035306264653366383865636237' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266646162633463653636373966313235303530626465336638386563623780250000000000000105037A00000001DBEE0000000000007C86BE5BC1B7231B2000000037653637336561626638383437613130636163346434353661373466633065610F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466DBEE0000000000000000000000000000000000000101040000001482802582D4A43500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353033373237372E323931363231390000000000000000000000000000
'4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666530373461396331386362376632393964326462623639333135343231' seq:0, type:1 => 08034A010000610000004641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E52356266653037346139633138636237663239396432646262363933313534323169130000000000000105037A00000001A7F1000000000000C80CBF5B8ABCFB2B2000000030336663643132646435613139623539613137343661643035643261316663380F00000061706C2D696E766F6963652D7064660F00000041504C2D496E766F696365504446730F0000006170706C69636174696F6E2F706466A7F10000000000000000000000000000000000000101040000001482691382BA703500000034366635333738352D646637372D343433652D626263652D3732386166343562396333622E31353032373932322E323939383234340000000000000000000000000000

The output is in the form of (key, seq, type, value). To make sense of the keys, we can use Python to translate hex to ASCII.

./ceph/src/rocksdb$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> '4D0000000000000ECA9D2E4641206E6F6E2072657461696C2F7064662F46414E525F73616C65735F696E766F6963655F46414E522F46414E525F73616C65735F696E766F6963655F46414E523562666436336231373937313039393438633736343337623138663434626238'.decode("hex")
'M\x00\x00\x00\x00\x00\x00\x0e\xca\x9d.FA non retail/pdf/ABCD_hello_there_ABCD/ABCD_plane_international_FANR5bfd63b1797109948c76437b18f44bb8'

So keys are valid and we are having genuine data in our index nodes. At the same time, we can list index pool data from RADOS. You can also list key and values using RADOS.

How to get data size reduced?

One was is running manual compaction on the current RocksDB. But that would not help as the default compaction stats implied that in spite of compaction, the data itself was huge. We tried to explore dynamic compaction of RocksDB by setting level_compaction_dynamic_level_bytes to 1 but it is currently disabled and unsupported in Ceph. We can modify RocksDB config using the following in /etc/ceph/ceph.conf. Most of the configs are immutable after OSD creation.

bluestore rocksdb options =  "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",
 
# We can get this info from the admin socket
# ceph --admin-daemon /var/run/ceph/ceph-osd.xx.asok config show|grep rocksdb
    "debug_rocksdb": "20/20",
    "bluestore_rocksdb_options": "compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152",

Why are we getting so much data?

There are following suspects:

  • Rogue client
  • Uncontrolled data ingestion by an account
  • Any activity in RGW that generated index data

We did not find anything suspicious in RGW logs (enabled using debug rgw = 20/20), and all PUT requests looked sane and valid. However, there were few logs around bucket Reshard. OSD logs (enabled using debug osd=20/20) also showed many requests for rgw Reshard bucket.

ceph-rgw-log.1.gz:2018-11-14 10:14:13.602194 7f30e6d66700  0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000
ceph-rgw-.log.1.gz:2018-11-14 10:14:13.680114 7f30c2d1e700  0 check_bucket_shards: resharding needed: stats.num_objects=268919 shard max_objects=200000

A quick search around bucket index Reshard popped a tracker http://tracker.ceph.com/issues/34307. The dynamic bucket Reshard happens for a bucket that has more than 100k entries and creates smaller shards. It leaves the old bucket index though and that causes data pileup. Every Reshard would cause data duplication and the cause for our near-full OSDs.

"rgw_max_objs_per_shard": "100000",
 
# ceph --admin-daemon /var/run/ceph/ceph-client.rgw.asok config show|grep dynamic
    "rgw_dynamic_resharding": "true",

We disabled dynamic bucket Reshard and OSDs data growth returned to normalcy.

References

Written with StackEdit.

Tuesday, September 18, 2018

A utility to dump block devices data in Linux

Many a times it is necessary to read raw disk blocks, such as data corruption, magic block corruption.
xxd is a simple and lightweight utility to dump a device.

# xxd /dev/vdb|less
00000000: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000010: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000200: 4c41 4245 4c4f 4e45 0100 0000 0000 0000  LABELONE........
00000210: ddf0 c3e3 2000 0000 4c56 4d32 2030 3031  .... ...LVM2 001
00000220: 5352 7661 7344 6162 7964 4d36 6569 6a64  SRvasDabydM6eijd
00000230: 6833 4338 3462 5532 536c 444b 564c 7943  h3C84bU2SlDKVLyC
00000240: 0000 0000 2000 0000 0000 1000 0000 0000  .... ...........

Another utility is debugfs that can dump a valid disk having a file system.

Written with StackEdit.

Friday, September 14, 2018

General Script to run Linux Shell Commands

#for i in {0..24}
#for i in $(cat meta.osd.ip)
do
    #sudo ceph osd purge $i --yes-i-really-mean-it
    #ssh -q -o "StrictHostKeyChecking no" $i sudo reboot
done

Written with StackEdit.

Thursday, September 13, 2018

OSD on Debian Jessie : No cluster conf found

The problem is in the ceph-disk code.
ceph-disk prepare has a following log on the destination node

# ceph-disk -v prepare /dev/vdb
command: Running command: 
/usr/bin/ceph-osd --cluster=None --show-config-value=fsid

The value of cluster is None and that is incorrect. It must be ceph.

# /usr/bin/ceph-osd --cluster=None --show-config-value=fsid
00000000-0000-0000-0000-000000000000
# /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
68eabcd3-a4fd-4c80-9e9c-56577d841234

The code change involved the following in /usr/sbin/ceph-disk:

prepare_parser.add_argument(
    '--cluster
    metavar='NAME', default='ceph', # added the parameter `default`
    help='cluster name to assign this disk to',
  )

  • diff output
<         metavar='NAME',
---
>         metavar='NAME', default='ceph',

Written with StackEdit.

Wednesday, September 5, 2018

Write CRUSH rule for a Cluster

CRUSH rules are described as follows:

        {
            "rule_id": 1,
            "rule_name": "replicated_ruleset_hdd",
            "ruleset": 1,
            "type": 1,
            "min_size": 1,
            "max_size": 10,
            "steps": [
                {
                    "op": "take",
                    "item": -481,
                    "item_name": "hdd"
                },
                {
                    "op": "chooseleaf_firstn",
                    "num": 0,
                    "type": "jbod"
                },
                {
                    "op": "emit"
                }
            ]
        },

Visualize the CRUSH rule as a way to traverse a tree. The ‘steps’ section decides the traversal. The above rule takes the root ‘hdd’ and picks num_replicas count of bucket of type ‘jbod’. Next, it picks num_replicas count of leaves (e.g. OSDs) in the chosen three buckets.
The leaves are decided by type 0 in the CRUSH types list.

Get the current CRUSH rules

$ sudo ceph osd getcrushmap -o crush.org
The above command gets us a compiled version of CRUSH rule. If required to make changes, we need to decompile the CRUSH rule.

$ sudo crushtool -d /tmp/crush.org -o crush.org.d

Make changes and recompile the /tmp/crush.org
$ sudo crushtool -c crush.org.d -o crush.new.c

Test the new rule

# Find out incorrect mappings from the new rule
$ sudo crushtool -i <compiled crush file> --test --show-bad-mappings

# Find out behavior of a random placement
$ sudo crushtool --test -i /tmp/crush.org --show-utilization --rule 3 --num-rep=3 --simulate

# Find out behavior of the new CRUSH rule placement
$ sudo crushtool --test -i /tmp/crush.org --show-utilization --rule 3 --num-rep=3
Sample output
  device 1334:           stored : 4      expected : 2.26049
  device 1335:           stored : 2      expected : 2.26049
  device 1336:           stored : 3      expected : 2.26049
  device 1337:           stored : 2      expected : 2.26049
  device 1338:           stored : 1      expected : 2.26049
  device 1339:           stored : 2      expected : 2.26049
  device 1340:           stored : 2      expected : 2.26049

References

Written with StackEdit.

Thursday, August 30, 2018

Restarting RGW service

sudo systemctl start ceph-radosgw@rgw.`hostname -s`
sudo systemctl enable ceph-radosgw@rgw.`hostname -s`
systemctl reset-failed ceph-mgr@{{ ansible_hostname }}"

Written with StackEdit.

Friday, August 17, 2018

Notes on RGW Sytem Object State

RGW raw object store has following structure:

// rgw/rgw_rados.h
struct RGWRawObjState {
  rgw_raw_obj obj;
  bool has_attrs{false};
  bool exists{false};
  uint64_t size{0};
  ceph::real_time mtime;
  uint64_t epoch;
  bufferlist obj_tag;
  bool has_data{false};
  bufferlist data;
  bool prefetch_data{false};
  uint64_t pg_ver{0};

  /* important! don't forget to update copy constructor */

  RGWObjVersionTracker objv_tracker;

  map<string, bufferlist> attrset;
  RGWRawObjState() {}

Written with StackEdit.

Notes on Ceph RADOS paper

Why PGs?
PGs enable a balanced distribution of objects. Without PGs, there are two ways to distribute objects:
a. Mirroring on other OSDs
b. Keep a copy of object on all the nodes in the cluster (declustering)

PGs provide a way to replicate a set of objects in a fault-tolerant manner.

Written with StackEdit.

Thursday, August 16, 2018

Notes on RGW Manifest

RGW maintains a manifest of each object. The class RGWObjManifest implements the details with object head, tail placement.
Manifest is written as XATTRs along with RGWRados::Object::Write::_do_write_meta( ).

/**
 * Write/overwrite an object to the bucket storage.
 * bucket: the bucket to store the object in
 * obj: the object name/key
 * data: the object contents/value
 * size: the amount of data to write (data must be this long)
 * accounted_size: original size of data before compression, encryption
 * mtime: if non-NULL, writes the given mtime to the bucket storage
 * attrs: all the given attrs are written to bucket storage for the given object
 * exclusive: create object exclusively
 * Returns: 0 on success, -ERR# otherwise.
 */

Written with StackEdit.

Monday, August 6, 2018

Notes on YouTube Architecture

  • Web servers are usually not the bottleneck
  • Caching levels:
    • Database
    • Serialized Python objects
    • HTML pages
  • Videos are sharded in the cluster to share load.
  • Instead of single process Apache, lighthttp was used because it was multi-process.
  • Showing thumbnails is challenging. A thumbnail is 5KB image.
    -DB sharding is the key.

Written with StackEdit.

Sunday, August 5, 2018

Lifecycle of a URL access on a Browser

Step 1

  • DNS lookup
    • Browser cache
    • OS cache
    • Router cache
    • ISP DNS lookup

Step 2

  • Connection setup
    • 3 phase TCP connection

Step 3

  • Browser sends HTTP request
    • GET
    • POST (for auth and form submission)

Step 4

  • Server handles the request and prepares a response.
    • Server could be a web server (Apache, IIS)
    • Handler parses the request header
    • Handler can be coded on ASP, PHP etc.

Step 5

  • Server responds with HTTP header having status code

Step 6

  • Browser gets HTML data
  • It renders tags
  • Fetches images, CSS using GET (usually cached)

Written with StackEdit.