Thursday, August 30, 2018

Restarting RGW service

sudo systemctl start ceph-radosgw@rgw.`hostname -s`
sudo systemctl enable ceph-radosgw@rgw.`hostname -s`
systemctl reset-failed ceph-mgr@{{ ansible_hostname }}"

Written with StackEdit.

Friday, August 17, 2018

Notes on RGW Sytem Object State

RGW raw object store has following structure:

// rgw/rgw_rados.h
struct RGWRawObjState {
  rgw_raw_obj obj;
  bool has_attrs{false};
  bool exists{false};
  uint64_t size{0};
  ceph::real_time mtime;
  uint64_t epoch;
  bufferlist obj_tag;
  bool has_data{false};
  bufferlist data;
  bool prefetch_data{false};
  uint64_t pg_ver{0};

  /* important! don't forget to update copy constructor */

  RGWObjVersionTracker objv_tracker;

  map<string, bufferlist> attrset;
  RGWRawObjState() {}

Written with StackEdit.

Notes on Ceph RADOS paper

Why PGs?
PGs enable a balanced distribution of objects. Without PGs, there are two ways to distribute objects:
a. Mirroring on other OSDs
b. Keep a copy of object on all the nodes in the cluster (declustering)

PGs provide a way to replicate a set of objects in a fault-tolerant manner.

Written with StackEdit.

Thursday, August 16, 2018

Notes on RGW Manifest

RGW maintains a manifest of each object. The class RGWObjManifest implements the details with object head, tail placement.
Manifest is written as XATTRs along with RGWRados::Object::Write::_do_write_meta( ).

/**
 * Write/overwrite an object to the bucket storage.
 * bucket: the bucket to store the object in
 * obj: the object name/key
 * data: the object contents/value
 * size: the amount of data to write (data must be this long)
 * accounted_size: original size of data before compression, encryption
 * mtime: if non-NULL, writes the given mtime to the bucket storage
 * attrs: all the given attrs are written to bucket storage for the given object
 * exclusive: create object exclusively
 * Returns: 0 on success, -ERR# otherwise.
 */

Written with StackEdit.

Monday, August 6, 2018

Notes on YouTube Architecture

  • Web servers are usually not the bottleneck
  • Caching levels:
    • Database
    • Serialized Python objects
    • HTML pages
  • Videos are sharded in the cluster to share load.
  • Instead of single process Apache, lighthttp was used because it was multi-process.
  • Showing thumbnails is challenging. A thumbnail is 5KB image.
    -DB sharding is the key.

Written with StackEdit.

Sunday, August 5, 2018

Lifecycle of a URL access on a Browser

Step 1

  • DNS lookup
    • Browser cache
    • OS cache
    • Router cache
    • ISP DNS lookup

Step 2

  • Connection setup
    • 3 phase TCP connection

Step 3

  • Browser sends HTTP request
    • GET
    • POST (for auth and form submission)

Step 4

  • Server handles the request and prepares a response.
    • Server could be a web server (Apache, IIS)
    • Handler parses the request header
    • Handler can be coded on ASP, PHP etc.

Step 5

  • Server responds with HTTP header having status code

Step 6

  • Browser gets HTML data
  • It renders tags
  • Fetches images, CSS using GET (usually cached)

Written with StackEdit.

Saturday, August 4, 2018

Notes on UNIX Pipes

  • A pipe call creates two file descriptors - read and a write
  • $ ls -l|more if write from ls is too much to handle by more, the write waits till read drains it.
  • If there is nothing to read, the pipe read waits.
  • Pipe is a kernel resource and is basically just a buffer.
  • A write to a read-closed pipe or vice-versa would not work. We get a SIGPIPE.
  • lseek( ) does not work on unidirectional pipes.

How shell commands piping work?

  • A process has its own local file descriptor table. This table keeps track of files opned by the process and their FDs.
    Using dup() system call, we can create a new entry in the process FDT as following:
FD filename
10 /tmp/xx

After calling dup(), we get a new entry with a new FD.

FD Filename
10 /tmp/xx
11 /tmp/xx

Both the entries map to same slot in the global file table. It is equivalent to a file softlink. Now, we can close the fd 10 and use fd 11 to continue ops or keep both of them open.

dup() returns the lowest available FD in the process. To exploit this fact, we can close standard FDs (0,1,2) and the next call to dup() would return these FDs.

int fd[2];
int p = pipe(fd);
close(1); // frees up fd 0
dup(fd[1]);
// dup will return the lowest available FD, one for fd[1]. 
FD File name
10 /tmp/xx
1 /tmp/xx

Now any write that was supposed to go to FD 1, would go to file /tmp/xx

Written with StackEdit.

Notes on Signals in UNIX

  • Pressing a key generates an interrupt and kernel has a keyboard interrupt handler module. The KB interrupt handler would send a signal to all processes associated with the terminal.
  • Once a signal handler is called, it is deleted from memory. So we call it recursively.
    void handler()
    {
        signal(SIGINT, handler);
    }
  • Common Signals to know:
    • SIGSYS: Incorrect usage of a system call
    • SIGCHLD
    • SIGALRM
  • SIGALRM is also used in sleep() call. Sleep call is just waiting infinitely for a signal. To create such wait, pause() function is used.

Written with StackEdit.

Notes on Linux Process Management

Notes on Linux Process Management
  • All processes have a PID and a group ID. The group leader is your shell (terminal).
  • So for a process to qualify as background, we change the process group usingsetgrp().
  • nohup ./a,out &
  • chmod 4750 file1.txt
    • 4 means SUID permission. An ordinary user can run this file with privileges of the actual owner of the file1.txt. Useful to allow an ordinary user to run commands accessible only to e.g. root.
      $chmod u+s file

Written with StackEdit.

Notes on exec( ) system call

Notes on exec( ) system call
  • The new process shared old process’s file descriptor table
  • A printf() used before exec might not work because its buffers were not flushed.
    • Use fflush().
  • exec’ed process too gets access to environ.

Written with StackEdit.

Notes on fork()

  • We can avoid orphans by Parent process calling wait( ) call.
wait(int *p)
*  p will have the return code of the child. So we can find
   if process was terminated normally or not. 
  • Child process has its own copy of globals too.
  • The file descriptor table of the parent is shared with child. It has FDs of all open files in the parent.
  • fread is buffered read (defaults to 1024 bytes at a time). We call fflush to empty buffer. This is more efficient as write/reads are in-memory till 1024 bytes.
  • read is a low level byte wise read op. There is no buffering.

The following code writes the string twice to the file. Why?

int main()
{
  char *p = "hello world";
  FILE *fp;
  fp = fopen("test", "w");
  fwrite(p, sizeof(p), 1, fp);
  fork();
  • fork() also copies environment variables of the parent to the child. Type $set to access env vars.

Written with StackEdit.

Thursday, August 2, 2018

dyld: Library not loaded: /usr/local/opt/python/Frameworks/Python.framework/Versions/3.6/Python

I started getting this error after trying to install macvim as follows:

brew install macvim --override-system-vim

The error string is as following:

$ vi linkedlist.cc
dyld: Library not loaded: /usr/local/opt/python/Frameworks/Python.framework/Versions/3.6/Python
  Referenced from: /usr/local/bin/vim
  Reason: image not found
Abort trap: 6

I checked shared libs for vim using otool.

$ otool -L /usr/local/bin/vim
/usr/local/bin/vim:
 /usr/lib/libncurses.5.4.dylib (compatibility version 5.4.0, current version 5.4.0)
 /usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
 /System/Library/Frameworks/AppKit.framework/Versions/C/AppKit
              (compatibility version 45.0.0, current version 1561.40.112)
 /usr/local/opt/lua/lib/liblua.5.3.dylib 
               (compatibility version 5.3.0, current version 5.3.4)
 /usr/local/opt/perl/lib/perl5/5.26.1/darwin-thread-multi-2level/CORE/libperl.dylib
               (compatibility version 5.26.0, current version 5.26.1)
 /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1252.50.4)
 /usr/lib/libutil.dylib (compatibility version 1.0.0, current version 1.0.0)
 /usr/local/opt/python/Frameworks/Python.framework/Versions/3.6/Python 
              (compatibility version 3.6.0, current version 3.6.0)
 /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
               (compatibility version 150.0.0, current version 1452.23.0)
 /usr/local/opt/ruby/lib/libruby.2.5.dylib (compatibility version 2.5.0, current version 2.5.1)
 /usr/lib/libobjc.A.dylib (compatibility version 1.0.0, current version 228.0.0)
 /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices 
                (compatibility version 1.0.0, current version 822.31.0)
 /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation 
                         (compatibility version 300.0.0, current version 1452.23.0)

It appears that many shared libs versions are updated incorrectly.

To solve the problem, I tried to upgrade packages.

$ brew update
Already up-to-date.

$ brew upgrade

It started working after doing upgrades.

Written with StackEdit.

Wednesday, August 1, 2018

Ceph Outage with OSDs Heartbeat failure on Hammer (0.94.6)

Symptoms

  • The cluster went down after 24 OSDs were added and marked in simultaneously.
  • This was an erasure coded (10+5) RGW cluster on Hammer.
  • All the OSDs started failing and eventually 50% of the OSDs were down.
  • Manual efforts to bring them up failed and we saw heartbeat failures in OSDs log.
  • All OSD were consuming ~15G RAM and OSDs were hitting Out of memory errors.
2018-07-18 08:58:12.794311 7f4aa0925700 -1 
osd.127 206901 heartbeat_check: 
no reply from osd.55 since 
back 2018-07-18 08:45:13.647493 
front 2018-07-18 08:45:13.647493 
(cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794315 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.57 since back
  2018-07-18 08:45:42.452510 front 2018-07-18
   08:45:42.452510 (cutoff 2018-07-18 08:57:12.794247)

2018-07-18 08:58:12.794321 7f4aa0925700 -1 osd.127 206901
 heartbeat_check: no reply from osd.82 since back 
 2018-07-18 08:45:13.647493 front 2018-07-18 
 08:45:13.647493
  (cutoff 2018-07-18 08:57:12.794247)
  • OSDs maps were out of sync
2018-07-18 08:56:52.668789 7f4886d7b700  
0 -- 10.33.49.153:6816/505502 >> 10.33.213.157:6801/2707
 pipe(0x7f4a4f39d000 sd=26 :13251
  s=1 pgs=233 cs=2 l=0 c=0x7f4a4f1b8980).connect
   claims to be 10.33.213.157:6801/1003787 not 
   10.33.213.157:6801/2707 - wrong node!   
  • An OSD has ~3000 threads, most of them in sleeping state.
  • Using GDB and getting a backtrace of all threads we found that most of the active threads were just Simple Messanger Pipe readers.
  • We were suspecting a memory leak in Ceph code.

Band-aid Fixes

  • Set norebalance, norecover, nobackfill

  • Adding swap memory to OSDs

  • Tuning heartbeat interval

  • Tuning OSD map sync and setting noout, nodown to let OSDs sync their maps.

$ sudo ceph daemon osd.148 status
{
    "cluster_fsid": "621d76ce-a208-42d6-a15b-154fcb09xcrt",
    "osd_fsid": "09650e4c-723e-45e0-b2ef-5b6d11a6da03",
    "whoami": 148,
    "state": "booting",
    "oldest_map": 156518,
    "newest_map": 221059,
    "num_pgs": 1295
}
  • Tuning OSD map cache size to 20
  • Finding processes other than Ceph
    • Processes consuming network, CPU, and RAM
    • Killing them
  • Starting OSDs one by one - that worked for us :-)

RCA

  • The major culprit was a rogue process that was consuming massive network bandwidth on OSD nodes.
  • As network bandwidth was not enough, many messenger threads were just waiting.
  • The Simple Messanger threads are sync threads and would wait till they get through.
  • That is one of the reasons of an OSD having ~3000 threads and consuming ~15G of memory.
  • As network was saturated, OSDs heartbeat signals too were blocked and they were either committing suicide or dying of OOM.

References

Written with StackEdit.