Innards of Tar

The La Brea Carpets

I’ve been working with tar files a lot lately and I haven’t been able to find a good example of what a tar file looks like, byte-by-byte. The specification is the best reference I’ve found for how tar files are structured, but it isn’t exactly friendly. Here’s an interactive breakdown of what tar files look like on the inside.

First, we’ll make a directory and some files:

$ mkdir tar_test
$ cd tar_test
~/tar_test$ mkdir subdir0 subdir1 subdir2
~/tar_test$ echo content > file0
~/tar_test$ echo content > subdir1/file0
~/tar_test$ echo content > subdir2/file0

Feel free to put whatever files you want in here, it’s a pretty easy-to-understand format. If you’re feeling frisky, add some symlinks.

Now tar them up:

~/tar_test$ tar cvvf tar_test.tar *
-rw-r----- k/k     6 2014-05-15 16:29 file0
drwxr-x--- k/k     0 2014-05-15 16:29 subdir0/
drwxr-x--- k/k     0 2014-05-15 16:30 subdir1/
-rw-r----- k/k     6 2014-05-15 16:30 subdir1/file0
drwxr-x--- k/k     0 2014-05-15 16:30 subdir2/
-rw-r----- k/k     6 2014-05-15 16:30 subdir2/file0

And check out your tar file to make sure everything looks alright:

~/tar_test$ tar tf tar_test.tar
file0
subdir0/
subdir1/
subdir1/file0
subdir2/
subdir2/file0

Tar files are organized into blocks of 512 bytes. Basically, the format of a tar file is:

Block # Description
0 Header
1 Content
2 Header
3 Content

If the content is longer than one block, it’ll be rounded up (so if you have a 1300-byte file, the tar entry will look like Header-Content-Content-Content). If an entry has no content (e.g., a directory or symbolic link) it only takes up one block. So, our tar file looks like:

Block # Description
0 Header for file0
1 Content of file0
2 Header for subdir0
3 Header for subdir1
4 Header for subdir1/file0
5 Content of subdir1/file0
6 Header for subdir2
7 Header for subdir2/file0
8 Content of subdir2/file0

Eight 512-byte blocks adds up to 4KB, but if we ls -lh the .tar, we get something bigger:

~/tar_test$ ls -lh tar_test.tar 
-rw-r----- 1 k k 10K May 16 15:19 tar_test.tar

There’s always an extra 1KB of 0s tacked onto the end of a .tar’s content as a footer, and there’s an implementation-dependent size tars are blocked up into (called the blocksize, which is different than the blocks discussed above). On my Linux machine, tar creates the 10KB archive shown above, on my OS X machine, it’s only 5.5KB.

Now we’re going to really look at the contents of the tar file, using hexdump. 512 bytes is 0x200 in hexidecimal, so each 200 is a new block in the archive.

~/tar_test$ hexdump -C tar_test.tar | more

You can see that the archive starts with the first entry’s filename:

00000000  66 69 6c 65 30 00 00 00  00 00 00 00 00 00 00 00  |file0...........|

Hexdump elides all-zero portions of the file, so the next interesting bit is the rest of the header:

00000060  00 00 00 00 30 30 30 30  36 34 30 00 30 36 30 31  |....0000640.0601|
00000070  34 35 34 00 30 30 31 31  36 31 30 00 30 30 30 30  |454.0011610.0000|
00000080  30 30 30 30 30 30 36 00  31 32 33 33 35 32 32 31  |0000008.12335221|
00000090  36 36 35 00 30 31 31 33  33 32 00 20 30 00 00 00  |665.011332. 0...|

Here are what the numbers are you’re seeing (you can look up these fields in the pax spec):

0000640
Mode (note that these are ASCII numbers: the byte values of ‘0’ is 30)
0601454
UID
0011610
GID
00000000008
Size
12335221665
mtime
011332
chksum
0
typeflag

Typeflag is the most interesting field here: it indicates the type of file (0 for normal files, 5 for directories). It can also b “x” to indicate an “extended header.” Extended headers are used to define your own fields or override fields in the header. For example, the header said that the mtime was 12335221665, but we could override that in an extended header with mtime=12345678901. If you have an extended header, the entry ends up taking an extra kilobyte of storage: one block for the extended header, and one block for a “normal” header which is identical to the initial header except contains the actual file type instead of “x”. So you’d have:

Block # Description
0 Header for file0 (typeflag=x)
1 Extended header of key=value pairs of attributes for file0
2 Header for file0 (typeflag=0)
3 Content of file0

The next part of the header is for links, so it’s all 0 for these normal files and directories. Then you finish up the header with:

00000900  00 75 73 74 61 72 20 20  00 6b 00 00 00 00 00 00  |.ustar  .k......|
00000910  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000920  00 00 00 00 00 00 00 00  00 6b 00 00 00 00 00 00  |.........k......|
00000930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

“ustar” is a “magic” string that gives the tar format. The “k”s are my username and group name.

At 0x200 is the actual file content:

00000200  66 69 6c 65 30 0a 00 00  00 00 00 00 00 00 00 00  |content.........|

Then at 0x400, then next block (subdir0’s header) starts:

00000400  73 75 62 64 69 72 30 2f  00 00 00 00 00 00 00 00  |subdir0/........|

This is what tar looks like “under the covers.” It’s a lot more sparse than I thought it’d be, but I guess that’s where gzip comes in.

The Basics of Signal Handling

The-Great-Gatsby-Green-Light

Signals are one of the most basic ways programs can receive messages from the outside world. I’ve found limited tutorial-type documentation on them, so this post covers how to set them up and some debugging techniques.

The easiest way to get a feel for signal handling is to play with a simple C program, like this:

#include 
#include 
#include 

void my_handler(int signum) {
  const char msg[] = "Signal handler got signaln";
  write(STDOUT_FILENO, msg, sizeof msg);
}

int main(int argc, char *argv[]) {
  printf("PID: %dn", getpid());

  // Set up signal handler
  struct sigaction action = {};
  action.sa_handler = &my_handler;
  sigaction(SIGINT, &action, NULL);

  while (1) {
    pause();
  }
  return 0;
}

Compile and run and try hitting Ctrl-C a few times:

$ gcc signals.c -o signals
$ ./signals 
PID: 11152
^CSignal handler got signal 2
^CSignal handler got signal 2
^CSignal handler got signal 2

Each signal calls the signal handler we set up.

If you attach strace (system call tracer) and then hit Ctrl-C in the terminal running ./signals again, you can see each signal coming in:

$ $ strace -p 11152 -e trace=none -e signal=all
Process 11152 attached - interrupt to quit
--- SIGINT (Interrupt) @ 0 (0) ---
--- SIGINT (Interrupt) @ 0 (0) ---
--- SIGINT (Interrupt) @ 0 (0) ---

As we can’t kill it with Ctrl-C, we can use kill to shutdown ./signals:

$ kill 11152

kill defaults to sending a SIGTERM, which we’re not handling (yet). You could add a handler for it by adding the line sigaction(SIGTERM, &action, NULL); but then we’d have to kill -9 the process to kill it (which is two extra characters of typing) so I’m leaving SIGTERM unhandled.

Ignoring Signals

There are also ways to make your program not even receive signals: ignoring and blocking them (which are subtly different). To ignore a signal, change sa_action to SIG_IGN:

#include 
#include 
#include 

void my_handler(int signum) {
  const char msg[] = "Signal handler got signaln";
  write(STDOUT_FILENO, msg, sizeof msg);
}

int main(int argc, char *argv[]) {
  printf("PID: %dn", getpid());

  // Set up signal handler
  struct sigaction action = {};
  action.sa_handler = SIG_IGN;
  sigaction(SIGINT, &action, NULL);

  while (1) {
    pause();
  }
  return 0;
}

Now recompile and run and hit Ctrl-C. You’ll get something like this:

$ ./signals
PID: 86579
^C^C^C^C^C

If you attach strace, you’ll see that ./signals isn’t even receiving the SIGINTs.

You can see the signals a program is ignoring by looking at /proc/PID/status:

$ cat /proc/86579/status
Name:   signals
State:  S (sleeping)
Tgid:   86579
Pid:    86579
PPid:   30493
TracerPid:      0
Uid:    197420  197420  197420  197420
Gid:    5000    5000    5000    5000
FDSize: 256
Groups: 4 20 24 25 44 46 104 128 499 5000 5001 5762 74990 75209 77056 78700 79910 79982 
VmPeak:     4280 kB
VmSize:     4160 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       352 kB
VmRSS:       352 kB
VmData:       48 kB
VmStk:       136 kB
VmExe:         4 kB
VmLib:      1884 kB
VmPTE:        28 kB
VmSwap:        0 kB
Threads:        1
SigQ:   0/192723
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000002
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   ffffffff
Cpus_allowed_list:      0-31
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     3

SigIgn is a hexidecimal number and it has kind of a weird format: the ignored signal number’s bit is set. So, for 2, the second bit is set. For SIGTERM (signal 15), the 15th bit is set: 0100_0000_0000_0000 in binary or 0x4000 in hexadecimal. So, if you were ignoring both SIGINT and SIGTERM, SigIgn would look like: 0000000000004002.

SigCgt is for signals that are being caught by the program and SigBlk is for signals that are being blocked.

Blocking Signals

What if you want your program to handle any signals that come in, just do it later? You might have a critical section where you don’t want to be interrupted, but afterwards you want to know what came in. That’s where blocking signals comes in handy.

You can block signals using sigprocmask:

#include 
#include 
#include 

void my_handler(int signum) {
  const char msg[] = "Signal handler got signaln";
  write(STDOUT_FILENO, msg, sizeof msg);
}

int main(int argc, char *argv[]) {
  printf("PID: %dn", getpid());

  // Set up signal handler                                                                                                                                                                                
  struct sigaction action = {};
  action.sa_handler = &my_handler;
  sigaction(SIGINT, &action, NULL);

  printf("Blocking signals...n");
  sigset_t sigset;
  sigemptyset(&sigset);
  sigaddset(&sigset, SIGINT);
  sigprocmask(SIG_BLOCK, &sigset, NULL);

  // Critical section
  sleep(5);

  printf("Unblocking signals...n");
  sigprocmask(SIG_UNBLOCK, &sigset, NULL);

  while (1) {
    pause();
  }
  return 0;
}

First we create a sigset_t which can hold a set of signals. We empty out the set with a call to sigemptyset and add a signal element: SIGINT. (There are a bunch of other set ops you can use to modify sigset_t, if necessary.)

If you compile and run this and try Ctrl-C-ing while signals are blocked, one signal will be “let through” when the signals are unblocked:

$ ./signals
PID: 86791
Blocking signals...
^C^C^C^C^C^CUnblocking signals...
Signal handler got signal 2

One common use for this is blocking signals while the signal handler is running. That way you can have your signal handler modify some non-atomic state (say, a counter of how many signals have come in) in a safe way.

However, suppose we called sigprocmask in the signal handler. There will always be a race condition: another signal could come in before we’ve called sigprocmask! So sigaction takes a mask of signals it should block while the handler is executing:

#include 
#include 
#include 

void my_handler(int signum) {
  const char msg[] = "Signal handler got signaln";
  write(STDOUT_FILENO, msg, sizeof msg);
}

int main(int argc, char *argv[]) {
  printf("PID: %dn", getpid());

  // Set up signal handler                                                                                                                                                                          
  struct sigaction action = {};
  action.sa_handler = &my_handler;
  sigset_t mask;
  sigemptyset(&mask);
  sigaddset(&mask, SIGINT);
  sigaddset(&mask, SIGTERM);
  action.sa_mask = mask;
  sigaction(SIGINT, &action, NULL);

  while (1) {
    pause();
  }
  return 0;
}

Here, we’re masking both SIGINT and SIGTERM: if either of these signals comes in while my_handler is running, they’ll be blocked until it completes.

Inheritance

Ignored and blocked signals are inherited when you fork a program. Thus, if a program isn’t responding to signals the way that you expect, it might be the fault of whoever forked it. Also, if you want your program to handle signals in certain ways you should explicitly set that rather than depending on the default.

If you want to use a signal’s default behavior (which is usually “terminate the program”), you can use SIG_DFL as the sa_handler.

What you can do in a signal handler

You might notice that I’m using write in the signal handlers above, instead of the somewhat more friendly printf. This is because there are only a small set of “async safe” functions you can call in a signal handler and printf isn’t one of them. There is a list of functions you can call on the signal(7) man page. A few examples that often come up: you cannot heap-allocate memory, buffer output, or mess with locks.

If you call any unsafe functions in a signal handler, the behavior is undefined (meaning it might work fine, or it might make your car blow up).

Edit: thanks to Vincent Bernat, who mentioned this in the comments.

References:

––thursday #5: diagnosing high readahead

Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

Picture our disk as looking like this, where each o is a document:

...
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
... // keep going for millions more o's

Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

...
ooooooooooooooooooooooooo
ooooxoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

...
ooooooooooooooooooooooooo
ooooxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...
Is your OS returning this with every document?

Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.

––thursday #4: blockdev

Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.

The image below helps visualize how slow (post continues below).

(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)

The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.

You can see how far the kernel reads ahead using the blockdev tool:

$ sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     80026361856   /dev/sda
rw   256   512  4096       2048     80025223168   /dev/sda1
rw   256   512  4096          0   2000398934016   /dev/sdb
rw   256   512  1024       2048        98566144   /dev/sdb1
rw   256   512  4096     194560      7999586304   /dev/sdb2
rw   256   512  4096   15818752     19999490048   /dev/sdb3
rw   256   512  4096   54880256   1972300152832   /dev/sdb4

Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:

$ man blockdev
...
       --setra N
              Set readahead to N 512-byte sectors.
...

This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.

So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).

Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.

I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.

––thursday #2: diff ‘n patch

I’m trying something new: every Thursday I’ll do a short post on how to do something with the command line.

I always seem to either create or apply patches in the wrong direction. It’s like stalagmites vs. stalactites, which I struggled with until I heard the nemonic: “Stalagmites might hang from the ceiling… but they don’t.”

Moving right along, you can use diff to get line-by-line changes between any two files. Generally I use git diff because I’m dealing with a git repo, so that’s what I’ll use here.

Let’s get a diff of MongoDB between version 2.0.2 and 2.0.3.

$ git clone git://github.com/mongodb/mongo.git
$ cd mongo
$ git diff r2.0.2..r2.0.3 > mongo.patch

This takes all of the changes between 2.0.2 and 2.0.3 (r2.0.2..r2.0.3) and dumps them into a file called mongo.patch (that’s the > mongo.patch part).

Now, let’s get the code from 2.0.2 and apply mongo.patch, effectively making it 2.0.3 (this is kind of a silly example but if you’re still with me after the stalagmite thing, I assume you don’t mind silly examples):

$ git checkout r2.0.2
Note: checking out 'r2.0.2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 514b122... BUMP 2.0.2
$ 
$ patch -p1 < mongo.patch

What intuitive syntax!

What does the -p1 mean? How many forward slashes to remove from the path in the patch, of course.

To take an example, if you look at the last 11 lines of the patch, you can see that it is the diff for the file that changes the version number. It looks like this:

$ tail -n 11 mongo.patch
--- a/util/version.cpp
+++ b/util/version.cpp
@@ -38,7 +38,7 @@ namespace mongo {
      *      1.2.3-rc4-pre-
      * If you really need to do something else you'll need to fix _versionArray()
      */
-    const char versionString[] = "2.0.2";
+    const char versionString[] = "2.0.3";
 
     // See unit test for example outputs
     static BSONArray _versionArray(const char* version){

Note the a/util/version.cpp and b/util/version.cpp. These indicate the file the patch should be applied to, but there are no a or b directories in the MongoDB repository. The a and b prefixes indicate that one is the previous version and one is the new version. And -p says how many slashes to strip from this path. An example may make this clearer:

  • -p0 (equivalent to not specifying -p): “apply this patch to a/util/version.cpp” (which doesn’t exist)
  • -p1: “apply this patch to util/version.cpp” ← bingo, that’s what we want
  • -p2: “apply this patch to version.cpp” (which doesn’t exist)

So, we use -p1, because that makes the patch’s paths match the actually directory structure. If someone sent you a patch and the path is something like /home/bob/bobsStuff/foo.txt and your name is not Bob, you’re just trying to patch foo.txt, you’d probably want to use -p4.

On the plus side, if you’re using patches generated by git, they’re super-easy to apply. Git chose the intuitive verb “apply” to patch a file. If you have a patch generated by git diff, you can patch your current tree with:

$ git apply mongo.patch

So, aside from the stupid choice of verbiage, that is generally easier.

Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!

––thursday #1: screen

I’m trying something new: every Thursday I’ll go over how to do something with the command line. Let me know what you think.

If you are using a modern-ish browser, you probably use tabs to keep multiple things open at once: your email, your calendar, whatever you’re actually doing, etc. You can do the same thing with the shell using screen: in a single terminal, you can compile a program while you’re editing a file and watching another process out of the corner of your eye.

Note that screen is super handy when SSH’d into a box. SSH in once, then start screen and open up all of the windows you need.

Using screen

To start up screen, run:

$ screen

Now your shell will clear and screen will give you a welcome message.


Screen version 4.00.03jw4 (FAU) 2-May-06

Copyright (c) 1993-2002 Juergen Weigert, Michael Schroeder
Copyright (c) 1987 Oliver Laumann

...




                          [Press Space or Return to end.]

As it says at the bottom, just hit Return to clear the welcome message. Now you’ll see an empty prompt and you can start working normally.

Let’s say we have three things we want to do:

  1. Run top
  2. Edit a file
  3. Tail a log

Go ahead and start up top:

$ top

Well, now we need to edit a file but top‘s using the shell. What to do now? Just create a new window. While top is still running, hit ^A c (I’m using ^A as shorthand for Control-a, so this means “hit Control-a, then hit c”) to create a new window. The new window gets put right on top of the old one, so you’ll see a fresh shell and be at the prompt again. But where did top go? Not to worry, it’s still there. We can switch back to it with ^A n or ^A p (next or previous window).

Now we can start up our editor and begin editing a file. But now we want to tail a file, so we create another new window with ^A c and run our tail -f filename. We can continue to use ^A n and ^A p to switch between the three things we’re doing (and open more windows as necessary).

Availability

screen seems pretty ubiquitous, it has been on every Linux machine I’ve ever tried running it on and even OS X (although it may be part of XCode, I haven’t checked).

Note for Emacs Users

^A is an annoying escape key, as it is also go-to-beginning-of-line shortcut in Emacs (and the shell). To fix this, create a .screenrc file and add one line to change this to something else:

# use ^T
escape ^Tt
# or ^Y
escape ^Yy

The escape sequence is 3 characters: carat, T, and t. (It is not using the single special character “^T”.) The traditional escape key is actually Ctrl-^, as the carat is the one character Emacs doesn’t use for anything. In a .screenrc file, this results in the rather bizarre string:

escape ^^^^

…which makes sense when you think about it, but looks a bit weird.

Odds and Ends

As long as you’re poking at the .screenrc file, you might want to turn off the welcome message, too:

startup_message off

Run ^A ? anytime for help, or check out the manual’s list of default bindings.

Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!

Playing with Virtual Memory

Linux: the developer's personal gentleman

When you run a process, it needs some memory to store things: its heap, its stack, and any libraries it’s using. Linux provides and cleans up memory for your process like an extremely conscientious butler. You can (and generally should) just let Linux do its thing, but it’s a good idea to understand the basics of what’s going on.

One easy way (I think) to understand this stuff is to actually look at what’s going on using the pmap command. pmap shows you memory information for a given process.

For example, let’s take a really simple C program that prints its own process id (PID) and pauses:

#include 
#include 
#include 

int main() {
  printf("run `pmap %d`n", getpid());
  pause();
}

Save this as mem_munch.c. Now compile and run it with:

$ gcc mem_munch.c -o mem_munch
$ ./mem_munch
run `pmap 25681`
 

The PID you get will probably be different than mine (25681).

At this point, the program will “hang.” This is because of the pause() function, and it’s exactly what we want. Now we can look at the memory for this process at our leisure.

Open up a new shell and run pmap, replacing the PID below with the one mem_munch gave you:

$ pmap 25681
25681:   ./mem_munch
0000000000400000      4K r-x--  /home/user/mem_munch
0000000000600000      4K r----  /home/user/mem_munch
0000000000601000      4K rw---  /home/user/mem_munch
00007fcf5af88000   1576K r-x--  /lib/x86_64-linux-gnu/libc-2.13.so
00007fcf5b112000   2044K -----  /lib/x86_64-linux-gnu/libc-2.13.so
00007fcf5b311000     16K r----  /lib/x86_64-linux-gnu/libc-2.13.so
00007fcf5b315000      4K rw---  /lib/x86_64-linux-gnu/libc-2.13.so
00007fcf5b316000     24K rw---    [ anon ]
00007fcf5b31c000    132K r-x--  /lib/x86_64-linux-gnu/ld-2.13.so
00007fcf5b512000     12K rw---    [ anon ]
00007fcf5b539000     12K rw---    [ anon ]
00007fcf5b53c000      4K r----  /lib/x86_64-linux-gnu/ld-2.13.so
00007fcf5b53d000      8K rw---  /lib/x86_64-linux-gnu/ld-2.13.so
00007fff7efd8000    132K rw---    [ stack ]
00007fff7efff000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total             3984K

This output is how memory “looks” to the mem_munch process. If mem_munch asks the operating system for 00007fcf5af88000, it will get libc. If it asks for 00007fcf5b31c000, it will get the ld library.

This output is a bit dense and abstract, so let’s look at how some more familiar memory usage shows up. Change our program to put some memory on the stack and some on the heap, then pause.

#include 
#include 
#include 
#include 

int main() {
  int on_stack, *on_heap;

  // local variables are stored on the stack
  on_stack = 42;
  printf("stack address: %pn", &on_stack);

  // malloc allocates heap memory
  on_heap = (int*)malloc(sizeof(int));
  printf("heap address: %pn", on_heap);

  printf("run `pmap %d`n", getpid());
  pause();
}

Now compile and run it:

$ ./mem_munch 
stack address: 0x7fff497670bc
heap address: 0x1b84010
run `pmap 11972`

Again, your exact numbers will probably be different than mine.

Before you kill mem_munch, run pmap on it:

$ pmap 11972
11972:   ./mem_munch
0000000000400000      4K r-x--  /home/user/mem_munch
0000000000600000      4K r----  /home/user/mem_munch
0000000000601000      4K rw---  /home/user/mem_munch
0000000001b84000    132K rw---    [ anon ]
00007f3ec4d98000   1576K r-x--  /lib/x86_64-linux-gnu/libc-2.13.so
00007f3ec4f22000   2044K -----  /lib/x86_64-linux-gnu/libc-2.13.so
00007f3ec5121000     16K r----  /lib/x86_64-linux-gnu/libc-2.13.so
00007f3ec5125000      4K rw---  /lib/x86_64-linux-gnu/libc-2.13.so
00007f3ec5126000     24K rw---    [ anon ]
00007f3ec512c000    132K r-x--  /lib/x86_64-linux-gnu/ld-2.13.so
00007f3ec5322000     12K rw---    [ anon ]
00007f3ec5349000     12K rw---    [ anon ]
00007f3ec534c000      4K r----  /lib/x86_64-linux-gnu/ld-2.13.so
00007f3ec534d000      8K rw---  /lib/x86_64-linux-gnu/ld-2.13.so
00007fff49747000    132K rw---    [ stack ]
00007fff497bb000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total             4116K

Note that there’s a new entry between the final mem_munch section and libc-2.13.so. What could that be?


# from pmap
0000000001b84000 132K rw--- [ anon ]
# from our program
heap address: 0x1b84010

The addresses are almost the same. That block ([ anon ]) is the heap. (pmap labels blocks of memory that aren’t backed by a file [ anon ]. We’ll get into what being “backed by a file” means in a sec.)

The second thing to notice:


# from pmap
00007fff49747000 132K rw--- [ stack ]
# from our program
stack address: 0x7fff497670bc

And there’s your stack!

One other important thing to notice: this is how memory “looks” to your program, not how memory is actually laid out on your physical hardware. Look at how much memory mem_munch has to work with. According to pmap, mem_munch can address memory between address 0x0000000000400000 and 0xffffffffff600000 (well, actually 0x00007fffffffffffffff, beyond that is special). For those of you playing along at home, that’s almost 10 million terabytes of memory. That’s a lot of memory. (If your computer has that kind of memory, please leave your address and times you won’t be at home.)

So, the amount of memory the program can address is kind of ridiculous. Why does the computer do this? Well, lots of reasons, but one important one is that this means you can address more memory than you actually have on the machine and let the operating system take care of making sure the right stuff is in memory when you try to access it.

Memory Mapped Files

Memory mapping a file basically tells the operating system to load the file so the program can access it as an array of bytes. Then you can treat a file like an in-memory array.

For example, let’s make a (pretty stupid) random number generator ever by creating a file full of random numbers, then mmap-ing it and reading off random numbers.

First, we’ll create a big file called random (note that this creates a 1GB file, so make sure you have the disk space and be patient, it’ll take a little while to write):

$ dd if=/dev/urandom bs=1024 count=1000000 of=/home/user/random
1000000+0 records in
1000000+0 records out
1024000000 bytes (1.0 GB) copied, 123.293 s, 8.3 MB/s
$ ls -lh random
-rw-r--r-- 1 user user 977M 2011-08-29 16:46 random

Now we’ll mmap random and use it to generate random numbers.

#include 
#include 
#include 
#include 
#include 

int main() {
  char *random_bytes;
  FILE *f;
  int offset = 0;

  // open "random" for reading                                                                                                                                              
  f = fopen("/home/user/random", "r");
  if (!f) {
    perror("couldn't open file");
    return -1;
  }

  // we want to inspect memory before mapping the file                                                                                                                      
  printf("run `pmap %d`, then press ", getpid());
  getchar();

  random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0);

  if (random_bytes == MAP_FAILED) {
    perror("error mapping the file");
    return -1;
  }

  while (1) {
    printf("random number: %d (press  for next number)", *(int*)(random_bytes+offset));
    getchar();

    offset += 4;
  }
}

If we run this program, we’ll get something like:

$ ./mem_munch 
run `pmap 12727`, then press 

The program hasn’t done anything yet, so the output of running pmap will basically be the same as it was above (I’ll omit it for brevity). However, if we continue running mem_munch by pressing enter, our program will mmap random.

Now if we run pmap it will look something like:

$ pmap 12727
12727:   ./mem_munch
0000000000400000      4K r-x--  /home/user/mem_munch
0000000000600000      4K r----  /home/user/mem_munch
0000000000601000      4K rw---  /home/user/mem_munch
000000000147d000    132K rw---    [ anon ]
00007fe261c6f000 976564K r--s-  /home/user/random
00007fe29d61c000   1576K r-x--  /lib/x86_64-linux-gnu/libc-2.13.so
00007fe29d7a6000   2044K -----  /lib/x86_64-linux-gnu/libc-2.13.so
00007fe29d9a5000     16K r----  /lib/x86_64-linux-gnu/libc-2.13.so
00007fe29d9a9000      4K rw---  /lib/x86_64-linux-gnu/libc-2.13.so
00007fe29d9aa000     24K rw---    [ anon ]
00007fe29d9b0000    132K r-x--  /lib/x86_64-linux-gnu/ld-2.13.so
00007fe29dba6000     12K rw---    [ anon ]
00007fe29dbcc000     16K rw---    [ anon ]
00007fe29dbd0000      4K r----  /lib/x86_64-linux-gnu/ld-2.13.so
00007fe29dbd1000      8K rw---  /lib/x86_64-linux-gnu/ld-2.13.so
00007ffff29b2000    132K rw---    [ stack ]
00007ffff29de000      4K r-x--    [ anon ]
ffffffffff600000      4K r-x--    [ anon ]
 total           980684K

This is very similar to before, but with an extra line (bolded), which kicks up virtual memory usage a bit (from 4MB to 980MB).

However, let’s re-run pmap with the -x option. This shows the resident set size (RSS): only 4KB of random are resident. Resident memory is memory that’s actually in RAM. There’s very little of random in RAM because we’ve only accessed the very start of the file, so the OS has only pulled the first bit of the file from disk into memory.

pmap -x 12727
12727:   ./mem_munch
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000       0       4       0 r-x--  mem_munch
0000000000600000       0       4       4 r----  mem_munch
0000000000601000       0       4       4 rw---  mem_munch
000000000147d000       0       4       4 rw---    [ anon ]
00007fe261c6f000       0       4       0 r--s-  random
00007fe29d61c000       0     288       0 r-x--  libc-2.13.so
00007fe29d7a6000       0       0       0 -----  libc-2.13.so
00007fe29d9a5000       0      16      16 r----  libc-2.13.so
00007fe29d9a9000       0       4       4 rw---  libc-2.13.so
00007fe29d9aa000       0      16      16 rw---    [ anon ]
00007fe29d9b0000       0     108       0 r-x--  ld-2.13.so
00007fe29dba6000       0      12      12 rw---    [ anon ]
00007fe29dbcc000       0      16      16 rw---    [ anon ]
00007fe29dbd0000       0       4       4 r----  ld-2.13.so
00007fe29dbd1000       0       8       8 rw---  ld-2.13.so
00007ffff29b2000       0      12      12 rw---    [ stack ]
00007ffff29de000       0       4       0 r-x--    [ anon ]
ffffffffff600000       0       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB          980684     508     100

If the virtual memory size (the Kbytes column) is all 0s for you, don’t worry about it. That’s a bug in Debian/Ubuntu’s -x option. The total is correct, it just doesn’t display correctly in the breakdown.

You can see that the resident set size, the amount that’s actually in memory, is tiny compared to the virtual memory. Your program can access any memory within a billion bytes of 0x00007fe261c6f000, but if it accesses anything past 4KB, it’ll probably have to go to disk for it*.

What if we modify our program so it reads the whole file/array of bytes?

#include 
#include 
#include 
#include 
#include 

int main() {
  char *random_bytes;
  FILE *f;
  int offset = 0;

  // open "random" for reading                                                                                                                                              
  f = fopen("/home/user/random", "r");
  if (!f) {
    perror("couldn't open file");
    return -1;
  }

  random_bytes = mmap(0, 1000000000, PROT_READ, MAP_SHARED, fileno(f), 0);

  if (random_bytes == MAP_FAILED) {
    printf("error mapping the filen");
    return -1;
  }

  for (offset = 0; offset < 1000000000; offset += 4) {
    int i = *(int*)(random_bytes+offset);

    // to show we're making progress                                                                                                                                        
    if (offset % 1000000 == 0) {
      printf(".");
    }
  }

  // at the end, wait for signal so we can check mem                                                                                                                        
  printf("ndone, run `pmap -x %d`n", getpid());
  pause();
}

Now the resident set size is almost the same as the virtual memory size:

$ pmap -x 5378
5378:   ./mem_munch
Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000       0       4       4 r-x--  mem_munch
0000000000600000       0       4       4 r----  mem_munch
0000000000601000       0       4       4 rw---  mem_munch
0000000002271000       0       4       4 rw---    [ anon ]
00007fc2aa333000       0  976564       0 r--s-  random
00007fc2e5ce0000       0     292       0 r-x--  libc-2.13.so
00007fc2e5e6a000       0       0       0 -----  libc-2.13.so
00007fc2e6069000       0      16      16 r----  libc-2.13.so
00007fc2e606d000       0       4       4 rw---  libc-2.13.so
00007fc2e606e000       0      16      16 rw---    [ anon ]
00007fc2e6074000       0     108       0 r-x--  ld-2.13.so
00007fc2e626a000       0      12      12 rw---    [ anon ]
00007fc2e6290000       0      16      16 rw---    [ anon ]
00007fc2e6294000       0       4       4 r----  ld-2.13.so
00007fc2e6295000       0       8       8 rw---  ld-2.13.so
00007fff037e6000       0      12      12 rw---    [ stack ]
00007fff039c9000       0       4       0 r-x--    [ anon ]
ffffffffff600000       0       0       0 r-x--    [ anon ]
----------------  ------  ------  ------
total kB          980684  977072     104

Now if we access any part of the file, it will be in RAM already. (Probably. Until something else kicks it out.) So, our program can access a gigabyte of memory, but the operating system can lazily load it into RAM as needed.

And that’s why your virtual memory is so damn high when you’re running MongoDB.

Left as an exercise to the reader: try running pmap on a mongod process before it’s done anything, once you’ve done a couple operations, and once it’s been running for a long time.

* This isn’t strictly true**. The kernel actually says, “If they want the first N bytes, they’re probably going to want some more of the file” so it’ll load, say, the first dozen KB of the file into memory but only tell the process about 4KB. When your program tries to access this memory that is in RAM, but it didn’t know was in RAM, it’s called a minor page fault (as opposed to a major page fault when it actually has to hit disk to load new info). back to context

** This note is also not strictly true. In fact, the whole file will probably be in memory before you map anything because you just wrote the thing with dd. So you’ll just be doing minor page faults as your program “discovers” it.

Installing Linux on a MacBook Air

fffffffuuuuuuuuIt’s not a clean victory, but I got Linux onto my MacBook Air.

When I first got my Air, I launched the Ubuntu install disk and followed the instructions on the Ubuntu wiki. Unfortunately, these instructions are apparently for the MacBook Air 1,1, and I had a MacBook Air 2,1. The Linux kernel froze in the middle of initializing.

After a couple, ahem, weeks of playing around with kernel parameters, I got it to a point where I realized it was Ubuntu, not Linux, that was screwing up, so I decided to try some other distro. I got a Debian network install CD (the full install is 31 CDs!) and tried it. It booted into the installer fine, and started merrily installing the system. I suddenly realized I had a doctor’s appointment, and had a terrible premonition that, by the time I got back, something would have gone wrong.

My premonition was correct. When I returned, the CD had stopped working. I checked it for errors, and it was fine. However, every time I started the computer now, the CD driver would make an ominous clicking noise and pop open. If I held it closed, it would make a downright alarming snapping noise. And reFit couldn’t even recognize it.

So, I installed VMWare Fusion on the Mac partition, and installed Linux on that. I’m trying to look on the bright side: I get OS X power management, wireless, and sound with a Linux environment.