Innards of Tar

The La Brea Carpets

I’ve been working with tar files a lot lately and I haven’t been able to find a good example of what a tar file looks like, byte-by-byte. The specification is the best reference I’ve found for how tar files are structured, but it isn’t exactly friendly. Here’s an interactive breakdown of what tar files look like on the inside.

First, we’ll make a directory and some files:

$ mkdir tar_test
$ cd tar_test
~/tar_test$ mkdir subdir0 subdir1 subdir2
~/tar_test$ echo content > file0
~/tar_test$ echo content > subdir1/file0
~/tar_test$ echo content > subdir2/file0

Feel free to put whatever files you want in here, it’s a pretty easy-to-understand format. If you’re feeling frisky, add some symlinks.

Now tar them up:

~/tar_test$ tar cvvf tar_test.tar *
-rw-r----- k/k     6 2014-05-15 16:29 file0
drwxr-x--- k/k     0 2014-05-15 16:29 subdir0/
drwxr-x--- k/k     0 2014-05-15 16:30 subdir1/
-rw-r----- k/k     6 2014-05-15 16:30 subdir1/file0
drwxr-x--- k/k     0 2014-05-15 16:30 subdir2/
-rw-r----- k/k     6 2014-05-15 16:30 subdir2/file0

And check out your tar file to make sure everything looks alright:

~/tar_test$ tar tf tar_test.tar
file0
subdir0/
subdir1/
subdir1/file0
subdir2/
subdir2/file0

Tar files are organized into blocks of 512 bytes. Basically, the format of a tar file is:

Block # Description
0 Header
1 Content
2 Header
3 Content

If the content is longer than one block, it’ll be rounded up (so if you have a 1300-byte file, the tar entry will look like Header-Content-Content-Content). If an entry has no content (e.g., a directory or symbolic link) it only takes up one block. So, our tar file looks like:

Block # Description
0 Header for file0
1 Content of file0
2 Header for subdir0
3 Header for subdir1
4 Header for subdir1/file0
5 Content of subdir1/file0
6 Header for subdir2
7 Header for subdir2/file0
8 Content of subdir2/file0

Eight 512-byte blocks adds up to 4KB, but if we ls -lh the .tar, we get something bigger:

~/tar_test$ ls -lh tar_test.tar 
-rw-r----- 1 k k 10K May 16 15:19 tar_test.tar

There’s always an extra 1KB of 0s tacked onto the end of a .tar’s content as a footer, and there’s an implementation-dependent size tars are blocked up into (called the blocksize, which is different than the blocks discussed above). On my Linux machine, tar creates the 10KB archive shown above, on my OS X machine, it’s only 5.5KB.

Now we’re going to really look at the contents of the tar file, using hexdump. 512 bytes is 0x200 in hexidecimal, so each 200 is a new block in the archive.

~/tar_test$ hexdump -C tar_test.tar | more

You can see that the archive starts with the first entry’s filename:

00000000  66 69 6c 65 30 00 00 00  00 00 00 00 00 00 00 00  |file0...........|

Hexdump elides all-zero portions of the file, so the next interesting bit is the rest of the header:

00000060  00 00 00 00 30 30 30 30  36 34 30 00 30 36 30 31  |....0000640.0601|
00000070  34 35 34 00 30 30 31 31  36 31 30 00 30 30 30 30  |454.0011610.0000|
00000080  30 30 30 30 30 30 36 00  31 32 33 33 35 32 32 31  |0000008.12335221|
00000090  36 36 35 00 30 31 31 33  33 32 00 20 30 00 00 00  |665.011332. 0...|

Here are what the numbers are you’re seeing (you can look up these fields in the pax spec):

0000640
Mode (note that these are ASCII numbers: the byte values of ‘0’ is 30)
0601454
UID
0011610
GID
00000000008
Size
12335221665
mtime
011332
chksum
0
typeflag

Typeflag is the most interesting field here: it indicates the type of file (0 for normal files, 5 for directories). It can also b “x” to indicate an “extended header.” Extended headers are used to define your own fields or override fields in the header. For example, the header said that the mtime was 12335221665, but we could override that in an extended header with mtime=12345678901. If you have an extended header, the entry ends up taking an extra kilobyte of storage: one block for the extended header, and one block for a “normal” header which is identical to the initial header except contains the actual file type instead of “x”. So you’d have:

Block # Description
0 Header for file0 (typeflag=x)
1 Extended header of key=value pairs of attributes for file0
2 Header for file0 (typeflag=0)
3 Content of file0

The next part of the header is for links, so it’s all 0 for these normal files and directories. Then you finish up the header with:

00000900  00 75 73 74 61 72 20 20  00 6b 00 00 00 00 00 00  |.ustar  .k......|
00000910  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000920  00 00 00 00 00 00 00 00  00 6b 00 00 00 00 00 00  |.........k......|
00000930  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

“ustar” is a “magic” string that gives the tar format. The “k”s are my username and group name.

At 0x200 is the actual file content:

00000200  66 69 6c 65 30 0a 00 00  00 00 00 00 00 00 00 00  |content.........|

Then at 0x400, then next block (subdir0’s header) starts:

00000400  73 75 62 64 69 72 30 2f  00 00 00 00 00 00 00 00  |subdir0/........|

This is what tar looks like “under the covers.” It’s a lot more sparse than I thought it’d be, but I guess that’s where gzip comes in.

4 thoughts on “Innards of Tar

    1. My only experience with BSD tar is through libarchive, so other libraries might be different. At least with libarchive’s version, notable differences are: the tar file has a blocksize of 512, so there’s only 1KB of padding at the end (the footer) and, in older versions, every single entry has an extended header because they all include some BSD-specific flags: SCHILY.dev, SCHILY.ino, and SCHILY.nlink. This ends up adding an extra 1KB per entry, as each entry has the extended header format shown above.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: