Kristina Chodorow's Blog

Replica Set Internals Bootcamp Part IV: Syncing

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

Elections
Creating a set
Reconfiguring
Syncing
Initial Sync
Rollback
Authentication
Debugging

I’m going to do one subject per post, we’ll see how many I can get through. I’m punting on reconfig for now because it’s finicky to write about.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Syncing

When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:

Applies the op
Writes the op to its own oplog (also local.oplog.rs)
Requests the next op

If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

w

To ensure a write is present on, say, two members, you can do:

> db.foo.runCommand({getLastError:1, w:2})

Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.

Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.

So, it goes like:

Do a write on primary.
Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
{getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
secondary queries the oplog on primary and gets the op
secondary applies the op from time t
secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
primary updates that secondary has applied up to t because it is requesting ops > t.
getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.

Starting up

When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.

You can always use this shell helper to get the current last op in the shell:

> rs.debug.getLastOpWritten()

The “ts” field is the last optime written.

If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.

Once it has the last op time, it will chose a target to sync from.

Who to sync from

As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:

for each member that is healthy:
    if member[state] == PRIMARY
        add to set of possible sync targets

    if member[lastOpTimeWritten] > our[lastOpTimeWritten]
        add to set of possible sync targets

sync target = member with the min ping time from the possible sync targets

The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.

You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

Chaining Slaves

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)

One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.

When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”

When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.

Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.

A        B        C

<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).

On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.

Coming soon to a replica set near you…

In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.

> db.adminCommand({replSetSyncFrom:"otherHost:27017"})

Night of the Living Dead Ops

MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:

> db.currentOp()
{
	"inprog" : [
		{
			"opid" : 3062962,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 32267,
			"op" : "update",
			"ns" : "httpdb.servers",
			"query" : {
				"_id" : "150.237.88.189"
			},
			"client" : "127.0.0.1:50416",
			"desc" : "conn",
			"threadId" : "0x2900c400",
			"connectionId" : 74,
			"killed" : true,
			"numYields" : 0
		},
		{
			"opid" : 3063051,
			"active" : false,
			"lockType" : "read",
			"waitingForLock" : true,
			"op" : "query",
			"ns" : "",
			"query" : {
				"count" : "servers",
				"query" : {
					"code" : {
						"$gte" : 200
					}
				}
			},
			"client" : "127.0.0.1:30736",
			"desc" : "conn",
			"threadId" : "0x29113700",
			"connectionId" : 191,
			"killed" : true,
			"numYields" : 0
		}
        ]
}

The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?

chainsaw — Chainsaws: the kill -9 of living dead

It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.

There are basically three ways MongoDB ops handle getting killed:

Ones that die when they yield whatever lock they’re holding. This means that if the op never yields (note that numYields is 0 in the example above), it will never be killed.
Ones that can be killed at certain checkpoints. For example, index builds happen in multiple stages and check killed between stages. (Many commands do this, too.)
Ones cannot be killed at all. For example, rsSync, the name for the op applying replication, falls into this category. There are some sharding commands that can’t be killed, too.

There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.

Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:

> db.eval("sleep(10*60*1000)")

While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).

> db.foo.insert({x:1})

Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):

> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 25,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "numYields" : 0
                }
        ]
}
> // get the opId of the insert from currentOp
> db.killOp(455949)
{ "info" : "attempting to kill op" }
> // run currentOp again to see that killed:true
> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 221,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "killed" : true,
                        "numYields" : 0
                }
        ]
}

If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:

Wait for lock
Acquire lock!
Start running
Yield lock
Check for killed

So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.

––thursday #4: blockdev

Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.

The image below helps visualize how slow (post continues below).

(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)

The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.

You can see how far the kernel reads ahead using the blockdev tool:

$ sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     80026361856   /dev/sda
rw   256   512  4096       2048     80025223168   /dev/sda1
rw   256   512  4096          0   2000398934016   /dev/sdb
rw   256   512  1024       2048        98566144   /dev/sdb1
rw   256   512  4096     194560      7999586304   /dev/sdb2
rw   256   512  4096   15818752     19999490048   /dev/sdb3
rw   256   512  4096   54880256   1972300152832   /dev/sdb4

Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:

$ man blockdev
...
       --setra N
              Set readahead to N 512-byte sectors.
...

This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.

So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).

Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.

I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.

––thursday #3: a handy git prompt

Everyone says to use git branches early and often, but I inevitably lose track of what branch I’m on. My workflow generally goes something like:

Check out a branch
Lunch!
Get back to my desk and make an emergency bug fix
Commit emergency fix
Suddenly realize I’m not on the branch I meant to be on, optimally (but not always) before I try to push

To keep track, I’ve modified my command prompt to display what I’m working on using __git_ps1, which prints a nice “you are here” string for your prompt:

$ __git_ps1
 (master)
$ git checkout myTag123
$ __git_ps1
 ((myTag123))

(Newlines added for clarity, it actually displays as ” (master)$ git checkout…”, which is generally what you want so your prompt doesn’t contain a newline.) I’m not sure what’s up with all the ‘()’s, but it does let you distinguish between branches (‘(branchname)’) and tags (‘((tagname))’).

__git_ps1 will even show you if you’re in the middle of a bisect or a merge, which is another common workflow for me:

Merge something and there are a ton of conflicts
Boldly decide to deal with it tomorrow and go home for the day
The next day, write and emergency patch for something
Check git status
Suddenly I remember that I’m in the middle of a merge
Do unpleasant surgery to get my git repo back to a sane state

If you tend to fall into a similar workflow, I highly recommend modifying your prompt to something like:

$ PS1='w$(__git_ps1)> '
~/gitroot/mongo (v2.0)>

And suddenly life is better, which is what Linux is all about.

You can run PS1=... on a per-shell basis (or export PS1 for all shells this session) or add that line to your ~/.bashrc file to get that prompt in all future shells.

Finally, __git_ps1 isn’t a binary in your path, it’s a function (so it won’t show up if you run which __git_ps1). You can see its source by running type __git_ps1.

Note: unfortunately this doesn’t work for me on OS X, so I think it might be Linux-only.

Debugging Sexism

rodin — Photo by Steven Fettig, some rights reserved

The word of the day is sexism: Shanley Kane tweeted the CTO of Geekli.st, asking them to take down a video of a woman in a Geekli.st top and underwear dancing around. The subsequent tweets were captured in a long Storify and Geekli.st subsequently issued a public apology.

The interesting thing about this, to me, is how often well-meaning geeks react badly when someone says that they did something sexist, racist, homophobic, etc.

Let’s say someone says to you, in a public forum, “This thing you did is sexist.” What should you do?

First, if your immediate reaction is to say, “No it’s not!” don’t give in to your immediate reaction. That seems to be what most people start off with and it doesn’t help anything.

The reason you’re probably so eager to say it’s not sexist is because you’re not sexist. (Let’s assume, if you’re reading this blog, that you are correct: you are not sexist.)

So, if you aren’t sexist, why can’t you flat-out say “You are wrong”? Well, for starters, just because you’re not sexist doesn’t mean you never do anything sexist. Besides, they’re saying “This thing you did is sexist” but what they mean is “I feel this thing you did is sexist.” And even if you’re Simone de Beauvoir and Margaret Atwood and Buffy the Vampire Slayer rolled into one, you cannot tell them their feelings are wrong (well, you can, but you’ll look like an ass). What should you do, then?

You should try to make them feel better and try to avoid hurting them in the future.

Step 1: try to make them feel better. The best way to do this is through an apology. A real one, not an “I’m sorry this offended you”-type apology. A better choice: “I’m sorry that I did that.” Now follow up with, “What can I do to make this better?”
Step 2: try to avoid hurting them in the future. Obviously, everyone makes mistakes. You can’t never make one again, even if you and your company were publicly humiliated the first time around. The important thing is to try to learn from it. Google around and read about the issues women and minorities are concerned about in geek culture. Even if you don’t agree with everything everyone writes (I certainly don’t), you will at least know what the issues are.

If someone tells you they’re offended, you should take that seriously. If they’re speaking up, there are probably many others who are silently offended.

––thursday #2: diff ‘n patch

I’m trying something new: every Thursday I’ll do a short post on how to do something with the command line.

I always seem to either create or apply patches in the wrong direction. It’s like stalagmites vs. stalactites, which I struggled with until I heard the nemonic: “Stalagmites might hang from the ceiling… but they don’t.”

Moving right along, you can use diff to get line-by-line changes between any two files. Generally I use git diff because I’m dealing with a git repo, so that’s what I’ll use here.

Let’s get a diff of MongoDB between version 2.0.2 and 2.0.3.

$ git clone git://github.com/mongodb/mongo.git
$ cd mongo
$ git diff r2.0.2..r2.0.3 > mongo.patch

This takes all of the changes between 2.0.2 and 2.0.3 (r2.0.2..r2.0.3) and dumps them into a file called mongo.patch (that’s the > mongo.patch part).

Now, let’s get the code from 2.0.2 and apply mongo.patch, effectively making it 2.0.3 (this is kind of a silly example but if you’re still with me after the stalagmite thing, I assume you don’t mind silly examples):

$ git checkout r2.0.2
Note: checking out 'r2.0.2'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at 514b122... BUMP 2.0.2
$ 
$ patch -p1 < mongo.patch

What intuitive syntax!

What does the -p1 mean? How many forward slashes to remove from the path in the patch, of course.

To take an example, if you look at the last 11 lines of the patch, you can see that it is the diff for the file that changes the version number. It looks like this:

$ tail -n 11 mongo.patch
--- a/util/version.cpp
+++ b/util/version.cpp
@@ -38,7 +38,7 @@ namespace mongo {
      *      1.2.3-rc4-pre-
      * If you really need to do something else you'll need to fix _versionArray()
      */
-    const char versionString[] = "2.0.2";
+    const char versionString[] = "2.0.3";
 
     // See unit test for example outputs
     static BSONArray _versionArray(const char* version){

Note the a/util/version.cpp and b/util/version.cpp. These indicate the file the patch should be applied to, but there are no a or b directories in the MongoDB repository. The a and b prefixes indicate that one is the previous version and one is the new version. And -p says how many slashes to strip from this path. An example may make this clearer:

-p0 (equivalent to not specifying -p): “apply this patch to a/util/version.cpp” (which doesn’t exist)
-p1: “apply this patch to util/version.cpp” ← bingo, that’s what we want
-p2: “apply this patch to version.cpp” (which doesn’t exist)

So, we use -p1, because that makes the patch’s paths match the actually directory structure. If someone sent you a patch and the path is something like /home/bob/bobsStuff/foo.txt and your name is not Bob, you’re just trying to patch foo.txt, you’d probably want to use -p4.

On the plus side, if you’re using patches generated by git, they’re super-easy to apply. Git chose the intuitive verb “apply” to patch a file. If you have a patch generated by git diff, you can patch your current tree with:

$ git apply mongo.patch

So, aside from the stupid choice of verbiage, that is generally easier.

Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!

––thursday #1: screen

I’m trying something new: every Thursday I’ll go over how to do something with the command line. Let me know what you think.

If you are using a modern-ish browser, you probably use tabs to keep multiple things open at once: your email, your calendar, whatever you’re actually doing, etc. You can do the same thing with the shell using screen: in a single terminal, you can compile a program while you’re editing a file and watching another process out of the corner of your eye.

Note that screen is super handy when SSH’d into a box. SSH in once, then start screen and open up all of the windows you need.

Using screen

To start up screen, run:

$ screen

Now your shell will clear and screen will give you a welcome message.


Screen version 4.00.03jw4 (FAU) 2-May-06

Copyright (c) 1993-2002 Juergen Weigert, Michael Schroeder
Copyright (c) 1987 Oliver Laumann

...




                          [Press Space or Return to end.]

As it says at the bottom, just hit Return to clear the welcome message. Now you’ll see an empty prompt and you can start working normally.

Let’s say we have three things we want to do:

Run top
Edit a file
Tail a log

Go ahead and start up top:

$ top

Well, now we need to edit a file but top‘s using the shell. What to do now? Just create a new window. While top is still running, hit ^A c (I’m using ^A as shorthand for Control-a, so this means “hit Control-a, then hit c”) to create a new window. The new window gets put right on top of the old one, so you’ll see a fresh shell and be at the prompt again. But where did top go? Not to worry, it’s still there. We can switch back to it with ^A n or ^A p (next or previous window).

Now we can start up our editor and begin editing a file. But now we want to tail a file, so we create another new window with ^A c and run our tail -f filename. We can continue to use ^A n and ^A p to switch between the three things we’re doing (and open more windows as necessary).

Availability

screen seems pretty ubiquitous, it has been on every Linux machine I’ve ever tried running it on and even OS X (although it may be part of XCode, I haven’t checked).

Note for Emacs Users

^A is an annoying escape key, as it is also go-to-beginning-of-line shortcut in Emacs (and the shell). To fix this, create a .screenrc file and add one line to change this to something else:

# use ^T
escape ^Tt
# or ^Y
escape ^Yy

The escape sequence is 3 characters: carat, T, and t. (It is not using the single special character “^T”.) The traditional escape key is actually Ctrl-^, as the carat is the one character Emacs doesn’t use for anything. In a .screenrc file, this results in the rather bizarre string:

escape ^^^^

…which makes sense when you think about it, but looks a bit weird.

Odds and Ends

As long as you’re poking at the .screenrc file, you might want to turn off the welcome message, too:

startup_message off

Run ^A ? anytime for help, or check out the manual’s list of default bindings.

Did I miss anything? Get anything wrong? Got a suggestion for next week? Leave a comment below and let me know!

Replica Set Internals Bootcamp: Part II – Creating a Set

There are 8 subjects I cover in my bootcamp:

Elections
Creating a set
Reconfiguring
Syncing
Initial Sync
Rollback
Authentication
Debugging

I’m going to do one subject per post, we’ll see how many I can get through.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Initializing a Set

Suppose you are creating a new set. You start up some mongods with no data and the replSet option. When a server starts up with the replSet option, the first thing it does is check its local.system.replset collection for a replica set config it can load, but it won’t find one (because there’s no data). Since it cannot find a config, it goes into the EMPTYCONFIG state. If you look in the logs, you’ll see messages about EMPTYCONFIG.

Tue Mar  6 12:24:35 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:35 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Tue Mar  6 12:24:45 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:55 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:05 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:15 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:25 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now suppose you send the replSetInitiate command to one of the members. replSetInitiate can either take a custom configuration or generate a config automatically. If you do not pass in a config, the server will try to figure out what its hostname is and generate a config using that.

Note: on EC2, the server always chooses the wrong hostname, so you have to pass in the config you want.

Once the server has a config document, either passed in or generated, it’ll make sure it can reach every node specified (all members must be up and reachable to initiate), then store the config in the local.system.replset collection. Finally, it begins sending heartbeat requests to the other members of the set.

The Log: Step-by-Step

Being able to interpret log files is critical to knowing what’s going on, so let’s walk through a sample log from running rs.initiate(). Here’s the whole log, feel free to skip over this for now. We’ll take it piece-by-piece below.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client
Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up
Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******
Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.
Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms
Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self
Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000
Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2
Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY
Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing
Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

Now, to understand this line-by-line. First, we start out in the EMPTYCONFIG state, as described above.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now the shell connects to the server and runs rs.initiate() (replSetInitiate), which apparently takes me 6 seconds to type (take that, Mavis Beacon).

Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client

Because we didn’t give a config, the server tries to generate one:

Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }

Now it does some sanity checks on this config (e.g., making sure all members are up and reachable). This matters more if we gave it a config, but sometimes things are so screwed up you can’t reach yourself.

Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up

Now it calculates the size of the oplog and allocates it between two lines of ******.

Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******

It doesn’t actually print the “…” part, I just omitted 21 files being preallocated.

Now it stores the configuration in the local.system.replset collection.

Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.

It seldom takes a minute for it to come online, but this way you’ll be pleasantly surprised.

replSetInitiate is now done running. It will often be logged, because slow operations are logged and allocating the oplog usually takes a while.

Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms

Now it actually loads this configuration as the replica set config we want to use:

Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self

You won’t see this next line if you’re running 2.0, I added it for 2.2 because I was sick of having to decipher which server a log was from:

Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000

Yay, now config has been loaded, making our state STARTUP2 (the “config has been loaded, starting up RS threads” state):

Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2

At this point, it can go straight into PRIMARY state, but it doesn’t. We could fix this, but it hasn’t been a priority, so it goes into SECONDARY state briefly before becoming PRIMARY:

Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY

A “real” secondary (we’re about to become primary, so we don’t count) will gather some stats about who’s best to sync from before it chooses a target:

Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing

Good luck with that, buddy, you’re the only one in the config.

Aha, we’ve noticed that we can become primary:

Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

And now we’re all set.

Propegating the Config

If we add members or provide an initial config with multiple members, the initial member’s heartbeat requests are sent to a bunch of servers in EMPTYCONFIG state. When these other servers receive the heartbeat request, they notice “Hey, I don’t have any config and this guy seems to think I’m part of his set, so I should check in with him.”

These EMPTYCONFIG members will then request a config from the member that requested the heartbeat. Seeing themselves in the config, they’ll save it to their own local.system.replset collection, then start sending out their own heartbeats. At that point, they’ll move into RECOVERING state and, once they’re synced to the original member, they’ll turn into secondaries (unless they’re arbiters, of course, but same general idea).

One quirk of initiating a set is that every member of the set must not have any data, other than the one you’re sending replSetInitiate to. After the set is initialized you can add members containing data, but at most one member can have data when you initialize the set.

Confused?

Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).

git-oh-$#!†

I’ve learned a lot about git, usually in a hurry after I mess up and have to fix it. Here are some basic techniques I’ve learned that may help a git beginner.

Fixing your ungood code

Let’s say you’re a clerk working for an Orwellian Ministry of Truth and you find out the new chocolate ration is 20 grams. You look at tonight’s newscast transcript, newscast.txt, which says:

...and in other news, the Ministry of Plenty has announced that chocolate 
rations will be held steady at 40 grams for the forthcoming quarter...

Well, that won’t do. We modify the transcript newscast to say:

...and in other news, the Ministry of Plenty has announced that chocolate 
rations will be cut to 20 grams for the forthcoming quarter...

And we commit:

$ git add newscast.txt
$ git commit -m "Fixed chocolate ration"
[master 9772a49] Fixed chocolate ration
 1 files changed, 1 insertions(+), 1 deletions(-)

As you’re about to push, your supervisor, who has been hovering, shrieks: “Do you want to be an unperson? Don’t say they’ve lowered a ration!”

So, we’ll modify the file again:

...and in other news, the Ministry of Plenty has announced that chocolate 
rations will be raised to 20 grams for the forthcoming quarter...

Now we’ll add this change, as though we were going to make a new commit:

$ git add newscast.txt
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ..." to unstage)
#
#	modified:   newscast.txt
#

Running git commit --amend sweeps everything in our staging area into the last commit:

$ # this opens your text editor, in case you want to change the message
$ git commit --amend
[master 04ce65d] Fixed chocolate ration
 1 files changed, 1 insertions(+), 1 deletions(-)

Now you just have one commit with the “correct” changes.

We have always been at war with Eastasia

Sometimes you have a mistake so severe that you don’t even want it to exist in the repository’s history anymore. If you haven’t pushed yet, you can use git reset to unwrite any history that happened after a given commit.

Let’s say we messed something up and we want to get back to 72ecbbda47c0712846312849bab1fb458cd9fffb:

$ git reset --hard 72ecbbda47c0712846312849bab1fb458cd9fffb

And you’re back to 72ecbbda47c0712846312849bab1fb458cd9fffb, like whatever happened never actually happened.

git reset isn’t just good for revisionist history, it’s also nice when you have a merge conflict, some deleted files, a strange thing with branches, and you don’t even know what’s going on: you just want to go home. A hard reset can get you out of almost any problem.

However, once you’ve pushed, resetting isn’t an option. If you reset a public repo on a “real” project, you’ll be unpersoned so fast your head will spin.

When you’ve already pushed a crimethink

Let’s say you made a mistake (as above), but have already pushed to a remote repository. Now you have to either fix the commit or remove it.

In the example above, it’s easy enough to fix: change “cut the chocolate ration” to “raised the chocolate ration” and you’re done, so you might as well just push a new commit with the fix.

However, sometimes a fix will take longer (e.g., you need research a plausible story to explain away inconsistencies). You don’t want anyone getting confused in the meantime, so if you cannot immediately correct history, you should politely back it out. This is where git revert comes in.

First, look up the hash of the commit you want to undo.

$ git log -1 # -1 means only show the latest 1 commit
commit 72ecbbda47c0712846312849bab1fb458cd9fffb
Author: Kristina 
Date:   Thu Feb 23 09:07:58 2012 -0500

    Airplane was now invented by the Party

Now, revert that hash:

$ git revert 72ecbbda47c0712846312849bab1fb458cd9fffb

git revert will pop up your editor and you can mess with the revert message, if you want.

Changing airplane's invention date is going to take more work than anticipated

This reverts commit 72ecbbda47c0712846312849bab1fb458cd9fffb.

# Please enter the commit message for your changes. Lines starting
# with '#' will be ignored, and an empty message aborts the commit.
# On branch master
# Changes to be committed:
#   (use "git reset HEAD ..." to unstage)
#
#       modified:   newscast.txt
#

Save, exit, and push to the remote repo.

$ git revert 72ecbbda47c0712846312849bab1fb458cd9fffb
[master 84e6f13] Changing airplane's invention date is going to take more work than anticipated
 1 files changed, 0 insertions(+), 1 deletions(-)

You can revert any commit from any time in history. Note that this creates a “uncommit” in addition to the original commit: it’s essentially creating an “opposite diff” to cancel out the original commit. So the original commit is still there.

Ignorance is strength, but just in case…

Some resources you may find helpful, when learning git:

git gui: invaluable for staging commits
gitk: easier to read than git log when examining commit history
GitReady: most useful site I’ve found for recipes of common tasks.

And remember, use a full sentence describing your change in commit messages. Big Brother is watching.

The Comments Conundrum

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

For example…

Note: you must be running at least version 2.1.0 of MongoDB to use the aggregation pipeline.

Getting all comments by Serious Cat

Serious Cat’s comments are scattered between post documents, so there wasn’t a good way of querying for just those embedded documents. Now there is.

Let’s assume we want each comment by Serious Cat, along with the title and url of the post Serious Cat was commenting on. So, the steps we need to take are:

Extract the fields we want (title, url, comments)
Unwind the comments field: make each comment into a “real” document
Query our new “comments collection” for “Serious Cat”

Using the aggregation pipeline, this looks like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields 
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode the "comments" array into separate documents
    $unwind: "$comments"
},
{
    // query like a boss
    $match: {comments.author : "Serious Cat"}
}]})

Now, this works well for something like a blog, where you have human-generated (small) data. If you’ve got gigs of comments to go through, you probably want to filter out as many as possible (e.g., with $match or $limit) before sending it to the “everything-in-memory” parts of the pipeline.

Getting the most recent comments

Let’s assume our site lists the 10 most recent comments across all posts, with links back to the posts they appeared on, e.g.,

Great post! -Jerry (February 2nd, 2012) from This is a Great Post
What does batrachophagous mean? -Fred (February 2nd, 2012) from Fun with Crosswords
Where can I get discount Prada shoes? -Tom (February 1st, 2012) from Rant about Spam
…

To extract these comments from a collection of posts, you could do something like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // sort newest first
    $sort: {
        "comments.date" : -1
    }
},
{
    // get the 10 newest
    $limit: 10
}]})

Let’s take a moment to look at what $unwind does to a sample document.

Suppose you have a document that looks like this after the $project:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
        {text : "First!", ...},
        {text : "I hate spam, too!", ...},
        {text : "I love spam.", ...}
    ]
}

Then, after unwinding the comments field, you’d have:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "First!", ...}
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I hate spam, too!", ...}
    ]
},
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I love spam.", ...}
    ]
}

Then we $sort, $limit, and Bob’s your uncle.

Rank commenters by popularity

Suppose we allow users to upvote comments and we want to see who the most popular commenters are.

The steps we want to take are:

Project out the fields we need (similar to above)
Unwind the comments array (similar to above)
Group by author, taking a count of votes (this will sum up all of the votes for each comment)
Sort authors to find the most popular commenters

Using the pipeline, this would look like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields we'll need
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // count up votes by author
    $group : {
        _id : "$comments.author",
        popularity : {$sum : "$comments.votes"}
    }
},
{
    // sort by the new popular field
    $sort: {
        "popularity" : -1
    }
}]})

As I mentioned before, there are a couple downsides to using the aggregation pipeline: a lot of the pipeline is done in-memory and can be very CPU- and memory-intensive. However, used judiciously, it give you a lot more freedom to mush around your embedded documents.