How MongoDB’s Journaling Works

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Good idea, Patrick!

So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

The journal appends each change description it gets.

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.

Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

How to Make Your First MongoDB Commit

10gen is hiring a lot of people straight out of college, so I thought this guide would be useful.

Basically, the idea is: you have found and fixed a bug (so you’ve cloned the mongo repository, created a branch named SERVER-1234, and committed your change on it). You’ve had your fix code-reviewed (this page is only accessible to 10gen wiki accounts). Now you’re ready to submit your change, to be used and enjoyed by millions (no pressure). But how do you get it into the main repo?

Basically, this is the idea: there’s the main MongoDB repo on Github, which you don’t have access to (yet):

However, you can make your own copy of the repo, which you do have access to:

So, you can put your change in your repo and then ask one of the developers to merge it in, using a pull request.

That’s the 1000-foot overview. Here’s how you do it, step-by-step:

  1. Create a Github account.
  2. Go to the MongoDB repository and hit the “Fork” button.

  3. Now, if you go to https://www.github.com/yourUsername/mongo, you’ll see that you have a copy of the repository (replace yourUsername with the username you chose in step 1). Now you have this setup:

  4. Add this repository as a remote locally:
    $ git remote add me git@github.com:yourUsername/mongo.git
    

    Now you have this:

  5. Now push your change from your local repo to your Github repo, do:
    $ git push me SERVER-1234
    
  6. Now you have to make a pull request. Visit your fork on Github and click the “Pull Request” button.

  7. This will pull up Github’s pull request interface. Make sure you have the right branch and the right commits.

  8. Hit “Send pull request” and you’re done!

Call for Schemas

The Return of the Mongoose Lemur

I just started working on MongoDB: The Definitive Guide, 2nd Edition! I’m planning to add:

  • Lots of ops info
  • Real-world schema design examples
  • Coverage of new features since 2010… so quite a few

However, I need your help on the schema design part! I want to include some real-world schemas people have used and why they worked (or didn’t). If you’re working on something 1) interesting and 2) non-confidential and you’d like to either share or get some free advice (or both), please email me (kristina at 10gen dot com) or leave a comment below. I’ll set up a little interview with you.

I am particularly looking for “cool” projects (video games, music, TV, sports), recognizable companies (Fortune 50 & HackerNews 500*), and geek elite (Linux development, research labs, robots, etc.). However, if you’re working on something you think is interesting that doesn’t fall into any of those categories, I’d love to hear about it!

* There isn’t really a HackerNews 500, I mean projects that people in the tech world recognize and thinks are pretty cool (DropBox, Github, etc.).

Controlling Collection Distribution

Shard tagging is a new feature in MongoDB version 2.2.0. It’s supposed to force writes to go to a local data center, but it can also be used to pin a collection to a shard or set of shards.

Note: to try this out, you’ll have to use 2.2.0-rc0 or greater.

To play with this feature, first you’ll need to spin up a sharded cluster:

> sharding = new ShardingTest({shards:3,chunksize:1})

This command will start up 3 shards, a config server, and a mongos. It’ll also start spewing out the logs from all the servers into stdout, so I recommend putting this shell aside and using a different one from here on in.

Start up a new shell and connect to the mongos (defaults to port 30999) and create some sharded collections and data to play with:

> // remember, different shell
> conn = new Mongo("localhost:30999")
> db = conn.getDB("villains")
>
> // shard db
> sh.enableSharding("villains")
>
> // shard collections
> sh.shardCollection("villains.joker", {jokes:1});
> sh.shardCollection("villains.two-face", {luck:1});
> sh.shardCollection("villains.poison ivy", {flora:1});
> 
> // add data
> for (var i=0; i for (var i=0; i for (var i=0; i<100000; i++) { db["poison ivy"].insert({flora: Math.random(), count: i, time: new Date()}); }

Now we have 3 shards and 3 villains. If you look at where the chunks are, you should see that they’re pretty evenly spread out amongst the shards:

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
Or, as Harley would say, “Puddin’.”

However, villains tend to not play well with others, so we’d like to separate the collections: 1 villain per shard. Our goal:

Shard Namespace
shard0000 “villains.joker”
shard0001 “villains.two-face”
shard0002 “villains.poison ivy”

To accomplish this, we’ll use tags. A tag describes a property of a shard, any property (they’re very flexible). So, you might tag a shard as “fast” or “slow” or “east coast” or “rackspace”.

In this example, we want to mark a shard as belonging to a certain villain, so we’ll add villains’ nicknames as tags.

> sh.addShardTag("shard0000", "mr. j")
> sh.addShardTag("shard0001", "harv")
> sh.addShardTag("shard0002", "ivy")

This says, “put any chunks tagged ‘mr. j’ on shard0000.”

The second thing we have to do is to make a rule, “For all chunks created in the villains.joker collection, give them the tag ‘mr. j’.” To do this, we can use the addTagRange helper:

> sh.addTagRange("villains.joker", {jokes:MinKey}, {jokes:MaxKey}, "mr. j")

This says, “Mark every chunk in villains.joker with the ‘mr. j’ tag” (MinKey is negative infinity, MaxKey is positive infinity, so all of the chunks fall in this range).

Now let’s do the same thing for the other two collections:

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:MaxKey}, "harv")
> sh.addTagRange("villains.poison ivy", {flora:MinKey}, {flora:MaxKey}, "ivy")

Now wait a couple of minutes (it takes a little while for it to rebalance) and then look at the chunks for these collections.

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }

Scaling with Tags

Obviously, Two-Face isn’t very happy with this arrangement and immediately requests two servers for his data. We can move the Joker and Poison Ivy’s collections to one shard and expand Harvey’s to two by manipulating tags:

> // move Poison Ivy to shard0000
> sh.addShardTag("shard0000", "ivy")
> sh.removeShardTag("shard0002", "ivy")
>
> // expand Two-Face to shard0002
> sh.addShardTag("shard0002", "harv")

Now if you wait a couple minutes and look at the chunks, you’ll see that Two-Face’s collection is distributed across 2 shards and the other two collections are on shard0000.

> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
“Bad heads, you get EBS.”

However, this still isn’t quite right for Harvey, he’d like one shard to be good and one to be bad. Let’s say we take advantage of Amazon’s new offering and replace shard0002 with SSDs. Then we divide up the traffic: send 50% of Harvey’s writes to the SSD shard and 50% to the spinning disk shard. First, we’ll add tags to the shards, describing them:

> sh.addShardTag("shard0001", "spinning")
> sh.addShardTag("shard0002", "ssd")

The value of the “luck” field is between 0 and 1, so we want to say, “If luck = .5, send to the SSD.”

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:.5}, "spinning")
> sh.addTagRange("villains.two-face", {luck:.5}, {luck:MaxKey}, "ssd")

Now “bad luck” docs will be written to the slow disk and “good luck” documents will be written to SSD.

As we add new servers, we can control what kind of load they get. Tagging gives operators a ton of control over what collections go where.

Finally, I wrote a small script that adds a “home” method to collections to pin them to a single tag. Example usage:

> // load the script
> load("batman.js")
> // put foo on bar
> db.foo.home("bar")
> // put baz on bar
> db.baz.home("bar")
> // move foo to bat
> db.foo.home("bat")

Enjoy!

Replica Set Internals Part V: Initial Sync

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The Initial Sync Processes

When you add a brand new member to the set, it needs to copy over all of the data before it’s ready to be a “real” member. This process is called initial syncing and there are, essentially, 7 steps:

  1. Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:
  2. Get the latest oplog time from the source member: call this time start.
  3. Clone all of the data from the source member to the destination member.
  4. Build indexes on destination.
  5. Get the latest oplog time from the sync target, which is called minValid.
  6. Apply the sync target’s oplog from start to minValid.
  7. Become a “normal” member (transition into secondary state).

Note that this process only checks the oplog. You could have a petabyte of data on the server, but if there’s no oplog, the new member will delete all of it and initial sync. Members depend on the oplog to know where to sync from.

So, suppose we’re starting up a new member with no data on it. MongoDB checks the oplog, sees that it doesn’t even exist, and begins the initial sync process.

Copying Data

The code for this couldn’t be much simpler, in pseudo code it is basically:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():
             destinationServer.getDB(db).getCollection(collection).insert(doc)

One of the issues with syncing is that it has to touch all of the source’s data, so if you’ve been carefully cultivating a working set on sourceServer, it’ll pretty much be destroyed.

There are benefits to initial syncing, though: it effectively compacts your data on the new secondary. As it’s doing are inserts, it’ll use pretty much the minimum amount of space. Some users actually use rotating resyncs to keep their data compact.

On the downside, initial sync doesn’t consider padding factor, so if that’s important to your application, the new server will have to build up the right padding factor over time.

Syncing on a Live System

The tricky part of initial syncing is that we’re trying to copy an (often) massive amount of data off of a live system. New data will be written while this copy is taking place, so it’s a bit like trying copy a tree over six months.

By the time you’re done, the data you copied first might have changed significantly on the source. The copy on the destination might not be… exactly what you’d expect.

That is what the oplog replay step is for: to get your data to a consistent state. Oplog ops are idempotent, they can be applied multiple times and yield the same answer. Thus, so long as we apply all of the writes at least once (remember, they may or may not have been applied on the source before the copy), we’ll end up with a consistent picture when we’re done.

Like so.

minValid (as mentioned in the list above) is the first timestamp where our new DB is in a consistent state: it may be behind the other members, but its data matches exactly how the other servers looked at some point in time.

Some examples of idempotency, as most people haven’t seen it since college:

// three idempotent functions:
function idemp1(doc) {
   doc.x = doc.x + 0;
}

function idemp2(doc) {
   doc.x = doc.x * 1;
}

function idemp3(doc) {
   // this is what replication does: it turns stuff like "$inc 4 by 1" into "$set to 5"
   doc.x = 5;
}

// two non-idempotent functions
function nonIdemp1(doc) {
   doc.x = doc.x + 1;
}

function nonIdemp2(doc) {
   doc.x = Math.random();
}

No matter how many times you call the idempotent functions the value of doc.x will be the same (as long as you call them at least once) .

Building Indexes

In 2.0, indexes were created on the secondary as part of the cloning step, but in 2.2, we moved index creation to after the oplog application. This is because of an interesting edge case. Let’s say we have a collection representing the tree above and we have a unique index on leaf height: no two leaves are at exactly the same height. So, pretend we have a document that looks like this:

{
    "_id" : 123,
    ...,
    "height" : 76.3
}

The cloner copies this doc from the source server to the destination server and moves on. On the source, we remove this leaf from the tree because of, I dunno, high winds.

> db.tree.remove({_id:123})

However, the cloner has already copied the leaf, so it doesn’t notice this change. Now another leaf might grow at this height. Let’s say leaf #10,012 grow to this height on the source.

> db.tree.update({_id:10012}, {$set : {height : 76.3}}) 

Now, when the cloner gets to document #10012, it’ll copy it to the destination server. Now there are two documents with the same height field in the destination collection, so when it tries to create a unique index on “height”, the index creation will fail!

So, we moved the index creation to after the oplog application. That way, we’re always building the index on a consistent data set, so it should always succeed.

There are a couple of other edge cases like that which have been super-fun to track down, which you can look up in Jira if you’re interested.

Restoring from Backup

Often, initial sync is too slow for people. If you want to get a secondary up and running as fast as possible, the best way to do so is to skip initial sync altogether and restore from backup. To restore from backup:

  1. Find a secondary you like the looks of.
  2. Either shut it down or fsync+lock it. Basically, get its data files into a clean state, so nothing is writing to them while you’re copying them.
  3. Copy the data files to your destination server. Make sure you get all of your data if you’re using any symlinks or anything.
  4. Start back up or unlock the source server.
  5. Point the destination server at the data you copied and start it up.

As there is already an oplog, it will not need to initial sync. It will begin syncing from another member’s oplog immediately when it starts up and usually catch up quite quickly.

Note that mongodump/mongorestore actually does not work very well as a “restoring from backup” strategy because it doesn’t give you an oplog. You can create one on your own and prime it, but it’s more work and more fiddly than just copying files. There is a feature request for mongorestore to be able to prime the oplog automatically, but it won’t be in 2.2.

P.S. Trees were done with an awesome program I recently discovered called ArtRage, which I highly recommend to anyone who likes painting/drawing. It “feels” like real paint.

Good Night, Westley: Time-To-Live Collections

In The Princess Bride, every night the Dread Pirate Roberts tells Westley: “Good night, Westley. Good work. Sleep well. I’ll most likely kill you in the morning.”

Let’s say the Dread Pirate Roberts wants to optimize this process, so he stores prisoners in a database. When he captures Westley, he can put:

> db.prisoners.insert({
... name: "Westley",
... sentenceStart: new Date()
... })

However, now he has to run some sort of cron job that runs all the time in order to kill everyone who needs killing and keep his database up-to-date.

Enter time-to-live (TTL) collections. TTL collections are going to be released in MongoDB 2.2 and they’re collections where documents expire in a more controlled way than with capped collections.

What the Dread Pirate Roberts can do is:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 24*60*60}

Now, MongoDB will regularly comb this index looking for docs to expire (so it’s actually more of a TTL index than a TTL collection).

Let’s try it out ourselves. You’ll need to download version 2.1.2 or higher to use this feature. Start up the mongod and run the following in the Mongo shell:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 30})

We’re on a schedule here, so our pirate ship is more brutal: death after 30 seconds. Let’s take aboard a prisoner and watch him die.

> var start = new Date()
> db.prisoners.insert({name: "Haggard Richard", sentenceStart: start})
> while (true) { 
... var count = db.prisoners.count(); 
... print("# of prisoners: " + count + " (" + (new Date() - start) + "ms)");
... if (count == 0) 
...      break; 
... sleep(4000); }
# of prisoners: 1 (2020ms)
# of prisoners: 1 (6021ms)
# of prisoners: 1 (10022ms)
# of prisoners: 1 (14022ms)
# of prisoners: 1 (18023ms)
# of prisoners: 1 (22024ms)
# of prisoners: 1 (26024ms)
# of prisoners: 0 (30025ms)

…and he’s gone.

Edited to add: Stennie pointed out that the TTL job only runs once a minute, so YMMV on when Westley gets bumped.

Conversely, let’s say we want to play the “maybe I’ll kill you tomorrow” game and keep bumping Westley’s expiration date. We can do that by updating the TTL-indexed field:

> db.prisoners.insert({name: "Westley", sentenceStart: new Date()})
> for (i=0; i  db.prisoners.count()
1

…and Westley’s still there, even though it’s been more than 30 seconds.

Once he gets promoted and becomes the Dread Pirate Roberts himself, he can remove himself from the execution rotation by changing his sentenceStart field to a non-date (or removing it altogether):

> db.prisoners.update({name: "Westley"}, {$unset : {sentenceStart: 1}});

When not on pirate ships, developers generally use TTL collections for sessions and other cache expiration problems. If ye be wanting a less grim introduction to MongoDB’s TTL collections, there are some docs on it in the manual.

Replica Set Internals Bootcamp Part III: Reconfiguring

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Reconfiguring Prerequisites

One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

Additive Reconfiguration vs. Full Reconfigs

Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

A simple reconfig starts a new heartbeat thread for the new member and it’s done.

A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

Propegation of Reconfiguration

Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

Suppose you have 2 nodes, A and B.

You run a reconfig on A, changing the version number from 6 to 7.

B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

Force reconfig to the face.

Forcing Reconfig

Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

Why is my version number 6,203,493?

When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

Removing Nodes and Sharding

I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

––thursday #5: diagnosing high readahead

Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

Picture our disk as looking like this, where each o is a document:

...
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
... // keep going for millions more o's

Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

...
ooooooooooooooooooooooooo
ooooxoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

...
ooooooooooooooooooooooooo
ooooxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...
Is your OS returning this with every document?

Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.

Night of the Living Dead Ops

MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:

> db.currentOp()
{
	"inprog" : [
		{
			"opid" : 3062962,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 32267,
			"op" : "update",
			"ns" : "httpdb.servers",
			"query" : {
				"_id" : "150.237.88.189"
			},
			"client" : "127.0.0.1:50416",
			"desc" : "conn",
			"threadId" : "0x2900c400",
			"connectionId" : 74,
			"killed" : true,
			"numYields" : 0
		},
		{
			"opid" : 3063051,
			"active" : false,
			"lockType" : "read",
			"waitingForLock" : true,
			"op" : "query",
			"ns" : "",
			"query" : {
				"count" : "servers",
				"query" : {
					"code" : {
						"$gte" : 200
					}
				}
			},
			"client" : "127.0.0.1:30736",
			"desc" : "conn",
			"threadId" : "0x29113700",
			"connectionId" : 191,
			"killed" : true,
			"numYields" : 0
		}
        ]
}

The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?

Chainsaws: the kill -9 of living dead

It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.

There are basically three ways MongoDB ops handle getting killed:

  • Ones that die when they yield whatever lock they’re holding. This means that if the op never yields (note that numYields is 0 in the example above), it will never be killed.
  • Ones that can be killed at certain checkpoints. For example, index builds happen in multiple stages and check killed between stages. (Many commands do this, too.)
  • Ones cannot be killed at all. For example, rsSync, the name for the op applying replication, falls into this category. There are some sharding commands that can’t be killed, too.

There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.

Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:

> db.eval("sleep(10*60*1000)")

While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).

> db.foo.insert({x:1})

Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):

> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 25,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "numYields" : 0
                }
        ]
}
> // get the opId of the insert from currentOp
> db.killOp(455949)
{ "info" : "attempting to kill op" }
> // run currentOp again to see that killed:true
> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 221,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "killed" : true,
                        "numYields" : 0
                }
        ]
}

If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:

  • Wait for lock
  • Acquire lock!
  • Start running
  • Yield lock
  • Check for killed

So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.

Replica Set Internals Bootcamp: Part II – Creating a Set

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Initializing a Set

Suppose you are creating a new set. You start up some mongods with no data and the replSet option. When a server starts up with the replSet option, the first thing it does is check its local.system.replset collection for a replica set config it can load, but it won’t find one (because there’s no data). Since it cannot find a config, it goes into the EMPTYCONFIG state. If you look in the logs, you’ll see messages about EMPTYCONFIG.

Tue Mar  6 12:24:35 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:35 [rsStart] replSet info you may need to run replSetInitiate -- rs.initiate() in the shell -- if that is not already done
Tue Mar  6 12:24:45 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:24:55 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:05 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:15 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:25:25 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now suppose you send the replSetInitiate command to one of the members. replSetInitiate can either take a custom configuration or generate a config automatically. If you do not pass in a config, the server will try to figure out what its hostname is and generate a config using that.

Note: on EC2, the server always chooses the wrong hostname, so you have to pass in the config you want.

Once the server has a config document, either passed in or generated, it’ll make sure it can reach every node specified (all members must be up and reachable to initiate), then store the config in the local.system.replset collection. Finally, it begins sending heartbeat requests to the other members of the set.

The Log: Step-by-Step

Being able to interpret log files is critical to knowing what’s going on, so let’s walk through a sample log from running rs.initiate(). Here’s the whole log, feel free to skip over this for now. We’ll take it piece-by-piece below.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client
Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up
Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******
Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.
Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms
Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self
Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000
Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2
Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY
Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing
Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

Now, to understand this line-by-line. First, we start out in the EMPTYCONFIG state, as described above.

Tue Mar  6 12:45:00 [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)

Now the shell connects to the server and runs rs.initiate() (replSetInitiate), which apparently takes me 6 seconds to type (take that, Mavis Beacon).

Tue Mar  6 12:45:02 [initandlisten] connection accepted from 127.0.0.1:38839 #2 (2 connections now open)
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate admin command received from client

Because we didn’t give a config, the server tries to generate one:

Tue Mar  6 12:45:08 [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
Tue Mar  6 12:45:08 [conn2] replSet created this configuration for initiation : { _id: "foo", members: [ { _id: 0, host: "ubuntu:30000" } ] }

Now it does some sanity checks on this config (e.g., making sure all members are up and reachable). This matters more if we gave it a config, but sometimes things are so screwed up you can’t reach yourself.

Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate config object parses ok, 1 members specified
Tue Mar  6 12:45:08 [conn2] replSet replSetInitiate all members seem up

Now it calculates the size of the oplog and allocates it between two lines of ******.

Tue Mar  6 12:45:08 [conn2] ******
Tue Mar  6 12:45:08 [conn2] creating replication oplog of size: 49085MB...
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.ns, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] creating directory /datadir/_tmp
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.ns, size: 16MB,  took 0.016 secs
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.0, filling with zeroes...
Tue Mar  6 12:45:08 [FileAllocator] done allocating datafile /datadir/local.0, size: 2047MB,  took 0.016 secs
Tue Mar  6 12:45:08 [conn2] datafileheader::init initializing /datadir/local.0 n:0
Tue Mar  6 12:45:08 [FileAllocator] allocating new datafile /datadir/local.1, filling with zeroes...
...
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.22 n:22
Tue Mar  6 12:45:09 [FileAllocator] allocating new datafile /datadir/local.23, filling with zeroes...
Tue Mar  6 12:45:09 [FileAllocator] done allocating datafile /datadir/local.23, size: 2047MB,  took 0.042 secs
Tue Mar  6 12:45:09 [conn2] datafileheader::init initializing /datadir/local.23 n:23
Tue Mar  6 12:45:10 [conn2] ******

It doesn’t actually print the “…” part, I just omitted 21 files being preallocated.

Now it stores the configuration in the local.system.replset collection.

Tue Mar  6 12:45:10 [conn2] replSet info saving a newer config version to local.system.replset
Tue Mar  6 12:45:10 [conn2] replSet saveConfigLocally done
Tue Mar  6 12:45:10 [conn2] replSet replSetInitiate config now saved locally.  Should come online in about a minute.

It seldom takes a minute for it to come online, but this way you’ll be pleasantly surprised.

replSetInitiate is now done running. It will often be logged, because slow operations are logged and allocating the oplog usually takes a while.

Tue Mar  6 12:45:10 [conn2] command admin.$cmd command: { replSetInitiate: undefined } ntoreturn:1 keyUpdates:0 reslen:196 2802ms

Now it actually loads this configuration as the replica set config we want to use:

Tue Mar  6 12:45:10 [rsStart] replSet load config ok from self

You won’t see this next line if you’re running 2.0, I added it for 2.2 because I was sick of having to decipher which server a log was from:

Tue Mar  6 12:45:10 [rsStart] replSet I am ubuntu:30000

Yay, now config has been loaded, making our state STARTUP2 (the “config has been loaded, starting up RS threads” state):

Tue Mar  6 12:45:10 [rsStart] replSet STARTUP2

At this point, it can go straight into PRIMARY state, but it doesn’t. We could fix this, but it hasn’t been a priority, so it goes into SECONDARY state briefly before becoming PRIMARY:

Tue Mar  6 12:45:10 [rsSync] replSet SECONDARY

A “real” secondary (we’re about to become primary, so we don’t count) will gather some stats about who’s best to sync from before it chooses a target:

Tue Mar  6 12:45:25 [rsSync] waiting for 2 pings from other members before syncing

Good luck with that, buddy, you’re the only one in the config.

Aha, we’ve noticed that we can become primary:

Tue Mar  6 12:45:26 [rsMgr] replSet info electSelf 0
Tue Mar  6 12:45:26 [rsMgr] replSet PRIMARY

And now we’re all set.

Propegating the Config

If we add members or provide an initial config with multiple members, the initial member’s heartbeat requests are sent to a bunch of servers in EMPTYCONFIG state. When these other servers receive the heartbeat request, they notice “Hey, I don’t have any config and this guy seems to think I’m part of his set, so I should check in with him.”

These EMPTYCONFIG members will then request a config from the member that requested the heartbeat. Seeing themselves in the config, they’ll save it to their own local.system.replset collection, then start sending out their own heartbeats. At that point, they’ll move into RECOVERING state and, once they’re synced to the original member, they’ll turn into secondaries (unless they’re arbiters, of course, but same general idea).

One quirk of initiating a set is that every member of the set must not have any data, other than the one you’re sending replSetInitiate to. After the set is initialized you can add members containing data, but at most one member can have data when you initialize the set.

Confused?

Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).