Kristina Chodorow's Blog

Git for Interns and New Employees

tracks1 — Think of commits as a trail future developers can follow. Would you like to leave a beautiful, easy-to-follow trail, or make them follow your… droppings?

My interns are leaving today 😦 and I think the most important skill they learned this summer was how to use git. However, I had a hard time finding concise references to send them about my expectations, so I’m writing this up for next year.

How to Create a Good Commit

A commit is an atomic unit of development.

You get a feeling for this as you gain experience, but commit early and often. I’ve never, ever thought that someone broke up a problem into too many commits. Ever.

That said, do not commit:

Whitespace changes as part of non-whitespace commits.
Debugging messages (“print(‘here!’)” and such).
Commented out sections of code.

Eew.
Any type of binary (in general… if you think you have a special case, ask someone before you commit)
Customer data, images you don’t own, passwords, etc. Assume that anything you commit will be included in the repo forever.
Auto-generated files (any intermediate building files) and files specific to your system. If it mentions your personal directory structure, it probably shouldn’t be committed.

Point #5 deserves a little extra mention: git keeps everything, so when in doubt, don’t commit something dubious. You can always add it later. When I was new at 10gen, I found a memory leak in MongoDB and was told to commit “what was needed to reproduce it.”

I committed a 20GB database to the MongoDB repo.

One emergency surgery later and the repo was back to its svelte self. So it is possible to remove commits if you have to, but try not to commit stuff you shouldn’t. It’s extremely annoying to fix. And embarrassing.

When you’re getting ready to commit, run git gui. This is the #1 best tool I’ve found for beginners learning how to make good commits. You’ll see something that looks sort of like this:

The upper-left pane is unstaged changes and the lower right is staged changes. The big pane shows what you’ve added to and removed from the file currently selected.

Right click on a hunk to stage it, or a single line from the hunk.

Click on this icon: to stage all of the changes in a file.

Note that notes.js is moved to the staging area (if only some parts of notes.js are staged, it will show up in both the staging and unstaged area).

Before you commit, look at each file in the staging area by clicking on its filename. Any stray hunks make it in? Whitespace changes? Remove those lines by right-clicking and unstaging.

Screen Shot 2012-08-10 at 3.55.03 PM — That extra line isn’t part of this change so it shouldn’t be part of the commit.

git gui will also show you when you have trailing whitespace:

And if you have two lines that look identical, it’s probably a whitespace issues (maybe tabs vs. spaces?).

Once you’ve fixed all that, you’re ready to describe your change…

Writing a Good Commit Message

First of all, there are a couple of semantic rules for writing good commit messages:

One sentence
In the present tense
Capitalized
No period
Less than 80 characters

That describes the form, but just like you can have a valid program that doesn’t do anything, you can have a valid commit message that’s useless.

So what does a good commit message look like? It should clearly summarize what the change did. Not “changed X to Y” (although that’s better than just saying “Y”, which I’ve also seen) but why X had to change to Y.

Examples of good commit messages:

Show error message on failed "edit var" in shell: Very nice “added feature”-type message.
Extra restrictions on DB name characters for Windows only: Would have been nice to have a description below the commit line describing why we needed to change this for Windows, but good “changed code”-type message.
Compile when SIGPIPE is not defined: Nice “fixed bug”-type message.
Whitespace: I think this is the only case where you can get away with a 1-word commit message

Examples of bad commit messages:

Add stuff: Doesn’t say what was added or why
Fix test, add input form, move text box: A commit should be one thought, this is three. Thus, this should probably be three commits, unless they’re all part of one thought you haven’t told us about.

And once you’ve committed…

When you inevitably mess up a commit and realize that you’ve accidentally committed a mishmash of ideas that break laws in six countries and are riddled with whitespace changes, check out my post on fixing git mistakes.

Or just go ahead and push.

Controlling Collection Distribution

Shard tagging is a new feature in MongoDB version 2.2.0. It’s supposed to force writes to go to a local data center, but it can also be used to pin a collection to a shard or set of shards.

Note: to try this out, you’ll have to use 2.2.0-rc0 or greater.

To play with this feature, first you’ll need to spin up a sharded cluster:

> sharding = new ShardingTest({shards:3,chunksize:1})

This command will start up 3 shards, a config server, and a mongos. It’ll also start spewing out the logs from all the servers into stdout, so I recommend putting this shell aside and using a different one from here on in.

Start up a new shell and connect to the mongos (defaults to port 30999) and create some sharded collections and data to play with:

> // remember, different shell
> conn = new Mongo("localhost:30999")
> db = conn.getDB("villains")
>
> // shard db
> sh.enableSharding("villains")
>
> // shard collections
> sh.shardCollection("villains.joker", {jokes:1});
> sh.shardCollection("villains.two-face", {luck:1});
> sh.shardCollection("villains.poison ivy", {flora:1});
> 
> // add data
> for (var i=0; i for (var i=0; i for (var i=0; i<100000; i++) { db["poison ivy"].insert({flora: Math.random(), count: i, time: new Date()}); }

Now we have 3 shards and 3 villains. If you look at where the chunks are, you should see that they’re pretty evenly spread out amongst the shards:

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }

joker — Or, as Harley would say, “Puddin’.”

However, villains tend to not play well with others, so we’d like to separate the collections: 1 villain per shard. Our goal:

Shard	Namespace
`shard0000`	“villains.joker”
`shard0001`	“villains.two-face”
`shard0002`	“villains.poison ivy”

To accomplish this, we’ll use tags. A tag describes a property of a shard, any property (they’re very flexible). So, you might tag a shard as “fast” or “slow” or “east coast” or “rackspace”.

In this example, we want to mark a shard as belonging to a certain villain, so we’ll add villains’ nicknames as tags.

> sh.addShardTag("shard0000", "mr. j")
> sh.addShardTag("shard0001", "harv")
> sh.addShardTag("shard0002", "ivy")

This says, “put any chunks tagged ‘mr. j’ on shard0000.”

The second thing we have to do is to make a rule, “For all chunks created in the villains.joker collection, give them the tag ‘mr. j’.” To do this, we can use the addTagRange helper:

> sh.addTagRange("villains.joker", {jokes:MinKey}, {jokes:MaxKey}, "mr. j")

This says, “Mark every chunk in villains.joker with the ‘mr. j’ tag” (MinKey is negative infinity, MaxKey is positive infinity, so all of the chunks fall in this range).

Now let’s do the same thing for the other two collections:

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:MaxKey}, "harv")
> sh.addTagRange("villains.poison ivy", {flora:MinKey}, {flora:MaxKey}, "ivy")

Now wait a couple of minutes (it takes a little while for it to rebalance) and then look at the chunks for these collections.

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }

Scaling with Tags

Obviously, Two-Face isn’t very happy with this arrangement and immediately requests two servers for his data. We can move the Joker and Poison Ivy’s collections to one shard and expand Harvey’s to two by manipulating tags:

> // move Poison Ivy to shard0000
> sh.addShardTag("shard0000", "ivy")
> sh.removeShardTag("shard0002", "ivy")
>
> // expand Two-Face to shard0002
> sh.addShardTag("shard0002", "harv")

Now if you wait a couple minutes and look at the chunks, you’ll see that Two-Face’s collection is distributed across 2 shards and the other two collections are on shard0000.

> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }

However, this still isn’t quite right for Harvey, he’d like one shard to be good and one to be bad. Let’s say we take advantage of Amazon’s new offering and replace shard0002 with SSDs. Then we divide up the traffic: send 50% of Harvey’s writes to the SSD shard and 50% to the spinning disk shard. First, we’ll add tags to the shards, describing them:

> sh.addShardTag("shard0001", "spinning")
> sh.addShardTag("shard0002", "ssd")

The value of the “luck” field is between 0 and 1, so we want to say, “If luck = .5, send to the SSD.”

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:.5}, "spinning")
> sh.addTagRange("villains.two-face", {luck:.5}, {luck:MaxKey}, "ssd")

Now “bad luck” docs will be written to the slow disk and “good luck” documents will be written to SSD.

As we add new servers, we can control what kind of load they get. Tagging gives operators a ton of control over what collections go where.

Finally, I wrote a small script that adds a “home” method to collections to pin them to a single tag. Example usage:

> // load the script
> load("batman.js")
> // put foo on bar
> db.foo.home("bar")
> // put baz on bar
> db.baz.home("bar")
> // move foo to bat
> db.foo.home("bat")

Enjoy!

Summer Reading Blogroll

What are some good ops blogs? Server Density does a nice weekly roundup of sys admin posts, but that’s about all I’ve found. So, anyone know any other good resources? The more basic the better.

In exchange, here are my top-10 “I’m totally doing something productive and learning something new” blogs:

Programming

Daniel Lemire’s Blog: Articles on databases and general musing on CS and higher education.
Embedded in Academia: Everything you ever wanted to know about debugging compilers.
Preshing on Programming: Bring-a-tent-length articles about advanced programming concepts.
Sutter’s Mill: C++ puzzlers.

Security

Schneier on Security: The best general security blog I’ve found.

Science!

How to Spot a Psychopath: General science Q&A, as well as justification for why every household needs 1kg of tungsten, 10,000 LEDs, and temperature-sensitive polymer.
In the Pipeline: A professional chemist’s blog. Sometimes way over my head, but generally pretty interesting.

10gen

On a less technical note, many of my coworkers write excellent blogs, here are two:

Max Schireson’s Blog: 10gen’s president, writes about running a company and working at startups.
Meghan Gill’s Blog: 10gen’s earliest non-technical hire, who deserves the credit for a lot of MongoDB’s success. Her blog is a really interesting and informative look at what marketing people do.

Whoops, that’s only nine. For the tenth, please leave a link to your favorite tech blog below so I can check it out!

Also, I artificially kept this list short, but there are ton of terrific blogs I read that didn’t get a mention. If you’re a coworker or a MongoDB Master, I probably subscribe to your blog and I’m really sorry if I didn’t mention it above!

Replica Set Internals Part V: Initial Sync

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

Elections
Creating a set
Reconfiguring
Syncing
Initial Sync
Rollback
Authentication
Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The Initial Sync Processes

When you add a brand new member to the set, it needs to copy over all of the data before it’s ready to be a “real” member. This process is called initial syncing and there are, essentially, 7 steps:

Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:
Get the latest oplog time from the source member: call this time start.
Clone all of the data from the source member to the destination member.
Build indexes on destination.
Get the latest oplog time from the sync target, which is called minValid.
Apply the sync target’s oplog from start to minValid.
Become a “normal” member (transition into secondary state).

Note that this process only checks the oplog. You could have a petabyte of data on the server, but if there’s no oplog, the new member will delete all of it and initial sync. Members depend on the oplog to know where to sync from.

So, suppose we’re starting up a new member with no data on it. MongoDB checks the oplog, sees that it doesn’t even exist, and begins the initial sync process.

Copying Data

The code for this couldn’t be much simpler, in pseudo code it is basically:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():
             destinationServer.getDB(db).getCollection(collection).insert(doc)

One of the issues with syncing is that it has to touch all of the source’s data, so if you’ve been carefully cultivating a working set on sourceServer, it’ll pretty much be destroyed.

There are benefits to initial syncing, though: it effectively compacts your data on the new secondary. As it’s doing are inserts, it’ll use pretty much the minimum amount of space. Some users actually use rotating resyncs to keep their data compact.

On the downside, initial sync doesn’t consider padding factor, so if that’s important to your application, the new server will have to build up the right padding factor over time.

Syncing on a Live System

The tricky part of initial syncing is that we’re trying to copy an (often) massive amount of data off of a live system. New data will be written while this copy is taking place, so it’s a bit like trying copy a tree over six months.

By the time you’re done, the data you copied first might have changed significantly on the source. The copy on the destination might not be… exactly what you’d expect.

That is what the oplog replay step is for: to get your data to a consistent state. Oplog ops are idempotent, they can be applied multiple times and yield the same answer. Thus, so long as we apply all of the writes at least once (remember, they may or may not have been applied on the source before the copy), we’ll end up with a consistent picture when we’re done.

minValid (as mentioned in the list above) is the first timestamp where our new DB is in a consistent state: it may be behind the other members, but its data matches exactly how the other servers looked at some point in time.

Some examples of idempotency, as most people haven’t seen it since college:

// three idempotent functions:
function idemp1(doc) {
   doc.x = doc.x + 0;
}

function idemp2(doc) {
   doc.x = doc.x * 1;
}

function idemp3(doc) {
   // this is what replication does: it turns stuff like "$inc 4 by 1" into "$set to 5"
   doc.x = 5;
}

// two non-idempotent functions
function nonIdemp1(doc) {
   doc.x = doc.x + 1;
}

function nonIdemp2(doc) {
   doc.x = Math.random();
}

No matter how many times you call the idempotent functions the value of doc.x will be the same (as long as you call them at least once) .

Building Indexes

In 2.0, indexes were created on the secondary as part of the cloning step, but in 2.2, we moved index creation to after the oplog application. This is because of an interesting edge case. Let’s say we have a collection representing the tree above and we have a unique index on leaf height: no two leaves are at exactly the same height. So, pretend we have a document that looks like this:

{
    "_id" : 123,
    ...,
    "height" : 76.3
}

The cloner copies this doc from the source server to the destination server and moves on. On the source, we remove this leaf from the tree because of, I dunno, high winds.

> db.tree.remove({_id:123})

However, the cloner has already copied the leaf, so it doesn’t notice this change. Now another leaf might grow at this height. Let’s say leaf #10,012 grow to this height on the source.

> db.tree.update({_id:10012}, {$set : {height : 76.3}})

Now, when the cloner gets to document #10012, it’ll copy it to the destination server. Now there are two documents with the same height field in the destination collection, so when it tries to create a unique index on “height”, the index creation will fail!

So, we moved the index creation to after the oplog application. That way, we’re always building the index on a consistent data set, so it should always succeed.

There are a couple of other edge cases like that which have been super-fun to track down, which you can look up in Jira if you’re interested.

Restoring from Backup

Often, initial sync is too slow for people. If you want to get a secondary up and running as fast as possible, the best way to do so is to skip initial sync altogether and restore from backup. To restore from backup:

Find a secondary you like the looks of.
Either shut it down or fsync+lock it. Basically, get its data files into a clean state, so nothing is writing to them while you’re copying them.
Copy the data files to your destination server. Make sure you get all of your data if you’re using any symlinks or anything.
Start back up or unlock the source server.
Point the destination server at the data you copied and start it up.

As there is already an oplog, it will not need to initial sync. It will begin syncing from another member’s oplog immediately when it starts up and usually catch up quite quickly.

Note that mongodump/mongorestore actually does not work very well as a “restoring from backup” strategy because it doesn’t give you an oplog. You can create one on your own and prime it, but it’s more work and more fiddly than just copying files. There is a feature request for mongorestore to be able to prime the oplog automatically, but it won’t be in 2.2.

P.S. Trees were done with an awesome program I recently discovered called ArtRage, which I highly recommend to anyone who likes painting/drawing. It “feels” like real paint.

Good Night, Westley: Time-To-Live Collections

In The Princess Bride, every night the Dread Pirate Roberts tells Westley: “Good night, Westley. Good work. Sleep well. I’ll most likely kill you in the morning.”

Let’s say the Dread Pirate Roberts wants to optimize this process, so he stores prisoners in a database. When he captures Westley, he can put:

> db.prisoners.insert({
... name: "Westley",
... sentenceStart: new Date()
... })

However, now he has to run some sort of cron job that runs all the time in order to kill everyone who needs killing and keep his database up-to-date.

Enter time-to-live (TTL) collections. TTL collections are going to be released in MongoDB 2.2 and they’re collections where documents expire in a more controlled way than with capped collections.

What the Dread Pirate Roberts can do is:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 24*60*60}

Now, MongoDB will regularly comb this index looking for docs to expire (so it’s actually more of a TTL index than a TTL collection).

Let’s try it out ourselves. You’ll need to download version 2.1.2 or higher to use this feature. Start up the mongod and run the following in the Mongo shell:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 30})

We’re on a schedule here, so our pirate ship is more brutal: death after 30 seconds. Let’s take aboard a prisoner and watch him die.

> var start = new Date()
> db.prisoners.insert({name: "Haggard Richard", sentenceStart: start})
> while (true) { 
... var count = db.prisoners.count(); 
... print("# of prisoners: " + count + " (" + (new Date() - start) + "ms)");
... if (count == 0) 
...      break; 
... sleep(4000); }
# of prisoners: 1 (2020ms)
# of prisoners: 1 (6021ms)
# of prisoners: 1 (10022ms)
# of prisoners: 1 (14022ms)
# of prisoners: 1 (18023ms)
# of prisoners: 1 (22024ms)
# of prisoners: 1 (26024ms)
# of prisoners: 0 (30025ms)

…and he’s gone.

Edited to add: Stennie pointed out that the TTL job only runs once a minute, so YMMV on when Westley gets bumped.

Conversely, let’s say we want to play the “maybe I’ll kill you tomorrow” game and keep bumping Westley’s expiration date. We can do that by updating the TTL-indexed field:

> db.prisoners.insert({name: "Westley", sentenceStart: new Date()})
> for (i=0; i  db.prisoners.count()
1

…and Westley’s still there, even though it’s been more than 30 seconds.

Once he gets promoted and becomes the Dread Pirate Roberts himself, he can remove himself from the execution rotation by changing his sentenceStart field to a non-date (or removing it altogether):

> db.prisoners.update({name: "Westley"}, {$unset : {sentenceStart: 1}});

When not on pirate ships, developers generally use TTL collections for sessions and other cache expiration problems. If ye be wanting a less grim introduction to MongoDB’s TTL collections, there are some docs on it in the manual.

The Snail Crawls On…

A bit of housekeeping: I’ve changed domain names, now this site is kchodorow.com, not snailinaturtleneck.com. Old links should still work, they’ll just be permanent redirects to the new domain. Please let me know if you come across any issues!

Why the change?

kchodorow.com sounds more professional and is shorter. Also, I’m after fame and glory here (in a very niche market) and having people remember “that snail blog” instead of my name doesn’t help.

So why was it Snail in a Turtleneck in the first place?

I started this site as a place to put up cartoons and it seemed like a cute name for my comics. Since the site’s inception, the content has gotten more and more technical, and it now basically just a programming blog (although I still dream of being the Hyperbole-and-a-Half of tech blogging).

I also slapped a new coat of paint on the site design, so if you startle easily, don’t worry that it looks like a completely different site under a completely different domain.

––thursday #6: using git over ssh

Got a secret project on machine A and too cheap to get a Github private repo? Clone it directly on machine B with:

$ git clone ssh://user@A/absolute/path/to/repo/

Replace “user” with your username on A, “A” with the actual hostname of A, and “/absolute/path/to/repo/” with the absolute path to the git repository.

Replica Set Internals Bootcamp Part III: Reconfiguring

There are 8 subjects I cover in my bootcamp:

Elections
Creating a set
Reconfiguring
Syncing
Initial Sync
Rollback
Authentication
Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Reconfiguring Prerequisites

One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

Additive Reconfiguration vs. Full Reconfigs

Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

A simple reconfig starts a new heartbeat thread for the new member and it’s done.

A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

Propegation of Reconfiguration

Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

Suppose you have 2 nodes, A and B.

You run a reconfig on A, changing the version number from 6 to 7.

B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

Force push — Force reconfig to the face.

Forcing Reconfig

Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

Why is my version number 6,203,493?

When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

Removing Nodes and Sharding

I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

10 Kindle Apps for the Non-Existent Developer API

The Kindle should have a developer API. Ereaders could be revolutionizing the way people read, but right now they’re like paperbacks without the nice book smell.

I’ve heard a lot of people say, “the Kindle isn’t powerful enough for apps.” Poppycock. I’m not talking about using it to play Angry Birds, I’m talking about stuff a calculator could zoom though and would actually improve the reading experience.

So I present 10 apps that would be super-useful, require few resources, and (in some cases) increase profits:

A “more content” button for magazines. If I’m reading a good magazine, I’d love to be able to get $10 more of content when I’m done. It’d be like giving a rat a lever that dispenses pellets. Yum, reading pellets.
Renaming books. Apparently my workflow is defective, because I’ll often end up with 6 titles named “Imported PDF”, and there is no way to distinguish the one I want other than opening each PDF until I find it. If I could just rename the damn things…
Support for other organizational schemes. Some people like tags (like whoever wrote Gmail, apparently) and everyone else likes hierarchical folders. I hate tags, I want things neatly tucked away in Sci Fi/Nebula Awards/Short Stories, not a franken-tag like “Sci Fi – Nebula Awards – Short Stories” (okay, it’s equivalent, but I hate tags).
In technical books, how often is there a diagram that you keep flipping back and forth to for the next 10 to 15 pages? It would be nice to be able to “pin” it to the top of the screen as you read all of the text related to the diagram.
Goodreads integration. When I finish a book, I want to rate it and have it automatically added to my “read” shelf in at Goodreads.
Related to above: recommendations when I finish a book and rate it. If I just rated it five stars, show me other books people who loved this book liked. If I rated it one star, show me books people who hated this book liked.
Related to above (again): list my Amazon recommendations inline with my list of books. This would be a money-spinner for them, I think, because Amazon’s recommendation engine is freakishly accurate (except when it gets thrown out of whack by holiday shopping). If I was looking at my Kindle and saw a list of books I really wanted to read a click away… well, I’d be much poorer.
Make looking up words plugable to different search engines (Wikipedia, Urban Dictionary, D&D Compendium, etc). I was recently reading “Crime and Punishment” and came across the term “yellow ticket.” The built-in dictionary knew what “yellow” was, and it knew what “ticket” was, but that didn’t help a whole lot (answer: an id given to Russian prostitutes).
Update support. Technical books especially can benefit from this: O’Reilly has been working to do multiple quick releases of ebooks so that they can be updated as the technology changes. Imagine if you’re opening up your well-thumbed copy of Scaling MongoDB and a dialog pops up: “Version 1.2 of Scaling MongoDB is available, covering changes in MongoDB 2.2. Would you like to download? [Yes/No]”. However, the support just isn’t there on the device side. (And a new version of Scaling MongoDB isn’t available yet, sorry.)
Metrics. As an author, I would love to know how long it took someone to read a page, how many times they came back to it, and when they put the book down and went to do something else. Authors have never been able to get this level of feedback before and I think it would revolutionize writing. Basic user tracking would be amazing.

I’m not sure why Amazon doesn’t have a dev API, but I’d imagine that part of the reason is that most publishers would not like it. However, I think Amazon is big enough to crush them into submission. I hope that they will hurry up and do so.

If anyone has any ideas on how to get Amazon to implement a developer API, please comment!

P.S. I know about the API here, but that’s essentially for Angry-Birds-type apps. I’m looking for an API that lets you mess with the reader.

––thursday #5: diagnosing high readahead

Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

Picture our disk as looking like this, where each o is a document:

...
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
... // keep going for millions more o's

Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

...
ooooooooooooooooooooooooo
ooooxoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

...
ooooooooooooooooooooooooo
ooooxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

Hobo shopping cart — Is your OS returning this with every document?

Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.