Kristina Chodorow's Blog

Writing MongoDB: The Definitive Guide

MongoDB: The Definitive Guide is now available in bookstores everywhere! (Or at least on Amazon.) Please pick up a copy!

Some interesting things I learned about the process of publishing:

There are professional indexers who write the index.

This amazes me, because we had to proofread our index and I’ve never been so bored in my life. These people must have the exact opposite personality I do. And, in our case, they spelled “Ruby gems” as “Ruby germs.”

Blog posts are a better length

In 500 words, I can edit and polish something until it’s a shimmering jewel of a, uh, blog post. It’s really hard to make a hundred thousand words even have a reasonable flow, never mind be “perfect.”

Illustrations will be assimilated.

When we submitted the manuscript, I had (the night before) whipped up the illustrations in Photoshop that looked like this:

my2 — Every document is a beautiful snowflake (because they're all unique)

At the final stage of the editing process, these all got replaced by O’Reilly illustrations, which looked a lot more professional.

I’m pretty impressed by how well they matched what I was going for, but wish I hadn’t spent so long making those damn snowflakes.

An advance is an advance on sales.

In retrospect, I should have realized this, but I never really thought about it before. If O’Reilly advanced us $100,000 (they didn’t), that just means we wouldn’t get any royalty checks until people bought enough books to give us $100k in royalties. So, essentially, authors write books for free. This kind of amazes me.

All and all, it was really fun and I’d do it again in a heartbeat. In the future, I wouldn’t stick to the schedule quite as rigorously. At the beginning, O’Reilly gave us the following timeline:

3 months = 2 chapters
6 months = first half
9 months = whole book

I write best when I splorch down everything that comes to me as fast as possible and then edit it fifty times. So next time I’d do:

3 months = book of crap
6 months = semi-literate book
9 months = great American (technical) novel.

Andrew suggested we do the National Novel Writing Month, so now I’m trying to think of another thing to write about. I’ll probably do a MongoDB book, but not sure what yet…

Choose your own adventure: MongoDB crash recovery edition

Suppose your application is happily talking to MongoDB and your laptop battery runs out. Or your server bursts into flame. Or velociraptors attack your data center. What now?

To bring your server back up, read through the text until you get to a bold question. Click on the answer that best matches your situation to see the instructions. When you’ve finished an “adventure,” there’ll be a link to bring you back to the top (here).

Is your server physically okay?

Yes, it was just a power outage or something, I’m sure it’s fine.
No, the velociraptors sure did a number on it.

Recovering a physically damaged server.

This is beyond the scope of this article. Get a new server and then…

Do you have a backup?

Yes, I take backups every hour. Doesn’t everyone?
No, what kind of weenie takes backups?

Don’t recover.

If you didn’t do any writes during the session that shutdown uncleanly (this has happened to people), your data is fine. Remove the mongod.lock file and start your database back up.

Try another adventure.

Seriously?

Okay, fine, but it’s really big.
Seriously, I’ve got YouTube over here.

Recover from a backup.

If you have a recent backup, recovery is easy. Remove the entire data directory, replace with the backup. Start the database.

Try another adventure.

Did you do any writes during your last session?

Duh, it’s a database.
No, I was just reading data.

Single server “recovery”

If you have a single instance that shut down uncleanly, you may lose data! Use this as a painful learning experience and always run with replication in the future.

Since you only have this one copy of your data, you’ll have to repair whatever is there. Remove the mongod.lock file and start the database with –repair and any other options you usually use (if you usually use a config file or dbpath, put that in). repair has no way of knowing where mongod put data unless you tell it. It can’t repair data unless it can find it.

Please do not just remove the mongod.lock file and start the database back up. If you’ve got corrupt data, the database will start up fine but you’ll start getting really weird errors when you try to read data. The annoying mongod.lock file is there for a reason!

repair will make a full copy of the uncorrupted data (make sure you have enough disk space!) and remove any corrupted data. It can take a while for large data sets because it looks at every field of every document.

Note that MongoDB keeps a sort of “table of contents” at the beginning of your collection. If it was updating its table of contents when the unclean shutdown happened, it might not be able to find a lot of your data. This is how the “I lost all my documents!” horror stories happen. Run with replication, people!

Better luck next time.

Are you on EBS?

Yes.
No.

You ran with replication!

Thank you, you get a lollipop! There are lots of ways to recover with various levels of swiftness and ease, but first you need a master. If you are running a replica set (with or without sharding), you don’t need to do anything, the promotion will happen automatically (and you don’t need to change anything in your application, that will failover automatically, too).

If you’re not running a replica set, shut down your slave and restart it with –master. Point your application to the new master.

When you start back up the server that crashed, the way you should start it depends on if you’re using master-slave or replica sets. If you’re using master-slave, start your database back up with –slave and –source pointing to the new master. If you’re running a replica set, just start it with the same arguments you used before.

Are you in a hurry?

Yes, I need this server back up yesterday.
No, I’m willing to wait if it’s easier.

Recover quickly without a backup nor messing with your currently up servers.

Here’s where things stand: you have data at point A and you want to get it to point B. If you don’t have a backup, you’re going to have to create a snapshot of whatever’s at A and send it to B. To take a snapshot, you’ll have to make sure the files at A are in a consistent state, so you’ll have to suck it up and fsync and lock it. Or you can use replication, but that’ll take longer.

Next time, make a backup.

Now that you’ve thought it over…

Are you willing to make a server read-only for a bit?

Yes, if necessary.
No, I’m not.

Recover via file system snapshot.

This is generally super-fast, but it might not be supported by your filesystem.

If you’re running on EBS or using ZFS, you can take a file system snapshot of the new master and put it on the server that crashed. Then, start up mongod.

Try another adventure.

How big is your data?

I can move it around without too much pain.
Not easy to move.

Recover via replication.

This way is the easiest, but it’s also the slowest.

Remove everything in the data directory and start the database back up again. It’ll resync all of the data from the new master.

Try another adventure.

How about XFS (or some other files system that lets you take snapshots)?

Yes.
No.

Recover with –fastsync.

If you don’t mind making your new master read-only for a bit, you can get your other server back up pretty quickly and easily. First, fsync and lock the master, take a dump its files (or a snapshot, as described above) and put them on the server that went down. Start back up with –fastsync and unlock the master.

Try another adventure.

Pre-create the oplog.

If you have hundreds of gigabytes of data, syncing from scratch may not be practical and the amount of data might be too big to throw around in backups. This way is trickier, but faster than syncing from scratch (unless you’re using ext4, where this won’t give you any added benefit).

Wipe the data directory, then pre-create the local.* files. Make them ~20% of your data size, so if you have 100GB, make 20GB of local files:

for i in {0..10}; do
      echo $i
      head -c 2146435072 /dev/zero > /data/db/local.$i
done

Now start mongod up with an oplog size a bit smaller than the one you just created, e.g., –oplog 17000. It’ll still have to resync, but it’ll cut down on the file preallocation time.

Try another adventure.

Were you running with replication?

Yes, I’m sane.
No, I also enjoy skydiving without parachutes.

Recover via postal service.

If your data is unmovable, it’s unmovable. If you really have that much data, you can get pretty good data transfer rates by priority shipping it on disks. (And don’t feel too ridiculous, even Google does this, sometimes.)

Try another adventure.

Oh, the Mistakes I’ve Seen

lorax — *A slow database is easily fixed*
*If you make good choices of fields indexed.*
*Sometimes the answer is simpler still,*
*A quick code change may fit the bill.*

I’ll be giving an O’Reilly webcast, Scaling with MongoDB, on Friday (9/17). Please sign up if you’re interested in learning some more advanced optimization than what this post gets into. This webcast is, in part, to pimp MongoDB: The Definitive Guide, which will be coming out next week!

These are a few basic tips on making your application better/faster/stronger without knowing anything about indexes or sharding.

Connecting

Connecting to the database is a (relatively) expensive operation. Try to minimize the number of times you connect and disconnect: use persistent connections or connection pooling (depending on your language).

To not waste connections, you have to know what your driver is doing. I see a lot of code like this in PHP:

$connection = new Mongo();
$connection->connect();

What this does is:

The constructor connects to the database.
connect() sees that you’re already connected, assumes you want to reset the connection.
Disconnects from the database.
Connects again.

Gah! You just doubled your execution time.

ObjectIds

ObjectIds seem to make people vaguely uncomfortable, so they convert their ObjectIds into strings (the macaroni and cheese of data types). The problem is, an ObjectId takes up 12 bytes but its string representation takes up 29 bytes (almost two and a half times bigger). The lesson: suck it up and eat your spinachy ObjectIds. You’ll learn to like ’em.

Also, an ObjectId won’t sneakily convert itself into a string on the fly. I see a lot of code like:

id = new ObjectId();
db.foo.insert({"_id" : new ObjectId(id)});
// or, even sillier
db.foo.insert({"_id" : new ObjectId(id.toString())});

If you created an ObjectId and haven’t messed with it, it’s still an ObjectId.

Numbers vs. Strings

MongoDB is type-sensitive and it’s important to use the correct type: numbers for numeric values and strings for strings.

If you have large numbers and you save them as strings (“1234567890” instead of 1234567890), MongoDB may slow down as it strcmps the entire length of the number instead of doing a quicker numeric comparison. Also, “12” is going to be sorted as less than “9”, because MongoDB will use string, not numeric, comparison on the values. This can lead to some surprising results.

Driver-specific

Find out if you’re driver is particularly weaknesses (or strengths). For instance, the Perl driver is one of the fastest drivers, but it sucks at decoding Date types (Perl’s DateTime objects take a long time to create). So, if you want fast Perl programs, avoid dates like the plague or you’ll be puttering along with the Ruby programmers. (Just kidding, Rubyists! Sort of.)

The most important thing is to get to know your language’s documentation and ask if you have any questions.

A Quick Intro to mongosniff

Writing an application on top of a framework on top of a driver on top of the database is a bit like playing telephone: you say “insert foo” and the database says “purple monkey dishwasher.” mongosniff lets you see exactly what the database is hearing and saying.

It comes with the binary distribution, so if you have mongod you should have mongosniff.

To try it out, first start up an instance of mongod normally:

$ ./mongod

When you start up mongosniff you have to tell it to listen on the loopback (localhost) interface. This interface is usually called “lo”, but my Mac calls it “lo0”, so run ifconfig to make sure you have the name right. Now run:

$ sudo ./mongosniff --source NET lo
sniffing... 27017

Note the “sudo”: this has never worked for me from my user account, probably because of some stupid network permissions thing.

Now start up the mongo shell and try inserting something:

> db.foo.insert({x:1})

If you look at mongosniff‘s output, you’ll see:

127.0.0.1:57856  -->> 127.0.0.1:27017 test.foo  62 bytes  id:430131ca   1124151754
        insert: { _id: ObjectId('4c7fb007b5d697849addc650'), x: 1.0 }
127.0.0.1:57856  -->> 127.0.0.1:27017 test.$cmd  76 bytes  id:430131cb  1124151755
        query: { getlasterror: 1.0, w: 1.0 }  ntoreturn: -1 ntoskip: 0
127.0.0.1:27017  <<--  127.0.0.1:57856   76 bytes  id:474447bf  1195657151 - 1124151755
        reply n:1 cursorId: 0
        { err: null, n: 0, wtime: 0, ok: 1.0 }

There are three requests here, all for one measly insert. Dissecting the first request, we can learn:

source -->> destination: Our client, mongo in this case, is running on port 57856 and has sent a message to the database (127.0.0.1:27017).
db.collection: This request is for the “test” database’s “foo” collection.
length bytes: The length of the request is 62 bytes. This can be handy to know if your requests are edging towards the maximum request length (16 MB).
id:hex-id id: This is the request id in hexadecimal and decimal (in case you don’t have a computer handy, apparently). Every request to and from the database has a unique id associated with it for tax purposes.
op: content: This is the actual meat of the request: we’re inserting this document. Notice that it’s inserting the float value 1.0, even though we typed 1 in the shell. This is because JavaScript only has one number type, so every number typed in the shell is converted to a double.

The next request in the mongosniff output is a database command: it checks to make sure the insert succeeded (the shell always does safe inserts).

The last message sniffed is a little different: it is going from the database to the shell. It is the database response to the getlasterror command. It shows that there was only one document returned (reply n:1) and that there are no more results waiting at the database (cursorId: 0). If this were a “real” query and there was another batch of results to be sent from the database, cursorId would be non-zero.

Hopefully this will help some people decipher what the hell is going on!

Return of the Mongo Mailbag

On the mongodb-user mailing list last week, someone asked (basically):

I have 4 servers and I want two shards. How do I set it up?

A lot of people have been asking questions about configuring replica sets and sharding, so here’s how to do it in nitty-gritty detail.

The Architecture

Prerequisites: if you aren’t too familiar with replica sets, see my blog post on them. The rest of this post won’t make much sense unless you know what an arbiter is. Also, you should know the basics of sharding.

Each shard should be a replica set, so we’ll need two replica sets (we’ll call them foo and bar). We want our cluster to be okay if one machine goes down or gets separated from the herd (network partition), so we’ll spread out each set among the available machines. Replica sets are color-coded and machines are imaginatively named server1-4.

Each replica set has two hosts and an arbiter. This way, if a server goes down, no functionality is lost (and there won’t be two masters on a single server).

To set this up, run:

server1

$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --replSet foo
$ ./mongod --dbpath ~/dbs/bar --port 27019 --replSet bar --oplogSize 1

server2

$ mkdir -p ~/dbs/foo
$ ./mongod --dbpath ~/dbs/foo --replSet foo

server3

$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --port 27019 --replSet foo --oplogSize 1
$ ./mongod --dbpath ~/dbs/bar --replSet bar

server4

$ mkdir -p ~/dbs/bar
$ ./mongod --dbpath ~/dbs/bar --replSet bar

Note that arbiters have an oplog size of 1. By default, oplog size is ~5% of your hard disk, but arbiters don’t need to hold any data so that’s a huge waste of space.

Putting together the replica sets

Now, we’ll start up our two replica sets. Start the mongo shell and type:

> db = connect("server1:27017/admin")
connecting to: server1:27017
admin
> rs.initiate({"_id" : "foo", "members" : [
... {"_id" : 0, "host" : "server1:27017"},
... {"_id" : 1, "host" : "server2:27017"},
... {"_id" : 2, "host" : "server3:27019", arbiterOnly : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}
> db = connect("server3:27017/admin")
connecting to: server3:27017
admin
> rs.initiate({"_id" : "bar", "members" : [
... {"_id" : 0, "host" : "server3:27017"},
... {"_id" : 1, "host" : "server4:27017"},
... {"_id" : 2, "host" : "server1:27019", arbiterOnly : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

Okay, now we have two replica set running. Let’s create a cluster.

Setting up Sharding

Since we’re trying to set up a system with no single points of failure, we’ll use three configuration servers. We can have as many mongos processes as we want (one on each appserver is recommended), but we’ll start with one.

server2

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000

server3

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000

server4

$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000
$ ./mongos --configdb server2:20000,server3:20000,server4:20000 --port 30000

Now we’ll add our replica sets to the cluster. Connect to the mongos and and run the addshard command:

> mongos = connect("server4:30000/admin")
connecting to: server4:30000
admin
> mongos.runCommand({addshard : "foo/server1:27017,server2:27017"})
{ "shardAdded" : "foo", "ok" : 1 }
> mongos.runCommand({addshard : "bar/server3:27017,server4:27017"})
{ "shardAdded" : "bar", "ok" : 1 }

Edit: you must list all of the non-arbiter hosts in the set for now. This is very lame, because given one host, mongos really should be able to figure out everyone in the set, but for now you have to list them.

Tada! As you can see, you end up with one “foo” shard and one “bar” shard. (I actually added that functionality on Friday, so you’ll have to download a nightly to get the nice names. If you’re using an older version, your shards will have the thrilling names “shard0000” and “shard0001”.)

Now you can connect to “server4:30000” in your application and use it just like a “normal” mongod. If you want to add more mongos processes, just start them up with the same configdb parameter used above.

History of MongoDB

At LinuxCon, a guy took a look at me and said, “So, MongoDB was developed by college students, huh?” No, it was not. I couldn’t distribute my way out of a paper bag, which is why I’m not designing the database.

A lot of people have been curious about where MongoDB came from, so here is the (very non-official) history, present, and future:

In the Beginning…

MongoDB was created by the founders of DoubleClick. Since leaving DoubleClick, they founded a number of startups and kept running into the same scaling problems over and over. They decided to try to create an application stack that would scale out easily, as companies everywhere seemed to be running into the same issues.

In Fall 2007, they founded 10gen and started working on an application platform for the cloud, similar to Google App Engine. The 10gen engine’s main language was server-side JavaScript, so the scalable database they were designing for it (proto-MongoDB) was also JavaScript-y.

The 10gen appengine was called ed (for Eliot and Dwight) and the database was called p (for platform). In the summer of 2008, someone decided that they needed real names, so they came up with Babble for the app engine and MongoDB for the database. The name “Mongo” was, originally, from Blazing Saddles (it was back-named to humongous). I hated the name from the start and knew that it was slang for “mongoloid.” I sent an email to the list, no one responded, so I gave up. Now I know to make a ruckus.

Correction 07/01/2013: Dwight commented below, I was mistaken about the original of the name “Mongo:”

Actually the name choice really does come from the word humongous. A couple years earlier a naming consultant showed me a list of 50 names for potential companies and consumer web products, and it was in that deck, and the point made was that it connoted “big”. But as you say some folks joked about the movie reference when we gave that name to the db, and I didn’t elaborate on the logic behind the naming at the time I would guess. I certainly didn’t in my mind make a negative association about the name at the time; my last encounter with it before that point was probably in Shrek 2. I knew it was campy but it was just a piece of the tech stack at first, not then a big standalone product and technology as it is today. Of course I now know that in some parts of the world it’s an odd choice — apologies about that to those of you in those locales.

The problem was, no one cared about Google App Engine and certainly no one cared about 10gen’s app engine. Developers would say, “well, the database is cool, but blech, app engine.”

After a year of work and practically no users, we ripped the database out of the app engine and open sourced them. Immediately, MongoDB started getting users. We saw the IRC channel creep up from 20 users to 30 to 40… (as of this writing there are 250 people in the room).

The Present

We have a large and growing number of community contributors and 10gen has hired a bunch of incredible programmers, including a former Oracle kernel dev (who worked on some of the first distributed systems in the world) and a guy who worked on Google’s BigTable.

compare_chart — Number of contributors to the core server

The last year and a half has been incredibly cool. Not only are thousands of people using our programs, but people are building things on top of them, such as Casbah, Morphia, MongoMapper, Mongoose, CandyGram, MongoKit, Mongoid,Ming, MongoEngine, Pymongo-Bongo, ActiveMongo, Morph, and MongoRecord (very, very incomplete list). People have also been integrating it with various existing projects, such as Drupal, Doctrine, Django, ActiveRecord, Lighttd, and NGINX (again, there are tons of others). The community has also written dozens of drivers for everything from C# to Erlang to Go.

And a couple of sites are using it.

In a nutshell:

GUIs: We decided early on not to create a GUI for MongoDB and let the community sort one out, which has had mixed consequences: there are now over a dozen to choose from! (We’re still hoping it’ll settle down.)
Books: There are now at least four MongoDB books in the works.
The user list: This has grown from us (four people at 10gen) to over 2,500 users.
Packages: We have packages for tons of Linux/UNIX distributions, including Ubuntu, Debian, CentOS, Fedora, ArchLinux, etc.
Documentation: There are dozens of users adding documentation and translating it into French, Spanish, Portuguese, German, Chinese, Japanese, Italian, Russian, and Serbian.
Monitoring: People have created plugins for Munin, Ganglia, Nagios, Cacti, and a few others.
Twitter: Over 5,000 Twitter followers.
Consulting: Hashrocket, LightCube, Squeejee and Mijix provide MongoDB consulting.
Hosting: MongoHQ, MongoMachine are Mongo-specific, EngineYard, Dreamhost, ServerBeach, and Media Temple support it.

The Future

In the next major release (1.8) we’re planning to add singe-server durability and faster aggregation tools. There are already over 150 feature requests scheduled for 1.8 (never mind bug fixes), so obviously not everything is going to make it in. If there’s a feature you’d like to see, make sure you vote for it at jira.mongodb.org!

And 10gen is growing (we’ve just opened a California office). If you’re looking for a job where you can work on a really awesome open source project with some very brilliant programmers, 10gen is hiring.

If it quacks like a RDBMS…

MongoDB feels a lot like a relational database: you can think of documents as rows, do ad hoc queries, and create indexes on fields. There are, however, a ton of differences due to the data model, scalability considerations, and MongoDB’s youth. This can lead to some not-so-pleasant surprises for users. We (the developers) try to document the differences, but there are a few often-overlooked assumptions:

MongoDB assumes that you have a 64-bit machine.

You are limited to ~2GB of data on a 32-bit machine. This is annoying for a lot of people who develop on 32-bit machines. There’ll be a solution for this at some point, but it’s not high on our priority list because people don’t run 32-bit servers in production. (Okay, on rare occasions they do… but MongoDB is only 2 years old, give it a few more releases, we’ll support it eventually!) Speaking of things we’ll support eventually…

MongoDB assumes that you’re using a little-endian system.

Honestly, I assume this, too. When I hear about developers using PPC and Sparc, I picture a “Primitive Computing” diorama at the Natural History Museum.

On the plus side, all of the drivers work on big-endian systems, so you can run the database remotely and still do development on your old-school machine.

MongoDB assumes that you have more than one server.

Again, this is one of those things that’s a “duh” for production but bites people in the ass in development. MongoDB developers have worked super hard on replication, but that only helps if you have more than one server. (Single server durability is in the works for this fall.)

MongoDB assumes you want fast/unsafe, but lets you do slow/safe.

This design decision has turned out to be one of the most controversial we’ve made and has caused the most criticism. We try to make it clear in the documentation, but some people never notice that there’s a “safe” option for writes (that defaults to false), and then get very pissed when something wasn’t written.

MongoDB developers assume you’ll complain if something goes wrong.

This isn’t about the database per se, but the core developers are available on IRC, the mailing list, the bug tracker, and Twitter. Most of us subscribe to Google Alerts, Google Blog Search, the #mongodb hashtag, and so on. We try to make sure everyone gets an answer to their question and we’ll often fix bugs within a few hours of them being reported.

So, hopefully this will save some people some pain.

Buying an Mahattan Co-op

Today, Andrew and I close on an apartment in Chelsea.

About a year and a half ago, I subscribed to a feed from streeteasy.com, which is a really awesome site. They do custom RSS feeds for any search, so I subscribed to the places in our price range in the neighborhoods we were interested in.

On May 16th, we finally found an apartment we really liked (pictured above). Then the fun began. While I was at a PHP conference in Chicago (TEK·X), I called Charles Schwab and got pre-approved for a mortgage. Then we made an offer. The sellers made a counter offer, we made a counter-counter-offer, and everyone agreed before I got home. So far so good, now we had to get a lawyer to draw up the contract. We had no lawyer, but Andrew’s parents had one, so we asked him to represent us. We initially forgot to ask him how much he charged, but eventually remembered (because we’re real grownups like that).

Once we had a contract, the sellers’ lawyer spelled my name wrong, so they had to redo the contract. Once it actually had my name on it, everyone signed it and we sent the first half of the downpayment to the seller’s attorney.

The next step was the mortgage: we had to get “for real” approved, not just be pre-approved. We inquired with a few other banks and ended up getting a great rate from Wells Fargo. We filled out the first of many stupidly long applications and sent them copies of all of our financial statements for the last couple of months (savings, checking, investments, paystubs, retirement accounts, etc.).

At this point we figured we’d probably be breaking lease on our current place, so we told our current landlord (Citi-Urban Management) that we’d be moving out on August 16th (we were scheduled to close on the 2nd, so this gave us a few extra weeks).

Once we had a commitment letter from Wells Fargo, we could start filling out the co-op application. You see, in NYC you don’t buy an apartment, you buy shares in the corporation that owns the building. This means that the building can make you jump through hoops and balance biscuits on your nose, This coop wanted us to send: 1 personal statement, 3 personal letters of recommendation, 1 business letter of recommendation, a filed-out application form, another copy of all of our financial statements for the last few months, the last two years of tax returns, checks for move-in fees and deposits, checks for credit checks, employment verification contact information, copies of our paystubs, copies of our drivers licenses, a copy of the “house rules” that we had to initial on every page, and a statement that we’d be getting homeowners insurance. The sad part is that there was more, I’ve just forgotten a bunch of the items.

We sent everything in and the board called us in for an interview on Friday, July 30th. We sat in a basement next to the laundry room and they asked us why we wanted to own a place, whether I’d be able to get a job if 10gen went out of business (somehow they weren’t as concerned with Google going out of business), if we could make websites (Danger! Danger, Will Robinson!), and if we were planning on getting a joint checking account. After an hour, they let us go.

On Monday, August 2nd, the real estate agent called to let us know that we were in. Now all we had to do was actually close. Unfortunately, the sellers now wanted to put off closing as long as possible so that they could find a new place and we got a call from our landlord telling us that they had found a person to take our lease who would be moving in on the 13th, so we had to be out by then. Our lawyer bullied and guilted everyone into agreeing to close on the 16th, leaving us homeless for a scant week (we moved out on the 10th because I had to go to Boston and give a talk at LinuxCon on the 11th). Luckily, Andrew’s (effective) godparents own a bed-and-breakfast in the city and let us stay there.

During this week, the person we’d been dealing with at Wells Fargo went on vacation and the temp guy didn’t have any of our information and wasn’t sure if we’d be able to make the closing. Last week, as I freaked out in Boston, Andrew managed to bully and guilt him into doing his damn job.

Finally, today, we got a check for the rest of the downpayment, did a final walkthrough of the apartment, and spent 3 hours signing papers. Tonight, we sleep on a borrowed air mattress on the floor of our new apartment!

Sharding and Replica Sets Illustrated

This post assumes you know what replica sets and sharding are.

Step 1: Don’t use sharding

Seriously. Almost no one needs it. If you were at the point where you needed to partition your MySQL database, you’ve probably got a long ways to go before you’ll need to partition MongoDB (we scoff at billions of rows).

Run MongoDB as a replica set. When you really need the extra capacity then, and only then, start sharding. Why?

You have to choose a shard key. If you know the characteristics of your system before you choose a shard key, you can save yourself a world of pain.
Sharding adds complexity: you have to keep track of more machines and processes.
Premature optimization is the root of all evil. If you application isn’t running fast, is it CPU-bound or network-bound? Do you have too many indexes? Too few? Are they being hit by your queries? Check (at least) all of these causes, first.

Using Sharding

A shard is defined as one or more servers with one master. Thus, a shard could be a single mongod (bad idea), a master-slave setup (better idea), or a replica set (best idea).

Let’s say we have three shards and each one is a replica set. For three shards, you’ll want a minimum of 3 servers (the general rule is: minimum of N servers for N shards). We’ll do the bare minimum on replica sets, too: a master, primary, and arbiter for each set.

Mugs are MongoDB processes. So, we have three replica sets:

rss — "teal", "green", and "blue" replica sets

“M” stands for “master,” “S” stands for “slave,” and “A” stands for “arbiter.” We also have config servers:

…and mongos processes:

Now, let’s stick these processes on servers (serving trays are servers). Each master needs to do a lot, so let’s give each primary its own server.

Now we can put a slave and arbiter on each box, too.

Note how we mix things up: no replica set is housed on a single server, so that if a server goes down, the set can fail over to a different server and be fine.

Now we can add the three configuration servers and two mongos processes. mongos processes are usually put on the appserver, but they’re pretty lightweight so we’ll stick a couple on here.

A bit crowded, but possible!

In case of emergency…

Let’s say we drop a tray. CRASH! With this setup, your data is safe (as long as you were using w) and the cluster loses no functionality (in terms of reads and writes).

Chunks will not be able to migrate (because one of the config servers is down), so a shard may become bloated if the config server is down for a very long time.

Network partitions and losing two server are bigger problems, so you should have more than three servers if you actually want great availability.

Let’s start and configure all 14 processes at once!

Or not. I was going to go through the command to set this whole thing up but… it’s really long and finicky and I’ve already done it in other posts. So, if you’re interested, check out my posts on setting up replica sets and sharding.

Combining the two is left as an exercise for the reader.

Part 3: Replica Sets in the Wild

Make way for ducklings — A primary with 8 secondaries

This post assumes that you know what replica sets are and some of the basic syntax.
In part 1, we set up a replica set from scratch, but real life is messier: you might want to migrate dev servers into production, add new slaves, prioritize servers, change things on the fly… that’s what this post covers.

Before we get started…

Replica sets don’t like localhost. They’re willing to go along with it… kinda, sorta… but it often causes issues. You can get around these issues by using the hostname instead. On Linux, you can find your hostname by running the hostname command:

$ hostname
wooster

On Windows, you have to download Linux or something. I haven’t really looked into it.

From here on out, I’ll be using my hostname instead of localhost.

Starting up with Data

This is pretty much the same as starting up without data, except you should backup your data before you get started (you should always backup your data before you mess around with your server configuration).

If, pre-replica-set, you were starting your server with something like:

$ ./mongod

…to turn it into the first member of a replica set, you’d shut it down and start it back up with the –replset option:

$ ./mongod --replSet unicomplex

Now, initialize the set with the one server (so far):

> rs.initiate()
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

Adding Slaves

You should always run MongoDB with slaves, so let’s add some.

Start your slave with the usual options you use, as well as –replSet. So, for example, we could do:

$ ./mongod --dbpath ~/dbs/slave1 --port 27018 --replSet unicomplex

Now, we add this slave to the replica set. Make sure db is connected to wooster:27017 (the primary server) and run:

> rs.add("wooster:27018")
{"ok" : 1}

Repeat as necessary to add more slaves.

Adding an Arbiter

This is very similar to adding a slave. In 1.6.x, when you start up the arbiter, you should give it the option –oplogSize 1. This way the arbiter won’t be wasting any space. (In 1.7.4+, the arbiter will not allocate an oplog automatically.)

$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex --oplogSize 1

Now add it to the set. You can specify that this server should be an arbiter by calling rs.addArb:

> rs.addArb("wooster:27019")
{"ok" : 1}

Demoting a Primary

Suppose our company has the following servers available:

Gazillion dollar super machine
EC2 instance
iMac we found on the street

Through an accident of fate, the iMac becomes primary. We can force it to become a slave by running the step down command:

> imac = connect("imac.example.com/admin")
connecting to: imac.example.com/admin
admin
> imac.runCommand({"replSetStepDown" : 1})
{"ok" : 1}

Now the iMac will be a slave.

Setting Priorities

It’s likely that we never want the iMac to be a master (we’ll just use it for backup). You can force this by setting its priority to 0. The higher a server’s priority, the more likely it is to become master if the current master fails. Right now, the only options are 0 (can’t be master) or 1 (can be master), but in the future you’ve be able to have a nice gradation of priorities.

So, let’s get into the nitty-gritty of replica sets and change the iMac’s priority to 0. To change the configuration, we connect to the master and edit its configuration:

> config = rs.conf()
{
        "_id" : "unicomplex",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "prod.example.com:27017"
                },
                {
                        "_id" : 1,
                        "host" : "ec2.example.com:27017"
                },
                {
                        "_id" : 2,
                        "host" : "imac.example.com:27017"
                }
        ]
}

Now, we have to do two things: 1) set the iMac’s priority to 0 and 2) update the configuration version. The new version number is always the old version number plus one. (It’s 1 right now so the next version is 2. If we change the config again, it’ll be 3, etc.)

> config.members[2].priority = 0
0
> config.version += 1
2

Finally, we tell the replica set that we have a new configuration for it.

> use admin
switched to db admin
> db.runCommand({"replSetReconfig" : config})
{"ok" : 1}

All configuration changes must happen on the master. They are propagated out to the slaves from there. Now you can kill any server and the iMac will never become master.

This configuration stuff is a bit finicky to do from the shell right now. In the future, most people will probably just use a GUI to configure their sets and mess with server settings.

Next up: how to hook this up with sharding to get a fault-tolerant distributed database.