If it quacks like a RDBMS…

It might be a turtle duck.
MongoDB feels a lot like a relational database: you can think of documents as rows, do ad hoc queries, and create indexes on fields. There are, however, a ton of differences due to the data model, scalability considerations, and MongoDB’s youth. This can lead to some not-so-pleasant surprises for users. We (the developers) try to document the differences, but there are a few often-overlooked assumptions:

MongoDB assumes that you have a 64-bit machine.

You are limited to ~2GB of data on a 32-bit machine. This is annoying for a lot of people who develop on 32-bit machines. There’ll be a solution for this at some point, but it’s not high on our priority list because people don’t run 32-bit servers in production. (Okay, on rare occasions they do… but MongoDB is only 2 years old, give it a few more releases, we’ll support it eventually!) Speaking of things we’ll support eventually…

MongoDB assumes that you’re using a little-endian system.

Honestly, I assume this, too. When I hear about developers using PPC and Sparc, I picture a “Primitive Computing” diorama at the Natural History Museum.

On the plus side, all of the drivers work on big-endian systems, so you can run the database remotely and still do development on your old-school machine.

MongoDB assumes that you have more than one server.

Again, this is one of those things that’s a “duh” for production but bites people in the ass in development. MongoDB developers have worked super hard on replication, but that only helps if you have more than one server. (Single server durability is in the works for this fall.)

MongoDB assumes you want fast/unsafe, but lets you do slow/safe.

This design decision has turned out to be one of the most controversial we’ve made and has caused the most criticism. We try to make it clear in the documentation, but some people never notice that there’s a “safe” option for writes (that defaults to false), and then get very pissed when something wasn’t written.

MongoDB developers assume you’ll complain if something goes wrong.

This isn’t about the database per se, but the core developers are available on IRC, the mailing list, the bug tracker, and Twitter. Most of us subscribe to Google Alerts, Google Blog Search, the #mongodb hashtag, and so on. We try to make sure everyone gets an answer to their question and we’ll often fix bugs within a few hours of them being reported.

So, hopefully this will save some people some pain.

Buying an Mahattan Co-op

Today, Andrew and I close on an apartment in Chelsea.

About a year and a half ago, I subscribed to a feed from streeteasy.com, which is a really awesome site. They do custom RSS feeds for any search, so I subscribed to the places in our price range in the neighborhoods we were interested in.

On May 16th, we finally found an apartment we really liked (pictured above). Then the fun began. While I was at a PHP conference in Chicago (TEK·X), I called Charles Schwab and got pre-approved for a mortgage. Then we made an offer. The sellers made a counter offer, we made a counter-counter-offer, and everyone agreed before I got home. So far so good, now we had to get a lawyer to draw up the contract. We had no lawyer, but Andrew’s parents had one, so we asked him to represent us. We initially forgot to ask him how much he charged, but eventually remembered (because we’re real grownups like that).

Once we had a contract, the sellers’ lawyer spelled my name wrong, so they had to redo the contract. Once it actually had my name on it, everyone signed it and we sent the first half of the downpayment to the seller’s attorney.

The next step was the mortgage: we had to get “for real” approved, not just be pre-approved. We inquired with a few other banks and ended up getting a great rate from Wells Fargo. We filled out the first of many stupidly long applications and sent them copies of all of our financial statements for the last couple of months (savings, checking, investments, paystubs, retirement accounts, etc.).

At this point we figured we’d probably be breaking lease on our current place, so we told our current landlord (Citi-Urban Management) that we’d be moving out on August 16th (we were scheduled to close on the 2nd, so this gave us a few extra weeks).

Once we had a commitment letter from Wells Fargo, we could start filling out the co-op application. You see, in NYC you don’t buy an apartment, you buy shares in the corporation that owns the building. This means that the building can make you jump through hoops and balance biscuits on your nose, This coop wanted us to send: 1 personal statement, 3 personal letters of recommendation, 1 business letter of recommendation, a filed-out application form, another copy of all of our financial statements for the last few months, the last two years of tax returns, checks for move-in fees and deposits, checks for credit checks, employment verification contact information, copies of our paystubs, copies of our drivers licenses, a copy of the “house rules” that we had to initial on every page, and a statement that we’d be getting homeowners insurance. The sad part is that there was more, I’ve just forgotten a bunch of the items.

We sent everything in and the board called us in for an interview on Friday, July 30th. We sat in a basement next to the laundry room and they asked us why we wanted to own a place, whether I’d be able to get a job if 10gen went out of business (somehow they weren’t as concerned with Google going out of business), if we could make websites (Danger! Danger, Will Robinson!), and if we were planning on getting a joint checking account. After an hour, they let us go.

On Monday, August 2nd, the real estate agent called to let us know that we were in. Now all we had to do was actually close. Unfortunately, the sellers now wanted to put off closing as long as possible so that they could find a new place and we got a call from our landlord telling us that they had found a person to take our lease who would be moving in on the 13th, so we had to be out by then. Our lawyer bullied and guilted everyone into agreeing to close on the 16th, leaving us homeless for a scant week (we moved out on the 10th because I had to go to Boston and give a talk at LinuxCon on the 11th). Luckily, Andrew’s (effective) godparents own a bed-and-breakfast in the city and let us stay there.

During this week, the person we’d been dealing with at Wells Fargo went on vacation and the temp guy didn’t have any of our information and wasn’t sure if we’d be able to make the closing. Last week, as I freaked out in Boston, Andrew managed to bully and guilt him into doing his damn job.

Finally, today, we got a check for the rest of the downpayment, did a final walkthrough of the apartment, and spent 3 hours signing papers. Tonight, we sleep on a borrowed air mattress on the floor of our new apartment!

Sharding and Replica Sets Illustrated

This post assumes you know what replica sets and sharding are.

Step 1: Don’t use sharding

Seriously. Almost no one needs it. If you were at the point where you needed to partition your MySQL database, you’ve probably got a long ways to go before you’ll need to partition MongoDB (we scoff at billions of rows).

Run MongoDB as a replica set. When you really need the extra capacity then, and only then, start sharding. Why?

  1. You have to choose a shard key. If you know the characteristics of your system before you choose a shard key, you can save yourself a world of pain.
  2. Sharding adds complexity: you have to keep track of more machines and processes.
  3. Premature optimization is the root of all evil. If you application isn’t running fast, is it CPU-bound or network-bound? Do you have too many indexes? Too few? Are they being hit by your queries? Check (at least) all of these causes, first.

Using Sharding

A shard is defined as one or more servers with one master. Thus, a shard could be a single mongod (bad idea), a master-slave setup (better idea), or a replica set (best idea).

Let’s say we have three shards and each one is a replica set. For three shards, you’ll want a minimum of 3 servers (the general rule is: minimum of N servers for N shards). We’ll do the bare minimum on replica sets, too: a master, primary, and arbiter for each set.

Mugs are MongoDB processes. So, we have three replica sets:

"teal", "green", and "blue" replica sets

“M” stands for “master,” “S” stands for “slave,” and “A” stands for “arbiter.” We also have config servers:

A config server

…and mongos processes:

A mongos process

Now, let’s stick these processes on servers (serving trays are servers). Each master needs to do a lot, so let’s give each primary its own server.

Now we can put a slave and arbiter on each box, too.

Note how we mix things up: no replica set is housed on a single server, so that if a server goes down, the set can fail over to a different server and be fine.

Now we can add the three configuration servers and two mongos processes. mongos processes are usually put on the appserver, but they’re pretty lightweight so we’ll stick a couple on here.

A bit crowded, but possible!

In case of emergency…

Let’s say we drop a tray. CRASH! With this setup, your data is safe (as long as you were using w) and the cluster loses no functionality (in terms of reads and writes).

Chunks will not be able to migrate (because one of the config servers is down), so a shard may become bloated if the config server is down for a very long time.

Network partitions and losing two server are bigger problems, so you should have more than three servers if you actually want great availability.

Let’s start and configure all 14 processes at once!

Or not.  I was going to go through the command to set this whole thing up but… it’s really long and finicky and I’ve already done it in other posts. So, if you’re interested, check out my posts on setting up replica sets and sharding.

Combining the two is left as an exercise for the reader.

Part 3: Replica Sets in the Wild

A primary with 8 secondaries

This post assumes that you know what replica sets are and some of the basic syntax.
In part 1, we set up a replica set from scratch, but real life is messier: you might want to migrate dev servers into production, add new slaves, prioritize servers, change things on the fly… that’s what this post covers.

Before we get started…

Replica sets don’t like localhost. They’re willing to go along with it… kinda, sorta… but it often causes issues. You can get around these issues by using the hostname instead. On Linux, you can find your hostname by running the hostname command:

$ hostname

On Windows, you have to download Linux or something. I haven’t really looked into it.

From here on out, I’ll be using my hostname instead of localhost.

Starting up with Data

This is pretty much the same as starting up without data, except you should backup your data before you get started (you should always backup your data before you mess around with your server configuration).

If, pre-replica-set, you were starting your server with something like:

$ ./mongod

…to turn it into the first member of a replica set, you’d shut it down and start it back up with the –replset option:

$ ./mongod --replSet unicomplex

Now, initialize the set with the one server (so far):

> rs.initiate()
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1

Adding Slaves

You should always run MongoDB with slaves, so let’s add some.

Start your slave with the usual options you use, as well as –replSet. So, for example, we could do:

$ ./mongod --dbpath ~/dbs/slave1 --port 27018 --replSet unicomplex

Now, we add this slave to the replica set. Make sure db is connected to wooster:27017 (the primary server) and run:

> rs.add("wooster:27018")
{"ok" : 1}

Repeat as necessary to add more slaves.

Adding an Arbiter

This is very similar to adding a slave. In 1.6.x, when you start up the arbiter, you should give it the option –oplogSize 1. This way the arbiter won’t be wasting any space. (In 1.7.4+, the arbiter will not allocate an oplog automatically.)

$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex --oplogSize 1

Now add it to the set. You can specify that this server should be an arbiter by calling rs.addArb:

> rs.addArb("wooster:27019")
{"ok" : 1}

Demoting a Primary

Suppose our company has the following servers available:

  1. Gazillion dollar super machine
  2. EC2 instance
  3. iMac we found on the street

Through an accident of fate, the iMac becomes primary. We can force it to become a slave by running the step down command:

> imac = connect("imac.example.com/admin")
connecting to: imac.example.com/admin
> imac.runCommand({"replSetStepDown" : 1})
{"ok" : 1}

Now the iMac will be a slave.

Setting Priorities

It’s likely that we never want the iMac to be a master (we’ll just use it for backup). You can force this by setting its priority to 0. The higher a server’s priority, the more likely it is to become master if the current master fails. Right now, the only options are 0 (can’t be master) or 1 (can be master), but in the future you’ve be able to have a nice gradation of priorities.

So, let’s get into the nitty-gritty of replica sets and change the iMac’s priority to 0. To change the configuration, we connect to the master and edit its configuration:

> config = rs.conf()
        "_id" : "unicomplex",
        "version" : 1,
        "members" : [
                        "_id" : 0,
                        "host" : "prod.example.com:27017"
                        "_id" : 1,
                        "host" : "ec2.example.com:27017"
                        "_id" : 2,
                        "host" : "imac.example.com:27017"

Now, we have to do two things: 1) set the iMac’s priority to 0 and 2) update the configuration version. The new version number is always the old version number plus one. (It’s 1 right now so the next version is 2. If we change the config again, it’ll be 3, etc.)

> config.members[2].priority = 0
> config.version += 1

Finally, we tell the replica set that we have a new configuration for it.

> use admin
switched to db admin
> db.runCommand({"replSetReconfig" : config})
{"ok" : 1}

All configuration changes must happen on the master. They are propagated out to the slaves from there. Now you can kill any server and the iMac will never become master.

This configuration stuff is a bit finicky to do from the shell right now. In the future, most people will probably just use a GUI to configure their sets and mess with server settings.

Next up: how to hook this up with sharding to get a fault-tolerant distributed database.

Replica Sets Part 2: What are Replica Sets?

The US's current primary

If you want to jump right into trying out replica sets, see Part 1: Master-Slave is so 2009.

Replica sets are basically just master-slave with automatic failover.

The idea is: you have a master and one or more slaves. If the master goes down, one of the slaves will automatically become the new master. The database drivers will always find the master so, if the master they’re using goes down, they’ll automatically figure out who the new master is and send writes to that server. This is much easier to manage (and faster) than manually failing over to a slave.

So, you have a pool of servers with one primary (the master) and N secondaries (slaves). If the primary crashes or disappears, the other servers will hold an election to choose a new primary.


A server has to get a majority of the total votes to be elected, not just a majority. This means that, if we have 50 servers and each server has 1 vote (the default, later posts will show how to change the number of votes a server gets), a server needs at least 26 votes to become primary. If no one gets 26 votes, no one becomes primary. The set can still handle reads, but not writes (as there’s no master).

Part 3 will cover demoting a primary

If a server gets 26 or more votes, it will becomes primary. All future writes will be directed to it, until it loses an election, blows up, gets caught breaking into the DNC, etc.

The original primary is still part of the set. If you bring it back up, it will become a secondary server (until it gets the majority of votes again).

Three’s a crowd (in a good way)

One complication with this voting system is that you can’t just have a master and a slave.

If you just set up a master and a slave, the system has a total of 2 votes, so a server needs both votes to be elected master (1 is not a majority). If one server goes down, the other server only has 1 out of 2 votes, so it will become (or stay) a slave. If the network is partitioned, suddenly the master doesn’t have a majority of the votes (it only has its own 1 vote), so it’ll be demoted to a slave. The slave also doesn’t have a majority of the votes, so it’ll stay a slave (so you’d end up with two slaves until the servers can reach each other again).

It’s a waste, though, to have two servers and no master up, so replica sets have a number of ways of avoiding this situation. One of the simplest and most versatile ways is using an arbiter, a special server that exists to resolves disputes. It doesn’t serve any data to the outside world, it’s just a voter (it can even be on the same machine as another server, it’s very lightweight). In part 1, localhost:27019 was the arbiter.

So, let’s say we set up a master, a slave, and an arbiter, each with 1 vote (total of 3 votes). Then, if we have the master and arbiter in one data center and the slave in another, if a network partition occurs, the master still has a majority of votes (master+arbiter). The slave only has 1 vote. If the master fails and the network is not partitioned, the arbiter can vote for the slave, promoting it to master.

With this three-server setup, we get sensible, robust failover.

Next up: dynamically configuring your replica set. In part 1, we fully specified everything on startup. In the real world you’ll want to be able to add servers dynamically and change the configuration.

Replica Sets Part 1: Master-Slave is so 2009

Replica sets are really cool and can be customized out the wazoo, so I’ll be doing a couple of posts on them (I have three written so far and I think I have a few more in there). If there’s any replica-set-related topic you’d like to see covered, please let me know and I’ll make sure to get to it.

This post shows how to do the “Hello, world” of replica sets. I was going to start with a post explaining what they are, but coding is more fun than reading. For now, all you have to know is that they’re master-slave with automatic failover.

Make sure you have version 1.5.7 or better of the database before trying out the code below.

Step 1: Choose a name for your set.

This is just organizational, so choose whatever. I’ll be using “unicomplex” for my example.

Step 2: Create the data directories.

We need a data directory for each server we’ll be starting:

$ mkdir -p ~/dbs/borg1 ~/dbs/borg2 ~/dbs/arbiter

Step 3: Start the servers.

We’ll start up our three servers:

$ ./mongod --dbpath ~/dbs/borg1 --port 27017 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/borg2 --port 27018 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex/

Step 4: Initialize the set.

Now you have to tell the set, “hey, you exist!” Start up the mongo shell and run:

MongoDB shell version: 1.5.7
connecting to: test
> rs.initiate({"_id" : "unicomplex", "members" : [
... {"_id" : 0, "host" : "localhost:27017"}, 
... {"_id" : 1, "host" : "localhost:27018"}, 
... {"_id" : 2, "host" : "localhost:27019", "arbiterOnly" : true}]})
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1

rs is a global variable that holds a bunch of useful replica set functions.

The message says it’ll be online in about a minute, but it’s always been ~5 seconds for me. Once you see the following line in one of the logs:


…your replica set is ready to go!

Playing with the set

One of the servers will be master, the other is a slave. You can figure out which is which by running the isMaster command in the shell:

> db.isMaster()
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
        "arbiters" : [
        "ok" : 1

If db isn’t primary, the server that is will be listed in the “primary” field:

> db.isMaster()
        "ismaster" : false,
        "secondary" : true,
        "hosts" : [
        "arbiters" : [
        "primary" : "localhost:27018",
        "ok" : 1

Now, try killing the primary server. Wait a couple seconds and you’ll see the other (non-arbiter) server be elected primary.

Once there’s a new primary, restart the mongod you just killed. You’ll see it join in the fray, though not become master (there’s already a master, so it won’t rock the boat). After a few seconds, kill the current master. Now the old master will become master again!

It’s pretty fun to play with this, bringing them up and down and watching the mastership go back and forth (or maybe I’m easily amused).

Inserting and querying data

By default, slaves are for backup only, but you can also use them for queries (reads) if you set the “slave ok” flag. Connect to each of the servers and set this flag:

> db.getMongo().setSlaveOk()
> borg2 = connect("localhost:27018/test")
connecting to: localhost:27018/test
> borg2.getMongo().setSlaveOk()

Now you can insert, update, and remove data on the master and read the changes on the slave.

On Monday: the “why” behind what we just did.

MongoDB backups & corn on the cob in 10 minutes

Last night, I discovered that you can make corn on the cob in about 5 minutes, which is so cool. You can also backup your MongoDB database in about 5 minutes (depending on size!), so I figured I’d combine the two.

You’ll need:

  • 1 MongoDB server you want to back up
  • 1 external drive for the backup
  • 2 ears of unshucked corn
  • 2 tablespoons of butter
  • 4 tablespoons of grated Parmesan cheese
  • 1/2 teaspoon of cayenne pepper


  1. Cook the ears of corn in the microwave for 4 minutes (in their husks) on a damp paper towel.
  2. Start a MongoDB shell and run:
    > use admin
    > db.runCommand({"fsync" : 1, "lock" : 1})
            "info" : "now locked against writes, use db.$cmd.sys.unlock.findOne() to unlock",
            "ok" : 1

    This flushes everything to disk and prevents more writes from happening. If you’re in production, you should do this on a slave (so you don’t prevent people from writing to the master). When you unlock it, the slave will start doing writes again and catch up with the master.

  3. Copy your database files to external storage:
    $ cp -R /data/db /mnt/usb/backup
  4. Your corn is probably done now. Take it out of the microwave and cover it in a towel to let it cool.
  5. In your database shell, run:
    > db.$cmd.sys.unlock.findOne()
    { "ok" : 1, "info" : "unlock requested" }
  6. If it’s cool enough, shuck your corn (the husks and silk should come off very easily, everything practically fell off for me).
  7. Roll the corn around in butter until it’s nice and coated. Then cover it with Parmesan cheese and sprinkle a little cayenne over it.
  8. Optional: eat corn over your laptop, safe in the knowledge that you have a recent backup.

I got this recipe from Tabla (the corn, not the database backup). There was a BBQ day at Madison Square Park and they’re too ritzy to serve BBQ, so they made this. It is really good.

Managing your Mongo horde with genghis-khan

I have been working on a sharding GUI for the past few months (on and off). It’s starting to look pretty cool, so I figured I’d give people a sneak peak.  No, it’s not available yet, sorry.

Basically, genghis-khan is a simple web server that connects to your cluster and gives you tons of information about it. You just open up a page in your browser.

The main view just shows you the current operations on your mongos processes, config servers, and shards. This, for example, is a test cluster on my machine with nothing much happening:

Two mongos processes, three config servers, and two shards

This is basically the output of mongostat.  There isn’t much to do on this screen, so let’s move on to the “databases” tab.

This lists all of the databases you have.  If a database has sharding enabled, you can shard its collections with the “add a sharded collection” form. You can also see what all of the sharded collections in this database are sharded by. (For instance, you can see that the foo.blog.authors collection is sharded by the name field.)

If a database doesn’t have sharding enabled yet, a big “shard the X database” button appears:

You can probably guess what that does.

The final tab is the “shards” tab, which shows what data is on what shard.

You can see that there are 4 chunks in the foo.blog.authors collection, all on one shard. MongoDB will balance the chunks if we give it a few minutes, but we can move them around ourselves by dragging a chunk to a different shard:

A chunk being dragged from shard0 to shard1

The result

If we wait around a bit, MongoDB will finish balancing for us and we end up with an even number of chunks on each shard.

We can use the form at the top of the page to add a new shard to the cluster, and optionally name it (by default it’ll be called “shardN“) and set a max size:

A second later, our new shard pops up.

Again, if we wait around a bit, our data will balance itself.

Request for Comments/Feature Requests

So, right now you can use genghis-khan to:

  • View stats
  • View shards
  • View chunks
  • Add shards
  • View databases
  • Shard databases
  • View sharded collections
  • Shard collections

Anyone have any features they’d like to see?  I can’t promise anything, but I’d love to hear people’s suggestions.

I Never Thought I’d Be On a Book

I’ve pretty much disappeared for the last few weeks because I’ve been finishing up MongoDB: The Definitive Guide, now available for pre-sale! It’s a comprehensive reference to MongoDB which should be useful for everyone, from a beginner who has never touched the database before to a core MongoDB developer (or so two have claimed… I’m a bit skeptical).

This book covers everything from getting started with MongoDB to developing your app to sending it into production safely and securely. Mike and I have been helping users with MongoDB for a couple years now and this is a compilation of answers to the most common questions, warnings about the usual traps people fall into, and comprehensive coverage of important or interesting subjects.

Also, for regular readers of my blog: remember that zero-points-of-failure sharding setup post I promised? It’s in the book. In fact, pretty much every great technical topic I’ve thought of in the last 6 months has gone into the book, so go get your copy now!