Kristina Chodorow's Blog

Replica Sets Part 2: What are Replica Sets?

If you want to jump right into trying out replica sets, see Part 1: Master-Slave is so 2009.

Replica sets are basically just master-slave with automatic failover.

The idea is: you have a master and one or more slaves. If the master goes down, one of the slaves will automatically become the new master. The database drivers will always find the master so, if the master they’re using goes down, they’ll automatically figure out who the new master is and send writes to that server. This is much easier to manage (and faster) than manually failing over to a slave.

So, you have a pool of servers with one primary (the master) and N secondaries (slaves). If the primary crashes or disappears, the other servers will hold an election to choose a new primary.

Elections

A server has to get a majority of the total votes to be elected, not just a majority. This means that, if we have 50 servers and each server has 1 vote (the default, later posts will show how to change the number of votes a server gets), a server needs at least 26 votes to become primary. If no one gets 26 votes, no one becomes primary. The set can still handle reads, but not writes (as there’s no master).

nixon — Part 3 will cover demoting a primary

If a server gets 26 or more votes, it will becomes primary. All future writes will be directed to it, until it loses an election, blows up, gets caught breaking into the DNC, etc.

The original primary is still part of the set. If you bring it back up, it will become a secondary server (until it gets the majority of votes again).

Three’s a crowd (in a good way)

One complication with this voting system is that you can’t just have a master and a slave.

If you just set up a master and a slave, the system has a total of 2 votes, so a server needs both votes to be elected master (1 is not a majority). If one server goes down, the other server only has 1 out of 2 votes, so it will become (or stay) a slave. If the network is partitioned, suddenly the master doesn’t have a majority of the votes (it only has its own 1 vote), so it’ll be demoted to a slave. The slave also doesn’t have a majority of the votes, so it’ll stay a slave (so you’d end up with two slaves until the servers can reach each other again).

It’s a waste, though, to have two servers and no master up, so replica sets have a number of ways of avoiding this situation. One of the simplest and most versatile ways is using an arbiter, a special server that exists to resolves disputes. It doesn’t serve any data to the outside world, it’s just a voter (it can even be on the same machine as another server, it’s very lightweight). In part 1, localhost:27019 was the arbiter.

So, let’s say we set up a master, a slave, and an arbiter, each with 1 vote (total of 3 votes). Then, if we have the master and arbiter in one data center and the slave in another, if a network partition occurs, the master still has a majority of votes (master+arbiter). The slave only has 1 vote. If the master fails and the network is not partitioned, the arbiter can vote for the slave, promoting it to master.

With this three-server setup, we get sensible, robust failover.

Next up: dynamically configuring your replica set. In part 1, we fully specified everything on startup. In the real world you’ll want to be able to add servers dynamically and change the configuration.

Replica Sets Part 1: Master-Slave is so 2009

Replica sets are really cool and can be customized out the wazoo, so I’ll be doing a couple of posts on them (I have three written so far and I think I have a few more in there). If there’s any replica-set-related topic you’d like to see covered, please let me know and I’ll make sure to get to it.

This post shows how to do the “Hello, world” of replica sets. I was going to start with a post explaining what they are, but coding is more fun than reading. For now, all you have to know is that they’re master-slave with automatic failover.

Make sure you have version 1.5.7 or better of the database before trying out the code below.

Step 1: Choose a name for your set.

This is just organizational, so choose whatever. I’ll be using “unicomplex” for my example.

Step 2: Create the data directories.

We need a data directory for each server we’ll be starting:

$ mkdir -p ~/dbs/borg1 ~/dbs/borg2 ~/dbs/arbiter

Step 3: Start the servers.

We’ll start up our three servers:

$ ./mongod --dbpath ~/dbs/borg1 --port 27017 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/borg2 --port 27018 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex/

Step 4: Initialize the set.

Now you have to tell the set, “hey, you exist!” Start up the mongo shell and run:

MongoDB shell version: 1.5.7
connecting to: test
> rs.initiate({"_id" : "unicomplex", "members" : [
... {"_id" : 0, "host" : "localhost:27017"}, 
... {"_id" : 1, "host" : "localhost:27018"}, 
... {"_id" : 2, "host" : "localhost:27019", "arbiterOnly" : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

rs is a global variable that holds a bunch of useful replica set functions.

The message says it’ll be online in about a minute, but it’s always been ~5 seconds for me. Once you see the following line in one of the logs:

replSet PRIMARY

…your replica set is ready to go!

Playing with the set

One of the servers will be master, the other is a slave. You can figure out which is which by running the isMaster command in the shell:

> db.isMaster()
{
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
                "localhost:27017",
                "localhost:27018",
        ],
        "arbiters" : [
                "localhost:27019"
        ],
        "ok" : 1
}

If db isn’t primary, the server that is will be listed in the “primary” field:

> db.isMaster()
{
        "ismaster" : false,
        "secondary" : true,
        "hosts" : [
                "localhost:27017",
                "localhost:27018",
        ],
        "arbiters" : [
                "localhost:27019"
        ],
        "primary" : "localhost:27018",
        "ok" : 1
}

Now, try killing the primary server. Wait a couple seconds and you’ll see the other (non-arbiter) server be elected primary.

Once there’s a new primary, restart the mongod you just killed. You’ll see it join in the fray, though not become master (there’s already a master, so it won’t rock the boat). After a few seconds, kill the current master. Now the old master will become master again!

It’s pretty fun to play with this, bringing them up and down and watching the mastership go back and forth (or maybe I’m easily amused).

Inserting and querying data

By default, slaves are for backup only, but you can also use them for queries (reads) if you set the “slave ok” flag. Connect to each of the servers and set this flag:

> db.getMongo().setSlaveOk()
> borg2 = connect("localhost:27018/test")
connecting to: localhost:27018/test
test
> borg2.getMongo().setSlaveOk()

Now you can insert, update, and remove data on the master and read the changes on the slave.

On Monday: the “why” behind what we just did.

MongoDB backups & corn on the cob in 10 minutes

Last night, I discovered that you can make corn on the cob in about 5 minutes, which is so cool. You can also backup your MongoDB database in about 5 minutes (depending on size!), so I figured I’d combine the two.

You’ll need:

1 MongoDB server you want to back up
1 external drive for the backup
2 ears of unshucked corn
2 tablespoons of butter
4 tablespoons of grated Parmesan cheese
1/2 teaspoon of cayenne pepper

Directions:

Cook the ears of corn in the microwave for 4 minutes (in their husks) on a damp paper towel.
Start a MongoDB shell and run:
```
> use admin
> db.runCommand({"fsync" : 1, "lock" : 1})
{
        "info" : "now locked against writes, use db.$cmd.sys.unlock.findOne() to unlock",
        "ok" : 1
}
```
This flushes everything to disk and prevents more writes from happening. If you’re in production, you should do this on a slave (so you don’t prevent people from writing to the master). When you unlock it, the slave will start doing writes again and catch up with the master.
Copy your database files to external storage:
```
$ cp -R /data/db /mnt/usb/backup
```
Your corn is probably done now. Take it out of the microwave and cover it in a towel to let it cool.

In your database shell, run:

> db.$cmd.sys.unlock.findOne()
{ "ok" : 1, "info" : "unlock requested" }

If it’s cool enough, shuck your corn (the husks and silk should come off very easily, everything practically fell off for me).
Roll the corn around in butter until it’s nice and coated. Then cover it with Parmesan cheese and sprinkle a little cayenne over it.
Optional: eat corn over your laptop, safe in the knowledge that you have a recent backup.

I got this recipe from Tabla (the corn, not the database backup). There was a BBQ day at Madison Square Park and they’re too ritzy to serve BBQ, so they made this. It is really good.

Managing your Mongo horde with genghis-khan

I have been working on a sharding GUI for the past few months (on and off). It’s starting to look pretty cool, so I figured I’d give people a sneak peak. No, it’s not available yet, sorry.

Basically, genghis-khan is a simple web server that connects to your cluster and gives you tons of information about it. You just open up a page in your browser.

The main view just shows you the current operations on your mongos processes, config servers, and shards. This, for example, is a test cluster on my machine with nothing much happening:

genghis-khan-stats — Two mongos processes, three config servers, and two shards

This is basically the output of mongostat. There isn’t much to do on this screen, so let’s move on to the “databases” tab.

This lists all of the databases you have. If a database has sharding enabled, you can shard its collections with the “add a sharded collection” form. You can also see what all of the sharded collections in this database are sharded by. (For instance, you can see that the foo.blog.authors collection is sharded by the name field.)

If a database doesn’t have sharding enabled yet, a big “shard the X database” button appears:

You can probably guess what that does.

The final tab is the “shards” tab, which shows what data is on what shard.

You can see that there are 4 chunks in the foo.blog.authors collection, all on one shard. MongoDB will balance the chunks if we give it a few minutes, but we can move them around ourselves by dragging a chunk to a different shard:

genghis-khan-chunks2 — A chunk being dragged from shard0 to shard1

If we wait around a bit, MongoDB will finish balancing for us and we end up with an even number of chunks on each shard.

We can use the form at the top of the page to add a new shard to the cluster, and optionally name it (by default it’ll be called “shardN“) and set a max size:

A second later, our new shard pops up.

Again, if we wait around a bit, our data will balance itself.

Request for Comments/Feature Requests

So, right now you can use genghis-khan to:

View stats
View shards
View chunks
Add shards
View databases
Shard databases
View sharded collections
Shard collections

Anyone have any features they’d like to see? I can’t promise anything, but I’d love to hear people’s suggestions.

I Never Thought I’d Be On a Book

I’ve pretty much disappeared for the last few weeks because I’ve been finishing up MongoDB: The Definitive Guide, now available for pre-sale! It’s a comprehensive reference to MongoDB which should be useful for everyone, from a beginner who has never touched the database before to a core MongoDB developer (or so two have claimed… I’m a bit skeptical).

This book covers everything from getting started with MongoDB to developing your app to sending it into production safely and securely. Mike and I have been helping users with MongoDB for a couple years now and this is a compilation of answers to the most common questions, warnings about the usual traps people fall into, and comprehensive coverage of important or interesting subjects.

Also, for regular readers of my blog: remember that zero-points-of-failure sharding setup post I promised? It’s in the book. In fact, pretty much every great technical topic I’ve thought of in the last 6 months has gone into the book, so go get your copy now!

Large Hadron Collider Using MongoDB

I just did a post on the MongoDB blog on the LHC using Mongo (you can tell it was me from the corny title).

With a name like Mongo, it has to be good

I just got back from MongoSF, which was awesome. Over 200 Mongo geeks, three tracks, and language-specific workshops all day.

mongosf_hall — I got to the conference early and took a picture of the venue... an hour later it was packed.

The highlight, for me, was Eliot Horowitz’s talk on sharding. He set up a MongoDB cluster of 25 large EC2 instances and started hammering them.

He pulled up an incredibly snazzy sharding GUI (okay, I wrote it, it’s not actually that snazzy) that displayed a row of stats for each shard. Each shard had about the same level of operations per second, so you could see that Mongo was doing pretty well distributing the data across the shards.

Eliot scrolled down to the bottom of the stats table where it showed the total number of operations per second across all the shards. The cluster was doing 8 million operations per second.

8,000,000 operations per second.

The whole audience burst into applause.

That’s about 320,000 operations per second per box, which is about what would be expected for Mongo on a powerful server (like a large EC2 instance).

Cool things about this:

If your app needs more than 8 million ops/sec, you can just add more machines and Mongo will take care of redistributing the load.
Your app doesn’t need to know if it’s talking to 1, 25, or 7,000 servers. You can focus on writing your app and let Mongo focus on scaling it.

speakers_dinner — Speaker's dinner the night before

If you missed out on MongoSF, there are a bunch of other Mongo conferences coming up: MongoNY at the end of May and MongoUK and MongoFR in June.

There must be 50 ways to start your Mongo

This blog post covers four major ones:

Single server
Master-slave
Replica pairs
Sharding

Feel free to jump to the ones that interest you (for instance, sharding).

Just start up a database, Ace

Starting up a vanilla MongoDB instance is super easy, it just needs a port it can listen on and a directory where it can save your info. By default, Mongo listens on port 27017, which should work fine (it’s not a very commonly used port). We’ll create a new directory for database files:

$ mkdir -p ~/dbs/mydb # -p creates parent directories if they don't exist

And then start up our database:

$ cd 
$ bin/mongod --dbpath ~/dbs/mydb

…and you’ll see a bunch of output:

$ bin/mongod --dbpath ~/dbs/mydb
Fri Apr 23 11:59:07 Mongo DB : starting : pid = 9831 port = 27017 dbpath = /data/db/ master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Fri Apr 23 11:59:07 db version v1.5.1-pre-, pdfile version 4.5
Fri Apr 23 11:59:07 git version: f86d93fd949777d5fbe00bf9784ec0947d6e75b9
Fri Apr 23 11:59:07 sys info: Linux ubuntu 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:54:29 UTC 2009 i686 BOOST_LIB_VERSION=1_38
Fri Apr 23 11:59:07 waiting for connections on port 27017
Fri Apr 23 11:59:07 web admin interface listening on port 28017

Now, Mongo will “freeze” like this, which confuses some people. Don’t worry, it’s just waiting for requests. You’re all set to go.

Set up a slave, Maeve

As we’re running master and slave on the same machine, they’ll need separate ports. We’ll use port 10000 for the master and 20000 for the slave. We also need separate directories for data, so we’ll create those:

$ mkdir ~/dbs/master ~/dbs/slave

Now we start the master database:

$ bin/mongod --master --port 10000 --dbpath ~/dbs/master

And then the slave, in a different terminal:

$ bin/mongod --slave --port 20000 --dbpath ~/dbs/slave --source localhost:10000

The “source” option specifies where the master is that the slave should replicate data from.

Now, if we want to add another slave, we need to go though the herculean effort of choosing a port and creating a new directory:

$ mkdir ~/dbs/slave2
$ bin/mongod --slave --port 20001 --dbpath ~/dbs/slave2 --source localhost:10000

Tada! Two slaves, one master. For more information on master-slave, see the core docs on it and my previous post.

This example puts the master server and slave server on the same machine, but people generally have a master on one machine and a slave on another. It works fine to put them on a single machine, it just defeats the point of a bit.

Get auto-failover… Rover

Okay, so there aren’t many people named Rover, but you come up with a rhyme for “auto-failover” (I tried “replica”, too).

Replica pairs are cool because it’s like master-slave, but you get automatic failover: if the master becomes unavailable, the slave will become a master. So, it’s basically the same as master-slave, but the servers know about each other and there is, optionally, an arbiter server that doesn’t do anything other than resolve “disputes” over who is master.

When could the arbiter come it in handy? Suppose the master’s network cable is pulled. The server still thinks it’s master, but no one else knows it’s there. The slave becomes master and the rest of the world goes along happily. When the master’s network cable gets plugged back in, now both servers think they’re master! In this case, the arbiter steps in and gently informs the master who’s behind in the times that he is now a slave.

You don’t have to set up an arbiter, but we will since it’s good practice:

$ mkdir ~/dbs/arbiter ~/dbs/replica1 ~/dbs/replica2
$ bin/mongod --port 50000 --dbpath ~/dbs/arbiter

Now, in separate terminals, you start each of the replicas:

$ bin/mongod --port 60000 --dbpath ~/dbs/replica1 --pairwith localhost:60001 --arbiter localhost:50000

And then the other one:

$ bin/mongod --port 60001 --dbpath ~/dbs/replica2 --pairwith localhost:60000 --arbiter localhost:50000

After they’ve been running for a bit, try killing (Ctrl-C) one, then restarting it, then killing the other one, back and forth.

For more information on replica pairs, see the core docs.

What’s this? Replica pairs are evolving! *voop* *voop* *voop*

Replica pairs have evolved into… replica sets! Well, okay, they haven’t yet, but they’re coming soon. Then you’ll be able to have an arbitrary number of servers in the auto-failover ring.

Make a new cluster, Buster

For the grand finale, sharding. Sharding is how you distribute data with Mongo. If you don’t know what sharding is, check out my previous post explaining how it works.

First of all, download the latest 1.5.x nightly build from the website. Sharding is changing rapidly, you want the latest and greatest here, not stable.

We’re going to be creating a three-node cluster. So, same as ever, create your database directories. We want one directory for the cluster configuration and three directories for our shards (nodes):

$ mkdir ~/dbs/config ~/dbs/shard1 ~/dbs/shard2 ~/dbs/shard3

The config server keeps track of what’s where, so we need to start that up first:

$ bin/mongod --configsvr --port 70000 --dbpath ~/dbs/config

The mongos is just a request router that runs on top of the config server. It doesn’t even need a data directory, we just tell it where to look for the configuration:

$ bin/mongos --configdb localhost:70000

Note the “s”: the router is called “mongos”, not “mongod”. We haven’t specified a port for it, so it’ll listen on the default port (27017).

Okay! Now, we need to set up our shards. Start these each up in separate terminals:

$ bin/mongod --shardsvr --port 71000 --dbpath ~/dbs/shard1
$ bin/mongod --shardsvr --port 71001 --dbpath ~/dbs/shard2
$ bin/mongod --shardsvr --port 71002 --dbpath ~/dbs/shard3

mongos doesn’t actually know about the shards yet, you need to tell it to add these servers to the cluster. The easiest way is to fire up a mongo shell:

$ bin/mongo
MongoDB shell version: 1.5.1-pre-
url: test
connecting to: test
type "help" for help
>

Now, we add each shard to the cluster:

> db = connect("localhost:70000/admin");
connecting to: localhost:70000
admin
> db.runCommand({addshard : "localhost:71000", allowLocal : true})
{
    "added" : "localhost:71000",
    "ok" : 1
}
> db.runCommand({addshard : "localhost:71001", allowLocal : true})
{
    "added" : "localhost:71001",
    "ok" : 1
}
> db.runCommand({addshard : "localhost:71002", allowLocal : true})
{
    "added" : "localhost:71002",
    "ok" : 1
}
>

mongos expects shards to be on remote machines and by default won’t allow you to add local shards (i.e., shards with “localhost” in the name). Since we’re just playing around, we specify “allowLocal” to override this behavior. (Note that “addshard” IS NOT camel-case, and allowLocal IS camel-case, because we’re consistent like that.)

Congratulations, you’re running a distributed database!

What do you do now? Well, use it just like a normal database! Connect to “localhost:27017” and proceed normally (or, as normally as possible… please report any bugs to our bugtracker!). Try the tutorial (since you’ve already got the shell open) or connect through your favorite driver and play around.

Connecting to mongos should be an identical experience to connecting to a normal Mongo server. Behind the scenes, it splits up your requests/data across the shards so you can concentrate on making your application, not scaling it.

P.S. Obviously, this example setup is full of single points of failure, but that’s completely avoidable. I can go over how to set up distributed MongoDB with zero single points of failure in a follow-up post, if people are interested.

Once and Future Presentations

On Monday, I gave a presentation on MongoDB to the San Francisco MySQL user group. It was a lot of fun, you can watch the recording on ustream:

http://www.ustream.tv/flash/live/1/3708550 Streaming Video by Ustream.TV

Apparently the audio was buzzy (I haven’t actually listened to it myself yet).

The audience especially enjoyed this slide about MySQL’s current situation:

One of the guys told me that he was scrambling to take a picture of it but I went to the next slide too fast, so here it is in all it’s glory.

Thanks to everyone at the MySQL meetup for being so awesome, I had a great time!

Future Talks

April 30th: I’ll be in California again, giving a talk called “Map/reduce, geospatial indexing, and other cool features” at MongoSF

May 18-21: I’ll be in Chicago at Tek·X. I’ll be doing a regular session, “MongoDB for Mobile Applications“, and a tutorial on switching apps from MySQL to MongoDB (assuming no knowledge of MongoDB).

Sharding with the Fishes

Sharding is the not-so-revolutionary way that MongoDB scales writes (it’s very similar to techniques described in the Big Table paper and by PNUTS) but many people are unfamiliar with what it is and how it works. If you’ve seen a talk on MongoDB or looked at the website, you’ve probably seen a diagram of sharding that looks like this:

…which probably looks a bit like “I hope I don’t have to understand that.”

However, it’s actually quite simple: it’s exactly how the Mafia is structured (or, at least, how The Godfather taught me it was):

The shards are the peons: someone tells them to do something (e.g., a query or an insert), they do it and report back.
The mongos is the godfather. It knows what data each peon has and gives them orders. It’s basically a router for the requests.
The config server is the consigliere. It knows where all of the data is at any given time and lets the boss know so that he can focus on bossing. The consigliere keeps the organization running smoothly.

For a concrete example, let’s say we have a collection of blog posts. You choose a “shard key,” which is the value Mongo will use to split the data across shards. We’ll choose “date” as the shard key, which means it will be split up based on values in the “date” field. If we have four shards, they might contain data something like:

shard #1: beginning of time up to June 2009
shard #2: July 2009 to November 2009
shard #3: December 2009 to February 2010
shard #4: March 2010 through the end of time

Now that we’ve got our peons set up, let’s ask the godfather for some favors.

Queries

Say you query for all documents created from the beginning of this year (January 1st, 2010) up to the present. Here’s what happens:

You (the client) send the query to the godfather.
The godfather knows which shards contain the data you’re looking for, so he sends the query to shards #3 and #4.
shard #3 and shard #4 execute the query and return the results to the godfather.
The godfather puts together the results he’s received and sends them back to the client.

Note how all of the sharding stuff is done a layer away from the client, so your application doesn’t have to be sharding-aware, it can just query the godfather as though it were a normal mongod instance.

Inserts

Suppose you want to insert a new document with today’s date. Here’s the sequence of events:

You send the document to the godfather.
It sees today’s date and routes it to shard #4.
shard #4 inserts the document.

Again, identical to a single-server setup from the client’s point of view.

So where’s the consigliere?

Suppose you start getting millions of documents inserted with the date September 2009. Shard #2 begins to swell up like a bloated corpse. The consigliere will notice this and, when shard #2 gets too big it will split the data on shard #2 and migrate some of it to other, emptier shards. Then it will let the godfather know that now shard #2 contains July 2009-September 15th 2009 and shard #3 contains September 16th 2009-February 2010.

The consigliere is also responsibly for figuring out what to do when you add a new shard to the cluster. It figures out if it should keep the new shard in reserve or migrate some data to it right away. Basically, it’s the brains of the operation.

Whenever the consigliere moves around data, it lets the godfather know what the final configuration is so that the godfather can continue routing requests correctly.

Leave the gun. Take the cannolis.

This scaling deliciousness is, unfortunately, still very alpha. You can help us out by telling us where our documentation sucks (specifics are better than “it sucks”), testing it out on your machine, and voting for features you’d like to see.