Part 3: Replica Sets in the Wild

A primary with 8 secondaries

This post assumes that you know what replica sets are and some of the basic syntax.
In part 1, we set up a replica set from scratch, but real life is messier: you might want to migrate dev servers into production, add new slaves, prioritize servers, change things on the fly… that’s what this post covers.

Before we get started…

Replica sets don’t like localhost. They’re willing to go along with it… kinda, sorta… but it often causes issues. You can get around these issues by using the hostname instead. On Linux, you can find your hostname by running the hostname command:

$ hostname
wooster

On Windows, you have to download Linux or something. I haven’t really looked into it.

From here on out, I’ll be using my hostname instead of localhost.

Starting up with Data

This is pretty much the same as starting up without data, except you should backup your data before you get started (you should always backup your data before you mess around with your server configuration).

If, pre-replica-set, you were starting your server with something like:

$ ./mongod

…to turn it into the first member of a replica set, you’d shut it down and start it back up with the –replset option:

$ ./mongod --replSet unicomplex

Now, initialize the set with the one server (so far):

> rs.initiate()
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

Adding Slaves

You should always run MongoDB with slaves, so let’s add some.

Start your slave with the usual options you use, as well as –replSet. So, for example, we could do:

$ ./mongod --dbpath ~/dbs/slave1 --port 27018 --replSet unicomplex

Now, we add this slave to the replica set. Make sure db is connected to wooster:27017 (the primary server) and run:

> rs.add("wooster:27018")
{"ok" : 1}

Repeat as necessary to add more slaves.

Adding an Arbiter

This is very similar to adding a slave. In 1.6.x, when you start up the arbiter, you should give it the option –oplogSize 1. This way the arbiter won’t be wasting any space. (In 1.7.4+, the arbiter will not allocate an oplog automatically.)

$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex --oplogSize 1

Now add it to the set. You can specify that this server should be an arbiter by calling rs.addArb:

> rs.addArb("wooster:27019")
{"ok" : 1}

Demoting a Primary

Suppose our company has the following servers available:

  1. Gazillion dollar super machine
  2. EC2 instance
  3. iMac we found on the street

Through an accident of fate, the iMac becomes primary. We can force it to become a slave by running the step down command:

> imac = connect("imac.example.com/admin")
connecting to: imac.example.com/admin
admin
> imac.runCommand({"replSetStepDown" : 1})
{"ok" : 1}

Now the iMac will be a slave.

Setting Priorities

It’s likely that we never want the iMac to be a master (we’ll just use it for backup). You can force this by setting its priority to 0. The higher a server’s priority, the more likely it is to become master if the current master fails. Right now, the only options are 0 (can’t be master) or 1 (can be master), but in the future you’ve be able to have a nice gradation of priorities.

So, let’s get into the nitty-gritty of replica sets and change the iMac’s priority to 0. To change the configuration, we connect to the master and edit its configuration:

> config = rs.conf()
{
        "_id" : "unicomplex",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "prod.example.com:27017"
                },
                {
                        "_id" : 1,
                        "host" : "ec2.example.com:27017"
                },
                {
                        "_id" : 2,
                        "host" : "imac.example.com:27017"
                }
        ]
}

Now, we have to do two things: 1) set the iMac’s priority to 0 and 2) update the configuration version. The new version number is always the old version number plus one. (It’s 1 right now so the next version is 2. If we change the config again, it’ll be 3, etc.)

> config.members[2].priority = 0
0
> config.version += 1
2

Finally, we tell the replica set that we have a new configuration for it.

> use admin
switched to db admin
> db.runCommand({"replSetReconfig" : config})
{"ok" : 1}

All configuration changes must happen on the master. They are propagated out to the slaves from there. Now you can kill any server and the iMac will never become master.

This configuration stuff is a bit finicky to do from the shell right now. In the future, most people will probably just use a GUI to configure their sets and mess with server settings.

Next up: how to hook this up with sharding to get a fault-tolerant distributed database.

Replica Sets Part 2: What are Replica Sets?

The US's current primary

If you want to jump right into trying out replica sets, see Part 1: Master-Slave is so 2009.

Replica sets are basically just master-slave with automatic failover.

The idea is: you have a master and one or more slaves. If the master goes down, one of the slaves will automatically become the new master. The database drivers will always find the master so, if the master they’re using goes down, they’ll automatically figure out who the new master is and send writes to that server. This is much easier to manage (and faster) than manually failing over to a slave.

So, you have a pool of servers with one primary (the master) and N secondaries (slaves). If the primary crashes or disappears, the other servers will hold an election to choose a new primary.

Elections

A server has to get a majority of the total votes to be elected, not just a majority. This means that, if we have 50 servers and each server has 1 vote (the default, later posts will show how to change the number of votes a server gets), a server needs at least 26 votes to become primary. If no one gets 26 votes, no one becomes primary. The set can still handle reads, but not writes (as there’s no master).

Part 3 will cover demoting a primary

If a server gets 26 or more votes, it will becomes primary. All future writes will be directed to it, until it loses an election, blows up, gets caught breaking into the DNC, etc.

The original primary is still part of the set. If you bring it back up, it will become a secondary server (until it gets the majority of votes again).

Three’s a crowd (in a good way)

One complication with this voting system is that you can’t just have a master and a slave.

If you just set up a master and a slave, the system has a total of 2 votes, so a server needs both votes to be elected master (1 is not a majority). If one server goes down, the other server only has 1 out of 2 votes, so it will become (or stay) a slave. If the network is partitioned, suddenly the master doesn’t have a majority of the votes (it only has its own 1 vote), so it’ll be demoted to a slave. The slave also doesn’t have a majority of the votes, so it’ll stay a slave (so you’d end up with two slaves until the servers can reach each other again).

It’s a waste, though, to have two servers and no master up, so replica sets have a number of ways of avoiding this situation. One of the simplest and most versatile ways is using an arbiter, a special server that exists to resolves disputes. It doesn’t serve any data to the outside world, it’s just a voter (it can even be on the same machine as another server, it’s very lightweight). In part 1, localhost:27019 was the arbiter.

So, let’s say we set up a master, a slave, and an arbiter, each with 1 vote (total of 3 votes). Then, if we have the master and arbiter in one data center and the slave in another, if a network partition occurs, the master still has a majority of votes (master+arbiter). The slave only has 1 vote. If the master fails and the network is not partitioned, the arbiter can vote for the slave, promoting it to master.

With this three-server setup, we get sensible, robust failover.

Next up: dynamically configuring your replica set. In part 1, we fully specified everything on startup. In the real world you’ll want to be able to add servers dynamically and change the configuration.

Replica Sets Part 1: Master-Slave is so 2009

Replica sets are really cool and can be customized out the wazoo, so I’ll be doing a couple of posts on them (I have three written so far and I think I have a few more in there). If there’s any replica-set-related topic you’d like to see covered, please let me know and I’ll make sure to get to it.

This post shows how to do the “Hello, world” of replica sets. I was going to start with a post explaining what they are, but coding is more fun than reading. For now, all you have to know is that they’re master-slave with automatic failover.

Make sure you have version 1.5.7 or better of the database before trying out the code below.

Step 1: Choose a name for your set.

This is just organizational, so choose whatever. I’ll be using “unicomplex” for my example.

Step 2: Create the data directories.

We need a data directory for each server we’ll be starting:

$ mkdir -p ~/dbs/borg1 ~/dbs/borg2 ~/dbs/arbiter

Step 3: Start the servers.

We’ll start up our three servers:

$ ./mongod --dbpath ~/dbs/borg1 --port 27017 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/borg2 --port 27018 --replSet unicomplex/
$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex/

Step 4: Initialize the set.

Now you have to tell the set, “hey, you exist!” Start up the mongo shell and run:

MongoDB shell version: 1.5.7
connecting to: test
> rs.initiate({"_id" : "unicomplex", "members" : [
... {"_id" : 0, "host" : "localhost:27017"}, 
... {"_id" : 1, "host" : "localhost:27018"}, 
... {"_id" : 2, "host" : "localhost:27019", "arbiterOnly" : true}]})
{
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}

rs is a global variable that holds a bunch of useful replica set functions.

The message says it’ll be online in about a minute, but it’s always been ~5 seconds for me. Once you see the following line in one of the logs:

replSet PRIMARY

…your replica set is ready to go!

Playing with the set

One of the servers will be master, the other is a slave. You can figure out which is which by running the isMaster command in the shell:

> db.isMaster()
{
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
                "localhost:27017",
                "localhost:27018",
        ],
        "arbiters" : [
                "localhost:27019"
        ],
        "ok" : 1
}

If db isn’t primary, the server that is will be listed in the “primary” field:

> db.isMaster()
{
        "ismaster" : false,
        "secondary" : true,
        "hosts" : [
                "localhost:27017",
                "localhost:27018",
        ],
        "arbiters" : [
                "localhost:27019"
        ],
        "primary" : "localhost:27018",
        "ok" : 1
}

Now, try killing the primary server. Wait a couple seconds and you’ll see the other (non-arbiter) server be elected primary.

Once there’s a new primary, restart the mongod you just killed. You’ll see it join in the fray, though not become master (there’s already a master, so it won’t rock the boat). After a few seconds, kill the current master. Now the old master will become master again!

It’s pretty fun to play with this, bringing them up and down and watching the mastership go back and forth (or maybe I’m easily amused).

Inserting and querying data

By default, slaves are for backup only, but you can also use them for queries (reads) if you set the “slave ok” flag. Connect to each of the servers and set this flag:

> db.getMongo().setSlaveOk()
> borg2 = connect("localhost:27018/test")
connecting to: localhost:27018/test
test
> borg2.getMongo().setSlaveOk()

Now you can insert, update, and remove data on the master and read the changes on the slave.

On Monday: the “why” behind what we just did.

MongoDB backups & corn on the cob in 10 minutes

Last night, I discovered that you can make corn on the cob in about 5 minutes, which is so cool. You can also backup your MongoDB database in about 5 minutes (depending on size!), so I figured I’d combine the two.

You’ll need:

  • 1 MongoDB server you want to back up
  • 1 external drive for the backup
  • 2 ears of unshucked corn
  • 2 tablespoons of butter
  • 4 tablespoons of grated Parmesan cheese
  • 1/2 teaspoon of cayenne pepper

Directions:

  1. Cook the ears of corn in the microwave for 4 minutes (in their husks) on a damp paper towel.
  2. Start a MongoDB shell and run:
    > use admin
    > db.runCommand({"fsync" : 1, "lock" : 1})
    {
            "info" : "now locked against writes, use db.$cmd.sys.unlock.findOne() to unlock",
            "ok" : 1
    }
    

    This flushes everything to disk and prevents more writes from happening. If you’re in production, you should do this on a slave (so you don’t prevent people from writing to the master). When you unlock it, the slave will start doing writes again and catch up with the master.

  3. Copy your database files to external storage:
    $ cp -R /data/db /mnt/usb/backup
    
  4. Your corn is probably done now. Take it out of the microwave and cover it in a towel to let it cool.
  5. In your database shell, run:
    > db.$cmd.sys.unlock.findOne()
    { "ok" : 1, "info" : "unlock requested" }
    
  6. If it’s cool enough, shuck your corn (the husks and silk should come off very easily, everything practically fell off for me).
  7. Roll the corn around in butter until it’s nice and coated. Then cover it with Parmesan cheese and sprinkle a little cayenne over it.
  8. Optional: eat corn over your laptop, safe in the knowledge that you have a recent backup.

I got this recipe from Tabla (the corn, not the database backup). There was a BBQ day at Madison Square Park and they’re too ritzy to serve BBQ, so they made this. It is really good.

Managing your Mongo horde with genghis-khan

I have been working on a sharding GUI for the past few months (on and off). It’s starting to look pretty cool, so I figured I’d give people a sneak peak.  No, it’s not available yet, sorry.

Basically, genghis-khan is a simple web server that connects to your cluster and gives you tons of information about it. You just open up a page in your browser.

The main view just shows you the current operations on your mongos processes, config servers, and shards. This, for example, is a test cluster on my machine with nothing much happening:

Two mongos processes, three config servers, and two shards

This is basically the output of mongostat.  There isn’t much to do on this screen, so let’s move on to the “databases” tab.

This lists all of the databases you have.  If a database has sharding enabled, you can shard its collections with the “add a sharded collection” form. You can also see what all of the sharded collections in this database are sharded by. (For instance, you can see that the foo.blog.authors collection is sharded by the name field.)

If a database doesn’t have sharding enabled yet, a big “shard the X database” button appears:

You can probably guess what that does.

The final tab is the “shards” tab, which shows what data is on what shard.

You can see that there are 4 chunks in the foo.blog.authors collection, all on one shard. MongoDB will balance the chunks if we give it a few minutes, but we can move them around ourselves by dragging a chunk to a different shard:

A chunk being dragged from shard0 to shard1

The result

If we wait around a bit, MongoDB will finish balancing for us and we end up with an even number of chunks on each shard.

We can use the form at the top of the page to add a new shard to the cluster, and optionally name it (by default it’ll be called “shardN“) and set a max size:

A second later, our new shard pops up.

Again, if we wait around a bit, our data will balance itself.

Request for Comments/Feature Requests

So, right now you can use genghis-khan to:

  • View stats
  • View shards
  • View chunks
  • Add shards
  • View databases
  • Shard databases
  • View sharded collections
  • Shard collections

Anyone have any features they’d like to see?  I can’t promise anything, but I’d love to hear people’s suggestions.

I Never Thought I’d Be On a Book

I’ve pretty much disappeared for the last few weeks because I’ve been finishing up MongoDB: The Definitive Guide, now available for pre-sale! It’s a comprehensive reference to MongoDB which should be useful for everyone, from a beginner who has never touched the database before to a core MongoDB developer (or so two have claimed… I’m a bit skeptical).

This book covers everything from getting started with MongoDB to developing your app to sending it into production safely and securely. Mike and I have been helping users with MongoDB for a couple years now and this is a compilation of answers to the most common questions, warnings about the usual traps people fall into, and comprehensive coverage of important or interesting subjects.

Also, for regular readers of my blog: remember that zero-points-of-failure sharding setup post I promised? It’s in the book. In fact, pretty much every great technical topic I’ve thought of in the last 6 months has gone into the book, so go get your copy now!

With a name like Mongo, it has to be good

I just got back from MongoSF, which was awesome. Over 200 Mongo geeks, three tracks, and language-specific workshops all day.

I got to the conference early and took a picture of the venue... an hour later it was packed.

The highlight, for me, was Eliot Horowitz’s talk on sharding. He set up a MongoDB cluster of 25 large EC2 instances and started hammering them.

He pulled up an incredibly snazzy sharding GUI (okay, I wrote it, it’s not actually that snazzy) that displayed a row of stats for each shard. Each shard had about the same level of operations per second, so you could see that Mongo was doing pretty well distributing the data across the shards.

Eliot scrolled down to the bottom of the stats table where it showed the total number of operations per second across all the shards. The cluster was doing 8 million operations per second.

8,000,000 operations per second.

The whole audience burst into applause.

That’s about 320,000 operations per second per box, which is about what would be expected for Mongo on a powerful server (like a large EC2 instance).

Cool things about this:

  • If your app needs more than 8 million ops/sec, you can just add more machines and Mongo will take care of redistributing the load.
  • Your app doesn’t need to know if it’s talking to 1, 25, or 7,000 servers. You can focus on writing your app and let Mongo focus on scaling it.
Speaker's dinner the night before

If you missed out on MongoSF, there are a bunch of other Mongo conferences coming up: MongoNY at the end of May and MongoUK and MongoFR in June.

There must be 50 ways to start your Mongo

This blog post covers four major ones:

Feel free to jump to the ones that interest you (for instance, sharding).

Just start up a database, Ace

Starting up a vanilla MongoDB instance is super easy, it just needs a port it can listen on and a directory where it can save your info. By default, Mongo listens on port 27017, which should work fine (it’s not a very commonly used port). We’ll create a new directory for database files:

$ mkdir -p ~/dbs/mydb # -p creates parent directories if they don't exist

And then start up our database:

$ cd 
$ bin/mongod --dbpath ~/dbs/mydb

…and you’ll see a bunch of output:

$ bin/mongod --dbpath ~/dbs/mydb
Fri Apr 23 11:59:07 Mongo DB : starting : pid = 9831 port = 27017 dbpath = /data/db/ master = 0 slave = 0  32-bit 

** NOTE: when using MongoDB 32 bit, you are limited to about 2 gigabytes of data
**       see http://blog.mongodb.org/post/137788967/32-bit-limitations for more

Fri Apr 23 11:59:07 db version v1.5.1-pre-, pdfile version 4.5
Fri Apr 23 11:59:07 git version: f86d93fd949777d5fbe00bf9784ec0947d6e75b9
Fri Apr 23 11:59:07 sys info: Linux ubuntu 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:54:29 UTC 2009 i686 BOOST_LIB_VERSION=1_38
Fri Apr 23 11:59:07 waiting for connections on port 27017
Fri Apr 23 11:59:07 web admin interface listening on port 28017

Now, Mongo will “freeze” like this, which confuses some people. Don’t worry, it’s just waiting for requests. You’re all set to go.

Set up a slave, Maeve

As we’re running master and slave on the same machine, they’ll need separate ports. We’ll use port 10000 for the master and 20000 for the slave. We also need separate directories for data, so we’ll create those:

$ mkdir ~/dbs/master ~/dbs/slave

Now we start the master database:

$ bin/mongod --master --port 10000 --dbpath ~/dbs/master

And then the slave, in a different terminal:

$ bin/mongod --slave --port 20000 --dbpath ~/dbs/slave --source localhost:10000

The “source” option specifies where the master is that the slave should replicate data from.

Now, if we want to add another slave, we need to go though the herculean effort of choosing a port and creating a new directory:

$ mkdir ~/dbs/slave2
$ bin/mongod --slave --port 20001 --dbpath ~/dbs/slave2 --source localhost:10000

Tada! Two slaves, one master. For more information on master-slave, see the core docs on it and my previous post.

This example puts the master server and slave server on the same machine, but people generally have a master on one machine and a slave on another. It works fine to put them on a single machine, it just defeats the point of a bit.

Get auto-failover… Rover

Okay, so there aren’t many people named Rover, but you come up with a rhyme for “auto-failover” (I tried “replica”, too).

Replica pairs are cool because it’s like master-slave, but you get automatic failover: if the master becomes unavailable, the slave will become a master. So, it’s basically the same as master-slave, but the servers know about each other and there is, optionally, an arbiter server that doesn’t do anything other than resolve “disputes” over who is master.

When could the arbiter come it in handy? Suppose the master’s network cable is pulled. The server still thinks it’s master, but no one else knows it’s there. The slave becomes master and the rest of the world goes along happily. When the master’s network cable gets plugged back in, now both servers think they’re master! In this case, the arbiter steps in and gently informs the master who’s behind in the times that he is now a slave.

You don’t have to set up an arbiter, but we will since it’s good practice:

$ mkdir ~/dbs/arbiter ~/dbs/replica1 ~/dbs/replica2
$ bin/mongod --port 50000 --dbpath ~/dbs/arbiter

Now, in separate terminals, you start each of the replicas:

$ bin/mongod --port 60000 --dbpath ~/dbs/replica1 --pairwith localhost:60001 --arbiter localhost:50000

And then the other one:

$ bin/mongod --port 60001 --dbpath ~/dbs/replica2 --pairwith localhost:60000 --arbiter localhost:50000

After they’ve been running for a bit, try killing (Ctrl-C) one, then restarting it, then killing the other one, back and forth.

For more information on replica pairs, see the core docs.

What’s this? Replica pairs are evolving! *voop* *voop* *voop*

Replica pairs have evolved into… replica sets! Well, okay, they haven’t yet, but they’re coming soon. Then you’ll be able to have an arbitrary number of servers in the auto-failover ring.

Make a new cluster, Buster

For the grand finale, sharding. Sharding is how you distribute data with Mongo. If you don’t know what sharding is, check out my previous post explaining how it works.

First of all, download the latest 1.5.x nightly build from the website. Sharding is changing rapidly, you want the latest and greatest here, not stable.

We’re going to be creating a three-node cluster. So, same as ever, create your database directories. We want one directory for the cluster configuration and three directories for our shards (nodes):

$ mkdir ~/dbs/config ~/dbs/shard1 ~/dbs/shard2 ~/dbs/shard3

The config server keeps track of what’s where, so we need to start that up first:

$ bin/mongod --configsvr --port 70000 --dbpath ~/dbs/config

The mongos is just a request router that runs on top of the config server. It doesn’t even need a data directory, we just tell it where to look for the configuration:

$ bin/mongos --configdb localhost:70000

Note the “s”: the router is called “mongos”, not “mongod”. We haven’t specified a port for it, so it’ll listen on the default port (27017).

Okay! Now, we need to set up our shards. Start these each up in separate terminals:

$ bin/mongod --shardsvr --port 71000 --dbpath ~/dbs/shard1
$ bin/mongod --shardsvr --port 71001 --dbpath ~/dbs/shard2
$ bin/mongod --shardsvr --port 71002 --dbpath ~/dbs/shard3

mongos doesn’t actually know about the shards yet, you need to tell it to add these servers to the cluster. The easiest way is to fire up a mongo shell:

$ bin/mongo
MongoDB shell version: 1.5.1-pre-
url: test
connecting to: test
type "help" for help
>

Now, we add each shard to the cluster:

> db = connect("localhost:70000/admin");
connecting to: localhost:70000
admin
> db.runCommand({addshard : "localhost:71000", allowLocal : true})
{
    "added" : "localhost:71000",
    "ok" : 1
}
> db.runCommand({addshard : "localhost:71001", allowLocal : true})
{
    "added" : "localhost:71001",
    "ok" : 1
}
> db.runCommand({addshard : "localhost:71002", allowLocal : true})
{
    "added" : "localhost:71002",
    "ok" : 1
}
>

mongos expects shards to be on remote machines and by default won’t allow you to add local shards (i.e., shards with “localhost” in the name). Since we’re just playing around, we specify “allowLocal” to override this behavior. (Note that “addshard” IS NOT camel-case, and allowLocal IS camel-case, because we’re consistent like that.)

Congratulations, you’re running a distributed database!

What do you do now? Well, use it just like a normal database! Connect to “localhost:27017” and proceed normally (or, as normally as possible… please report any bugs to our bugtracker!). Try the tutorial (since you’ve already got the shell open) or connect through your favorite driver and play around.

Connecting to mongos should be an identical experience to connecting to a normal Mongo server. Behind the scenes, it splits up your requests/data across the shards so you can concentrate on making your application, not scaling it.


P.S. Obviously, this example setup is full of single points of failure, but that’s completely avoidable. I can go over how to set up distributed MongoDB with zero single points of failure in a follow-up post, if people are interested.

Once and Future Presentations

On Monday, I gave a presentation on MongoDB to the San Francisco MySQL user group.  It was a lot of fun, you can watch the recording on ustream:

http://www.ustream.tv/flash/live/1/3708550Streaming Video by Ustream.TV

Apparently the audio was buzzy (I haven’t actually listened to it myself yet).

The audience especially enjoyed this slide about MySQL’s current situation:

One of the guys told me that he was scrambling to take a picture of it but I went to the next slide too fast, so here it is in all it’s glory.

Thanks to everyone at the MySQL meetup for being so awesome, I had a great time!

Future Talks

April 30th: I’ll be in California again, giving a talk called “Map/reduce, geospatial indexing, and other cool features” at MongoSF

May 18-21: I’ll be in Chicago at Tek·X. I’ll be doing a regular session, “MongoDB for Mobile Applications“, and a tutorial on switching apps from MySQL to MongoDB (assuming no knowledge of MongoDB).