Replica Set Internals Bootcamp Part III: Reconfiguring

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Reconfiguring Prerequisites

One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

Additive Reconfiguration vs. Full Reconfigs

Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

A simple reconfig starts a new heartbeat thread for the new member and it’s done.

A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

Propegation of Reconfiguration

Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

Suppose you have 2 nodes, A and B.

You run a reconfig on A, changing the version number from 6 to 7.

B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

Force reconfig to the face.

Forcing Reconfig

Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

Why is my version number 6,203,493?

When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

Removing Nodes and Sharding

I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

Replica Set Internals Bootcamp: Part I – Elections

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The most obvious feature of replica sets is their ability to elect a new primary, so the first thing we’ll cover is this election process.

Replica Set Elections

Let’s say we have a replica set with 3 members: X, Y, and Z. Every two seconds, each server sends out a heartbeat request to the other members of the set. So, if we wait a few seconds, X sends out heartbeats to Y and Z. They respond with information about their current situation: the state they’re in (primary/secondary), if they are eligible to become primary, their current clock time, etc.

X receives this info and updates its “map” of the set: if members have come up or gone down, changed state, and how long the roundtrip took.

At this point, if X map changed, X will check a couple of things: if X is primary and a member went down, it will make sure it can still reach a majority of the set. If it cannot, it’ll demote itself to a secondary.


There is one wrinkle with X demoting itself: in MongoDB, writes default to fire-and-forget. Thus, if people are doing fire-and-forget writes on the primary and it steps down, they might not realize X is no longer primary and keep sending writes to it. The secondary-formerly-known-as-primary will be like, “I’m a secondary, I can’t write that!” But because the writes don’t get a response on the client, the client wouldn’t know.

Technically, we could say, “well, they should use safe writes if they care,” but that seems dickish. So, when a primary is demoted, it also closes all connections to clients so that they will get a socket error when they send the next message. All of the client libraries know to re-check who is primary if they get an error. Thus, they’ll be able to find who the new primary is and not accidentally send an endless stream of writes to a secondary.


Anyway, getting back to the heartbeats: if X is a secondary, it’ll occasionally check if it should elect itself, even if its map hasn’t changed. First, it’ll do a sanity check: does another member think it’s primary? Does X think it’s already primary? Is X ineligible for election? If it fails any of the basic questions, it’ll continue puttering along as is.

If it seems as though a new primary is needed, X will proceed to the first step in election: it sends a message to Y and Z, telling them “I am considering running for primary, can you advise me on this matter?”

When Y and Z get this message, they quickly check their world view. Do they already know of a primary? Do they have more recent data than X? Does anyone they know of have more recent data than X? They run through a huge list of sanity checks and, if everything seems satisfactory, they tentatively reply “go ahead.” If they find a reason that X cannot be elected, they’ll reply “stop the election!”

If X receives any “stop the election!” messages, it cancels the election and goes back to life as a secondary.

If everyone says “go ahead,” X continues with the second (and final) phase of the election process.

For the second phase, X sends out a second message that is basically, “I am formally announcing my candidacy.” At this point, Y and Z make a final check: do all of the conditions that held true before still hold? If so, they allow X to take their election lock and send back a vote. The election lock prevents them from voting for another candidate for 30 seconds.

If one of the checks doesn’t pass the second time around (fairly unusual, at least in 2.0), they send back a veto. If anyone vetos, the election fails.

Suppose that Y votes for X and Z vetos X. At that point, Y‘s election lock is taken, it cannot vote in another election for 30 seconds. That means that, if Z wants to run for primary, it had better be able to get X‘s vote. That said, it should be able to if Z is a viable candidate: it’s not like the members hold grudges (except for Y, for 30 seconds).

If no one vetos and the candidate member receives votes from a majority of the set, the candidate becomes primary.


Feel free to ask questions in the comments below. This is a loving, caring bootcamp (as bootcamps go).

Trying Out Replica Set Priorities

Respect my prioritah!

As promised in an earlier post, replica set priorities for MongoDB are now committed and will be available in 1.9.0, which should be coming out soon.

Priorities allow you to give weights to the servers, saying, “I want server X to be primary whenever possible.” Priorities can range from 0.0-100.0.

To use priorities, download the latest binaries (“Development Release (Unstable) – Nightly”) from the MongoDB site. You have to have all members of the set running 1.9.0- or higher, as older versions object strongly to priorities other than 0 and 1.

Once you have the latest code installed, start up your three servers (A, B, and C) and create the replica set:

> // on server A
> rs.initiate()
> rs.add("B")
> rs.add("C")

Suppose we want B to be the preferred primary. And C‘s a backup server, but it could be primary if we really need it. So, we can adjust the priorities:

> config = rs.conf()
> // the default priority is 1
> B = config.members[1]
> B.priority = 2
> C = config.members[2]
> C.priority = .5
> // always increment the version number when reconfiguring
> config.version++
> rs.reconfig(config)

In a few seconds, A will step down and B will take over as primary.

Protip: the actual values of the priorities don’t matter, it’s just the relative values: B > A > C. B could have a priority of 100 and C could have a priority of .00001 and the set would behave exactly the same.


(Based on coworkers+the 12 hours it’s been committed)

What if A steps down and B is behind? Won’t I lose data?

No. A will only step down if B is within 10 seconds of synced. Once A steps down, B will sync until it’s up-to-date and (only then) become master.

Okay, but I want B to be primary now. Can I force that?

Yes, now-ish. Run this on A:

> db.adminCommand({replSetStepDown : 300, force : true})

This forces A to step down immediately (for 300 seconds). B will sync to it until it is up-to-date, then become primary.

I forgot to upgrade one of my servers before setting crazy priorities and now it’s complaining. What do I do?

Shut it down and restart it with 1.9.0. Setting non-1/0 priorities won’t harm anything with earlier versions, it just won’t work.

So, when should I use votes?

Almost never! Please ignore votes (in all versions, not just 1.9.0), unless: 1) you’re trying to have a replica set with more than 7 members or 2) you want to wedge your replica set into a primary-less state.

Implementing Replica Set Priorities

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Replicos is a nation of islands that elect a primary island, called the Grand Poobah, to lead them. Each island cast a vote (or votes) and the island that gets a majority of the votes wins poobahship. If no one gets a majority (out of the total number of votes), no one becomes Grand Poobah. The islands can only communicate via carrier seagull.

Healthy configurations of islands, completely connected via carrier seagulls.

However, due to a perpetual war with the neighboring islands of Entropos, seagulls are often shot down mid-flight, distrupting communications between the Replicos islands until new seagulls can be trained.

The people of Replicos have realized that some islands are better at Poobah-ing than others. Their goal is to elect the island with the highest Poobah-ing ability that can reach a majority of the other islands. If all of the seagulls can make it to their destinations and back, electing a Poobah becomes trivial: an island sends a message saying they want to be Grand Poobah and everyone votes for them or says “I’m better at Poobah-ing, I should be Poobah.” However, it becomes tricky when you throw the Entropos Navy into the mix.

So, let’s Entropos has shot down a bunch of seagulls, leaving us with only three seagulls:

The island with .5 Poobah ability should be elected leader (the island with 1 Poobah ability can’t reach a majority of the set). But how can .5 know that it should be Poobah? It knows 1.0 exists, so theoretically it could ask the islands it can reach to ask 1.0 if it wants to be Poobah, but it’s a pain to pass messages through multiple islands (takes longer, more chances of failure, a lot more edge cases to check), so we’d like to be able to elect a Poobah using only the directly reachable islands, if possible.

One possibility might be for the islands sent a response indicating if they were connected to an island with a higher Poobah ability. In the case above, this would work (only one island is connected to an island with higher Poobah ability, so it can’t have a majority), but what about this case:

Every island, other than .5, is connected to a 1.0, but .5 should be the one elected! So, suppose we throw in a bit more information (which island of higher priority can be reached) and let the island trying to elect itself figure things out? Well, that doesn’t quite work, what if both .5 and 1.0 can reach a majority, but not the same one?

Conclusion: the Poobah-elect can’t figure this out on their own, everyone needs to work together.

Preliminaries: define an island to be Poohable if it has any aptitude for Poobah-ing and can reach a majority of the set. An island is not Poohable if it has no aptitude for Poobah-ing and/or cannot reach a majority of the set. Islands can be more or less Poohable, depending on their aptitude for Poobah-ing.

Every node knows whether or not it, itself, is Poohable: it knows its leadership abilities and if it can reach a majority of the islands. If more than one island (say islands A and B) is Poohable, then there must be at least one island that can reach both A and B [Proof at the bottom].

Let’s have each island keep a list of “possible Poobahs.” So, say we have an island A, that starts out with an empty list. If A is Poohable, it’ll add itself to the list (if it stops being Poohable, it’ll remove itself from the list). Now, whenever A communicates with another island, the other island will either say “add me to your list” or “remove me from your list,” depending on whether it is currently Poohable or not. Every other island does the same, so now each island has a list of the Poohable islands it can reach.

Now, say island X tries to elect itself master. It contacts all of the islands it can reach for votes. Each of the voting islands checks its list: if it has an entry on it that is more Poohable than X, it’ll send a veto. Otherwise X can be elected master. If you check the situations above (and any other situation) you can see that Poohability works, due to the strength of the guarantee that a Poobah must be able to reach a majority of the set.

Proof: suppose a replica set has n members and a node A can reach a majority of the set (at least ⌊n/2+1⌋) and a node B can reach a majority of the set (again, ⌊n/2+1⌋). If the sets of members A and B can reach are disjoint, then there must be ⌊n/2+1⌋+⌊n/2+1⌋ = at least n+1 members in the set. Therefore the set of nodes that A can reach and the set of nodes that B can reach are not disjoint.

Resizing Your Oplog

The MongoDB replication oplog is, by default, 5% of your free disk space. The theory behind this is that, if you’re writing 5% of your disk space every x amount of time, you’re going to run out of disk in 19x time. However, this doesn’t hold true for everyone, sometimes you’ll need a larger oplog. Some common cases:

  • Applications that delete almost as much data as they create.
  • Applications that do lots of in-place updates, which consume oplog entries but not disk space.
  • Applications that do lots of multi-updates or remove lots of documents at once. These multi-document operations have to be “exploded” into separate entries for each document in the oplog, so that the oplog remains idempotent.

If you fall into one of these categories, you might want to think about allocating a bigger oplog to start out with. (Or, if you have a read-heavy application that only does a few writes, you might want a smaller oplog.) However, what if your application is already running in production when you realize you need to change the oplog size?

Usually if you’re having oplog size problems, you want to change the oplog size on the master. To change its oplog, we need to “quarantine” it so it can’t reach the other members (and your application), change the oplog size, then un-quarantine it.

To start the quarantine, shut down the master. Restart it without the --replSet option on a different port. So, for example, if I was starting MongoDB like this:

$ mongod --replSet foo # default port

I would restart it with:

$ mongod --port 10000

Replica set members look at the last entry of the oplog to see where to start syncing from. So, we want to do the following:

  1. Save the latest insert in the oplog.
  2. Resize the oplog
  3. Put the entry we saved in the new oplog.

So, the process is:

1. Save the latest insert in the oplog.

> use local
switched to db local
> // "i" is short for "insert"
>{op : "i"}).sort(
... {$natural : -1}).limit(1).next())

Note that we are saving the last insert here. If there have been other operations since that insert (deletes, updates, commands), that’s fine, the oplog is designed to be able to replay ops multiple times. We don’t want to use deletes or updates as a checkpoint because those could have $s in their keys, and $s cannot be inserted into user collections.

2. Resize the oplog

First, back up the existing oplog, just in case:

$ mongodump --db local --collection '' --port 10000

Drop the collection, and recreate it to be the size that you want:

> // size is in bytes
> db.runCommand({create : "", capped : true, size : 1900000}) 
{ "ok" : 1 }

3. Put the entry we saved in the new oplog.


Making this server primary again

Now shut down the database and start it up again with --replSet on the correct port. Once it is a secondary, connect to the current primary and ask it to step down so you can have your old primary back (in 1.9+, you can use priorities to force a certain member to be preferentially primary and skip this step: it’ll automatically switch back to being primary ASAP).

> rs.stepDown(10000)
// you'll get some error messages because reconfiguration 
// causes the db to drop all connections

Your oplog is now the correct size.

Edit: as Graham pointed out in the comments, you should do this on each machine that could become primary.

How to Use Replica Set Rollbacks

Rollin' rollin' rollin', keep that oplog rollin'

If you’re using replica sets, you can get into a situation where you have conflicting data. MongoDB will roll back conflicting data, but it never throws it out.

Let’s take an example, say you have three servers: A (arbiter), B, and C. You initialize A, B, and C:

$ mongo B:27017/foo
> rs.initiate()
> rs.add("C:27017")
{ "ok" : 1 }
> rs.addArb("A:27017")
{ "ok" : 1 }

Now do a couple of writes to the master (say it’s B).

> B = connect("B:27017/foo")
>{_id : 1})
>{_id : 2})
>{_id : 3})

Then C gets disconnected (if you’re trying this out, you can just hit Ctrl-C—in real life, this might be caused by a network partition). B handles some more writes:

>{_id : 4})
>{_id : 5})
>{_id : 6})

Now B gets disconnected. C gets reconnected and the arbiter elects it master, so it starts handling writes.

> C = connect("C:27017/foo")
>{_id : 7})
>{_id : 8})
>{_id : 9})

But now B gets reconnected. B has data that C doesn’t have and C has data that B doesn’t have! What to do? MongoDB chooses to roll back B’s data, since it’s “further behind” (B’s latest timestamp is before C’s latest timestamp).

If we query the databases after the millisecond or so it takes to roll back, they’ll be the same:

{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }

Note that the data B wrote and C didn’t is gone. However, if you look in B’s data directory, you’ll see a rollback directory:

$ ls /data/db
journal  local.0  local.1  local.ns  mongod.lock  rollback  foo.0  foo.1  foo.ns  _tmp
$ ls /data/db/rollback

If you look in the rollback directory, there will be a file for each rollback MongoDB has done. You can examine what was rolled back with the bsondump utility (comes with MongoDB):

$ bsondump
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }
Wed Jan 19 13:33:32      3 objects found

If these won’t conflict with your existing data, you can add them back to the collection with mongorestore.

$ mongorestore -d foo -c bar 
connected to:
Wed Jan 19 13:36:27
Wed Jan 19 13:36:27      going into namespace []
Wed Jan 19 13:36:27      3 objects found

Note that you need to specify -d foo and -c bar to get it into the correct collection. If it would conflict, you could restore it into another collection and do a more delicate merge operation.

Now, if you do a find, you’ll get all of the documents:

{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }

Hopefully this sort of thing can tide most people over until MongoDB supports multi-master.

Replication Internals

Displacer beast... seemed related (it's sort of in two places at the same time).

This is the first in a three-part series on how replication works.

Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?

MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).

The oplog

Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:

$ ./mongod --replSet funWithOplogs

Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:

> rs.initiate()

Now if we query the oplog you’ll see this operation:

> use local
switched to db local
    "ts" : { "t" : 1286821527000, "i" : 1 }, 
    "h" : NumberLong(0), 
    "op" : "n", 
    "ns" : "", 
    "o" : { "msg" : "initiating set" } 

This is just an informational message for the slave, it isn’t a “real” operation. Breaking this down, it contains the following fields:

  • ts: the time this operation occurred.
  • h: a unique ID for this operation. Each operation will have a different value in this field.
  • op: the write operation that should be applied to the slave. n indicates a no-op, this is just an informational message.
  • ns: the database and collection affected by this operation. Since this is a no-op, this field is left blank.
  • o: the actual document representing the op. Since this is a no-op, this field is pretty useless.

To see some real oplog messages, we’ll need to do some writes. Let’s do a few simple ones in the shell:

> use test
switched to db test
>{x:1}, {$set : {y:1}})
>{x:2}, {$set : {y:1}}, true)

Now look at the oplog:

> use local
switched to db local
{ "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } }
{ "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } }
{ "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } }
{ "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }

You can see that each operation has a ns now: “”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.

The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).

Using this information

MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.

Next up: what the oplog is and how syncing works.

Return of the Mongo Mailbag

On the mongodb-user mailing list last week, someone asked (basically):

I have 4 servers and I want two shards. How do I set it up?

A lot of people have been asking questions about configuring replica sets and sharding, so here’s how to do it in nitty-gritty detail.

The Architecture

Prerequisites: if you aren’t too familiar with replica sets, see my blog post on them. The rest of this post won’t make much sense unless you know what an arbiter is. Also, you should know the basics of sharding.

Each shard should be a replica set, so we’ll need two replica sets (we’ll call them foo and bar). We want our cluster to be okay if one machine goes down or gets separated from the herd (network partition), so we’ll spread out each set among the available machines. Replica sets are color-coded and machines are imaginatively named server1-4.

Each replica set has two hosts and an arbiter. This way, if a server goes down, no functionality is lost (and there won’t be two masters on a single server).

To set this up, run:


$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --replSet foo
$ ./mongod --dbpath ~/dbs/bar --port 27019 --replSet bar --oplogSize 1


$ mkdir -p ~/dbs/foo
$ ./mongod --dbpath ~/dbs/foo --replSet foo


$ mkdir -p ~/dbs/foo ~/dbs/bar
$ ./mongod --dbpath ~/dbs/foo --port 27019 --replSet foo --oplogSize 1
$ ./mongod --dbpath ~/dbs/bar --replSet bar


$ mkdir -p ~/dbs/bar
$ ./mongod --dbpath ~/dbs/bar --replSet bar

Note that arbiters have an oplog size of 1. By default, oplog size is ~5% of your hard disk, but arbiters don’t need to hold any data so that’s a huge waste of space.

Putting together the replica sets

Now, we’ll start up our two replica sets. Start the mongo shell and type:

> db = connect("server1:27017/admin")
connecting to: server1:27017
> rs.initiate({"_id" : "foo", "members" : [
... {"_id" : 0, "host" : "server1:27017"},
... {"_id" : 1, "host" : "server2:27017"},
... {"_id" : 2, "host" : "server3:27019", arbiterOnly : true}]})
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
> db = connect("server3:27017/admin")
connecting to: server3:27017
> rs.initiate({"_id" : "bar", "members" : [
... {"_id" : 0, "host" : "server3:27017"},
... {"_id" : 1, "host" : "server4:27017"},
... {"_id" : 2, "host" : "server1:27019", arbiterOnly : true}]})
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1

Okay, now we have two replica set running. Let’s create a cluster.

Setting up Sharding

Since we’re trying to set up a system with no single points of failure, we’ll use three configuration servers. We can have as many mongos processes as we want (one on each appserver is recommended), but we’ll start with one.


$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000


$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000


$ mkdir ~/dbs/config
$ ./mongod --dbpath ~/dbs/config --port 20000
$ ./mongos --configdb server2:20000,server3:20000,server4:20000 --port 30000

Now we’ll add our replica sets to the cluster. Connect to the mongos and and run the addshard command:

> mongos = connect("server4:30000/admin")
connecting to: server4:30000
> mongos.runCommand({addshard : "foo/server1:27017,server2:27017"})
{ "shardAdded" : "foo", "ok" : 1 }
> mongos.runCommand({addshard : "bar/server3:27017,server4:27017"})
{ "shardAdded" : "bar", "ok" : 1 }

Edit: you must list all of the non-arbiter hosts in the set for now. This is very lame, because given one host, mongos really should be able to figure out everyone in the set, but for now you have to list them.

Tada! As you can see, you end up with one “foo” shard and one “bar” shard. (I actually added that functionality on Friday, so you’ll have to download a nightly to get the nice names. If you’re using an older version, your shards will have the thrilling names “shard0000” and “shard0001”.)

Now you can connect to “server4:30000” in your application and use it just like a “normal” mongod. If you want to add more mongos processes, just start them up with the same configdb parameter used above.

Sharding and Replica Sets Illustrated

This post assumes you know what replica sets and sharding are.

Step 1: Don’t use sharding

Seriously. Almost no one needs it. If you were at the point where you needed to partition your MySQL database, you’ve probably got a long ways to go before you’ll need to partition MongoDB (we scoff at billions of rows).

Run MongoDB as a replica set. When you really need the extra capacity then, and only then, start sharding. Why?

  1. You have to choose a shard key. If you know the characteristics of your system before you choose a shard key, you can save yourself a world of pain.
  2. Sharding adds complexity: you have to keep track of more machines and processes.
  3. Premature optimization is the root of all evil. If you application isn’t running fast, is it CPU-bound or network-bound? Do you have too many indexes? Too few? Are they being hit by your queries? Check (at least) all of these causes, first.

Using Sharding

A shard is defined as one or more servers with one master. Thus, a shard could be a single mongod (bad idea), a master-slave setup (better idea), or a replica set (best idea).

Let’s say we have three shards and each one is a replica set. For three shards, you’ll want a minimum of 3 servers (the general rule is: minimum of N servers for N shards). We’ll do the bare minimum on replica sets, too: a master, primary, and arbiter for each set.

Mugs are MongoDB processes. So, we have three replica sets:

"teal", "green", and "blue" replica sets

“M” stands for “master,” “S” stands for “slave,” and “A” stands for “arbiter.” We also have config servers:

A config server

…and mongos processes:

A mongos process

Now, let’s stick these processes on servers (serving trays are servers). Each master needs to do a lot, so let’s give each primary its own server.

Now we can put a slave and arbiter on each box, too.

Note how we mix things up: no replica set is housed on a single server, so that if a server goes down, the set can fail over to a different server and be fine.

Now we can add the three configuration servers and two mongos processes. mongos processes are usually put on the appserver, but they’re pretty lightweight so we’ll stick a couple on here.

A bit crowded, but possible!

In case of emergency…

Let’s say we drop a tray. CRASH! With this setup, your data is safe (as long as you were using w) and the cluster loses no functionality (in terms of reads and writes).

Chunks will not be able to migrate (because one of the config servers is down), so a shard may become bloated if the config server is down for a very long time.

Network partitions and losing two server are bigger problems, so you should have more than three servers if you actually want great availability.

Let’s start and configure all 14 processes at once!

Or not.  I was going to go through the command to set this whole thing up but… it’s really long and finicky and I’ve already done it in other posts. So, if you’re interested, check out my posts on setting up replica sets and sharding.

Combining the two is left as an exercise for the reader.

Part 3: Replica Sets in the Wild

A primary with 8 secondaries

This post assumes that you know what replica sets are and some of the basic syntax.
In part 1, we set up a replica set from scratch, but real life is messier: you might want to migrate dev servers into production, add new slaves, prioritize servers, change things on the fly… that’s what this post covers.

Before we get started…

Replica sets don’t like localhost. They’re willing to go along with it… kinda, sorta… but it often causes issues. You can get around these issues by using the hostname instead. On Linux, you can find your hostname by running the hostname command:

$ hostname

On Windows, you have to download Linux or something. I haven’t really looked into it.

From here on out, I’ll be using my hostname instead of localhost.

Starting up with Data

This is pretty much the same as starting up without data, except you should backup your data before you get started (you should always backup your data before you mess around with your server configuration).

If, pre-replica-set, you were starting your server with something like:

$ ./mongod

…to turn it into the first member of a replica set, you’d shut it down and start it back up with the –replset option:

$ ./mongod --replSet unicomplex

Now, initialize the set with the one server (so far):

> rs.initiate()
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1

Adding Slaves

You should always run MongoDB with slaves, so let’s add some.

Start your slave with the usual options you use, as well as –replSet. So, for example, we could do:

$ ./mongod --dbpath ~/dbs/slave1 --port 27018 --replSet unicomplex

Now, we add this slave to the replica set. Make sure db is connected to wooster:27017 (the primary server) and run:

> rs.add("wooster:27018")
{"ok" : 1}

Repeat as necessary to add more slaves.

Adding an Arbiter

This is very similar to adding a slave. In 1.6.x, when you start up the arbiter, you should give it the option –oplogSize 1. This way the arbiter won’t be wasting any space. (In 1.7.4+, the arbiter will not allocate an oplog automatically.)

$ ./mongod --dbpath ~/dbs/arbiter --port 27019 --replSet unicomplex --oplogSize 1

Now add it to the set. You can specify that this server should be an arbiter by calling rs.addArb:

> rs.addArb("wooster:27019")
{"ok" : 1}

Demoting a Primary

Suppose our company has the following servers available:

  1. Gazillion dollar super machine
  2. EC2 instance
  3. iMac we found on the street

Through an accident of fate, the iMac becomes primary. We can force it to become a slave by running the step down command:

> imac = connect("")
connecting to:
> imac.runCommand({"replSetStepDown" : 1})
{"ok" : 1}

Now the iMac will be a slave.

Setting Priorities

It’s likely that we never want the iMac to be a master (we’ll just use it for backup). You can force this by setting its priority to 0. The higher a server’s priority, the more likely it is to become master if the current master fails. Right now, the only options are 0 (can’t be master) or 1 (can be master), but in the future you’ve be able to have a nice gradation of priorities.

So, let’s get into the nitty-gritty of replica sets and change the iMac’s priority to 0. To change the configuration, we connect to the master and edit its configuration:

> config = rs.conf()
        "_id" : "unicomplex",
        "version" : 1,
        "members" : [
                        "_id" : 0,
                        "host" : ""
                        "_id" : 1,
                        "host" : ""
                        "_id" : 2,
                        "host" : ""

Now, we have to do two things: 1) set the iMac’s priority to 0 and 2) update the configuration version. The new version number is always the old version number plus one. (It’s 1 right now so the next version is 2. If we change the config again, it’ll be 3, etc.)

> config.members[2].priority = 0
> config.version += 1

Finally, we tell the replica set that we have a new configuration for it.

> use admin
switched to db admin
> db.runCommand({"replSetReconfig" : config})
{"ok" : 1}

All configuration changes must happen on the master. They are propagated out to the slaves from there. Now you can kill any server and the iMac will never become master.

This configuration stuff is a bit finicky to do from the shell right now. In the future, most people will probably just use a GUI to configure their sets and mess with server settings.

Next up: how to hook this up with sharding to get a fault-tolerant distributed database.