Sharding and Replica Sets Illustrated

This post assumes you know what replica sets and sharding are.

Step 1: Don’t use sharding

Seriously. Almost no one needs it. If you were at the point where you needed to partition your MySQL database, you’ve probably got a long ways to go before you’ll need to partition MongoDB (we scoff at billions of rows).

Run MongoDB as a replica set. When you really need the extra capacity then, and only then, start sharding. Why?

  1. You have to choose a shard key. If you know the characteristics of your system before you choose a shard key, you can save yourself a world of pain.
  2. Sharding adds complexity: you have to keep track of more machines and processes.
  3. Premature optimization is the root of all evil. If you application isn’t running fast, is it CPU-bound or network-bound? Do you have too many indexes? Too few? Are they being hit by your queries? Check (at least) all of these causes, first.

Using Sharding

A shard is defined as one or more servers with one master. Thus, a shard could be a single mongod (bad idea), a master-slave setup (better idea), or a replica set (best idea).

Let’s say we have three shards and each one is a replica set. For three shards, you’ll want a minimum of 3 servers (the general rule is: minimum of N servers for N shards). We’ll do the bare minimum on replica sets, too: a master, primary, and arbiter for each set.

Mugs are MongoDB processes. So, we have three replica sets:

"teal", "green", and "blue" replica sets

“M” stands for “master,” “S” stands for “slave,” and “A” stands for “arbiter.” We also have config servers:

A config server

…and mongos processes:

A mongos process

Now, let’s stick these processes on servers (serving trays are servers). Each master needs to do a lot, so let’s give each primary its own server.

Now we can put a slave and arbiter on each box, too.

Note how we mix things up: no replica set is housed on a single server, so that if a server goes down, the set can fail over to a different server and be fine.

Now we can add the three configuration servers and two mongos processes. mongos processes are usually put on the appserver, but they’re pretty lightweight so we’ll stick a couple on here.

A bit crowded, but possible!

In case of emergency…

Let’s say we drop a tray. CRASH! With this setup, your data is safe (as long as you were using w) and the cluster loses no functionality (in terms of reads and writes).

Chunks will not be able to migrate (because one of the config servers is down), so a shard may become bloated if the config server is down for a very long time.

Network partitions and losing two server are bigger problems, so you should have more than three servers if you actually want great availability.

Let’s start and configure all 14 processes at once!

Or not.  I was going to go through the command to set this whole thing up but… it’s really long and finicky and I’ve already done it in other posts. So, if you’re interested, check out my posts on setting up replica sets and sharding.

Combining the two is left as an exercise for the reader.

20 thoughts on “Sharding and Replica Sets Illustrated

  1. Why shard? Because a billion-row MongoDB collection would have a 45 GB _id index, and it can be tricky to get budget approval for two servers with 96+ GB of RAM each. 🙂

    For me, sharding looks very attractive, because if I can split my 150 million rows/day across multiple servers, I can keep more days online. My two indexes add up to 23.4 GB on a 24 GB machine, so splitting it across three shards would give me excellent performance for two days of data, or somewhat-improved performance for three (depending on how well my shard key distributed the insert load).

    -j

    Like

  2. I think it’s fine for people to use sharding if the database has become the bottleneck (that’s why its there). I’m just trying to discourage people from leaping to it as soon as they start using MongoDB. One person emailed the list who was trying to set up sharding and didn’t know how to do an update!

    Like

  3. I am pretty impressed by Replica Sets but I think I am going to wait for sometime until it becomes more mature. We had few serious issues esp. related to halted replication with Replica Pairs and we eventually switch (and still switching) to traditional Master-Slave Replication.

    Btw, thanks for the posts!

    Like

  4. Yeah, replica sets aren’t perfect. We tried to test the hell out of them but I can see not wanting to run in production with them until 1.6.2 or so. You might want to use them in dev (since you’ll be using them in production eventually, hopefully) and just report any problems you find.

    Like

  5. Oh, sure; I just thought it was fair to point out that scoffing at billions of rows requires some pretty serious hardware, given MongoDB’s “very strong desire” to have all indexes fit in physical RAM. I actually managed to get up to a billion rows on my server with only 24GB and it wasn’t completely awful, but it was obviously not something I wanted to keep doing. Trimming back to 150 million improved the performance and justified me saying “if you want more data online for analytics, buy me four more servers so I can convert to a cluster”.

    -j

    Like

  6. Sort of… MongoDB slows way down if you don’t have the working set of indexes in memory. So, if your doing realtime analytics and you only care about the last week or so, you only have to fit a week of records in memory, even if you have billions or records. If you need to access all billion in a somewhat random fashion, then yes, you’ll need some pretty crazy hardware to do that on one machine!

    Honestly, “MongoDB scoffs at billions of rows” was just a turn of phrase I liked from Buffy the Vampire Slayer (“A watcher scoffs at gravity”). A more accurate phrase would be, “MongoDB scoffs at billions of rows under certain circumstances with a properly designed database.”

    Like

  7. It’d be cool if you did a post on choosing a sensible sharding regime for a write heavy application (like logging). Say I’ve got tons of timestamped log data streaming in… using the timestamp as my shard key will get it distributed but is going to keep slamming the last shard, right? Adding a secondary key (like url, say) might help, but what if there are only a few of those? I’d love some suggestions.

    Like

  8. To avoid the extra hassle when using sharding and just want to stay with a simple setup (replica set(s)) – would it be possible to apply some kind of load balancing for the set perhaps in a round robin manner?

    I suppose that adding replica members and enabling a slaveOk() for each of those members would increase throughput/ability to respond to request, except for writes that still is handled by the primary mamber?

    Like

    1. Yes, and this load balancing is gradually being built into the drivers (the Java driver already has support for automatically round-robining reads to slaves). It’s fairly easy to program yourself, too, and is a good way of scaling read-heavy applications. All writes have to be handled by the primary, so that can be a bottleneck.

      Most application shouldn’t use sharding right away: they should start with a replica set and turn it into a cluster when they need.

      Like

      1. Thanks for replying… you say “All writes have to be handled by the primary, so that can be a bottleneck.” are there any plans for distributing writes among members in a replica set or by any means to avoid bottleneck.

        Like

      2. No, each replica set can only have one master avoid conflicting writes. If you need more write throughput, you can combine replica sets with sharding. Sharding gives you write scalability.

        Like

  9. Another thing I have been thinking of… I have not had so much luck with my replica setup, please read http://groups.google.com/group/mongodb-user/browse_thread/thread/86a4252d33267940/0a1e031bee28d3be?lnk=gst&q=+MongoDB+-+unstable#0a1e031bee28d3be if you feel like it.

    Each time (a) member(s) in a replica set breaks for some reason you will manually have to bring it alive again is it not correct? Has some kind of automated restart been thought of?

    Like

    1. Yes, I have been following that thread. The problems you’re having are strange, I’m guessing they’re VPS-related.

      Theoretically, you could have a job that restarts mongod if it shuts down (this is the default on Homebrew installs, people have complained that they cannot figure out how to stop mongod on OS X). It’s not a good idea, though, because if MongoDB crashed, you need to run a repair before starting the database again. In 1.8 it’ll be a more viable option.

      Like

  10. How realistic is it to run a setup like this in production? I’ll probably post this question to the mailing list, but I’m working on sharding a fairly large collection. My assumption was that I’d shard over two servers with two other servers as replica sets, but I’m wondering if this would be a decent approach. It would certainly allow me to keep more of my indexes in RAM than across two servers, but I’m concerned there would be some contention between the different “master/slave” processes. That said, due to sharding key restrictions, I’m going to be write-heavy on one shard with occasional reads scattered through the remaining ones.

    Like

    1. This setup is more risky than I’d want to run in production. If one server goes down, you’ll have twice the load on one of your remaining servers, which might lead to it being overloaded. Think through what the risks are you want to take and how much load your servers are capable of handling. If you have shards on overlapping servers (shard1 and shard2 share server1 and server2), you’re potentially doubling the load on one of your servers and if server1 and server2 are in the same failure domain, you could end up with 2 shards down at once with no backup.

      Like

      1. One of the reasons I was thinking of doing this is our load won’t be evenly distributed due to an existing unique index on a serial value. One shard will be very write-heavy, while the others will have occasional reads scattered across them.

        Your comment definitely makes sense, though, and I’ll have to think through some more before making the final decision. Thanks!

        Like

  11. Hi, thanks for this nice explanation. Although This post a little old it is still useful. While reading your post. I was wondering if this configuration could lead at the end with the three primary on one physical machine (since former primary will join as secondary). Is that correct ?

    Like

    1. Yes. It’s a very crowded configuration, you probably wouldn’t want to deploy like this except for testing. In a real deployment, one thing you could do to minimize the risk of all primaries being on the same machine is add the “priority” option to the replica set members. Priority lets you control which members “try hardest” to be primary.

      Like

Leave a comment