Implementing Replica Set Priorities

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Replicos is a nation of islands that elect a primary island, called the Grand Poobah, to lead them. Each island cast a vote (or votes) and the island that gets a majority of the votes wins poobahship. If no one gets a majority (out of the total number of votes), no one becomes Grand Poobah. The islands can only communicate via carrier seagull.

Healthy configurations of islands, completely connected via carrier seagulls.

However, due to a perpetual war with the neighboring islands of Entropos, seagulls are often shot down mid-flight, distrupting communications between the Replicos islands until new seagulls can be trained.

The people of Replicos have realized that some islands are better at Poobah-ing than others. Their goal is to elect the island with the highest Poobah-ing ability that can reach a majority of the other islands. If all of the seagulls can make it to their destinations and back, electing a Poobah becomes trivial: an island sends a message saying they want to be Grand Poobah and everyone votes for them or says “I’m better at Poobah-ing, I should be Poobah.” However, it becomes tricky when you throw the Entropos Navy into the mix.

So, let’s Entropos has shot down a bunch of seagulls, leaving us with only three seagulls:

The island with .5 Poobah ability should be elected leader (the island with 1 Poobah ability can’t reach a majority of the set). But how can .5 know that it should be Poobah? It knows 1.0 exists, so theoretically it could ask the islands it can reach to ask 1.0 if it wants to be Poobah, but it’s a pain to pass messages through multiple islands (takes longer, more chances of failure, a lot more edge cases to check), so we’d like to be able to elect a Poobah using only the directly reachable islands, if possible.

One possibility might be for the islands sent a response indicating if they were connected to an island with a higher Poobah ability. In the case above, this would work (only one island is connected to an island with higher Poobah ability, so it can’t have a majority), but what about this case:

Every island, other than .5, is connected to a 1.0, but .5 should be the one elected! So, suppose we throw in a bit more information (which island of higher priority can be reached) and let the island trying to elect itself figure things out? Well, that doesn’t quite work, what if both .5 and 1.0 can reach a majority, but not the same one?

Conclusion: the Poobah-elect can’t figure this out on their own, everyone needs to work together.

Preliminaries: define an island to be Poohable if it has any aptitude for Poobah-ing and can reach a majority of the set. An island is not Poohable if it has no aptitude for Poobah-ing and/or cannot reach a majority of the set. Islands can be more or less Poohable, depending on their aptitude for Poobah-ing.

Every node knows whether or not it, itself, is Poohable: it knows its leadership abilities and if it can reach a majority of the islands. If more than one island (say islands A and B) is Poohable, then there must be at least one island that can reach both A and B [Proof at the bottom].

Let’s have each island keep a list of “possible Poobahs.” So, say we have an island A, that starts out with an empty list. If A is Poohable, it’ll add itself to the list (if it stops being Poohable, it’ll remove itself from the list). Now, whenever A communicates with another island, the other island will either say “add me to your list” or “remove me from your list,” depending on whether it is currently Poohable or not. Every other island does the same, so now each island has a list of the Poohable islands it can reach.

Now, say island X tries to elect itself master. It contacts all of the islands it can reach for votes. Each of the voting islands checks its list: if it has an entry on it that is more Poohable than X, it’ll send a veto. Otherwise X can be elected master. If you check the situations above (and any other situation) you can see that Poohability works, due to the strength of the guarantee that a Poobah must be able to reach a majority of the set.

Proof: suppose a replica set has n members and a node A can reach a majority of the set (at least ⌊n/2+1⌋) and a node B can reach a majority of the set (again, ⌊n/2+1⌋). If the sets of members A and B can reach are disjoint, then there must be ⌊n/2+1⌋+⌊n/2+1⌋ = at least n+1 members in the set. Therefore the set of nodes that A can reach and the set of nodes that B can reach are not disjoint.

“Scaling MongoDB” Update

Vroom vroom

In the last couple weeks, we’ve been getting a lot of questions like: (no one asked this specific question, this is just similar to the questions we’ve been getting)

I ran shardcollection, but it didn’t return immediately and I didn’t know what was going on, so I killed the shell and tried deleting a shard and then running the ‘shard collection’ command again and then I started killing ops and then I turned the balancer off and then I turned it back on and now I’m not sure what’s going on…

Aaaaagh! Stop running commands!

If a single server is like a TIE fighter then a sharded cluster is like the Death Star: you’ve got more power but you’re not making any sudden movements. For any configuration change you make, at least four servers have to talk to each other (usually more) and often a great deal of data has to get processed. If you ran all of the commands above on a big MongoDB install, everything would eventually work itself out (except the killing random operations part, it sort of depends on what got killed), but it could take a long time.

I think these questions stem from sharding being nerve-wracking: the documentation says what commands to run, but then nothing much seems to happen and everything seems slow and the command line doesn’t return a response (immediately). Meanwhile, you have hundreds of gigabytes of production data and MongoDB is chugging along doing… something.

So, I added some new sections to Scaling MongoDB on what to expect when you shard a big collection: if you run x in the shell, you’ll see y in the log, then MongoDB will be doing z until you see w. What it’s doing, what you’ll see, how (and if) you should react. In general: a sharding operation that hasn’t returned yet isn’t done, keep you eye on the logs, and don’t panic.

I’ve also added a section on backing up config servers and updated the chunk size information. If you bought the eBook, you can update it free to the latest version for free to get the new info. (I love this eBook update system!) The update should be going out this week.

Let me know if there’s any other info that you think is missing and I’ll add it for future updates.

Resizing Your Oplog

The MongoDB replication oplog is, by default, 5% of your free disk space. The theory behind this is that, if you’re writing 5% of your disk space every x amount of time, you’re going to run out of disk in 19x time. However, this doesn’t hold true for everyone, sometimes you’ll need a larger oplog. Some common cases:

  • Applications that delete almost as much data as they create.
  • Applications that do lots of in-place updates, which consume oplog entries but not disk space.
  • Applications that do lots of multi-updates or remove lots of documents at once. These multi-document operations have to be “exploded” into separate entries for each document in the oplog, so that the oplog remains idempotent.

If you fall into one of these categories, you might want to think about allocating a bigger oplog to start out with. (Or, if you have a read-heavy application that only does a few writes, you might want a smaller oplog.) However, what if your application is already running in production when you realize you need to change the oplog size?

Usually if you’re having oplog size problems, you want to change the oplog size on the master. To change its oplog, we need to “quarantine” it so it can’t reach the other members (and your application), change the oplog size, then un-quarantine it.

To start the quarantine, shut down the master. Restart it without the --replSet option on a different port. So, for example, if I was starting MongoDB like this:

$ mongod --replSet foo # default port

I would restart it with:

$ mongod --port 10000

Replica set members look at the last entry of the oplog to see where to start syncing from. So, we want to do the following:

  1. Save the latest insert in the oplog.
  2. Resize the oplog
  3. Put the entry we saved in the new oplog.

So, the process is:

1. Save the latest insert in the oplog.

> use local
switched to db local
> // "i" is short for "insert"
> db.temp.save(db.oplog.rs.find({op : "i"}).sort(
... {$natural : -1}).limit(1).next())

Note that we are saving the last insert here. If there have been other operations since that insert (deletes, updates, commands), that’s fine, the oplog is designed to be able to replay ops multiple times. We don’t want to use deletes or updates as a checkpoint because those could have $s in their keys, and $s cannot be inserted into user collections.

2. Resize the oplog

First, back up the existing oplog, just in case:

$ mongodump --db local --collection 'oplog.rs' --port 10000

Drop the local.oplog.rs collection, and recreate it to be the size that you want:

> db.oplog.rs.drop()
true
> // size is in bytes
> db.runCommand({create : "oplog.rs", capped : true, size : 1900000}) 
{ "ok" : 1 }

3. Put the entry we saved in the new oplog.

> db.oplog.rs.save(db.temp.findOne())

Making this server primary again

Now shut down the database and start it up again with --replSet on the correct port. Once it is a secondary, connect to the current primary and ask it to step down so you can have your old primary back (in 1.9+, you can use priorities to force a certain member to be preferentially primary and skip this step: it’ll automatically switch back to being primary ASAP).

> rs.stepDown(10000)
// you'll get some error messages because reconfiguration 
// causes the db to drop all connections

Your oplog is now the correct size.

Edit: as Graham pointed out in the comments, you should do this on each machine that could become primary.

Enchiladas of Doom

Damjan Stanković's Eko light design

Andrew and I are visiting San Francisco this week. Last night, I wanted enchiladas from the Mexican place across the street from the hotel. It was still warm out even though the sun had set hours ago, so we decided to walk over.

Our hotel is on a busy road with three lanes in both directions, but there are lights along the road so there are periods when no cars are coming. We waited until a wave of cars had passed and sprinted across the first three lanes. In the darkness, it looked like the median was flush with the pavement and I charged at it. It was actually raised and my foot hit it six inches before I had expected to encounter anything solid. I staggered and lost my balance but I was still running full-steam, so I tripped my way forward ending up in the middle of the road. I lay there on the highway, stunned, looking at three lanes of cars coming at me.

Get up! screamed my brain. It hadn’t even finished shouting when Andrew scooped me up and half-carried me off of the road. He had almost tripped over me as I fell, but managed to leap over and then turn and pick me up. He is my Batman!

My ankle is still pretty sore and I’m a bit banged up down the side I fell on, but other than that I’m fine. And the enchiladas were delicious.

Also: if you’re a subscriber and you’re only interested in MongoDB-related posts, I created a new RSS feed you can subscribe to for just those posts. The old feed will continue to have all MongoDB posts, plus stuff about my life and other technology.

My Life is Awesome

Andrew and I are getting married!

I can’t figure out how to say this eloquently, but: Andrew is so wonderful and I am incredibly lucky to have him. I love him so much. We love doing stuff together, talking about everything together (well, he puts up with more database talk than he’d probably strictly like), and just being with each other.

He has the cutest new haircut, too, but he’s a tough man to get in front of a camera (the picture on the right is from a year ago).

We’re getting married by the justice of the peace on our third anniversary. We already have tons of crap in our apartment and the last thing we need is more crap, so we’re asking wedding guests to skip the presents and donate to one of the following charities instead:

I am so happy.

A Short eBook on Scaling MongoDB

I just finished a little ebook for O’Reilly: Scaling MongoDB. I’m excited about it, it was really fun to write and I think it’ll be both fun and instructive to read. It covers:

  • What a cluster is
  • How it works (at a high level)
  • How to set it up correctly
  • The differences in programming for a cluster vs. a single server
  • How to administer a cluster
  • What can go wrong and how to handle it

So, why would you want to get a copy?

  • It’s a succinct reference for anything that’s likely to come up and covers the common questions I’ve heard people ask in the last year.
  • I heard some grumbling about my post on choosing a shard key (“Ugh, a metaphor,” -an IRC user). People who don’t like metaphors will be pleased to hear that this book has a straight-forward, serious-business section on choosing a shard key that not only lacks a metaphor but also goes much more into depth than the blog post.
  • It’s a quick read. There are code examples, of course, and it can be used as a reference, but after banging out 15,000 words in a couple of days, I took the next couple weeks to make them all flow together like fuckin’ Shakespeare.
  • It can be updated in real time! After MongoDB: The Definitive Guide becoming out-of-date approximately 6 seconds after we handed in the final draft, I’m delighted that new sections, advice, and features can be added as needed. Once you buy the ebook, you can update to the latest “edition” whenever you want, as many times as you want. O’Reilly wants to do automatic updates, but so far the infrastructure isn’t there in traditional ebook readers so you’ll have to update it manually.

You can also get a “print on demand” copy if you’re old school.

I hope you guys will check it out and let me know what you think!

To promote the book, Meghan is forcing me to do a webcast on Friday (February 4th). It’s called How Sharding Works and it’s a combination whitepaper and Magic School Bus tour of sharding. It should cover some of the interesting stuff about sharding that didn’t really fit into the 60 pages I had to work with (or the more practical focus of the book).

Look at the teeth on that guy! (He'll bite you if you make any webscale jokes.)

Why Command Helpers Suck

This is a rant from my role as a driver developer and person who gives support on the mailing list/IRC.

Command helpers are terrible. They confuse users, result in a steeper learning curve, and make MongoDB’s interface seem arbitrary.

The basics: what are command helpers?

Command helpers are wrappers for database commands. Database commands are everything other than CRUD (create, retrieve, update, delete) that you can do with MongoDB. This includes things like dropping a collection, doing a MapReduce, adding a member to a replica set, seeing what arguments you started mongod with, and finding out if the last write operation succeeded. They’re everywhere, if you’ve used MongoDB, you’ve run a database command (even if you weren’t aware of it).

So, what are command helpers? These are wrappers around the raw command, turning something like db.adminCommand({serverStatus:1}) into db.serverStatus(). This makes it slightly quicker to run and look “nicer” than the command. However, there are honey bunches of reasons that they’re a bad idea and should be avoided whenever possible.

Database helpers are unportable

Helpers are extremely unportable. If you know how to run db.serverStatus() in the shell, that’s great, but all you know is how to do it in the shell. If you know how to run the serverStatus command, you know how to get the server status in every language you’ll ever use.

Similarly, each language handles command options differently. Take a command like group: the shell helper chooses one order of options (a single argument “options”, incidentally) and the Python driver chooses another (“key”, “condition”, “initial”, “reduce”, “finalize”) and the PHP driver another (“key”, “initial”, “reduce”, “options”). If you just learn the group command itself, you can execute it in any language you please.

This affects almost everyone using MongoDB, as almost everyone uses at least two languages (JavaScript and something else). I have seen hundreds of questions of the form “How do I run <shell function> using my driver?” If these users knew it was a database command (and knew what a database command was), they wouldn’t have to ask.

Database helpers lock you to a certain API, often an out-of-date one

Suppose the database changes the options for a command. All of the drivers that support helpers for that command are suddenly out-of-date. Conversely, if you have a recent version of a driver and an old version of the database, you can have helpers for features that don’t exist yet or have different options.

An example of old driver/new database: MapReduce’s options changed in version 1.7.4. As far as I know, none of the drivers support the new options, yet.

You can’t support database helpers for everything

Next, there’s just the sheer volume of database commands, which makes it impossible to implement helpers for all of them. Everyone has their favorites: aggregation is important to some people, administration helpers are important to others, etc. If all of them had helpers, not only would there be a ridiculous number of methods polluting the API documentation, but it would leads to tons of compatibility problems between the driver and the database (as mentioned above).

Database helpers conceal what’s going on, giving users less options

Finally, using command helpers keeps people from understanding what’s actually going on, which is pointless and can lead to problems. It’s pointless to conceal the gory details because the details aren’t very gory: all database commands are queries. This means you can deconstruct command helpers as follows (example in PHP):

// the command helper
$db->lastError();
// is the same as
$db->command(array("getlasterror" => 1));
// is the same as
$db->selectCollection('$cmd')->findOne(array("getlasterror" => 1));
// is the same as
$db->selectCollection('$cmd')->find(array("getlasterror" => 1))->limit(1)->getNext();

Every command helper is just a find() in disguise! This means you can do (almost) anything with a database command that you could with a query.

This gives you more control. Not only can you use whatever options you want, you can do a few other things:

  • By default, drivers send all commands to the master, even if slaveOkay is set. If you want to send a command to a slave, you can deconstruct it to a query bypass the driver’s commands-go-to-master logic.
  • Suppose you have a command that takes a long time to execute and it times out on the client side. If you deconstruct the command into a query, you can (for some drivers) set the client-side timeout on the cursor.

Finally, if you’re using an unfamiliar driver, you might not know what its helpers are called but all drivers have a find() method, so you can always use that.

Exceptions

There are a couple command helpers worth implementing. I think that count and drop (at both the database and collection levels) are common enough to be worth having helpers for. Also, at a higher level (e.g., frameworks on top of the driver and admin GUIs) I think helpers are absolutely fine. However, as someone who has been maintaining a driver and supporting users for the last few years, I think that, at a driver level, command helpers are a terrible idea.

How to Use Replica Set Rollbacks

Rollin' rollin' rollin', keep that oplog rollin'

If you’re using replica sets, you can get into a situation where you have conflicting data. MongoDB will roll back conflicting data, but it never throws it out.

Let’s take an example, say you have three servers: A (arbiter), B, and C. You initialize A, B, and C:

$ mongo B:27017/foo
> rs.initiate()
> rs.add("C:27017")
{ "ok" : 1 }
> rs.addArb("A:27017")
{ "ok" : 1 }

Now do a couple of writes to the master (say it’s B).

> B = connect("B:27017/foo")
> B.bar.insert({_id : 1})
> B.bar.insert({_id : 2})
> B.bar.insert({_id : 3})

Then C gets disconnected (if you’re trying this out, you can just hit Ctrl-C—in real life, this might be caused by a network partition). B handles some more writes:

> B.bar.insert({_id : 4})
> B.bar.insert({_id : 5})
> B.bar.insert({_id : 6})

Now B gets disconnected. C gets reconnected and the arbiter elects it master, so it starts handling writes.

> C = connect("C:27017/foo")
> C.bar.insert({_id : 7})
> C.bar.insert({_id : 8})
> C.bar.insert({_id : 9})

But now B gets reconnected. B has data that C doesn’t have and C has data that B doesn’t have! What to do? MongoDB chooses to roll back B’s data, since it’s “further behind” (B’s latest timestamp is before C’s latest timestamp).

If we query the databases after the millisecond or so it takes to roll back, they’ll be the same:

> C.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
> B.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }

Note that the data B wrote and C didn’t is gone. However, if you look in B’s data directory, you’ll see a rollback directory:

$ ls /data/db
journal  local.0  local.1  local.ns  mongod.lock  rollback  foo.0  foo.1  foo.ns  _tmp
$ ls /data/db/rollback
foo.bar.2011-01-19T18-27-14.0.bson

If you look in the rollback directory, there will be a file for each rollback MongoDB has done. You can examine what was rolled back with the bsondump utility (comes with MongoDB):

$ bsondump foo.bar.2011-01-19T18-27-14.0.bson
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }
Wed Jan 19 13:33:32      3 objects found

If these won’t conflict with your existing data, you can add them back to the collection with mongorestore.

$ mongorestore -d foo -c bar foo.bar.2011-01-19T18-27-14.0.bson 
connected to: 127.0.0.1
Wed Jan 19 13:36:27 foo.bar.2011-01-19T18-27-14.0.bson
Wed Jan 19 13:36:27      going into namespace [foo.bar]
Wed Jan 19 13:36:27      3 objects found

Note that you need to specify -d foo and -c bar to get it into the correct collection. If it would conflict, you could restore it into another collection and do a more delicate merge operation.

Now, if you do a find, you’ll get all of the documents:

> B.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }

Hopefully this sort of thing can tide most people over until MongoDB supports multi-master.

How to Choose a Shard Key: The Card Game

Choosing the right shard key for a MongoDB cluster is critical: a bad choice will make you and your application miserable. Shard Carks is a cooperative strategy game to help you choose a shard key. You can try out a few pre-baked strategies I set up online (I recommend reading this post first, though). Also, this won’t make much sense unless you understand the basics of sharding.

Mapping from MongoDB to Shard Carks

This game maps the pieces of a MongoDB cluster to the game “pieces:”

  • Shard – a player.
  • Some data – a playing card. In this example, one card is ~12 hours worth of data.
  • Application server – the dealer: passes out cards (data) to the players (shards).
  • Chunk – 0-4 cards defined by a range of cards it can contain, “owned” by a single player. Each player can have multiple chunks and pass chunks to other players.

Instructions

Before play begins, the dealer orders the deck to mimic the application being modeled. For this example, we’ll pretend we’re programming a news application, where users are mostly concerned with the latest few weeks of information. Since the data is “ascending” through time, it can be modeled by sorting the cards in ascending order: two through ace for spades, then two through ace of hearts, then diamonds, then clubs for the first deck. Multiple decks can be used to model longer periods of time.

Once the decks are prepared, the players decide on a shard key: the criteria used for chunk ranges. The shard key can be any deterministic strategy that an independent observer could calculate. Some examples: order dealt, suit, or deck number.

Gameplay

The game begins with Player 1 having a single chunk (chunk1). chunk1 has 0 cards and the shard key range [-∞, ∞).

Each turn, the dealer flips over a card, computes the value for the shard key, figures out which player has a chunk range containing that key, and hands the card to that player. Because the first card’s shard key value will obviously fall in the range [-∞, ∞), it will go to Player 1, who will add it to chunk1. The second and the third cards go to chunk1, too. When the fourth card goes to chunk1, the chunk is full (chunks can only hold up to four cards) so the player splits it into two chunks: one with a range of [-∞, midchunk1), the other with a range of [midchunk1, ∞), where midchunk1 is the midpoint shard key value for the cards in chunk1, such that two cards will end up in one chunk and two cards will end up in the other.

The dealer flips over the next card and computes the shard key’s value. If it’s in the [-∞, midchunk1) range, it will be added to that chunk. If it’s in the [midchunk1, ∞) range, it will be added to that chunk.

Splitting

Whenever a chunk gets four cards, the player splits the chunk into two 2-card chunks. If a chunk has the range [x, z), it can be split into two chunks with ranges [x, y), [y, z), where x < y < z.

Balancing

All of the players should have roughly the same number of chunks at all times. If, after splitting, Player A ends up with more chunks than Player B, Player A should pass one of their chunks to Player B.

Strategy

The goals of the game are for no player to be overwhelmed and for the gameplay to remain easy indefinitely, even if more players and dealers are added. For this to be possible, the players have to choose a good shard key. There are a few different strategies people usually try:

Sample Strategy 1: Let George Do It

The players huddle together and come up with a plan: they’ll choose “order dealt” as the shard key.

The dealer starts dealing out cards: 2 of spades, 3 of spades, 4 of spades, etc. This works for a bit, but all of the cards are going to one player (he has the [x, ∞) chunk, and each card’s shard key value is closer to ∞ than the preceding card’s). He’s filling up chunks and passing them to his friends like mad, but all of the incoming cards are added to this single chunk. Add a few more dealers and the situation becomes completely unmaintainable.

Ascending shard keys are equivalent to this strategy: ObjectIds, dates, timestamps, auto-incrementing primary keys.

Sample Strategy 2: More Scatter, Less Gather

When George falls over dead from exhaustion, the players regroup and realize they have to come up with a different strategy. “What if we go the other direction?” suggests one player. “Let’s have the shard key be the MD5 hash of the order dealt, so it’ll be totally random.” The players agree to give it a try.

The dealer begins calculating MD5 hashes with his calculator watch. This works great at first, at least compared to the last method. The cards are dealt at a pretty much even rate to all of the players. Unfortunately, once each player has a few dozen chunks in front of them, things start to get difficult. The dealer is handing out cards at a swift pace and the players are scrambling to find the right chunk every time the dealer hands them a card. The players realize that this strategy is just going to get more unmanageable as the number of chunks grows.

Sharding keys equivalent to this strategy: MD5 hashes, UUIDs. If you shard on a random key, you lose data locality benefits.

Sample Strategy 3: Combo Plate

What the players really want is something where they can take advantage of the order (like the first strategy) and distribute load across all of the players (like the second strategy). They figure out a trick: couple a coarsely-grained order with the random element. “Let’s say everything in a given deck is ‘equal,'” one player suggests. “If all of the cards in a deck are equivalent, we’ll need a way of splitting chunks, so we’ll also use the MD5 hash as a secondary criteria.”

The dealer passes the first four cards to Player 1. Player 1 splits his chunk and passes the new chunk to Player 2. Now the cards are being evenly distributed between Player 1 and Player 2. When one of them gets a full chunk again, they split it and hand a chunk to Player 3. After a few more cards, the dealer will be evenly distributing cards among all of the players because within a given deck, the order the players are getting the cards is random. Because the cards are being split in roughly ascending order, once a deck has finished, the players can put aside those cards and know that they’ll never have to pick them up again.

This strategy manages to both distribute load evenly and take advantage of the natural order of the data.

Applying Strategy 3 to MongoDB

For many applications where the data is roughly chronological, a good shard key is:

{<coarse timestamp> : 1, <search criteria> : 1}

The coarseness of the timestamp depends on your data: you want a bunch of chunks (a chunk is 200MB) to fit into one “unit” of timestamp. So, if 1 month’s worth of writes is 30GB, 1 month is a good granularity and your shard key could start with {"month" : 1}. If one month’s worth of data is 1 GB you might want to use the year as your timestamp. If you’re getting 500GB/month, a day would work better. If you’re inserting 5000GB/sec, sub-second timestamps would qualify as “coarse.”

If you only use a coarse granularity, you’ll end up with giant unsplittable chunks. For example, say you chose the shard key {"year" : 1}. You’d get one chunk per year, because MongoDB wouldn’t be able to split chunks based on any other criteria. So you need another field to target queries and prevent chunks from getting too big. This field shouldn’t really be random, as in Strategy 3, though. It’s good to group data by the criteria you’ll be looking for it by, so a good choice might be username, log level, or email, depending on your application.

Warning: this pattern is not the only way to choose a shard key and it won’t work well for all applications. Spend some time monitoring, figuring out what your application’s write and read patterns are. Setting up a distributed system is hard and should not be taken lightly.

How to Use Shard Carks

If you’re going to be sharding and you’re not sure what shard key to choose, try running through a few Shark Carks scenarios with coworkers. If a certain player start getting grouchy because they’re having to do twice the work or everyone is flailing around trying to find the right cards, take note and rethink your strategy. Your servers will be just as grumpy and flailing, only at 3am.

If you don’t have anyone easily persuadable around, I made a little web application for people to try out the strategies mentioned above. The source is written in PHP and available on Github, so feel free to modify. (Contribute back if you come up with something cool!)

And that’s how you choose a shard key.

Wireless dongle review

A dongle is a USB thingy (as you can see, I’m very qualified in this area) that lets you connect your computer to the internet wherever you go. It uses the same type of connection your cellphone data plan uses (3G or 4G).

A few months ago, Clear asked if they could send me a free sample dongle, as I am such a prestigious tech blogger. And I, being a sucker for free things (take note marketers) agreed to try out their dongle. And I have to say, it’s been pretty cool having free wifi wherever I go. The good bits:

  • It is very handy, especially when traveling. Waiting for hours in cold, smelly terminals become much more bearable. If I traveled more, I’d definitely get my own dongle (or try to get work to get one for me).
  • I could use the dongle on multiple laptops. I was worried about this, it seems like a lot of companies grub for money by binding devices like this to a single machine so you have to buy one for each computer you have (and who has just one computer?). It only supported Mac and Windows, though, so minor ding for that.
  • Andrew and I watched Law and Order (Netflix) using it and there was no noticeable difference in quality from our landline. I didn’t do a proper speed test, partly because I’m lazy and partly because I didn’t care. (If you know me IRL and want to do one, let me know and I’ll lend the dongle to you.)

But… there aren’t a whole lot of places I go where I don’t have free wifi already. Almost all of the coffeeshops and bookstores (and even bars) I go to already advertise free wifi. I used the dongle maybe once a week. I’ll miss it when my free trial runs out, but I won’t miss it $55-per-month-worth.

Also, I should be able to get the same sort of behavior by tethering my cellphone–if Sprint didn’t cripple their cellphones to prevent you from tethering. I actually don’t like having a phone, period, so when my contract runs out I’ll probably get a phone with just a data plan and a less douchey carrier.

So, my conclusions are: it’s super handy, but my cellphone should really be able to serve the same function. But that’s just me, and it is really cool being able to go online anywhere.