NoSQL vs. the world

About a year ago, Mike Dirolf drew an enormous circle covering a sheet of paper. “Here are the people who use databases,” he said, “and here are the people who have even heard of NoSQL,” and he drew a circle this big: ° .

I think that interest has grown since then. By this time, the number of people that know about NoSQL is at least this big: o .

For evidence, let’s take a look at what people are searching for. First, here’s a Google Insights chart for a bunch of popular NoSQL solutions:

MongoDB vs. Cassandra vs. CouchDB vs. Redis vs. Riak (source)

Woohoo! MongoDB seems to be leading the charge, so far. But wait, now let’s compare MongoDB (the top line above) to some other databases:

MongoDB vs. Sybase vs. PostgreSQL vs. Firebird vs. Sqlite (source)

Okay, well, we’re behind everyone, but it’s not too bad. You start to see some patterns with the relational databases that you don’t yet so much with the NoSQL databases: people are using relational databases a lot more at work (during the week) than for fun (on the weekends). In fact, MongoDB is occasionally inching above Sybase on the weekends!

How about MySQL, SQL Server, and Oracle?

MongoDB vs. MySQL vs. SQL Server vs. Oracle (source)

Sigh. Back to work, people. We have a ways to go.

A finite simple group of order two

On my way into an underpass.

Andrew and I just returned from our honeymoon in the mountains, so I am now going to wax boring about how awesome it was.

We really love rock climbing, so we went on a lot of rock scrambles. Rock scrambles involve “scrambling” around rocks, boulders, little cliffs, and crevices. We skittered down slippery sheets of rock, trying to hit the tiny outcroppings and avoid tumbling to our deaths. Then the trail markers would betray us, pointing us right into the crevices, under huge boulders, across precariously balanced natural bridges of rock, through tiny gaps where I barely fit and there was a moment of panic where I thought Andrew was going to get stuck.

At one point, we came to a particularly buggy area. The air was warm and full of tiny flies and mosquitoes. I kept flailing at them, mostly just managing to whack myself in the head. The trail hit the edge of a pit with steep walls blocking our view of everything but the path in front of us. Slipping, sliding, and slapping ourselves in the head, we made our way down.

Not as bad as it looked! (Rockin my XKCD shirt)

As we went, the air got ten degrees colder, then twenty, then thirty. Suddenly there were no bugs at all. We reached the bottom of the pit and the air was pleasantly cold. Then we saw why: there were piles of snow! The hole was completely surrounded by rock and protected from the sun, forming natural ice box.

As we scrambled up the other side of the pit, the air became warmer and the bugs resumed their assault, but for a moment there, it was like we were in an ice dragon’s lair in the heart of a hot jungle.

Also, we got room 314 (3.14), which seems like a good omen.

Finally, if you’re a geek and don’t know what the title is referring to, you should watch the video.

Simulating Network Paritions with mongobridge

Note: mongobridge does not come with the MongoDB binaries, but you can build it by getting the source code and running scons mongobridge.

Let’s say we have a replica set with members (M1, M2, and M3) and we want to see what happens when M1 and M3 cannot reach each other (or any other sort of network partition). How do we do this?

We can’t shut down M3 because that would test something different. If we block incoming connections to M3, we end up blocking M2’s connections, too, which we don’t want.

We want M3 to only block M1’s connection, not M2’s. This is where mongobridge comes in.

A replica set with three members.

mongobridge allows you to accept connections on one port and send them on to a MongoDB process listening on another port. So, for the M1↔M3 connection, we can set up a bridge. This bridge, like most bridges, needs to go two directions (M1→M3 and M3→M1), so we actually need two instances of mongobridge.

So, we’ll make the “onramps” for our bridge M1::10013 (for the M1→M3 connection) and M3:10031 (for the M3→M1 connection). To set this up, we run:

$ ssh M1
M1$ mongobridge --port 10013 --dest M3:27017
M1$ exit
$ ssh M3
M3$ mongobridge --port 10031 --dest M1:27017
M3$ exit

This means, “Take all traffic heading to M1:10013 and send it to M3:27017. Take all traffic heading to M3:10031 and send it to M1:27017.

However, M1:27017 doesn’t want to send its traffic to M1:10013, its configuration says to send M3-bound traffic to M3:27017. Same with M3:27017, it doesn’t want to send its traffic to M3:10031, it wants to send it to M1:27017. And we can’t reconfigure the set so that different members have different configurations… officially.

Unofficially, we can change each config to anything we want.

Warning: never, ever do what I’m about to do in production, this is just for trying out funky networking tricks.

Shut down M1 and M3, and start each back up without the --replSet option on a different port. This way it won’t connect to the rest of the replica set and you can edit the local.system.replset configurations to point at the bridges, instead of the other members.

> // supposing we started them on port 27018 instead of 27017...
> db = (new Mongo("M1:27018")).getDB("local")
> db.system.replset.update({}, {$set : {"members.2.host" : "M1:10013"}})
> db = (new Mongo("M3:27018")).getDB("local")
> db.system.replset.update({}, {$set : {"members.0.host" : "M3:10031"}})

Now, restart M1 and M3 with their proper ports and --replSet. They will connect to each other through mongobridge and our replica set will be ready to use.

To simulate a network outage, kill the mongobridges between two servers.

Suppose M1 is the primary and you have knocked out the network between M1 and M3. If you do a write on M1, M2 will sync to M1 and then M3 will sync to M2, so you can see that the write will still make it to M3.

And that’s the dramatic payoff.

Trying Out Replica Set Priorities

Respect my prioritah!

As promised in an earlier post, replica set priorities for MongoDB are now committed and will be available in 1.9.0, which should be coming out soon.

Priorities allow you to give weights to the servers, saying, “I want server X to be primary whenever possible.” Priorities can range from 0.0-100.0.

To use priorities, download the latest binaries (“Development Release (Unstable) – Nightly”) from the MongoDB site. You have to have all members of the set running 1.9.0- or higher, as older versions object strongly to priorities other than 0 and 1.

Once you have the latest code installed, start up your three servers (A, B, and C) and create the replica set:

> // on server A
> rs.initiate()
> rs.add("B")
> rs.add("C")

Suppose we want B to be the preferred primary. And C‘s a backup server, but it could be primary if we really need it. So, we can adjust the priorities:

> config = rs.conf()
>
> // the default priority is 1
>
> B = config.members[1]
> B.priority = 2
>
> C = config.members[2]
> C.priority = .5
>
> // always increment the version number when reconfiguring
> config.version++
> rs.reconfig(config)

In a few seconds, A will step down and B will take over as primary.

Protip: the actual values of the priorities don’t matter, it’s just the relative values: B > A > C. B could have a priority of 100 and C could have a priority of .00001 and the set would behave exactly the same.

FAQ

(Based on coworkers+the 12 hours it’s been committed)

What if A steps down and B is behind? Won’t I lose data?

No. A will only step down if B is within 10 seconds of synced. Once A steps down, B will sync until it’s up-to-date and (only then) become master.

Okay, but I want B to be primary now. Can I force that?

Yes, now-ish. Run this on A:

> db.adminCommand({replSetStepDown : 300, force : true})

This forces A to step down immediately (for 300 seconds). B will sync to it until it is up-to-date, then become primary.

I forgot to upgrade one of my servers before setting crazy priorities and now it’s complaining. What do I do?

Shut it down and restart it with 1.9.0. Setting non-1/0 priorities won’t harm anything with earlier versions, it just won’t work.

So, when should I use votes?

Almost never! Please ignore votes (in all versions, not just 1.9.0), unless: 1) you’re trying to have a replica set with more than 7 members or 2) you want to wedge your replica set into a primary-less state.

The Scripting Language of Databases

One of the most common questions non-users ask is “Why should I use MongoDB?”

There are a bunch of fancy answers: you can scale it (webscale!), you can use it for MapReduce, you can store files in it. Those things are all true, but every database worth its salt can scale (there are MySQL clusters with tens of thousands of nodes), every new-ish database I know of supports MapReduce, and filesystems are pretty good at storing files.

I think the reason is much simpler:

MongoDB gets out of your way.

For example, a user (on IRC) asked, “How do I store a form’s data in MongoDB?” Based on the question, I assumed he was using PHP, so I pasted the following three lines:

$m = new Mongo();
$m->test->formResults->insert($_POST);
print_r($m->test->formResults->findOne());

“Hey, it works!” he said.

(For those of you not familiar with MongoDB (or PHP), this stores and retrieves everything in the POST request and can be run with no prior database setup.*)

So, all of the bells and whistles are nice, but I think the real benefit is the simplicity. MongoDB is the scripting language of databases: you can just get stuff done, fast.

In this spirit, 10gen’s first monthly blogging contest topic was to write about something you developed quickly using MongoDB. The entries were cool: people built really interesting applications ridiculously fast.

A screenshot of Mojology displaying syslog entries

Some of my favorites entries were:

BugRocket’s Rapid(-ish) Development

Ryan Funduk wrote about creating a bug tracker.

“Without MongoDB, I would have easily racked up over a thousand database migrations.”

The Birth of Mojology

Gergely Nagy built an open source application for viewing and doing statistics on syslog messages.

“About four hours [after installing MongoDB], I posted the first version of my mongodb destination driver to the syslog-ng mailing list.”

From 0 to 1 Million in 6 Hours

Bradley Grzesiak wrote about programming VoiceRally.

“Friday, the day after VoiceRally was written, we sent over 1.5 million WebSocket messages.”

Family Spoon and MongoDB

Tom Maiaroto writes about creating a recipe website.

“Yes, you need to be aware of “schema” and you can’t go hog wild, but you also get more forgiveness and MongoDB works with you to solve your problems. It’s very flexible. It’s not something that you need to work around, it’s something that you get to work with. Anytime that you have a situation like that as a developer, your day is going to be much more happy and productive.”

Check out the other entries, as well. It’s too bad we can only choose one to win!

This month, we’re asking people to write about an open source app using MongoDB and the prize is an iPad2!

Edited to add: some of the commenters are upset about my advice to store $_POST in MongoDB. You should not store any user input unsanitized. For people familiar with SQL, the code above does not allow a traditional injection attack with MongoDB (as it would with SQL). After the first flush of success, I told the guy to not do it this way and to go read the documentation. Inserting $_POST was a learning tool, not a solution, and I tried to make that clear over IRC, if not in this post.

Lorenz University: I can has degree?

Click on the image to see the original (full size) version in a new window. Big thanks to Wondermark for allowing people to post their comics!

Misadventures in HR (an hilarious blog about… HR) mentioned Lorenz University, a degree mill. I’d never heard of a degree mill before, so I wanted to see how legit it looked from a computer scientist’s point of view.

whois

Every site on the internet has to register contact information the king of the internet, so you can see who’s behind a website. Anyone can look up this info by running “whois domain” on their computer. For example, here are some legit universities’ info:

$ whois nyu.edu
Registrant:
   New York University
   ITS Communications Operations Services
   7 East 12th Street, 5th Floor
   New York, NY 10003
   UNITED STATES
$ whois mit.edu
Registrant:
   Massachusetts Institute of Technology
   Cambridge, MA 02139
   UNITED STATES
$ whois ufl.edu
Registrant:
   University of Florida
   Computing and Network Services
   Space Sciences Research Building
   Gainesville, FL 32611-2050
   UNITED STATES

Most businesses, higher learning institutions, and pretty much any large, legitimate site has their actual address listed there. What does Lorenz U have?

$ whois lorenzuniversity.com
Registrant:
   Domains by Proxy, Inc.
   DomainsByProxy.com
   15111 N. Hayden Rd., Ste 160, PMB 353
   Scottsdale, Arizona 85260
   United States

Domains by Proxy is a service where you can pay them to keep your contact info a secret. It’s good for privacy, but it’s a bit unusual for a university.

Also, protip: most universities are not .com addresses.

Accreditation

At first glance, Lorenz University seem to have some good proof that they’re a valid, accredited institution:

Lorenz University holds valid accreditation from reputable accrediting agencies including IAAFOE and ACTDE. These agencies have clearly mentioned on their official websites that Lorenz University is fully approved by their evaluation committee.

But wait, I’ve never heard of the IAAFOE or the ACTDE. A quick Google search turns up the International Accreditation Association for Online Eduction and the Accreditation Council for Distance Education.

Okay, Lorenz University is accredited by someone, but let’s take a look at who.

$ whois iaafoe.org
Registrant Name:Registration Private
Registrant Organization:Domains by Proxy, Inc.
Registrant Street1:DomainsByProxy.com
Registrant Street2:15111 N. Hayden Rd., Ste 160, PMB 353
Registrant Street3:
Registrant City:Scottsdale
Registrant State/Province:Arizona
Registrant Postal Code:85260
Registrant Country:US

Huh, Domains by Proxy. Again.

$ whois actde.org
Registrant Name:Registration Private
Registrant Organization:Domains by Proxy, Inc.
Registrant Street1:DomainsByProxy.com
Registrant Street2:15111 N. Hayden Rd., Ste 160, PMB 353
Registrant Street3:
Registrant City:Scottsdale
Registrant State/Province:Arizona
Registrant Postal Code:85260
Registrant Country:US

And again! What are the chances?!

Now, let’s take a closer look at these accreditation sites. I used wget to download the entirety of both sites (somehow, I had the feeling that they wouldn’t be that big). Indeed, one site was 10 files and the other was 11:

$ wget -r http://iaafoe.org/
$ wget -r http://actde.org/

Looking at these files, we can see certain similarities:

$ ls actde.org/
ACTDE  CSS  index.asp  index.html  PDF  robots.txt
$ ls iaafoe.org/
IAAFOE  index.asp  index.html  PDF  robots.txt
$ 
$ # is robots.txt non-trivial?
$ wc -l iaafoe.org/robots.txt
30 iaafoe.org/robots.txt
$  diff -s iaafoe.org/robots.txt actde.org/robots.txt 
Files iaafoe.org/robots.txt and actde.org/robots.txt are identical

Also, there’s a funny “Members Login” link on the ACTDE site that—whoops—isn’t actually a link. How hard is it to make a login page that doesn’t log anyone in?

Conclusion

Lorenz University seems to have “accredited” themselves by creating two accreditation websites, and are trying to take advantage of people who think this will help them get a job.

What I’m really curious about is if they’ll accredit other bullshit. The accreditation sites seem to be non-interactive, and don’t have any way of taking money.

P.S. As long as I’m just picking on them… Lorenz University also bought the site lorenzuniversityscam.com, to defend against people calling them a scam. The scam site has a link, “Click here[sic] to visit the official website of Lorenz University and find out all the details about Lorenz University and the application process to get an accredited degrees.” They misspelled “university” in a link to their own site.

Edit: the

Implementing Replica Set Priorities

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Replicos is a nation of islands that elect a primary island, called the Grand Poobah, to lead them. Each island cast a vote (or votes) and the island that gets a majority of the votes wins poobahship. If no one gets a majority (out of the total number of votes), no one becomes Grand Poobah. The islands can only communicate via carrier seagull.

Healthy configurations of islands, completely connected via carrier seagulls.

However, due to a perpetual war with the neighboring islands of Entropos, seagulls are often shot down mid-flight, distrupting communications between the Replicos islands until new seagulls can be trained.

The people of Replicos have realized that some islands are better at Poobah-ing than others. Their goal is to elect the island with the highest Poobah-ing ability that can reach a majority of the other islands. If all of the seagulls can make it to their destinations and back, electing a Poobah becomes trivial: an island sends a message saying they want to be Grand Poobah and everyone votes for them or says “I’m better at Poobah-ing, I should be Poobah.” However, it becomes tricky when you throw the Entropos Navy into the mix.

So, let’s Entropos has shot down a bunch of seagulls, leaving us with only three seagulls:

The island with .5 Poobah ability should be elected leader (the island with 1 Poobah ability can’t reach a majority of the set). But how can .5 know that it should be Poobah? It knows 1.0 exists, so theoretically it could ask the islands it can reach to ask 1.0 if it wants to be Poobah, but it’s a pain to pass messages through multiple islands (takes longer, more chances of failure, a lot more edge cases to check), so we’d like to be able to elect a Poobah using only the directly reachable islands, if possible.

One possibility might be for the islands sent a response indicating if they were connected to an island with a higher Poobah ability. In the case above, this would work (only one island is connected to an island with higher Poobah ability, so it can’t have a majority), but what about this case:

Every island, other than .5, is connected to a 1.0, but .5 should be the one elected! So, suppose we throw in a bit more information (which island of higher priority can be reached) and let the island trying to elect itself figure things out? Well, that doesn’t quite work, what if both .5 and 1.0 can reach a majority, but not the same one?

Conclusion: the Poobah-elect can’t figure this out on their own, everyone needs to work together.

Preliminaries: define an island to be Poohable if it has any aptitude for Poobah-ing and can reach a majority of the set. An island is not Poohable if it has no aptitude for Poobah-ing and/or cannot reach a majority of the set. Islands can be more or less Poohable, depending on their aptitude for Poobah-ing.

Every node knows whether or not it, itself, is Poohable: it knows its leadership abilities and if it can reach a majority of the islands. If more than one island (say islands A and B) is Poohable, then there must be at least one island that can reach both A and B [Proof at the bottom].

Let’s have each island keep a list of “possible Poobahs.” So, say we have an island A, that starts out with an empty list. If A is Poohable, it’ll add itself to the list (if it stops being Poohable, it’ll remove itself from the list). Now, whenever A communicates with another island, the other island will either say “add me to your list” or “remove me from your list,” depending on whether it is currently Poohable or not. Every other island does the same, so now each island has a list of the Poohable islands it can reach.

Now, say island X tries to elect itself master. It contacts all of the islands it can reach for votes. Each of the voting islands checks its list: if it has an entry on it that is more Poohable than X, it’ll send a veto. Otherwise X can be elected master. If you check the situations above (and any other situation) you can see that Poohability works, due to the strength of the guarantee that a Poobah must be able to reach a majority of the set.

Proof: suppose a replica set has n members and a node A can reach a majority of the set (at least ⌊n/2+1⌋) and a node B can reach a majority of the set (again, ⌊n/2+1⌋). If the sets of members A and B can reach are disjoint, then there must be ⌊n/2+1⌋+⌊n/2+1⌋ = at least n+1 members in the set. Therefore the set of nodes that A can reach and the set of nodes that B can reach are not disjoint.

“Scaling MongoDB” Update

Vroom vroom

In the last couple weeks, we’ve been getting a lot of questions like: (no one asked this specific question, this is just similar to the questions we’ve been getting)

I ran shardcollection, but it didn’t return immediately and I didn’t know what was going on, so I killed the shell and tried deleting a shard and then running the ‘shard collection’ command again and then I started killing ops and then I turned the balancer off and then I turned it back on and now I’m not sure what’s going on…

Aaaaagh! Stop running commands!

If a single server is like a TIE fighter then a sharded cluster is like the Death Star: you’ve got more power but you’re not making any sudden movements. For any configuration change you make, at least four servers have to talk to each other (usually more) and often a great deal of data has to get processed. If you ran all of the commands above on a big MongoDB install, everything would eventually work itself out (except the killing random operations part, it sort of depends on what got killed), but it could take a long time.

I think these questions stem from sharding being nerve-wracking: the documentation says what commands to run, but then nothing much seems to happen and everything seems slow and the command line doesn’t return a response (immediately). Meanwhile, you have hundreds of gigabytes of production data and MongoDB is chugging along doing… something.

So, I added some new sections to Scaling MongoDB on what to expect when you shard a big collection: if you run x in the shell, you’ll see y in the log, then MongoDB will be doing z until you see w. What it’s doing, what you’ll see, how (and if) you should react. In general: a sharding operation that hasn’t returned yet isn’t done, keep you eye on the logs, and don’t panic.

I’ve also added a section on backing up config servers and updated the chunk size information. If you bought the eBook, you can update it free to the latest version for free to get the new info. (I love this eBook update system!) The update should be going out this week.

Let me know if there’s any other info that you think is missing and I’ll add it for future updates.

Resizing Your Oplog

The MongoDB replication oplog is, by default, 5% of your free disk space. The theory behind this is that, if you’re writing 5% of your disk space every x amount of time, you’re going to run out of disk in 19x time. However, this doesn’t hold true for everyone, sometimes you’ll need a larger oplog. Some common cases:

  • Applications that delete almost as much data as they create.
  • Applications that do lots of in-place updates, which consume oplog entries but not disk space.
  • Applications that do lots of multi-updates or remove lots of documents at once. These multi-document operations have to be “exploded” into separate entries for each document in the oplog, so that the oplog remains idempotent.

If you fall into one of these categories, you might want to think about allocating a bigger oplog to start out with. (Or, if you have a read-heavy application that only does a few writes, you might want a smaller oplog.) However, what if your application is already running in production when you realize you need to change the oplog size?

Usually if you’re having oplog size problems, you want to change the oplog size on the master. To change its oplog, we need to “quarantine” it so it can’t reach the other members (and your application), change the oplog size, then un-quarantine it.

To start the quarantine, shut down the master. Restart it without the --replSet option on a different port. So, for example, if I was starting MongoDB like this:

$ mongod --replSet foo # default port

I would restart it with:

$ mongod --port 10000

Replica set members look at the last entry of the oplog to see where to start syncing from. So, we want to do the following:

  1. Save the latest insert in the oplog.
  2. Resize the oplog
  3. Put the entry we saved in the new oplog.

So, the process is:

1. Save the latest insert in the oplog.

> use local
switched to db local
> // "i" is short for "insert"
> db.temp.save(db.oplog.rs.find({op : "i"}).sort(
... {$natural : -1}).limit(1).next())

Note that we are saving the last insert here. If there have been other operations since that insert (deletes, updates, commands), that’s fine, the oplog is designed to be able to replay ops multiple times. We don’t want to use deletes or updates as a checkpoint because those could have $s in their keys, and $s cannot be inserted into user collections.

2. Resize the oplog

First, back up the existing oplog, just in case:

$ mongodump --db local --collection 'oplog.rs' --port 10000

Drop the local.oplog.rs collection, and recreate it to be the size that you want:

> db.oplog.rs.drop()
true
> // size is in bytes
> db.runCommand({create : "oplog.rs", capped : true, size : 1900000}) 
{ "ok" : 1 }

3. Put the entry we saved in the new oplog.

> db.oplog.rs.save(db.temp.findOne())

Making this server primary again

Now shut down the database and start it up again with --replSet on the correct port. Once it is a secondary, connect to the current primary and ask it to step down so you can have your old primary back (in 1.9+, you can use priorities to force a certain member to be preferentially primary and skip this step: it’ll automatically switch back to being primary ASAP).

> rs.stepDown(10000)
// you'll get some error messages because reconfiguration 
// causes the db to drop all connections

Your oplog is now the correct size.

Edit: as Graham pointed out in the comments, you should do this on each machine that could become primary.

Enchiladas of Doom

Damjan Stanković's Eko light design

Andrew and I are visiting San Francisco this week. Last night, I wanted enchiladas from the Mexican place across the street from the hotel. It was still warm out even though the sun had set hours ago, so we decided to walk over.

Our hotel is on a busy road with three lanes in both directions, but there are lights along the road so there are periods when no cars are coming. We waited until a wave of cars had passed and sprinted across the first three lanes. In the darkness, it looked like the median was flush with the pavement and I charged at it. It was actually raised and my foot hit it six inches before I had expected to encounter anything solid. I staggered and lost my balance but I was still running full-steam, so I tripped my way forward ending up in the middle of the road. I lay there on the highway, stunned, looking at three lanes of cars coming at me.

Get up! screamed my brain. It hadn’t even finished shouting when Andrew scooped me up and half-carried me off of the road. He had almost tripped over me as I fell, but managed to leap over and then turn and pick me up. He is my Batman!

My ankle is still pretty sore and I’m a bit banged up down the side I fell on, but other than that I’m fine. And the enchiladas were delicious.

Also: if you’re a subscriber and you’re only interested in MongoDB-related posts, I created a new RSS feed you can subscribe to for just those posts. The old feed will continue to have all MongoDB posts, plus stuff about my life and other technology.