MapReduce – The Fanfiction

MapReduce is really cool, useful, and powerful, but a lot of people find it hard to wrap their heads around. This post is a fairly silly, non-technical explanation using Star Trek.

The Enterprise found a new planet, as it tends to do.

Kirk wanted to beam down immediately and start surveying the planet but Spock told him to wait a moment. “It usually takes us one hour to survey a planet, correct Captain?  In less than 5 minutes, I can calculate whether the chance of encountering friendly alien females outweighs the risk of attack by brain-eating monsters.”

“Interesting idea, Spock,” said Kirk.  “Go ahead.”

The Data

“Logically,” thought Spock, “if we can survey a whole planet in one hour, we can survey 1/16th of a planet in 3.75 minutes.”  Spock divided the planet into 16 equal-size pieces and summoned 16 red shirts.

“You’ll be beamed down to the surface of the planet with this special data collection device called an ’emitter.’  If you see a brain-eating monster, you press the “brain-eating monster” button on your emitter.  If you see an attractive female alien, you press the “hot alien chick” button.  Press either, neither, or both buttons, as your situation requires.”

The Map Step

The 16 red shirts were beamed down to the 16 parts of the planet.  As they found things, they would press the buttons on their emitter.

Back on the Enterprise, Spock started getting lots of data pairs that looked like:

| type                 | location |
|----------------------|----------|
| Brain-eating monster | 2        |
| Hot alien chick      | 7        |
| Brain-eating monster | 14       |
| Brain-eating monster | 7        |

The Reduce Step

“Computer,” Spock said.  “Initialize a counter to 0 for each new type you get.  Then, for every subsequent data pair with the same type, increment that counter.”

“I dinnae understand,” said Scotty.  “What’s that, then?”

“I basically told the computer to initialize two variables, ‘Brain-eating monster’ and ‘Hot alien chick’, setting them both to zero.  Every time the computer gets a ‘Brain-eating monster’ emit, it increments the ‘Brain-eating monster’ variable.  Every time it gets a ‘Hot alien chick’ emit, it increments the ‘Hot alien chick’ variable.

“Ah, I see,” said Scotty.  “But don’t you lose the location information?”

“Yes,” replied Spock.  “But I don’t actually care about location for this readout.  If I wanted the location, I could give the computer a slightly more complicated algorithm, but right now I just want the count.”

The Result

After 3.75 minutes, Spock beamed up the red shirts who were still alive and presented to Kirk: “There are brain-eating monsters on 7/8ths of the planet, Captain.  1/16 of the planet has hot alien chicks.”

“Excellent work Spock,” Kirk says.  “Let’s boldly go somewhere else.”

And so they did.

Captain’s log, star date 1419.7 (aka a summary of what we did)

  1. Goal – To generate a report on a planet.
  2. Data – 16 pieces of land with various attributes. Each piece of land could be represented by a JSON object such as:
    {
        "location" : 5
        "contains" : ["Brain-eating monsters", "rocks", "poison gas"]
    }
  3. Map – Send attributes for each piece of data back to the processor. In JSON, each emit would look something like:
    {
        "Brain-eating monsters" : 5
    }
  4. Reduce – Sum up the data, grouping by type
  5. Result – How much of each attribute is on the planet

Further reading: Kyle Banker has an excellent (and more technical) explanation of MapReduce.

Bug Reporting: A How-To

This type of bug report drives me nuts:

You have a critical bug! This crashes Java!

for (int i=0; i<10; i++) {
     cursor.next();
}

(I’ve never gotten this exact report and I’m not picking on anyone, it’s a composite.)

This doesn’t crash for me.  It doesn’t even compile for me because the variable “cursor” isn’t defined. If you’re going to use a variable (a function, a framework, etc.) in a code snippet, you have to define it. Let’s try again:

Mongo m = new Mongo();
DB db = m.getDB("bar");
DBCollection coll = db.getCollection("foo");
DBCursor cursor = coll.find();

for (int i=0; i<10; i++) {
     cursor.next();
}

Better! But this is probably crashing because of something in your database.  Unless it crashes regardless of dataset, you need to send me the data that makes it crash.  The basic rule is:

The faster I can reproduce your bug, the faster I can fix it.

Some other tips for submitting bug reports:

  1. Make sure to include information about your environment.  The more the merrier.
  2. If I ask for log messages, please send me the entire log.  If it’s been running for days and the log’s a zillion lines long, send everything from around the time the error happened (before, during, after).  Please, please, please don’t skim the logs, extract a single suspicious-looking line, and send it to me.  I promise that I’m not going to be mad about having to wade through a couple hundred (or thousand) lines to find what I’m looking for.  I would rather quickly skim a bunch of extra info than pry the logs, line by line, from your clutches.
  3. If you are running on a non-traditional setup (e.g., a Linux kernel you modified yourself on a mainframe from 1972), it would be super helpful if you can give me access to your machine.  If your bug is platform-specific to ENIAC, it’s doubtful I’m going to be able to figure out what’s going wrong on my MacBook Air.

And, of course, flatter the developer’s ego when they’ve fixed the bug.  Not applicable for me, of course, but other developers like it when you give them a little “Thanks, you’re awesome!” biscuit when they’ve fixed your bug.

To my users: thank you to everyone who has ever filed a bug report.  I’m really, really happy that you’re using my software and that you care enough about it to submit a bug, instead of just giving up. Seriously, thank you. I have to give a shout-out to Perl developers in particular, here.  More than half the time, people reporting bugs in the MongoDB Perl driver actually include a patch in the bug report that fixes it!  I love you guys.

Sleepy.Mongoose: A MongoDB HTTP Interface

The first half of the MongoDB book is due this week, so I wrote a REST interface for Mongo (I’m a prolific procrastinator).  Anyway, it’s called Sleepy.Mongoose and it’s available at https://github.com/10gen-labs/sleepy.mongoose.

Installing Sleepy.Mongoose

  1. Install MongoDB.
  2. Install the Python driver:
    $ easy_install pymongo
  3. Download Sleepy.Mongoose.
  4. From the mongoose directory, run:
    $ python httpd.py

You’ll see something that looks like:

=================================
|      MongoDB REST Server      |
=================================

listening for connections on http://localhost:27080

Using Sleepy.Mongoose

First, we’re just going to ping Sleepy.Mongoose to make sure it’s awake. You can use curl:

$ curl 'http://localhost:27080/_hello'

and it’ll send back a Star Wars quote.

To really use the interface, we need to connect to a database server. To do this, we post our database server address to the URI “/_connect” (all actions start with an underscore):

$ curl --data server=localhost:27017 'http://localhost:27080/_connect'

This connects to the database running at localhost:27017.

Now let’s insert something into a collection.

$ curl --data 'docs=[{"x":1}]' 'http://localhost:27080/foo/bar/_insert'

This will insert the document {“x” : 1} into the foo database’s bar collection. If we open up the JavaScript shell (mongo), we can see the document we just added:

> use foo
> db.bar.find()
{ "_id" : ObjectId("4b7edc9a1d41c8137e000000"), "x" : 1 }

But why bother opening the shell when we can query with curl?

$ curl -X GET 'http://localhost:27080/foo/bar/_find'
{"ok": 1, "results": [{"x": 1, "_id": {"$oid": "4b7edc9a1d41c8137e000000"}}], "id": 0}

Note that queries are GET requests, whereas the other requests up to this point have been posts (well, the _hello can be either).

A query returns three fields:

  • “ok”, which will be 1 if the query succeeded, 0 otherwise
  • “results” which is an array of documents from the db
  • “id” which is an identifier for that particular query

In this case “id” is irrelevant as we only have one document in the collection but if we had a bunch, we could use the id to get more results (_find only returns the first 15 matching documents by default, although it’s configurable). This will probably be clearer with an example, so let’s add some more documents to see how this works:

$ curl --data 'docs=[{"x":2},{"x":3}]' 'http://localhost:27080/foo/bar/_insert'
{"ok" : 1}

Now we have three documents. Let’s do a query and ask for it to return one result at a time:

$ curl -X GET 'http://localhost:27080/foo/bar/_find?batch_size=1'
{"ok": 1, "results": [{"x": 1, "_id": {"$oid": "4b7edc9a1d41c8137e000000"}}], "id": 1}

The only difference between this query and the one above is the “?batch_size=1” which means “send one document back.” Notice that the cursor id is 1 now, too (not 0). To get the next result, we can do:

$ curl -X GET 'http://localhost:27080/foo/bar/_more?id=1&batch_size=1'
{"ok": 1, "results": [{"x": 2, "_id": {"$oid": "4b7ee0731d41c8137e000001"}}], "id": 1}
$ curl -X GET 'http://localhost:27080/foo/bar/_more?id=1&batch_size=1'
{"ok": 1, "results": [{"x": 3, "_id": {"$oid": "4b7ee0731d41c8137e000002"}}], "id": 1}

Now let’s remove a document:

$ curl --data 'criteria={"x":2}' 'http://localhost:27080/foo/bar/_remove'
{"ok" : 1}

Now if we do a _find, it only returns two documents:

$ curl -X GET 'http://localhost:27080/foo/bar/_find'
{"ok": 1, "results": [{"x": 1, "_id": {"$oid": "4b7edc9a1d41c8137e000000"}}, {"x": 3, "_id": {"$oid": "4b7ee0731d41c8137e000002"}}], "id": 2}

And finally, updates:

$ curl --data 'criteria={"x":1}&newobj={"$inc":{"x":1}}' 'http://localhost:27080/foo/bar/_update'

Let’s do a _find to see the updated object, this time using criteria: {“x”:2}. To put this in a URL, we need to escape the ‘{‘ and ‘}’ characters. You can do this by copy-pasting it into any javascript interpreter (Rhino, Spidermonkey, mongo, Firebug, Chome’s dev tools) as follows:

> escape('{"x":2}')
%7B%22x%22%3A2%7D

And now we can use that in our URL:

$ curl -X GET 'http://localhost:27080/foo/bar/_find?criteria=%7B%22x%22%3A2%7D'
{"ok": 1, "results": [{"x": 2, "_id": {"$oid": "4b7edc9a1d41c8137e000000"}}], "id": 0}

If you’re looking to go beyond the basic CRUD, there’s more documentation in the wiki.

This code is super-alpha. Comments, questions, suggestions, patches, and forks are all welcome.

Note: Sleepy.Mongoose is an offshoot of something I’m actually supposed to be working on: a JavaScript API we’re going to use to make an awesome sharding tool.  Administrating your cluster will be a point-and-click interface.  You’ll be able to see how everything is doing, drag n’ drop chunks, visually split collections… it’s going to be so cool.

“Introduction to MongoDB” Video

This is the video of the talk I gave last Sunday at the NoSQL Devroom at FOSDEM. It’s about why MongoDB was created, what it’s good at (and a bit about what it’s not good for), the basic syntax for it and how sharding and replication work (it covers a lot of ground).

You can also go to Parleys.com to see the video with my slides next to it (they’re a little tough to see below).

http://www.parleys.com/share/parleysshare2.swf?pageId=1864

St. Clementine’s Day

The night before Valentine’s Day, I got Andrew a crate of clementines (they’re already gone).  Yesterday, the Doghouse Diaries ran:

I came down with a cold on Friday and neither of us wanted to do anything for Valentine’s Day so we ended up playing Legend of Zelda most of it. When we got hungry, we started looking through the cabinets and I saw some dried apricots that reminded me of some chocolates Andrew got me for Christmas.

“For future reference,” I said, “I loved those chocolate-covered apricots.  They were so good.”

“You want to make some now?” he asked.

Now, I was sick and tired and that sounded a lot like work.  But it was so freakin easy.  And awesome.  And delicious.   And come on, how much more romantic can you get?  Here’s how to do it yourself:

  1. Take a lump of semisweet baker’s chocolate about twice the size of a Hershey’s bar
  2. Stick it in a bowl in the microwave for 30 seconds
  3. Mush it up and dump in a bunch of dried fruit/nuts/whatever
  4. Spoon each chocolate-covered item onto a sheet of parchment
  5. Stick the parchment in the fridge

An hour later, we had my favorite chocolates.

Then we watched Star Wars.

My life is awesome.  Thank you for being so wonderful, Andrew.

Mongo Mailbag #2: Updating GridFS Files

Welcome to week two of Mongo Mailbag, where I take a question from the Mongo mailing list and answer it in more detail. If you have a question you’d like to see answered in excruciating detail, feel free to email it to me.

Is it possible (with the PHP driver) to storeBytes into GridFS (for example CSS data), and later change that data?!

I get some strange behavior when passing an existing _id value in the $extra array of MongoGridFS::storeBytes, sometimes Apache (under Windows) crashes when reloading the file, sometimes it doesn’t seem to be updated at all.

So I wonder, is it even possible to update files in GridFS?! 🙂

-Wouter

If you already understand GridFS, feel free to skip to the last section. For everyone else…

Intro to GridFS

GridFS is the standard way MongoDB drivers handle files; a protocol that allows you to save an arbitrarily large file to the database. It’s not the only way, it’s not the best way (necessarily), it’s just the built-in way that all of the drivers support. This means that you can use GridFS to save a file in Ruby and then retrieve it using Perl and visa versa.

Why would you want to store files in the database? Well, it can be handy for a number of reasons:

  • If you set up replication, you’ll have automatic backups of your files.
  • You can keep millions of files in one (logical) directory… something most filesystems either won’t allow or aren’t good at.
  • You can keep information associated with the file (who’s edited it, download count, description, etc.) right with the file itself.
  • You can easily access info from random sections of large files, another thing traditional file tools aren’t good at.

There are some limitations, too:

  • You can’t have an arbitrary number of files per document… it’s one file, one document.
  • You must use a specific naming scheme for the collections involved: prefix.files and prefix.chunks (by default prefix is “fs”: fs.files and fs.chunks).

If you have complex requirements for your files (e.g., YouTube), you’d probably want to come up with your own protocol for file storage. However, for most applications, GridFS is a good solution.

How it Works

GridFS breaks large files into manageable chunks. It saves the chunks to one collection (fs.chunks) and then metadata about the file to another collection (fs.files). When you query for the file, GridFS queries the chunks collection and returns the file one piece at a time.

Here are some common questions about GridFS:

Q: Why not just save the whole file in a single document?
A: MongoDB has a 4MB cap on document size.
Q: That’s inconvenient, why?
A: It’s an arbitrary limit, mostly to prevent bad schema design.
Q: But in this case it would be so handy!
A: Not really. Imagine you’re storing a 20GB file. Do you really want to return the whole thing at once? That means 20GB or memory will be used whenever you query for that document. Do you even have that much memory? Do you want it taken up by a single request?
Q: Well, no.
A: The nice thing about GridFS is that it streams the data back to the client, so you never need more than 4MB of memory.
Q: Now I know.
A: And knowing is half the battle.
Together: G.I. Joe!

Answer the Damn Question

Back to Wouter’s question: changing the metadata is easy: if we wanted to add, say, a “permissions” field, we could run the following PHP code:

$files = $db->fs->files;
$files->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

// or, equivalently, from the MongoGridFS object:

$grid->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

Updating the file itself, what Wouter is actually asking about, is significantly more complex. If we want to update the binary data, we’ll need to reach into the chunks collection and update every document associated with the file. Edit: Unless you’re using the C# driver! See Sam Corder’s comment below. It would look something like:

// get the target file's chunks
$chunks = $db->fs->chunks;
$cursor = $chunks->find(array("file_id" => $fileId))->sort(array("n" => 1));

$newLength = 0;

foreach ($cursor as $chunk) {
    // read in a string of bytes from the new version of the file
    $bindata = fread($file, MongoGridFS::$chunkSize);
    $newLength += strlen($bindata);

    // put the new version's contents in this chunk
    $chunk->data = new MongoBinData($bindata);

    // update the chunks collection with this new chunk
    $chunks->save($chunk);
}

// update the file length metadata (necessary for retrieving the file)
$db->fs->files->update(array("_id" => $fileId), array('$set' => array("length" => $newLength));

The code above doesn’t handle a bunch of cases (what if the new file is a different number of chunks than the old one?) and anything beyond this basic scenario gets irritatingly complex. If you’re updating individual chunks you should probably just remove the GridFS file and save it again. It’ll end up taking about the same amount of time and be less error-prone.

FOSDEM: Some Pictures

This picture was taken by outerthought at FOSDEM. People look fairly interested 🙂 There was a guy on the other side of the room who was asleep the whole time, but he was old and I tried not to look at him. You can see I’m all super-professional in my XKCD “I’m not slacking off, my code’s compiling” teeshirt.

Given I’m in Belgium, I sneaked a few Magrittes into my slideshow:


They just seemed to call out “consistency” and “transactions” for me.

Andrew and I actually tried to visit the Magritte museum today, but we couldn’t find it. We walked all around, circled the block… I figure it must be in his old house and it was closed, because it was indistinguishable from every other townhouse on the street. Annoying.

FOSDEM

I gave a talk at FOSDEM (Free and Open Source Developers European Meetup) this morning: “Introduction to MongoDB”. It went pretty well, I think. Slides are up at scribd.com and it was recorded, so the video for it should be somewhere soon (I’ll update when I find out where).

The trip across the Atlantic was interesting. It was so bumpy that one of the stewardesses serving drinks fell over and the captain announced “All flight attendants, take your seats with your seat belts fastened!” Takeoff had been delayed to fix something and, in addition to the regular dropping a couple dozen feet in altitude, the engine kept making funny noises, and I was pretty sure we were going to go down. I considered putting my shoes on, but I decided that the last thing I wanted while floating on a raft in the North Atlantic was wet shoes. The plane pulled through, though, and eventually we got to Belgium. So, no excellent story (or fiery death) for me.

One of the cool things about being here was that I got to meet chx, a Drupal developer. He’s helping integrate MongoDB and Drupal 7. We have been wanting to send him some schwag but he’s on an extended visit to relatives who live on a hill that the postman can’t climb. However, I knew he was going to be here, so I carried a mug all the way to Belgium and got to give it to him. Mongo devs: neither snow nor sleet nor gloom of 3000 miles of ocean keep these swift couriers from delivering mugs. Woot.

My talk was at 10am (4am New York time… ugh). Andrew and I went to the conference’s cafeteria beforehand so I could get some coffee. It was… interesting. I have a theory on how Belgians make coffee: they brew a pot of coffee, and then let it sit on a burner until only a cup is left in the pot. Then they serve you that cup. Now, I am grateful, because I managed to drink it (with the help of a chocolate croissant) and it kept me upright for my talk, but I am glad I live in a country where people like their coffee watery.

Andrew and I are at what I think is the Belgian equivalent of a diner, where we’re having some coffee and beer. I feel like a total philistine, but I can’t actually tell the difference between Belgian beer and a decent American beer. Obviously more data points are necessary, I’ll be looking into it further tonight.

Giving talks is fun, but stressful. I feel like my whole body is relaxing now. I’m looking forward to sleeping at least 12 hours tonight.