The Comments Conundrum

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

For example…

Note: you must be running at least version 2.1.0 of MongoDB to use the aggregation pipeline.

Getting all comments by Serious Cat

Serious Cat’s comments are scattered between post documents, so there wasn’t a good way of querying for just those embedded documents. Now there is.

Let’s assume we want each comment by Serious Cat, along with the title and url of the post Serious Cat was commenting on. So, the steps we need to take are:

  1. Extract the fields we want (title, url, comments)
  2. Unwind the comments field: make each comment into a “real” document
  3. Query our new “comments collection” for “Serious Cat”

Using the aggregation pipeline, this looks like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields 
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode the "comments" array into separate documents
    $unwind: "$comments"
},
{
    // query like a boss
    $match: {comments.author : "Serious Cat"}
}]})

Now, this works well for something like a blog, where you have human-generated (small) data. If you’ve got gigs of comments to go through, you probably want to filter out as many as possible (e.g., with $match or $limit) before sending it to the “everything-in-memory” parts of the pipeline.

Getting the most recent comments

Let’s assume our site lists the 10 most recent comments across all posts, with links back to the posts they appeared on, e.g.,

  1. Great post! -Jerry (February 2nd, 2012) from This is a Great Post
  2. What does batrachophagous mean? -Fred (February 2nd, 2012) from Fun with Crosswords
  3. Where can I get discount Prada shoes? -Tom (February 1st, 2012) from Rant about Spam

To extract these comments from a collection of posts, you could do something like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // sort newest first
    $sort: {
        "comments.date" : -1
    }
},
{
    // get the 10 newest
    $limit: 10
}]})

Let’s take a moment to look at what $unwind does to a sample document.

Suppose you have a document that looks like this after the $project:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
        {text : "First!", ...},
        {text : "I hate spam, too!", ...},
        {text : "I love spam.", ...}
    ]
}

Then, after unwinding the comments field, you’d have:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "First!", ...}
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I hate spam, too!", ...}
    ]
},
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I love spam.", ...}
    ]
}

Then we $sort, $limit, and Bob’s your uncle.

Rank commenters by popularity

Suppose we allow users to upvote comments and we want to see who the most popular commenters are.

The steps we want to take are:

  1. Project out the fields we need (similar to above)
  2. Unwind the comments array (similar to above)
  3. Group by author, taking a count of votes (this will sum up all of the votes for each comment)
  4. Sort authors to find the most popular commenters

Using the pipeline, this would look like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields we'll need
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // count up votes by author
    $group : {
        _id : "$comments.author",
        popularity : {$sum : "$comments.votes"}
    }
},
{
    // sort by the new popular field
    $sort: {
        "popularity" : -1
    }
}]})

As I mentioned before, there are a couple downsides to using the aggregation pipeline: a lot of the pipeline is done in-memory and can be very CPU- and memory-intensive. However, used judiciously, it give you a lot more freedom to mush around your embedded documents.

Hacking Chess: Data Munging

This is a supplement to the Hacking Chess with the MongoDB Pipeline. This post has instructions for rolling your own data sets from chess games.

Download a collection of chess games you like. I’m using 1132 wins in less than 10 moves, but any of them should work.

These files are in a format called portable game notation (.PGN), which is a human-readable notation for chess games. For example, the first game in TEN.PGN (helloooo 80s filenames) looks like:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "Gedult D"]
[Black "Kohn V"]
[Result "1-0"]
[ECO "B33/09"]

1.e4 c5 2.Nf3 Nc6 3.d4 cxd4 4.Nxd4 Nf6
5.Nc3 e5 6.Ndb5 d6 7.Nd5 Nxd5 8.exd5 Ne7
9.c4 a6 10.Qa4  1-0

This represents a 10-turn win at an unknown event. The “ECO” field shows which opening was used (a Sicilian in the game above).

Unfortunately for us, MongoDB doesn’t import PGNs in their native format, so we’ll need to convert them to JSON. I found a PGN->JSON converter in PHP that did the job here. Scroll down to the “download” section to get the .zip.

It’s one of those zips that vomits its contents into whatever directory you unzip it in, so create a new directory for it.

So far, we have:

$ mkdir chess
$ cd chess
$
$ ftp ftp://ftp.pitt.edu/group/student-activities/chess/PGN/Collections/ten-pg.zip ./
$ unzip ten-pg.zip
$
$ wget http://www.dhtmlgoodies.com/scripts/dhtml-chess/dhtml-chess.zip
$ unzip dhtml-chess.zip

Now, create a simple script, say parse.php, to run through the chess matches and output them in JSON, one per line:

getNumberOfGames();
for ($i=0; $igetGameDetailsAsJson($i)."n";
}

?>

Run parse.php and dump the results into a file:

$ php parse.php > games.json

Now you’re ready to import games.json.

Back to the original “hacking” post

Hacking Chess with the MongoDB Pipeline

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

You can use mongoimport to import games.json into MongoDB:

$ mongoimport --db chess --collection fast_win games.json
connected to: 127.0.0.1
imported 1132 objects

We can take a look at our chess games in the Mongo shell:

> use chess
switched to db chess
> db.fast_win.count()
1132
> db.fast_win.findOne()
{
	"_id" : ObjectId("4ed3965bf86479436d6f1cd7"),
	"event" : "?",
	"site" : "?",
	"date" : "????.??.??",
	"round" : "?",
	"white" : "Gedult D",
	"black" : "Kohn V",
	"result" : "1-0",
	"eco" : "B33/09",
	"moves" : {
		"1" : {
			"white" : {
				"move" : "e4"
			},
			"black" : {
				"move" : "c5"
			}
		},
		"2" : {
			"white" : {
				"move" : "Nf3"
			},
			"black" : {
				"move" : "Nc6"
			}
		},
                ...
		"10" : {
			"white" : {
				"move" : "Qa4"
			}
		}
	}
}

Not exactly the greatest schema, but that’s how the chess format exporter munged it. Regardless, now we can use aggregation pipelines to analyze these games.

Experiment #1: First Mover Advantage

White has a slight advantage in chess because you move first (Wikipedia says it’s a 52%-56% chance of winning). I’d hypothesize that, in a short game, going first matters even more.

Let’s find out.

The “result” field in these docs is “1-0” if white wins and “0-1” if black wins. So, we want to divide our docs into two groups based on the “result” field and count how many docs are in each group. Using the aggregation pipeline, this looks like:

> db.runCommand({aggregate : "fast_win", pipeline : [
... {
...    $group : {
...        _id : "$result",      // group by 'result' field
...        numGames : {$sum : 1} // add 1 for every document in the group
...    }
... }]})
{
	"result" : [
		{
			"_id" : "0-1",
			"numGames" : 435
		},
		{
			"_id" : "1-0",
			"numGames" : 697
		}
	],
	"ok" : 1
}

That gives a 62% chance white will win (697 wins/1132 total games). Pretty good (although, of course, this isn’t a very large sample set).

In case you're not familiar with it, a reference chessboard with 1-8, a-h marked.

Experiment #2: Best Starting Move

Given a starting move, what percent of the time will that move lead to victory? This probably depends on whether you’re playing white or black, so we’ll just focus on white’s opening move.

First, we’ll just determine what starting moves white uses with this series of steps:

  • project all of white’s first moves (the moves.1.white.move field)
  • group all docs with the same starting move together
  • and count how many documents (games) used that move.

These steps look like:

> db.runCommand({aggregate: "fast_win", pipeline: [
... // '$project' is used to extract all of white's opening moves
... {
...     $project : {
...         // extract moves.1.white.move into a new field, firstMove
...         firstMove : "$moves.1.white.move"
...     }
... },
... // use '$group' to calculate the number of times each move occurred
... {
...     $group : { 
...         _id : "$firstMove",
...         numGames : {$sum : 1}
...     }
... }]})
{
	"result" : [
		{
			"_id" : "d3",
			"numGames" : 2
		},
		{
			"_id" : "e4",
			"numGames" : 696
		},
		{
			"_id" : "b4",
			"numGames" : 17
		},
		{
			"_id" : "g3",
			"numGames" : 3
		},
		{
			"_id" : "e3",
			"numGames" : 2
		},
		{
			"_id" : "c4",
			"numGames" : 36
		},
		{
			"_id" : "b3",
			"numGames" : 4
		},
		{
			"_id" : "g4",
			"numGames" : 11
		},
		{
			"_id" : "h4",
			"numGames" : 1
		},
		{
			"_id" : "Nf3",
			"numGames" : 37
		},
		{
			"_id" : "f3",
			"numGames" : 1
		},
		{
			"_id" : "f4",
			"numGames" : 25
		},
		{
			"_id" : "Nc3",
			"numGames" : 14
		},
		{
			"_id" : "d4",
			"numGames" : 283
		}
	],
	"ok" : 1
}

Now let’s compare those numbers with whether white won or lost.

> db.runCommand({aggregate: "fast_win", pipeline: [
... // extract the first move
... {
...    $project : {
...        firstMove : "$moves.1.white.move",
...        // create a new field, "win", which is 1 if white won and 0 if black won
...        win : {$cond : [
...            {$eq : ["$result", "1-0"]}, 1, 0
...        ]}
...    }
... },
... // group by the move and count up how many winning games used it
... {
...     $group : {
...         _id : "$firstMove",
...         numGames : {$sum : 1},
...         numWins : {$sum : "$win"}
...     }
... },
... // calculate the percent of games won with this starting move
... {
...     $project : {
...         _id : 1,
...         numGames : 1,
...         percentWins : {
...             $multiply : [100, {
...                 $divide : ["$numWins","$numGames"]
...             }]
...         }
...     }
... },
... // discard moves that were used in less than 10 games (probably not representative) 
... {
...     $match : {
...         numGames : {$gte : 10}
...     }
... },
... // order from worst to best
... {
...     $sort : {
...         percentWins : 1
...     }
... }]})
{
	"result" : [
		{
			"_id" : "f4",
			"numGames" : 25,
			"percentWins" : 24
		},
		{
			"_id" : "b4",
			"numGames" : 17,
			"percentWins" : 35.294117647058826
		},
		{
			"_id" : "c4",
			"numGames" : 36,
			"percentWins" : 50
		},
		{
			"_id" : "d4",
			"numGames" : 283,
			"percentWins" : 50.53003533568905
		},
		{
			"_id" : "g4",
			"numGames" : 11,
			"percentWins" : 63.63636363636363
		},
		{
			"_id" : "Nf3",
			"numGames" : 37,
			"percentWins" : 67.56756756756756
		},
		{
			"_id" : "e4",
			"numGames" : 696,
			"percentWins" : 68.24712643678161
		},
		{
			"_id" : "Nc3",
			"numGames" : 14,
			"percentWins" : 78.57142857142857
		}
	],
	"ok" : 1
}

Pawn to e4 seems like the most dependable winner here. Knight to c3 also seems like a good choice (at a nearly 80% win rate), but it was only used in 14 winning games.

Experiment #3: Best and Worst Moves for Black

We basically want to do a similar pipeline to Experiment 2, but for black. At the end, we want to find the best and worst percent.

> db.runCommand({aggregate: "fast_win", pipeline: [
... // extract the first move
... {
...    $project : {
...        firstMove : "$moves.1.black.move",
...        win : {$cond : [
...            {$eq : ["$result", "0-1"]}, 1, 0
...        ]}
...    }
... },
... // group by the move and count up how many winning games used it
... {
...     $group : {
...         _id : "$firstMove",
...         numGames : {$sum : 1},
...         numWins : {$sum : "$win"}
...     }
... },
... // calculate the percent of games won with this starting move
... {
...     $project : {
...         _id : 1,
...         numGames : 1,
...         percentWins : {
...             $multiply : [100, {
...                 $divide : ["$numWins","$numGames"]
...             }]
...         }
...     }
... },
... // discard moves that were used in less than 10 games (probably not representative) 
... {
...     $match : {
...         numGames : {$gte : 10}
...     }
... },
... // sort by % win rate
... {
...     $sort : {
...         percentWins : -1
...     }
... },
... // get the best and worst
... {
...     $group : {
...          _id : 1,
...          best : {$first : "$_id"},
...          worst : {$last : "$_id"}
...     }
... }]})
{
	"result" : [
		{
			"_id" : 1,
			"best" : "Nf6",
			"worst" : "d6"
		}
	],
	"ok" : 1
}

I like this new aggregation functionality because it’s feels simpler than MapReduce. You can start with a one-operation pipeline and build it up, step-by-step, seeing exactly what a given operation does to your output. (And no Javascript required, which is always a plus.)

There’s lots more documentation on aggregation pipelines in the docs and I’ll be doing a couple more posts on it.

SQL to MongoDB: An Updated Mapping

Rick Osborne's original chart.

The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.

In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.

Here is the original SQL:

SELECT
  Dim1, Dim2,
  SUM(Measure1) AS MSum,
  COUNT(*) AS RecordCount,
  AVG(Measure2) AS MAvg,
  MIN(Measure1) AS MMin
  MAX(CASE
    WHEN Measure2  123)
GROUP BY Dim1, Dim2
HAVING (MMin > 0)
ORDER BY RecordCount DESC
LIMIT 4, 8

We can break up this statement and replace each piece of SQL with the new aggregation pipeline syntax:

MongoDB Pipeline MySQL
aggregate: "DenormAggTable"
FROM DenormAggTable
{
    $match : {
        Filter1 : {$in : ['A','B']},
        Filter2 : 'C',
        Filter3 : {$gt : 123}
    }
}
WHERE (Filter1 IN (’A’,’B’))
  AND (Filter2 = ‘C’)
  AND (Filter3 > 123)
{
    $project : {
        Dim1 : 1,
        Dim2 : 1,
        Measure1 : 1,
        Measure2 : 1,
        lessThanAHundred : {
            $cond: [ 
                {$lt: ["$Measure2", 100] },
                "$Measure2", // if
                0]           // else
        }
    }
}
CASE
  WHEN Measure2 < 100
  THEN Measure2
END
{
    $group : {
        _id : {Dim1 : 1, Dim2 : 1},
        MSum : {$sum : "$Measure1"},
        RecordCount : {$sum : 1},
        MAvg : {$avg : "$Measure2"},
        MMin : {$min : "$Measure1"},
        MMax : {$max : "$lessThanAHundred"}
    }
}
SELECT
  Dim1, Dim2,
  SUM(Measure1) AS MSum,
  COUNT(*) AS RecordCount,
  AVG(Measure2) AS MAvg,
  MIN(Measure1) AS MMin
  MAX(CASE
    WHEN Measure2 < 100
    THEN Measure2
  END) AS MMax

GROUP BY Dim1, Dim2
{
    $match : {MMin : {$gt : 0}}
}
HAVING (MMin > 0)
{
    $sort : {RecordCount : -1}
}
ORDER BY RecordCount DESC
{
    $limit : 8
},
{
    $skip : 4
}
LIMIT 4, 8

Putting all of these together gives you your pipeline:

> db.runCommand({aggregate: "DenormAggTable", pipeline: [
{
    $match : {
        Filter1 : {$in : ['A','B']},
        Filter2 : 'C',
        Filter3 : {$gt : 123}
    }
},
{
    $project : {
        Dim1 : 1,
        Dim2 : 1,
        Measure1 : 1,
        Measure2 : 1,
        lessThanAHundred : {$cond: [{$lt: ["$Measure2", 100]}, {
            "$Measure2",
            0]
        }
    }
},
{
    $group : {
        _id : {Dim1 : 1, Dim2 : 1},
        MSum : {$sum : "$Measure1"},
        RecordCount : {$sum : 1},
        MAvg : {$avg : "$Measure2"},
        MMin : {$min : "$Measure1"},
        MMax : {$max : "$lessThanAHundred"}
    }
},
{
    $match : {MMin : {$gt : 0}}
},
{
    $sort : {RecordCount : -1}
},
{
    $limit : 8
},
{
    $skip : 4
}
]})

As you can see, the SQL matches the pipeline operations pretty clearly. If you want to play with it, it’ll be available soon to a the development nightly build.

If you’re at MongoSV today (December 9th, 2011), check out Chris Westin’s talk on the new aggregation framework at 3:45 in room B4.