The Comments Conundrum

One of the most common questions I see about MongoDB schema design is:

I have a collection of blog posts and each post has an array of comments. How do I get…
…all comments by a given author
…the most recent comments
…the most popular commenters?

And so on. The answer to this has always been “Well, you can’t do that on the server side…” You can either do it on the client side or store comments in their own collection. What you really want is the ability to treat embedded documents like a “real” collection.

The aggregation pipeline gives you this ability by letting you “unwind” arrays into separate documents, then doing whatever else you need to do in subsequent pipeline operators.

For example…

Note: you must be running at least version 2.1.0 of MongoDB to use the aggregation pipeline.

Getting all comments by Serious Cat

Serious Cat’s comments are scattered between post documents, so there wasn’t a good way of querying for just those embedded documents. Now there is.

Let’s assume we want each comment by Serious Cat, along with the title and url of the post Serious Cat was commenting on. So, the steps we need to take are:

  1. Extract the fields we want (title, url, comments)
  2. Unwind the comments field: make each comment into a “real” document
  3. Query our new “comments collection” for “Serious Cat”

Using the aggregation pipeline, this looks like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields 
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode the "comments" array into separate documents
    $unwind: "$comments"
},
{
    // query like a boss
    $match: {comments.author : "Serious Cat"}
}]})

Now, this works well for something like a blog, where you have human-generated (small) data. If you’ve got gigs of comments to go through, you probably want to filter out as many as possible (e.g., with $match or $limit) before sending it to the “everything-in-memory” parts of the pipeline.

Getting the most recent comments

Let’s assume our site lists the 10 most recent comments across all posts, with links back to the posts they appeared on, e.g.,

  1. Great post! -Jerry (February 2nd, 2012) from This is a Great Post
  2. What does batrachophagous mean? -Fred (February 2nd, 2012) from Fun with Crosswords
  3. Where can I get discount Prada shoes? -Tom (February 1st, 2012) from Rant about Spam

To extract these comments from a collection of posts, you could do something like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // sort newest first
    $sort: {
        "comments.date" : -1
    }
},
{
    // get the 10 newest
    $limit: 10
}]})

Let’s take a moment to look at what $unwind does to a sample document.

Suppose you have a document that looks like this after the $project:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
        {text : "First!", ...},
        {text : "I hate spam, too!", ...},
        {text : "I love spam.", ...}
    ]
}

Then, after unwinding the comments field, you’d have:

{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "Where can I get discount Prada shoes?", ...},
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "First!", ...}
    ]
}
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I hate spam, too!", ...}
    ]
},
{
    "url" : "/blog/spam",
    "title" : "Rant about Spam",
    "comments" : [
        {text : "I love spam.", ...}
    ]
}

Then we $sort, $limit, and Bob’s your uncle.

Rank commenters by popularity

Suppose we allow users to upvote comments and we want to see who the most popular commenters are.

The steps we want to take are:

  1. Project out the fields we need (similar to above)
  2. Unwind the comments array (similar to above)
  3. Group by author, taking a count of votes (this will sum up all of the votes for each comment)
  4. Sort authors to find the most popular commenters

Using the pipeline, this would look like:

> db.runCommand({aggregate: "posts", pipeline: [
{
   // extract the fields we'll need
   $project: {
        title : 1,
        url : 1,
        comments : 1
    }
},
{
    // explode "comments" array into separate documents
    $unwind: "$comments"
},
{
    // count up votes by author
    $group : {
        _id : "$comments.author",
        popularity : {$sum : "$comments.votes"}
    }
},
{
    // sort by the new popular field
    $sort: {
        "popularity" : -1
    }
}]})

As I mentioned before, there are a couple downsides to using the aggregation pipeline: a lot of the pipeline is done in-memory and can be very CPU- and memory-intensive. However, used judiciously, it give you a lot more freedom to mush around your embedded documents.

5 thoughts on “The Comments Conundrum

  1. Curious what the performance implications of the above are vs. simply having comments in their own collection.   As always, I think the answer is it depends on what one is trying to do but I am curious as to whether your team have come up with any usage suggestions.  

    Using the examples you provide, here are some alternative solutions:Example 1:alternative is to get all the posts by user, then go out and get all the related posts and assemble.  If there were a significant number of posts, the alternative might be slower due to sending lots of data across the wire.Example 2:alternative same as 1 but would probably be much faster since 

    Example 3:
    alternative would simply do the aggregation against the comments collection and the posts collection would never come into play.  I am guessing this will be a lot faster.

    Are my assumptions correct?  

    Thanks,

    George G.

    Like

    1. It’s always a tradeoff.  I haven’t seen any numbers yet on the aggregation framework, but I wouldn’t be surprised if you were correct, particularly about examples 2 and 3.  However, if 99% of your requests are for a post with its comments, it seems reasonable to optimize for that case vs. a “top” list you might compile once an hour.

      Like

  2. Wow 🙂 As usual a great post. I thought this was something which was a lot easier. Since I am building my system on a RDBMS system (well, due to my hosting env. restrictions) and want to shift onto Mongo later on, I am trying to learn about mongo as much as possible.  I had not thought that it was easy to embed comments and search for comments for something like a blog (my project has one). Thanks for telling how it is. Gonna help in taking some decisions during migration. Keep posting good stuff. I love your posts. Much easier than the online Docs. Also, tried to purchase your book “The Definitive Guide”. No local stores have it and I do not have a credit card to pay for an online purchase and my address is not that ‘reachable’ either. Ended up downloading the books from torrents. Sorry. On a fine day when I will have a way, I will purchase a copy (who doesn’t like to not-feel-guilty). 

    Like

    1. Thank you 🙂

      Yo ho ho, I’m glad you were able to get a copy.  If you’d like a copy, please ask bookstores to carry it if you can’t find it on their shelves (that helps them know to order more MongoDB books).

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: