
The aggregation pipeline code has finally been merged into the main development branch and is scheduled for release in 2.2. It lets you combine simple operations (like finding the max or min, projecting out fields, taking counts or averages) into a pipeline of operations, making a lot of things that were only possible by using MapReduce doable with a “normal” query.
In celebration of this, I thought I’d re-do the very popular MySQL to MongoDB mapping using the aggregation pipeline, instead of MapReduce.
Here is the original SQL:
SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 123) GROUP BY Dim1, Dim2 HAVING (MMin > 0) ORDER BY RecordCount DESC LIMIT 4, 8
We can break up this statement and replace each piece of SQL with the new aggregation pipeline syntax:
MongoDB Pipeline | MySQL |
---|---|
aggregate: "DenormAggTable" |
FROM DenormAggTable |
{ $match : { Filter1 : {$in : ['A','B']}, Filter2 : 'C', Filter3 : {$gt : 123} } } |
WHERE (Filter1 IN (’A’,’B’)) AND (Filter2 = ‘C’) AND (Filter3 > 123) |
{ $project : { Dim1 : 1, Dim2 : 1, Measure1 : 1, Measure2 : 1, lessThanAHundred : { $cond: [ {$lt: ["$Measure2", 100] }, "$Measure2", // if 0] // else } } } |
CASE WHEN Measure2 < 100 THEN Measure2 END |
{ $group : { _id : {Dim1 : 1, Dim2 : 1}, MSum : {$sum : "$Measure1"}, RecordCount : {$sum : 1}, MAvg : {$avg : "$Measure2"}, MMin : {$min : "$Measure1"}, MMax : {$max : "$lessThanAHundred"} } } |
SELECT Dim1, Dim2, SUM(Measure1) AS MSum, COUNT(*) AS RecordCount, AVG(Measure2) AS MAvg, MIN(Measure1) AS MMin MAX(CASE WHEN Measure2 < 100 THEN Measure2 END) AS MMax GROUP BY Dim1, Dim2 |
{ $match : {MMin : {$gt : 0}} } |
HAVING (MMin > 0) |
{ $sort : {RecordCount : -1} } |
ORDER BY RecordCount DESC |
{ $limit : 8 }, { $skip : 4 } |
LIMIT 4, 8 |
Putting all of these together gives you your pipeline:
> db.runCommand({aggregate: "DenormAggTable", pipeline: [ { $match : { Filter1 : {$in : ['A','B']}, Filter2 : 'C', Filter3 : {$gt : 123} } }, { $project : { Dim1 : 1, Dim2 : 1, Measure1 : 1, Measure2 : 1, lessThanAHundred : {$cond: [{$lt: ["$Measure2", 100]}, { "$Measure2", 0] } } }, { $group : { _id : {Dim1 : 1, Dim2 : 1}, MSum : {$sum : "$Measure1"}, RecordCount : {$sum : 1}, MAvg : {$avg : "$Measure2"}, MMin : {$min : "$Measure1"}, MMax : {$max : "$lessThanAHundred"} } }, { $match : {MMin : {$gt : 0}} }, { $sort : {RecordCount : -1} }, { $limit : 8 }, { $skip : 4 } ]})
As you can see, the SQL matches the pipeline operations pretty clearly. If you want to play with it, it’ll be available soon to a the development nightly build.
If you’re at MongoSV today (December 9th, 2011), check out Chris Westin’s talk on the new aggregation framework at 3:45 in room B4.
This kinda stuff shouldn’t be in blog. It should go straight into tutorial! It’s that good! Thanks.
LikeLike
I think:
{
$limit : 4
},
{
$skip : 8
}
should be changed into:
{ $limit : 8},{ $skip : 4}
LikeLike
Whoops, thank you! Fixed.
LikeLike
I would like to view this chart reviewed with the new aggregation framework introduced in mongo 2.1 (i can help too)
LikeLike
That’d be cool, let me know if you come up with a new version.
LikeLike