Hacking Chess with the MongoDB Pipeline

MongoDB’s new aggegation framework is now available in the nightly build! This post demonstrates some of its capabilities by using it to analyze chess games.

Make sure you have a the “Development Release (Unstable)” nightly running before trying out the stuff in this post. The aggregation framework will be in 2.1.0, but as of this writing it’s only in the nightly build.

First, we need some chess games to analyze. Download games.json, which contains 1132 games that were won in 10 moves or less (crush their soul and do it quick).

You can use mongoimport to import games.json into MongoDB:

$ mongoimport --db chess --collection fast_win games.json
connected to: 127.0.0.1
imported 1132 objects

We can take a look at our chess games in the Mongo shell:

> use chess
switched to db chess
> db.fast_win.count()
1132
> db.fast_win.findOne()
{
	"_id" : ObjectId("4ed3965bf86479436d6f1cd7"),
	"event" : "?",
	"site" : "?",
	"date" : "????.??.??",
	"round" : "?",
	"white" : "Gedult D",
	"black" : "Kohn V",
	"result" : "1-0",
	"eco" : "B33/09",
	"moves" : {
		"1" : {
			"white" : {
				"move" : "e4"
			},
			"black" : {
				"move" : "c5"
			}
		},
		"2" : {
			"white" : {
				"move" : "Nf3"
			},
			"black" : {
				"move" : "Nc6"
			}
		},
                ...
		"10" : {
			"white" : {
				"move" : "Qa4"
			}
		}
	}
}

Not exactly the greatest schema, but that’s how the chess format exporter munged it. Regardless, now we can use aggregation pipelines to analyze these games.

Experiment #1: First Mover Advantage

White has a slight advantage in chess because you move first (Wikipedia says it’s a 52%-56% chance of winning). I’d hypothesize that, in a short game, going first matters even more.

Let’s find out.

The “result” field in these docs is “1-0” if white wins and “0-1” if black wins. So, we want to divide our docs into two groups based on the “result” field and count how many docs are in each group. Using the aggregation pipeline, this looks like:

> db.runCommand({aggregate : "fast_win", pipeline : [
... {
...    $group : {
...        _id : "$result",      // group by 'result' field
...        numGames : {$sum : 1} // add 1 for every document in the group
...    }
... }]})
{
	"result" : [
		{
			"_id" : "0-1",
			"numGames" : 435
		},
		{
			"_id" : "1-0",
			"numGames" : 697
		}
	],
	"ok" : 1
}

That gives a 62% chance white will win (697 wins/1132 total games). Pretty good (although, of course, this isn’t a very large sample set).

Chessboard — In case you're not familiar with it, a reference chessboard with 1-8, a-h marked.

Experiment #2: Best Starting Move

Given a starting move, what percent of the time will that move lead to victory? This probably depends on whether you’re playing white or black, so we’ll just focus on white’s opening move.

First, we’ll just determine what starting moves white uses with this series of steps:

project all of white’s first moves (the moves.1.white.move field)
group all docs with the same starting move together
and count how many documents (games) used that move.

These steps look like:

> db.runCommand({aggregate: "fast_win", pipeline: [
... // '$project' is used to extract all of white's opening moves
... {
...     $project : {
...         // extract moves.1.white.move into a new field, firstMove
...         firstMove : "$moves.1.white.move"
...     }
... },
... // use '$group' to calculate the number of times each move occurred
... {
...     $group : { 
...         _id : "$firstMove",
...         numGames : {$sum : 1}
...     }
... }]})
{
	"result" : [
		{
			"_id" : "d3",
			"numGames" : 2
		},
		{
			"_id" : "e4",
			"numGames" : 696
		},
		{
			"_id" : "b4",
			"numGames" : 17
		},
		{
			"_id" : "g3",
			"numGames" : 3
		},
		{
			"_id" : "e3",
			"numGames" : 2
		},
		{
			"_id" : "c4",
			"numGames" : 36
		},
		{
			"_id" : "b3",
			"numGames" : 4
		},
		{
			"_id" : "g4",
			"numGames" : 11
		},
		{
			"_id" : "h4",
			"numGames" : 1
		},
		{
			"_id" : "Nf3",
			"numGames" : 37
		},
		{
			"_id" : "f3",
			"numGames" : 1
		},
		{
			"_id" : "f4",
			"numGames" : 25
		},
		{
			"_id" : "Nc3",
			"numGames" : 14
		},
		{
			"_id" : "d4",
			"numGames" : 283
		}
	],
	"ok" : 1
}

Now let’s compare those numbers with whether white won or lost.

> db.runCommand({aggregate: "fast_win", pipeline: [
... // extract the first move
... {
...    $project : {
...        firstMove : "$moves.1.white.move",
...        // create a new field, "win", which is 1 if white won and 0 if black won
...        win : {$cond : [
...            {$eq : ["$result", "1-0"]}, 1, 0
...        ]}
...    }
... },
... // group by the move and count up how many winning games used it
... {
...     $group : {
...         _id : "$firstMove",
...         numGames : {$sum : 1},
...         numWins : {$sum : "$win"}
...     }
... },
... // calculate the percent of games won with this starting move
... {
...     $project : {
...         _id : 1,
...         numGames : 1,
...         percentWins : {
...             $multiply : [100, {
...                 $divide : ["$numWins","$numGames"]
...             }]
...         }
...     }
... },
... // discard moves that were used in less than 10 games (probably not representative) 
... {
...     $match : {
...         numGames : {$gte : 10}
...     }
... },
... // order from worst to best
... {
...     $sort : {
...         percentWins : 1
...     }
... }]})
{
	"result" : [
		{
			"_id" : "f4",
			"numGames" : 25,
			"percentWins" : 24
		},
		{
			"_id" : "b4",
			"numGames" : 17,
			"percentWins" : 35.294117647058826
		},
		{
			"_id" : "c4",
			"numGames" : 36,
			"percentWins" : 50
		},
		{
			"_id" : "d4",
			"numGames" : 283,
			"percentWins" : 50.53003533568905
		},
		{
			"_id" : "g4",
			"numGames" : 11,
			"percentWins" : 63.63636363636363
		},
		{
			"_id" : "Nf3",
			"numGames" : 37,
			"percentWins" : 67.56756756756756
		},
		{
			"_id" : "e4",
			"numGames" : 696,
			"percentWins" : 68.24712643678161
		},
		{
			"_id" : "Nc3",
			"numGames" : 14,
			"percentWins" : 78.57142857142857
		}
	],
	"ok" : 1
}

Pawn to e4 seems like the most dependable winner here. Knight to c3 also seems like a good choice (at a nearly 80% win rate), but it was only used in 14 winning games.

Experiment #3: Best and Worst Moves for Black

We basically want to do a similar pipeline to Experiment 2, but for black. At the end, we want to find the best and worst percent.

> db.runCommand({aggregate: "fast_win", pipeline: [
... // extract the first move
... {
...    $project : {
...        firstMove : "$moves.1.black.move",
...        win : {$cond : [
...            {$eq : ["$result", "0-1"]}, 1, 0
...        ]}
...    }
... },
... // group by the move and count up how many winning games used it
... {
...     $group : {
...         _id : "$firstMove",
...         numGames : {$sum : 1},
...         numWins : {$sum : "$win"}
...     }
... },
... // calculate the percent of games won with this starting move
... {
...     $project : {
...         _id : 1,
...         numGames : 1,
...         percentWins : {
...             $multiply : [100, {
...                 $divide : ["$numWins","$numGames"]
...             }]
...         }
...     }
... },
... // discard moves that were used in less than 10 games (probably not representative) 
... {
...     $match : {
...         numGames : {$gte : 10}
...     }
... },
... // sort by % win rate
... {
...     $sort : {
...         percentWins : -1
...     }
... },
... // get the best and worst
... {
...     $group : {
...          _id : 1,
...          best : {$first : "$_id"},
...          worst : {$last : "$_id"}
...     }
... }]})
{
	"result" : [
		{
			"_id" : 1,
			"best" : "Nf6",
			"worst" : "d6"
		}
	],
	"ok" : 1
}

I like this new aggregation functionality because it’s feels simpler than MapReduce. You can start with a one-operation pipeline and build it up, step-by-step, seeing exactly what a given operation does to your output. (And no Javascript required, which is always a plus.)

There’s lots more documentation on aggregation pipelines in the docs and I’ll be doing a couple more posts on it.