
This is the first in a three-part series on how replication works.
Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?
MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).
The oplog
Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:
$ ./mongod --replSet funWithOplogs
Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:
> rs.initiate()
Now if we query the oplog you’ll see this operation:
> use local switched to db local > db.oplog.rs.find() { "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
This is just an informational message for the slave, it isn’t a “real” operation. Breaking this down, it contains the following fields:
- ts: the time this operation occurred.
- h: a unique ID for this operation. Each operation will have a different value in this field.
- op: the write operation that should be applied to the slave. n indicates a no-op, this is just an informational message.
- ns: the database and collection affected by this operation. Since this is a no-op, this field is left blank.
- o: the actual document representing the op. Since this is a no-op, this field is pretty useless.
To see some real oplog messages, we’ll need to do some writes. Let’s do a few simple ones in the shell:
> use test switched to db test > db.foo.insert({x:1}) > db.foo.update({x:1}, {$set : {y:1}}) > db.foo.update({x:2}, {$set : {y:1}}, true) > db.foo.remove({x:1})
Now look at the oplog:
> use local switched to db local > db.oplog.rs.find() { "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } } { "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } } { "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } } { "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } } { "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }
You can see that each operation has a ns now: “test.foo”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.
The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).
Using this information
MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.
Next up: what the oplog is and how syncing works.
In the description above on o and o2 it seems that o is giving the update modifications, and o2 looks like de first argument to update().
Or am I crazy?
LikeLike
You are correct! Thanks, I’ve fixed the description.
LikeLike
Just curious if no-ops occur when an update is requested to set a value to the same value as it currently has.
LikeLike
No, at the moment it’ll actually log an update op. This is sort of just an implementation detail, though, not by-design. It actually may change to no being logged for 2.4, which is having a lot of the update code re-written.
LikeLike
Thanks for the quick reply, Kristina! I was hoping to use this data for tracking changes to documents over a period and ignoring updates that were effectively no-ops (nothing modified). Perhaps I can work out a way to determine this using the current output.
LikeLike
No problem! I think you’d have to compare the op to the underlying document, at the moment.
LikeLike
OK. I’m concerned that there’s some chance that the document may have changed in the time since the log entry was created and the process that is recording the event reads it. If so then the underlying document may no longer be current.
LikeLike
I am using mongolab for mongodb hosting , so I can access to op-log when mongodb is in localhost but my db is hosted on web.
LikeLike
You cannot access the oplog on a hosted solution, as all the ops from everyone else’s DBs would be visible, too.
LikeLike
I can see myself getting into this same situation. Is there a work around or alternate method for creating triggers if you do not have access to the oplog?
LikeLike
I set up a test to listen to the oplog for changes to a certain collection. It’s working fine, but I notice there is a delay of about 1 second maybe… 80% of the time. The other times it executes quickly, completing in under 100ms. Is there some delay on the oplog generation? Is it polling to generate the entries? Is there some configuration I could use to decrease the lag, and would that have other penalties such as over taxing the CPU?
I am using the node.js driver. This is my implementation, though I don’t imagine it will work outside of my application environment:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
eventChannel.coffee
hosted with ❤ by GitHub
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
watchMongo.coffee
hosted with ❤ by GitHub
I’ll release this as a simple watcher library once I get it cleaned up.
LikeLike
First, it might be the rate you’re writing to the oplog: oplog queries will hang around for a while waiting for results. Second, you might want to use the “oplog replay” option (no idea what it’s called in Coffeescript) which makes querying the oplog more efficient.
LikeLike
Kristina,
Thanks for your reply!
I changed the tailableRetryInterval in my code to 100ms, and that reduced the delay. So it looks like the tailable cursor is actually polling for results. Hmm… not ideal. I’ll take this up on the node-mongodb-native google group though, as it seems it’s an implementation detail of the driver.
LikeLike
Ok, so I actually found an answer. If you initialize the cursor with ‘awaitdata: true’ then it will rely on Mongo server functionality to push out the new data. Here’s an example from the tests in node-mongodb-native:
https://github.com/mongodb/node-mongodb-native/blob/master/test/cursor_test.js#L2038
Using this my test runs in ~100ms (20ms lag over just waiting for the update callback), or if the server’s been idle for a little while it’s more like 500ms. That should be acceptable.
LikeLike
Cool, glad it worked out! I do recommend still looking into the oplog replay flag.
LikeLike
You’re referring to the OplogReplay bit mentioned here, I assume?
http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/
I think I’ll have to ask Christian more specifically if his driver handles that… I grepped the code and cannot find it, but maybe he is passing all the options through directly. What specifically does this flag do? I can’t seem to find much documentation on it, though I see it mentioned by Scott here:
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/b1qiuAIG75A
Thanks again for your help. This is an adventure into the lair of the beast, to be certain. 🙂
LikeLike
Yes, that’s the flag I’m talking about. The oplog doesn’t have any indexes, so when you query mongod has to scan every document. The oplog replay flag makes the query start at the latest document and jump back by 200MB at a time to try to find a timestamp earlier than the one you’re querying for. Then, once it’s found the right oplog segment to search, it’ll move forward one document at a time. It makes querying the oplog for a particular timestamp much faster.
LikeLike
Alrighty, library is published. Here it is for any interested:
https://github.com/TorchlightSoftware/mongo-watch
LikeLike
Cool, thanks for sharing!
LikeLike