Replication Internals

Displacer beast... seemed related (it's sort of in two places at the same time).

This is the first in a three-part series on how replication works.

Replication gives you hot backups, read scaling, and all sorts of other goodness. If you know how it works you can get a lot more out of it, from how it should be configured to what you should monitor to using it directly in your applications. So, how does it work?

MongoDB’s replication is actually very simple: the master keeps a collection that describes writes and the slaves query that collection. This collection is called the oplog (short for “operation log”).

The oplog

Each write (insert, update, or delete) creates a document in the oplog collection, so long as replication is enabled (MongoDB won’t bother keeping an oplog if replication isn’t on). So, to see the oplog in action, start by running the database with the –replSet option:

$ ./mongod --replSet funWithOplogs

Now, when you do operations, you’ll be able to see them in the oplog. Let’s start out by initializing out replica set:

> rs.initiate()

Now if we query the oplog you’ll see this operation:

> use local
switched to db local
> db.oplog.rs.find()
{ 
    "ts" : { "t" : 1286821527000, "i" : 1 }, 
    "h" : NumberLong(0), 
    "op" : "n", 
    "ns" : "", 
    "o" : { "msg" : "initiating set" } 
}

This is just an informational message for the slave, it isn’t a “real” operation. Breaking this down, it contains the following fields:

  • ts: the time this operation occurred.
  • h: a unique ID for this operation. Each operation will have a different value in this field.
  • op: the write operation that should be applied to the slave. n indicates a no-op, this is just an informational message.
  • ns: the database and collection affected by this operation. Since this is a no-op, this field is left blank.
  • o: the actual document representing the op. Since this is a no-op, this field is pretty useless.

To see some real oplog messages, we’ll need to do some writes. Let’s do a few simple ones in the shell:

> use test
switched to db test
> db.foo.insert({x:1})
> db.foo.update({x:1}, {$set : {y:1}})
> db.foo.update({x:2}, {$set : {y:1}}, true)
> db.foo.remove({x:1})

Now look at the oplog:

> use local
switched to db local
> db.oplog.rs.find()
{ "ts" : { "t" : 1286821527000, "i" : 1 }, "h" : NumberLong(0), "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
{ "ts" : { "t" : 1286821977000, "i" : 1 }, "h" : NumberLong("1722870850266333201"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d"), "x" : 1 } }
{ "ts" : { "t" : 1286821984000, "i" : 1 }, "h" : NumberLong("1633487572904743924"), "op" : "u", "ns" : "test.foo", "o2" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") }, "o" : { "$set" : { "y" : 1 } } }
{ "ts" : { "t" : 1286821993000, "i" : 1 }, "h" : NumberLong("5491114356580488109"), "op" : "i", "ns" : "test.foo", "o" : { "_id" : ObjectId("4cb3586928ce78a2245fbd57"), "x" : 2, "y" : 1 } }
{ "ts" : { "t" : 1286821996000, "i" : 1 }, "h" : NumberLong("243223472855067144"), "op" : "d", "ns" : "test.foo", "b" : true, "o" : { "_id" : ObjectId("4cb35859007cc1f4f9f7f85d") } }

You can see that each operation has a ns now: “test.foo”. There are also three operations represented (the op field), corresponding to the three types of writes mentioned earlier: i for inserts, u for updates, and d for deletes.

The o field now contains the document to insert or the criteria to update and remove. Notice that, for the update, there are two o fields (o and o2). o2 give the update criteria and o gives the modifications (equivalent to update()‘s second argument).

Using this information

MongoDB doesn’t yet have triggers, but applications could hook into this collection if they’re interested in doing something every time a document is deleted (or updated, or inserted, etc.) Part three of this series will elaborate on this idea.

Next up: what the oplog is and how syncing works.

24 thoughts on “Replication Internals

  1. In the description above on o and o2 it seems that o is giving the update modifications, and o2 looks like de first argument to update().

    Or am I crazy?

    Like

  2. Pingback: ehcache.net
    1. No, at the moment it’ll actually log an update op.  This is sort of just an implementation detail, though, not by-design.  It actually may change to no being logged for 2.4, which is having a lot of the update code re-written.

      Like

      1. Thanks for the quick reply, Kristina! I was hoping to use this data for tracking changes to documents over a period and ignoring updates that were effectively no-ops (nothing modified). Perhaps I can work out a way to determine this using the current output.

        Like

      2. OK. I’m concerned that there’s some chance that the document may have changed in the time since the log entry was created and the process that is recording the event reads it. If so then the underlying document may no longer be current.

        Like

  3. I am using mongolab for mongodb hosting , so I can access to op-log when mongodb is in localhost but my db is hosted on web.

    Like

      1. I can see myself getting into this same situation.  Is there a work around or alternate method for creating triggers if you do not have access to the oplog?

        Like

  4. I set up a test to listen to the oplog for changes to a certain collection.  It’s working fine, but I notice there is a delay of about 1 second maybe… 80% of the time.  The other times it executes quickly, completing in under 100ms.  Is there some delay on the oplog generation?  Is it polling to generate the entries?  Is there some configuration I could use to decrease the lag, and would that have other penalties such as over taxing the CPU?

    I am using the node.js driver.  This is my implementation, though I don’t imagine it will work outside of my application environment:


    {EventEmitter} = require 'events'
    eventChannel = new EventEmitter()
    eventChannel.on 'error', (args…) -> console.log "Event Channel received error:", args…
    module.exports = eventChannel


    {Server, Db, Timestamp} = require 'mongodb'
    client = new Db 'local', new Server('localhost', 27017, {native_parser: true}), {w: 0}
    eventChannel = config.require 'load/eventChannel'
    getTimestamp = (date) ->
    date ||= new Date()
    time = Math.floor(date.getTime() / 1000)
    new Timestamp 0, time
    getDate = (timestamp) ->
    new Date timestamp.high_ * 1000
    mapOp =
    n: 'noop'
    i: 'insert'
    u: 'update'
    r: 'remove'
    options = {} # raw
    module.exports =
    connect: (opts) ->
    options.merge opts
    watch: (collection) ->
    # watch user model
    client.open (err) ->
    console.log 'Error connecting:', err if err
    client.collection 'oplog.rs', (err, oplog) ->
    options =
    tailable: true
    tailableRetryInterval: 1000
    numberOfRetries: 1000
    currentTime = getTimestamp()
    cursor = oplog.find {ts: {$gte: currentTime}}, options
    stream = cursor.stream()
    stream.on 'data', (data) ->
    if collection
    return unless data.ns is collection
    if options.raw
    event = data
    else
    event =
    timestamp: getDate data.ts
    operation: mapOp[data.op] or data.op
    namespace: data.ns
    id: data.h.toString()
    criteria: data.o2
    data: data.o
    eventChannel.emit 'change', event

    I’ll release this as a simple watcher library once I get it cleaned up.

    Like

    1. First, it might be the rate you’re writing to the oplog: oplog queries will hang around for a while waiting for results.  Second, you might want to use the “oplog replay” option (no idea what it’s called in Coffeescript) which makes querying the oplog more efficient.

      Like

      1. Kristina,

        Thanks for your reply!

        I changed the tailableRetryInterval in my code to 100ms, and that reduced the delay.  So it looks like the tailable cursor is actually polling for results.  Hmm… not ideal.  I’ll take this up on the node-mongodb-native google group though, as it seems it’s an implementation detail of the driver.

        Like

      2. Ok, so I actually found an answer.  If you initialize the cursor with ‘awaitdata: true’ then it will rely on Mongo server functionality to push out the new data.  Here’s an example from the tests in node-mongodb-native:

        https://github.com/mongodb/node-mongodb-native/blob/master/test/cursor_test.js#L2038

        Using this my test runs in ~100ms (20ms lag over just waiting for the update callback), or if the server’s been idle for a little while it’s more like 500ms.  That should be acceptable.

        Like

      3. You’re referring to the OplogReplay bit mentioned here, I assume?

        http://docs.mongodb.org/meta-driver/latest/legacy/mongodb-wire-protocol/

        I think I’ll have to ask Christian more specifically if his driver handles that… I grepped the code and cannot find it, but maybe he is passing all the options through directly.  What specifically does this flag do?  I can’t seem to find much documentation on it, though I see it mentioned by Scott here:

        https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/b1qiuAIG75A

        Thanks again for your help.  This is an adventure into the lair of the beast, to be certain.  🙂

        Like

      4. Yes, that’s the flag I’m talking about.  The oplog doesn’t have any indexes, so when you query mongod has to scan every document.  The oplog replay flag makes the query start at the latest document and jump back by 200MB at a time to try to find a timestamp earlier than the one you’re querying for.  Then, once it’s found the right oplog segment to search, it’ll move forward one document at a time.  It makes querying the oplog for a particular timestamp much faster.

        Like

Leave a comment