Replica Set Internals Bootcamp Part IV: Syncing

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through. I’m punting on reconfig for now because it’s finicky to write about.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Syncing

When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:

  1. Applies the op
  2. Writes the op to its own oplog (also local.oplog.rs)
  3. Requests the next op

If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

w

To ensure a write is present on, say, two members, you can do:

> db.foo.runCommand({getLastError:1, w:2})

Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.

Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.

So, it goes like:

  1. Do a write on primary.
  2. Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
  3. {getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
  4. secondary queries the oplog on primary and gets the op
  5. secondary applies the op from time t
  6. secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
  7. primary updates that secondary has applied up to t because it is requesting ops > t.
  8. getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.

Starting up

When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.

You can always use this shell helper to get the current last op in the shell:

> rs.debug.getLastOpWritten()

The “ts” field is the last optime written.

If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.

Once it has the last op time, it will chose a target to sync from.

Who to sync from

As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:

for each member that is healthy:
    if member[state] == PRIMARY
        add to set of possible sync targets

    if member[lastOpTimeWritten] > our[lastOpTimeWritten]
        add to set of possible sync targets

sync target = member with the min ping time from the possible sync targets

The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.

You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

Chaining Slaves

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)

One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.

When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”

When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.

Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.

A        B        C
           
      

<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).

On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.

Coming soon to a replica set near you…

In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.

> db.adminCommand({replSetSyncFrom:"otherHost:27017"})

8 thoughts on “Replica Set Internals Bootcamp Part IV: Syncing

    1. Thanks!  C does actually send back a few bytes per op so that B can sanity-check.  So, it does add a little overhead, but it shouldn’t be noticeable (unless you’re _really_ close to saturating your network connection).

      Like

    1. Not sure when the next one will be, I think we’re trying to train people up in bigger batches now.  I’m sure you’ll catch the next one 🙂

      Like

    1. Yeah, you didn’t miss it!  I mentioned above that I’m punting on reconfig for now because it’s finicky to write about (it kind of blends in with the preamble, though).

      Like

  1. Nice explanation there on chaining slaves! So if i am using slave chaining, will initial sync also be replicated like this?
    So if B is syncing from A and I add a new node ‘C’ to my replica set and configure it to sync from B. So during initial sync will all data be copied first on B (from A) and then to C?
    This may be good during normal replication but doing this way in initial sync where a *lot* of data needs to be replicated up front might be problem right ?

    Like

    1. Initial sync works differently than normal sync, so it won’t do a ghost sync, it’ll just copy directly from B. A new member isn’t even tracked in terms of w until it’s done with initial sync.

      Like

Leave a comment