Replica Set Internals Part V: Initial Sync

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

The Initial Sync Processes

When you add a brand new member to the set, it needs to copy over all of the data before it’s ready to be a “real” member. This process is called initial syncing and there are, essentially, 7 steps:

  1. Check the oplog. If it is not empty, this node does not initial sync, it just starts syncing normally. If the oplog is empty, then initial sync is necessary, continue to step #2:
  2. Get the latest oplog time from the source member: call this time start.
  3. Clone all of the data from the source member to the destination member.
  4. Build indexes on destination.
  5. Get the latest oplog time from the sync target, which is called minValid.
  6. Apply the sync target’s oplog from start to minValid.
  7. Become a “normal” member (transition into secondary state).

Note that this process only checks the oplog. You could have a petabyte of data on the server, but if there’s no oplog, the new member will delete all of it and initial sync. Members depend on the oplog to know where to sync from.

So, suppose we’re starting up a new member with no data on it. MongoDB checks the oplog, sees that it doesn’t even exist, and begins the initial sync process.

Copying Data

The code for this couldn’t be much simpler, in pseudo code it is basically:

for each db on sourceServer:
    for each collection in db:
        for each doc in db.collection.find():
             destinationServer.getDB(db).getCollection(collection).insert(doc)

One of the issues with syncing is that it has to touch all of the source’s data, so if you’ve been carefully cultivating a working set on sourceServer, it’ll pretty much be destroyed.

There are benefits to initial syncing, though: it effectively compacts your data on the new secondary. As it’s doing are inserts, it’ll use pretty much the minimum amount of space. Some users actually use rotating resyncs to keep their data compact.

On the downside, initial sync doesn’t consider padding factor, so if that’s important to your application, the new server will have to build up the right padding factor over time.

Syncing on a Live System

The tricky part of initial syncing is that we’re trying to copy an (often) massive amount of data off of a live system. New data will be written while this copy is taking place, so it’s a bit like trying copy a tree over six months.

By the time you’re done, the data you copied first might have changed significantly on the source. The copy on the destination might not be… exactly what you’d expect.

That is what the oplog replay step is for: to get your data to a consistent state. Oplog ops are idempotent, they can be applied multiple times and yield the same answer. Thus, so long as we apply all of the writes at least once (remember, they may or may not have been applied on the source before the copy), we’ll end up with a consistent picture when we’re done.

Like so.

minValid (as mentioned in the list above) is the first timestamp where our new DB is in a consistent state: it may be behind the other members, but its data matches exactly how the other servers looked at some point in time.

Some examples of idempotency, as most people haven’t seen it since college:

// three idempotent functions:
function idemp1(doc) {
   doc.x = doc.x + 0;
}

function idemp2(doc) {
   doc.x = doc.x * 1;
}

function idemp3(doc) {
   // this is what replication does: it turns stuff like "$inc 4 by 1" into "$set to 5"
   doc.x = 5;
}

// two non-idempotent functions
function nonIdemp1(doc) {
   doc.x = doc.x + 1;
}

function nonIdemp2(doc) {
   doc.x = Math.random();
}

No matter how many times you call the idempotent functions the value of doc.x will be the same (as long as you call them at least once) .

Building Indexes

In 2.0, indexes were created on the secondary as part of the cloning step, but in 2.2, we moved index creation to after the oplog application. This is because of an interesting edge case. Let’s say we have a collection representing the tree above and we have a unique index on leaf height: no two leaves are at exactly the same height. So, pretend we have a document that looks like this:

{
    "_id" : 123,
    ...,
    "height" : 76.3
}

The cloner copies this doc from the source server to the destination server and moves on. On the source, we remove this leaf from the tree because of, I dunno, high winds.

> db.tree.remove({_id:123})

However, the cloner has already copied the leaf, so it doesn’t notice this change. Now another leaf might grow at this height. Let’s say leaf #10,012 grow to this height on the source.

> db.tree.update({_id:10012}, {$set : {height : 76.3}}) 

Now, when the cloner gets to document #10012, it’ll copy it to the destination server. Now there are two documents with the same height field in the destination collection, so when it tries to create a unique index on “height”, the index creation will fail!

So, we moved the index creation to after the oplog application. That way, we’re always building the index on a consistent data set, so it should always succeed.

There are a couple of other edge cases like that which have been super-fun to track down, which you can look up in Jira if you’re interested.

Restoring from Backup

Often, initial sync is too slow for people. If you want to get a secondary up and running as fast as possible, the best way to do so is to skip initial sync altogether and restore from backup. To restore from backup:

  1. Find a secondary you like the looks of.
  2. Either shut it down or fsync+lock it. Basically, get its data files into a clean state, so nothing is writing to them while you’re copying them.
  3. Copy the data files to your destination server. Make sure you get all of your data if you’re using any symlinks or anything.
  4. Start back up or unlock the source server.
  5. Point the destination server at the data you copied and start it up.

As there is already an oplog, it will not need to initial sync. It will begin syncing from another member’s oplog immediately when it starts up and usually catch up quite quickly.

Note that mongodump/mongorestore actually does not work very well as a “restoring from backup” strategy because it doesn’t give you an oplog. You can create one on your own and prime it, but it’s more work and more fiddly than just copying files. There is a feature request for mongorestore to be able to prime the oplog automatically, but it won’t be in 2.2.

P.S. Trees were done with an awesome program I recently discovered called ArtRage, which I highly recommend to anyone who likes painting/drawing. It “feels” like real paint.

Good Night, Westley: Time-To-Live Collections

In The Princess Bride, every night the Dread Pirate Roberts tells Westley: “Good night, Westley. Good work. Sleep well. I’ll most likely kill you in the morning.”

Let’s say the Dread Pirate Roberts wants to optimize this process, so he stores prisoners in a database. When he captures Westley, he can put:

> db.prisoners.insert({
... name: "Westley",
... sentenceStart: new Date()
... })

However, now he has to run some sort of cron job that runs all the time in order to kill everyone who needs killing and keep his database up-to-date.

Enter time-to-live (TTL) collections. TTL collections are going to be released in MongoDB 2.2 and they’re collections where documents expire in a more controlled way than with capped collections.

What the Dread Pirate Roberts can do is:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 24*60*60}

Now, MongoDB will regularly comb this index looking for docs to expire (so it’s actually more of a TTL index than a TTL collection).

Let’s try it out ourselves. You’ll need to download version 2.1.2 or higher to use this feature. Start up the mongod and run the following in the Mongo shell:

> db.prisoners.ensureIndex({sentenceStart: 1}, {expireAfterSeconds: 30})

We’re on a schedule here, so our pirate ship is more brutal: death after 30 seconds. Let’s take aboard a prisoner and watch him die.

> var start = new Date()
> db.prisoners.insert({name: "Haggard Richard", sentenceStart: start})
> while (true) { 
... var count = db.prisoners.count(); 
... print("# of prisoners: " + count + " (" + (new Date() - start) + "ms)");
... if (count == 0) 
...      break; 
... sleep(4000); }
# of prisoners: 1 (2020ms)
# of prisoners: 1 (6021ms)
# of prisoners: 1 (10022ms)
# of prisoners: 1 (14022ms)
# of prisoners: 1 (18023ms)
# of prisoners: 1 (22024ms)
# of prisoners: 1 (26024ms)
# of prisoners: 0 (30025ms)

…and he’s gone.

Edited to add: Stennie pointed out that the TTL job only runs once a minute, so YMMV on when Westley gets bumped.

Conversely, let’s say we want to play the “maybe I’ll kill you tomorrow” game and keep bumping Westley’s expiration date. We can do that by updating the TTL-indexed field:

> db.prisoners.insert({name: "Westley", sentenceStart: new Date()})
> for (i=0; i  db.prisoners.count()
1

…and Westley’s still there, even though it’s been more than 30 seconds.

Once he gets promoted and becomes the Dread Pirate Roberts himself, he can remove himself from the execution rotation by changing his sentenceStart field to a non-date (or removing it altogether):

> db.prisoners.update({name: "Westley"}, {$unset : {sentenceStart: 1}});

When not on pirate ships, developers generally use TTL collections for sessions and other cache expiration problems. If ye be wanting a less grim introduction to MongoDB’s TTL collections, there are some docs on it in the manual.

The Snail Crawls On…

A bit of housekeeping: I’ve changed domain names, now this site is kchodorow.com, not snailinaturtleneck.com. Old links should still work, they’ll just be permanent redirects to the new domain. Please let me know if you come across any issues!

Why the change?

kchodorow.com sounds more professional and is shorter. Also, I’m after fame and glory here (in a very niche market) and having people remember “that snail blog” instead of my name doesn’t help.

So why was it Snail in a Turtleneck in the first place?

I started this site as a place to put up cartoons and it seemed like a cute name for my comics. Since the site’s inception, the content has gotten more and more technical, and it now basically just a programming blog (although I still dream of being the Hyperbole-and-a-Half of tech blogging).

I also slapped a new coat of paint on the site design, so if you startle easily, don’t worry that it looks like a completely different site under a completely different domain.

Replica Set Internals Bootcamp Part III: Reconfiguring

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Reconfiguring Prerequisites

One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

Additive Reconfiguration vs. Full Reconfigs

Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

A simple reconfig starts a new heartbeat thread for the new member and it’s done.

A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

Propegation of Reconfiguration

Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

Suppose you have 2 nodes, A and B.

You run a reconfig on A, changing the version number from 6 to 7.

B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

Force reconfig to the face.

Forcing Reconfig

Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

Why is my version number 6,203,493?

When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

Removing Nodes and Sharding

I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

10 Kindle Apps for the Non-Existent Developer API

The Kindle should have a developer API. Ereaders could be revolutionizing the way people read, but right now they’re like paperbacks without the nice book smell.

I’ve heard a lot of people say, “the Kindle isn’t powerful enough for apps.” Poppycock. I’m not talking about using it to play Angry Birds, I’m talking about stuff a calculator could zoom though and would actually improve the reading experience.

So I present 10 apps that would be super-useful, require few resources, and (in some cases) increase profits:

  1. A “more content” button for magazines. If I’m reading a good magazine, I’d love to be able to get $10 more of content when I’m done. It’d be like giving a rat a lever that dispenses pellets. Yum, reading pellets.
  2. Renaming books. Apparently my workflow is defective, because I’ll often end up with 6 titles named “Imported PDF”, and there is no way to distinguish the one I want other than opening each PDF until I find it. If I could just rename the damn things…
  3. Support for other organizational schemes. Some people like tags (like whoever wrote Gmail, apparently) and everyone else likes hierarchical folders. I hate tags, I want things neatly tucked away in Sci Fi/Nebula Awards/Short Stories, not a franken-tag like “Sci Fi – Nebula Awards – Short Stories” (okay, it’s equivalent, but I hate tags).
  4. In technical books, how often is there a diagram that you keep flipping back and forth to for the next 10 to 15 pages? It would be nice to be able to “pin” it to the top of the screen as you read all of the text related to the diagram.
  5. Goodreads integration. When I finish a book, I want to rate it and have it automatically added to my “read” shelf in at Goodreads.
  6. Related to above: recommendations when I finish a book and rate it. If I just rated it five stars, show me other books people who loved this book liked. If I rated it one star, show me books people who hated this book liked.
  7. Related to above (again): list my Amazon recommendations inline with my list of books. This would be a money-spinner for them, I think, because Amazon’s recommendation engine is freakishly accurate (except when it gets thrown out of whack by holiday shopping). If I was looking at my Kindle and saw a list of books I really wanted to read a click away… well, I’d be much poorer.
  8. Make looking up words plugable to different search engines (Wikipedia, Urban Dictionary, D&D Compendium, etc). I was recently reading “Crime and Punishment” and came across the term “yellow ticket.” The built-in dictionary knew what “yellow” was, and it knew what “ticket” was, but that didn’t help a whole lot (answer: an id given to Russian prostitutes).
  9. Update support. Technical books especially can benefit from this: O’Reilly has been working to do multiple quick releases of ebooks so that they can be updated as the technology changes. Imagine if you’re opening up your well-thumbed copy of Scaling MongoDB and a dialog pops up: “Version 1.2 of Scaling MongoDB is available, covering changes in MongoDB 2.2. Would you like to download? [Yes/No]”. However, the support just isn’t there on the device side. (And a new version of Scaling MongoDB isn’t available yet, sorry.)
  10. Metrics. As an author, I would love to know how long it took someone to read a page, how many times they came back to it, and when they put the book down and went to do something else. Authors have never been able to get this level of feedback before and I think it would revolutionize writing. Basic user tracking would be amazing.

I’m not sure why Amazon doesn’t have a dev API, but I’d imagine that part of the reason is that most publishers would not like it. However, I think Amazon is big enough to crush them into submission. I hope that they will hurry up and do so.

If anyone has any ideas on how to get Amazon to implement a developer API, please comment!

P.S. I know about the API here, but that’s essentially for Angry-Birds-type apps. I’m looking for an API that lets you mess with the reader.

––thursday #5: diagnosing high readahead

Having readahead set too high can slow your database to a crawl. This post discusses why that is and how you can diagnose it.

The #1 sign that readahead is too high is that MongoDB isn’t using as much RAM as it should be. If you’re running Mongo Monitoring Service (MMS), take a look at the “resident” size on the “memory” chart. Resident memory can be thought of as “the amount of space MongoDB ‘owns’ in RAM.” Therefore, if MongoDB is the only thing running on a machine, we want resident size to be as high as possible. On the chart below, resident is ~3GB:

Is 3GB good or bad? Well, it depends on the machine. If the machine only has 3.5GB of RAM, I’d be pretty happy with 3GB resident. However, if the machine has, say, 15GB of RAM, then we’d like at least 15GB of the data to be in there (the “mapped” field is (sort of) data size, so I’m assuming we have 60GB of data).

Assuming we’re accessing a lot of this data, we’d expect MongoDB’s resident set size to be 15GB, but it’s only 3GB. If we try turning down readahead and the resident size jumps to 15GB and our app starts going faster. But why is this?

Let’s take an example: suppose all of our docs are 512 bytes in size (readahead is set in 512-byte increments, called sectors, so 1 doc = 1 sector makes the math easier). If we have 60GB of data then we have ~120 million documents (60GB of data/(512 bytes/doc)). The 15GB of RAM on this machine should be able to hold ~30 million documents.

Our application accesses documents randomly across our data set, so we’d expect MongoDB to eventually “own” (have resident) all 15GB of RAM, as 1) it’s the only thing running and 2) it’ll eventually fetch at least 15GB of the data.

Now, let’s set our readahead to 100 (100 512-byte sectors, aka 100 documents): blockdev --set-ra 100. What happens when we run our application?

Picture our disk as looking like this, where each o is a document:

...
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
... // keep going for millions more o's

Let’s say our app requests a document. We’ll mark it with “x” to show that the OS has pulled it into memory:

...
ooooooooooooooooooooooooo
ooooxoooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...

See it on the third line there? But that’s not the only doc that’s pulled into memory: readahead is set to 100 so the next 99 documents are pulled into memory, too:

...
ooooooooooooooooooooooooo
ooooxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooooooooooooooooooooo
...
Is your OS returning this with every document?

Now we have 100 docs in memory, but remember that our application is accessing documents randomly: the likelihood of the next document we access is in that block of 100 docs is almost nil. At this point, there’s 50KB of data in RAM (512 bytes * 100 docs = 51,200 bytes) and MongoDB’s resident size has only increase by 512 bytes (1 doc).

Our app will keep bouncing around the disk, reading docs from here and there and filing up memory with docs MongoDB never asked for until RAM is completely full of junk that’s never been used. Then, it’ll start evicting things to make room for new junk as our app continues to make requests.

Working this out, there’s a 25% chance of our app requesting a doc that’s already in memory, so 75% of the requests are going to go to disk. Say we’re doing 2 requests a sec. Then 1 hour of requests is 2 requests * 3600 seconds/hour = 7200 requests, 4800 of which are going to disk (.75 * 7200). If each request pulls back 50KB, that’s 240MB read from disk/hour. If we set readahead to 0, we’ll have 2MB read from disk/hour.

Which brings us to the next symptom of a too-high readahead: unexpectedly high disk IO. Because most of the data we want isn’t in memory, we keep having to go to disk, dragging shopping-carts full of junk into RAM, perpetuating the high disk io/low resident mem cycle.

The general takeaway is that a DB is not a “normal” workload for an OS. The default settings may screw you over.

Replica Set Internals Bootcamp Part IV: Syncing

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

  1. Elections
  2. Creating a set
  3. Reconfiguring
  4. Syncing
  5. Initial Sync
  6. Rollback
  7. Authentication
  8. Debugging

I’m going to do one subject per post, we’ll see how many I can get through. I’m punting on reconfig for now because it’s finicky to write about.

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Syncing

When a secondary is operating normally, it chooses a member to sync from (more on that below) and starts pulling operations from the source’s local.oplog.rs collection. When it gets an op (call it W, for “write”), it does three things:

  1. Applies the op
  2. Writes the op to its own oplog (also local.oplog.rs)
  3. Requests the next op

If the db crashes between 1 & 2 and then comes back up, it’ll think it hasn’t applied W yet, so it’ll re-apply it. Luckily (i.e., due to massive amounts of hard work), oplog ops are idempotent: you can apply W once, twice, or a thousand times and you’ll end up with the same document.

For example, if you have a doc that looks like {counter:1} and you do an update like {$inc:{counter:1}} on the primary, you’ll end up with {counter:2} and the oplog will store {$set:{counter:2}}. The secondaries will replicate that instead of the $inc.

w

To ensure a write is present on, say, two members, you can do:

> db.foo.runCommand({getLastError:1, w:2})

Syntax varies based on language, consult your API docs, but it’s always an option for writes. The way this works is kind of cool.

Suppose you have a member called primary and another member syncing from it, called secondary. How does primary know where secondary is synced to? Well, secondary is querying primary‘s oplog for more results. So, if secondary requests an op written at 3pm, primary knows seconday has replicated all ops written before 3pm.

So, it goes like:

  1. Do a write on primary.
  2. Write is written to the oplog on primary, with a field “ts” saying the write occurred at time t.
  3. {getLastError:1,w:2} is called on primary. primary has done the write, so it is just waiting for one more server to get the write (w:2).
  4. secondary queries the oplog on primary and gets the op
  5. secondary applies the op from time t
  6. secondary requests ops with {ts:{$gt:t}} from primary‘s oplog
  7. primary updates that secondary has applied up to t because it is requesting ops > t.
  8. getLastError notices that primary and secondary both have the write, so w:2 is satisfied and it returns.

Starting up

When you start up a node, it takes a look at its local.oplog.rs collection and finds the latest entry in there. This is called the lastOpTimeWritten and it’s the latest op this secondary has applied.

You can always use this shell helper to get the current last op in the shell:

> rs.debug.getLastOpWritten()

The “ts” field is the last optime written.

If a member starts up and there are no entries in the oplog, it will begin the initial sync process, which is beyond the scope of this post.

Once it has the last op time, it will chose a target to sync from.

Who to sync from

As of 2.0, servers automatically sync from whoever is “nearest” based on average ping time. So, if you bring up a new member it starts sending out heartbeats to all the other members and averaging how long it takes to get a response. Once it has a decent picture of the world, it’ll decide who to sync from using the following algorithm:

for each member that is healthy:
    if member[state] == PRIMARY
        add to set of possible sync targets

    if member[lastOpTimeWritten] > our[lastOpTimeWritten]
        add to set of possible sync targets

sync target = member with the min ping time from the possible sync targets

The definition of “member that is healthy” has changed somewhat over the versions, but generally you can think of it as a “normal” member: a primary or secondary. In 2.0, “healthy” debatably includes slave delayed nodes.

You can see who a server is syncing from by running db.adminCommand({replSetGetStatus:1}) and looking at the “syncingTo” field (only present on secondaries). Yes, yes, it probably should have been syncingFrom. Backwards compatibility sucks.

Chaining Slaves

The algorithm for chosing a sync target means that slave chaining is semi-automatic: start up a server in a data center and it’ll (probably) sync from a server in the same data center, minimizing WAN traffic. (Note that you can’t end up with a sync loop, i.e., A syncing from B and B syncing from A, because a secondary can only sync from another secondary with a strictly higher optime.)

One cool thing to implement was making w work with slave chaining. If A is syncing from B and B is syncing from C, how does C know where A is synced to? The way this works is that it builds on the existing oplog-reading protocol.

When A starts syncing from B (or any server starts syncing from another server), it sends a special “handshake” message that basically says, “Hi, I’m A and I’ll be syncing from your oplog. Please track this connection for w purposes.”

When B gets this message, it says, “Hmm, I’m not primary, so let me forward that along to the member I’m syncing from.” So it opens a new connection to C and says “Pretend I’m ‘A‘, I’ll be syncing from your oplog on A‘s behalf.” Note that B now has two connections open to C, one for itself and one for A.

Whenever A requests more ops from B, B sends the ops from its oplog and then forwards a dummy request to C along “A‘s” connection to C. A doesn’t even need to be able to connect directly to C.

A        B        C
           
      

<====> is a “real” sync connection. The connection between B and C on A’s behalf is called a “ghost” connection (<---->).

On the plus side, this minimizes network traffic. On the negative side, the absolute time it takes a write to get to all members is higher.

Coming soon to a replica set near you…

In 2.2, there will be a new command, replSetSyncFrom, that lets you change who a member is syncing from, bypassing the “choosing a sync target” logic.

> db.adminCommand({replSetSyncFrom:"otherHost:27017"})

Night of the Living Dead Ops

MongoDB users often ask about the “killed” field in db.currentOp() output. For example, if you’ve run db.killOp(), you might see something like:

> db.currentOp()
{
	"inprog" : [
		{
			"opid" : 3062962,
			"active" : true,
			"lockType" : "write",
			"waitingForLock" : false,
			"secs_running" : 32267,
			"op" : "update",
			"ns" : "httpdb.servers",
			"query" : {
				"_id" : "150.237.88.189"
			},
			"client" : "127.0.0.1:50416",
			"desc" : "conn",
			"threadId" : "0x2900c400",
			"connectionId" : 74,
			"killed" : true,
			"numYields" : 0
		},
		{
			"opid" : 3063051,
			"active" : false,
			"lockType" : "read",
			"waitingForLock" : true,
			"op" : "query",
			"ns" : "",
			"query" : {
				"count" : "servers",
				"query" : {
					"code" : {
						"$gte" : 200
					}
				}
			},
			"client" : "127.0.0.1:30736",
			"desc" : "conn",
			"threadId" : "0x29113700",
			"connectionId" : 191,
			"killed" : true,
			"numYields" : 0
		}
        ]
}

The operation looks dead… it has killed:true, right? But you can run db.currentOp() again and again and the op doesn’t go away, even though it’s been “killed.” So what’s up with that?

Chainsaws: the kill -9 of living dead

It has to do with how MongoDB handles killed operations. When you run db.killOp(3062962), MongoDB looks up operation 3062962 in a hashtable and sets its killed field to true. However, the code running that op gets to decide whether to even check that field and how deal with it appropriately.

There are basically three ways MongoDB ops handle getting killed:

  • Ones that die when they yield whatever lock they’re holding. This means that if the op never yields (note that numYields is 0 in the example above), it will never be killed.
  • Ones that can be killed at certain checkpoints. For example, index builds happen in multiple stages and check killed between stages. (Many commands do this, too.)
  • Ones cannot be killed at all. For example, rsSync, the name for the op applying replication, falls into this category. There are some sharding commands that can’t be killed, too.

There is no kill -9 equivalent in MongoDB (other than kill -9-ing the server itself): if an op doesn’t know how to safely kill itself, it won’t die until it’s good and ready. Which means that you can have a “killed” op in db.currentOp() output for a long time. killed might be better named killRequested.

Also, if you kill an operation before it acquires a lock, it’ll generally start executing anyway (e.g., op 3063051 above). For example, try opening up a shell and make the db hold the writelock for 10 minutes:

> db.eval("sleep(10*60*1000)")

While that’s running, in another shell, try doing an insert (which will block, as the db cannot do any writes while the db.eval() is holding the writelock).

> db.foo.insert({x:1})

Now, in a third shell, kill the insert we just did (before the 10 minutes elapse):

> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 25,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "numYields" : 0
                }
        ]
}
> // get the opId of the insert from currentOp
> db.killOp(455949)
{ "info" : "attempting to kill op" }
> // run currentOp again to see that killed:true
> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 455937,
                        "active" : true,
                        "lockType" : "W",
                        "waitingForLock" : false,
                        "secs_running" : 221,
                        "op" : "query",
                        "ns" : "test",
                        "query" : {
                                "$eval" : "sleep(10*60*1000)"
                        },
                        "client" : "127.0.0.1:51797",
                        "desc" : "conn",
                        "threadId" : "0x7f241c0bf700",
                        "connectionId" : 14477,
                        "locks" : {
                                "." : "W"
                        },
                        "numYields" : 0
                },
                {
                        "opid" : 455949,
                        "active" : false,
                        "lockType" : "w",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "client" : "127.0.0.1:51799",
                        "desc" : "conn",
                        "threadId" : "0x7f24147f8700",
                        "connectionId" : 14478,
                        "locks" : {
                                "." : "w",
                                ".test" : "W"
                        },
                        "killed" : true,
                        "numYields" : 0
                }
        ]
}

If you wait 10 minutes for the db.eval() to finish, then do a find on db.foo, you’ll see that {x:1} was actually inserted anyway. This is because the op’s lifecycle looks something like:

  • Wait for lock
  • Acquire lock!
  • Start running
  • Yield lock
  • Check for killed

So it’ll run a bit before dying (if it can be killed at all), which may produce unintuitive results.

––thursday #4: blockdev

Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.

The image below helps visualize how slow (post continues below).

(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)

The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.

You can see how far the kernel reads ahead using the blockdev tool:

$ sudo blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0     80026361856   /dev/sda
rw   256   512  4096       2048     80025223168   /dev/sda1
rw   256   512  4096          0   2000398934016   /dev/sdb
rw   256   512  1024       2048        98566144   /dev/sdb1
rw   256   512  4096     194560      7999586304   /dev/sdb2
rw   256   512  4096   15818752     19999490048   /dev/sdb3
rw   256   512  4096   54880256   1972300152832   /dev/sdb4

Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:

$ man blockdev
...
       --setra N
              Set readahead to N 512-byte sectors.
...

This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.

So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).

Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.

I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.