Replica Set Internals Bootcamp Part III: Reconfiguring

I’ve been doing replica set “bootcamps” for new hires. It’s mainly focused on applying this to debug replica set issues and being able to talk fluently about what’s happening, but it occurred to me that you (blog readers) might be interested in it, too.

There are 8 subjects I cover in my bootcamp:

Elections
Creating a set
Reconfiguring
Syncing
Initial Sync
Rollback
Authentication
Debugging

Prerequisites: I’m assuming you know what replica sets are and you’ve configured a set, written data to it, read from a secondary, etc. You understand the terms primary and secondary.

Reconfiguring Prerequisites

One of the goals is to not let you reconfigure yourself into a corner (e.g., end up with all arbiters), so reconfig tries to make sure that a primary could be elected with the new config. Basically, we go through each node and tally up how many votes there will be and if a majority of those is up (the reconfig logic sends out heartbeats).

Also, the member you send the reconfig to has to be able to become primary in the new setup. It doesn’t have to become primary, but its priority has to be greater than 0. So, you can’t have all of the members have a priority of 0.

The reconfig also checks the version number, set name, and that nothing is going to an illegal state (e.g., arbiter-to-non-arbiter, upping the priority on a slave delayed node, and so on).

One thing to note is that you can change hostnames in a reconfig. If you’re using localhost for a single-node set and want to change it to an externally resolvable hostname so you can add some other members, you can just change the member’s hostname from localhost to someHostname and reconfig (so long as someHostname resolves, of course).

Additive Reconfiguration vs. Full Reconfigs

Once the reconfiguration has been checked for correctness, MongoDB checks to see if this is a simple reconfig or a full reconfig. A simple reconfig adds a new node. Anything else is a full reconfig.

A simple reconfig starts a new heartbeat thread for the new member and it’s done.

A full reconfig clears all state. This means that the current primary closes all connections. All the current heartbeat threads are stopped and a new heartbeat thread for each member is started. The old config is replaced by the new config. Then the member formerly known as primary becomes primary again.

We definitely take a scorched-earth approach to reconfiguring. If you are, say, changing the priority of a node from 0 to 1, it would make more sense to change that field than to tear down the whole old config. However, we didn’t want to miss an edge case, so we went with better safe than sorry. Reconfig is considered a “slow” operation anyway, so we’ll generally make the tradeoff of slower and safer.

Propegation of Reconfiguration

Even if you have a node that is behind on replication or slave delayed, reconfiguration will propegate almost immediately. How? New configs are communicated via heartbeat.

Suppose you have 2 nodes, A and B.

You run a reconfig on A, changing the version number from 6 to 7.

B sends a heartbeat request to A, which includes a field stating that B‘s version number is 6.

When A gets that heartbeat request, it will see that B‘s config version is less than it’s own, so it’ll send back its config (at version 7) as part of its heartbeat response.

When B sees that new config, it’ll load it (making the same checks for validity that A did originally) and follow the same procedure described above.

Force push — Force reconfig to the face.

Forcing Reconfig

Despite the checks made by reconfig, users sometimes get into a situation where they don’t have a primary. They’d permanently lose a couple servers or a data center and suddenly be stuck with a bunch of secondaries and no way to reconfig. So, in 2.0, we added a force:true option to reconfig, which allowed it to be run on a secondary. That is all that force:true does. Sometimes people complain that force:true wouldn’t let them load an invalid configuration. Indeed, it won’t. force:true does not relax any of the other reconfig constraints. You still have to pass in a valid config. You can just pass it to a secondary.

Why is my version number 6,203,493?

When you force-reconfigure a set, it adds a random (big) number to the version, which can be unnerving. Why does the version number jump by thousands? Suppose that we have a network partition and force-reconfigure the set on both sides of the partition. If we ended up with both sides having a config version of 8 and the set got reconnected, then everyone would assume they were in sync (everyone has a config version of 8, no problems here!) and you’d have half of your nodes with one config and half with another. By adding a random number to the version on reconfig, it’s very probable that one “side” will have a higher version number than the other. When the network is fixed, whichever side has a higher version number will “win” and your set will end up with a consistent config.

It might not end up choosing the config you want, but some config is better than the set puttering along happily with two primaries (or something stupid like that). Basically, if shenanigans happen during a network partition, check your config after the network is healthy again.

Removing Nodes and Sharding

I’d just like to rant for a second: removing nodes sucks! You’d think it’s would be so easy, right? Just take the node out of the config and boom, done. It turns out it’s a total nightmare. Not only do you have to stop all of the replication stuff happening on the removed node, you have to stop everything the rest of the set is doing with that node (e.g., syncing from it).

You also have to change the way the removed node reports itself so that mongos won’t try to update a set’s config from a node that’s been removed. And you can’t just shut it down because people want to be able to play around and do rs.add("foo"); rs.remove("foo"); rs.add("foo") so you have to be able to entirely shut down the replica set’s interaction with the removed node, but in any way that can be restarted on a dime.

Basically, there are a lot of edge cases around removing nodes, so if you want to be on the safe side, shut down a node before removing it from the set. However, Eric Milkie has done a lot of awesome work on removing nodes for 2.2, so it should be getting better.

4 thoughts on “Replica Set Internals Bootcamp Part III: Reconfiguring”

Mike Dirolf says:

May 22, 2012 at 5:55 am

Glad to hear about the improvements coming to removing nodes. That has definitely been one of the biggest RS pain points.

LikeLike

iammutex says:

May 30, 2012 at 6:32 am

Hello, when you say “removing nodes sucks”, do you mean the operate we should do is complicated or just the internal implementation is complicated.

LikeLike

1. kristina1 says:
  
  May 30, 2012 at 6:41 am
  
  Just the internal implementation is complicated. From the user’s POV, you can just do rs.remove(hostname).
  
  LikeLike
  
  1. iammutex says:
    
    May 30, 2012 at 8:14 am
    
    thanks
    
    LikeLike