If you want to jump right into trying out replica sets, see Part 1: Master-Slave is so 2009.
Replica sets are basically just master-slave with automatic failover.
The idea is: you have a master and one or more slaves. If the master goes down, one of the slaves will automatically become the new master. The database drivers will always find the master so, if the master they’re using goes down, they’ll automatically figure out who the new master is and send writes to that server. This is much easier to manage (and faster) than manually failing over to a slave.
So, you have a pool of servers with one primary (the master) and N secondaries (slaves). If the primary crashes or disappears, the other servers will hold an election to choose a new primary.
A server has to get a majority of the total votes to be elected, not just a majority. This means that, if we have 50 servers and each server has 1 vote (the default, later posts will show how to change the number of votes a server gets), a server needs at least 26 votes to become primary. If no one gets 26 votes, no one becomes primary. The set can still handle reads, but not writes (as there’s no master).
If a server gets 26 or more votes, it will becomes primary. All future writes will be directed to it, until it loses an election, blows up, gets caught breaking into the DNC, etc.
The original primary is still part of the set. If you bring it back up, it will become a secondary server (until it gets the majority of votes again).
Three’s a crowd (in a good way)
One complication with this voting system is that you can’t just have a master and a slave.
If you just set up a master and a slave, the system has a total of 2 votes, so a server needs both votes to be elected master (1 is not a majority). If one server goes down, the other server only has 1 out of 2 votes, so it will become (or stay) a slave. If the network is partitioned, suddenly the master doesn’t have a majority of the votes (it only has its own 1 vote), so it’ll be demoted to a slave. The slave also doesn’t have a majority of the votes, so it’ll stay a slave (so you’d end up with two slaves until the servers can reach each other again).
It’s a waste, though, to have two servers and no master up, so replica sets have a number of ways of avoiding this situation. One of the simplest and most versatile ways is using an arbiter, a special server that exists to resolves disputes. It doesn’t serve any data to the outside world, it’s just a voter (it can even be on the same machine as another server, it’s very lightweight). In part 1, localhost:27019 was the arbiter.
So, let’s say we set up a master, a slave, and an arbiter, each with 1 vote (total of 3 votes). Then, if we have the master and arbiter in one data center and the slave in another, if a network partition occurs, the master still has a majority of votes (master+arbiter). The slave only has 1 vote. If the master fails and the network is not partitioned, the arbiter can vote for the slave, promoting it to master.
With this three-server setup, we get sensible, robust failover.
Next up: dynamically configuring your replica set. In part 1, we fully specified everything on startup. In the real world you’ll want to be able to add servers dynamically and change the configuration.
16 thoughts on “Replica Sets Part 2: What are Replica Sets?”
But how does it actually work? How do you avoid race conditions? How do you deal with incoming requests during the process of elections?
> But how does it actually work? How do you avoid race conditions?
That’s a bit complicated to go into. I’ll do a post about replica set internals in a week or two.
> How do you deal with incoming requests during the process of elections?
If no one is master, the system becomes read-only. Writes will fail with “not master” (or similar) for the time it takes to elect a master. There are two options, really: you can always have a master and deal with conflicts (as you’ll sometimes end up with more than one) or you can sometimes have no master and the system, in this case, will become read-only. MongoDB chose “occasionally read-only,” as that’s generally easier for people to develop applications against.
There is something I basically do not understand. I have a replica set (one master, one secondary and a arbiter). The secondary has a rs.slaveOk() to allow for reads (should increase the webapps capability to respond to requests right?). They are all initiated/connected/configured using internal IPs.
What I do not understand is if the primary fails then and a new primary is elected and so the site should remain up right? But it does not – it gives a “Transport endpoint is not connected”. Is it because it should be handled by the (in my case) PHP mongo driver – I mean if a “MongoConnectionException” is caught and it resolves to a “Transport endpoint is not connected” then address new master (but should it not happen automatically).
It will failover automatically once a new master has been elected. Your application does have to handle that there may be a few seconds (or longer) where there is no master. This means that you should figure out what you want your application to do if you get an error: maybe drop into read-only mode? Block everything until a new master is elected? It really depends on the application.
Thanks for your reply. However I’m still not sure how to achieve the desired failover. As of now when the master fails (with a kill -15 ) and the new master is elected I would expect my somedomain.com/mongoTestPage to be reachable but is not – I get a MongoConnectionException thrown by the PHP mongo driver – I must be missing something basic.
I think there were some issues with failover in 1.0.10 of the PHP driver. I’ll be releasing 1.0.11 soon, but you could try the latest code (http://github.com/mongodb/mongo-php-driver) and see if that works better for you.
It will throw MongoConnectionExceptions until a new master is elected, but once it is the driver should failover.
Again thanks for answering (that quick).
I tried to do a manual installation following your advice but for some reason I cannot use PECL/phpize. I then tried to (re)install PEAR with lynx -source http://pear.php.net/go-pear | php but ran into a lot of error messages along the lines of “Warning: include_once(PEAR.php): failed to open stream: No such file or directory in – on line 682
” – is there another failproof way of installing the driver manually? Sorry for taking up your time with all these questions 🙂
Yes, you can install the driver manually by downloading the source from Github, then running:
from the command line.
New driver installed, but I’m afraid it did not solve anything. Must be something wrong with my setup, my secondary and arbiter does not have any domain attached only the primary and there is no DNS magic between them only mongoDB that connects the 3.
Should all 3 mongoDBs have the latest PHP driver installed?
What is reponsible for redirecting to new master if server that holds the current master mongoDB is killed?
You need each PHP install you’re using to have the latest driver installed, so you probably need to upgrade it on each app server you’re using.
The driver is responsible for redirecting to a new master. When you connect to the database using “replicaSet” => true, the driver will make a note of all of the members of the set. If the master goes down, it’ll try connecting to the others. You can print your connection (Mongo) object to see all of the members it knows about.
Thanks Kristina, you are being very helpful.
My last question to you is, what if the physical machine, that hosts the primary that is the mongodb and the php driver, is shut down for some reason? Then the php driver, who is as you say responsible for redirecting to a new master, not able to do so – what happens then?
The PHP driver is installed wherever your PHP code is running (your application server). If you are running your application server on a single machine that goes down, your website will go down and it won’t really matter what your database is doing. The remaining replica set members will failover and choose a new master, but no one will be connecting to it.
Congratulations for your post.
You did a great explanation. Even mongodb.org doesn’t have a text so clear and to the point.