PHP Extensions Made Eldrich: Installing PHP

A PHP extension allows you to connect almost any C/C++ code you want to PHP. This is a 4-part tutorial on how to write an extension:

  1. Setting Up PHP – compiling PHP for extension development
  2. Hello, world! – your first extension
  3. Working with the API – the PHP C API
  4. Classes – creating PHP objects in C

Almost all of the code examples in this tutorial are available on Github.

Zend Developer Zone has an excellent tutorial on writing PHP extensions. I wrote this tutorial because the DevZone article is getting a little old: it doesn’t cover objects, methods, or exceptions and it uses Zend API artifacts dating back to PHP 3. However, it is still an excellent tutorial and I highly recommend reading through it if you’re interested in writing PHP extensions.

Setting Up PHP

Before you start developing an extension, you should compile PHP from source (it’ll make debugging easier later on). If you hate future-you, though, you can just do which phpize and if it returns something, you’re can continue to the next section.

Compiling PHP yourself isn’t too scary (unless your on Windows, in which case welcome to Hell). First, download the source for version you want to develop with. The current stable release is a good choice.

Unpack the tarball and change to the PHP source directory:

$ tar jxvf php-5.3.6.tar.bz2
$ cd php-5.3.6/
$ PHPDIR=`pwd` # setting this up so I can refer to $PHPDIR later

Note: this tutorial assumes that you’re using version 5.3.*. The API changes every version, so if you’re not using 5.3, this tutorial is going to be very frustrating.

To install PHP, run:

$ mkdir install-debug-zts # install dir
$ ./configure --enable-debug --enable-maintainer-zts --prefix=$PHPDIR/install-debug-zts
$ make install

I recommend using a custom install prefix ($PHPDIR/install-debug-zts in the example above), to keep it separate from any PHP you might have installed previously.

If you install multiple versions of PHP to the default location (/usr/local), it’ll get really annoying really fast: you always have to re-install if you want to try a different build, package managers often install PHP in /usr (so you may have two PHP installs floating around), and the PHP install is oddly coy about overwriting existing files: sometimes it decides to just leave the old versions there if a file already exists.

Thus, it pays to keep things organized in custom installation folders.

There are a couple of configuration options that you should enable, too, for extension development: –enable-debug (debugging info) and –enable-maintainer-zts (thread stuff and memory tracking).

Once make install is done, you’ve got PHP installed! Add $PHPDIR/install-debug-zts/bin to your path with:

$ # this will only add it to the path for this shell
$ PATH=$PHPDIR/install-debug-zts/bin:$PATH

Now you’re ready to make an extension.

Next up: writing your first extension.


Since MongoDB was first created, the Mongo shell prompt has just been:


A couple of months ago, my prompt suddenly changed to:


It’s nice to have more information the prompt, but 1) I don’t care about the replica set name and 2) a programmer’s prompt is very personal. Having it change out from under you is like coming home and finding that someone replaced all of your underwear. It’s just disconcerting.

Anyway, I recently got an intern (well, I’m mentoring him, it’s not like I bought him), Matt Dannenberg, who’s interested in working on shell stuff. He committed some code last week that lets you customize the shell’s prompt (it will be in 1.9.1+).

Basically, you define a prompt() function, and then it’ll be executed every time the shell is displayed. Immediately, I did:

myReplSetName:SECONDARY> prompt = "> "
> // ah, bliss
> // some sysadmins think > is a weird prompt, as it's also 
> // used for redirections, so they might prefer $
> prompt = "$ "
$ // there we go

Okay, that’s much better. But there is some information I’d like to add to my prompt.

I often forget which database db is referring to, and then I have to type db to check (which is especially annoying when I’m in the middle of a multi-line script). So, let’s just add the current database name to the prompt.

> prompt = function() { return db+"> "; }
test> use foo
foo> use bar

The prompt no longer shows whether I’m connected to a PRIMARY or a SECONDARY (or something else), which is useful information to have. I hate that long string, though, so let’s neaten it up. I want it to be:

I’m connected to the primary
I’m connected to a secondary
I’m connected to a server with state STATE.

This might look something like:

> prompt = function() { 
... result = db.isMaster();
... if (result.ismaster) {
...     return db+"> "; 
... }
... else if (result.secondary) {
...    return "("+db+")> ";
... }
... result = db.adminCommand({replSetGetStatus : 1})
... return states[result.myState]+":"+db+"> ";
... }

Also, the default prompt displays if you’re connected to a mongos (with mongos>), which is good for keeping track when you’re running a cluster:

> prompt = function() {
... result = db.adminCommand({isdbgrid : 1});
... if (result.ok == 1) {
...     return "mongos> ";
... }
... return "> ";
... }

Another nice thing would be to have the time each time it displays the prompt: then you can kick off a long-running job, go to lunch, and know what time it finished when you get back.

> prompt = function() { 
... var now = new Date(); 
... return now.getHours()+":"+now.getMinutes()+":"+now.getSeconds()+"> ";
... }

Defining prompt() as shown above is nice for playing around, but it’s a pain to define your prompt every time you start up the shell. So, you can add it to a function and then either load it on startup (a command line argument) or from the shell itself:

$ # load from command line arg:
$ mongo shellConfig.js
MongoDB shell version 1.9.1-
connecting to: test
> // load from the shell itself
> load("/path/to/my/shellConfig.js")

Or, you can use another feature my intern has implemented: mongo will automatically look for (and, if it finds, load) a .mongorc.js file from your home directory on startup.

// my startup file

prompt = /* ... */

// getting "not master and slaveok=false" errors drives me nuts,
// so I'm overriding the getDB() code to ALWAYS set slaveok=true
Mongo.prototype.getDB = function(name) {
    return new DB(this, name);

/* and so on... */
Actually, 10gen employees would never have an intern make coffee, as they might mess it up: we have at least five different brewers, two grinders, two pages of close-typed instructions on how to grind/brew, and an RFC on coffee-making protocol.

Keep in mind that the is a “proper” JS file, you can’t use magic Mongo shell helpers, like use <dbname> (instead, use db.getSisterDB("<dbname>")). If you don’t want .mongorc.js loaded on startup, start the shell with –norc.

Hopefully these things will make life a little easier for people.

Both of these changes are in master and will be in 1.9.1+. They will not be backported to the 1.8 branch. You can use the 1.9 shell with the 1.8 database server, though, if you want to use these features with a production database.

Mongo in Flatland

MongoDB’s geospatial indexing lets you use a collection as a map. It works differently than “normal” indexing, but there’s actually a nice, visual way to see what geospatial indexing does.

Let’s say we have a 16×16 map; something that looks like this:

All of the coordinates in our map (as described above) are somewhere between [0,0] and [16,16], so I’m going to make the min value 0 and the max value 16.{point : "2d"}, {min : 0, max : 16, bits : 4})

This essentially turns our collection into a map. (Don’t worry about bits, for now, I’ll explain that below.)

Let’s say we have something at the point [4,6]. MongoDB generates a geohash of this point, which describes the point in a way that makes it easier to find things near it (and still be able to distribute the map across multiple servers). The geohash for this point is a string of bits describing the position of [4,6]. We can find the geohash of this point by dividing our map up into quadrants and determining which quadrant it is in. So, first we divide the map into 4 parts:

This is the trickiest part: each quadrant can be described by two bits, as shown in the table below:

01 11
00 10

[4,6] is in the lower-left quadrant, which matches 00 in the table above. Thus, its geohash starts with 00.

Geohash so far: 00

Now we divide that quadrant again:

[4,6] is now in the upper-right quadrant, so the next two bits in the geohash are 11. Note that the bottom and left edges are included in the quadrant, the top and right edges are excluded.

Geohash so far: 0011

Now we divide that quadrant again:

[4,6] is now in the upper-left quadrant, so the next two bits in the geohash are 01.

Geohash so far: 001101

Now we divide that quadrant again:

[4,6] is now in the lower-left quadrant, so the next two bits in the geohash are 00.

Geohash so far: 00110100

You may wonder: how far do we keep dividing? That’s exactly what the bits setting is for. We set it to 4 when we created the index, so we divide into quadrants 4 times. If we wanted higher precision, we could set bits to something higher.

You can check your math above by using the geoNear command, which returns the geohash for the point you’re search near:

> db.runCommand({geoNear : "map", near : [4,6]})
	"ns" : "",
	"near" : "00110100",
	"results" : [ ],
	"stats" : {
		"time" : 0,
		"btreelocs" : 0,
		"nscanned" : 0,
		"objectsLoaded" : 0,
		"avgDistance" : NaN,
		"maxDistance" : -1
	"ok" : 1

As you can see, the “near” field contains exactly the geohash we’d expect from our calculations.

The interesting thing about geohashing is that this makes it easy to figure out what’s near us, because things are sorted according to their position on the map: every document with a point geohash starting with 00 is in the lower-left quadrant, every point starting with 00111111 is very near the middle, but in the lower-left quadrant. Thus, you can eyeball where a point is by looking at its geohash.

Bits and Precision

Let’s say a wizard casts a ring of fire around him with a radius of 2. Is the point [4,6] caught in that ring of fire?

It’s pretty obvious from the picture that it isn’t, but if we look at the geohash, we actually can’t tell: [4,6] hashes to 00110100, but so does [4.9, 6.9], and any other value in the square between [4,6] and [5,7]. So, in order to figure out whether the point is within the circle, MongoDB must go to the document and look at the actual value in the point field. Thus, setting bits to 4 is a bit low for the data we’re using/queries we’re doing.

Generally you shouldn’t bother setting bits, I’ve only set it above for purposes of demonstration. bits defaults to 26, which gives you approximately 1 foot resolution using latitude and longitude. The higher the number of bits, the slower geohashing gets (conversely, lower bit values mean faster geohashing, but more accessing documents on lookup). If you’re doing particularly high or low resolution queries, you might want to play around with different values of bits (in dev, on a representative data set) and see if you get better performance.

Thanks to Greg Studer, who gave a geospatial tech talk last Friday and inspired this post. (Every Friday, a 10gen engineer does a tech talk on something they’re working on, which is a really nice way to keep up with all of the cool stuff coworkers are doing. If you’re ever running an engineering department, I highly recommend them!)

NoSQL vs. the world

About a year ago, Mike Dirolf drew an enormous circle covering a sheet of paper. “Here are the people who use databases,” he said, “and here are the people who have even heard of NoSQL,” and he drew a circle this big: ° .

I think that interest has grown since then. By this time, the number of people that know about NoSQL is at least this big: o .

For evidence, let’s take a look at what people are searching for. First, here’s a Google Insights chart for a bunch of popular NoSQL solutions:

MongoDB vs. Cassandra vs. CouchDB vs. Redis vs. Riak (source)

Woohoo! MongoDB seems to be leading the charge, so far. But wait, now let’s compare MongoDB (the top line above) to some other databases:

MongoDB vs. Sybase vs. PostgreSQL vs. Firebird vs. Sqlite (source)

Okay, well, we’re behind everyone, but it’s not too bad. You start to see some patterns with the relational databases that you don’t yet so much with the NoSQL databases: people are using relational databases a lot more at work (during the week) than for fun (on the weekends). In fact, MongoDB is occasionally inching above Sybase on the weekends!

How about MySQL, SQL Server, and Oracle?

MongoDB vs. MySQL vs. SQL Server vs. Oracle (source)

Sigh. Back to work, people. We have a ways to go.

A finite simple group of order two

On my way into an underpass.

Andrew and I just returned from our honeymoon in the mountains, so I am now going to wax boring about how awesome it was.

We really love rock climbing, so we went on a lot of rock scrambles. Rock scrambles involve “scrambling” around rocks, boulders, little cliffs, and crevices. We skittered down slippery sheets of rock, trying to hit the tiny outcroppings and avoid tumbling to our deaths. Then the trail markers would betray us, pointing us right into the crevices, under huge boulders, across precariously balanced natural bridges of rock, through tiny gaps where I barely fit and there was a moment of panic where I thought Andrew was going to get stuck.

At one point, we came to a particularly buggy area. The air was warm and full of tiny flies and mosquitoes. I kept flailing at them, mostly just managing to whack myself in the head. The trail hit the edge of a pit with steep walls blocking our view of everything but the path in front of us. Slipping, sliding, and slapping ourselves in the head, we made our way down.

Not as bad as it looked! (Rockin my XKCD shirt)

As we went, the air got ten degrees colder, then twenty, then thirty. Suddenly there were no bugs at all. We reached the bottom of the pit and the air was pleasantly cold. Then we saw why: there were piles of snow! The hole was completely surrounded by rock and protected from the sun, forming natural ice box.

As we scrambled up the other side of the pit, the air became warmer and the bugs resumed their assault, but for a moment there, it was like we were in an ice dragon’s lair in the heart of a hot jungle.

Also, we got room 314 (3.14), which seems like a good omen.

Finally, if you’re a geek and don’t know what the title is referring to, you should watch the video.

Simulating Network Paritions with mongobridge

Note: mongobridge does not come with the MongoDB binaries, but you can build it by getting the source code and running scons mongobridge.

Let’s say we have a replica set with members (M1, M2, and M3) and we want to see what happens when M1 and M3 cannot reach each other (or any other sort of network partition). How do we do this?

We can’t shut down M3 because that would test something different. If we block incoming connections to M3, we end up blocking M2’s connections, too, which we don’t want.

We want M3 to only block M1’s connection, not M2’s. This is where mongobridge comes in.

A replica set with three members.

mongobridge allows you to accept connections on one port and send them on to a MongoDB process listening on another port. So, for the M1↔M3 connection, we can set up a bridge. This bridge, like most bridges, needs to go two directions (M1→M3 and M3→M1), so we actually need two instances of mongobridge.

So, we’ll make the “onramps” for our bridge M1::10013 (for the M1→M3 connection) and M3:10031 (for the M3→M1 connection). To set this up, we run:

$ ssh M1
M1$ mongobridge --port 10013 --dest M3:27017
M1$ exit
$ ssh M3
M3$ mongobridge --port 10031 --dest M1:27017
M3$ exit

This means, “Take all traffic heading to M1:10013 and send it to M3:27017. Take all traffic heading to M3:10031 and send it to M1:27017.

However, M1:27017 doesn’t want to send its traffic to M1:10013, its configuration says to send M3-bound traffic to M3:27017. Same with M3:27017, it doesn’t want to send its traffic to M3:10031, it wants to send it to M1:27017. And we can’t reconfigure the set so that different members have different configurations… officially.

Unofficially, we can change each config to anything we want.

Warning: never, ever do what I’m about to do in production, this is just for trying out funky networking tricks.

Shut down M1 and M3, and start each back up without the --replSet option on a different port. This way it won’t connect to the rest of the replica set and you can edit the local.system.replset configurations to point at the bridges, instead of the other members.

> // supposing we started them on port 27018 instead of 27017...
> db = (new Mongo("M1:27018")).getDB("local")
> db.system.replset.update({}, {$set : {"" : "M1:10013"}})
> db = (new Mongo("M3:27018")).getDB("local")
> db.system.replset.update({}, {$set : {"" : "M3:10031"}})

Now, restart M1 and M3 with their proper ports and --replSet. They will connect to each other through mongobridge and our replica set will be ready to use.

To simulate a network outage, kill the mongobridges between two servers.

Suppose M1 is the primary and you have knocked out the network between M1 and M3. If you do a write on M1, M2 will sync to M1 and then M3 will sync to M2, so you can see that the write will still make it to M3.

And that’s the dramatic payoff.

Trying Out Replica Set Priorities

Respect my prioritah!

As promised in an earlier post, replica set priorities for MongoDB are now committed and will be available in 1.9.0, which should be coming out soon.

Priorities allow you to give weights to the servers, saying, “I want server X to be primary whenever possible.” Priorities can range from 0.0-100.0.

To use priorities, download the latest binaries (“Development Release (Unstable) – Nightly”) from the MongoDB site. You have to have all members of the set running 1.9.0- or higher, as older versions object strongly to priorities other than 0 and 1.

Once you have the latest code installed, start up your three servers (A, B, and C) and create the replica set:

> // on server A
> rs.initiate()
> rs.add("B")
> rs.add("C")

Suppose we want B to be the preferred primary. And C‘s a backup server, but it could be primary if we really need it. So, we can adjust the priorities:

> config = rs.conf()
> // the default priority is 1
> B = config.members[1]
> B.priority = 2
> C = config.members[2]
> C.priority = .5
> // always increment the version number when reconfiguring
> config.version++
> rs.reconfig(config)

In a few seconds, A will step down and B will take over as primary.

Protip: the actual values of the priorities don’t matter, it’s just the relative values: B > A > C. B could have a priority of 100 and C could have a priority of .00001 and the set would behave exactly the same.


(Based on coworkers+the 12 hours it’s been committed)

What if A steps down and B is behind? Won’t I lose data?

No. A will only step down if B is within 10 seconds of synced. Once A steps down, B will sync until it’s up-to-date and (only then) become master.

Okay, but I want B to be primary now. Can I force that?

Yes, now-ish. Run this on A:

> db.adminCommand({replSetStepDown : 300, force : true})

This forces A to step down immediately (for 300 seconds). B will sync to it until it is up-to-date, then become primary.

I forgot to upgrade one of my servers before setting crazy priorities and now it’s complaining. What do I do?

Shut it down and restart it with 1.9.0. Setting non-1/0 priorities won’t harm anything with earlier versions, it just won’t work.

So, when should I use votes?

Almost never! Please ignore votes (in all versions, not just 1.9.0), unless: 1) you’re trying to have a replica set with more than 7 members or 2) you want to wedge your replica set into a primary-less state.

The Scripting Language of Databases

One of the most common questions non-users ask is “Why should I use MongoDB?”

There are a bunch of fancy answers: you can scale it (webscale!), you can use it for MapReduce, you can store files in it. Those things are all true, but every database worth its salt can scale (there are MySQL clusters with tens of thousands of nodes), every new-ish database I know of supports MapReduce, and filesystems are pretty good at storing files.

I think the reason is much simpler:

MongoDB gets out of your way.

For example, a user (on IRC) asked, “How do I store a form’s data in MongoDB?” Based on the question, I assumed he was using PHP, so I pasted the following three lines:

$m = new Mongo();

“Hey, it works!” he said.

(For those of you not familiar with MongoDB (or PHP), this stores and retrieves everything in the POST request and can be run with no prior database setup.*)

So, all of the bells and whistles are nice, but I think the real benefit is the simplicity. MongoDB is the scripting language of databases: you can just get stuff done, fast.

In this spirit, 10gen’s first monthly blogging contest topic was to write about something you developed quickly using MongoDB. The entries were cool: people built really interesting applications ridiculously fast.

A screenshot of Mojology displaying syslog entries

Some of my favorites entries were:

BugRocket’s Rapid(-ish) Development

Ryan Funduk wrote about creating a bug tracker.

“Without MongoDB, I would have easily racked up over a thousand database migrations.”

The Birth of Mojology

Gergely Nagy built an open source application for viewing and doing statistics on syslog messages.

“About four hours [after installing MongoDB], I posted the first version of my mongodb destination driver to the syslog-ng mailing list.”

From 0 to 1 Million in 6 Hours

Bradley Grzesiak wrote about programming VoiceRally.

“Friday, the day after VoiceRally was written, we sent over 1.5 million WebSocket messages.”

Family Spoon and MongoDB

Tom Maiaroto writes about creating a recipe website.

“Yes, you need to be aware of “schema” and you can’t go hog wild, but you also get more forgiveness and MongoDB works with you to solve your problems. It’s very flexible. It’s not something that you need to work around, it’s something that you get to work with. Anytime that you have a situation like that as a developer, your day is going to be much more happy and productive.”

Check out the other entries, as well. It’s too bad we can only choose one to win!

This month, we’re asking people to write about an open source app using MongoDB and the prize is an iPad2!

Edited to add: some of the commenters are upset about my advice to store $_POST in MongoDB. You should not store any user input unsanitized. For people familiar with SQL, the code above does not allow a traditional injection attack with MongoDB (as it would with SQL). After the first flush of success, I told the guy to not do it this way and to go read the documentation. Inserting $_POST was a learning tool, not a solution, and I tried to make that clear over IRC, if not in this post.

Lorenz University: I can has degree?

Click on the image to see the original (full size) version in a new window. Big thanks to Wondermark for allowing people to post their comics!

Misadventures in HR (an hilarious blog about… HR) mentioned Lorenz University, a degree mill. I’d never heard of a degree mill before, so I wanted to see how legit it looked from a computer scientist’s point of view.


Every site on the internet has to register contact information the king of the internet, so you can see who’s behind a website. Anyone can look up this info by running “whois domain” on their computer. For example, here are some legit universities’ info:

$ whois
   New York University
   ITS Communications Operations Services
   7 East 12th Street, 5th Floor
   New York, NY 10003
$ whois
   Massachusetts Institute of Technology
   Cambridge, MA 02139
$ whois
   University of Florida
   Computing and Network Services
   Space Sciences Research Building
   Gainesville, FL 32611-2050

Most businesses, higher learning institutions, and pretty much any large, legitimate site has their actual address listed there. What does Lorenz U have?

$ whois
   Domains by Proxy, Inc.
   15111 N. Hayden Rd., Ste 160, PMB 353
   Scottsdale, Arizona 85260
   United States

Domains by Proxy is a service where you can pay them to keep your contact info a secret. It’s good for privacy, but it’s a bit unusual for a university.

Also, protip: most universities are not .com addresses.


At first glance, Lorenz University seem to have some good proof that they’re a valid, accredited institution:

Lorenz University holds valid accreditation from reputable accrediting agencies including IAAFOE and ACTDE. These agencies have clearly mentioned on their official websites that Lorenz University is fully approved by their evaluation committee.

But wait, I’ve never heard of the IAAFOE or the ACTDE. A quick Google search turns up the International Accreditation Association for Online Eduction and the Accreditation Council for Distance Education.

Okay, Lorenz University is accredited by someone, but let’s take a look at who.

$ whois
Registrant Name:Registration Private
Registrant Organization:Domains by Proxy, Inc.
Registrant Street2:15111 N. Hayden Rd., Ste 160, PMB 353
Registrant Street3:
Registrant City:Scottsdale
Registrant State/Province:Arizona
Registrant Postal Code:85260
Registrant Country:US

Huh, Domains by Proxy. Again.

$ whois
Registrant Name:Registration Private
Registrant Organization:Domains by Proxy, Inc.
Registrant Street2:15111 N. Hayden Rd., Ste 160, PMB 353
Registrant Street3:
Registrant City:Scottsdale
Registrant State/Province:Arizona
Registrant Postal Code:85260
Registrant Country:US

And again! What are the chances?!

Now, let’s take a closer look at these accreditation sites. I used wget to download the entirety of both sites (somehow, I had the feeling that they wouldn’t be that big). Indeed, one site was 10 files and the other was 11:

$ wget -r
$ wget -r

Looking at these files, we can see certain similarities:

$ ls
ACTDE  CSS  index.asp  index.html  PDF  robots.txt
$ ls
IAAFOE  index.asp  index.html  PDF  robots.txt
$ # is robots.txt non-trivial?
$ wc -l
$  diff -s 
Files and are identical

Also, there’s a funny “Members Login” link on the ACTDE site that—whoops—isn’t actually a link. How hard is it to make a login page that doesn’t log anyone in?


Lorenz University seems to have “accredited” themselves by creating two accreditation websites, and are trying to take advantage of people who think this will help them get a job.

What I’m really curious about is if they’ll accredit other bullshit. The accreditation sites seem to be non-interactive, and don’t have any way of taking money.

P.S. As long as I’m just picking on them… Lorenz University also bought the site, to defend against people calling them a scam. The scam site has a link, “Click here[sic] to visit the official website of Lorenz University and find out all the details about Lorenz University and the application process to get an accredited degrees.” They misspelled “university” in a link to their own site.

Edit: the

Implementing Replica Set Priorities

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Replicos is a nation of islands that elect a primary island, called the Grand Poobah, to lead them. Each island cast a vote (or votes) and the island that gets a majority of the votes wins poobahship. If no one gets a majority (out of the total number of votes), no one becomes Grand Poobah. The islands can only communicate via carrier seagull.

Healthy configurations of islands, completely connected via carrier seagulls.

However, due to a perpetual war with the neighboring islands of Entropos, seagulls are often shot down mid-flight, distrupting communications between the Replicos islands until new seagulls can be trained.

The people of Replicos have realized that some islands are better at Poobah-ing than others. Their goal is to elect the island with the highest Poobah-ing ability that can reach a majority of the other islands. If all of the seagulls can make it to their destinations and back, electing a Poobah becomes trivial: an island sends a message saying they want to be Grand Poobah and everyone votes for them or says “I’m better at Poobah-ing, I should be Poobah.” However, it becomes tricky when you throw the Entropos Navy into the mix.

So, let’s Entropos has shot down a bunch of seagulls, leaving us with only three seagulls:

The island with .5 Poobah ability should be elected leader (the island with 1 Poobah ability can’t reach a majority of the set). But how can .5 know that it should be Poobah? It knows 1.0 exists, so theoretically it could ask the islands it can reach to ask 1.0 if it wants to be Poobah, but it’s a pain to pass messages through multiple islands (takes longer, more chances of failure, a lot more edge cases to check), so we’d like to be able to elect a Poobah using only the directly reachable islands, if possible.

One possibility might be for the islands sent a response indicating if they were connected to an island with a higher Poobah ability. In the case above, this would work (only one island is connected to an island with higher Poobah ability, so it can’t have a majority), but what about this case:

Every island, other than .5, is connected to a 1.0, but .5 should be the one elected! So, suppose we throw in a bit more information (which island of higher priority can be reached) and let the island trying to elect itself figure things out? Well, that doesn’t quite work, what if both .5 and 1.0 can reach a majority, but not the same one?

Conclusion: the Poobah-elect can’t figure this out on their own, everyone needs to work together.

Preliminaries: define an island to be Poohable if it has any aptitude for Poobah-ing and can reach a majority of the set. An island is not Poohable if it has no aptitude for Poobah-ing and/or cannot reach a majority of the set. Islands can be more or less Poohable, depending on their aptitude for Poobah-ing.

Every node knows whether or not it, itself, is Poohable: it knows its leadership abilities and if it can reach a majority of the islands. If more than one island (say islands A and B) is Poohable, then there must be at least one island that can reach both A and B [Proof at the bottom].

Let’s have each island keep a list of “possible Poobahs.” So, say we have an island A, that starts out with an empty list. If A is Poohable, it’ll add itself to the list (if it stops being Poohable, it’ll remove itself from the list). Now, whenever A communicates with another island, the other island will either say “add me to your list” or “remove me from your list,” depending on whether it is currently Poohable or not. Every other island does the same, so now each island has a list of the Poohable islands it can reach.

Now, say island X tries to elect itself master. It contacts all of the islands it can reach for votes. Each of the voting islands checks its list: if it has an entry on it that is more Poohable than X, it’ll send a veto. Otherwise X can be elected master. If you check the situations above (and any other situation) you can see that Poohability works, due to the strength of the guarantee that a Poobah must be able to reach a majority of the set.

Proof: suppose a replica set has n members and a node A can reach a majority of the set (at least ⌊n/2+1⌋) and a node B can reach a majority of the set (again, ⌊n/2+1⌋). If the sets of members A and B can reach are disjoint, then there must be ⌊n/2+1⌋+⌊n/2+1⌋ = at least n+1 members in the set. Therefore the set of nodes that A can reach and the set of nodes that B can reach are not disjoint.