My Life is Awesome

Andrew and I are getting married!

I can’t figure out how to say this eloquently, but: Andrew is so wonderful and I am incredibly lucky to have him. I love him so much. We love doing stuff together, talking about everything together (well, he puts up with more database talk than he’d probably strictly like), and just being with each other.

He has the cutest new haircut, too, but he’s a tough man to get in front of a camera (the picture on the right is from a year ago).

We’re getting married by the justice of the peace on our third anniversary. We already have tons of crap in our apartment and the last thing we need is more crap, so we’re asking wedding guests to skip the presents and donate to one of the following charities instead:

I am so happy.

A Short eBook on Scaling MongoDB

I just finished a little ebook for O’Reilly: Scaling MongoDB. I’m excited about it, it was really fun to write and I think it’ll be both fun and instructive to read. It covers:

  • What a cluster is
  • How it works (at a high level)
  • How to set it up correctly
  • The differences in programming for a cluster vs. a single server
  • How to administer a cluster
  • What can go wrong and how to handle it

So, why would you want to get a copy?

  • It’s a succinct reference for anything that’s likely to come up and covers the common questions I’ve heard people ask in the last year.
  • I heard some grumbling about my post on choosing a shard key (“Ugh, a metaphor,” -an IRC user). People who don’t like metaphors will be pleased to hear that this book has a straight-forward, serious-business section on choosing a shard key that not only lacks a metaphor but also goes much more into depth than the blog post.
  • It’s a quick read. There are code examples, of course, and it can be used as a reference, but after banging out 15,000 words in a couple of days, I took the next couple weeks to make them all flow together like fuckin’ Shakespeare.
  • It can be updated in real time! After MongoDB: The Definitive Guide becoming out-of-date approximately 6 seconds after we handed in the final draft, I’m delighted that new sections, advice, and features can be added as needed. Once you buy the ebook, you can update to the latest “edition” whenever you want, as many times as you want. O’Reilly wants to do automatic updates, but so far the infrastructure isn’t there in traditional ebook readers so you’ll have to update it manually.

You can also get a “print on demand” copy if you’re old school.

I hope you guys will check it out and let me know what you think!

To promote the book, Meghan is forcing me to do a webcast on Friday (February 4th). It’s called How Sharding Works and it’s a combination whitepaper and Magic School Bus tour of sharding. It should cover some of the interesting stuff about sharding that didn’t really fit into the 60 pages I had to work with (or the more practical focus of the book).

Look at the teeth on that guy! (He'll bite you if you make any webscale jokes.)

Why Command Helpers Suck

This is a rant from my role as a driver developer and person who gives support on the mailing list/IRC.

Command helpers are terrible. They confuse users, result in a steeper learning curve, and make MongoDB’s interface seem arbitrary.

The basics: what are command helpers?

Command helpers are wrappers for database commands. Database commands are everything other than CRUD (create, retrieve, update, delete) that you can do with MongoDB. This includes things like dropping a collection, doing a MapReduce, adding a member to a replica set, seeing what arguments you started mongod with, and finding out if the last write operation succeeded. They’re everywhere, if you’ve used MongoDB, you’ve run a database command (even if you weren’t aware of it).

So, what are command helpers? These are wrappers around the raw command, turning something like db.adminCommand({serverStatus:1}) into db.serverStatus(). This makes it slightly quicker to run and look “nicer” than the command. However, there are honey bunches of reasons that they’re a bad idea and should be avoided whenever possible.

Database helpers are unportable

Helpers are extremely unportable. If you know how to run db.serverStatus() in the shell, that’s great, but all you know is how to do it in the shell. If you know how to run the serverStatus command, you know how to get the server status in every language you’ll ever use.

Similarly, each language handles command options differently. Take a command like group: the shell helper chooses one order of options (a single argument “options”, incidentally) and the Python driver chooses another (“key”, “condition”, “initial”, “reduce”, “finalize”) and the PHP driver another (“key”, “initial”, “reduce”, “options”). If you just learn the group command itself, you can execute it in any language you please.

This affects almost everyone using MongoDB, as almost everyone uses at least two languages (JavaScript and something else). I have seen hundreds of questions of the form “How do I run <shell function> using my driver?” If these users knew it was a database command (and knew what a database command was), they wouldn’t have to ask.

Database helpers lock you to a certain API, often an out-of-date one

Suppose the database changes the options for a command. All of the drivers that support helpers for that command are suddenly out-of-date. Conversely, if you have a recent version of a driver and an old version of the database, you can have helpers for features that don’t exist yet or have different options.

An example of old driver/new database: MapReduce’s options changed in version 1.7.4. As far as I know, none of the drivers support the new options, yet.

You can’t support database helpers for everything

Next, there’s just the sheer volume of database commands, which makes it impossible to implement helpers for all of them. Everyone has their favorites: aggregation is important to some people, administration helpers are important to others, etc. If all of them had helpers, not only would there be a ridiculous number of methods polluting the API documentation, but it would leads to tons of compatibility problems between the driver and the database (as mentioned above).

Database helpers conceal what’s going on, giving users less options

Finally, using command helpers keeps people from understanding what’s actually going on, which is pointless and can lead to problems. It’s pointless to conceal the gory details because the details aren’t very gory: all database commands are queries. This means you can deconstruct command helpers as follows (example in PHP):

// the command helper
$db->lastError();
// is the same as
$db->command(array("getlasterror" => 1));
// is the same as
$db->selectCollection('$cmd')->findOne(array("getlasterror" => 1));
// is the same as
$db->selectCollection('$cmd')->find(array("getlasterror" => 1))->limit(1)->getNext();

Every command helper is just a find() in disguise! This means you can do (almost) anything with a database command that you could with a query.

This gives you more control. Not only can you use whatever options you want, you can do a few other things:

  • By default, drivers send all commands to the master, even if slaveOkay is set. If you want to send a command to a slave, you can deconstruct it to a query bypass the driver’s commands-go-to-master logic.
  • Suppose you have a command that takes a long time to execute and it times out on the client side. If you deconstruct the command into a query, you can (for some drivers) set the client-side timeout on the cursor.

Finally, if you’re using an unfamiliar driver, you might not know what its helpers are called but all drivers have a find() method, so you can always use that.

Exceptions

There are a couple command helpers worth implementing. I think that count and drop (at both the database and collection levels) are common enough to be worth having helpers for. Also, at a higher level (e.g., frameworks on top of the driver and admin GUIs) I think helpers are absolutely fine. However, as someone who has been maintaining a driver and supporting users for the last few years, I think that, at a driver level, command helpers are a terrible idea.

How to Use Replica Set Rollbacks

Rollin' rollin' rollin', keep that oplog rollin'

If you’re using replica sets, you can get into a situation where you have conflicting data. MongoDB will roll back conflicting data, but it never throws it out.

Let’s take an example, say you have three servers: A (arbiter), B, and C. You initialize A, B, and C:

$ mongo B:27017/foo
> rs.initiate()
> rs.add("C:27017")
{ "ok" : 1 }
> rs.addArb("A:27017")
{ "ok" : 1 }

Now do a couple of writes to the master (say it’s B).

> B = connect("B:27017/foo")
> B.bar.insert({_id : 1})
> B.bar.insert({_id : 2})
> B.bar.insert({_id : 3})

Then C gets disconnected (if you’re trying this out, you can just hit Ctrl-C—in real life, this might be caused by a network partition). B handles some more writes:

> B.bar.insert({_id : 4})
> B.bar.insert({_id : 5})
> B.bar.insert({_id : 6})

Now B gets disconnected. C gets reconnected and the arbiter elects it master, so it starts handling writes.

> C = connect("C:27017/foo")
> C.bar.insert({_id : 7})
> C.bar.insert({_id : 8})
> C.bar.insert({_id : 9})

But now B gets reconnected. B has data that C doesn’t have and C has data that B doesn’t have! What to do? MongoDB chooses to roll back B’s data, since it’s “further behind” (B’s latest timestamp is before C’s latest timestamp).

If we query the databases after the millisecond or so it takes to roll back, they’ll be the same:

> C.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
> B.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }

Note that the data B wrote and C didn’t is gone. However, if you look in B’s data directory, you’ll see a rollback directory:

$ ls /data/db
journal  local.0  local.1  local.ns  mongod.lock  rollback  foo.0  foo.1  foo.ns  _tmp
$ ls /data/db/rollback
foo.bar.2011-01-19T18-27-14.0.bson

If you look in the rollback directory, there will be a file for each rollback MongoDB has done. You can examine what was rolled back with the bsondump utility (comes with MongoDB):

$ bsondump foo.bar.2011-01-19T18-27-14.0.bson
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }
Wed Jan 19 13:33:32      3 objects found

If these won’t conflict with your existing data, you can add them back to the collection with mongorestore.

$ mongorestore -d foo -c bar foo.bar.2011-01-19T18-27-14.0.bson 
connected to: 127.0.0.1
Wed Jan 19 13:36:27 foo.bar.2011-01-19T18-27-14.0.bson
Wed Jan 19 13:36:27      going into namespace [foo.bar]
Wed Jan 19 13:36:27      3 objects found

Note that you need to specify -d foo and -c bar to get it into the correct collection. If it would conflict, you could restore it into another collection and do a more delicate merge operation.

Now, if you do a find, you’ll get all of the documents:

> B.bar.find()
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 3 }
{ "_id" : 7 }
{ "_id" : 8 }
{ "_id" : 9 }
{ "_id" : 4 }
{ "_id" : 5 }
{ "_id" : 6 }

Hopefully this sort of thing can tide most people over until MongoDB supports multi-master.

How to Choose a Shard Key: The Card Game

Choosing the right shard key for a MongoDB cluster is critical: a bad choice will make you and your application miserable. Shard Carks is a cooperative strategy game to help you choose a shard key. You can try out a few pre-baked strategies I set up online (I recommend reading this post first, though). Also, this won’t make much sense unless you understand the basics of sharding.

Mapping from MongoDB to Shard Carks

This game maps the pieces of a MongoDB cluster to the game “pieces:”

  • Shard – a player.
  • Some data – a playing card. In this example, one card is ~12 hours worth of data.
  • Application server – the dealer: passes out cards (data) to the players (shards).
  • Chunk – 0-4 cards defined by a range of cards it can contain, “owned” by a single player. Each player can have multiple chunks and pass chunks to other players.

Instructions

Before play begins, the dealer orders the deck to mimic the application being modeled. For this example, we’ll pretend we’re programming a news application, where users are mostly concerned with the latest few weeks of information. Since the data is “ascending” through time, it can be modeled by sorting the cards in ascending order: two through ace for spades, then two through ace of hearts, then diamonds, then clubs for the first deck. Multiple decks can be used to model longer periods of time.

Once the decks are prepared, the players decide on a shard key: the criteria used for chunk ranges. The shard key can be any deterministic strategy that an independent observer could calculate. Some examples: order dealt, suit, or deck number.

Gameplay

The game begins with Player 1 having a single chunk (chunk1). chunk1 has 0 cards and the shard key range [-∞, ∞).

Each turn, the dealer flips over a card, computes the value for the shard key, figures out which player has a chunk range containing that key, and hands the card to that player. Because the first card’s shard key value will obviously fall in the range [-∞, ∞), it will go to Player 1, who will add it to chunk1. The second and the third cards go to chunk1, too. When the fourth card goes to chunk1, the chunk is full (chunks can only hold up to four cards) so the player splits it into two chunks: one with a range of [-∞, midchunk1), the other with a range of [midchunk1, ∞), where midchunk1 is the midpoint shard key value for the cards in chunk1, such that two cards will end up in one chunk and two cards will end up in the other.

The dealer flips over the next card and computes the shard key’s value. If it’s in the [-∞, midchunk1) range, it will be added to that chunk. If it’s in the [midchunk1, ∞) range, it will be added to that chunk.

Splitting

Whenever a chunk gets four cards, the player splits the chunk into two 2-card chunks. If a chunk has the range [x, z), it can be split into two chunks with ranges [x, y), [y, z), where x < y < z.

Balancing

All of the players should have roughly the same number of chunks at all times. If, after splitting, Player A ends up with more chunks than Player B, Player A should pass one of their chunks to Player B.

Strategy

The goals of the game are for no player to be overwhelmed and for the gameplay to remain easy indefinitely, even if more players and dealers are added. For this to be possible, the players have to choose a good shard key. There are a few different strategies people usually try:

Sample Strategy 1: Let George Do It

The players huddle together and come up with a plan: they’ll choose “order dealt” as the shard key.

The dealer starts dealing out cards: 2 of spades, 3 of spades, 4 of spades, etc. This works for a bit, but all of the cards are going to one player (he has the [x, ∞) chunk, and each card’s shard key value is closer to ∞ than the preceding card’s). He’s filling up chunks and passing them to his friends like mad, but all of the incoming cards are added to this single chunk. Add a few more dealers and the situation becomes completely unmaintainable.

Ascending shard keys are equivalent to this strategy: ObjectIds, dates, timestamps, auto-incrementing primary keys.

Sample Strategy 2: More Scatter, Less Gather

When George falls over dead from exhaustion, the players regroup and realize they have to come up with a different strategy. “What if we go the other direction?” suggests one player. “Let’s have the shard key be the MD5 hash of the order dealt, so it’ll be totally random.” The players agree to give it a try.

The dealer begins calculating MD5 hashes with his calculator watch. This works great at first, at least compared to the last method. The cards are dealt at a pretty much even rate to all of the players. Unfortunately, once each player has a few dozen chunks in front of them, things start to get difficult. The dealer is handing out cards at a swift pace and the players are scrambling to find the right chunk every time the dealer hands them a card. The players realize that this strategy is just going to get more unmanageable as the number of chunks grows.

Sharding keys equivalent to this strategy: MD5 hashes, UUIDs. If you shard on a random key, you lose data locality benefits.

Sample Strategy 3: Combo Plate

What the players really want is something where they can take advantage of the order (like the first strategy) and distribute load across all of the players (like the second strategy). They figure out a trick: couple a coarsely-grained order with the random element. “Let’s say everything in a given deck is ‘equal,'” one player suggests. “If all of the cards in a deck are equivalent, we’ll need a way of splitting chunks, so we’ll also use the MD5 hash as a secondary criteria.”

The dealer passes the first four cards to Player 1. Player 1 splits his chunk and passes the new chunk to Player 2. Now the cards are being evenly distributed between Player 1 and Player 2. When one of them gets a full chunk again, they split it and hand a chunk to Player 3. After a few more cards, the dealer will be evenly distributing cards among all of the players because within a given deck, the order the players are getting the cards is random. Because the cards are being split in roughly ascending order, once a deck has finished, the players can put aside those cards and know that they’ll never have to pick them up again.

This strategy manages to both distribute load evenly and take advantage of the natural order of the data.

Applying Strategy 3 to MongoDB

For many applications where the data is roughly chronological, a good shard key is:

{<coarse timestamp> : 1, <search criteria> : 1}

The coarseness of the timestamp depends on your data: you want a bunch of chunks (a chunk is 200MB) to fit into one “unit” of timestamp. So, if 1 month’s worth of writes is 30GB, 1 month is a good granularity and your shard key could start with {"month" : 1}. If one month’s worth of data is 1 GB you might want to use the year as your timestamp. If you’re getting 500GB/month, a day would work better. If you’re inserting 5000GB/sec, sub-second timestamps would qualify as “coarse.”

If you only use a coarse granularity, you’ll end up with giant unsplittable chunks. For example, say you chose the shard key {"year" : 1}. You’d get one chunk per year, because MongoDB wouldn’t be able to split chunks based on any other criteria. So you need another field to target queries and prevent chunks from getting too big. This field shouldn’t really be random, as in Strategy 3, though. It’s good to group data by the criteria you’ll be looking for it by, so a good choice might be username, log level, or email, depending on your application.

Warning: this pattern is not the only way to choose a shard key and it won’t work well for all applications. Spend some time monitoring, figuring out what your application’s write and read patterns are. Setting up a distributed system is hard and should not be taken lightly.

How to Use Shard Carks

If you’re going to be sharding and you’re not sure what shard key to choose, try running through a few Shark Carks scenarios with coworkers. If a certain player start getting grouchy because they’re having to do twice the work or everyone is flailing around trying to find the right cards, take note and rethink your strategy. Your servers will be just as grumpy and flailing, only at 3am.

If you don’t have anyone easily persuadable around, I made a little web application for people to try out the strategies mentioned above. The source is written in PHP and available on Github, so feel free to modify. (Contribute back if you come up with something cool!)

And that’s how you choose a shard key.

Wireless dongle review

A dongle is a USB thingy (as you can see, I’m very qualified in this area) that lets you connect your computer to the internet wherever you go. It uses the same type of connection your cellphone data plan uses (3G or 4G).

A few months ago, Clear asked if they could send me a free sample dongle, as I am such a prestigious tech blogger. And I, being a sucker for free things (take note marketers) agreed to try out their dongle. And I have to say, it’s been pretty cool having free wifi wherever I go. The good bits:

  • It is very handy, especially when traveling. Waiting for hours in cold, smelly terminals become much more bearable. If I traveled more, I’d definitely get my own dongle (or try to get work to get one for me).
  • I could use the dongle on multiple laptops. I was worried about this, it seems like a lot of companies grub for money by binding devices like this to a single machine so you have to buy one for each computer you have (and who has just one computer?). It only supported Mac and Windows, though, so minor ding for that.
  • Andrew and I watched Law and Order (Netflix) using it and there was no noticeable difference in quality from our landline. I didn’t do a proper speed test, partly because I’m lazy and partly because I didn’t care. (If you know me IRL and want to do one, let me know and I’ll lend the dongle to you.)

But… there aren’t a whole lot of places I go where I don’t have free wifi already. Almost all of the coffeeshops and bookstores (and even bars) I go to already advertise free wifi. I used the dongle maybe once a week. I’ll miss it when my free trial runs out, but I won’t miss it $55-per-month-worth.

Also, I should be able to get the same sort of behavior by tethering my cellphone–if Sprint didn’t cripple their cellphones to prevent you from tethering. I actually don’t like having a phone, period, so when my contract runs out I’ll probably get a phone with just a data plan and a less douchey carrier.

So, my conclusions are: it’s super handy, but my cellphone should really be able to serve the same function. But that’s just me, and it is really cool being able to go online anywhere.

Setting Up Your Interview Toolbox

This post covers a couple “toolbox” topics that are easy to brush up on before the technical interview.

I recently read a post that drove me nuts, written by someone looking for a job. They said:

I can’t seem to crack the on-site coding interviews… [Interviews are geared towards] those who can suavely implement a linked list code library (inserting, deleting, reversing) as well as a data structure using that linked list (i.e. a stack) on a white board, no syntax errors, compilable, all error paths covered, interfaces cleanly buttoned up. Lather, rinse, repeat for binary search trees and sorting algorithms.

These are a programmer’s multiplication tables! If someone asked me “what’s 6×15?” on an interview, I wouldn’t throw my hands up and complain that I learned it 20 years ago, I’d be fucking thrilled that they had given me such a softball question.

Believe me, if you can’t figure out my basic algorithm questions, you do not want me to ask my “fun” questions.

If you’re looking for a job, I’d recommend accepting that interviewers want to see you know your multiplication tables and spend a few hours cramming if you need to. Make sure you have a basic toolbox set up in your brain, covering a couple basic categories:

  • Data structures: hashes, lists, trees – know how to implement them and common manipulations and searches.
  • Algorithms: sorts, recursion, search – simple algorithm problems. “Algorithms” covers a lot of ground, but at the very least know how to do the basic sorts (merge, quick, selection), recursion, and tree searches. They come up a lot. Also, make sure you know when to apply them (or they won’t be very useful).
  • Bit twiddling – this is mainly for C and C++ positions. I like to see if people know how to manipulate their bits (oh la la). This varies on the company, though, I doubt a Web 2.0 site is going to care that you know your bit shifts backwards and forwards (or, rather, left and right).

If you are applying for a language-specific job, the interviewer will probably ask you about some specifics. A good interviewer shouldn’t try to trap you with obscure language trivia, but make sure you’re familiar with the basics. So, if you’re applying for, say, a Java position, get comfortable with java.lang, java.util, how garbage collection works, basic synchronization, and know that Strings are immutable.

Protip: when I was looking for a job, every single place I interviewed asked me about Java’s public/protected/private keywords. Nearly all of them asked about final, too.

Don’t freak out if you get up to the board and can’t remember whether it’s foo.toString() or (String)foo, or if you forget a semicolon. Any reasonable interviewer knows that it’s hard to program on a whiteboard and doesn’t expect compiler-ready code. On the other hand, if your resume says you’ve been doing C for 10 years and you allocate an array of chars as char *x[], we expect you to laugh and understand your mistake when we point it out (I know I might do something like that out of nerves, so I wouldn’t hold it against you as long as you understood the problem).

Good luck out there. Remember that, if a company brings you in for an interview, they want to hire you. Do everything you can to let them!

How I Became a Programmer

NYU's asbestos-filled math and CS building where I spent my undergrad

I started programming when I was 20. My original college plan was to major in mathematics and become a saxophonist (I didn’t feel like starving while I tried to make it as a musician).

Luckily, I had a crush on a computer science major so I tagged along with him to a programming team meeting. Progteam blew my mind: programming was like math, only fun! Majoring in math made me feel smart and dignified, but it was never like “Wow, this it fun.” It was more like “Ow, my brain hurts, but I guess it’s building brain muscles…”

It turned out I was good at computer science, so I decided (somewhat randomly) that I was getting into MIT for grad school, dammit. I knew they’d want to see research, so I asked a professor to mentor an independent research project. Over the next year, I did researched a classic optimization algorithm and wrote a paper on an algorithm I came up with to improve its performance for certain cases.

The problem was that, when the time came to apply to grad school, I wasn’t sure I wanted to go to at all anymore. I had liked learning about optimization and coming up with a new algorithm, but I had hated research, in and of itself. I asked my parents for advice.

“Just apply,” they said. “Keep your options open.”

The computer science building at MIT

Grad school had been my goal for a while, so I applied to a couple of PhD programs. I half hoped that they would all reject me and make the choice easier. Of course, they all accepted me, even MIT (poor me</sarcasm>). I thought about it some more and told my parents that I still didn’t think I wanted to go.

“Just try a semester,” they said. “You can always leave if you don’t like it.”

I ended up accepting Columbia, not MIT. I had really liked every professor I met at Columbia, which I figured would give me more advisor options. Unfortunately, I continued to hate research and I was thoroughly sick of school. The next three months the most miserable of my life.

“Just stick it out,” said my parents. “Until you get a master’s degree, at least.”

I finally put my foot down. Usually they have good advice but I realized that this was their thing, not mine. I dropped out of grad school and got a job I loved. My parents were happy that I was happy and got over the disappointment that I would never be Dr. Chodorow. I’m still at the same job and couldn’t be happier.

So, in the spirit of Thanksgiving, I’m really thankful that I lucked into discovering computer science. Math kind of sucks.

Firesheep: Internet Snooping made Easy

A demo of Firesheep, courtesy of a fellow bus rider

If you use an open wifi network, people around you can see what you’re doing. They not only can look at your accounts, but log in as you with a double click. Even if you’re non-technical (especially if you’re non-technical!) you should know how this works and how to protect your accounts. Here’s what’s happening:

When you use wireless internet you are sending information through the air from your computer to a router* somewhere. This information is like broadcasting your own little radio station: it can be picked up and seen by anyone in the area. The problem is, your radio station is broadcasting you checking email, updating your OkCupid profile, writing stupid messages to friends on Facebook… activities that you don’t want random “listeners” to know about.

To keep your radio station private, websites support encoding all of the data you send so it looks like gibberish to anyone on the outside. So, when you sign into Gmail (or Amazon or Chase) your computer turns your username and password into gibberish and sends it into the air. The website receives the message, decodes the gibberish, and says “Now that you’ve given me your credentials, I’ll assume you’re Joe Shmoe if you give me the unlikely combination of digits ‘874328972387498234’ every time you make a request.” And then most sites stop encoding anything.

So, when you post a status update to your wall, you send along “874328972387498234” as clear as day and Facebook says “Aha, it’s you. Okay, I’ll post that.”

However, remember that you’re broadcasting this on your own personal radio station. Well, someone finally built a tuner, called Firesheep. If you have Firesheep installed and you sit down in a coffeeshop (or anywhere with an open wifi network), you are logged in as everyone around you to every site the other patrons are visiting.

Important takeaways for non-geeks:

  • Don’t access any accounts you care about via a public wifi connection. There is an embarrassingly long list of sites built into Firesheep: Amazon, Cisco, Facebook, Flickr, Google, New York Times, Twitter, WordPress, Yahoo, and many others. My mom could figure out how to use Firesheep and it would take a geek ~10 minutes to add a new site.
  • This “hack” cannot be patched globally by flipping a switch. Each website needs to fix itself. It is analogous to a locksmith discovering that every lock can be unlocked by whistling at it: everyone needs to go and improve their locks, we can’t outlaw whistling.
  • There’s no easy way, other than not using your accounts, to prevent people from seeing what you’re doing. The easiest ways I can think of off the top of my head are setting up Tor or a VPN, which are beyond the abilities (or at least interest) of most non-geeks I know.
  • Gmail encodes everything, by default. Your Google account will pop up in Firesheep (see the screenshot above), but people won’t actually be able to access your email. Also, any bank or reasonably professional payment system will be secure (look for the little lock symbol in the corner of your browser or https:// in the address bar). You can log into someone’s Amazon account with Firesheep, but you can’t do any payment stuff.

The code for Firesheep is open source and available on Github. You can try it out by starting up Firefox, downloading Firesheep, going to File->Open File and selecting the file you just downloaded. You may have to select View->Sidebar->Firesheep if it doesn’t pop up automatically.

That’s it, it’s ready to start capturing data from other people on your wifi network.

* Geeks: I know it’s not necessarily a router, but most lay people know that a router is where internet comes out and it’s close enough.

Bending the Oplog to Your Will

Brains...

Part 3 of the replication internals series: three handy tricks.

This is the third post in a three-part series on replication. See also parts 1 (replication internals) and 2 (getting to know your oplog).

DIY triggers

MongoDB has a type of query that behaves like the tail -f command: it shows you new data as it’s written to a collection. This is great for the oplog, where you want to see new records as they pop up and don’t want to query over and over.

If you want this type of ongoing query, MongoDB returns a tailable cursor. When this cursor gets to the end of the result set it will hang around and wait for more elements to be added to the collection. As they’re added, the cursor will return them. If no elements are added for a while, the cursor will time out and the client has to requery if they want more results.

Using your knowledge of the oplog’s format, you can use a tailable cursor to do a long pull for activities in a certain collection, of a certain type, at a certain time… almost any criteria you can imagine.

Using the oplog for crash recovery

Suppose your database goes down, but you have a fairly recent backup. You could put a backup into production, but it’ll be a bit behind. You can bring it up-to-date using your oplog entries.

If you use the trigger mechanism (described above) to capture the entire oplog and send it to a non-capped collection on another server, you can then use an oplog replayer to play the oplog over your dump, bringing it as up-to-date as possible.

Pick a time pre-dump and start replaying the oplog from there. It’s okay if you’re not sure exactly when the dump was taken because the oplog is idempotent: you can apply it to your data as many times as you want and your data will end up the same.

Also, warning: I haven’t tried out the oplog replayer I linked to, it’s just the first one I found. There are a few different ones out there and they’re pretty easy to write.

Creating non-replicated collections

The local database contains data that is local to a given server: it won’t be replicated anywhere. This is one reason why it holds all of the replication info.

local isn’t reserved for replication stuff: you can put your own data there, too. If you do a write in the local database and then check the oplog, you’ll notice that there’s no record of the write. The oplog doesn’t track changes to the local database, since they won’t be replicated.

And now I’m all replicated out. If you’re interested in learning more about replication, check out the core documentation on it. There’s also core documentation on tailable cursors and language-specific instructions in the driver documentation.