Got any advice?

I was interviewing an potential summer intern yesterday (hey college students, apply to be an intern at 10gen!) and at the end she asked me, “I’ve never been interviewed by a female programmer before. Do you have any advice for me, being a female in a computer science?”

I had no idea what to tell her. Other than specific, non-gendered stuff like “learn how to use Linux” and general platitudes like “don’t let the bastards grind you down,” I couldn’t think of anything to say. So, what advice would you give a female CS student?

The most memorable piece of advice I got in college from a female engineer was, “You will cry at work. Try to make it to the bathroom before you start bawling,” which wasn’t terribly helpful. (And I haven’t yet, HA!)

How MongoDB’s Journaling Works

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Good idea, Patrick!

So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

The journal appends each change description it gets.

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.

Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

Go Get a Hot Water Bottle

If you don’t own one, go order an old-school hot water bottle. You can get one on Amazon for ~$10 and they feel amazing when you have a fever and your feet are freezing. They are also super-easy to use: just fill it up with hot tap water and they let out a nice even heat for ~8 hours. I am just so impressed with this technology. It’s like the Apple of cozy feet.

Anyway, I recommend getting one before you get sick.

––thursday #7: git-new-workdir

Often I’ll fix a bug (call it “bug A”), kick off some tests, and then get stuck. I’d like to start working on bug B, but I can’t because the tests are running and I don’t want to change the repo while they’re going. Luckily, there’s a git tool for that: git-new-workdir. It basically creates a copy of your repo somewhere else on the filesystem, with all of your local branches and commits.

git-new-workdir doesn’t actually come with git-core, but you should have a copy of the git source anyway, right?

$ git clone https://github.com/git/git.git

Copy the git-new-workdir script from contrib/workdir to somewhere on your $PATH. (There are some other gems in the contrib directory, so poke around.)

Now go back to your repository and do:

$ git-new-workdir ./ ../bug-b

This creates a directory one level up called bug-b, with a copy of your repo.

Thanks to Andrew for telling me about this.

Edited to add: Justin rightly asks, what’s the difference between this and local clone? The difference is that git-new-workdir creates softlinks everything in your .git directory, so your commits in bug-b appear in your original repository.

How to Make Your First MongoDB Commit

10gen is hiring a lot of people straight out of college, so I thought this guide would be useful.

Basically, the idea is: you have found and fixed a bug (so you’ve cloned the mongo repository, created a branch named SERVER-1234, and committed your change on it). You’ve had your fix code-reviewed (this page is only accessible to 10gen wiki accounts). Now you’re ready to submit your change, to be used and enjoyed by millions (no pressure). But how do you get it into the main repo?

Basically, this is the idea: there’s the main MongoDB repo on Github, which you don’t have access to (yet):

However, you can make your own copy of the repo, which you do have access to:

So, you can put your change in your repo and then ask one of the developers to merge it in, using a pull request.

That’s the 1000-foot overview. Here’s how you do it, step-by-step:

  1. Create a Github account.
  2. Go to the MongoDB repository and hit the “Fork” button.

  3. Now, if you go to https://www.github.com/yourUsername/mongo, you’ll see that you have a copy of the repository (replace yourUsername with the username you chose in step 1). Now you have this setup:

  4. Add this repository as a remote locally:
    $ git remote add me git@github.com:yourUsername/mongo.git
    

    Now you have this:

  5. Now push your change from your local repo to your Github repo, do:
    $ git push me SERVER-1234
    
  6. Now you have to make a pull request. Visit your fork on Github and click the “Pull Request” button.

  7. This will pull up Github’s pull request interface. Make sure you have the right branch and the right commits.

  8. Hit “Send pull request” and you’re done!

A Neat C Preprocessor Trick

I’ve been looking at Clang and they define lexer tokens in a way that I thought was clever.

The challenge is: how do you keep a single list of language tokens but use them as both an enum and a list of strings?

Clang defines C token types in a file, TokenKinds.def, with all of the names of the different C language tokens (pretend C only has four tokens for now):

#ifndef TOK
#define TOK(X)
#endif

TOK(comment)
TOK(identifier)
TOK(string_literal)
TOK(char_constant)

#undef TOK

If you just #include this file, the preprocessor defines TOK(X) as “” (nothing), so the whole thing becomes an empty file.

However! When they want a declaration of all possible tokens that could be used, they makes an enum of this list like this:

enum TokenKind = {
#define TOK(X) X,
#include "clang/Basic/TokenKinds.def"
    NUM_TOKENS
};

Because TOK is defined when TokenKinds.def is included, the preprocessor will spit out something like:

enum TokKind = {
    comment,
    identifier,
    string_literal,
    char_constant,
    NUM_TOKENS
};

This has the nice property that you can check if a type is valid by making sure that it is less than NUM_TOKENS. But if we’re going to put the tokens into that enum, woudln’t it be clearer just to put them there, instead of in a separate file? Maybe, but doing it this way gives them a nice way to get a string representation of the types, too. In another file, they do:

const char* const TokNames[] = {
#define TOK(X) #X,
#include "clang/Basic/TokenKinds.def"
    0
};

“#X” means that the preprocessor replaces X and surrounds it in quotes, so that turns into:

const char* const TokNames[] = {
    "comment",
    "identifier",
    "string_literal",
    "char_constant",
    0
};

Now if they have a token, they can say TokNames[token.kind] to get the string name of that token. It lets them use the token types efficiently, print them out nicely for debugging, and not have to maintain multiple lists of tokens.

Call for Schemas

The Return of the Mongoose Lemur

I just started working on MongoDB: The Definitive Guide, 2nd Edition! I’m planning to add:

  • Lots of ops info
  • Real-world schema design examples
  • Coverage of new features since 2010… so quite a few

However, I need your help on the schema design part! I want to include some real-world schemas people have used and why they worked (or didn’t). If you’re working on something 1) interesting and 2) non-confidential and you’d like to either share or get some free advice (or both), please email me (kristina at 10gen dot com) or leave a comment below. I’ll set up a little interview with you.

I am particularly looking for “cool” projects (video games, music, TV, sports), recognizable companies (Fortune 50 & HackerNews 500*), and geek elite (Linux development, research labs, robots, etc.). However, if you’re working on something you think is interesting that doesn’t fall into any of those categories, I’d love to hear about it!

* There isn’t really a HackerNews 500, I mean projects that people in the tech world recognize and thinks are pretty cool (DropBox, Github, etc.).

Git for Interns and New Employees

Think of commits as a trail future developers can follow. Would you like to leave a beautiful, easy-to-follow trail, or make them follow your… droppings?

My interns are leaving today 😦 and I think the most important skill they learned this summer was how to use git. However, I had a hard time finding concise references to send them about my expectations, so I’m writing this up for next year.

How to Create a Good Commit

A commit is an atomic unit of development.

You get a feeling for this as you gain experience, but commit early and often. I’ve never, ever thought that someone broke up a problem into too many commits. Ever.

That said, do not commit:

  1. Whitespace changes as part of non-whitespace commits.
  2. Debugging messages (“print(‘here!’)” and such).
  3. Commented out sections of code.
    Eew.

  4. Any type of binary (in general… if you think you have a special case, ask someone before you commit)
  5. Customer data, images you don’t own, passwords, etc. Assume that anything you commit will be included in the repo forever.
  6. Auto-generated files (any intermediate building files) and files specific to your system. If it mentions your personal directory structure, it probably shouldn’t be committed.

Point #5 deserves a little extra mention: git keeps everything, so when in doubt, don’t commit something dubious. You can always add it later. When I was new at 10gen, I found a memory leak in MongoDB and was told to commit “what was needed to reproduce it.”

I committed a 20GB database to the MongoDB repo.

One emergency surgery later and the repo was back to its svelte self. So it is possible to remove commits if you have to, but try not to commit stuff you shouldn’t. It’s extremely annoying to fix. And embarrassing.

When you’re getting ready to commit, run git gui. This is the #1 best tool I’ve found for beginners learning how to make good commits. You’ll see something that looks sort of like this:

The upper-left pane is unstaged changes and the lower right is staged changes. The big pane shows what you’ve added to and removed from the file currently selected.

Right click on a hunk to stage it, or a single line from the hunk.

Click on this icon: to stage all of the changes in a file.

Note that notes.js is moved to the staging area (if only some parts of notes.js are staged, it will show up in both the staging and unstaged area).

Before you commit, look at each file in the staging area by clicking on its filename. Any stray hunks make it in? Whitespace changes? Remove those lines by right-clicking and unstaging.

That extra line isn’t part of this change so it shouldn’t be part of the commit.

git gui will also show you when you have trailing whitespace:

And if you have two lines that look identical, it’s probably a whitespace issues (maybe tabs vs. spaces?).

Once you’ve fixed all that, you’re ready to describe your change…

Writing a Good Commit Message

First of all, there are a couple of semantic rules for writing good commit messages:

  • One sentence
  • In the present tense
  • Capitalized
  • No period
  • Less than 80 characters

That describes the form, but just like you can have a valid program that doesn’t do anything, you can have a valid commit message that’s useless.

So what does a good commit message look like? It should clearly summarize what the change did. Not “changed X to Y” (although that’s better than just saying “Y”, which I’ve also seen) but why X had to change to Y.

Examples of good commit messages:

Show error message on failed "edit var" in shell
Very nice “added feature”-type message.
Extra restrictions on DB name characters for Windows only
Would have been nice to have a description below the commit line describing why we needed to change this for Windows, but good “changed code”-type message.
Compile when SIGPIPE is not defined
Nice “fixed bug”-type message.
Whitespace
I think this is the only case where you can get away with a 1-word commit message

Examples of bad commit messages:

Add stuff
Doesn’t say what was added or why
Fix test, add input form, move text box
A commit should be one thought, this is three. Thus, this should probably be three commits, unless they’re all part of one thought you haven’t told us about.

And once you’ve committed…

When you inevitably mess up a commit and realize that you’ve accidentally committed a mishmash of ideas that break laws in six countries and are riddled with whitespace changes, check out my post on fixing git mistakes.

Or just go ahead and push.

Controlling Collection Distribution

Shard tagging is a new feature in MongoDB version 2.2.0. It’s supposed to force writes to go to a local data center, but it can also be used to pin a collection to a shard or set of shards.

Note: to try this out, you’ll have to use 2.2.0-rc0 or greater.

To play with this feature, first you’ll need to spin up a sharded cluster:

> sharding = new ShardingTest({shards:3,chunksize:1})

This command will start up 3 shards, a config server, and a mongos. It’ll also start spewing out the logs from all the servers into stdout, so I recommend putting this shell aside and using a different one from here on in.

Start up a new shell and connect to the mongos (defaults to port 30999) and create some sharded collections and data to play with:

> // remember, different shell
> conn = new Mongo("localhost:30999")
> db = conn.getDB("villains")
>
> // shard db
> sh.enableSharding("villains")
>
> // shard collections
> sh.shardCollection("villains.joker", {jokes:1});
> sh.shardCollection("villains.two-face", {luck:1});
> sh.shardCollection("villains.poison ivy", {flora:1});
> 
> // add data
> for (var i=0; i for (var i=0; i for (var i=0; i<100000; i++) { db["poison ivy"].insert({flora: Math.random(), count: i, time: new Date()}); }

Now we have 3 shards and 3 villains. If you look at where the chunks are, you should see that they’re pretty evenly spread out amongst the shards:

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
Or, as Harley would say, “Puddin’.”

However, villains tend to not play well with others, so we’d like to separate the collections: 1 villain per shard. Our goal:

Shard Namespace
shard0000 “villains.joker”
shard0001 “villains.two-face”
shard0002 “villains.poison ivy”

To accomplish this, we’ll use tags. A tag describes a property of a shard, any property (they’re very flexible). So, you might tag a shard as “fast” or “slow” or “east coast” or “rackspace”.

In this example, we want to mark a shard as belonging to a certain villain, so we’ll add villains’ nicknames as tags.

> sh.addShardTag("shard0000", "mr. j")
> sh.addShardTag("shard0001", "harv")
> sh.addShardTag("shard0002", "ivy")

This says, “put any chunks tagged ‘mr. j’ on shard0000.”

The second thing we have to do is to make a rule, “For all chunks created in the villains.joker collection, give them the tag ‘mr. j’.” To do this, we can use the addTagRange helper:

> sh.addTagRange("villains.joker", {jokes:MinKey}, {jokes:MaxKey}, "mr. j")

This says, “Mark every chunk in villains.joker with the ‘mr. j’ tag” (MinKey is negative infinity, MaxKey is positive infinity, so all of the chunks fall in this range).

Now let’s do the same thing for the other two collections:

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:MaxKey}, "harv")
> sh.addTagRange("villains.poison ivy", {flora:MinKey}, {flora:MaxKey}, "ivy")

Now wait a couple of minutes (it takes a little while for it to rebalance) and then look at the chunks for these collections.

> use config
> db.chunks.find({ns: "villains.joker"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }

Scaling with Tags

Obviously, Two-Face isn’t very happy with this arrangement and immediately requests two servers for his data. We can move the Joker and Poison Ivy’s collections to one shard and expand Harvey’s to two by manipulating tags:

> // move Poison Ivy to shard0000
> sh.addShardTag("shard0000", "ivy")
> sh.removeShardTag("shard0002", "ivy")
>
> // expand Two-Face to shard0002
> sh.addShardTag("shard0002", "harv")

Now if you wait a couple minutes and look at the chunks, you’ll see that Two-Face’s collection is distributed across 2 shards and the other two collections are on shard0000.

> db.chunks.find({ns: "villains.poison ivy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
> db.chunks.find({ns: "villains.two-face"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
“Bad heads, you get EBS.”

However, this still isn’t quite right for Harvey, he’d like one shard to be good and one to be bad. Let’s say we take advantage of Amazon’s new offering and replace shard0002 with SSDs. Then we divide up the traffic: send 50% of Harvey’s writes to the SSD shard and 50% to the spinning disk shard. First, we’ll add tags to the shards, describing them:

> sh.addShardTag("shard0001", "spinning")
> sh.addShardTag("shard0002", "ssd")

The value of the “luck” field is between 0 and 1, so we want to say, “If luck = .5, send to the SSD.”

> sh.addTagRange("villains.two-face", {luck:MinKey}, {luck:.5}, "spinning")
> sh.addTagRange("villains.two-face", {luck:.5}, {luck:MaxKey}, "ssd")

Now “bad luck” docs will be written to the slow disk and “good luck” documents will be written to SSD.

As we add new servers, we can control what kind of load they get. Tagging gives operators a ton of control over what collections go where.

Finally, I wrote a small script that adds a “home” method to collections to pin them to a single tag. Example usage:

> // load the script
> load("batman.js")
> // put foo on bar
> db.foo.home("bar")
> // put baz on bar
> db.baz.home("bar")
> // move foo to bat
> db.foo.home("bat")

Enjoy!

Summer Reading Blogroll

What are some good ops blogs? Server Density does a nice weekly roundup of sys admin posts, but that’s about all I’ve found. So, anyone know any other good resources? The more basic the better.

In exchange, here are my top-10 “I’m totally doing something productive and learning something new” blogs:

Programming

Daniel Lemire’s Blog
Articles on databases and general musing on CS and higher education.
Embedded in Academia
Everything you ever wanted to know about debugging compilers.
Preshing on Programming
Bring-a-tent-length articles about advanced programming concepts.
Sutter’s Mill
C++ puzzlers.

Security

Schneier on Security
The best general security blog I’ve found.

Science!

How to Spot a Psychopath
General science Q&A, as well as justification for why every household needs 1kg of tungsten, 10,000 LEDs, and temperature-sensitive polymer.
In the Pipeline
A professional chemist’s blog. Sometimes way over my head, but generally pretty interesting.

10gen

On a less technical note, many of my coworkers write excellent blogs, here are two:

Max Schireson’s Blog
10gen’s president, writes about running a company and working at startups.
Meghan Gill’s Blog
10gen’s earliest non-technical hire, who deserves the credit for a lot of MongoDB’s success. Her blog is a really interesting and informative look at what marketing people do.

Whoops, that’s only nine. For the tenth, please leave a link to your favorite tech blog below so I can check it out!

Also, I artificially kept this list short, but there are ton of terrific blogs I read that didn’t get a mention. If you’re a coworker or a MongoDB Master, I probably subscribe to your blog and I’m really sorry if I didn’t mention it above!