More PHP Internals: References

By request, a quick post on using PHP references in extensions.

To start, here’s an example of references in PHP we’ll be translating into C:


This will print:

x is 1
called not_by_ref(1)
x is 1
called by_ref(1)
x is 3

If you want your C extension’s function to officially have a signature with ampersands in it, you have to declare to PHP that you want to pass in refs as arguments. Remember how we declared functions in this struct?

zend_function_entry rlyeh_functions[] = {
  PHP_FE(cthulhu, NULL)
  { NULL, NULL, NULL }
};

The second argument to PHP_FE, NULL, can optional be the argument spec. For example, let’s say we’re implementing by_ref() in C. We would add this to php_rlyeh.c:

// the 1 indicates pass-by-reference
ZEND_BEGIN_ARG_INFO(arginfo_by_ref, 1)
ZEND_END_ARG_INFO();

zend_function_entry rlyeh_functions[] = {
  PHP_FE(cthulhu, NULL)
  PHP_FE(by_ref, arginfo_by_ref)
  { NULL, NULL, NULL }
};

PHP_FUNCTION(by_ref) {
  zval *zptr = 0;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) {
    return;
  }

  php_printf("called (the c version of) by_ref(%d)n", (int)Z_LVAL_P(zptr));
  ZVAL_LONG(zptr, 3);
}

Suppose we also add not_by_ref(). This might look something like:

ZEND_BEGIN_ARG_INFO(arginfo_not_by_ref, 0)
ZEND_END_ARG_INFO();

zend_function_entry rlyeh_functions[] = {
  PHP_FE(cthulhu, NULL)
  PHP_FE(by_ref, arginfo_by_ref)
  PHP_FE(not_by_ref, arginfo_not_by_ref)
  { NULL, NULL, NULL }
};

PHP_FUNCTION(not_by_ref) {
  zval *zptr = 0, *copy = 0;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) {
    return;
  }

  php_printf("called (the c version of) not_by_ref(%d)n", (int)Z_LVAL_P(zptr));
  ZVAL_LONG(zptr, 2);
}

However, if we try running this, we’ll get:

x is 1
called (the c version of) not_by_ref(1)
x is 2
called (the c version of) by_ref(2)
x is 3

What happened? not_by_ref used our variable like a reference!

This is really weird and annoying behavior (if anyone knows why PHP does this, please comment below).

To work around it, if you want non-reference behavior, you have to manually make a copy of the argument.

Our not_by_ref() function becomes:

PHP_FUNCTION(not_by_ref) {
  zval *zptr = 0, *copy = 0;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) {
    return;
  }

  // make a copy                                                                                                                                                          
  MAKE_STD_ZVAL(copy);
  memcpy(copy, zptr, sizeof(zval));

  // set refcount to 1, as we're only using "copy" in this function                                                                                                         
  Z_SET_REFCOUNT_P(copy, 1);

  php_printf("called (the c version of) not_by_ref(%d)n", (int)Z_LVAL_P(copy));
  ZVAL_LONG(copy, 2);

  zval_ptr_dtor(&copy);
}

Note that we set the refcount of copy to 1. This is because the refcount for zptr is 2: 1 ref from the calling function + 1 ref from the not_by_ref function. However, we don’t want the copy of zptr to have a refcount of 2, because it’s only being used by the current function.

Also note that memcpy-ing the zval only works because this is a scalar: if this were an array or object, we’d have to use PHP API functions to make a deep copy of the original.

If we run our PHP program again, it gives us:

x is 1
called (the c version of) not_by_ref(1)
x is 1
called (the c version of) by_ref(1)
x is 3

Okay, this is pretty good… but we’re actually missing a case. What happens if we pass in a reference to not_by_ref()? In PHP, this looks like:

function not_by_ref($arg) {
   $arg = 2;
}

$x = 1;
not_by_ref(&$x);
display($x);

…which displays “x is 2”. Unfortunately, we’ve overridden this behavior in our not_by_ref() C function, so we have to special case: if this is a reference, change its value, otherwise make a copy and change the copy’s value.

PHP_FUNCTION(not_by_ref) {
  zval *zptr = 0, *copy = 0;

  if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z", &zptr) == FAILURE) {
    return;
  }

  // NEW CODE
  if (Z_ISREF_P(zptr)) {
    // if this is a reference, make copy point to zptr
    copy = zptr;

    // adding a reference so we can indiscriminately delete copy later
    zval_add_ref(&zptr);
  }
  // OLD CODE
  else {
    // make a copy                                                                                                                                  
    MAKE_STD_ZVAL(copy);
    memcpy(copy, zptr, sizeof(zval));

    // set refcount to 1, as we're only using "copy" in this function                                                                                                       
    Z_SET_REFCOUNT_P(copy, 1);
  }

  php_printf("called (the c version of) not_by_ref(%d)n", (int)Z_LVAL_P(copy));
  ZVAL_LONG(copy, 2);

  zval_ptr_dtor(&copy);
}

Now it’ll behave “properly.”

There may be a better way to do this, please leave a comment if you know of one. However, as far as I know, this is the only way to emulate the PHP reference behavior.

If you would like to read more about PHP references, Derick Rethans wrote a great article on it for PHP Architect.

Mongo Mailbag #2: Updating GridFS Files

Welcome to week two of Mongo Mailbag, where I take a question from the Mongo mailing list and answer it in more detail. If you have a question you’d like to see answered in excruciating detail, feel free to email it to me.

Is it possible (with the PHP driver) to storeBytes into GridFS (for example CSS data), and later change that data?!

I get some strange behavior when passing an existing _id value in the $extra array of MongoGridFS::storeBytes, sometimes Apache (under Windows) crashes when reloading the file, sometimes it doesn’t seem to be updated at all.

So I wonder, is it even possible to update files in GridFS?! ūüôā

-Wouter

If you already understand GridFS, feel free to skip to the last section. For everyone else…

Intro to GridFS

GridFS is the standard way MongoDB drivers handle files; a protocol that allows you to save an arbitrarily large file to the database. It’s not the only way, it’s not the best way (necessarily), it’s just the built-in way that all of the drivers support. This means that you can use GridFS to save a file in Ruby and then retrieve it using Perl and visa versa.

Why would you want to store files in the database? Well, it can be handy for a number of reasons:

  • If you set up replication, you’ll have automatic backups of your files.
  • You can keep millions of files in one (logical) directory… something most filesystems either won’t allow or aren’t good at.
  • You can keep information associated with the file (who’s edited it, download count, description, etc.) right with the file itself.
  • You can easily access info from random sections of large files, another thing traditional file tools aren’t good at.

There are some limitations, too:

  • You can’t have an arbitrary number of files per document… it’s one file, one document.
  • You must use a specific naming scheme for the collections involved: prefix.files and prefix.chunks (by default prefix is “fs”: fs.files and fs.chunks).

If you have complex requirements for your files (e.g., YouTube), you’d probably want to come up with your own protocol for file storage. However, for most applications, GridFS is a good solution.

How it Works

GridFS breaks large files into manageable chunks. It saves the chunks to one collection (fs.chunks) and then metadata about the file to another collection (fs.files). When you query for the file, GridFS queries the chunks collection and returns the file one piece at a time.

Here are some common questions about GridFS:

Q: Why not just save the whole file in a single document?
A: MongoDB has a 4MB cap on document size.
Q: That’s inconvenient, why?
A: It’s an arbitrary limit, mostly to prevent bad schema design.
Q: But in this case it would be so handy!
A: Not really. Imagine you’re storing a 20GB file. Do you really want to return the whole thing at once? That means 20GB or memory will be used whenever you query for that document. Do you even have that much memory? Do you want it taken up by a single request?
Q: Well, no.
A: The nice thing about GridFS is that it streams the data back to the client, so you never need more than 4MB of memory.
Q: Now I know.
A: And knowing is half the battle.
Together: G.I. Joe!

Answer the Damn Question

Back to Wouter’s question: changing the metadata is easy: if we wanted to add, say, a “permissions” field, we could run the following PHP code:

$files = $db->fs->files;
$files->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

// or, equivalently, from the MongoGridFS object:

$grid->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

Updating the file itself, what Wouter is actually asking about, is significantly more complex. If we want to update the binary data, we’ll need to reach into the chunks collection and update every document associated with the file. Edit: Unless you’re using the C# driver! See Sam Corder’s comment below. It would look something like:

// get the target file's chunks
$chunks = $db->fs->chunks;
$cursor = $chunks->find(array("file_id" => $fileId))->sort(array("n" => 1));

$newLength = 0;

foreach ($cursor as $chunk) {
    // read in a string of bytes from the new version of the file
    $bindata = fread($file, MongoGridFS::$chunkSize);
    $newLength += strlen($bindata);

    // put the new version's contents in this chunk
    $chunk->data = new MongoBinData($bindata);

    // update the chunks collection with this new chunk
    $chunks->save($chunk);
}

// update the file length metadata (necessary for retrieving the file)
$db->fs->files->update(array("_id" => $fileId), array('$set' => array("length" => $newLength));

The code above doesn’t handle a bunch of cases (what if the new file is a different number of chunks than the old one?) and anything beyond this basic scenario gets irritatingly complex. If you’re updating individual chunks you should probably just remove the GridFS file and save it again. It’ll end up taking about the same amount of time and be less error-prone.

Mongo Mailbag: Master/Slave Configuration

Trying something new: each week, I’ll take an interesting question from the MongoDB mailing list and answer it in more depth. ¬†Some of the replies on the list are a bit short, given that the developers are trying to, you know, develop (as well as answer over a thousand questions a month). ¬†So, I’m going to grab some interesting ones and flesh things out a bit more.

Hi all,

Assume I have a Mongo master and 2 mongo slaves.  Using PHP, how do I do it so that writes goes to the master while reads are spread across the slaves (+maybe the master)?

1) 1 connect to all 3 nodes in one go, PHP/Mongo handles all the rest
2) 1 connect to the master for writes. Another connection to connect to all slave nodes and read from them.

Thanks all and sorry for the noobiness!

-Mr. Google

Basics first: what is master/slave?

One database server (the “master”) is in charge and can do anything. ¬†A bunch of other database servers keep copies of all the data that’s been written to the master and can optionally be queried (these are the “slaves”). ¬†Slaves cannot be written to directly, they are just copies of the master database. ¬†Setting up a master and slaves allows you to scale reads nicely because you can just keep adding slaves to increase your read capacity. ¬†Slaves also make great backup machines. If your master explodes, you’ll have a copy of your data safe and sound on the slave.

A handy-dandy comparison chart between master database servers and slave database servers:

Master Slave
# of servers 1 ‚ąě
permissions read/write read
used for queries, inserts, updates, removes queries

So, how do you set up Mongo in a master/slave configuration? ¬†Assuming you’ve downloaded MongoDB from mongodb.org,¬†you can start a master and slave by cutting and pasting the following lines into your shell:

$ mkdir -p ~/dbs/master ~/dbs/slave
$ ./mongod --master --dbpath ~/dbs/master >> ~/dbs/master.log &
$ ./mongod --slave --port 27018 --dbpath ~/dbs/slave --source localhost:27017 >> ~/dbs/slave.log &

(I’m assuming you’re running *NIX. ¬†The commands for Windows are similar, but I don’t want to encourage that sort of thing).

What are these lines doing?

  1. First, we’re making directories to keep the database in (~/dbs/master and ~/dbs/slave).
  2. Now we start the master, specifying that it should put its files in the ~/dbs/master directory and its log in the ~/dbs/master.log file.  So, now we have a master running on localhost:27017.
  3. Next, we start the slave. It needs to listen on a different port than the master since they’re on the same machine, so we’ll choose 27018. It will store its files in ~/db/slave and its logs in ~/dbs/slave.log. ¬†The most important part is letting it know who’s boss: the –source localhost:27017 option lets it know that the master it should be reading from is at localhost:27017.

There are tons of possible master/slave configurations. Some examples:

  • You could have a dozen slave boxes where you want to distribute the reads evenly across them all.
  • You might have one wimpy little slave machine that you don’t want any reads to go to, you just use it for backup.
  • You might have the most powerful server in the world as your master machine and you want it to handle both reads and writes… unless you’re getting more than 1,000 requests per second, in which case you want some of them to spill over to your slaves.

In short, Mongo can’t automatically configure your application to take advantage of your master-slave setup. Sorry. ¬†You’ll have to do this yourself. (Edit: the Python driver actually does handle case 1 for you, see Mike’s comment.)

However, it’s not too complicated, especially for what MG wants to do. ¬†MG is using 3 servers: a master and two slaves, so we need three connections: one to the master and one to each slave. ¬†Assuming he’s got the master at master.example.com and the slaves at slave1.example.com and slave2.example.com, he can create the connections with:

$master = new Mongo("master.example.com:27017");
$slave1 = new Mongo("slave1.example.com:27017");
$slave2 = new Mongo("slave2.example.com:27017");

This next bit is a little nasty and it would be cool if someone made a framework to do it (hint hint). ¬†What we want to do is abstract the master-slave logic into a separate layer, so the application talks to the master slave logic which talks to the driver. ¬†I’m lazy, though, so I’ll just extend the MongoCollection class and add some master-slave logic. ¬†Then, if a person creates a MongoMSCollection from their $master connection, they can add their slaves and use the collection as though it were a normal MongoCollection. ¬†Meanwhile, MongoMSCollection will evenly distribute reads amongst the slaves.

class MongoMSCollection extends MongoCollection {
    public $currentSlave = -1;

    // call this once to initialize the slaves
    public function addSlaves($slaves) {
        // extract the namespace for this collection: db name and collection name
        $db = $this->db->__toString();
        $c = $this->getName();

        // create an array of MongoCollections from the slave connections
        $this->slaves = array();
        foreach ($slaves as $slave) {
            $this->slaves[] = $slave->$db->$c;
        }

        $this->numSlaves = count($this->slaves);
    }

    public function find($query, $fields) {
        // get the next slave in the array
        $this->currentSlave = ($this->currentSlave+1) % $this->numSlaves;

        // use a slave connection to do the query
        return $this->slaves[$this->currentSlave]->find();
    }
}

To use this class, we instantiate it with the master database and then add an array of slaves to it:

$master = new Mongo("master.example.com:27017");
$slaves = array(new Mongo("slave1.example.com:27017"), new Mongo("slave2.example.com:27017"));

$c = new MongoMSCollection($master->foo, "bar");
$c->addSlaves($slaves);

Now we can use $c like a normal MongoCollection.  MongoMSCollection::find will alternate between the two slaves and all of the other operations (inserts, updates, and removes) will be done on the master.  If MG wants to have the master handle reads, too, he can just add it to the $slaves array (which might be better named the $reader array, now):

$slaves = array($master, new Mongo("slave1.example.com:27017"), new Mongo("slave2.example.com:27017"));

Alternatively, he could change the logic in the MongoMSCollection::find method.

Edit: as of version 1.4.0, slaveOkay is not neccessary for reading from slaves. slaveOkay should be used if you are using replica sets, not –master and –slave. Thus, the next section doesn’t really apply anymore to normal master/slave.

The only tricky thing about Mongo’s implementation of master/slave is that, by default, a slave isn’t even readable, it’s just a way of doing backup for the master database. ¬†If you actually want to read off of a slave, you have to set a flag on your query, called “slaveOkay”. ¬†Instead of saying:

$cursor = $slave->foo->bar->find();

we have:

$cursor = $slave->foo->bar->find()->slaveOkay();

Or, because this is a pain in the ass to set for every query (and almost impossible to do for findOnes unless you know the internals) you can set a static variable on MongoCursor that will hold for all of your queries:

MongoCursor::$slaveOkay = true;

And now you will be allowed to query your slave normally, without calling slaveOkay() on each cursor.

References:

PHP Extension Wiki

I started a wiki on this site (http://www.kchodorow.com/php) to write down all the stuff I learn about writiing PHP extensions. If anyone else has experience with them, feel free to add or edit articles.

Some basics: a PHP extension is written in C. In fact, PHP itself is written in C, so there’s a lot of good source code to look at out there. There’s an excellent introduction to writing PHP extensions at Zend DevZone. However, it doesn’t go into a lot of the specifics, which is why I started the wiki. I had to figure out how to do a ton of stuff on my own, mostly by digging through the PHP source code and other extensions’ source code. No one should have to look through 500 undocumented C files to figure out how to create a PHP class in C. (However, if you like digging through source code, it’s all available to view on the web. Extensions are under pecl and PHP source is available under php-src.)

I feel like I have a pretty good handle on how to do almost anything with PHP in C, so if anyone has any questions or suggestions for an article, feel free to ask and I’ll try to write a page on it.

Upcoming pages I’d planning on writing:
– Throwing exceptions
– How to extend/implement other classes
– Using HashTable

Got Mongo Working on Hostmonster!

This was written in April of 2009. It is very out of date. See http://rcrisman.net/article/11/installing-mongodb-on-hostmonster-bluehost-accounts for more up-to-date information (as of August 2010). Keep in mind that shared hosting with Hostmonster is very lame. They only lets you run a program for 5 minutes before killing it, so it’s fairly useless to install MongoDB unless you have a dedicated IP.

I finally got MongoDB working on this site, so I’m going to start switching stuff over from MySQL. I’m biased, but I think it’s just an easier database to use.

And, because I like writing tutorials… How I did it:

  1. Downloaded the binary I created of MongoDB for “legacy” Linux. I originally compiled this for a user on Mandriva 2006 (see previous post about VMWare), but it works fine for other old Linux distros, too.
  2. Run:
    $ tar zxvf mongodb-linux-i686-old-linux-1.tgz
  3. Make a directory for the database to put files in:
    $ mkdir /home/user/data
  4. Upload libjava.so, libjvm.so, and libverify.so. Make sure they have execute permissions and put them somewhere like /home/user/lib.
  5. Run:
    $ export LD_LIBRARY_PATH=/home/user/lib

    replacing the path wherever you put the .so’s above.

  6. Start the database:
    $ cd mongodb-linux-i686-old-linux-1
    $ bin/mongod --dbpath /home/user/data --nojni run
            

I cheated a bit and didn’t install Java, so I had to use the –nojni option. If you install Java, you won’t need that (and you won’t need to upload the individual .so files).

Now, what good is a database if you can’t use it, right? So, I downloaded my PHP driver (go to its Github repository and click “Download” for the latest version). I then followed the install instructions and put the .so generated by make in /home/user/extensions.

I changed the options under “PHP Config” in Hostmonster’s CPanel to use php.ini in /home/user/public_html/php.ini, and then edited that file to use my extension.

I made a simple test page with:


Which connected me to MongoDB, showing:

localhost:27017

when I loaded the page!

phpdoc hell

I’ve been fighting with phpdoc for about a week now, trying to figure out how to document my extension, which is a combination of PHP and C code. I finally figured it out, and since I haven’t seen this documented anywhere, I figured I’d reproduce the steps here:

Download phpdoc from CVS.  No!  Not PEAR, you fool!  CVS!  For some reason, the package you download with PEAR and the package you get from CVS are completely different. Luckily, PHP has some instructions on how to download the package with CVS.

For my lovely Javadoc-style PHP-code comments, phpdoc can be used to convert them to XML:

$ phpdoc -d path/to/php/srcs -t output -o XML:DocBook/peardoc2:default

Cool!  Now you have your PHP documentation converted. Now for the C code. CVS phpdoc (and only this phpdoc, not PEAR phpdoc) has a script in scripts/docgen called docgen.php.  You use this as follows:

$ php docgen.php --output tmp --extension extname

That gives you an outline to fill in for you C documentation.¬† Now you just need to mush them together and generate the manual, to see what it looks like.¬† This is where I got stuck.¬† Here’s what I figured out: in phpdoc/manual.xml.in, add 1 line:

&reference.mongo.book;

…except replace “mongo” with whatever your extension’s name is.

Put your xml documentation in the directory phpdoc/en/reference/mongo. In phpdoc/en/reference/ add a file called entities.mongo.xml, and use the other entities files as your guide for its contents.

To actually generate the documentation, go to phpdoc/ and run:

$ php configure.php --with-partial=book.mongo
$ phd -d /path/to/phpdoc/.manual.book.mongo.xml

Et voila.  Point your browser to file:///path/to/phpdoc/html/ref.mongo.html and you can see your documentation.

Pain in my CVS

This is pretty geeky, so sorry non-technical reader. There’s a glossary at the bottom if you’d like to follow along.

I’ve been developing a PHP database driver for work, and this week I proposed it as a new PECL (pronounced “pickle”) package. Unfortunately, they use CVS for their packages. I’m used to Git.

So, I created a cvsroot/ directory and imported my driver to it. After a couple tries, I figured that out. Then I was confused, all my files were suddenly named file,txt instead of file.txt. Then I realized that this was my master repository, and I had to check out code from there. So, I made a cvsstuff/ directory and did

$ cd cvsstuff
$ cvs co mongo

And it did the right thing! Cool.

So now I had to do it with php.net’s remote repository. I tried checking out a couple packages to get a feel for the syntax. Then I figured out my import command, triple checked it, and was about to press enter when… hmm, where was it getting the path to the directory I was importing? I was in my home directory, so… oh, crap. So, I almost uploaded my home directory to a public repository on php.net (for Windows users, this is roughly equivalent to uploading My Documents). My hand was hovering over the enter key, but I didn’t!

I successfully uploaded my driver, and all was well.

Non-technical person explanation:

PHP
you’ve heard of it, maybe? ¬† It’s a programming language, like C++ or Java, except it’s usually used for making webpages.
Database
information storage, usually can be pictured as a bunch of tables. MySQL is the most famous, my company’s is called Mongo (www.mongodb.org).
Database driver
when you create a database, like our company did, you want everyone, regardless of what language they program in, to be able to use it. So you write drivers, which translate a programming language to database-speak.
PECL package
PHP has a system set up so, if someone write a useful program in PHP, it’s easy for you, halfway around the world, to download it and use it. It’s called PEAR, and it lets you type:¬†¬†¬†

$ pear install cool-package

And then you have cool-package installed on your system, too. All the packages are open source.

CVS/Git
When you’re working on a programming project, often you’re like, “I’ll just fix this little thing here.” And then when it’s fixed, suddenly no one can log in and you can’t remember exactly what you changed and the your company crashes and burns. And that is where version control software comes in. Whenever someone makes a change, they use a program like CVS to say “I’ve changed lines 4, 6, and 23 of MyProg.java.” Then they send this change to the central computer, who has the master code and updates it with the person’s change. If the change turns out to break stuff, there’s a record of exactly what changed and you can revert the code back to its original state. CVS is the grand-daddy of all version control, Git is another, more recent version control system.¬†¬†¬†