Mongo Mailbag #2: Updating GridFS Files

kchodorowFebruary 11, 2010Mongo MailbagGridFS, MongoDB, PHP

Welcome to week two of Mongo Mailbag, where I take a question from the Mongo mailing list and answer it in more detail. If you have a question you’d like to see answered in excruciating detail, feel free to email it to me.

Is it possible (with the PHP driver) to storeBytes into GridFS (for example CSS data), and later change that data?!

I get some strange behavior when passing an existing _id value in the $extra array of MongoGridFS::storeBytes, sometimes Apache (under Windows) crashes when reloading the file, sometimes it doesn’t seem to be updated at all.

So I wonder, is it even possible to update files in GridFS?! 🙂

-Wouter

If you already understand GridFS, feel free to skip to the last section. For everyone else…

Intro to GridFS

GridFS is the standard way MongoDB drivers handle files; a protocol that allows you to save an arbitrarily large file to the database. It’s not the only way, it’s not the best way (necessarily), it’s just the built-in way that all of the drivers support. This means that you can use GridFS to save a file in Ruby and then retrieve it using Perl and visa versa.

Why would you want to store files in the database? Well, it can be handy for a number of reasons:

If you set up replication, you’ll have automatic backups of your files.
You can keep millions of files in one (logical) directory… something most filesystems either won’t allow or aren’t good at.
You can keep information associated with the file (who’s edited it, download count, description, etc.) right with the file itself.
You can easily access info from random sections of large files, another thing traditional file tools aren’t good at.

There are some limitations, too:

You can’t have an arbitrary number of files per document… it’s one file, one document.
You must use a specific naming scheme for the collections involved: prefix.files and prefix.chunks (by default prefix is “fs”: fs.files and fs.chunks).

If you have complex requirements for your files (e.g., YouTube), you’d probably want to come up with your own protocol for file storage. However, for most applications, GridFS is a good solution.

How it Works

GridFS breaks large files into manageable chunks. It saves the chunks to one collection (fs.chunks) and then metadata about the file to another collection (fs.files). When you query for the file, GridFS queries the chunks collection and returns the file one piece at a time.

Here are some common questions about GridFS:

Q: Why not just save the whole file in a single document?
A: MongoDB has a 4MB cap on document size.
Q: That’s inconvenient, why?
A: It’s an arbitrary limit, mostly to prevent bad schema design.
Q: But in this case it would be so handy!
A: Not really. Imagine you’re storing a 20GB file. Do you really want to return the whole thing at once? That means 20GB or memory will be used whenever you query for that document. Do you even have that much memory? Do you want it taken up by a single request?
Q: Well, no.
A: The nice thing about GridFS is that it streams the data back to the client, so you never need more than 4MB of memory.
Q: Now I know.
A: And knowing is half the battle.
Together: G.I. Joe!

Answer the Damn Question

Back to Wouter’s question: changing the metadata is easy: if we wanted to add, say, a “permissions” field, we could run the following PHP code:

$files = $db->fs->files;
$files->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

// or, equivalently, from the MongoGridFS object:

$grid->update(array("filename" => "installer.bin"), array('$set' => array("permissions" => "555")));

Updating the file itself, what Wouter is actually asking about, is significantly more complex. If we want to update the binary data, we’ll need to reach into the chunks collection and update every document associated with the file. Edit: Unless you’re using the C# driver! See Sam Corder’s comment below. It would look something like:

// get the target file's chunks
$chunks = $db->fs->chunks;
$cursor = $chunks->find(array("file_id" => $fileId))->sort(array("n" => 1));

$newLength = 0;

foreach ($cursor as $chunk) {
    // read in a string of bytes from the new version of the file
    $bindata = fread($file, MongoGridFS::$chunkSize);
    $newLength += strlen($bindata);

    // put the new version's contents in this chunk
    $chunk->data = new MongoBinData($bindata);

    // update the chunks collection with this new chunk
    $chunks->save($chunk);
}

// update the file length metadata (necessary for retrieving the file)
$db->fs->files->update(array("_id" => $fileId), array('$set' => array("length" => $newLength));

The code above doesn’t handle a bunch of cases (what if the new file is a different number of chunks than the old one?) and anything beyond this basic scenario gets irritatingly complex. If you’re updating individual chunks you should probably just remove the GridFS file and save it again. It’ll end up taking about the same amount of time and be less error-prone.

24 thoughts on “Mongo Mailbag #2: Updating GridFS Files”

Sam Corder says:

February 11, 2010 at 6:37 pm

Good write up on the internals. BTW didn’t they say “Yo Joe!”? That’s how I remember it. Of course I’ve just dated myself and admitted to a sordid past in one line. 🙂

Begin Shameless Plug
The C# driver exposes a read/write stream that can update a file in place. Simply get a stream of the file, seek to the spot you want to update, write your new bits and close the file. The only limitation is that you can’t add bits into the middle of the file. You can only change existing bits or append to the end of the stream.
End Shameless Plug

LikeLike

Reply
Sam Corder says:

February 11, 2010 at 11:37 am

Good write up on the internals. BTW didn’t they say “Yo Joe!”? That’s how I remember it. Of course I’ve just dated myself and admitted to a sordid past in one line. 🙂

Begin Shameless Plug
The C# driver exposes a read/write stream that can update a file in place. Simply get a stream of the file, seek to the spot you want to update, write your new bits and close the file. The only limitation is that you can’t add bits into the middle of the file. You can only change existing bits or append to the end of the stream.
End Shameless Plug

LikeLike

Reply
Wouter says:

February 11, 2010 at 10:53 pm

I’m honored my question made it to your blog :-). I was trying to store CSS files into GridFS whose contents could be changed.

Someone on the mailing list suggested to add the updated CSS file as a new file to GridFS, with the same filename, and then remove the outdated one. Much easier :-).

LikeLike

Reply
Wouter says:

February 11, 2010 at 3:53 pm

I’m honored my question made it to your blog :-). I was trying to store CSS files into GridFS whose contents could be changed.

Someone on the mailing list suggested to add the updated CSS file as a new file to GridFS, with the same filename, and then remove the outdated one. Much easier :-).

LikeLike

Reply
kristina says:

February 14, 2010 at 2:39 pm

@Sam: I added a note, that’s cool!

@Wouter: yeah, remove+insert is a lot easier than updating, at least at the moment.

LikeLike

Reply
kristina says:

February 14, 2010 at 7:39 am

@Sam: I added a note, that’s cool!

@Wouter: yeah, remove+insert is a lot easier than updating, at least at the moment.

LikeLike

Reply
Anonymous says:

February 14, 2010 at 5:46 pm

“Imagine you’re storing a 20GB file. Do you really want to return the whole thing at once? That means 20GB or memory will be used whenever you query for that document.”
FYI, there are ways for zero-copy data passing.

LikeLike

Reply
Anonymous says:

February 14, 2010 at 10:46 am

“Imagine you’re storing a 20GB file. Do you really want to return the whole thing at once? That means 20GB or memory will be used whenever you query for that document.”
FYI, there are ways for zero-copy data passing.

LikeLike

Reply
kristina says:

February 15, 2010 at 4:03 am

@Anonymous: I don’t understand what you mean, can you elaborate?

LikeLike

Reply
kristina says:

February 14, 2010 at 9:03 pm

@Anonymous: I don’t understand what you mean, can you elaborate?

LikeLike

Reply
mauso says:

March 11, 2010 at 10:05 pm

Probably Anonymous means something like this: http://www.ibm.com/developerworks/library/j-zerocopy/

LikeLike

Reply
mauso says:

March 11, 2010 at 3:05 pm

Probably Anonymous means something like this: http://www.ibm.com/developerworks/library/j-zerocopy/

LikeLike

Reply
kristina says:

March 13, 2010 at 12:55 pm

That’s really cool, thanks for the link!

I don’t think it wouldn’t really work with Mongo, though, because you have to parse the received data to figure out which bytes are even part of the file.

LikeLike

Reply
kristina says:

March 13, 2010 at 5:55 am

That’s really cool, thanks for the link!

I don’t think it wouldn’t really work with Mongo, though, because you have to parse the received data to figure out which bytes are even part of the file.

LikeLike

Reply
Pingback: MongoDB as in huMONGOus, not retarded | profeshunl newbie
Alice Kelly says:

April 28, 2010 at 6:06 pm

i love watching GI Joe, both the cartoon series and the movie. I am hoping that they would make a sequel. |

LikeLike

Reply
Alice Kelly says:

April 28, 2010 at 11:06 am

i love watching GI Joe, both the cartoon series and the movie. I am hoping that they would make a sequel. |

LikeLike

Reply
Uninstall Program says:

June 27, 2010 at 12:58 am

Thanks for your update of GridFS:)

LikeLike

Reply
driver update says:

July 28, 2010 at 9:43 am

Interesting article and nice blog you have too!

LikeLike

Reply
Pingback: Getting Started With MongoDB GridFS | LearnMongo.com
Pingback: ehcache.net
Myarak says:

July 19, 2011 at 4:43 am

Hi, Why there is this limitation :
You can’t have an arbitrary number of files per document… it’s one file, one document.

LikeLike

Reply
1. Anonymous says:
  
  July 19, 2011 at 3:38 pm
  
  It’s arbitrary, it’s just the way the API was designed. I think it was supposed to be like, in a filesystem a filename points to one file (not many). However, it would be easy enough to roll your own GridFS-like API that allowed multiple files per document.
  
  LikeLike
  
  Reply
Pete Carr says:

November 8, 2011 at 9:28 am

+1 awesomeness for G.I. Joe Reference!

LikeLike

Reply