How MongoDB’s Journaling Works

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Good idea, Patrick!

So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

The journal appends each change description it gets.

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.

Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

52 thoughts on “How MongoDB’s Journaling Works

    1. Good question.  The oplog is a normal collection.  It is journaled in the same way that every other collection is journaled.  If mongod is running without journaling and crashes, the oplog may be corrupt like any other collection.

      MongoDB could have been designed to use the journal instead of the oplog for replication.  However, replication was written before journaling was implemented.  This might be an option in the future, but there are some benefits to having a “human-readable” replication log.

      Does that make sense?

      Like

      1. Sorry, I know this is old but I’m still not clear on what that means… When a write request hits the primary, does that make two journal entries at the same time – one for the oplog collection and one for the intended collection? Or is the intended collection data change driven from the oplog collection? Or something else?

        Like

      2. No problem, glad people are still finding it useful! Yes, two journal entries are flushed at the at the same time.

        Like

      3. Thanks for the detailed explanation, Kristina. The follow up comments are even more informative. I have attached an image from my understanding of all this. Can you let me know if this is correct?

        Like

      4. You’re welcome! The diagram is almost correct. The oplog is not a separate component: the writes to it are journaled/written to the data files at the same time the “normal” writes are. So, you can get rid of that box/arrow altogether. Also, the secondary nodes don’t get the data from the data files on disk, but from the private view.

        Like

      5. Hi!
        I learn a lot from this blog while developing a migration tool for mongo sharding.Thank You A LOT!
        Still I have some question:
        1. I use iostat to mongitor disk I/O in one of my shard whick receiving batches of insert.I see regular write tasks every second but none read tasks at all. So how journal files applied changes to Shard View?

        2. I see a high percent of util of write tasks to my disk every few seconds which last a few seconds or 20s+;and it is not regular like “5s flushing to disk” by linux OS MMAP mechenism. Neither like 60s flush(By the way ,as 5s flush happend,why still 60s flush?). Can I assume that 5s or specific size is achieved will activate a flush work ?

        Hope for your answer sinseerly!

        Like

  1. Awesome explanation. Do you know of any articles that compare and contrast mongo’s journaling to a journal used by a file system or a database?  Obviously the basics are the same (write changesets directions sequentially to disk before actually writing to disk for replay in the future). However, it would be nice to see how different datastores with different considerations solved a similar problem.

    Like

    1. Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

      Like

  2. Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

    Like

  3. Thank you!  http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?

    Like

    1. Once the data is written to the journal, it never changes (so once it’s written it’s safe).  The interesting thing is that the machine could go down in the middle of writing a ledger (entry) to the journal, in which case some of the ledger may be written, some may not be.  Thus, each ledger has a header and footer with a checksum so that, before replaying it, mongod knows that the whole thing was written correctly to disk.  If the checksum doesn’t match the data or the footer is missing (or whatever), the ledger is discarded and that write is lost (and due to the append-only nature of the file, that can only happen to the final ledger). 

      Like

      1. I’m so confused… I couldn’t even find it in Disqus.  I tried “editing” it and resaving (without changing anything).

        Like

      2. to me the fact that its written for ever does not sit well with me. I feel like if the mongod server restarts and the journal and data files are in sync then the journal should be cleared it seems like wasted disk space to have both versions there, I also think that as the journal was to get longer and mongod does a read to check that the last writes of an unsafe shutdown were written also to the data file that having a 20gb+ file to parse would be a pain, even if they read the file from end to start they have a lot of over head to handle.

        Like

      3. The journal files are cleared once they’ve been used.  MongoDB should only ever keep around a couple of journal files at a time (each journal file is 2GB, so you’ll never have a 20GB file).  You’ll generally have one or two “active” journal files and two preallocated journal files.

        Like

  4. Hi Kristina – you mention that the shared view is flushed to disk (in the background) every 60 seconds.  The journal, by default, is flushed to disk every 100ms – is that the append in the diagram when the private view appends to the journal file?

    Like

  5. When you issue an msync() on the shared view, how do you guarantee the on-disk data file won’t be corrupted? If it crashes in the midst of an msync(), there’s no guarantee as to what order pages got written to disk and if there were partial page writes to disk? It seems to me like the journal is not going to help in these cases. In traditional databases like InnoDB, there’s the double-write buffer to guard against partial page writes.

    Like

    1. See Eliot’s answer on the MongoDB blog: http://blog.mongodb.org/post/33700094220/how-mongodbs-journaling-works#comment-684898620.  To elaborate a bit, the shared view is only flushing changes that have already been written to the journal.  Therefore, if pages are partially flushed and then the machine crashes, it doesn’t matter: the journal has the full version of those partially flushed changes.  It can just rewrite those pages on start up.

      Like

      1. Thanks for the answer. So the journal does have full versions of pages (not just the diff) – that sounds similar to how postgres does things (there’s a full version of each page in the WAL after each checkpoint).

        Like

      2. The journal doesn’t have to keep around full pages because it doesn’t really matter if the unchanged parts of the page were half-flushed: they were being re-written to the same values that they were before, so there’s no way to “corrupt” them.

        Like

      3. One clarification: my coworker Scott mentioned that you might be talking about when the log sequence number is written, which does not get updated until after the shard view sync is complete.

        Like

  6. Excellent post, much better explained than in the official documentations.

    But still, I have some questions:

    1) I assume the journal file is not mmap-ped? Is that true?

    2) Why is writing to the journal file is more secure than writing to the data file? Let’s assume the disk is full, so mongod cannot append journal entries any longer. In that situation writing to the data file may still be possible because the corresponding file was already preallocated.

    3) Is it true that journalling is only intended for single node durability? From what I know, in a replica sets the oplog is used to recover out-of-date nodes.

    TIA
    Tobias

    Like

    1. 1) Correct.

      2) As you’ll see if you try to write _a lot_ of data, writes requests will block if the journal is unable to flush.  So it should just block all writes if it ran out of disk, but there’s some special code that handles running out of space that I’m not familiar with, so it might do something smarter (e.g., error out the writes).  Also, just FYI, MongoDB preallocates journal files as well as data files, so you’d start seeing failures as soon as the preallocation failed.

      3) Yes.  The journal has instructions like “write byte X to offset Y in file Z.”  The oplog is more like “write document {…} to collection W.”  More human readable, but each member must be run with journaling to be crash safe.

      Like

  7. Great article!
    Now a question, is there a “private view” per connection ? Or is the “private view” shared between connections? I know making the private view “shared” doesn’t make any sense by the name of “private view” vs “shared view”, but it’s important to understand.
    A behavior we are seeing on automated tests is that on highly concurrent read/write scenarios, even after a flush to journal on a writer thread another reader thread that is trying to fetch the same object doesn’t seem to get a fresh version until a “period” of time (very short indeed).
    Is this because “private views” are private per connection so until data makes it to shared view they are not visible to the rest of the world?

    Like

    1. Thanks!  The same private view is used by all connections, but that can’t cause the issue you’re seeing.  Any write is immediately viewable by readers as soon as it has been written (well before it has been flushed or remapped).  

      Generally, the issue in this type of test is that you need to set write concern to wait for a DB response before expecting a reader to find the write.  If write concern is not set properly, the client will continue “successfully” before the DB has actually performed the write.  If you’re still having problems, asking on the mailing list might be helpful (https://groups.google.com/forum/?fromgroups=#!forum/mongodb-user).

      Like

  8. Why need the private view,? If I just write to the shared view and msync it, whats the difference between them?
    Thank you!

    Like

    1. The OS can write data from the shared view to disk at anytime without telling MongoDB. Thus, if we just used the shared view, data could end up in the data files before being written to the journal file. That would make the journal essentially useless.

      Like

  9. I have a question on ‘remap shared view to private view to prevent private view from getting too dirty’

    According to what I understood, on a write request, the data update sequence is: private view -> journal file -> shared view -> data files. So the data in the private view should not older than the shared view, why is remapping required? And does the remapping have risk of losing data?

    Like

    1. > So the data in the private view should not older than the shared view, why is remapping required?

      Suppose you’ve just started MongoDB. The private view takes up basically no memory. Now, suppose you write a KB of data to MongoDB. Now the private view takes up 1 KB of memory. Now you write 23 MB. Now the private view is taking up 23.001 MB of space (23 MB + 1 KB). This continues to grow, the private view using more an more memory as you write more data. When MongoDB remaps the private view, it takes up (approximately) 0 space again.

      > And does the remapping have risk of losing data?

      No. Once the data is in the journal it is safe.

      Like

      1. Does MongoDB issue a remap (shared view to private view) only after writing all the changes in the journal to the shared view? Does it block write access to private view when remapping is in progress?

        Like

  10. in my opinion,“remaps the shared view to the private view” and Check Point in RDBMS,meaning almost,is not it?

    Like

    1. No, the step where the journal appends the change description is the most similar to a checkpoint.

      The remapping is an optimization, it has nothing to do with durability.

      Like

  11. Thanks for the great post but just trying to understand the concise advantage and disadvantage of MongoDB journalism

    As per advantage

    – All write are safe

    – Durability

    As of disadvantage

    not sure on this there is tradeoff of performance especially read operation when using journal

    can you share light on this as well

    Also when should I think of using journal since only when I’m concerned about data-consistency

    Also if possible for you how would you like to answer this question

    http://dba.stackexchange.com/questions/49956/mongodb-advantages

    Like

  12. journal contain change records and oplog.rs also contain change records. How are these two differ ? Are these two changes in journal and oplog are used differently in recovery conditions ?

    Like

      1. change recs are the same but different formats. journal is used for crash recovery and oplog for replSet and they work independently from each other ?

        Like

  13. Great post, thanks for this.

    In one of the comments you’ve mentioned journal files couldn’t grow large (i.e. one or two active files and another two “next” preallocated). Since each journal file is allocated in 1GB increments, is it safe to assume journal files are going to take storage space at most 5-6 GB per instance irrespective of data files’ size?

    Like

      1. as soon as mongod applies all changes in journal log to data files it will delete the old journal log and will create a new one.

        Like

Leave a comment