I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.
Good idea, Patrick!
So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:
When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)
This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.
However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.
Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.
Now, when you do a write, mongod writes this to the private view.
mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.
The journal appends each change description it gets.
At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.
The journal will then replay this change on the shared view.
Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).
Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.
And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).
How does the oplog fit into all this?
LikeLike
Good question. The oplog is a normal collection. It is journaled in the same way that every other collection is journaled. If mongod is running without journaling and crashes, the oplog may be corrupt like any other collection.
MongoDB could have been designed to use the journal instead of the oplog for replication. However, replication was written before journaling was implemented. This might be an option in the future, but there are some benefits to having a “human-readable” replication log.
Does that make sense?
LikeLike
Best One…!! 🙂
LikeLike
Sorry, I know this is old but I’m still not clear on what that means… When a write request hits the primary, does that make two journal entries at the same time – one for the oplog collection and one for the intended collection? Or is the intended collection data change driven from the oplog collection? Or something else?
LikeLike
No problem, glad people are still finding it useful! Yes, two journal entries are flushed at the at the same time.
LikeLike
Thanks for the detailed explanation, Kristina. The follow up comments are even more informative. I have attached an image from my understanding of all this. Can you let me know if this is correct?
LikeLike
You’re welcome! The diagram is almost correct. The oplog is not a separate component: the writes to it are journaled/written to the data files at the same time the “normal” writes are. So, you can get rid of that box/arrow altogether. Also, the secondary nodes don’t get the data from the data files on disk, but from the private view.
LikeLike
Hi!
I learn a lot from this blog while developing a migration tool for mongo sharding.Thank You A LOT!
Still I have some question:
1. I use iostat to mongitor disk I/O in one of my shard whick receiving batches of insert.I see regular write tasks every second but none read tasks at all. So how journal files applied changes to Shard View?
2. I see a high percent of util of write tasks to my disk every few seconds which last a few seconds or 20s+;and it is not regular like “5s flushing to disk” by linux OS MMAP mechenism. Neither like 60s flush(By the way ,as 5s flush happend,why still 60s flush?). Can I assume that 5s or specific size is achieved will activate a flush work ?
Hope for your answer sinseerly!
LikeLike
Awesome explanation. Do you know of any articles that compare and contrast mongo’s journaling to a journal used by a file system or a database? Obviously the basics are the same (write changesets directions sequentially to disk before actually writing to disk for replay in the future). However, it would be nice to see how different datastores with different considerations solved a similar problem.
LikeLike
Thank you! http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?
LikeLike
Thank you! http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?
LikeLike
Thank you! http://www.ibm.com/developerworks/linux/library/l-journaling-filesystems/ looks pretty interesting for filesystems, does anyone know any good descriptions for relational DBs out there?
LikeLike
What characteristics of the journal mean that it is durable, in the sense that it doesn’t get affected by a crash? The append only nature?
LikeLike
Once the data is written to the journal, it never changes (so once it’s written it’s safe). The interesting thing is that the machine could go down in the middle of writing a ledger (entry) to the journal, in which case some of the ledger may be written, some may not be. Thus, each ledger has a header and footer with a checksum so that, before replaying it, mongod knows that the whole thing was written correctly to disk. If the checksum doesn’t match the data or the footer is missing (or whatever), the ledger is discarded and that write is lost (and due to the append-only nature of the file, that can only happen to the final ledger).
LikeLike
Looks like my comment above was flagged for review and isn’t showing…
LikeLike
I’m so confused… I couldn’t even find it in Disqus. I tried “editing” it and resaving (without changing anything).
LikeLike
Thanks, is showing now.
LikeLike
to me the fact that its written for ever does not sit well with me. I feel like if the mongod server restarts and the journal and data files are in sync then the journal should be cleared it seems like wasted disk space to have both versions there, I also think that as the journal was to get longer and mongod does a read to check that the last writes of an unsafe shutdown were written also to the data file that having a 20gb+ file to parse would be a pain, even if they read the file from end to start they have a lot of over head to handle.
LikeLike
The journal files are cleared once they’ve been used. MongoDB should only ever keep around a couple of journal files at a time (each journal file is 2GB, so you’ll never have a 20GB file). You’ll generally have one or two “active” journal files and two preallocated journal files.
LikeLike
Hi Kristina – you mention that the shared view is flushed to disk (in the background) every 60 seconds. The journal, by default, is flushed to disk every 100ms – is that the append in the diagram when the private view appends to the journal file?
LikeLike
Yes, exactly. The 100ms flush is when it takes all changes to the private view and appends them to the journal.
LikeLike
When you issue an msync() on the shared view, how do you guarantee the on-disk data file won’t be corrupted? If it crashes in the midst of an msync(), there’s no guarantee as to what order pages got written to disk and if there were partial page writes to disk? It seems to me like the journal is not going to help in these cases. In traditional databases like InnoDB, there’s the double-write buffer to guard against partial page writes.
LikeLike
See Eliot’s answer on the MongoDB blog: http://blog.mongodb.org/post/33700094220/how-mongodbs-journaling-works#comment-684898620. To elaborate a bit, the shared view is only flushing changes that have already been written to the journal. Therefore, if pages are partially flushed and then the machine crashes, it doesn’t matter: the journal has the full version of those partially flushed changes. It can just rewrite those pages on start up.
LikeLike
Thanks for the answer. So the journal does have full versions of pages (not just the diff) – that sounds similar to how postgres does things (there’s a full version of each page in the WAL after each checkpoint).
LikeLike
The journal doesn’t have to keep around full pages because it doesn’t really matter if the unchanged parts of the page were half-flushed: they were being re-written to the same values that they were before, so there’s no way to “corrupt” them.
LikeLike
One clarification: my coworker Scott mentioned that you might be talking about when the log sequence number is written, which does not get updated until after the shard view sync is complete.
LikeLike
Excellent post, much better explained than in the official documentations.
But still, I have some questions:
1) I assume the journal file is not mmap-ped? Is that true?
2) Why is writing to the journal file is more secure than writing to the data file? Let’s assume the disk is full, so mongod cannot append journal entries any longer. In that situation writing to the data file may still be possible because the corresponding file was already preallocated.
3) Is it true that journalling is only intended for single node durability? From what I know, in a replica sets the oplog is used to recover out-of-date nodes.
TIA
Tobias
LikeLike
1) Correct.
2) As you’ll see if you try to write _a lot_ of data, writes requests will block if the journal is unable to flush. So it should just block all writes if it ran out of disk, but there’s some special code that handles running out of space that I’m not familiar with, so it might do something smarter (e.g., error out the writes). Also, just FYI, MongoDB preallocates journal files as well as data files, so you’d start seeing failures as soon as the preallocation failed.
3) Yes. The journal has instructions like “write byte X to offset Y in file Z.” The oplog is more like “write document {…} to collection W.” More human readable, but each member must be run with journaling to be crash safe.
LikeLike
Great article!
Now a question, is there a “private view” per connection ? Or is the “private view” shared between connections? I know making the private view “shared” doesn’t make any sense by the name of “private view” vs “shared view”, but it’s important to understand.
A behavior we are seeing on automated tests is that on highly concurrent read/write scenarios, even after a flush to journal on a writer thread another reader thread that is trying to fetch the same object doesn’t seem to get a fresh version until a “period” of time (very short indeed).
Is this because “private views” are private per connection so until data makes it to shared view they are not visible to the rest of the world?
LikeLike
Thanks! The same private view is used by all connections, but that can’t cause the issue you’re seeing. Any write is immediately viewable by readers as soon as it has been written (well before it has been flushed or remapped).
Generally, the issue in this type of test is that you need to set write concern to wait for a DB response before expecting a reader to find the write. If write concern is not set properly, the client will continue “successfully” before the DB has actually performed the write. If you’re still having problems, asking on the mailing list might be helpful (https://groups.google.com/forum/?fromgroups=#!forum/mongodb-user).
LikeLike
Why need the private view,? If I just write to the shared view and msync it, whats the difference between them?
Thank you!
LikeLike
The OS can write data from the shared view to disk at anytime without telling MongoDB. Thus, if we just used the shared view, data could end up in the data files before being written to the journal file. That would make the journal essentially useless.
LikeLike
I have a question on ‘remap shared view to private view to prevent private view from getting too dirty’
According to what I understood, on a write request, the data update sequence is: private view -> journal file -> shared view -> data files. So the data in the private view should not older than the shared view, why is remapping required? And does the remapping have risk of losing data?
LikeLike
> So the data in the private view should not older than the shared view, why is remapping required?
Suppose you’ve just started MongoDB. The private view takes up basically no memory. Now, suppose you write a KB of data to MongoDB. Now the private view takes up 1 KB of memory. Now you write 23 MB. Now the private view is taking up 23.001 MB of space (23 MB + 1 KB). This continues to grow, the private view using more an more memory as you write more data. When MongoDB remaps the private view, it takes up (approximately) 0 space again.
> And does the remapping have risk of losing data?
No. Once the data is in the journal it is safe.
LikeLike
Does MongoDB issue a remap (shared view to private view) only after writing all the changes in the journal to the shared view? Does it block write access to private view when remapping is in progress?
LikeLike
So, is the remap done after all the updates from journal entries have made it to the Shared view?
LikeLike
Yes.
LikeLike
in my opinion,“remaps the shared view to the private view” and Check Point in RDBMS,meaning almost,is not it?
LikeLike
No, the step where the journal appends the change description is the most similar to a checkpoint.
The remapping is an optimization, it has nothing to do with durability.
LikeLike
Thanks for the great post but just trying to understand the concise advantage and disadvantage of MongoDB journalism
As per advantage
– All write are safe
– Durability
As of disadvantage
not sure on this there is tradeoff of performance especially read operation when using journal
can you share light on this as well
Also when should I think of using journal since only when I’m concerned about data-consistency
Also if possible for you how would you like to answer this question
http://dba.stackexchange.com/questions/49956/mongodb-advantages
LikeLike
The advantage is writes are durable, the disadvantage is writes are slower. Journaling shouldn’t affect read speed.
LikeLike
journal contain change records and oplog.rs also contain change records. How are these two differ ? Are these two changes in journal and oplog are used differently in recovery conditions ?
LikeLike
See my response to http://www.kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/#comment-753105064's third question.
LikeLike
change recs are the same but different formats. journal is used for crash recovery and oplog for replSet and they work independently from each other ?
LikeLike
Yup.
LikeLike
Great post, thanks for this.
In one of the comments you’ve mentioned journal files couldn’t grow large (i.e. one or two active files and another two “next” preallocated). Since each journal file is allocated in 1GB increments, is it safe to assume journal files are going to take storage space at most 5-6 GB per instance irrespective of data files’ size?
LikeLike
Thanks! Yes, I’m not sure about the exact size, but it should be on that order.
LikeLike
Thanks for the info, great help! More helpful than the documentation on this! 🙂
LikeLike
as soon as mongod applies all changes in journal log to data files it will delete the old journal log and will create a new one.
LikeLike
Great article, but the pictures are no longer there. Could you re-upload them? Thanks!
LikeLike
Great article, but the pictures are lost. Could you re-upload them? Thanks!
LikeLike
Unfortunately they were lost when I migrated from self-hosted->Wordpress.com! Still visible on wayback machine, if you want: https://web.archive.org/web/20130415093439/https://kchodorow.com/blog/2012/10/04/how-mongodbs-journaling-works/
LikeLike