Disk IO is slow. You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I mean, you may think your network is slow, but that’s just peanuts to disk IO.
The image below helps visualize how slow (post continues below).
(Originally found on Hacker News and inspired by Gustavo Duarte’s blog.)
The kernel knows how slow the disk is and tries to be smart about accessing it. It not only reads the data you requested, it also returns a bit more. This way, if you’re reading through a file or watching a movie (sequential access), your system doesn’t have to go to disk as frequently because you’re pulling more data back than you strictly requested each time.
You can see how far the kernel reads ahead using the blockdev tool:
$ sudo blockdev --report RO RA SSZ BSZ StartSec Size Device rw 256 512 4096 0 80026361856 /dev/sda rw 256 512 4096 2048 80025223168 /dev/sda1 rw 256 512 4096 0 2000398934016 /dev/sdb rw 256 512 1024 2048 98566144 /dev/sdb1 rw 256 512 4096 194560 7999586304 /dev/sdb2 rw 256 512 4096 15818752 19999490048 /dev/sdb3 rw 256 512 4096 54880256 1972300152832 /dev/sdb4
Readahead is listed in the “RA” column. As you can see, I have two disks (sda and sdb) with readahead set to 256 on each. But what unit is that 256? Bytes? Kilobytes? Dolphins? If we look at the man page for blockdev, it says:
$ man blockdev ... --setra N Set readahead to N 512-byte sectors. ...
This means that my readahead is 512 bytes*256=131072 or 128KB. That means that, whenever I read from disk, the disk is actually reading at least 128KB of data, even if I only requested a few bytes.
So what value should you set your readahead to? Please don’t set it to a number you find online without understanding the consequences. If you Google for “blockdev setra”, the first result uses blockdev –setra 65536, which translates to 32MB of readahead. That means that, whenever you read from disk, the disk is actually doing 32MB worth of work. Please do not set your readahead this high if you’re doing a lot of random-access reads and writes, as all of the extra IO can slow things down a lot (and if your low on memory, you’ll be forcing the kernel to fill up your RAM with data you won’t need).
Getting a good readahead value can help disk IO issues to some extent, but if you are using MongoDB (in particular), please consider your typical document size and access patterns before changing your blockdev settings. I’m not recommending any particular value because what’s perfect for one application/machine can be death for another.
I’m really enjoying these –thursday posts because every week people have commented with different/better/interesting ways of doing what I talked about (or ways of telling the difference between stalagmites and stalactites), which is really cool. So I’m throwing this out there: how would you figure out what a good readahead setting is? Next week I’m planning to do iostat for –thursday which should cover this a bit, but please leave a comment if you have any ideas.
“You just won’t believe how vastly, hugely, mind-bogglingly slow it is. I
mean, you may think your network is slow, but that’s just peanuts to
disk IO.”
Heh. DNA FTW.
LikeLike
🙂
LikeLike
Are there any way, to collect some stats about average size of the blocks, read by the system from the disk?
LikeLike
Yes, iostat can show you info about what the disk is doing. However, I’m not sure what the best way of correlating that to what MongoDB is using is. I heard one suggestion that you could check how disk IO compared to how much was going into resident memory, but it seems like that only would work until you’ve filled up resident memory.
LikeLike
i read your book – very well written. i have a question. Is there a concept of a temp db found in traditional db’ s? Is there something along the lines of staging vs. production databases? i would like to setup an environment where production data is in one db separate from a db where users can freely run ad-hoc queries for testing or learning purposes, during which process users may copy large chunks of production data to their test db? Given the memory-mapped nature of mongodb, is it possible to make such temp copies of your big production collections?
LikeLike
Thank you! There is no built-in mechanism for doing this with MongoDB. You’d probably want to take snapshots of a secondary and then use those to re-create “clean” staging dbs for people to play with. You might want to ask on https://groups.google.com/forum/#!forum/mongodb-user about this for more ideas.
LikeLike