Supersize io operations in Linux

Hi all,

This time I will tell you something that most of you really don’t care about: Max io size.

More specific, how large of an io operation I can issue to a storage system in one go. This might not make much of a difference for most people, but when you are tweaking a system for databases, distributed computing etc, there is a case for being able to tune this properly.

About a year ago I was testing storage performance on different platforms, i.e Solaris and Linux. One of the tests was to measure the throughput and number of iops when ramping one to many processes over io operation sizes (i.e 8,32,64,128,256,512,1024,2048k).

To finalize this blog post, I had to go back to some emails I sent to myself (note book-keeping), so I am lacking some of the screendumps. But the basic setup is the same on both Ubuntu and SUSE using the lpfc-driver (Emulex).

On Linux, I just could not get the box to issue larger io operations than 256k, which I found to be quite disturbing as I really tried hard to bash the system with large iops. No matter how hard I tried (using different tools to generate the IO, and pressing the enter button on my keyboard really, really hard), I just could not get iostat to show IOs larger than 256k. Since I trust iostat, I deduct that my system does not produce larger IOs.

When it comes to reading the output in this post, there are a couple of things to keep in mind so that you don’t get confused.

  • Even if you start iostat with -k to show kilobytes, the avgrq-sz is shown as the number of 512 byte disk blocks, hence you need to multiply the number seen in iostat with 512 to get the io size
  • Some config parameters need to be multiplied with 4kb memory page size (I will get to that later)

To get to my point, I need to show you a couple of things. For example, I can see that dd is issuing 1MB iops like this:

malu@kmg-sandbox-0001:~$ sudo strace dd if=/dev/zero of=/dev/sdc bs=1024k count=10000
write(1, ""..., 1048576) = 1048576
read(0, ""..., 1048576) = 1048576

But, when checking the iostat output, it only shows me 256k iops.

dm-7 0.00 0.00 0.00 1072.00 0.00 274432.00 512.00 3.07 2.85 0.93 99.20

Remember to multiply the 512.00 with 512 to get the io size.

To find out how to tune this, I really bent myself over twice. In the bad old days, pre kernel 2.6, there was MAXPHYS kernel parameter, which was probably not optimal, since it was removed in the 2.6 kernel. Don’t ask me too much about the history, but I know it was a pain to get this set correctly, and often a kernel recompile was required, which in turn voided any support.

When investigating this topic, most people I talked to either told me “can’t be done”, or “don’t bother”.

The first thing I found out was that there is a tweakable parameter per device, which controls the max physical IO. It is not very convenient to use, and it only affects the LUN layer; how large of an IO can you send to a LUN.

malu@kmg-guran-0001:/sys/block/sda/queue $cat max_hw_sectors_kb
4096

Setting this (echo 128 > max_hw_sectors_kb) change the IO size sent to that device, up to some magic limit (256k) and not the 4MB as seen in the tunable above.

So… I could only limit the max io size to something which was smaller or equal to 256k, regardless of the max_hw_sectors_kb content. It is quite easy to check this, as the system will react immediately when you tune this parameter. Here is an example:

In one terminal window, run the following which will run “dd” reading 1M blocks over and over again:

malu@poc01:/sys/block/sda/queue> while true; do echo "Restarting dd" ; sudo dd iflag=direct if=/dev/sda of=/dev/null bs=1024k; done

In another terminal window, do the following:

malu@poc01:/sys/block/sda/queue> for bs in 8 16 32 64 128 256 512 1024 2048; do date "+%Y%m%d %H:%M:%S"; echo "Setting max_sectors_kb to $bs"; sudo sh -c "echo $bs > max_sectors_kb"; sleep 10; done
20110923 09:05:05
Setting max_sectors_kb to 8
20110923 09:05:15
Setting max_sectors_kb to 16
20110923 09:05:25
Setting max_sectors_kb to 32
20110923 09:05:35
Setting max_sectors_kb to 64
20110923 09:05:45
Setting max_sectors_kb to 128
20110923 09:05:55
Setting max_sectors_kb to 256
20110923 09:06:05
Setting max_sectors_kb to 512
20110923 09:06:15
Setting max_sectors_kb to 1024
20110923 09:06:25
Setting max_sectors_kb to 2048

In a third terminal window you run “iostat -xtc 1” or similar, and you will see the blocksize read from the device changing from 8k to 256k (don’t forget multiplying the avgrq-sz by 512 to get the IO size).

This got me very frustrated. I really wanted to be able to tune this properly (to be able to issue larger IOs) and none of my favorite contacts at different vendors could help me out. After quite some googling, I came across a discussion about Lustre, where someone had a similar issue. This directed me to Bug 22850, where I finally found the configuration parameter I needed, to change the size of IOs sent through the driver stack.

https://bugzilla.lustre.org/show_bug.cgi?id=22850

Voila!

Knowing this, I could get the following information from my system:

malu@poc01:/sys/class/scsi_host> cat host*/lpfc_sg_seg_cnt
64
64
64
64

The Emulex driver was limited to 256k IO size, which is easily tuned by changing the lpfc_sg_seg_cnt parameter to 256 in /etc/modprobe.conf.local

malu@poc01:/sys/class/scsi_host> cat /etc/modprobe.conf.local

options lpfc lpfc_lun_queue_depth=16
options lpfc lpfc_sg_seg_cnt=256
options lpfc lpfc_link_speed=4

malu@poc01:/sys/class/scsi_host> sudo mkinitrd
malu@poc01:/sys/class/scsi_host> sudo shutdown -r now
...
malu@poc01:/sys/class/scsi_host> cat host*/lpfc_sg_seg_cnt
256
256
256
256

Now the max IO size is 1MB and I am happy again. Try it out yourself!