<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>domas mituzas &#187; linux</title>
	<atom:link href="http://mituzas.lt/tag/linux/feed/" rel="self" type="application/rss+xml" />
	<link>http://mituzas.lt</link>
	<description></description>
	<lastBuildDate>Fri, 30 Jul 2010 07:36:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>Linux 2.6.29</title>
		<link>http://mituzas.lt/2009/03/24/linux-2629/</link>
		<comments>http://mituzas.lt/2009/03/24/linux-2629/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 13:16:01 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=434</guid>
		<description><![CDATA[2.6.29 was released. I don&#8217;t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel &#8230; <a href="http://mituzas.lt/2009/03/24/linux-2629/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>2.6.29 was released. I don&#8217;t usually write about linux kernel releases, thats what <a href='http://slashdot.org'>Slashdot</a> is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (<a href='http://dammit.lt/2008/11/03/xfs-write-barriers/'>recent post about it</a>). Beware.</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2009/03/24/linux-2629/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>iostat -x</title>
		<link>http://mituzas.lt/2009/03/11/iostat/</link>
		<comments>http://mituzas.lt/2009/03/11/iostat/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 16:26:48 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=393</guid>
		<description><![CDATA[My favorite Linux tool in DB work is &#8216;iostat -x&#8217; (and I really really want to see whenever I&#8217;m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot &#8230; <a href="http://mituzas.lt/2009/03/11/iostat/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>My favorite Linux tool in DB work is &#8216;iostat -x&#8217; (and I really really want to see whenever I&#8217;m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot from a slightly overloaded 16-disk database box:</p>
<pre>
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.57   21.65    0.00   67.66

Device:  rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s \
sda     7684.00    19.00 2420.00  498.00 81848.00  5287.00 \

        avgrq-sz avgqu-sz   await  svctm  %util
           29.86    32.99   11.17   0.34 100.00
</pre>
<p>I pasted this somewhere on IRC, and got &#8220;doesn&#8217;t look too healthy&#8221; and that it is disk-bound. Now, to understand if it really is, one has to understand what iostat tells here. </p>
<p>First line of numbers shows that we&#8217;ve got plenty of CPU resources (thats because nowadays it is quite difficult to get a box with not enough CPU power, and I/O still seems to be bottleneck) &#8211; and we have more threads waiting for I/O than we have CPU execution (that sounds normal). </p>
<p>Now the actual per-disk statistics are where one should look. I used to prefer %util over general %iowait (I couldn&#8217;t really explain what %iostat is, and I can say what %util is). I don&#8217;t know why, but iostat has most interesting bits at the end, and not so interesting at the start:</p>
<ul>
<li><b>%util</b>: how much time did the storage device have outstanding work (was busy). In proper RAID environments it is more like &#8220;how much time did at least one disk in RAID array have something to do&#8221;. I&#8217;m deliberately excluding any kind of cache here &#8211; if request can be served from cache, the chance is quite negligible it will show up in %util, unlike in other values. What this also means &#8211; the RAID subsystem can be loaded from 6.25% (one disk doing the work) to 100% (all of them busy). Thats quite a lot of insight in single value of &#8217;100%&#8217;, isn&#8217;t it?</li>
<li><b>svctm</b>: Though manual says &#8220;The average service time (in milliseconds) for I/O requests that were issued to the device.&#8221;, it isn&#8217;t exactly that when you look at multiple-disk systems. What it says is, &#8220;when your I/O subsystem is busy, how fast does it respond requests overall&#8221;. Actually, less you load your system, higher svctm is (as there&#8217;re less outstanding requests, and average time to serve them goes up). Of course, at some certain moment, when I/O becomes really overloaded, you can see svctm going up. One can tweak /sys/block/sda/queue/nr_requests based on this &#8211; to avoid overloading I/O controller, though that is really rarely needed. </li>
<li><b>await</b>. One of my favorites &#8211; how fast do requests go through. It is just an average, how long it takes to serve a request for a device, once it gets into device queue, to final &#8220;OK&#8221;. Low = good, high = bad. There&#8217;re few gotchas here &#8211; even though different reads can have different performance properties (middle of disk, outer areas of disk, etc), the biggest difference is between reads and writes. Reads take time, writes can be instant (write caching at underlying layers..). As 80% of requests were reads, we can try to account for that by doing 11.17/0.8 math, to get 14ms figure. Thats quite high &#8211; systems that aren&#8217;t loaded can show ~5ms times (which isn&#8217;t that far away from 4ms rotation time of 15krpm disk). </li>
<li><b>avgqu-sz</b>: Very very very important value &#8211; how many requests are there in a request queue. Low = either your system is not loaded, or has serialized I/O and cannot utilize underlying storage properly. High = your software stack is scalable enough to load properly underlying I/O. Queue size equal to amount of disks means (in best case of request distribution) that all your disks are busy. Queue size higher than amount of disks means that you are already trading I/O response time for better throughput (disks can optimize order of operations if they know them beforehand, thats what <a href='http://en.wikipedia.org/wiki/NCQ'>NCQ &#8211; Native Command Queueing</a> does). If one complains about I/O performance issues when avgqu-sz is lower, then it is application specific stuff, that can be resolved with more aggressive read-ahead, less fsyncs, etc. One interesting part &#8211; avqu-sz, await, svctm and %util are iterdependent ( await = avgqu-sz * svctm / (%util/100)</li>
<li><b>avgrq-sz</b>: Just an average request size. Quite often will look like a block size of some kind &#8211; can indicate what kind of workload happens. This is already post-merging, so lots of adjacent block operations will bump this up. Also, if database page is 16k, though filesystem or volume manager block is 32k, this will be seen in avgrq-sz. Large requests indicate there&#8217;s some big batch/stream task going on. </li>
<li><b>wsec/s &#038; rsec/s</b>: Sectors read and written per second. Divide by 2048, and you&#8217;ll get megabytes per second. I wanted to write this isn&#8217;t important, but remembered all the non-database people who store videos on filesystems :) So, if megabytes per second matter, these values are important (and can be seen in &#8216;vmstat&#8217; output too). If not, for various database people there are other ones:</li>
<li><b>r/s &#038; w/s</b>: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). These numbers are the ones that are the I/O capacity figures, though of course, depending on how much pressure underlying I/O subsystem gets (queue size!), they can vary. And as mentioned above, on rotational media it is possible to trade response time (which is not that important in parallel workloads) for better throughput.</li>
<li><b>rrqm/s &#038; wrqm/s</b>: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don&#8217;t have to do it inside application. Quite often there will be way less merges, because applications which tend to write adjacent blocks, also tend to wait after every write (see my rant on <a href='http://dammit.lt/2008/02/05/linux-io-schedulers/'>I/O schedulers</a>). Reads however can be merged way easier &#8211; especially if application does &#8220;read ahead&#8221; block by block. Another reason for merges is simple block size mismatch &#8211; 16k database pages on top of 8k database pages will cause adjacent block reads, which would be merged by block layer. On some systems read of two adjacent pages would result in 1MB reads, but thats another rant :)</li>
<li><b>Device:</b> &#8211; just to make sure, that you&#8217;re looking at the right device. :-) </li>
</ul>
<p>So, after all this, the iostat output above tells us something like:</p>
<ul>
<li>System has healthy high load (request queue has two-requests-per-disk)</li>
<li>Average request time is double the value one would expect from idle system, it isn&#8217;t too harmful, but one can do better</li>
<li>It is reading <s>80</s> 40MB/s from disks, at 2420 requests/s. Thats quite high performance from inexpensive 2u database server (shameless plug: X4240 :)</li>
<li>High amount of merges comes from LVM snapshots, can be ignored</li>
<li>System is alive, healthy and kicking, no matter what anyone says :)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2009/03/11/iostat/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>mmap()</title>
		<link>http://mituzas.lt/2008/08/17/mmap/</link>
		<comments>http://mituzas.lt/2008/08/17/mmap/#comments</comments>
		<pubDate>Sun, 17 Aug 2008 21:51:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[mmap]]></category>
		<category><![CDATA[myisam]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=187</guid>
		<description><![CDATA[I&#8217;ve seen quite some work done on implementing mmap() in various places, including MySQL. mmap() is also used for malloc()&#8217;ing huge blocks of memory. mmap() data cache is part of VM cache, not file cache (though those are inside kernels &#8230; <a href="http://mituzas.lt/2008/08/17/mmap/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve seen quite some work done on implementing <a href='http://en.wikipedia.org/wiki/Mmap'>mmap()</a> in various places, including MySQL.<br />
mmap() is also used for malloc()&#8217;ing huge blocks of memory.<br />
mmap() data cache is part of VM cache, not file cache (though those are inside kernels tightly coupled, priorities still remain different).</p>
<p>If a small program with low memory footprint maps a file, it will probably make file access faster (as it will be cached more aggressively in memory, and will provide pressure on other cached file data -thats cheating though). </p>
<p>If a large program with lots and lots of allocated memory maps a file, that will pressure the filesystem cache to flush pages, and then&#8230; will pressure existing VM pages of the very same large program to be swapped out. Thats certainly bad.</p>
<p>For now MySQL is <a href='http://bugs.mysql.com/bug.php?id=37408'>using mmap()</a> just for compressed MyISAM files. <a href='http://www.mysqlperformanceblog.com/2006/05/26/myisam-mmap-feature-51/'>Vadim wrote</a> a patch to do more of mmap()ing. </p>
<p>If there&#8217;s less data than RAM, mmap() may provide somewhat more efficient CPU cycles. If there&#8217;s more data than RAM, mmap() will kill the system. </p>
<p>Interesting though, few months ago there was a <a href='http://kerneltrap.org/mailarchive/linux-kernel/2008/6/19/2166494/thread'>discussion on lkml</a> where <a href='http://en.wikipedia.org/wiki/Linus_Torvalds'>Linus</a> wrote:</p>
<blockquote><p>
Because quite frankly, the mixture of doing mmap() and write() system calls is quite fragile &#8211; and I&#8217;m not saying that just because of this particular bug, but because there are all kinds of nasty cache aliasing issues with virtually indexed caches etc that just fundamentally mean that it&#8217;s often a mistake to mix mmap with read/write at the same time.
</p></blockquote>
<p>So, simply, don&#8217;t. </p>
<p><b>Update:</b> Oh well, 5.1: &#8211;myisam_use_mmap option&#8230; Argh.<br />
<b>Update on update:</b> after few minutes of internal testing all mmap()ed MyISAM tables <a href='http://bugs.mysql.com/bug.php?id=38848'>went fubar</a>. </p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/08/17/mmap/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Notes from land of I/O</title>
		<link>http://mituzas.lt/2008/08/11/notes-from-land-of-io/</link>
		<comments>http://mituzas.lt/2008/08/11/notes-from-land-of-io/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 10:52:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[directio]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[jfs]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[xfs]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=184</guid>
		<description><![CDATA[A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), &#8230; <a href="http://mituzas.lt/2008/08/11/notes-from-land-of-io/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking <a href='http://noc.wikimedia.org/~midom/raidbench.c.txt'>program</a> (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.</p>
<p>The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.</p>
<p>My notes for now are:</p>
<ul>
<li>O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.</li>
<li>xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size &#8211; seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (<i>#xfs@freenode: &#8220;yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages&#8221;</i>, so
<pre>posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)</pre>
<p>helps).</li>
<li>fsync(),sync(),fdatasync() wait if there are any writes, bad part &#8211; it can wait forever. Filesystems people say thats a bug &#8211; it shouldn&#8217;t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such. </li>
</ul>
<p>Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks. </p>
<p>It is interesting, that write-behind caching isn&#8217;t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one. </p>
<p>Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :) </p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/08/11/notes-from-land-of-io/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>I/O schedulers seriously revisited</title>
		<link>http://mituzas.lt/2008/02/05/linux-io-schedulers/</link>
		<comments>http://mituzas.lt/2008/02/05/linux-io-schedulers/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 10:41:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[cfq]]></category>
		<category><![CDATA[deadline]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[scheduler]]></category>

		<guid isPermaLink="false">http://dammit.lt/2008/02/05/linux-io-schedulers/</guid>
		<description><![CDATA[The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found &#8230; <a href="http://mituzas.lt/2008/02/05/linux-io-schedulers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The <a href='http://dammit.lt/2007/12/13/io-scheduler-deadline-cfq/'>I/O scheduler problems</a> have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating&#8230;<br />
<span id="more-97"></span></p>
<p>The synonym for &#8216;i/o scheduler&#8217; inside Linux kernel is &#8216;elevator&#8217; &#8211; and it helps to explain many things. The physical movement needs of disk spindles are similar to how elevators move people in tall buildings:</p>
<ul>
<li>Attempting of going single direction, just &#8216;up&#8217; or &#8216;down&#8217;, and grabbing all people on their way.</li>
<li>At every floor where it stops, waiting for more people to show up, not closing doors immediately. </li>
<li>More full the elevator cabin is, more efficient it is in terms of transported people.</li>
<li>More full the elevator cabin is, more annoying is for people in it &#8211; stopping at every floor, then waiting, people start hitting &#8216;close door&#8217; button nervously. </li>
<li>Buildings solve this by having more elevators, or sophisticated queueing systems for getting into them</li>
</ul>
<p>Now, imagine a huge hotel building, that is having huge convention of privacy-worshippers. Or just misanthropes. They will never get into elevator, until they know that it is empty, and human that was traveling before got out of elevator safely. Essentially, thats how database transaction serialization works. </p>
<p>Thats where smart elevators fail &#8211; they immediately notice, that all writes are going to same location and prefer to wait for more requests &#8211; and merge them. Though, whenever an elevator decides to wait, nothing happens &#8211; there is a global lock inside database engine, which tells not to write until first write finishes.<br />
Scheduler waits, decides that it did wait too long, flushes the write, gets another request, notices write goes to same location and there might be chance to merge subsequent requests, which&#8230; do not come in again.</p>
<p>And the solution is &#8211; using teleports. Well, at least treating database writes as instant accesses, not caring about order, waiting, just doing everything as soon as possible. </p>
<p>To demonstrate this I went to the world of edge cases &#8211; made the performance test, which maybe shouldn&#8217;t be called a &#8216;benchmark&#8217;. I created a very simple table and started spamming rows, each as a separate transaction, into it. The hardware for test was a &#8216;regular&#8217; DB box, 8 disks, write-behind cache on RAID controller, 16GB of memory, 8 cores &#8211; not my desktop or laptop. The sole idea of test was finding how different I/O modes affect ability to write to I/O controller as fast as possible &#8211; resulting in transaction throughput. </p>
<p>Here are some results:</p>
<table class='benchmarks'>
<tr>
<td>&nbsp;</td>
<td colspan='2'>1 thread</td>
<td colspan='2'>8 threads</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>tps</td>
<td>ioutil%</td>
<td>tps</td>
<td>ioutil%</td>
</tr>
<tr>
<td>flush=0, deadline</td>
<td>15000</td>
<td>&lt;1%</td>
<td>19100</td>
<td>&lt;1%</td>
</tr>
<tr>
<td>deadline</td>
<td>6850</td>
<td>32%</td>
<td>15250</td>
<td>45%</td>
</tr>
<tr>
<td>cfq</td>
<td>5880</td>
<td>45%</td>
<td><b>1240</b></td>
<td>94%</td>
</tr>
</table>
<p>Here, CFQ had very huge regression at higher concurrency, actually Anticipatory showed similar, slightly better results. NOOP showed similar results to deadline, much faster at single thread, slightly slower at multiple. </p>
<p>So, whichever decisions CFQ takes during this test, they must be all wrong &#8211; with multiple disks and raid controller handling flushing of write cache there is no need for elevation or request merging, multiple tagged commands can be sent to the I/O subsystem, and they will be executed swiftly. It is supposed to be a scheduler good for most of workloads, and it probably is. But high-performance databases rely on storage being fast, and for enforcing of ACID requirements, synchronous operations should not wait forever (or wait, when not necessary)</p>
<p>Of course, CFQ may provide better performance on systems that do extensive I/O scanning, so for folks with slow queries it may end up providing more throughput, as it will tolerate delaying small things for big things to get through &#8211; deadline would not care about fairness and try getting everything done as quickly as possible. </p>
<p>The worst part is that Deadline is enabled by default just on community distributions, like Ubuntu Server (though Ubuntu desktop kernel has CFQ by default). So, most of people will end up having anti-database scheduler for ages, and will rarely get into internals of whole stack, or analysis of performance profile. Switching to another scheduler is a matter of single command (though it managed to crash my system once :)</p>
<p>Must read: <a href='http://kerneltrap.org/node/7637'>Jens Axboe on his block layer work</a>, and my previous <a href='http://dammit.lt/2007/12/13/io-scheduler-deadline-cfq/'>rant</a> :)</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/02/05/linux-io-schedulers/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>
