<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>domas mituzas &#187; io</title>
	<atom:link href="http://mituzas.lt/tag/io/feed/" rel="self" type="application/rss+xml" />
	<link>http://mituzas.lt</link>
	<description></description>
	<lastBuildDate>Fri, 30 Jul 2010 07:36:08 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1-alpha</generator>
		<item>
		<title>On file system benchmarks</title>
		<link>http://mituzas.lt/2009/06/30/on-file-system-benchmarks/</link>
		<comments>http://mituzas.lt/2009/06/30/on-file-system-benchmarks/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 19:34:35 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[benchmarks]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[rant]]></category>
		<category><![CDATA[xfs]]></category>

		<guid isPermaLink="false">http://mituzas.lt/?p=519</guid>
		<description><![CDATA[I see this benchmark being quoted in multiple places, and there I see stuff like: When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it &#8230; <a href="http://mituzas.lt/2009/06/30/on-file-system-benchmarks/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I see <a href="http://www.phoronix.com/scan.php?page=article&amp;item=ext4_btrfs_nilfs2">this benchmark</a> being quoted in multiple places, and there I see stuff like:</p>
<blockquote><p>When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it had been running for more than an hour for a single run. Between EXT3, EXT4, and NILFS2, the fastest file-system was EXT3 and then its successor, EXT4, was slightly behind that. Far behind the position of EXT4 were NILFS2 and then Btrfs and XFS.</p></blockquote>
<p>There were few other benchmarks, e.g. SQLite showed &#8216;bad performance&#8217; on XFS and Btrfs.</p>
<p>*clear throat*</p>
<p>Dear benchmarkers, don&#8217;t compare apples and oranges. If you see differences between benchmarks, do some very very tiny research, and use some intellect, that you, as primates, do have. If database tests are slowest on filesystems created by Oracle (who know some stuff about systems in general) or SGI (who, despite giving away their campus to Google, still have lots of expertise in the field), that can indicate, that your tests are probably flawed somewhere, at least for that test domain.</p>
<p>Now, probably you&#8217;ve heard about such thing as &#8216;data consistency&#8217;. That is something what database stack tries to ensure, sometimes at higher costs, like not trusting volatile caches, enforcing certain write orders, depending on acknowledgements by underlying hardware.</p>
<p>So, in this case it wasn&#8217;t &#8220;benchmarking file systems&#8221;, it was simply, benchmarking &#8220;consistency&#8221; against &#8220;no consistency&#8221;. But don&#8217;t worry, most benchmarks have such flaws &#8211; getting numbers but not understanding them makes results much more interesting, right?</p>
<p>Oh, and&#8230; thanks for few more misguided people.</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2009/06/30/on-file-system-benchmarks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linux 2.6.29</title>
		<link>http://mituzas.lt/2009/03/24/linux-2629/</link>
		<comments>http://mituzas.lt/2009/03/24/linux-2629/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 13:16:01 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=434</guid>
		<description><![CDATA[2.6.29 was released. I don&#8217;t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel &#8230; <a href="http://mituzas.lt/2009/03/24/linux-2629/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>2.6.29 was released. I don&#8217;t usually write about linux kernel releases, thats what <a href='http://slashdot.org'>Slashdot</a> is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (<a href='http://dammit.lt/2008/11/03/xfs-write-barriers/'>recent post about it</a>). Beware.</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2009/03/24/linux-2629/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>iostat -x</title>
		<link>http://mituzas.lt/2009/03/11/iostat/</link>
		<comments>http://mituzas.lt/2009/03/11/iostat/#comments</comments>
		<pubDate>Wed, 11 Mar 2009 16:26:48 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=393</guid>
		<description><![CDATA[My favorite Linux tool in DB work is &#8216;iostat -x&#8217; (and I really really want to see whenever I&#8217;m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot &#8230; <a href="http://mituzas.lt/2009/03/11/iostat/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>My favorite Linux tool in DB work is &#8216;iostat -x&#8217; (and I really really want to see whenever I&#8217;m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot from a slightly overloaded 16-disk database box:</p>
<pre>
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.57   21.65    0.00   67.66

Device:  rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s \
sda     7684.00    19.00 2420.00  498.00 81848.00  5287.00 \

        avgrq-sz avgqu-sz   await  svctm  %util
           29.86    32.99   11.17   0.34 100.00
</pre>
<p>I pasted this somewhere on IRC, and got &#8220;doesn&#8217;t look too healthy&#8221; and that it is disk-bound. Now, to understand if it really is, one has to understand what iostat tells here. </p>
<p>First line of numbers shows that we&#8217;ve got plenty of CPU resources (thats because nowadays it is quite difficult to get a box with not enough CPU power, and I/O still seems to be bottleneck) &#8211; and we have more threads waiting for I/O than we have CPU execution (that sounds normal). </p>
<p>Now the actual per-disk statistics are where one should look. I used to prefer %util over general %iowait (I couldn&#8217;t really explain what %iostat is, and I can say what %util is). I don&#8217;t know why, but iostat has most interesting bits at the end, and not so interesting at the start:</p>
<ul>
<li><b>%util</b>: how much time did the storage device have outstanding work (was busy). In proper RAID environments it is more like &#8220;how much time did at least one disk in RAID array have something to do&#8221;. I&#8217;m deliberately excluding any kind of cache here &#8211; if request can be served from cache, the chance is quite negligible it will show up in %util, unlike in other values. What this also means &#8211; the RAID subsystem can be loaded from 6.25% (one disk doing the work) to 100% (all of them busy). Thats quite a lot of insight in single value of &#8217;100%&#8217;, isn&#8217;t it?</li>
<li><b>svctm</b>: Though manual says &#8220;The average service time (in milliseconds) for I/O requests that were issued to the device.&#8221;, it isn&#8217;t exactly that when you look at multiple-disk systems. What it says is, &#8220;when your I/O subsystem is busy, how fast does it respond requests overall&#8221;. Actually, less you load your system, higher svctm is (as there&#8217;re less outstanding requests, and average time to serve them goes up). Of course, at some certain moment, when I/O becomes really overloaded, you can see svctm going up. One can tweak /sys/block/sda/queue/nr_requests based on this &#8211; to avoid overloading I/O controller, though that is really rarely needed. </li>
<li><b>await</b>. One of my favorites &#8211; how fast do requests go through. It is just an average, how long it takes to serve a request for a device, once it gets into device queue, to final &#8220;OK&#8221;. Low = good, high = bad. There&#8217;re few gotchas here &#8211; even though different reads can have different performance properties (middle of disk, outer areas of disk, etc), the biggest difference is between reads and writes. Reads take time, writes can be instant (write caching at underlying layers..). As 80% of requests were reads, we can try to account for that by doing 11.17/0.8 math, to get 14ms figure. Thats quite high &#8211; systems that aren&#8217;t loaded can show ~5ms times (which isn&#8217;t that far away from 4ms rotation time of 15krpm disk). </li>
<li><b>avgqu-sz</b>: Very very very important value &#8211; how many requests are there in a request queue. Low = either your system is not loaded, or has serialized I/O and cannot utilize underlying storage properly. High = your software stack is scalable enough to load properly underlying I/O. Queue size equal to amount of disks means (in best case of request distribution) that all your disks are busy. Queue size higher than amount of disks means that you are already trading I/O response time for better throughput (disks can optimize order of operations if they know them beforehand, thats what <a href='http://en.wikipedia.org/wiki/NCQ'>NCQ &#8211; Native Command Queueing</a> does). If one complains about I/O performance issues when avgqu-sz is lower, then it is application specific stuff, that can be resolved with more aggressive read-ahead, less fsyncs, etc. One interesting part &#8211; avqu-sz, await, svctm and %util are iterdependent ( await = avgqu-sz * svctm / (%util/100)</li>
<li><b>avgrq-sz</b>: Just an average request size. Quite often will look like a block size of some kind &#8211; can indicate what kind of workload happens. This is already post-merging, so lots of adjacent block operations will bump this up. Also, if database page is 16k, though filesystem or volume manager block is 32k, this will be seen in avgrq-sz. Large requests indicate there&#8217;s some big batch/stream task going on. </li>
<li><b>wsec/s &#038; rsec/s</b>: Sectors read and written per second. Divide by 2048, and you&#8217;ll get megabytes per second. I wanted to write this isn&#8217;t important, but remembered all the non-database people who store videos on filesystems :) So, if megabytes per second matter, these values are important (and can be seen in &#8216;vmstat&#8217; output too). If not, for various database people there are other ones:</li>
<li><b>r/s &#038; w/s</b>: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). These numbers are the ones that are the I/O capacity figures, though of course, depending on how much pressure underlying I/O subsystem gets (queue size!), they can vary. And as mentioned above, on rotational media it is possible to trade response time (which is not that important in parallel workloads) for better throughput.</li>
<li><b>rrqm/s &#038; wrqm/s</b>: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don&#8217;t have to do it inside application. Quite often there will be way less merges, because applications which tend to write adjacent blocks, also tend to wait after every write (see my rant on <a href='http://dammit.lt/2008/02/05/linux-io-schedulers/'>I/O schedulers</a>). Reads however can be merged way easier &#8211; especially if application does &#8220;read ahead&#8221; block by block. Another reason for merges is simple block size mismatch &#8211; 16k database pages on top of 8k database pages will cause adjacent block reads, which would be merged by block layer. On some systems read of two adjacent pages would result in 1MB reads, but thats another rant :)</li>
<li><b>Device:</b> &#8211; just to make sure, that you&#8217;re looking at the right device. :-) </li>
</ul>
<p>So, after all this, the iostat output above tells us something like:</p>
<ul>
<li>System has healthy high load (request queue has two-requests-per-disk)</li>
<li>Average request time is double the value one would expect from idle system, it isn&#8217;t too harmful, but one can do better</li>
<li>It is reading <s>80</s> 40MB/s from disks, at 2420 requests/s. Thats quite high performance from inexpensive 2u database server (shameless plug: X4240 :)</li>
<li>High amount of merges comes from LVM snapshots, can be ignored</li>
<li>System is alive, healthy and kicking, no matter what anyone says :)</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2009/03/11/iostat/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>On SSDs, rotations and I/O</title>
		<link>http://mituzas.lt/2008/11/09/on-ssd-io/</link>
		<comments>http://mituzas.lt/2008/11/09/on-ssd-io/#comments</comments>
		<pubDate>Sun, 09 Nov 2008 14:05:18 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[ssd]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=266</guid>
		<description><![CDATA[Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack &#8230; <a href="http://mituzas.lt/2008/11/09/on-ssd-io/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top. </p>
<p>The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes. </p>
<p>Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance). </p>
<p>Reads have none of that. Of course, there&#8217;re caches, but they don&#8217;t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes. </p>
<p>Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy. </p>
<p>Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.</p>
<p>This is where SSDs matter &#8211; they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) &#8211; but the read performance will be <i>satisfactory</i> (uhm, much much better) too. </p>
<p>What this would mean for MySQL use:</p>
<ul>
<li>Buffer pool, key cache, read-ahead buffers &#8211; all gone (or drastically reduced).</li>
<li>Data locality wouldn&#8217;t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.</li>
<li>Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures</li>
<li>RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary</li>
<li>Log-based storage designs like PBXT will make much more sense</li>
<li>Complex data flushing logic like inside InnoDB&#8217;s will not be useful anymore (one can say, it is useless already ;-) &#8211; and straightforward methods such as in Maria are welcome again.</li>
</ul>
<p>Probably the happiest camp out there are PostgreSQL people &#8211; data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!). </p>
<p>Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff. </p>
<p>You can sense the futility of any work done to optimize for rotation &#8211; all the &#8220;make reads fast&#8221; techniques will end up resolved at hardware layer, and the human isn&#8217;t needed anymore (nor all these servers with lots of memory and lots of spindles). </p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/11/09/on-ssd-io/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Notes from land of I/O</title>
		<link>http://mituzas.lt/2008/08/11/notes-from-land-of-io/</link>
		<comments>http://mituzas.lt/2008/08/11/notes-from-land-of-io/#comments</comments>
		<pubDate>Mon, 11 Aug 2008 10:52:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[directio]]></category>
		<category><![CDATA[innodb]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[jfs]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[xfs]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=184</guid>
		<description><![CDATA[A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking program (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), &#8230; <a href="http://mituzas.lt/2008/08/11/notes-from-land-of-io/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A discussion on IRC sparkled some interest on how various I/O things work in Linux. I wrote small microbenchmarking <a href='http://noc.wikimedia.org/~midom/raidbench.c.txt'>program</a> (where all configuration is in source file, and I/O modes can be changed by editing various places in code ;-), and started playing with performance.</p>
<p>The machine for this testing was RAID10 16disk box with 2.6.24 kernel, and I tried to understand how O_DIRECT works, and how fsync() works and ended up digging into some other stuff.</p>
<p>My notes for now are:</p>
<ul>
<li>O_DIRECT serializes writes to a file on ext2, ext3, jfs, so I got at most 200-250w/s.</li>
<li>xfs allows parallel (and out-of-order, if that matters) DIO, so I got 1500-2700w/s (depending on file size &#8211; seek time changes.. :) of random I/O without write-behind caching. There are few outstanding bugs that lock this down back to 250w/s (<i>#xfs@freenode: &#8220;yeah, we drop back to taking the i_mutex in teh case where we are writing beyond EOF or we have cached pages&#8221;</i>, so
<pre>posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED)</pre>
<p>helps).</li>
<li>fsync(),sync(),fdatasync() wait if there are any writes, bad part &#8211; it can wait forever. Filesystems people say thats a bug &#8211; it shouldn&#8217;t wait for I/O that happened after sync being called. I tend to believe, as it causes stuff like InnoDB semaphore waits and such. </li>
</ul>
<p>Of course, having write-behind caching at the controller (or disk, *shudder*) level allows filesystems to be lazy (and benchmarks are no longer that different), but having the upper layers work efficiently is quite important too, to avoid bottlenecks. </p>
<p>It is interesting, that write-behind caching isn&#8217;t needed that much anymore for random writes, once filesystem parallelizes I/O, even direct, nonbuffered one. </p>
<p>Anyway, now that I found some of I/O properties and issues, should probably start thinking how they apply to the upper layers like InnoDB.. :) </p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/08/11/notes-from-land-of-io/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>On blocking</title>
		<link>http://mituzas.lt/2008/06/20/on-blocking/</link>
		<comments>http://mituzas.lt/2008/06/20/on-blocking/#comments</comments>
		<pubDate>Fri, 20 Jun 2008 08:44:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[rsync]]></category>
		<category><![CDATA[tcp]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=162</guid>
		<description><![CDATA[If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy &#8211; convert one operation (quite often the one that blocks less, but I guess it doesn&#8217;t &#8230; <a href="http://mituzas.lt/2008/06/20/on-blocking/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy &#8211; convert one operation (quite often the one that blocks less, but I guess it doesn&#8217;t matter that much) into a nonblocking one. </p>
<p>Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff. </p>
<p>rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel&#8217;s send-queue saturated (it is 128K by default anyway). </p>
<p>How to make MySQL&#8217;s or rsync networking snappier? If in &#8216;netstat&#8217; sendq column is maxed out &#8211; just increase kernel buffers, instead of process buffers:</p>
<pre>
# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
</pre>
<p>This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k &#8211; so it seems to be really worth it).</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/06/20/on-blocking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Shameless ad</title>
		<link>http://mituzas.lt/2008/05/15/sun-fire-x4240/</link>
		<comments>http://mituzas.lt/2008/05/15/sun-fire-x4240/#comments</comments>
		<pubDate>Thu, 15 May 2008 08:30:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[raid]]></category>
		<category><![CDATA[server]]></category>
		<category><![CDATA[x4240]]></category>

		<guid isPermaLink="false">http://dammit.lt/?p=104</guid>
		<description><![CDATA[&#8220;The Sun Fire X4240, powered by the AMD Opteron 2200 and 2300 processor series, is a two-socket, 8-core, 2RU system with up to twice the memory and storage capacity of any system in its class. It&#8217;s the first and only &#8230; <a href="http://mituzas.lt/2008/05/15/sun-fire-x4240/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;<a href="http://www.sun.com/servers/x64/x4240/">The Sun Fire X4240</a>, powered by the AMD Opteron 2200 and 2300 processor series, is a two-socket, 8-core, 2RU system with up to twice the memory and storage capacity of any system in its class. It&#8217;s the first and only two-socket AMD Opteron system with sixteen hard drive slots in a 2RU form factor.&#8221;</p></blockquote>
<p>Well, now that I work for Sun, it ends up being a shameless ad and boasting :) But back when I saw information about this product, I wasn&#8217;t my first thought was &#8220;wow, thats the best machine for scaling up scaled out environments!&#8221;. </p>
<p>In web database world people agree that number of spindles (disks!) matters &#8211; remember YouTube&#8217;s &#8220;think disks, not servers&#8221; mantra said during the scaling panel at MySQL conference. Before, getting such number of spindles would&#8217;ve required external arrays taking space and sucking power (TCO! ;-)</p>
<p>And for us&#8230; it probably means we can finally start doing RAID10, instead of RAID0. :-)</p>
<p>By the way, that box even has Quad-Core service processor. Way to go! :) </p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/05/15/sun-fire-x4240/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>I/O schedulers seriously revisited</title>
		<link>http://mituzas.lt/2008/02/05/linux-io-schedulers/</link>
		<comments>http://mituzas.lt/2008/02/05/linux-io-schedulers/#comments</comments>
		<pubDate>Tue, 05 Feb 2008 10:41:00 +0000</pubDate>
		<dc:creator>Domas Mituzas</dc:creator>
				<category><![CDATA[mysql]]></category>
		<category><![CDATA[cfq]]></category>
		<category><![CDATA[deadline]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[scheduler]]></category>

		<guid isPermaLink="false">http://dammit.lt/2008/02/05/linux-io-schedulers/</guid>
		<description><![CDATA[The I/O scheduler problems have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found &#8230; <a href="http://mituzas.lt/2008/02/05/linux-io-schedulers/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The <a href='http://dammit.lt/2007/12/13/io-scheduler-deadline-cfq/'>I/O scheduler problems</a> have drawn my attention, and besides trusting empirical results, I tried to do more of benchmarking and analysis, why the heck strange things happen at Linux block layer. So, here is the story, which I found myself quite fascinating&#8230;<br />
<span id="more-97"></span></p>
<p>The synonym for &#8216;i/o scheduler&#8217; inside Linux kernel is &#8216;elevator&#8217; &#8211; and it helps to explain many things. The physical movement needs of disk spindles are similar to how elevators move people in tall buildings:</p>
<ul>
<li>Attempting of going single direction, just &#8216;up&#8217; or &#8216;down&#8217;, and grabbing all people on their way.</li>
<li>At every floor where it stops, waiting for more people to show up, not closing doors immediately. </li>
<li>More full the elevator cabin is, more efficient it is in terms of transported people.</li>
<li>More full the elevator cabin is, more annoying is for people in it &#8211; stopping at every floor, then waiting, people start hitting &#8216;close door&#8217; button nervously. </li>
<li>Buildings solve this by having more elevators, or sophisticated queueing systems for getting into them</li>
</ul>
<p>Now, imagine a huge hotel building, that is having huge convention of privacy-worshippers. Or just misanthropes. They will never get into elevator, until they know that it is empty, and human that was traveling before got out of elevator safely. Essentially, thats how database transaction serialization works. </p>
<p>Thats where smart elevators fail &#8211; they immediately notice, that all writes are going to same location and prefer to wait for more requests &#8211; and merge them. Though, whenever an elevator decides to wait, nothing happens &#8211; there is a global lock inside database engine, which tells not to write until first write finishes.<br />
Scheduler waits, decides that it did wait too long, flushes the write, gets another request, notices write goes to same location and there might be chance to merge subsequent requests, which&#8230; do not come in again.</p>
<p>And the solution is &#8211; using teleports. Well, at least treating database writes as instant accesses, not caring about order, waiting, just doing everything as soon as possible. </p>
<p>To demonstrate this I went to the world of edge cases &#8211; made the performance test, which maybe shouldn&#8217;t be called a &#8216;benchmark&#8217;. I created a very simple table and started spamming rows, each as a separate transaction, into it. The hardware for test was a &#8216;regular&#8217; DB box, 8 disks, write-behind cache on RAID controller, 16GB of memory, 8 cores &#8211; not my desktop or laptop. The sole idea of test was finding how different I/O modes affect ability to write to I/O controller as fast as possible &#8211; resulting in transaction throughput. </p>
<p>Here are some results:</p>
<table class='benchmarks'>
<tr>
<td>&nbsp;</td>
<td colspan='2'>1 thread</td>
<td colspan='2'>8 threads</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>tps</td>
<td>ioutil%</td>
<td>tps</td>
<td>ioutil%</td>
</tr>
<tr>
<td>flush=0, deadline</td>
<td>15000</td>
<td>&lt;1%</td>
<td>19100</td>
<td>&lt;1%</td>
</tr>
<tr>
<td>deadline</td>
<td>6850</td>
<td>32%</td>
<td>15250</td>
<td>45%</td>
</tr>
<tr>
<td>cfq</td>
<td>5880</td>
<td>45%</td>
<td><b>1240</b></td>
<td>94%</td>
</tr>
</table>
<p>Here, CFQ had very huge regression at higher concurrency, actually Anticipatory showed similar, slightly better results. NOOP showed similar results to deadline, much faster at single thread, slightly slower at multiple. </p>
<p>So, whichever decisions CFQ takes during this test, they must be all wrong &#8211; with multiple disks and raid controller handling flushing of write cache there is no need for elevation or request merging, multiple tagged commands can be sent to the I/O subsystem, and they will be executed swiftly. It is supposed to be a scheduler good for most of workloads, and it probably is. But high-performance databases rely on storage being fast, and for enforcing of ACID requirements, synchronous operations should not wait forever (or wait, when not necessary)</p>
<p>Of course, CFQ may provide better performance on systems that do extensive I/O scanning, so for folks with slow queries it may end up providing more throughput, as it will tolerate delaying small things for big things to get through &#8211; deadline would not care about fairness and try getting everything done as quickly as possible. </p>
<p>The worst part is that Deadline is enabled by default just on community distributions, like Ubuntu Server (though Ubuntu desktop kernel has CFQ by default). So, most of people will end up having anti-database scheduler for ages, and will rarely get into internals of whole stack, or analysis of performance profile. Switching to another scheduler is a matter of single command (though it managed to crash my system once :)</p>
<p>Must read: <a href='http://kerneltrap.org/node/7637'>Jens Axboe on his block layer work</a>, and my previous <a href='http://dammit.lt/2007/12/13/io-scheduler-deadline-cfq/'>rant</a> :)</p>
]]></content:encoded>
			<wfw:commentRss>http://mituzas.lt/2008/02/05/linux-io-schedulers/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>
