Posts Tagged ‘performance’

again, on benchmarks

Thursday, November 19th, 2009

Dear interweb, if you have no idea what you’re writing about, keep it to yourself, don’t litter into the tubes. Some people may not notice they’re eating absolute crap and get diarrhea.

This particular benchmark has two favorite parts, that go with each other together really well:

I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

And..

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. …

Seriously, how does one repay for all the damage of such idiotic benchmarks?

P.S. I’ve ranted at benchmarks before, and will continue doing so.

On file system benchmarks

Tuesday, June 30th, 2009

I see this benchmark being quoted in multiple places, and there I see stuff like:

When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it had been running for more than an hour for a single run. Between EXT3, EXT4, and NILFS2, the fastest file-system was EXT3 and then its successor, EXT4, was slightly behind that. Far behind the position of EXT4 were NILFS2 and then Btrfs and XFS.

There were few other benchmarks, e.g. SQLite showed ‘bad performance’ on XFS and Btrfs.

*clear throat*

Dear benchmarkers, don’t compare apples and oranges. If you see differences between benchmarks, do some very very tiny research, and use some intellect, that you, as primates, do have. If database tests are slowest on filesystems created by Oracle (who know some stuff about systems in general) or SGI (who, despite giving away their campus to Google, still have lots of expertise in the field), that can indicate, that your tests are probably flawed somewhere, at least for that test domain.

Now, probably you’ve heard about such thing as ‘data consistency’. That is something what database stack tries to ensure, sometimes at higher costs, like not trusting volatile caches, enforcing certain write orders, depending on acknowledgements by underlying hardware.

So, in this case it wasn’t “benchmarking file systems”, it was simply, benchmarking “consistency” against “no consistency”. But don’t worry, most benchmarks have such flaws – getting numbers but not understanding them makes results much more interesting, right?

Oh, and… thanks for few more misguided people.

iostat -x

Wednesday, March 11th, 2009

My favorite Linux tool in DB work is ‘iostat -x’ (and I really really want to see whenever I’m doing any kind of performance analysis), yet I had to learn its limitations and properties. For example, I took 1s snapshot from a slightly overloaded 16-disk database box:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.12    0.00    2.57   21.65    0.00   67.66

Device:  rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s \
sda     7684.00    19.00 2420.00  498.00 81848.00  5287.00 \

        avgrq-sz avgqu-sz   await  svctm  %util
           29.86    32.99   11.17   0.34 100.00

I pasted this somewhere on IRC, and got “doesn’t look too healthy” and that it is disk-bound. Now, to understand if it really is, one has to understand what iostat tells here.

First line of numbers shows that we’ve got plenty of CPU resources (thats because nowadays it is quite difficult to get a box with not enough CPU power, and I/O still seems to be bottleneck) – and we have more threads waiting for I/O than we have CPU execution (that sounds normal).

Now the actual per-disk statistics are where one should look. I used to prefer %util over general %iowait (I couldn’t really explain what %iostat is, and I can say what %util is). I don’t know why, but iostat has most interesting bits at the end, and not so interesting at the start:

  • %util: how much time did the storage device have outstanding work (was busy). In proper RAID environments it is more like “how much time did at least one disk in RAID array have something to do”. I’m deliberately excluding any kind of cache here – if request can be served from cache, the chance is quite negligible it will show up in %util, unlike in other values. What this also means – the RAID subsystem can be loaded from 6.25% (one disk doing the work) to 100% (all of them busy). Thats quite a lot of insight in single value of ‘100%’, isn’t it?
  • svctm: Though manual says “The average service time (in milliseconds) for I/O requests that were issued to the device.”, it isn’t exactly that when you look at multiple-disk systems. What it says is, “when your I/O subsystem is busy, how fast does it respond requests overall”. Actually, less you load your system, higher svctm is (as there’re less outstanding requests, and average time to serve them goes up). Of course, at some certain moment, when I/O becomes really overloaded, you can see svctm going up. One can tweak /sys/block/sda/queue/nr_requests based on this – to avoid overloading I/O controller, though that is really rarely needed.
  • await. One of my favorites – how fast do requests go through. It is just an average, how long it takes to serve a request for a device, once it gets into device queue, to final “OK”. Low = good, high = bad. There’re few gotchas here – even though different reads can have different performance properties (middle of disk, outer areas of disk, etc), the biggest difference is between reads and writes. Reads take time, writes can be instant (write caching at underlying layers..). As 80% of requests were reads, we can try to account for that by doing 11.17/0.8 math, to get 14ms figure. Thats quite high – systems that aren’t loaded can show ~5ms times (which isn’t that far away from 4ms rotation time of 15krpm disk).
  • avgqu-sz: Very very very important value – how many requests are there in a request queue. Low = either your system is not loaded, or has serialized I/O and cannot utilize underlying storage properly. High = your software stack is scalable enough to load properly underlying I/O. Queue size equal to amount of disks means (in best case of request distribution) that all your disks are busy. Queue size higher than amount of disks means that you are already trading I/O response time for better throughput (disks can optimize order of operations if they know them beforehand, thats what NCQ – Native Command Queueing does). If one complains about I/O performance issues when avgqu-sz is lower, then it is application specific stuff, that can be resolved with more aggressive read-ahead, less fsyncs, etc. One interesting part – avqu-sz, await, svctm and %util are iterdependent ( await = avgqu-sz * svctm / (%util/100)
  • avgrq-sz: Just an average request size. Quite often will look like a block size of some kind – can indicate what kind of workload happens. This is already post-merging, so lots of adjacent block operations will bump this up. Also, if database page is 16k, though filesystem or volume manager block is 32k, this will be seen in avgrq-sz. Large requests indicate there’s some big batch/stream task going on.
  • wsec/s & rsec/s: Sectors read and written per second. Divide by 2048, and you’ll get megabytes per second. I wanted to write this isn’t important, but remembered all the non-database people who store videos on filesystems :) So, if megabytes per second matter, these values are important (and can be seen in ‘vmstat’ output too). If not, for various database people there are other ones:
  • r/s & w/s: Read and write requests per second. This is already post-merging, and in proper I/O setups reads will mean blocking random read (serial reads are quite often merged), and writes will mean non-blocking random write (as underlying cache can allow to serve the OS instantly). These numbers are the ones that are the I/O capacity figures, though of course, depending on how much pressure underlying I/O subsystem gets (queue size!), they can vary. And as mentioned above, on rotational media it is possible to trade response time (which is not that important in parallel workloads) for better throughput.
  • rrqm/s & wrqm/s: How many requests were merged by block layer. In ideal world, there should be no merges at I/O level, because applications would have done it ages ago. Ideals differ though, for others it is good to have kernel doing this job, so they don’t have to do it inside application. Quite often there will be way less merges, because applications which tend to write adjacent blocks, also tend to wait after every write (see my rant on I/O schedulers). Reads however can be merged way easier – especially if application does “read ahead” block by block. Another reason for merges is simple block size mismatch – 16k database pages on top of 8k database pages will cause adjacent block reads, which would be merged by block layer. On some systems read of two adjacent pages would result in 1MB reads, but thats another rant :)
  • Device: – just to make sure, that you’re looking at the right device. :-)

So, after all this, the iostat output above tells us something like:

  • System has healthy high load (request queue has two-requests-per-disk)
  • Average request time is double the value one would expect from idle system, it isn’t too harmful, but one can do better
  • It is reading 80 40MB/s from disks, at 2420 requests/s. Thats quite high performance from inexpensive 2u database server (shameless plug: X4240 :)
  • High amount of merges comes from LVM snapshots, can be ignored
  • System is alive, healthy and kicking, no matter what anyone says :)

Eyecandy mutexes!

Sunday, February 15th, 2009

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex       

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132    libc.so.1`mutex_lock_impl
      4096 |@@@                |   186    libc.so.1`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14
----------------------------------------------------------------

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

  1. Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
  2. Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

On SSDs, rotations and I/O

Sunday, November 9th, 2008

Every time anyone mentions SSDs, I have a feeling of futility and being useless in near future. I have spent way too much time to work around limitations of rotational media, and understand the implications of whole vertical data stack on top.

The most interesting upcoming shift is not only the speed difference, but simply different cost balance between reads and writes. With modern RAID controllers and modern disks and modern filesystems reads are way more expensive operation from application perspective than writes.

Writes can be merged at application and OS level, buffered at I/O controller level, and even sped up by on-disk volatile cache (NCQ write reordering can give +30% faster random write performance).

Reads have none of that. Of course, there’re caches, but they don’t speed up actual read operations, they just help to avoid them. This leads to very disproportionate amount of caches needed for reads, compared to writes.

Simply, 32GB system with MySQL/InnoDB will be wasting 4GB on mutexes (argh!!..), few more gigs on data dictionary (arghhh #2), and everything else for read caching inside buffer pool. There may be some dirty pages and adaptive hash or insert buffer entries, but they are all there not because systems lack write output capacity, but simply because of braindead InnoDB page flushing policy.

Also, database write performance is mostly impacted not because of actual underlying write speed, but simply because every write has to read from multiple scattered places to actually find what needs to be changed.

This is where SSDs matter – they will have same satisfactory write performance (and fixes for InnoDB are out there ;-) – but the read performance will be satisfactory (uhm, much much better) too.

What this would mean for MySQL use:

  • Buffer pool, key cache, read-ahead buffers – all gone (or drastically reduced).
  • Data locality wouldn’t matter that much anymore either, covering indexes would provide just double performance, rather than up to 100x speed increase.
  • Re-reading data may be cheaper, than including it in various temporary sorting and grouping structures
  • RAIDs no longer needed (?), though RAM-backed write-behind caching would still be necessary
  • Log-based storage designs like PBXT will make much more sense
  • Complex data flushing logic like inside InnoDB’s will not be useful anymore (one can say, it is useless already ;-) – and straightforward methods such as in Maria are welcome again.

Probably the happiest camp out there are PostgreSQL people – data locality issues were plaguing their performance most, and it is strong side of InnoDB. On the other hand, MySQL has pluggable engine support, so it may be way easier to produce SSD versions for anything we have now, or just use new ones (hello, Maria!).

Still, there is quite some work to adapt to the new storage model, and judging by the fact how InnoDB works with modern rotational media, we will need some very strong push to adapt it for the new stuff.

You can sense the futility of any work done to optimize for rotation – all the “make reads fast” techniques will end up resolved at hardware layer, and the human isn’t needed anymore (nor all these servers with lots of memory and lots of spindles).

On XFS write barriers

Monday, November 3rd, 2008

I’m very naive, when I have to trust software. I just can’t believe a filesystem may have a tunable that makes it 20x faster (or rather, doesn’t make it 20x slower). I expect it to work out of the box. So, I was pondering, why in my testing XFS on LVM flushes data ~20x faster than on a box where it talks directly to device. Though I have noticed some warnings before, people on #xfs pointed out that LVM doesn’t support write barriers.

So, as I had no idea what write barriers are, had to read up a bit on that. There is a very nice phrase in there regarding battery-backed write-behind caching:

Using write barriers in this instance is not warranted and will in fact lower performance. Therefore, it is recommended to turn off the barrier support and mount the filesystem with “nobarrier”.

No shit, 20x lower performance :) As usually, I was not the only one to spot that..

So, I just ran this:

mount -o remount,nobarrier /a

And InnoDB flushed pages at 80MB/s instead of 4MB/s.

Update (2009/03): 2.6.29 kernel will support write barriers for LVM too – so XFS performance degradation is very much expected at very very wide scope. Also ext4 uses write barriers by default too. This thing is getting huge.

Progress in percents: 0 1 2 3 …

Sunday, October 26th, 2008

Well, servers usually don’t crash ( our English Wikipedia master is running for 800 days, on white-box hardware, RAID0, 4.0 ;-), but when they do (like some kernel bugs on our big big boxes), one of most painful experiences is InnoDB log recovery.

Usually people will reduce the innodb-log-file-size to ease up with the recovery (it helps, in a way :), but the real problem is somewhere else.

See, when InnoDB does crash recovery, it applies the log changes in memory, and builds a flush list. It doesn’t flush any pages during the recovery process, so the flush list grows big, thousands, tens of thousands, maybe millions kind of big, anyway, big-number big.

Oh, did I mention? The flush list is actually a linked list, not some kind of hippy tree stuff. Every time a log record is read from a log and something gets updated, the flush list will be traversed, thousands, tens of thousands, maybe millions of entries.

The expensive code looks something like this:

while (b && (ut_dulint_cmp(b->oldest_modification,
             block->oldest_modification) > 0)) {
       prev_b = b;
       b = UT_LIST_GET_NEXT(flush_list, b);
}

Then your profile starts looking like this, and you wish your systems didn’t crash:

%        symbol name
87.6978  buf_flush_insert_sorted_into_flush_list
 5.8571  -kernel
 1.9793  recv_apply_hashed_log_recs
 0.8059  buf_calc_page_new_checksum

So, the recovery process cost is exponential, and people work around it by reducing the log file size, by reducing performance of their system, while the actual fix is right there, in optimizing the data structure. Current model is outdated for anything built in last 5 years anyway.

Oh, and of course, I’d like systems not to crash at all, like that database master on whitebox raid0 running for 800 days.

Update: this is old stuff. Peter wrote about it, Heikki opened a bug, then thought it would need more than five minutes to fix it and classified it as a feature request, so Peter could write more about it. That makes it even more sad. We’d probably change the synopsis for the feature request, “make crash recovery work”.

Update 2: get the patch at Percona (Yasufumi is god :)

dtrace!

Friday, October 3rd, 2008

At the MySQL developer conference I accidently showed up some of things we’ve been doing with dtrace (I used it in few cases and realized the power it has), and saw some jaws drop. Then I ended up doing small demos around the event. What most people know about dtrace, is that there’re some probes and you can trace them. What people don’t know is that you can actually create lots of probes dynamically, and use them with lots of flexibility.

One of major things not really grasped by many is that dtrace is a combination of a tracing tool, debugger, programming language and a database, having minor, but very valuable functionality for each. It can attach to any place in code, it can get stacks, function arguments, traverse structures, do some evaluations, aggregate data, and in the end – thats all compiled code executed by kernel (or programs). 

Sometimes a probe may look not that useful (strace would provide file writes too?), but once combined with ability to get immediate stack, as well as set or read context variables (a previous probe on any other event could have saved some important information, e.g. host,user,table names, etc) – so final result may tell statistics correlated to many other activities. 

One developer (a traitor who has left support for easier life in engineering dept) listened to all this, and I asked what his current project was – apparently he was adding static dtrace probes to MySQL. It ended up being quite interesting discussion, as static probes provide two value points. First of all, it provides an interface – whereas dynamic probes can change with code changes (though, that doesn’t happen too often :) Second value – one can do additional calculations on a specific probe, which would be done only on-demand (when the probe is attached). 

So, having a static probe that directly maps to easy-mode dynamic one (it is straightforward to attach to a function, and quite easy to read its arguments), is a bit of waste (both in development time, as well as few instructions are actually written there). Dynamic tracing generally modifies binaries on fly – so it does not carry additional costs otherwise. Though an example where static probe would be awesome – having “query start” event, which would have query string canonized with all literals removed – this would allow on-demand query profiling for query groups, rather than stand-alone queries.

The other major value is ability to set thread-specific context variables in different probes, so they can read each other data. At the type of incoming packet one can tag the thread with whichever information needed – then any subsequent actions can reuse such information to filter out important events. That also removes the need of static probes providing multiple-layer information – it all can be achieved by chaining the events – without too much complexity. 

I took a bit of trollish stance when approached a developer implementing internal performance statistics. We were playing a game – he’d tell me what kind of performance information he’d like to extract, and I’d show a method to do that with dtrace. More people from monitoring field joined, and we ended up discussing what is the perfect performance monitoring and analysis system. It is quite easy to understand, that different people will need different kinds of metrics. For MySQL development work performance engineer will need mutex contention information, someone fixing a leak will need heap profiling, someone writing a feature will want an easy way to trace how server executes their code – and all that is way far from any needs actual user or DBA has. Someone who writes a query just wants to see the query plan with some easy-to-understand costs (just need to pump more steroids into EXPLAIN). DBAs may want to see resource consumption per-user, per-table, etc (something Google patch  provides). It is interesting to find a balance, between external tools and what should be supported out-of-the-box internally – and it is way easier to force internal crowd to have proper tools, and it is always nice to provide a much as possible instrumentation for anyone externally. 

Of course, there’s poor guy in the middle of two camps – a support engineer – who needs easy performance metrics to be accessible from clients, but needs way more depth than standard tools provide. In ideal case dtrace would be everywhere (someone recently said, thats one of coolest things Sun has ever brought) – then we’d be able to retrieve on-demand performance metrics from everywhere, and would be tempted to write DTraceToolkit  (a suite of programs that give lots and lots of information based on dtrace) like bunch of stuff for MySQL internals analysis.

I already made one very very simple tool  which visualizes dtrace output, so we can have graphviz based SVG callgraph for pretty much any type of probe (like, who in application does expensive file reads) – all from a single dtrace oneliner. It seems I can sell the tool to Sun’s performance engineering team – they liked it. :) 

Some people even installed Solaris afterwards for their performance tests. Great, I won’t have to (haha!).

Though lack of dtrace in Linux is currently a blocker for the technology, lots of engineers already have it on their laptops – MacOSX 10.5 ships it. It even has visual toolkit, that allows building some dtrace stuff in a GUI. 

I’m pretty sure now, any engineer would love dtrace (or dtrace based tools), they just don’t know that yet.

On blocking

Friday, June 20th, 2008

If a process has two blocking operations, each blocking other (like, I/O and networking), theoretical performance decrease will be 50%. Solution is very easy – convert one operation (quite often the one that blocks less, but I guess it doesn’t matter that much) into a nonblocking one.

Though MySQL has network-write buffer, which provides some async network behavior, it still has to get context switch into a thread to write stuff.

rsync and other file transfer protocols are even worse in this regard. On a regular Linux machine rsync even on gigabit network will keep kernel’s send-queue saturated (it is 128K by default anyway).

How to make MySQL’s or rsync networking snappier? If in ‘netstat’ sendq column is maxed out – just increase kernel buffers, instead of process buffers:

# increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
# increase Linux autotuning TCP buffer limits
# min, default, and max number of bytes to use
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

This can add additional 10-20% of file transfer throughput (and sendq goes up to 500k – so it seems to be really worth it).

On checksums

Thursday, May 29th, 2008

InnoDB maintains two checksums per buffer pool block. Old formula of checksum, and new formula of checksum. Both are read, both are written. I guess this had to be some kind of transition period, but it obviously took too long (or was forgotten). Anyway, disabling checksums code entirely makes single-thread data load 7% faster – though in parallel activity locking contention provides with some extra CPU resources for checksum calculation.

Leaving just single version of checksum would cut this fat in half, without abandoning the feature entirely – probably worth trying.

Update: Benchmarked InnoDB checksum against Fletcher. Results were interesting (milliseconds for 10000 iterations):

Algorithm: InnoDB Fletcher
- 826 453
-O2: 316 133
-O3: 42 75

So, though using Fletcher doubles the performance, -O3 optimizes InnoDB checksumming much better. How many folks do run -O3 compiled mysqld?