Posts Tagged ‘innodb’

Read ahead…

Saturday, January 2nd, 2010

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

Opening tables!

Saturday, December 26th, 2009

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

On deadlock detection

Monday, December 21st, 2009

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047

Crash recovery, again

Tuesday, December 8th, 2009

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

plugin and 5.1

Tuesday, August 18th, 2009

You can check it yourself – 5.1 seems to be shipped with InnoDB plugin in future :-) (oh the joy of open source repositories, always ready to spoil the surprise, eh?:)

Checksums again, some I/O too

Thursday, May 28th, 2009

When I was doing data loading tests, I realized that usually low checksum calculation CPU percentage is actually the blocking factor. See, usually when background writers do the flushing, it gets parallelized, but if active query is forcing a checkpoint, it all happens in ‘foreground’ thread, checksum computation included. This is where more Sun-ish wisdom (these people tune kernel with debugger all the time) comes in:

gdb -p $(pidof mysqld) -ex "set srv_use_checksums=0" --batch

Puff. Everything becomes much faster. Of course, one would be able to restart the server with –skip-innodb-checksums, but that would interrupt the whole process, etc. Of course, proper people would implement tunable parameter (5 lines of code, or so), but anyone with Solaris experience knows how to tune stuff with debuggers, hahaha.

Odd though, I was used to compiled -O3 mode optimizing checksums to disappear from profiles, so either this doesn’t work, or it just got so much emphasis to actually matter. This is why I told Heikki at the Users Conference, that checksums ‘must go’ or be fixed in one way or another. Even when they’re small part of operation and doesn’t cost anything if it doesn’t block primary operations (e.g. happens in parallel), people keep accessing SSDs, and then relative cost of checksum is insane.

It is quite easy to check that, just run InnoDB with tiny buffer pool, OS caching, and try some benchmarking with checksums enabled and disabled – there will be a huge difference, which may simply explain the performance difference of caching at buffer pool and OS buffers. Turn checksums off, and OS caching may be even tolerable for your workloads.

Other interesting issue here is that MarkC has lots of I/O path optimization in his fourth (and four is a good number!) patch. Now… maybe everyone will be soon caching pages in OS, once some more work is done in I/O access path cost work?

stop messing with the tablespace

Thursday, May 21st, 2009

People keep loving and endorsing the –innodb-file-per-table. Then poor new users read about that, get confused, start using –innodb-file-per-table, and tell others to. Others read then, get confused even more, and start using –innodb-file-per-table, then write about it. Then…

Oh well. Here, some endorsements and FUD against one-tablespace-to-unite-them-all:

This same nice property also translates to a not so nice one: data can be greatly fragmented across the tablespace.

Of course, having file-per-table will mean that only one table will be in a file, so, kind of, it will not be ‘mixed’… inside file. Now, when data grows organically (not when you restore few-hundred-gigabyte dump sequentially), all those files grow and start getting fragmented (at ratios depending on how smart filesystem is, and.. how many people choose smart filesystems nowadays?). So, disks will have same soup of data, just instead of ‘fragmentation’ inside tablespace which is laid out sequentially on a disk/filesystem, you get fragmentation at file system level. Now, InnoDB has extents and such, and can afford new storage formats more often than filesystems do, so… which one is more modern for proper data allocation strategies?

So, some more criticism of InnoDB tablespaces:

An annoying property of InnoDB’s tablespaces is that they never shrink.

Annoying are people who find this annoying. Look, it is simple, if your data grows to X, then you do something amazing and shrink your dataset, there’re two outstanding questioms:

  • How soon will your dataset grow back to X?
  • What temporary data will be placed there, until the moment dataset grows back to X?

It is very simple, database servers house data. Run OPTIMIZE, data will get eventually fragmented (quite fast, actually, at usual access patterns, as once you have lots of full pages around, a simple insertion will split pages). That ‘free space’ achieved does not bring too much value, it will be gone, and data will be placed there. Oh, well, and it _will_ be used by database, by _any_ table, not just the owner-table of a file.

It does not release storage to the file system.

*facepalm*. File system doesn’t want that storage. It will give it back to InnoDB as soon as it asks. :)

I’ve seen more than once how certain tables are left unwatched, growing until disk space reaches 90% and SMS notifications start beeping all around.

Where is that issue of data storage strategies? I have tablespaces taking 95%. When they will reach 99% I will kill the database servers and cycle them out to other tasks to deal with smaller datasets. There is nothing bad with data growing, as long as you know the reason. There is nothing wrong with tablespace growing, thats its job – to house the data you feed in.

But having a file which consumes some 80-90% of disk space is a performance catastrophe. It means the disk needle needs to move large distances. Overall disk performance runs very low.

[citation needed]. Which filesystem will not place data at those ‘long distances’ for the needle, when it gets full? Which filesystem will relocate the data to the start of disk? This is most poorly sourced statement, which spreads FUD, even though same data placement would happen pretty much with every storage strategy. If start area gets full, you have to place data elsewhere. End of tablespace, end of filesystem, end of disk, it is same thing.

Also, one can never be sure how underlying storage will be mapped to block device. Writing to the end can be fastest, you never know. Did you try? :)

mysqldump, mk-parallel-dump

Where is mydumper? ;-D

One other nice thing about innodb_file_per_table is that it is possible to monitor table size on the file system level.

Oh sure, thats just what you need, give people who deal with data at the application level, access to MySQL data directories, and let them run awk, sort, du, etc, even though INFORMATION_SCHEMA gives you that data for any kind of table storage type you use. Oh, by the way, both I_S and ‘SHOW TABLE STATUS’ will tell about free space, whereas filesystem has no idea about what data is inside file. You choose your tools… ;-)

Oh, wait, this was given as an argument:

You don’t need access to MySQL, to use SHOW TABLE STATUS or to query the INFORMATION_SCHEMA.

Right…

Though actually I know where file-per-table solves something. I’ve written already, that most Linux file systems serialize O_DIRECT writes. Using separate files will mean that O_DIRECTs will be placed in separate queues per each inode. I’d suggest just using proper file system :-)

Now I should probably tell why having single tablespace is better. Besides the obvious (freed space shared by all tables), I don’t know. Well, there’re lots of different issues that have different impact on different operating systems, file systems, storage, etc.

For example, when data lives inside single tablespace, filesystem metadata operations are nearly non-existing. That means no writes to FS journal, less need for multiple-points-of-synchronization, etc.

Multiple file descriptors could also be mentioned, though probably nowadays it has nearly zero significance. I don’t know – statements like that need more than just assumptions. I don’t have good numbers about how much FS fragmentation affects proper database workload – and results can be different depending on data access patterns, size composition, and of course, filesystems used.

Still, from OS administrator’s perspective data is not fragmented in any way, moving files around will hit bandwidth issues in network, memory, I/O, but will not do any seeks. I probably have dealt too much with systems that had hundreds of millions of files, to love single-file-approach. :)

after the conference, mydumper, parallelism, etc

Monday, May 18th, 2009

Though slides for my MySQL Conference talks were on the O’Reilly website, I placed them in my talks page too, for both dtrace and security presentations.

I also gave a lightning talk about mydumper. Since my original announcement mydumper has changed a bit. It supports writing compressed files, detecting and killing slow queries that could block table flushes, supports regular expressions for table names, and trunk is slowly moving towards understanding that storage engines differ :)

I’ve been using mydumper quite a lot in my deployments (and observing 10x faster dumps). Now, the sad part is how to do faster recovery. It is quite easy to parallelize load of data (apparently, xargs supports running parallel processes):

echo *.sql.gz | xargs -n1 -P 16 -I % sh -c 'zcat % | mysql dbname'

Still, that doesn’t scale much – only doubles the load speed, compared to single threaded load, even on quite powerful machine. The problem lives in log_sys mutex – it is acquired for every InnoDB row operation, to grab LogicalSequenceNumbers (LSNs), so neither batching nor differentiation strategies really help, and same problem is hit by LOAD DATA too. In certain cases I saw quite some spinning on other mutexes, and it seems that InnoDB currently doesn’t scale that well with lots of small row operations. Maybe someone some day will pick this up and fix, thats why we go to conferences and share our findings :)

Eyecandy mutexes!

Sunday, February 15th, 2009

In my quest of making MySQL usable, I managed to hit contention that wasn’t spotted by performance masters before. Meet most useless mutex ever (this is actual contention event, not just a hold):

Count     nsec Lock
 1451   511364 mysqld`ut_list_mutex       

      nsec ---- Distribution --- count    Stack
      2048 |@@                 |   132    libc.so.1`mutex_lock_impl
      4096 |@@@                |   186    libc.so.1`mutex_lock
      8192 |                   |     3    ut_malloc_low
     16384 |@@                 |   137    mem_heap_create_block
     32768 |@                  |   105    row_sel_store_mysql_rec
     65536 |@                  |    89    row_search_for_mysql
    131072 |@                  |    99    ha_innobase::general_fetch
    262144 |@@                 |   131    ha_innobase::rnd_next
    524288 |@@@                |   195    rr_sequential
   1048576 |@@@                |   223
   2097152 |@@                 |   137
   4194304 |                   |    14
----------------------------------------------------------------

ut_list_mutex guards a memory structure which has all memory blocks allocated by InnoDB (via ut_malloc/ut_free) in it.
It has two uses:

  1. Printing “Total memory allocated” in SHOW INNODB STATUS (though this can still be implemented lock-free)
  2. Deallocating all memory on shutdown (though, all modern operating systems do that anyway, so this is purely just to shut up valgrind)

If you have any BLOB/TEXT data in your tables, you’re definitely hitting this contention spot (it is #1 contention in such cases).

Fix? Kill the eyecandy, replace ut_malloc and ut_free with direct calls to malloc() and free(), oh and of course, use scalable allocators like tcmalloc or Hoard.

Charsets mutex meets InnoDB

Friday, February 6th, 2009

When InnoDB compares data (like, when looking up in indexes), it actually asks help from MySQL – character sets may provide different rules for evaluation. MySQL then looks up character set information. This is where the fun begins – if character set used for comparison is default server character set, or latin1 – there are shortcuts. If not (say, some smart developer decided to use Unicode without forcing DBA to set default server charset, as it isn’t needed) – a very nice internal routine is called, which at the very beginning does exactly this:

 /*
    To make things thread safe we are not
    allowing other threads to interfere
    while we may changing the cs_info_table
  */
  pthread_mutex_lock(&THR_LOCK_charset);

Apparently, ha_innodb.cc has such comment too:

/* Use the charset number to pick the right charset struct for
the comparison. Since the MySQL function get_charset may be
slow before Bar removes the mutex operation there, we first
look at 2 common charsets directly. */

if (charset_number == default_charset_info->number) {
    charset = default_charset_info;
} else if (charset_number == my_charset_latin1.number) {
    charset = &my_charset_latin1;
} else {
    charset = get_charset(charset_number, MYF(MY_WME));
    [...]

I’ll avoid going into discussions why such global lock at every row operation is harmful, but in case anyone is hitting lots of mutex contention there – just set default server character set to what your databases are in (or use binary or latin1, or add a clause for utf8 up here, or remove mutex, nobody is changing your character sets anyway ;-)

Update: Some of the discussion is at bug#42649