Opening tables!

There’s one bottleneck in MySQL/InnoDB that ultimately sucks. It sucked in 4.0, sucked in 5.0, sucks in 5.1 with newest InnoDB plugin. Opening tables has been a bottleneck on machines that have thousands of tables all the time (as LOCK_open is being held during the process), and while there was a table being opened, everything else would stall on the machine.

It can simply take hours on such systems just to open tables – and the major portion of time spent is randomly diving into InnoDB tables to populate index statistics. It obviously sounds like low hanging fruit – as statistics aren’t needed while you are opening a table, they’re needed just for querying the table.

So, I threw in few thousand tables to my machine, and tried opening them with ten connections. Standard InnoDB code was opening 13.5 tables a second. After spending few minutes and moving (this is pure prototype, not suitable for production) statistic collection post ha_innodb::open(), I noticed performance increase.

Tables were opened at 105-a-second speed. A bit better, ~8x better.

Merry Christmas, MySQL!

Posted in mysql | Tagged | 3 Comments

On deadlock detection

InnoDB detects deadlocks. Deadlocks are those nasty situations, when transaction 1 tries to acquire locks A and B, whereas transaction 2 tries to acquire locks B and A at the same time. As both are stubborn, InnoDB will decide simply to terminate one of them. If it wouldn’t do that, both transactions would have to wait until lock_wait_timeout to expire otherwise. There is a big chance that longer the transaction is, more likely it is to cause deadlocks. Deadlock detection kind of helps, then, but… at certain costs.

Transaction 1 and 2 case is way too easy, try adding few hundred transactions that contend over same set of locks. To do that, InnoDB deadlock monitor will recursively brute-force lock graph, until it hits a 200-transaction-long chain (it will say it is a deadlock), or until it runs out of paths to check. Still, with the power of modern hardware that will still be milliseconds.

Unfortunately, InnoDB will also hold kernel_mutex at that time, so lots and lots of InnoDB operations will not happen at that time. To be exact, InnoDB will rarely do anything else, while deadlock check is happening.

To illustrate that, I have a very simple testcase (that in certain conditions stalls the server for half an hour, even if it is not being ran):

UPDATE t1 SET b=b+1 WHERE a=1;

With few threads it executes nearly 20000 times a second on my desktop machine. With ten threads it executes 14000/s. With 50 threads it is only 3000/s. With 100 threads it falls down to 639 operations a second. At 140 threads it is already just 266.

I built InnoDB without deadlock detection (tiny tiny patch), and tried same test. Similar performance with 10 threads, still doing 10000 operations a second at 100 threads:

Though I illustrated edge case here, its purity actually didn’t show how bad this can go – this situation can happen not only because of high contention on single row, but simply because someone holds up the row lock for a bit too long (there’s always that sleep between UPDATE and COMMIT, too). It can take a single transaction to cause a lock convoy, and once transactions queue up, and update rate falls down below 100/s, all MySQL will be doing is checking for deadlocks, even if they never happen.

On many systems deadlock detection is causing way more issues, than lack of it would. Most deadlocks happen on transactions that are somewhere in the middle of their lock wait anyway :)

There’s some discussion about it at MySQL Bug#49047

Posted in mysql | Tagged | 2 Comments

On replication, some more

Dear MySQL,

I feel ashamed that I ever wanted you to support 4.0->5.1 replication, and apologize for that. I really understand that it was really egoistic of me even to consider you should be involved in this.

I even understand that 5.0 is running out of active support (I’m not questioning that you’ll stop supporting 4.1 entirely too), and you’ll stop doing pretty much anything to 5.0, except “critical security fixes” (w00t, I managed to get one into 4.1, 8 year old MITM flaw :).

I really understand that supporting more than one release is very very very difficult, and people should do only adjacent version upgrade.

I’m not asking you much, but, maybe you could then support 5.1 to 5.1 replication? I don’t want much, just:

  • Gracefully recover after slave crashes.
  • Don’t have single serial reading of pages for replication stream as a bottleneck – either read-ahead properly (you can do that with RBR!!!), or apply events in parallel (you can do that with RBR too!)
  • Allow to edit replication filters without restarting servers.
  • Allow to enable and disable binary logging of events received from master, as well as enabling and disabling binary logging without restarting the instance.

I hope it isn’t too much too ask! It is just supported replication between two same version instances.

Thanks!

Posted in mysql | Tagged | 6 Comments

best free() is exit()

Whenever any maintenance needs server restarts, there’s a list of unsolved bottlenecks or inefficient code that gets touched a lot at that time. I can understand that heating up the server can take lots of time (though lots of low hanging fruits there), but the way actual shutdown is done, even if there’s not much of dirty data to flush, sucks.

See, developers are perfectionists, and their perfectionism also includes the crazy idea that all memory has to be deallocated at server shutdown, as otherwise Valgrind and other tools will complain that someone leaked memory. Developers will write expensive code in shutdown routines that will traverse every memory structure and deallocate/free() it.

Now, guess what would happen if they wouldn’t write all this expensive memory deallocation code.

Still guessing?

OS would do it for them, much much much faster, without blocking the shutdown for minutes or using excessive amounts of CPU. \o/

Posted in mysql | 7 Comments

Crash recovery, again

There’s one stage in InnoDB crash recovery where it reads log file, you know, this one:

InnoDB: Doing recovery: scanned up to log sequence number 354164119040
InnoDB: Doing recovery: scanned up to log sequence number 354169361920

On a machine with bigger logs it will start spending nearly 100% CPU somewhere in recv_scan_log_recs. Guess what it does…. -fno-inline builds to the rescue:

#0  mem_block_get_len at ./include/mem0mem.ic:86
#1  mem_heap_get_size at ./include/mem0mem.ic:591
#2  recv_scan_log_recs at log/log0recv.c:2727

And:

samples  %        symbol name
8467     72.9222  mem_heap_get_size
291       2.5062  recv_add_to_hash_table
95        0.8182  mem_block_get_len

To speak in layman’s terms, InnoDB does SUM(LENGTH(allocation)) on its relatively wide memory (tens, hundreds of thousands of entries) arena, FOR EVERY LOG SEGMENT, to make sure it didn’t run out of available 32GBs. Hehehe, lame.

As for now, I’ll just killed the check and have my recovery much much faster – finished in 3 minutes, what it wasn’t able to do in 30 before.

P.S. This is different from what I wrote before (and magic Yasufumi’s patch)
P.P.S. Now I got to learn to reuse LOG_DUMMY table during the recovery process, as it is next low hanging fruit there…

Posted in mysql | Tagged , , | Comments Off

on replication compatibility

Dear MySQL,

I will do this to rest of your code, if you continue breaking replication for me.

– Domas

Posted in mysql | Tagged , | Comments Off

again, on benchmarks

Dear interweb, if you have no idea what you’re writing about, keep it to yourself, don’t litter into the tubes. Some people may not notice they’re eating absolute crap and get diarrhea.

This particular benchmark has two favorite parts, that go with each other together really well:

I didnt change absolutely any parameters for the servers, eg didn’t change the innodb_buffer_pool_size or key_buffer_size.

And..

If you need speed just to fetch a data for a given combination or key, Redis is a solution that you need to look at. MySQL can no way compare to Redis and Memcache. …

Seriously, how does one repay for all the damage of such idiotic benchmarks?

P.S. I’ve ranted at benchmarks before, and will continue doing so.

Posted in mysql | Tagged , , | 12 Comments

update

In past few months I had lots of changes going on – left the Sun/MySQL job, my term on Wikimedia Board of Trustees ended, I joined Facebook and now I got appointed to Wikimedia Advisory Board. This also probably means that I will have slightly less hands-on work on Wikipedia technology (I’ll be mostly in “relaxed maintenance mode“), though I don’t know yet how much less – time will show :)

P.S. I also quit World of Warcraft. ;-)

Posted in misc, wikitech | 3 Comments

A formal objection to formal objection!

Dear EC, I have an opinion on Sun/Oracle merger.

I don’t like objections, therefore I object to them. Of course, I understand, that some people want to play major parts in dramas and soap operas, others are spectating and enjoying it, but most people I know and respect simply want to be efficient and get their work done.

These people, many of whom were my colleagues back at Sun/MySQL, sometimes can be labeled as ones who don’t care about greater picture, or the common good. Unfortunately, the topic of common interest depends from religion to religion, from society to society, from person to person.

My approach to common good is doing what I want to do for the community, and not what community wants me to do. If that relation is not compatible for any of the sides, there is no relation at all. Now, what happens now, is that there is some user community (and I include all the people who are protecting their free-software-usage interests) which has demands for vendors.

When PG zealots keep telling that “PG is open, MySQL is closed-development by a company”, you will often see MySQL camp not listening to that at all. But there is some true in that, every MySQL user was using a product that was developed inside a citadel (even some builds were called ‘Citadel’), and company had a say over lots of issues.

Start with Windows licensing at the beginning, GPLv2 versus GPLv3 or any other issue, that was affecting our user community at large, it has always been top-down decision, nobody has ever asked me as a user what kind of license I would want – and all those decisions were made by MySQL, the semi-independent company.

I never actually craved for too much control of the project – I’m always free to fork (and I have forked at least 3 releases of MySQL, and now have to deal with yet another fork at work ;-). I have went through various stages of relations with MySQL – and at every of them I had to understand that open-source software carries its own set of risks, but also has its own set of mitigations. I’m free to support my organization needs at various outcomes, I don’t have any data lock-in, I don’t have any vendor lock-in, the only impact is a risk that I should have calculated years ago, not today.

So even if the argument starts with “MySQL is being acquired by competitor, and it is risky situation”, there are few real messages out there: “we need to protect people who don’t care about risks” – sounds very much like bailout money for all the people with crazy spending, as well as “if your entry campaign is about doing lots of common good, you may not be allowed to profit in the future”.

This doesn’t make open-source a good investment (would Sun pay 1B$ if it forecasted today’s EC position?), and it not being a good investment means there won’t be opportunities for people to balance common good and business opportunities in future.

If a community was supposed to have a voice in this situation, it should’ve had a fair share of board members in project management. If a community didn’t get its fair share of management at the organization, it should’ve found another organization, or accepted the risks.

In my opinion, if you are right now in the camp of supporting objections, it is not because you’re seeing a lot now, it is mostly because you didn’t see anything before.

Posted in mysql | Tagged | 4 Comments

GDB 7!

I wasn’t prepared for this. After spending months playing with GDB development trees I somehow entirely missed that 7.0 release is getting close, and took me more than an hour to spot it.

My favorite features are python scripting and non-stop debugging. I was toying around with python scripting for a while, and was planning to make backtraces make sense. Having hands that open means that one can see PHP backtraces, when gdb’ing apache, see table names and states when MySQL thread access handler interfaces, or remote IPs and users, when it is writing to network. Process inspection can simply rock, if right tools are created using these new capabilities, and I’m way too excited when I think about those. “Always have debugging symbols” gets way more meaning now.

Another issue I’ve been trying to resolve lately is avoiding long locking periods for running processes (directly attaching to process can freeze its work for a second or so, which isn’t that tolerable in production environments). GDB is getting closer to the async debugging capabilities – where one can run a debugger without actually stopping anything.

So, congratulations GDB team, now it is job for us to find all the uses of the tool. It has been invaluable so far, but this is much much more.

Posted in mysql, wikitech | Tagged | 1 Comment