April 23rd, 2010

Debugging memory leaks in plugins with Valgrind

I had an interesting IRC discussion the other day with Monty Taylor about what turned out to be a limitation in Valgrind with respect to debugging memory leaks in dynamically loaded plugins.

Monty Taylor's original problem was with Drizzle, but as it turns out, it is common to all of the MySQL-derived code bases. When there is a memory leak from an allocation in a dynamically loaded plugin, Valgrind will detect the leak, but the part of the stack trace that is within the plugin shows up as an unhelpful three question marks "???":

==1287== 400 bytes in 4 blocks are definitely lost in loss record 5 of 8
==1287==    at 0x4C22FAB: malloc (vg_replace_malloc.c:207)
==1287==    by 0x126A2186: ???
==1287==    by 0x7C8E01: ha_initialize_handlerton(st_plugin_int*) (handler.cc:429)
==1287==    by 0x88ADD6: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1033)
Which tells you little more than that there is a leak in one of your plugins.

After trying a couple of things, we found that this is a known limitation in Valgrind in relation to code that is loaded with dlopen() and later unloaded with dlclose():

http://bugs.kde.org/show_bug.cgi?id=79362

The basic problem is that Valgrind records the location of the malloc() call as just a memory address. And when the memory leak check is performed after the end of program execution, the plugin has been unloaded with dlclose(), and the recorded memory address is therefore no longer valid.

The problem is specific to memory leak checks, which are done only after the code has been unloaded. Other checks (like use of uninitialised values and use-after-free) work fine with full information in the stack traces, as such checks are done while the plugin code is still loaded into memory. But the memory leak checks are arguably among the most useful cheks Valgrind does, as Valgrind is often the only way to find and fix critical memory leaks efficiently.

Fortunately, once the issue was understood, we had an easy work-around: disable the dlclose() call in the server plugin code, and the leak is then detected with full information in the stack trace. Unfortunately this introduces a leak of its own, since now the memory allocated in dlopen() is never freed, so we get another spurious Valgrind memory leak warning.

Another possible way to get the same effect is to pass the RTLD_NODELETE flag to dlopen() to achieve the same effect, though I did not try this yet.

A possibly better work-around (which I also did not try yet) is one suggested in the above referenced Valgrind feature request. By adding the offending plugin(s) as LD_PRELOAD when starting the server, the plugin code will not actually be unloaded in dlclose(), so stack traces should be available without any spurious leak warnings from Valgrind. However, this will not work well if some of the dynamic plugins need a particular load order (according to the suggestion in the feature request). I also need to check if this actually works for plugins (like storage engines) that has link dependencies to symbols in the main program. But it might be a good option if it can be made to work.

(At first I was surprised to learn that this was a problem in MySQL and MariaDB, as I never saw it before. But I suppose the reason is that we so far have built most plugins as built-in, rather than as dynamically loaded .so files. The problem is likely to occur more frequently as we are moving to do more and more with plugins in MariaDB, so it is nice to know a work-around. Thanks, Monty!)

Fixing MySQL group commit (part 1)

This is the first in a series of three articles about ideas for implementing full support for group commit in MariaDB (for the other parts see the second and third articles). Group commit is an important optimisation for databases that helps mitigate the latency of physically writing data to permanent storage. Group commit can have a dramatic effect on performance, as the following graph shows:

Benchmark results

The rising blue and yellow lines show transactions per second when group commit is working, showing greatly improved throughput as the parallelism (number of concurrently running transactions) increases. The flat red and green lines show transactions per second with no group commit, with no scaling at all as parallelism increases. As can be seen, the effect of group commit on performance can be dramatic, improving throughput by an order of magnitude or more. More details of this benchmark below, but first some background information.

Durability and group commit

In a fully transactional system, it is traditionally expected that when a transaction is committed successfully, it becomes durable. The term durable is the "D" in ACID, and means that even if the system crashes the very moment after commit succeeded (power failure, kernel panic, server software crash, or whatever), the transaction will remain committed after the system is restarted, and crash recovery has been performed.

The usual way to ensure durability is by writing, to a transactional log file, sufficient information to fully recover the transaction in case of a crash, and then use the fsync() system call to force the data to be physically written to the disk drive before returning successfully from the commit operation. This way, in case crash recovery becomes necessary, we know that the needed information will be available. There are other methods than fsync(), including calling the fdatasync() system call or using the O_DIRECT flag when opening the log file, but for simplicity we will use fsync() to refer to any method for forcing data to physical disk.

fsync() is an expensive operation. A good traditional hard disk drive (HDD) will do around 150 fsyncs per second (10k rotation per minute drives). A good solid state disk like the Intel X25-M will do around 1200 fsyncs per second. It is possible to use RAID controllers with a battery backed up cache (which will keep data in cache memory during a power failure and physically write it to disk when the power returns); this will reduce the overhead of fsync(), but not eliminate it completely.

(There are other ways than fsync() to ensure durability. For example in a cluster with synchronous replication (like NDB or Galera), durability can be achieved by making sure the transaction is replicated fully to multiple nodes, on the assumption that a system failure will not take out all nodes at once. Whatever method used, ensuring durability is usually signficantly more expensive that merely committing a transaction to local memory.)

So the naive approach, which does fsync() after every commit, will limit throughput to around 150 transactions per second (on a standard HDD). But with group commit we can do much better. If we have several transactions running concurrently, all waiting to fsync their data in the commit step, we can use a single fsync() call to flush them all to physical storage in one go. The cost of fsync() is often not much higher for multiple transactions than for a single one, so as the above graph shows, this simple optimisation greatly reduces the overhead of fsync() for parallel workloads.

Group commit in MySQL and MariaDB

MySQL (and MariaDB) has full support for ACID when using the popular InnoDB (XtraDB in MariaDB) storage engine (and there are other storage engines with ACID support as well). For InnoDB/XtraDB, durability is enabled by setting innodb_flush_log_at_trx_commit=1.

Durability is needed when there is a requirement that committed transactions must survive a crash. But this is not the only reason for enabling durability. Another reason is when the binary log is enabled, for using a server as a replication master.

When the binary log is used for replication, it is important that the content of the binary log on the master exactly match the changes done in the storage engine(s). Otherwise the slaves will replicate to different data than what is on the master, causing replication to diverge and possibly even break if the differences are such that a query on the master is unable to run on the slave without error. If we do not have durability, some number of transactions may be lost after a crash, and if the transactions lost in the storage engines are not the same as the transactions lost in the binary log, we will end up with an inconsistency. So with the binary log enabled, durability is needed in MySQL/MariaDB to be able to recover into a consistent state after a crash.

With the binary log enabled, MySQL/MariaDB uses XA/2-phase commit between the binary log and the storage engine to ensure the needed durability of all transactions. In XA, committing a transaction is a three-step process:

  1. First, a prepare step, in which the transaction is made durable in the engine(s). After this step, the transaction can still be rolled back; also, in case of a crash after the prepare phase, the transaction can be recovered.
  2. If the prepare step succeeds, the transaction is made durable in the binary log.
  3. Finally, the commit step is run in the engine(s) to make the transaction actually committed (after this step the transaction can no longer be rolled back).
The idea is that when the system comes back up after a crash, crash recovery will go through the binary log. Any prepared (but not committed) transactions that are found in the binary log will be committed in the storage engine(s). Other prepared transactions will be rolled back. The result is guaranteed consistency between the engines and the binary log.

Now, each of the three above steps requires an fsync() to work, making a commit three times as costly in this respect compared to a commit with the binary log disabled. This makes it all the more important to use the group commit optimisation to mitigate the overhead from fsync().

But unfortunately, group commit does not work in MySQL/MariaDB when the binary log is enabled! This is the infamous Bug#13669, reported by Peter Zaitsev back in 2005.

So this is what we see in the graph and benchmark shown at the start. This is a benchmark running a lot of very simple transactions (a single REPLACE statement on a smallish XtraDB table) against a server with and without the binary log enabled. This kind of benchmark is bottlenecked on the fsync throughput of the I/O system when durability is enabled.

The benchmark is done against two different servers. One has a pair of Western Digital 10k rpm HDDs (with binary log and XtraDB log on different drives). The other has a single Intel X25-M SSD. The servers are both running MariaDB 5.1.44, and are configured with durable commits in XtraDB, and with drive cache turned off (drives like to lie about fsync to look better in casual benchmarks).

The graph shows throughput in transactions per second for different number of threads running in parallel. For each server, there is a line for results with the binary log disabled, and one with the binary log enabled.

We see that with one thread, there is some overhead in enabling the binary log, as is to be expected given that three calls to fsync() are required instead of just one.

But much worse, we also see that group commit does not work at all when the binary log is enabled. While the lines with binary log disabled show excellent scaling as the parallelism increases, the lines for binary log enabled are completely flat. With group commit non-functional, the overhead of enabling the binary log is enourmous at higher parallelism (at 64 threads on HDD it is actually two orders of magnitude worse with binary log enabled).

So this concludes the first part of the series. We have seen that if we can get group commit to work when the binary log is enabled, we can expect a huge gain in performance on workloads that are bottlenecked on the fsync throughput of the available I/O system.

The second part will go into detail with why the code for group commit does not work when the binary log is enabled. The third (and final) part will discuss some ideas about how to fix the code with respect to group commit and the binary log.

Fixing MySQL group commit (part 2)

This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.

Internally, InnoDB (and hence XtraDB) do support group commit. The way this works is seen in the innobase_commit() function. The work in this function is split into two parts. First, a "fast" part, which registers the commit in memory:

    trx->flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx->flush_log_later = FALSE;
Second, a "slow" part, which writes and fsync's the commit to disk to make it durable:
    trx_commit_complete_for_mysql(trx)
While one transaction is busy executing the "slow" part, any number of later transactions can complete their "fast" part, and queue up waiting for the running fsync() to finish. Once it does finish, a single fsync() of the log is now sufficient to complete the slow part for all of the queued-up transactions. This is how group commit works in InnoDB when the binary log is disabled.

When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:

    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()

Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB locks the prepare_commit_mutex in innobase_xa_prepare(), and does not release it until after the "fast" part of innobase_commit() has completed. This means that while one transaction is executing innobase_commit(), all subsequent transactions will be blocked inside innobase_xa_prepare() waiting for the mutex. As a result, no transactions can queue up to share an fsync(), and group commit is broken with the binary log enabled.

So, why does InnoDB hold the problematic prepare_commit_mutex across the binary logging? That turns out to be a really good question. After extensive research into the issue, it appears that in fact there is no good reason at all for the mutex to be held.

Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.

But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.

Sergei Golubchik investigated this issue during the 2010 MySQL conference, inspired by the many discussions of group commit that took place there. It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position. This statement completely blocks the processing of commits until released, removing any possibility of different commit order in engine and binary log (InnoDB hot backup is closed source, so difficult to check, but presumably works in the same way). So there certainly is no need for holding the prepare_commit_mutex to ensure consistent binary log position for backups!

There is another popular way to do hot backups without using FLUSH TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially runs the recovery algorithm at restore time. In this case, XA is used to ensure that engine and binary log are consistent at server start, eliminating any need to enforce same ordering of commits.

So it really seems that there just is no good reason for the prepare_commit_mutex mutex to exist in the first place. Unless someone can come up with a good explanation for why it should be needed, I am forced to conclude that we have lived with 5 years of broken group commit in MySQL solely because of incorrect hearsay about how things should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB ever cared sufficiently to take a serious look at this important issue.

(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).

This concludes the second part of the series, showing that group commit can be restored simply by removing the offending prepare_commit_mutex from InnoDB. The third and final article in the series will discuss some deeper issues that arise from looking into this part of the server code, and some interesting ideas for further improving things related to group commit.

Fixing MySQL group commit (part 3)

This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.

So how do we fix group commit in MySQL? As we saw in the second article of this series, we can just eliminate the prepare_commit_mutex from InnoDB, extend the binary logging to do group commit by itself, and that would solve the problem.

However, we might be able to do even better. As explained in the first article, with binary logging enabled we need XA to ensure consistency after a crash, and that requires to do three fsyncs for a commit. Even if each of those can be shared with other transactions using group commit, it is still expensive. During a discussion on the maria-developers@ mailing list, an idea came up for how to do this with only a single (shared) fsync() for a commit.

The basic idea is to only do fsync() for the binary log, not for the storage engine, corresponding to running with innodb_flush_log_at_trx_commit set to 2 or even 0.

If we do this, we can end up in the following situation: some transaction A is written into the binary log, and fsync() makes sure that is stored durably on disk. Then transaction A is committed in InnoDB. And before the operating system and hardware gets around to store the InnoDB part of A durably on disk, we get a crash.

Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.

In order to do this, we need two things:

  • For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
  • We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.

Now, for ensuring same commit order, we do not want to re-introduce the (by now) infamous prepare_commit_mutex, as that would make it impossible to have group commit for the binary log. Instead we should use another way to ensure such order. There are several ways to do this. Mark Callaghan explained one such way to do this at the MySQL conference, described further in this article.

The basic idea is that when writing transactions into the binary log, we remember their ordering. We can do this by putting the transactions into a queue, by assigning them a global transaction id in monotonic sequence, or by assigning them some kind of ticket as Mark suggests. Then inside innobase_commit(), transactions can coordinate with each other to make sure they go into the engine in the order dictated by the queue, global transaction id, or ticket.

I think I have a working idea for how to extend the storage engine API to be able to do this in a clean way for any transactional engine. We can Introduce an optional handler call commit_fast() that is guaranteed to be called in the same order as transactions are written to the binary log, prior to the normal commit handler call. Basically it would be called under a binary log mutex. The idea is that commit_fast() will contain the "fast" part of innobase_commit(), as explained in the previous article. Then in commit_fast(), the engine can do the assignment of a ticket or insertion into a queue, as needed.

I think possibly for symmetry we would want to also add a similar xa_prepare_fast() handler call that would be invoked after the normal xa_prepare() and similarly be guaranteed to be in the same order as binary log commit, though I need to consider this a bit more to fully make up my mind.

I believe such an addition to the storage engine API would allow to implement in a clean way for all engines the method of re-playing the binary log at crash recovery to avoid more than a single fsync() at commit.

So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter's original Bug#13669 for fixed!