This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.
So how do we fix group commit in MySQL? As we saw in the second
article of this series, we can just eliminate
prepare_commit_mutex from InnoDB, extend the binary logging
to do group commit by itself, and that would solve the problem.
However, we might be able to do even better. As explained in
the first article, with binary logging enabled we need XA
to ensure consistency after a crash, and that requires to do three fsyncs for
a commit. Even if each of those can be shared with other transactions using
group commit, it is still expensive. During a discussion on
the maria-developers@ mailing list, an idea came up for how
to do this with only a single (shared)
fsync() for a commit.
The basic idea is to only do
fsync() for the binary log, not for
the storage engine, corresponding to running
innodb_flush_log_at_trx_commit set to 2 or even 0.
If we do this, we can end up in the following situation: some transaction A is
written into the binary log, and
fsync() makes sure that is
stored durably on disk. Then transaction A is committed in InnoDB. And before
the operating system and hardware gets around to store the InnoDB part of A
durably on disk, we get a crash.
Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.
In order to do this, we need two things:
- For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
- We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.
Now, for ensuring same commit order, we do not want to re-introduce
the (by now) infamous
prepare_commit_mutex, as that would make it
impossible to have group commit for the binary log. Instead we should use
another way to ensure such order. There are several ways to do this. Mark
Callaghan explained one such way to do this at the MySQL conference, described
further in this article.
The basic idea is that when writing transactions into the binary log, we
remember their ordering. We can do this by putting the transactions into a
queue, by assigning them a global transaction id in monotonic sequence, or by
assigning them some kind of ticket as Mark suggests. Then
innobase_commit(), transactions can coordinate with each
other to make sure they go into the engine in the order dictated by the queue,
global transaction id, or ticket.
I think I have a working idea for how to extend the storage engine API to be
able to do this in a clean way for any transactional engine. We can Introduce an
optional handler call
commit_fast() that is guaranteed to be called
in the same order as transactions are written to the binary log, prior to the
normal commit handler call. Basically it would be called under a binary log
mutex. The idea is that
commit_fast() will contain the "fast" part
innobase_commit(), as explained in the previous
article. Then in
commit_fast(), the engine can do the
assignment of a ticket or insertion into a queue, as needed.
I think possibly for symmetry we would want to also add a
xa_prepare_fast() handler call that would be
invoked after the normal
xa_prepare() and similarly be
guaranteed to be in the same order as binary log commit, though I need to
consider this a bit more to fully make up my mind.
I believe such an addition to the storage engine API would allow to implement
in a clean way for all engines the method of re-playing the binary log at
crash recovery to avoid more than a single
fsync() at commit.
So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter's original Bug#13669 for fixed!