This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.
Internally, InnoDB (and hence XtraDB) do support group commit. The way this
works is seen in the innobase_commit() function. The work in this
function is split into two parts. First, a "fast" part, which registers the commit in
memory:
trx->flush_log_later = TRUE;
innobase_commit_low(trx);
trx->flush_log_later = FALSE;
Second, a "slow" part, which writes and fsync's the commit to disk to make
it durable:
trx_commit_complete_for_mysql(trx)
While one transaction is busy executing the "slow" part, any number of later
transactions can complete their "fast" part, and queue up waiting for the
running fsync() to finish. Once it does finish, a
single fsync() of the log is now sufficient to complete the slow
part for all of the queued-up transactions. This is how group commit
works in InnoDB when the binary log is disabled.
When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:
innobase_xa_prepare()
write() and fsync() binary log
innobase_commit()
Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB
locks the prepare_commit_mutex
in innobase_xa_prepare(), and does not release it until after the
"fast" part of innobase_commit() has completed. This means that
while one transaction is executing innobase_commit(), all
subsequent transactions will be blocked
inside innobase_xa_prepare() waiting for the mutex. As a result,
no transactions can queue up to share an fsync(), and group
commit is broken with the binary log enabled.
So, why does InnoDB hold the problematic prepare_commit_mutex
across the binary logging? That turns out to be a really good question. After
extensive research into the issue, it appears that in fact there is no good
reason at all for the mutex to be held.
Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.
But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.
Sergei Golubchik investigated this issue during the 2010 MySQL conference,
inspired by the many discussions of group commit that took place there. It
turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it
extracts the binary log position. This statement completely blocks the
processing of commits until released, removing any possibility of
different commit order in engine and binary log (InnoDB hot backup is closed
source, so difficult to check, but presumably works in the same way). So there
certainly is no need for holding the prepare_commit_mutex to
ensure consistent binary log position for backups!
There is another popular way to do hot backups without using FLUSH
TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially
runs the recovery algorithm at restore time. In this case, XA is used to
ensure that engine and binary log are consistent at server start, eliminating
any need to enforce same ordering of commits.
So it really seems that there just is no good reason for
the prepare_commit_mutex mutex to exist in the first
place. Unless someone can come up with a good explanation for why it should be
needed, I am forced to conclude that we have lived with 5 years of broken
group commit in MySQL solely because of incorrect hearsay about how things
should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB
ever cared sufficiently to take a serious look at this important issue.
(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).
This concludes the second part of the series, showing that group commit can be
restored simply by removing the offending prepare_commit_mutex
from InnoDB. The third and final article in the series will discuss some
deeper issues that arise from looking into this part of the server code, and
some interesting ideas for further improving things related to group commit.