Kristian Nielsen (kristiannielsen) wrote,
Kristian Nielsen
kristiannielsen

Fixing MySQL group commit (part 2)

This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.

Internally, InnoDB (and hence XtraDB) do support group commit. The way this works is seen in the innobase_commit() function. The work in this function is split into two parts. First, a "fast" part, which registers the commit in memory:

    trx->flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx->flush_log_later = FALSE;
Second, a "slow" part, which writes and fsync's the commit to disk to make it durable:
    trx_commit_complete_for_mysql(trx)
While one transaction is busy executing the "slow" part, any number of later transactions can complete their "fast" part, and queue up waiting for the running fsync() to finish. Once it does finish, a single fsync() of the log is now sufficient to complete the slow part for all of the queued-up transactions. This is how group commit works in InnoDB when the binary log is disabled.

When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:

    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()

Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB locks the prepare_commit_mutex in innobase_xa_prepare(), and does not release it until after the "fast" part of innobase_commit() has completed. This means that while one transaction is executing innobase_commit(), all subsequent transactions will be blocked inside innobase_xa_prepare() waiting for the mutex. As a result, no transactions can queue up to share an fsync(), and group commit is broken with the binary log enabled.

So, why does InnoDB hold the problematic prepare_commit_mutex across the binary logging? That turns out to be a really good question. After extensive research into the issue, it appears that in fact there is no good reason at all for the mutex to be held.

Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.

But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.

Sergei Golubchik investigated this issue during the 2010 MySQL conference, inspired by the many discussions of group commit that took place there. It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position. This statement completely blocks the processing of commits until released, removing any possibility of different commit order in engine and binary log (InnoDB hot backup is closed source, so difficult to check, but presumably works in the same way). So there certainly is no need for holding the prepare_commit_mutex to ensure consistent binary log position for backups!

There is another popular way to do hot backups without using FLUSH TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially runs the recovery algorithm at restore time. In this case, XA is used to ensure that engine and binary log are consistent at server start, eliminating any need to enforce same ordering of commits.

So it really seems that there just is no good reason for the prepare_commit_mutex mutex to exist in the first place. Unless someone can come up with a good explanation for why it should be needed, I am forced to conclude that we have lived with 5 years of broken group commit in MySQL solely because of incorrect hearsay about how things should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB ever cared sufficiently to take a serious look at this important issue.

(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).

This concludes the second part of the series, showing that group commit can be restored simply by removing the offending prepare_commit_mutex from InnoDB. The third and final article in the series will discuss some deeper issues that arise from looking into this part of the server code, and some interesting ideas for further improving things related to group commit.

Tags: freesoftware, mariadb, mysql, performance, programming
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 7 comments