This is the first in a series of three articles about ideas for implementing full support for group commit in MariaDB (for the other parts see the second and third articles). Group commit is an important optimisation for databases that helps mitigate the latency of physically writing data to permanent storage. Group commit can have a dramatic effect on performance, as the following graph shows:
The rising blue and yellow lines show transactions per second when group commit is working, showing greatly improved throughput as the parallelism (number of concurrently running transactions) increases. The flat red and green lines show transactions per second with no group commit, with no scaling at all as parallelism increases. As can be seen, the effect of group commit on performance can be dramatic, improving throughput by an order of magnitude or more. More details of this benchmark below, but first some background information.
Durability and group commit
In a fully transactional system, it is traditionally expected that when a transaction is committed successfully, it becomes durable. The term durable is the "D" in ACID, and means that even if the system crashes the very moment after commit succeeded (power failure, kernel panic, server software crash, or whatever), the transaction will remain committed after the system is restarted, and crash recovery has been performed.
The usual way to ensure durability is by writing, to a transactional log file,
sufficient information to fully recover the transaction in case of a crash,
and then use the fsync()
system call to force the data to be
physically written to the disk drive before returning successfully from the
commit operation. This way, in case crash recovery becomes necessary, we know
that the needed information will be available. There are other methods
than fsync()
, including calling the fdatasync()
system call or using the O_DIRECT
flag when opening the log file,
but for simplicity we will use fsync()
to refer to any method for
forcing data to physical disk.
fsync()
is an expensive operation. A good traditional hard disk
drive (HDD) will do around 150 fsyncs per second (10k rotation per minute
drives). A good solid state disk like the Intel X25-M will do around 1200
fsyncs per second. It is possible to use RAID controllers with a battery
backed up cache (which will keep data in cache memory during a power failure
and physically write it to disk when the power returns); this will reduce the
overhead of fsync()
, but not eliminate it completely.
(There are other ways than fsync()
to ensure durability. For
example in a cluster with synchronous replication
(like NDB
or Galera),
durability can be achieved by making sure the transaction is replicated fully
to multiple nodes, on the assumption that a system failure will not take out
all nodes at once. Whatever method used, ensuring durability is usually
signficantly more expensive that merely committing a transaction to local
memory.)
So the naive approach, which does fsync()
after every commit,
will limit throughput to around 150 transactions per second (on a standard
HDD). But with group commit we can do much better. If we have several
transactions running concurrently, all waiting to fsync their data in the
commit step, we can use a single fsync()
call to flush them all
to physical storage in one go. The cost of fsync()
is often not
much higher for multiple transactions than for a single one, so as the above
graph shows, this simple optimisation greatly reduces the overhead
of fsync()
for parallel workloads.
Group commit in MySQL and MariaDB
MySQL (and MariaDB) has full support for ACID when using the popular InnoDB
(XtraDB in MariaDB) storage engine (and there are other storage engines with
ACID support as well). For InnoDB/XtraDB, durability is enabled by
setting innodb_flush_log_at_trx_commit=1
.
Durability is needed when there is a requirement that committed transactions must survive a crash. But this is not the only reason for enabling durability. Another reason is when the binary log is enabled, for using a server as a replication master.
When the binary log is used for replication, it is important that the content of the binary log on the master exactly match the changes done in the storage engine(s). Otherwise the slaves will replicate to different data than what is on the master, causing replication to diverge and possibly even break if the differences are such that a query on the master is unable to run on the slave without error. If we do not have durability, some number of transactions may be lost after a crash, and if the transactions lost in the storage engines are not the same as the transactions lost in the binary log, we will end up with an inconsistency. So with the binary log enabled, durability is needed in MySQL/MariaDB to be able to recover into a consistent state after a crash.
With the binary log enabled, MySQL/MariaDB uses XA/2-phase commit between the binary log and the storage engine to ensure the needed durability of all transactions. In XA, committing a transaction is a three-step process:
- First, a prepare step, in which the transaction is made durable in the engine(s). After this step, the transaction can still be rolled back; also, in case of a crash after the prepare phase, the transaction can be recovered.
- If the prepare step succeeds, the transaction is made durable in the binary log.
- Finally, the commit step is run in the engine(s) to make the transaction actually committed (after this step the transaction can no longer be rolled back).
Now, each of the three above steps requires an fsync()
to work, making a commit three times as costly in this respect compared to a
commit with the binary log disabled. This makes it all the more important to
use the group commit optimisation to mitigate the overhead from
fsync()
.
But unfortunately, group commit does not work in MySQL/MariaDB when the binary log is enabled! This is the infamous Bug#13669, reported by Peter Zaitsev back in 2005.
So this is what we see in the graph and benchmark shown at the start. This is
a benchmark running a lot of very simple transactions
(a single REPLACE
statement on a smallish XtraDB table)
against a server with and without the binary log enabled. This kind of
benchmark is bottlenecked on the fsync throughput of the I/O system when
durability is enabled.
The benchmark is done against two different servers. One has a pair of Western Digital 10k rpm HDDs (with binary log and XtraDB log on different drives). The other has a single Intel X25-M SSD. The servers are both running MariaDB 5.1.44, and are configured with durable commits in XtraDB, and with drive cache turned off (drives like to lie about fsync to look better in casual benchmarks).
The graph shows throughput in transactions per second for different number of threads running in parallel. For each server, there is a line for results with the binary log disabled, and one with the binary log enabled.
We see that with one thread, there is some overhead in enabling the binary
log, as is to be expected given that three calls to fsync()
are
required instead of just one.
But much worse, we also see that group commit does not work at all when the binary log is enabled. While the lines with binary log disabled show excellent scaling as the parallelism increases, the lines for binary log enabled are completely flat. With group commit non-functional, the overhead of enabling the binary log is enourmous at higher parallelism (at 64 threads on HDD it is actually two orders of magnitude worse with binary log enabled).
So this concludes the first part of the series. We have seen that if we can get group commit to work when the binary log is enabled, we can expect a huge gain in performance on workloads that are bottlenecked on the fsync throughput of the available I/O system.
The second part will go into detail with why the code for group commit does not work when the binary log is enabled. The third (and final) part will discuss some ideas about how to fix the code with respect to group commit and the binary log.