This is the first in a series of three articles about ideas for implementing
full support for group commit in MariaDB (for the other parts see the second and third articles). Group commit is an
important optimisation for databases that helps mitigate the latency of
physically writing data to permanent storage. Group commit can have a dramatic
effect on performance, as the following graph shows:
The rising blue and yellow lines show transactions per second when group
commit is working, showing greatly improved throughput as the parallelism
(number of concurrently running transactions) increases. The flat red and
green lines show transactions per second with no group commit, with no scaling
at all as parallelism increases. As can be seen, the effect of group commit on
performance can be dramatic, improving throughput by an order of magnitude or
more. More details of this benchmark below, but first some background
Durability and group commit
In a fully transactional system, it is traditionally expected that when a
transaction is committed successfully, it becomes durable. The
term durable is the "D" in ACID, and means that even if the system
crashes the very moment after commit succeeded (power failure, kernel panic,
server software crash, or whatever), the transaction will remain committed
after the system is restarted, and crash recovery has been performed.
The usual way to ensure durability is by writing, to a transactional log file,
sufficient information to fully recover the transaction in case of a crash,
and then use the
fsync() system call to force the data to be
physically written to the disk drive before returning successfully from the
commit operation. This way, in case crash recovery becomes necessary, we know
that the needed information will be available. There are other methods
fsync(), including calling the
system call or using the
O_DIRECT flag when opening the log file,
but for simplicity we will use
fsync() to refer to any method for
forcing data to physical disk.
fsync() is an expensive operation. A good traditional hard disk
drive (HDD) will do around 150 fsyncs per second (10k rotation per minute
drives). A good solid state disk like the Intel X25-M will do around 1200
fsyncs per second. It is possible to use RAID controllers with a battery
backed up cache (which will keep data in cache memory during a power failure
and physically write it to disk when the power returns); this will reduce the
fsync(), but not eliminate it completely.
(There are other ways than
fsync() to ensure durability. For
example in a cluster with synchronous replication
durability can be achieved by making sure the transaction is replicated fully
to multiple nodes, on the assumption that a system failure will not take out
all nodes at once. Whatever method used, ensuring durability is usually
signficantly more expensive that merely committing a transaction to local
So the naive approach, which does
fsync() after every commit,
will limit throughput to around 150 transactions per second (on a standard
HDD). But with group commit we can do much better. If we have several
transactions running concurrently, all waiting to fsync their data in the
commit step, we can use a single
fsync() call to flush them all
to physical storage in one go. The cost of
fsync() is often not
much higher for multiple transactions than for a single one, so as the above
graph shows, this simple optimisation greatly reduces the overhead
fsync() for parallel workloads.
Group commit in MySQL and MariaDB
MySQL (and MariaDB) has full support for ACID when using the popular InnoDB
(XtraDB in MariaDB) storage engine (and there are other storage engines with
ACID support as well). For InnoDB/XtraDB, durability is enabled by
Durability is needed when there is a requirement that committed
transactions must survive a crash. But this is not the only reason
for enabling durability. Another reason is when the binary log is enabled, for
using a server as a replication master.
When the binary log is used for replication, it is important that the content
of the binary log on the master exactly match the changes done in the storage
engine(s). Otherwise the slaves will replicate to different data than what is
on the master, causing replication to diverge and possibly even break if the
differences are such that a query on the master is unable to run on the slave
without error. If we do not have durability, some number of transactions may
be lost after a crash, and if the transactions lost in the storage engines are
not the same as the transactions lost in the binary log, we will end up with
an inconsistency. So with the binary log enabled, durability is needed in
MySQL/MariaDB to be able to recover into a consistent state after a crash.
With the binary log enabled, MySQL/MariaDB uses XA/2-phase commit between the
binary log and the storage engine to ensure the needed durability of all
transactions. In XA, committing a transaction is a three-step process:
- First, a prepare step, in which the transaction is made durable in
the engine(s). After this step, the transaction can still be rolled
back; also, in case of a crash after the prepare phase, the transaction
can be recovered.
- If the prepare step succeeds, the transaction is made durable in the
- Finally, the commit step is run in the engine(s) to make the
transaction actually committed (after this step the transaction can no
longer be rolled back).
The idea is that when the system comes back up after a crash, crash recovery
will go through the binary log. Any prepared (but not committed) transactions
that are found in the binary log will be committed in the storage
engine(s). Other prepared transactions will be rolled back. The result is
guaranteed consistency between the engines and the binary log.
Now, each of the three above steps requires an
to work, making a commit three times as costly in this respect compared to a
commit with the binary log disabled. This makes it all the more important to
use the group commit optimisation to mitigate the overhead from
But unfortunately, group commit does not work in MySQL/MariaDB when the binary
log is enabled! This is the
reported by Peter Zaitsev back in 2005.
So this is what we see in the graph and benchmark shown at the start. This is
a benchmark running a lot of very simple transactions
REPLACE statement on a smallish XtraDB table)
against a server with and without the binary log enabled. This kind of
benchmark is bottlenecked on the fsync throughput of the I/O system when
durability is enabled.
The benchmark is done against two different servers. One has a pair of Western
Digital 10k rpm HDDs (with binary log and XtraDB log on different drives). The
other has a single Intel X25-M SSD. The servers are both running MariaDB
5.1.44, and are configured with durable commits in XtraDB, and with drive
cache turned off (drives like to lie about fsync to look better in casual
The graph shows throughput in transactions per second for different number of
threads running in parallel. For each server, there is a line for results with
the binary log disabled, and one with the binary log enabled.
We see that with one thread, there is some overhead in enabling the binary
log, as is to be expected given that three calls to
required instead of just one.
But much worse, we also see that group commit does not work at all when the
binary log is enabled. While the lines with binary log disabled show excellent
scaling as the parallelism increases, the lines for binary log enabled are
completely flat. With group commit non-functional, the overhead of enabling
the binary log is enourmous at higher parallelism (at 64 threads on HDD it is
actually two orders of magnitude worse with binary log enabled).
So this concludes the first part of the series. We have seen that if we can
get group commit to work when the binary log is enabled, we can expect a huge
gain in performance on workloads that are bottlenecked on the fsync throughput
of the available I/O system.
The second part will go into detail with why the code for group commit does
not work when the binary log is enabled. The third (and final) part will
discuss some ideas about how to fix the code with respect to group commit and
the binary log.