I found time to continue my previous work on group commit for the binary log in MariaDB.
In current code, a (group) commit to InnoDB does not less than three fsync()
calls:
- Once during InnoDB prepare, to make sure we can recover the transaction in InnoDB if we crash after writing it to the binlog.
- Once after binlog write, to make sure we have the transaction in the binlog before we irrevocably commit it in InnoDB.
- Once during InnoDB commit, to make sure we no longer need to scan the binlog after a crash to recover the transaction.
fsync() after every (group) commit. In fact, it seems hardly
necessary to do such fsync() at all! If we should crash before
the commit record hits the disk, we can always recover the transaction by
scanning the binlogs and checking which of the transactions in InnoDB prepared
state should be committed. Of course, we do not want to keep and scan years
worth of binlogs, but we need only fsync() every so often, not
after every commit.
So I implemented MDEV-232. This removes the fsync() call in the
commit step in InnoDB. Instead, the binlog code requests from InnoDB (and any
other transactional storage engines) to flush all pending commits to disk when
the binlog is rotated. When InnoDB is done with the flush, it reports back to
the binlog code. We keep track of how far InnoDB has flushed by writing
so-called checkpoint events into the current binlog. After a crash, we first
scan the latest binlog. The last checkpoint event found will tell us if we
need to scan any older binlogs to be sure to find all commits that were not
durably committed inside InnoDB prior to the crash.
The result is that we only need to do two fsync() calls per
(group) commit instead of three.
I benchmarked the code on a server with a good disk system - HP RAID
controller with a battery-backed up disk cache. When the cache is enabled,
fsync() is fast, around 400 microseconds. When the cache is
disabled, it is slow, several milliseconds. The setup should be mostly
comparable to Mark Callaghan's benchmarks here and here.
I used sysbench update_non_index.lua to make it easier for others
to understand/reproduce the test. This does a single update of one row in a
table in each transaction. I used 100,000 rows in the table. Group commit is
now so fast that at higher concurrencies, it is no longer the bottleneck. It
will be interesting to test again with the new InnoDB code from MySQL 5.6 and
any other scaslability improvements that have been made there.
Slow fsync()

As can be seen, we have a very substantial improvement, around 30-60% more
commits per second depending on concurrency. Not only are we saving one out of
three expensive fsync() calls, improvements to the locking done
during commit also allow more commits to share the same fsync().
Fast fsync()

Even with fast fsync(), the improvements are substantial.
I am fairly pleased with these results. There is still substantial overhead
from enabling the binlog (like several times slowdown if fsync()
time is the bottleneck), and I have a design for mostly solving this in MWL#164. But
I think perhaps it is now time to turn towards other more important areas. In
particular I would like to turn to MWL#184 -
another method for parallel apply of events on slaves that can help in cases
where the per-database split of workload that exists in Tungsten and MySQL 5.6
can not be used, like many updates to a single table. Improving throughput
even further on the master may not be the most important if slaves are already
struggling to keep up with current throughput, and this is another
relatively simple spin-off from group commit that could greatly help.
For anyone interested, the current code is pushed to lp:~maria-captains/maria/5.5-mdev232
MySQL group commit
It was an interesting coincidence that the new MySQL group commit preview was
published just as I was finishing this work. So I had the
chance to take a quick look and include it in the benchmarks (with slow
fsync()):

While the implementation in MySQL 5.6 preview is completely different from
MariaDB (talk about "not invented here ..."), the basic design is now quite
similar, as far as I could gather from the code. A single thread writes all
transactions in the group into the binlog, in order; likewise a single thread
does the commits (to memory) inside InnoDB, in order. The storage engine
interface is extended with a thd_get_durability_property()
callback for the engines - when the server returns HA_IGNORE_DURABILITY from this,
InnoDB commit() method is changed to work exactly like MariaDB
commit_ordered(): commit to memory but do not sync to disk.
(It remains to see what storage engine developers will think of MySQL implementing a different API for the same functionality ...)
The new MySQL group commit also removes the third fsync() in the
InnoDB commit, same as the new MariaDB code. To ensure they can still recover
after a crash, they just call into the storage engines to sync all commits to
disk during binlog rotate. I actually like that from the point of simplicity -
even if it does stall commits for longer, it is unlikely to matter in
practice. What actually happens inside InnoDB in the two implementations is
identical.
The new MySQL group commit is substantially slower than the new MariaDB group
commit in this benchmark. My guess is that this is in part due to suboptimal inter-thread
communication. As I wrote about earlier, this is crucial to get best
performance at high commit rates, and the MySQL code seems to do additional
synchronisation between what they call stages - binlog write, binlog
fsync(), and storage engine commit. Since the designs are now
basically identical, it should not be hard to get this fixed to perform the same as
MariaDB. (Of course, if they had started from my work, they could have spent
the effort improving that even more, rather than wasting it on catch-up).
Note that the speedup from group commit (any version of it) is highly
dependent on the workload and the speed of the disk system. With fast
transactions, slow fsync(), and high concurrency, the speedup
will be huge. With long transactions, fast fsync(), and low
concurrency, the speedups will be modest, if any.
Incidentally, the new MySQL group commit is a change from the designs described earlier, where individual commit threads would use
pwrite() in parallel into the binary log. I am convinced this is
a good change. The writing to binlog is just memcpy() between
buffers, a single thread can do gigabytes worth of that, it is not where the
bottleneck is. While it is crucial to optimise the inter-thread communication,
as I found out here - and lots of small parallel
pwrite() calls into the same few data blocks at the end of a file
delivered to the file system is not likely to be a success. If binlog write
bandwidth would really turn out to be a problem the solution is to have
multiple logs in parallel - but I think we are quite far from being there yet.
It is a pity that we cannot work together in the MySQL world. I approached the MySQL developers several times over the past few years suggesting we work together, with no success. There are trivial bugs in the MySQL group commit preview whose fix yield great speedup. I could certainly have used more input while doing my implementation. The MySQL user community could have much better quality if we would only work together.
Instead, Oracle engineers use their own bugtracker which is not accessible to others, push to their own development trees which are not accessible to others, communicate on their own mailing lists which are not accessible to others, hold their own developer meetings which are not accessible to others ... the list is endless.
The most important task when MySQL was aquired was to collect the different development groups working on the code base and create a real, great, collaborative Open Source project. Oracle has totally botched this task up. Instead, what we have is lots of groups each working on their own tree, with no real interesting in collaborating. I am amazed every time I read some prominent MySQL community member praise the Oracle stewardship of MySQL. If these people are not interested in creating a healthy Open Source project and just want to not pay for their database software, why do they not go use the express/cost-free editions of SQL server or Oracle or whatever?
It is kind of sad, really.