Benchmarking thread scheduling in group commit|
The best part of the recent MariaDB meeting in Lisbon for me
was that I got some good feedback on my group
commit work. This has been waiting in the review
queue for quite some time now.
One comment I got revolve around an optimisation in the implementation related
to how threads are scheduled.
A crucial step in the group commit algorithm is when the transactions being
committed have been written into the binary log, and we want to commit them in
the storage engine(s) in the same order as they were committed in the
binlog. This ordering requirement makes that part of the commit process
serialised (think global mutex).
Even though care is taken to make this serial part very quick to run inside
the storage engine(s), I was still concerned about how it would impact
scalability on multi-core machines. So I took extra care to minimise the time
spent on the server layer in this step.
Suppose we have three transactions being committed as a group, each running in
their own connection thread in the server. It would be natural to let the
first thread do the first commit, then have the first thread signal the second
thread to do the second commit, and finally have the second thread signal the
third thread. The problem with this is that now the inherently serial part of
the group commit not only includes the work in the storage engines, it also
includes the time needed for two context switches (from thread 1 to thread 2,
and from thread 2 to thread 3)! This is particularly costly if, after finishing
with thread 1, we end up having to wait for thread 2 to be scheduled because
all CPU cores are busy.
So what I did instead was to run all of the serial part in a single thread
(the thread of the first transaction). The single thread will handle the
commit ordering inside the storage engine for all the transactions, and the
remaining threads will just wait for the first one to wake them up. This means
the context switches for the waiting threads are not included in the serial
part of the algorithm. But it also means that the storage engines need to be
prepared to run this part of the commit in a separate thread from the rest of
So, in Lisbon there was some discussion around if the modifications I did to
InnoDB/XtraDB for this were sufficient to ensure that there would not be any
problems with this running part of the commit in a different thread. After
all, this requirement is a complication. And then the question came
up if the above optimisation is actually needed? Does it notably increase
Now, that is a good question, and I did not have an answer as I never tested it.
So now I did! I added an
--binlog-optimize-thread-scheduling to allow to switch
between the naive and the optimised way to handle the commit of the different
transactions in the serial part of the algorithm, and benchmarked them against
Unfortunately, the two many-core servers we have available for testing were
both unavailable (our hosting and quality of servers leaves a lot to be
desired unfortunately). So I was left to test on a 4-core (8 threads with
hyperthreading) desktop box I have in my own office. I was able to get some
useful results from this nevertheless, though I hope to revisit the benchmark
later on more interesting hardware.
In order to stress the group commit code maximally, I used a syntetic workload
with as many commits per second as possible. I used the fastest disk I have
/dev/shm (Linux ramdisk). The transactions are
single-row updates of the form
REPLACE INTO t (a,b) VALUES (?, ?)
The server is an Intel Core i7 quad-core with hyperthreading enabled. It has
8GByte of memory. I used Gypsy to generate the load.
Table type is XtraDB. The server is running
Here are the results in queries per second, with different number of
concurrent connections running the queries:
So as we see from this table, even with just four cores we see noticable
better performance by running the serial part of group commit in a single
thread. The improvement is around 10% or so, depending on parallelism. So I think
this means that I will want to keep the optimised version.
|Number of connections
||QPS (naive scheduling)
||QPS (optimised scheduling)
||QPS (binlog disabled)
It is nice to see that we can get > 20k commits/second with the group commit
code on cheap desktop hardware. For real servers the I/O subsystem will
probably be a bottleneck, but that is what I wanted to see: that the group
commit code will not limit the ability to fully utilise high amounts of I/O
While I was at it, I also measured the throughput when the binlog is
disabled. As can be seen, enabling the binlog has notable performance impact
even with very fast disk. Still, considering the added overhead of writing an
extra log file, not to mention the added 2-phase commit step, the overhead is
not that unreasonable.
From the table we also see some negative scaling as the number of parallel
connections increases. Some of this is likely from InnoDB/XtraDB, but I would
like to investigate it deeper at some point to see if there is anything in the
group commit part that can be improved with respect to this.
Looking back, should I have done this benchmark when designing the code? I
think it is a tricky question, and one that cannot be given a simple
answer. It will always be a trade-off: It is not feasible to test (and
implement!) every conceivable variant of a new feature during development,
it is necessary to also rely on common sense and experience. On the other
hand, it is dangerous to rely on intuition with respect to performance; time
and time again measurements prove that the real world is very often counter to
intuition. In this case I was right, and my optimisation was beneficial;
however I could easily have been wrong. I think the main lesson here is how
important it is to get feedback on complex design work like this; such
feedback is crucial for motivating and structuring the work to be of the
quality that we need to see in MariaDB.
Tags: mariadb, mysql, performance, programming