One comment I got revolve around an optimisation in the implementation related to how threads are scheduled.
A crucial step in the group commit algorithm is when the transactions being committed have been written into the binary log, and we want to commit them in the storage engine(s) in the same order as they were committed in the binlog. This ordering requirement makes that part of the commit process serialised (think global mutex).
Even though care is taken to make this serial part very quick to run inside the storage engine(s), I was still concerned about how it would impact scalability on multi-core machines. So I took extra care to minimise the time spent on the server layer in this step.
Suppose we have three transactions being committed as a group, each running in their own connection thread in the server. It would be natural to let the first thread do the first commit, then have the first thread signal the second thread to do the second commit, and finally have the second thread signal the third thread. The problem with this is that now the inherently serial part of the group commit not only includes the work in the storage engines, it also includes the time needed for two context switches (from thread 1 to thread 2, and from thread 2 to thread 3)! This is particularly costly if, after finishing with thread 1, we end up having to wait for thread 2 to be scheduled because all CPU cores are busy.
So what I did instead was to run all of the serial part in a single thread (the thread of the first transaction). The single thread will handle the commit ordering inside the storage engine for all the transactions, and the remaining threads will just wait for the first one to wake them up. This means the context switches for the waiting threads are not included in the serial part of the algorithm. But it also means that the storage engines need to be prepared to run this part of the commit in a separate thread from the rest of the transaction.
So, in Lisbon there was some discussion around if the modifications I did to InnoDB/XtraDB for this were sufficient to ensure that there would not be any problems with this running part of the commit in a different thread. After all, this requirement is a complication. And then the question came up if the above optimisation is actually needed? Does it notably increase performance?
Now, that is a good question, and I did not have an answer as I never tested it.
So now I did! I added an
--binlog-optimize-thread-scheduling to allow to switch
between the naive and the optimised way to handle the commit of the different
transactions in the serial part of the algorithm, and benchmarked them against
Unfortunately, the two many-core servers we have available for testing were both unavailable (our hosting and quality of servers leaves a lot to be desired unfortunately). So I was left to test on a 4-core (8 threads with hyperthreading) desktop box I have in my own office. I was able to get some useful results from this nevertheless, though I hope to revisit the benchmark later on more interesting hardware.
In order to stress the group commit code maximally, I used a syntetic workload
with as many commits per second as possible. I used the fastest disk I have
/dev/shm (Linux ramdisk). The transactions are
single-row updates of the form
REPLACE INTO t (a,b) VALUES (?, ?)The server is an Intel Core i7 quad-core with hyperthreading enabled. It has 8GByte of memory. I used Gypsy to generate the load. Table type is XtraDB. The server is running with
Here are the results in queries per second, with different number of concurrent connections running the queries:
|Number of connections||QPS (naive scheduling)||QPS (optimised scheduling)||QPS (binlog disabled)|
It is nice to see that we can get > 20k commits/second with the group commit code on cheap desktop hardware. For real servers the I/O subsystem will probably be a bottleneck, but that is what I wanted to see: that the group commit code will not limit the ability to fully utilise high amounts of I/O resources.
While I was at it, I also measured the throughput when the binlog is disabled. As can be seen, enabling the binlog has notable performance impact even with very fast disk. Still, considering the added overhead of writing an extra log file, not to mention the added 2-phase commit step, the overhead is not that unreasonable.
From the table we also see some negative scaling as the number of parallel connections increases. Some of this is likely from InnoDB/XtraDB, but I would like to investigate it deeper at some point to see if there is anything in the group commit part that can be improved with respect to this.
Looking back, should I have done this benchmark when designing the code? I think it is a tricky question, and one that cannot be given a simple answer. It will always be a trade-off: It is not feasible to test (and implement!) every conceivable variant of a new feature during development, it is necessary to also rely on common sense and experience. On the other hand, it is dangerous to rely on intuition with respect to performance; time and time again measurements prove that the real world is very often counter to intuition. In this case I was right, and my optimisation was beneficial; however I could easily have been wrong. I think the main lesson here is how important it is to get feedback on complex design work like this; such feedback is crucial for motivating and structuring the work to be of the quality that we need to see in MariaDB.