40% better single-threaded performance in MariaDB|
Continuing my investigation of
single-threaded performance in the MariaDB server, I managed to increase
throughput of single-threaded read-only sysbench by more than 40% so far:
I use read-only sysbench 0.4.12 run like this:
sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx run
And mysqld is run with minimal options:
sql/mysqld --no-defaults --basedir=X --datadir=Y --innodb-buffer-pool-size=128M
With modern high-performance CPUs, it is necessary to do detailed measurements
using the built-in performance counters in order to get any kind of
understanding of how an application performs and what the bottlenecks are.
Forget about looking at the code and counting instructions or cycles as we did
in the old days. It no longer works, not even to within an order of magnitude.
I am using the Linux
perf program for this. During my
invistigations, I found that the main bottleneck in the benchmark turns out to
be instruction cache misses. Here is an example measurement:
So 10% of executed instructions missed the level 1 instruction cache ("icache"). That is
bad. The Intel Optimization Manual says this about the ratio
Anything over 1% of instructions retired can be a significant issue.
So we are 10 times worse than "a significant issue".
Instruction cache misses cause a bottleneck in the frontend of the CPU -
where x86 instructions are fetch, decoded, and supplied to the micro-op queue
to be scheduled for the out-of-order dataflow backend. To get an idea about
how badly bottlenecked we actually are in the frontend in this benchmark, we
can use another measure from the Intel manual:
IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD) = 81.5%
This ratio estimates the percentage of time in which the front-end is not able
to deliver instructions sufficiently fast for the backend to work at full
speed. In this case, for more than 80% of the time, the CPU would have been
able to do more work in the same time if only instructions could be fetch and
Note the importance of this analysis in knowing what areas to work on to
improve performance. Roughly speaking, 80% of our time is spent waiting for
the icache. We could have spent years on improving data locality or branch
mispredictions or whatever in the code, and we would never have been able to
get very big improvements. Now, knowing where the sore spot is, we can spend
our efforts where it matters the most.
In the case of icache misses, the best place to start is with profile-guided
optimisation. This works by first running a specially instrumented mysqld
binary to collect data about the load during the benchmark. Then the binary is
recompiled, where the compiler can take into account the information obtained
to better optimise the generated code.
Using information about how the program actually behaves when run can help the
compiler lay out the basic blocks of the program to benefit the most common
execution paths. This can help reduce the footprint in the icache by reducing
jumps into or out of the middle of a 16-byte instruction stream (such jumps
waste the rest of the 16-byte stream, which is the unit of storage of the
icache). It can also increase the length of straight-line code execution
paths, which should help the hardware prefetcher keep the level 1 icache
So that is the theory - but does it work in practice? It turns out that it
does. First I compile with
--coverage added to the
CFLAGS/CXXFLAGS. Then I run sysbench to generate the profile data. Then I
recompile adding instead the
--profile-use flag. That is all
there is to it, GCC handles everything else automatically.
benchmark with the optimised binary yields a large speedup:
|Binary||Queries per second||Speedup|
So a 44% speedup just from compiling with different optimisation, not bad!
(The actual server-side improvement is in fact even higher, as the sysbench
client consumes part of the runtime due to the single-threaded nature of the
By the way, that 44% speedup comes from just a modest reduction in icache miss
rate - from 10% down to 8%. It just shows how expensive those icache misses
are. Fetching from L2 cache takes something like 12 cycles, and during each of
those cycles the CPU will be able to do nothing. Given that we could
potentially execute 4 operations in the backend each of those clocks, that is
a lot of waste. So with 8% misses still left there is room for further huge
improvements, though those improvements will likely be much harder to achieve
than just fiddling with compiler options.
I think it is clear that we will need to start using profile-guided
optimisation for future MariaDB binaries. This will be some work, as we need
to develop a set of typical workloads to optimise after, and integrate running
of those into the compilation. We will need to cover a range of different
common usages to optimise after. And more tests with a wider range of
benchmarks will be needed to ensure that the gain is more generally applicable
and does not cause significant regressions in performance of other parts of
the code. With such a huge improvement in this test I am confident that things
can be generally improved, but it still needs proper testing.
My work on improving single-threaded performance will continue, as time
permits, and I certainly expect more good results along the way (several
patches are already in the pipeline). But I thought this one was so
interesting that it was worth mentioning to a wider audience.
Tags: database, mariadb, mysql, performance, programming