MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primit|
I took a quick look at MariaDB 10.0 single-treaded performance (simple
read-only sysbench). One thing immediately leaps to the eye, and I thought it
worthy of mention. It contains an important lesson about the use of
synchronisation primitives and in particular "atomic operations" in MariaDB
I am using the Linux
perf tool on this
sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
Look at the top offender in the output from
1,54% mysqld mysqld [.] set_thread_state_v1
The only thing this does is set a string for SHOW PROCESSLIST (and the like)
about what the thread is doing. And we are spending a whopping 1.5% of the
total time doing this.
And why? That becomes clear when looking at the disassembly (extract):
2d: mov 0x7b0(%rbx),%eax
33: mov %eax,%edx
lock cmpxchg %eax,0x7b0(%rbx)
5e: mov %eax,%edx
lock cmpxchg %eax,0x7b0(%rbx)
No less than two locked cmpxchg instructions. Each of these implies a
which is really expensive.
To achieve their very high performance, modern CPUs process many instructions
at the same time, a hundred or more is possible. But for a full memory
barrier, the CPU needs to stop and ensure that all pending store instructions
have fully reached the last-level cache. Only then can it start processing any
following instructions involving a load. This is not something we should be
doing every time a thread moves to a new stage of query processing.
I am not really familiar with the performance schema code (nor am I sure I
want to be). But let us make some guess at what the code is trying to do here.
Here is the code from which the first locked instruction originates:
Execute an allocated to dirty transition.
This transition should be executed by the writer that owns the record,
before the record is modified.
uint32 copy= PFS_atomic::load_u32(&m_version_state);
uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
/* We own the record, no need to use compare and swap. */
So perhaps the idea is that only the thread itself can modify its state, but
other threads can read it. And we want the update to be fast (I suppose
updating the state is more important than inspecting it). So the developers
wanted to avoid a lock that could block the writer. Instead, it sets a flag to
mark that the data is being modified before the update. And clears the flag
after. And a reader can then check the flag; if the flag was modified during
a read, the data is potentially inconsistent, and the read can be re-tried.
Now, this really is poorly designed. The
is implemented using
my_atimic_store32(), which is inappropriate here. The
my_atomic() facility implies a full memory barrier in all of the
operations, which makes them somewhat less useful for performance-critical
All we need here is that the setting of the update flag becomes visible before
changes to the state, and that the clearing of the flag becomes visible
after. Thus it should be sufficient with a pair of write memory barriers in
the writer (and corresponding read memory barriers in the reader).
The write and read memory barriers are much less expensive than a
full barrier. All they do is prevent a store (respectively a load) before the
barrier to be satisfied before a store (load) after the barrier. There is no
need for the CPU to stop execution. In fact, on x86, there is usually nothing
special to do, because stores and loads are never re-ordered unless the
special non-temporal memory instructions are used.
But look closer at the code:
static void set_thread_state_v1(const char* state)
int state_len= state ? strlen(state) : 0;
So it is storing a pointer to a string along with the length, and it needs to
do synchronisation to ensure that the two are consistent with each other. But
suppose instead that we store only the pointer. This can be done atomically without _any_
need for memory barriers or other synchronisation (as long as the memory
location is naturally aligned). That would seem to be even better than using
the (cheaper) write- and read-barriers.
To get a better idea of how bad the use of full memory barriers is for
performance, I wrote a quick and dirty test program.
It has code similar to what is in
set_thread_state_v1(), as well
as three other variants of it: One using just a write memory barrier, one
using no memory barrier (not SMP safe, but useful for comparison), and one
using normal pthread mutexes rather than a lock-free algorithm. Here is the
time taken for 100,000,000 iterations of the function:
Note how expensive the full memory barrier version is. The overhead (compared
to no barrier) is 35 times larger than the lesser write-barrier. And I
suspect that the real overhead is even bigger in real-life than in this
synthetic micro-benchmark (for example imagine if we get a cache miss on one
of those locked instructions).
| ||wallclock ||cycles ||cycles overhead |
|Original ||3.70 ||81 ||71 |
|Write barrier ||0.53 ||12 ||2 |
|No barrier ||0.45 ||10 || - |
|Pthread mutex ||2.26 ||50 ||40 |
Note the sad thing here that the "clever" lock-free algorithm in this case
actually makes things slower than just using a simple pthread
mutex. The overhead of the original code is 75% higher than the mutex
version. This is not really surprising: the main cost of a mutex operations is
also just a single full memory barrier, but pthread mutexes have probably been
very carefully tuned and optimised by some very knowledgeable people, unlike
the ad-hoc performance schema code. Lock-free algorithms require a lot of
knowledge and care to do correctly, it is not simple stuff.
I think there is an important lesson here about our use of synchronisation
primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention
issues, and some various parts starting to use lock-free algorithms, there
does not seem to be a general understanding of issues around memory barriers
among the developers. In fact to my knowledge there is no general facilities
for explicit memory barriers. We have
my_atomic.h, which confuses
the issues of "atomicity" and "memory barriers" and defines a handful of
"atomic" operations that are really locked instructions implying a full memory
barrier, as in the example above. As the test program shows, it is really
important to distinguish between different kinds of memory barriers and
understand the performance implications of the use of each. I have always been
really unhappy about
my_atomic.h, and I think this is an area
that we need to improve significantly in.
And who knows, maybe the performance schema developers can take a hint or two
Tags: database, mariadb, mysql, performance, programming