Kristian Nielsen (kristiannielsen) wrote,
Kristian Nielsen
kristiannielsen

MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primit

I took a quick look at MariaDB 10.0 single-treaded performance (simple read-only sysbench). One thing immediately leaps to the eye, and I thought it worthy of mention. It contains an important lesson about the use of synchronisation primitives and in particular "atomic operations" in MariaDB (and MySQL).

I am using the Linux perf tool on this sysbench command:

  sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
Look at the top offender in the output from perf report:
  1,54%  mysqld  mysqld               [.] set_thread_state_v1
The only thing this does is set a string for SHOW PROCESSLIST (and the like) about what the thread is doing. And we are spending a whopping 1.5% of the total time doing this.

And why? That becomes clear when looking at the disassembly (extract):

  2d:   mov    0x7b0(%rbx),%eax
  33:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    33
        and    $0xfffffffc,%edx
        add    $0x1,%edx
        xchg   %edx,0x7b0(%rbx)
        mov    %rbp,0x7b8(%rbx)
        mov    %ecx,0x7c0(%rbx)
        mov    0x7b0(%rbx),%eax
  5e:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    5e
        and    $0xfffffffc,%edx
        add    $0x6,%edx
        xchg   %edx,0x7b0(%rbx)
No less than two locked cmpxchg instructions. Each of these implies a full (memory barrier), which is really expensive.

To achieve their very high performance, modern CPUs process many instructions at the same time, a hundred or more is possible. But for a full memory barrier, the CPU needs to stop and ensure that all pending store instructions have fully reached the last-level cache. Only then can it start processing any following instructions involving a load. This is not something we should be doing every time a thread moves to a new stage of query processing.

I am not really familiar with the performance schema code (nor am I sure I want to be). But let us make some guess at what the code is trying to do here. Here is the code from which the first locked instruction originates:

  /**
    Execute an allocated to dirty transition.
    This transition should be executed by the writer that owns the record,
    before the record is modified.
  */
  void allocated_to_dirty(void)
  {
    uint32 copy= PFS_atomic::load_u32(&m_version_state);
    uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
    /* We own the record, no need to use compare and swap. */
    PFS_atomic::store_u32(&m_version_state, new_val);
  }
So perhaps the idea is that only the thread itself can modify its state, but other threads can read it. And we want the update to be fast (I suppose updating the state is more important than inspecting it). So the developers wanted to avoid a lock that could block the writer. Instead, it sets a flag to mark that the data is being modified before the update. And clears the flag after. And a reader can then check the flag; if the flag was modified during a read, the data is potentially inconsistent, and the read can be re-tried.

Now, this really is poorly designed. The PFS_atomic::load_u32() is implemented using my_atomic_load32() and my_atimic_store32(), which is inappropriate here. The my_atomic() facility implies a full memory barrier in all of the operations, which makes them somewhat less useful for performance-critical code.

All we need here is that the setting of the update flag becomes visible before changes to the state, and that the clearing of the flag becomes visible after. Thus it should be sufficient with a pair of write memory barriers in the writer (and corresponding read memory barriers in the reader).

The write and read memory barriers are much less expensive than a full barrier. All they do is prevent a store (respectively a load) before the barrier to be satisfied before a store (load) after the barrier. There is no need for the CPU to stop execution. In fact, on x86, there is usually nothing special to do, because stores and loads are never re-ordered unless the special non-temporal memory instructions are used.

But look closer at the code:

static void set_thread_state_v1(const char* state)
  ...
    int state_len= state ? strlen(state) : 0;
    pfs->m_processlist_lock.allocated_to_dirty();
    pfs->m_processlist_state_ptr= state;
    pfs->m_processlist_state_length= state_len;
    pfs->m_processlist_lock.dirty_to_allocated();
So it is storing a pointer to a string along with the length, and it needs to do synchronisation to ensure that the two are consistent with each other. But suppose instead that we store only the pointer. This can be done atomically without _any_ need for memory barriers or other synchronisation (as long as the memory location is naturally aligned). That would seem to be even better than using the (cheaper) write- and read-barriers.

To get a better idea of how bad the use of full memory barriers is for performance, I wrote a quick and dirty test program. It has code similar to what is in set_thread_state_v1(), as well as three other variants of it: One using just a write memory barrier, one using no memory barrier (not SMP safe, but useful for comparison), and one using normal pthread mutexes rather than a lock-free algorithm. Here is the time taken for 100,000,000 iterations of the function:

  wallclock cycles cycles overhead
Original 3.70 81 71
Write barrier 0.53 12 2
No barrier 0.45 10 -
Pthread mutex 2.26 50 40
Note how expensive the full memory barrier version is. The overhead (compared to no barrier) is 35 times larger than the lesser write-barrier. And I suspect that the real overhead is even bigger in real-life than in this synthetic micro-benchmark (for example imagine if we get a cache miss on one of those locked instructions).

Note the sad thing here that the "clever" lock-free algorithm in this case actually makes things slower than just using a simple pthread mutex. The overhead of the original code is 75% higher than the mutex version. This is not really surprising: the main cost of a mutex operations is also just a single full memory barrier, but pthread mutexes have probably been very carefully tuned and optimised by some very knowledgeable people, unlike the ad-hoc performance schema code. Lock-free algorithms require a lot of knowledge and care to do correctly, it is not simple stuff.

I think there is an important lesson here about our use of synchronisation primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention issues, and some various parts starting to use lock-free algorithms, there does not seem to be a general understanding of issues around memory barriers among the developers. In fact to my knowledge there is no general facilities for explicit memory barriers. We have my_atomic.h, which confuses the issues of "atomicity" and "memory barriers" and defines a handful of "atomic" operations that are really locked instructions implying a full memory barrier, as in the example above. As the test program shows, it is really important to distinguish between different kinds of memory barriers and understand the performance implications of the use of each. I have always been really unhappy about my_atomic.h, and I think this is an area that we need to improve significantly in.

And who knows, maybe the performance schema developers can take a hint or two from this.

Tags: database, mariadb, mysql, performance, programming
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 17 comments