MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primit

My blog has moved! Read it in the new location here.

I took a quick look at MariaDB 10.0 single-treaded performance (simple read-only sysbench). One thing immediately leaps to the eye, and I thought it worthy of mention. It contains an important lesson about the use of synchronisation primitives and in particular "atomic operations" in MariaDB (and MySQL).

I am using the Linux perf tool on this sysbench command:

  sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
Look at the top offender in the output from perf report:
  1,54%  mysqld  mysqld               [.] set_thread_state_v1
The only thing this does is set a string for SHOW PROCESSLIST (and the like) about what the thread is doing. And we are spending a whopping 1.5% of the total time doing this.

And why? That becomes clear when looking at the disassembly (extract):

  2d:   mov    0x7b0(%rbx),%eax
  33:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    33
        and    $0xfffffffc,%edx
        add    $0x1,%edx
        xchg   %edx,0x7b0(%rbx)
        mov    %rbp,0x7b8(%rbx)
        mov    %ecx,0x7c0(%rbx)
        mov    0x7b0(%rbx),%eax
  5e:   mov    %eax,%edx
        lock   cmpxchg %eax,0x7b0(%rbx)
        jne    5e
        and    $0xfffffffc,%edx
        add    $0x6,%edx
        xchg   %edx,0x7b0(%rbx)
No less than two locked cmpxchg instructions. Each of these implies a full (memory barrier), which is really expensive.

To achieve their very high performance, modern CPUs process many instructions at the same time, a hundred or more is possible. But for a full memory barrier, the CPU needs to stop and ensure that all pending store instructions have fully reached the last-level cache. Only then can it start processing any following instructions involving a load. This is not something we should be doing every time a thread moves to a new stage of query processing.

I am not really familiar with the performance schema code (nor am I sure I want to be). But let us make some guess at what the code is trying to do here. Here is the code from which the first locked instruction originates:

    Execute an allocated to dirty transition.
    This transition should be executed by the writer that owns the record,
    before the record is modified.
  void allocated_to_dirty(void)
    uint32 copy= PFS_atomic::load_u32(&m_version_state);
    uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
    /* We own the record, no need to use compare and swap. */
    PFS_atomic::store_u32(&m_version_state, new_val);
So perhaps the idea is that only the thread itself can modify its state, but other threads can read it. And we want the update to be fast (I suppose updating the state is more important than inspecting it). So the developers wanted to avoid a lock that could block the writer. Instead, it sets a flag to mark that the data is being modified before the update. And clears the flag after. And a reader can then check the flag; if the flag was modified during a read, the data is potentially inconsistent, and the read can be re-tried.

Now, this really is poorly designed. The PFS_atomic::load_u32() is implemented using my_atomic_load32() and my_atimic_store32(), which is inappropriate here. The my_atomic() facility implies a full memory barrier in all of the operations, which makes them somewhat less useful for performance-critical code.

All we need here is that the setting of the update flag becomes visible before changes to the state, and that the clearing of the flag becomes visible after. Thus it should be sufficient with a pair of write memory barriers in the writer (and corresponding read memory barriers in the reader).

The write and read memory barriers are much less expensive than a full barrier. All they do is prevent a store (respectively a load) before the barrier to be satisfied before a store (load) after the barrier. There is no need for the CPU to stop execution. In fact, on x86, there is usually nothing special to do, because stores and loads are never re-ordered unless the special non-temporal memory instructions are used.

But look closer at the code:

static void set_thread_state_v1(const char* state)
    int state_len= state ? strlen(state) : 0;
    pfs->m_processlist_state_ptr= state;
    pfs->m_processlist_state_length= state_len;
So it is storing a pointer to a string along with the length, and it needs to do synchronisation to ensure that the two are consistent with each other. But suppose instead that we store only the pointer. This can be done atomically without _any_ need for memory barriers or other synchronisation (as long as the memory location is naturally aligned). That would seem to be even better than using the (cheaper) write- and read-barriers.

To get a better idea of how bad the use of full memory barriers is for performance, I wrote a quick and dirty test program. It has code similar to what is in set_thread_state_v1(), as well as three other variants of it: One using just a write memory barrier, one using no memory barrier (not SMP safe, but useful for comparison), and one using normal pthread mutexes rather than a lock-free algorithm. Here is the time taken for 100,000,000 iterations of the function:

  wallclock cycles cycles overhead
Original 3.70 81 71
Write barrier 0.53 12 2
No barrier 0.45 10 -
Pthread mutex 2.26 50 40
Note how expensive the full memory barrier version is. The overhead (compared to no barrier) is 35 times larger than the lesser write-barrier. And I suspect that the real overhead is even bigger in real-life than in this synthetic micro-benchmark (for example imagine if we get a cache miss on one of those locked instructions).

Note the sad thing here that the "clever" lock-free algorithm in this case actually makes things slower than just using a simple pthread mutex. The overhead of the original code is 75% higher than the mutex version. This is not really surprising: the main cost of a mutex operations is also just a single full memory barrier, but pthread mutexes have probably been very carefully tuned and optimised by some very knowledgeable people, unlike the ad-hoc performance schema code. Lock-free algorithms require a lot of knowledge and care to do correctly, it is not simple stuff.

I think there is an important lesson here about our use of synchronisation primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention issues, and some various parts starting to use lock-free algorithms, there does not seem to be a general understanding of issues around memory barriers among the developers. In fact to my knowledge there is no general facilities for explicit memory barriers. We have my_atomic.h, which confuses the issues of "atomicity" and "memory barriers" and defines a handful of "atomic" operations that are really locked instructions implying a full memory barrier, as in the example above. As the test program shows, it is really important to distinguish between different kinds of memory barriers and understand the performance implications of the use of each. I have always been really unhappy about my_atomic.h, and I think this is an area that we need to improve significantly in.

And who knows, maybe the performance schema developers can take a hint or two from this.

First steps with MariaDB Global Transaction ID

My blog has moved! Read it in the new location here.

My previous writings were mostly teoretical, so I wanted to give a more practical example, showing the actual state of the current code. I also wanted to show how I have tried to make the feature fit well into the existing replication features, without requiring the user to enable lots of options or understand lots of restrictions before being able to use it.

So let us start! We will build the code from lp:~maria-captains/maria/10.0-mdev26, which at the time of writing is at revision

First, we start a master server on port 3310 and put a bit of data into it:

    server1> use test;
    server1> create table t1 (a int primary key, b int) engine=innodb;
    server1> insert into t1 values (1,1);
    server1> insert into t1 values (2,1);
    server1> insert into t1 values (3,1);
To provision a slave, we take a mysqldump:
    bash$ mysqldump --master-data=2 --single-transaction -uroot test > /tmp/dump.sql
Note that with --master-data=2 --single-transaction we obtain the exact binlog position corresponding to the data in the dump. Since MariaDB 5.3, this is completely non-blocking on the server (it does not do FLUSH TABLES WITH READ LOCK):
    bash$ grep "CHANGE MASTER" /tmp/dump.sql
    -- CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000001', MASTER_LOG_POS=910;
Meanwhile, the master server has a couple more transactions:
    server1> insert into t1 values (4,2);
    server1> insert into t1 values (5,2);
Now let us start up the slave server on port 3311, load the dump, and start replicating from the master:
    bash$ mysql -uroot test < /tmp/dump.sql
    server2> change master to master_host='', master_port=3310,
        master_user='root', master_log_file='master-bin.000001', master_log_pos=910;
    server2> start slave;
    server2> select * from t1;
    | a | b    |
    | 1 |    1 |
    | 2 |    1 |
    | 3 |    1 |
    | 4 |    2 |
    | 5 |    2 |
    5 rows in set (0.00 sec)
So slave is up to date. In addition, when the slave connects to the master, it downloads the current GTID replication state, so everything is now ready for using global transaction ID. Let us promote the slave as the new master, and then later make the old master a slave of the new master. So stop the slave thread on the old slave, and run another transaction to simulate it being the new master:
    server2> stop slave;
    server2> insert into t1 values (6,3);
Finally, let us attach the old master as a slave using global transaction ID:
    server1> change master to master_host='', master_port=3311,
        master_user='root', master_gtid_pos=auto;
    server1> start slave;
    server1> select * from t1;
    | a | b    |
    | 1 |    1 |
    | 2 |    1 |
    | 3 |    1 |
    | 4 |    2 |
    | 5 |    2 |
    | 6 |    3 |
    6 rows in set (0.01 sec)
Old master is now running as slave and is up-to-date with the new master.

So that is it! A short post from me for once, but that is the whole point. Replication with MariaDB Global Transaction ID works much as it always did. The only new thing here is when we issue the CHANGE MASTER to make the old master a slave of the new master. We do not have to manually try to compute or guess the correct binlog position on the new master, we can just specify MASTER_GTID_POS=AUTO and the servers figure out the rest for themselves.

I hope I managed to show more concretely my ideas with MariaDB Global Transaction ID. Comments and questions are most welcome, as always. Everything above is actual commands that work on the current code on Launchpad. Everything else may or may not work yet, as this is work in progress, just so you know!

More on global transaction ID in MariaDB

My blog has moved! Read it in the new location here.

I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.

(It also gives me the opportunity to explain more deeply a lot of interesting design decisions that I left out in the first post for the sake of brevity and clarity.)

On crash-safe slave

One of the things I really wanted to improve with global transaction ID is to make the replication slaves more crash safe with respect to their current replication state. This state is mostly persistently stored information about which event(s) were last executed on the slave, so that after a server restart the slave will know from which point in the master binlog(s) to resume replication. In current (5.5 and earlier) replication, this state is stored simply by continuously writing a file after each event executed. If the server crashes, this is very susceptible to corruption where the contents of the file no longer matches the actual state of tables in the database. </p>

With MariaDB global transaction ID, the replication state is stored in the following table instead of in a plain file:

    CREATE TABLE rpl_slave_state (
	PRIMARY KEY (domain_id, sub_id));
When a transaction is executed on the slave, this table is updated as part of the transaction. So if the table is created with InnoDB (or other transactional engine) and the replicated events also use transactional tables, then the replication state is crash safe. DDL, or non-transactional engines such as MyISAM, remain crash-unsafe of course.

A global transaction ID in MariaDB consists of domain_id as described in the previous post, an increasing sequence number, and the usual server_id. Recall that the replication state with global transaction ID consists of the last global transaction ID applied within each independent replication stream, ie. a mapping from domain_id to global transaction ID. This is what the table rpl_slave_state provides.

But what about the sub_id in the above table? This is to prevent concurrency issues when parallel replication is used. If we want to be able to execute in parallel two transactions with the same domain_id, then these two transactions will both need to update the table rpl_slave_state. If the transactions would update the same row in the table, then one transaction would have to wait on a row lock for the other to commit. This would prevent any kind of group commit, which would be a very serious performance bottleneck.

So instead, each transaction inserts a new, separate row into the table to record the new global transaction ID applied. There may thus be multiple entries for a given domain_id. The sub_id is used to distinguish them, it is simply a (local) integer that is increased for each transaction received from the master binlog. Thus, at any given time the last applied global transaction ID for given domain_id is the one with the highest sub_id in the table.

In effect, the replication state is obtained with a query like this:

    SELECT domain_id, server_id, seq_no
    FROM rpl_slave_state
    WHERE (domain_id, sub_id) IN
      (SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
Old rows are deleted when no longer needed.

Thus, two replicated transactions can be executed like this:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    INSERT INTO mysql.rpl_slave_state      INSERT INTO mysql.rpl_slave_state
       SET domain_id = 1,                     SET domain_id = 1,
	   sub_id = 100,                          sub_id = 101,
	   server_id = 5,                         server_id = 5,
	   seq_no = 300010;                       seq_no = 300011;

    COMMIT;                                COMMIT;
These two transactions can run completely independent, including the insert into rpl_slave_state. And the commits at the end can be done together as a single group commit, where we ensure that the second one is recorded as happening after the first one so commit order (and binlog order) is preserved and visibility is correct (second transaction not visible to any query without the first also being visible).

Contrast this with how things would be with a rpl_slave_state table with a single row per domain_id:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    UPDATE bad_rpl_slave_state
       SET server_id = 5,
	   seq_no = 300010
     WHERE domain_id = 1;


					   UPDATE bad_rpl_slave_state
					      SET server_id = 5,
						  seq_no = 300011
					    WHERE domain_id = 1;

Here the update of the replication state table in the second transaction would have to wait for the first transaction to commit, because of row locks. Group commit becomes impossible.

(I actually explained this issue to the replication developers at MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had ignored it...)

On where to store the replication state

As Giuseppe pointed out, in the global transaction ID design it is still written that the replication state will be stored in the slave binlog, not in the rpl_slave_state table. Sorry about this, I will get the document updated as soon as possible.

I had basically two ideas for how to store the slave state in a crash-safe way:

  1. In the slave's binlog.
  2. In a (transactional) table.
The big advantage of (2) is that it works also when the binlog is not enabled on the slave. Since there can still be substantial overhead to enabling the binlog, I currently plan to go with this approach.

The advantage of (1) is that it is potentially cheaper when the binlog is enabled on the slave, as it commonly will be when global transaction ID is enabled (to be able to promote a slave as a new master, the binlog must be enabled, after all). We already write every single global transaction ID applied into the binlog, and if we crash, we already scan the binlog during crash recovery. Thus, it is easy during crash recovery to rebuild the replication state from the binlog contents. This way we get crash safe slave state without the overhead of maintaining an extra rpl_slave_state table.

It will be possible in the future to refine this, so that we could use method (1) if binlog is enabled, else method (2). This might improve performance slightly when binlog is enabled. But we should first benchmark to check if such refinement will be worth it in terms of performance gained. It seems likely that any gains will be modest, at best.

On parallel replication

Parallel replication is something that has been long overdue, but is now a reality. MariaDB 10.0 will have multi-source replication, which is actually a form of parallel replication. MySQL 5.6 will have multi-threaded slave. Galera can do parallel replication, as can Tungsten I believe, though I am not familiar with details. There are several other mechanisms for parallel replication planned for later MariaDB releases, like MWL#184 and MDEV-520.

It is thus very important to think parallel replication into the design of global transaction ID from the start. I fully agree with Giuseppe's remarks here about MySQL 5.6 replication features failing completely to do this. They introduce in 5.6 three new features that require extensions to how the replication state is stored: crash-safe slave, global transaction ID, and multi-threaded slave. They have managed to do this by introducing three completely different solutions. This is just insane. It makes one wonder if Oracle management forbids Oracle developers to talk to each other, just like we already know they prevent discussions with the community ...

So, in relation to global transaction ID there are basically two kinds of parallel replication techniques: in-order and out-of-order. The two interact with global transaction ID in different ways.

On in-order parallel replication

In-order is when two (or more) different transactions are executed in parallel on the slave, but the commit of the second transaction is delayed to after the first transaction has committed. Galera is an example of this, I think. Planned tasks MWL#184 and possibly MDEV-520 are also in-order techniques.

In-order parallel replication is transparent to applications and users (at least with MVCC transactional engines like InnoDB), since changes only become visible on COMMIT, and commits are done in a serial way. It is thus also mostly transparent to global transaction ID, and does not need much special consideration for the design.

One thing that can be done, and that I am currently working on, is to integrate in-order parallel replication with group commit. Suppose we run transactions T1 and T2 in parallel on the slave, and suppose that T2 happens to complete first so that we have to wait in T2's commit for T1 to commit first. If we integrate this wait with group commit, we can actually commit T1 and T2 at the same time, taking care to write the commit records to the binlog and to the storage engine in the right order (T1 before T2). This way, the wait is likely to improve performance rather than reduce it, in fact.

On out-of-order parallel replication

Out-of-order parallel replication is when transactions can be committed in a different order on the slave than on the master. The MySQL 5.6 multi-threaded slave feature is an example of this.

Out-of-order must be explicitly enabled by the application/DBA, because it breaks fundamental semantics. If commits happen in different order, the slave database may be temporarily in a state that never existed on the master and may be invalid for the application. But if the application is written to tolerate such inconsistencies, and explicitly declares this to the database, then there may be potential for more parallelism than with in-order methods. This can make out-of-order interesting.

The typical example, which MySQL 5.6 multi-threaded slave uses, is when the application declares that transactions against different schemas are guaranteed independent. Different schemas can then be replicated independently (though if the application messes up and transactions happen to not really be independent, things can break). MariaDB 10.0 multi-source replication is another example, where the application declares a guarantee that two master servers can be replicated independently.

Out-of-order creates a challenge for global transaction ID when switching to a new master. Because events in the binlog on the new master are in different order, there will not in general be a single place from which to start replication without either loosing or duplicating some event.

MariaDB global transaction ID handles this by only allowing out-of-order parallel replication between different replication domains, never within a single domain. In effect, the DBA/application explicitly declares the possible independent replication streams, and then it is sufficient to remember one global transaction ID per domain_id as the position reached within each independent stream.

Thus, suppose we have a master where updates to schemas are independent, and we want to replicate them in parallel on slaves. On the master, we configure 10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog, we set the domain_id value to a hash of the used schema.

On the slave, we then configure 10 SQL threads. Two received transactions with different domain_id can be executed in parallel. Two transactions using same schema will map to the same domain_id and will thus not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded slave almost for free, using the exact same mechanism as for executing multi-source replication in parallel. The replication state on the slave will in this case consist of the 10 different global transaction IDs reached within each of the 10 replication domains. And they can be stored in the table rpl_slave_state just as described above. Thus replication state for out-of-order parallel replication is fully integrated with the rest of the design, needing no special mechanisms.

And we can do more! The application (with suitable privileges) is allowed to change domain_id per-query. For example, we can run all normal queries with domain_id=1. But then if we have a long-running maintenance query like an ALTER TABLE or something that updates every row in a large table, we can execute it with domain_id=2, if we take care that no other queries conflict with it. This way, the long-running query can run in parallel "in the background", without causing any replication delay for normal queries.

In effect, the application or DBA now has great flexibility in declaring which queries can replicate independent of (and thus in parallel with) each other, and all this just falls out almost for free from the overall design. I foresee that this will be a very powerful feature to have for large, busy replication setups.

Note btw. that most out-of-order parallel replication techniques can also be done as in-order simply by delaying the final COMMIT steps of transactions to happen in-order. This way one could for example do per-schema parallel replication without polluting the replication state with many global transaction IDs. This should generally achieve similar improvement in overall throughput, though latency of individual transactions can be longer.

On "holes" in the global transaction ID sequences

Global transaction IDs have a sequence-number component, which ensures uniqueness by being always increasing. This raises the issue of whether an event will always have a sequence number exactly one bigger than the previous event, or if it is allowed to have "holes", where some sequence number is never allocated to an event.

For MariaDB global transaction ID, I took the approach that holes are allowed. There are a number of good reasons for this.

Mostly, I believe that a design that relies on "no holes" is a fundamental mistake. In MySQL 5.6 global transaction ID, holes are absolutely not allowed. If a hole ever turns up, you will be stuck with it literally forever. The MySQL 5.6 replication state lists every sequence number not yet applied on a slave server, so if one becomes missing it will forever remain. Unless you remove it manually, and as far as I have been able to determine, there are currently no facilities for this. Anyone who knows how fragile MySQL replication can be should realise that this is a recipe for disaster.

Another point: because of the strict "no holes" requirement, in MySQL 5.6, when events are filtered with --replicate-ignore-db or whatever, they had to change the code so that a dummy event is used to replace the filtered event. In effect, you cannot really filter any events any more! I think that alone should be enough to realise that the design is wrong.

A more subtle point is that a strict "no holes" requirement makes it much harder to correctly and scalable handle allocation of new numbers. Basically, to allocate next number in a multi-thread environment, a lock needs to be taken. We need to take this lock for as short as possible to preserve scalability. But then, what happens if we allocate some sequence number N to transaction T1, and then later we get some failure that prevents T1 from successfully committing and being written into the binlog? We now cannot simply rollback T1, because some other transaction T2 may have already allocated the next number, and then we would leave a hole. Subtle issues like this are important to achieve good scalability.

So I think it is wrong to base the design on never having holes. On the other hand, there is no reason to deliberately introduce holes just for the fun of it. Sequence numbers in MariaDB global transaction ID will generally be without holes, it is just that nothing will break if somehow a hole should sneak in.

Also, whenever a global transaction ID is received on a slave server, the server's own internal counter for the next sequence number to allocate will be set to one more than the received sequence number, if it is currently smaller. This gives the very nice property in a standard setup, where only one master is ever written at any one time: The sequence number in itself will be globally unique and always increasing. This means that one can look at any two global transaction IDs and immediately know which one comes first in the history, which can be very useful. It also allows to give a warning if multiple masters are being written without being configured with distinct replication domain ID. This is detected when a server replicates a global transaction ID with same domain_id as its own but smaller sequence number.

In multi-master-like setups, sequence number by itself can no longer be globally unique. But even here, if the system is correctly configured so that each actively written master has its own domain_id, sequence number will be unique per domain_id and allow to order global transaction IDs within one domain (and of course, between different domains there is no well-defined ordering anyway).


In the previous post, I did not really go into what new syntax will be introduced for MariaDB global transaction ID, both for the sake of brevity, and also because it has not really been fully decided yet.

However, there will certainly be the possibility to change directly the replication slave state, ie. the global transaction ID to start replicating from within each replication domain. For example something like this:

    CHANGE MASTER TO master_host = "master.lan",
           master_gtid_pos = "1-1-100,2-5-300";
This is a fine and supported way to start replication. I just want to mention that it will also be supported to start replication in the old way, with master_log_file and master_log_pos, and the master will automatically convert this into the corresponding master_gtid_pos and set this for the slave. This can be convenient, as many tools like mysqldump or XtraBackup provide easy access to the old-style binlog position. It is certainly an improvement over MySQL 5.6 global transaction ID, where the only documented way to setup a slave involves RESET MASTER (!) on the master server...

Incidentally, note that master_gtid_pos has just one global transaction ID per domain_id, not one per server_id. Thus, if not using any form of multi-master, there will be just one global transaction ID to set.

So if we start with server 1 as the master, and then some time later switch over to server 2 for a master, the binlog will have global transaction IDs both with server_id=1 and server_id=2. But the slave binlog state will be just a single global transaction ID, with server_id=2 in this case. Since binlog order is always the same within one replication domain, a single global transaction ID is sufficient to know the correct place to continue replication.

I think this is a very nice property, that the size of the replication state is fixed: one global transaction ID per configured replication domain. In contrast, for MySQL 5.6 global transaction ID, any server_id that ever worked as master will remain in the replication state forever. If you ever had a server id 666 you will still be stuck with it 10 years later when you specify the replication state in CHANGE MASTER (assuming they will at some point even allow specifying the replication state in CHANGE MASTER).

Once the global transaction replication state is set, changing to a new master could happen with something like this:

    CHANGE MASTER TO master_host = "master.lan",
                     master_gtid_pos = AUTO;
This is the whole point of global transaction ID, of course, to be able to do this and automatically get replication started from the right point(s) in the binlog on the new master.

One idea that I have not yet decided about is to allow just this simple syntax:

    CHANGE MASTER TO master_host = "master.lan";
so that if no starting position is specified, and a global transaction state already exists, we default to master_gtid_pos = AUTO. This would be a change in current behaviour, so maybe it is not a good idea. On the other hand, the current behaviour is to start replication from whatever happens to be the first not yet purged binlog file on the master, which is almost guaranteed to be wrong. So it is tempting to change this simple syntax to just do the right thing.

On extensions to mysqlbinlog

Robert mentions some possible extensions to the mysqlbinlog program, and I agree with those.

The master binlog is carefully designed so that it is easily possible to locate any given global transaction ID and the corresponding binlog position (or determine that such global transaction ID is not present in any binlog files). In the initial design this requires scanning one (but just one) binlog file from the beginning; later we could add an index facility if this becomes a bottleneck. The mysqlbinlog program should also support this, probably by allowing to specify a global transaction ID (or multiple IDs) for --start-position and --stop-position.

Robert also mentions the usefulness of an option to filter out events from within just one replication domain/stream. This is something I had not thought of, but it would clearly be useful and is simple to implement.

On session variable server_id

With MariaDB global transaction ID, server_id becomes a session variable, as do newly introduced variables gtid_domain_id and gtid_seq_no. This allows an external replication mechanism to apply events from another server outside the normal replication mechanism, and still preserve the global transaction ID when the resulting queries are logged in the local binlog. One important use case for this is of course point-in-time recovery, mysqlbinlog | mysql. Here, mysqlbinlog can set these variables to preserve the global transaction IDs on the events applied, so that fail-over and so on will still work correctly.

Since messing with server_id and so on has the possibility to easily break replication, setting these requires SUPER privileges.

Global transaction ID in MariaDB

My blog has moved! Read it in the new location here.

The main goal of global transaction ID is to make it easy to promote a new master and switch all slaves over to continue replication from the new master. This is currently harder than it could be, since the current replication position for a slave is specified in coordinates that are specific to the current master, and it is not trivial to translate them into the corresponding coordinates on the new master. Global transaction ID solves this by annotating each event with the global transaction id which is unique and universal across the whole replication hierarchy.

In addition, there are at least two other main goals for MariaDB global transaction ID:

  1. Make it easy to setup global transaction ID, and easy to provision a new slave into an existing replication hierarchy.
  2. Fully support multi-source replication and other similar setups.

Replication streams

Let us consider the second point first, dealing with multi-source replication. The figure shows a replication topology with five servers. Server 3 is a slave with two independent masters, server 1 and server 2. Server 3 is in addition itself a master for two slaves server 4 and server 5. The coloured boxes A1, A2, ... and B1, B2, ... denote the binlogs in each server.

When server 3 replicates events from its two master servers, events from one master are applied independently from and in parallel with events from the other master. So the events from server 1 and server 2 get interleaved with each other in the binlog of server 3 in essentially arbitrary order. However, an important point is that events from the same master are still strictly ordered. A2 can be either before or after B1, but it will always be after A1.

When the slave server 4 replicates from master server 3, server 4 sees just a single binlog stream, which is the interleaving of events originating in server 1 and server 2. However, since the two original streams are fully independent (by the way that multi-source replication works in MariaDB), they can be freely applied in parallel on server 4 as well. This is very important! It is already a severe performance bottleneck that replication from one master is single-threaded on the slaves, so replicating events from multiple masters also serially would make matters even worse.

What we do is to annotate the events with a global transaction ID that includes a tag that identifies the original source. In the figure, this is marked with different colours, blue for events originating in server 1, and red for server 2. Server 4 and server 5 are then free to apply a "blue" event in parallel with a "red" event, and such parallel events can thus end up in different order in the binlogs in different places in the replication hierarchy. So every server can have a distinct interleaving of the two original event streams, but every interleaving respects the order within a single original stream. In the figure, we see for example that A2 comes before B1 in server 3 and server 5, but after in server 4, however it is always after A1.

This concept of having a collection of distict binlog streams, each strictly ordered but interleaved with each other in a relaxed way, is very powerful. It allows both great flexibility (and hence opportunity for parallelism) in applying independent events, as well as simple representation of the state of each replication slave at each point in time. For each slave, we simply need to remember the global transaction ID of the last event applied in each independent stream. Then to switch a slave to a new master, the master finds the corresponding places within its own binlog for each independent stream and starts sending events from the appropriate location for each stream.

For example, in the figure, we see that the state of server 4 is (A4, B3) and for server 5 it is (A3, B3). Thus we can change server 5 to use server 4 as a slave directly, as server 4 is strictly ahead of server 5 in the replication streams.

Or if we want to instead make server 5 the new master, then we first need to temporarily replicate from server 4 to server 5 up to (A4, B3). Then we can switch over and make server 5 the new master. Note that in general such a procedure may be necessary, as there may be no single server in the hierarchy that is ahead of every other server in every stream if the original master goes away. But since each stream is simply ordered, it is always possible to bring one server up ahead to server as a master for the others.

Setup and provisioning

This brings us back to the first point about, making it easy to setup replication using global transaction ID, and easy to provision a new slave into an existing replication hierarchy.

To create a new slave for a given master, one can proceed exactly the same way whether using global transaction id or not. Make a copy of the master obtaining the corresponding binlog position (mysqldump --master-data, XtraBackup, whatever). Setup the copy as the new slave, and issue CHANGE MASTER TO ... MASTER_LOG_POS=... to start replication. Then when the slave first connects, the master will send last global transaction ID within each existing replication stream, and slave will thus automatically be configured with the correct state. Then if there later is a need to switch the slave to a different master, global transaction ID is already properly set up.

This works exactly because of the property that while we have potentially interleaved distinct replication streams, each stream is strictly ordered across the whole replication hierarchy. I believe this is a very important point, and essential for getting a good global transaction ID design. The notion of an ordered sequence of the statements and transactions executed on the master is the central core of MySQL replication, it is what users know and what has made it so successful despite all its limitations.

Replication domains

To implement this design, MariaDB global transaction ID introduces the notion of a replication domain and an associated domain_id. A replication domain is just a server or group of servers that generate a single, strictly ordered replication stream. Thus, in the example above, there are two domain_id values in play corresponds to the two colours blue and red. The global transaction ID includes the domain_id, and this way every event can be identified with its containing replication stream.

Another important point here is that domain_id is something the DBA configures explicitly. MySQL replication is all about the DBA having control and flexibility. The existence of independent streams of events is a property of the application of MySQL, not some server internal, so it needs to be under the control of the user/DBA. In the example, one would configure server 1 with domain_id=1 and server 2 with domain_id=2.

Of course, in basic (and not so basic) replication setups where only one master server is written to by applications at any one time, there is only a single ordered event stream, so domain_id can be ignored and remain at the default (which is 0).

Note by the way that domain_id is different from server_id! It is possible and normal for multiple servers to share the same domain_id, for example server 1 might be a slave of some higher-up master server, and the two would then share the domain_id. One could even imagine that at some point in the future, servers would have moved around so that server 2 was re-provisioned to replace server 1, it would then retain its old server_id but change its domain_id to 1. So both the blue and the red event stream would have instances with server_id=1, but domain_id will always be consistent.

It is also possible for a single server to use multiple domain IDs. For example, a DBA might configure events generated to receive as domain_id a hash of the current schema. This would be a way of declaring that transactions in distinct schemas are guaranteed to be independent, and it would allow slaves to apply those independent transactions in parallel. The slave will just see distinct streams, and apply them in parallel same way as for multi-source replication. This is similar to the multi-threaded slave that MySQL 5.6 implements. But it is more flexible, for example an application could explicitly mark a long-running transaction with a distict domain_id, and then ensure that it is independent of other queries, allowing it to be replicated in parallel and not delay replication of normal queries.

Current status

The MariaDB global transaction ID is work-in-progress, currently planned for MariaDB 10.0.

The current code is maintained on Launchpad: lp:~maria-captains/maria/10.0-mdev26. The design is written up in detail in Jira task MDEV-26, where the progress is also tracked.

Global transaction ID has already been discussed on the maria-developers mailing list. I have received valuable feedback there which has been included in the current design. But I very much welcome additional feedback, I am open to changing anything if it makes the end result better. Much of the community seems to not be using mailing lists to their full potential (hint hint!), hence this blog post to hopefully reach a wider audience that might be interested.

Integer overflow

My blog has moved! Read it in the new location here.

What do you think of this piece of C code?

  void foo(long v) {
    unsigned long u;
    unsigned sign;
    if (v < 0) {
      u = -v;
      sign = 1;
    } else {
      u = v;
      sign = 0;
Seems pretty simple, right? Then what do you think of this output from MySQL:
  mysql> create table t1 (a bigint) as select '-9223372036854775807.5' as a;
  mysql> select * from t1;
  | a                    |
  | -'..--).0-*(+,))+(0( | 
Yes, that is authentic output from older versions of MySQL. Not just the wrong number, the output is complete garbage! This is my all-time favorite MySQL bug#31799. It was caused by code like the above C snippet.

So can you spot what is wrong with the code? Looks pretty simple, does it not? But the title of this post may give a hint...

It is a little known fact that signed integer overflow is undefined in C! The code above contains such undefined behaviour. The expression -v overflows when v contains the smallest negative integer of the long type (-263 on 64-bit) - the absolute value of this cannot be represented in the type. The correct way to put the absolute value of signed v into unsigned u is u = (unsigned long)0 - (unsigned long)v. Unsigned overflow is well-defined in C, in contrast to signed overflow.

And yes, GCC will generate unexpected (but technically valid) assembler for such code, as seen in the Bug#31799. If you do not like this, then use -fno-strict-overflow like I believe Postgresql and the Linux kernel do.

(But better write correct C code from the start).

Even faster group commit!

My blog has moved! Read it in the new location here.

I found time to continue my previous work on group commit for the binary log in MariaDB.

In current code, a (group) commit to InnoDB does not less than three fsync() calls:

  1. Once during InnoDB prepare, to make sure we can recover the transaction in InnoDB if we crash after writing it to the binlog.
  2. Once after binlog write, to make sure we have the transaction in the binlog before we irrevocably commit it in InnoDB.
  3. Once during InnoDB commit, to make sure we no longer need to scan the binlog after a crash to recover the transaction.
Of course, in point 3, it really is not necessary to do an fsync() after every (group) commit. In fact, it seems hardly necessary to do such fsync() at all! If we should crash before the commit record hits the disk, we can always recover the transaction by scanning the binlogs and checking which of the transactions in InnoDB prepared state should be committed. Of course, we do not want to keep and scan years worth of binlogs, but we need only fsync() every so often, not after every commit.

So I implemented MDEV-232. This removes the fsync() call in the commit step in InnoDB. Instead, the binlog code requests from InnoDB (and any other transactional storage engines) to flush all pending commits to disk when the binlog is rotated. When InnoDB is done with the flush, it reports back to the binlog code. We keep track of how far InnoDB has flushed by writing so-called checkpoint events into the current binlog. After a crash, we first scan the latest binlog. The last checkpoint event found will tell us if we need to scan any older binlogs to be sure to find all commits that were not durably committed inside InnoDB prior to the crash.

The result is that we only need to do two fsync() calls per (group) commit instead of three.

I benchmarked the code on a server with a good disk system - HP RAID controller with a battery-backed up disk cache. When the cache is enabled, fsync() is fast, around 400 microseconds. When the cache is disabled, it is slow, several milliseconds. The setup should be mostly comparable to Mark Callaghan's benchmarks here and here.

I used sysbench update_non_index.lua to make it easier for others to understand/reproduce the test. This does a single update of one row in a table in each transaction. I used 100,000 rows in the table. Group commit is now so fast that at higher concurrencies, it is no longer the bottleneck. It will be interesting to test again with the new InnoDB code from MySQL 5.6 and any other scaslability improvements that have been made there.

Slow fsync()

As can be seen, we have a very substantial improvement, around 30-60% more commits per second depending on concurrency. Not only are we saving one out of three expensive fsync() calls, improvements to the locking done during commit also allow more commits to share the same fsync().

Fast fsync()

Even with fast fsync(), the improvements are substantial.

I am fairly pleased with these results. There is still substantial overhead from enabling the binlog (like several times slowdown if fsync() time is the bottleneck), and I have a design for mostly solving this in MWL#164. But I think perhaps it is now time to turn towards other more important areas. In particular I would like to turn to MWL#184 - another method for parallel apply of events on slaves that can help in cases where the per-database split of workload that exists in Tungsten and MySQL 5.6 can not be used, like many updates to a single table. Improving throughput even further on the master may not be the most important if slaves are already struggling to keep up with current throughput, and this is another relatively simple spin-off from group commit that could greatly help.

For anyone interested, the current code is pushed to lp:~maria-captains/maria/5.5-mdev232

MySQL group commit

It was an interesting coincidence that the new MySQL group commit preview was published just as I was finishing this work. So I had the chance to take a quick look and include it in the benchmarks (with slow fsync()):

While the implementation in MySQL 5.6 preview is completely different from MariaDB (talk about "not invented here ..."), the basic design is now quite similar, as far as I could gather from the code. A single thread writes all transactions in the group into the binlog, in order; likewise a single thread does the commits (to memory) inside InnoDB, in order. The storage engine interface is extended with a thd_get_durability_property() callback for the engines - when the server returns HA_IGNORE_DURABILITY from this, InnoDB commit() method is changed to work exactly like MariaDB commit_ordered(): commit to memory but do not sync to disk.

(It remains to see what storage engine developers will think of MySQL implementing a different API for the same functionality ...)

The new MySQL group commit also removes the third fsync() in the InnoDB commit, same as the new MariaDB code. To ensure they can still recover after a crash, they just call into the storage engines to sync all commits to disk during binlog rotate. I actually like that from the point of simplicity - even if it does stall commits for longer, it is unlikely to matter in practice. What actually happens inside InnoDB in the two implementations is identical.

The new MySQL group commit is substantially slower than the new MariaDB group commit in this benchmark. My guess is that this is in part due to suboptimal inter-thread communication. As I wrote about earlier, this is crucial to get best performance at high commit rates, and the MySQL code seems to do additional synchronisation between what they call stages - binlog write, binlog fsync(), and storage engine commit. Since the designs are now basically identical, it should not be hard to get this fixed to perform the same as MariaDB. (Of course, if they had started from my work, they could have spent the effort improving that even more, rather than wasting it on catch-up).

Note that the speedup from group commit (any version of it) is highly dependent on the workload and the speed of the disk system. With fast transactions, slow fsync(), and high concurrency, the speedup will be huge. With long transactions, fast fsync(), and low concurrency, the speedups will be modest, if any.

Incidentally, the new MySQL group commit is a change from the designs described earlier, where individual commit threads would use pwrite() in parallel into the binary log. I am convinced this is a good change. The writing to binlog is just memcpy() between buffers, a single thread can do gigabytes worth of that, it is not where the bottleneck is. While it is crucial to optimise the inter-thread communication, as I found out here - and lots of small parallel pwrite() calls into the same few data blocks at the end of a file delivered to the file system is not likely to be a success. If binlog write bandwidth would really turn out to be a problem the solution is to have multiple logs in parallel - but I think we are quite far from being there yet.

It is a pity that we cannot work together in the MySQL world. I approached the MySQL developers several times over the past few years suggesting we work together, with no success. There are trivial bugs in the MySQL group commit preview whose fix yield great speedup. I could certainly have used more input while doing my implementation. The MySQL user community could have much better quality if we would only work together.

Instead, Oracle engineers use their own bugtracker which is not accessible to others, push to their own development trees which are not accessible to others, communicate on their own mailing lists which are not accessible to others, hold their own developer meetings which are not accessible to others ... the list is endless.

The most important task when MySQL was aquired was to collect the different development groups working on the code base and create a real, great, collaborative Open Source project. Oracle has totally botched this task up. Instead, what we have is lots of groups each working on their own tree, with no real interesting in collaborating. I am amazed every time I read some prominent MySQL community member praise the Oracle stewardship of MySQL. If these people are not interested in creating a healthy Open Source project and just want to not pay for their database software, why do they not go use the express/cost-free editions of SQL server or Oracle or whatever?

It is kind of sad, really.

Tale of a bug

My blog has moved! Read it in the new location here.

This is a tale of the bug lp:798213. The bug report has the initial report, and a summary of the real problem obtained after detailed analysis, but it does not describe the processes of getting from the former to the latter. I thought it would be interesting to document this, as the analysis of this bug was rather tricky and contains several good lessons.


The bug first manifested itself as a sporadic failure in one of our random query generator tests for replication. We run this test after all MariaDB pushes in our Buildbot setup. However, this failure had only occured twice in several months, so it is clearly a very rare failure.

The first task was to try to repeat the problem and get some more data in the form of binlog files and so on. Philip kindly helped with this, and after running the test repeatedly for several hours he finally managed to obtain a failure and attach the available information to the initial bug report. Time for analysis!

Understanding the failure

The first step is to understand what the test is doing, and what the failure means.

The test starts up a master server and exposes it to some random parallel write load. Half-way through, with the server running at full speed, it takes a non-blocking XtraBackup backup, and restores the backup into a new slave server. Finally it starts the new slave replicating from the binlog position reported by XtraBackup, and when the generated load is done and the slave caught up, it compares the master and slave to check that they are consistent. This test is an important check of my group commit work, which is carefully engineered to provide group commit while still preserving the commit order and consistent binlog position that is needed by XtraBackup to do such non-blocking provisioning of new slaves.

The failure is that in a failed run, the master and slave are different when compared at the end. The slave has a couple of extra rows (later I discovered the bug could also manifest itself as a single row being different). So this is not good obviously, and needs to be investigated.

Analysing the failure

So this is a typical case of a "hard" failure to debug. We have binlogs with 100k queries or so, and a slave that somewhere in those 100k queries diverges from the master. Working on problems like this, it is important to work methodically, slowly but surely narrowing down the problem, come up with hypothesis about the behaviour and positively affirm or reject them, until finally the problem is narrowed down sufficiently that the real cause is apparent. Random poking around not only is likely to waste time, but far worse, without a real understanding of the root cause of the failure, there is a great danger of eventually tweaking things so that the failure happens to go away in the test at hand, yet the underlying bug is still there. After all, the failure was already highly sporadic to begin with.

First I wanted to know if the problem is that replication diverges (eg. because of non-deterministic queries in the statement-based replication), or if it is a problem with the restored backup used to start the slave (wrong data or starting binlog position). Clearly, I strongly suspected a wrong starting binlog position, as this is what my group commit work messes with. But as it turns out, this was not the problem, again stressing the need to always verify positively any assumptions made during debugging.

To check this, I setup a new slave server from scratch, and had it replicate from the master binlog all the way from the start to the end. I then compared all three end results: (A) the original master; (B) the slave provisioned by XtraBackup, and (C) the new slave replicated from the start of the binlogs. It turns out that (A) and (C) are identical, while (B) differs. So this strongly suggests a problem with the restored XtraBackup; the binlogs by themselves replicate without problems.

To go further, I needed to analyse the state of the slave server just after the XtraBackup has been restored, without the effect of the thousands of queries replicated afterwards. Unfortunately this was not saved as part of the original test. It was trivial to add to the test (just copy away the backup to a safe place before starting the slave server), but then the need came to reproduce the failure again.

This is another important step in debugging hard sporadic failures: Get to the point where the failure can be reliably reproduced, at least for some semi-reasonable meaning of "reliable". This is really important not only to help debugging, but also to be able to verify that a proposed bug fix actually fixes the original bug! I do have experienced once or twice a failure so elusive that the only way to fix was to commit blindly a possible fix, then wait for several months to see if the failure would re-appear in that interval. Fortunately, in far the most cases, with a bit of work, this is not necessary.

Same here: After a bit of experimentation, I found that I could reliably reproduce the failure by reducing the duration of the test from 5 minutes to 35 seconds, and running the test in a tight loop until it failed. It always failed after typically 15-40 runs.

So now I had the state of the slave provisioned with XtraBackup as it was just before it starts replicating. So what I did was to set up another slave server from scratch and let it replicate from the master binlogs using START SLAVE UNTIL with the binlog position reported by XtraBackup. If the XtraBackup and its reported binlog start position are correct, these two servers should be identical. But sure enough, a comparison showed that they differed! In this case it was a single row that had different data. So this confirms the hypothesis that the problem is with the restored XtraBackup data and/or binlog position.

So now, thinking it was the binlog position that was off, I naturally next looked into the master binlog around this position, looking for an event just before the position that was not applied, or an event just after that already was applied. However, to my surprise I did not find this. I did find an event just after that updated the table that had the wrong row. However, the data in the update looked nothing like the data that was expected in the wrong row. And besides, that update was part of a transaction updating multiple tables; if that event was duplicated or missing, there would have been more row differences in more tables, not just one row in a single table. I did find an earlier event that looked somewhat related, however it was far back in the binlog (so not resolvable by merely adjusting the starting binlog pos); and besides again it was part of a bigger transaction updating more rows, while I had only one row with wrong data.

So at this point I need a new idea; the original hypothesis has been proven false. The restored XtraBackup is clearly wrong in a single row, but nothing in the binlog explains how this one row difference could have occured. When analysis runs up against a dead end, it is time to get more data. So I ran the test for a couple hours, obtained a handful more failures, and analysed then in the same way. Each time I saw the XtraBackup differing from the master in one row, or in once case the slave had a few extra rows.

So this is strange. After we restore the XtraBackup, we have one row (or a few rows) different from the master server. And those rows were updated in a multi-row transaction. It is as if we are somehow missing part of a transactions. Which is obviously quite bad, and indicates something bad going on at a deep level

Again it is time to get more data. Now I try running the test with different server options to see if it makes any difference. Running with --binlog-optimize-thread-scheduling=0 still caused the failure, so it was not related to the thread scheduling optimisation that I implemented. Then I noticed that the test runs with the option --innodb-release-locks-early=1 enabled. On a hunch I tried running without this option, and AHA! Without this option, I was no longer able to repeat the failure, even after 250 runs!

At this point, I start to strongly suspect a bug in the --innodb-release-locks-early feature. But this is still not proven! It could also be that with the option disabled, there is less opportunity for parallelism, hiding the problem which could really be elsewhere. So I still needed to understand exactly what the root cause of the problem is.


At this point, I had sufficient information to start just thinking about the problem, trying to work out ways in which things could go wrong in a way that would produce symptoms like what we see. So I started to think on how --innodb-release-locks-early works and how InnoDB undo and recovery in general function. So I tried a couple of ideas, some did not seem relevant...

...and then, something occured to me. What the --innodb-release-locks-early feature does is to make InnoDB release row locks for a transaction earlier than normal, just after the transaction has been prepared (but before it is written into the binlog and committed inside InnoDB). This is to allow another transaction waiting on the same row locks to proceed as quickly as possible.

Now, this means that with proper timing, it is possible for such a second transaction to also prepare before the first has time to commit. At this point we thus have two transactions in the prepared state, both of which modified the same row. If we were to take an XtraBackup snapshot at this exact moment, upon restore XtraBackup would need to roll back both of those transactions (the situation would be the same if the server crashed at that point and later did crash recovery).

This begs the question if such rollback will work correctly? This certainly is not something that could occur in InnoDB before the patch for --innodb-release-locks-early was implemented, and from my knowledge of the patch, I know it does not explicitly do anything to make this work. Aha! So now we have a new hypothesis: Rollback of multiple transactions modifying the same row causes problems.

To test this hypothesis, I used the Debug_Sync facility to create a mysql-test-run test case. This test case creates runs a few transactions in parallel, all modifying a common row, then starts them committing but pauses the server when some of them are still in the prepared state. At this point it takes an XtraBackup snapshot. I then tried restoring the XtraBackup snapshot to a fresh server. Tada... but unfortunately this did not show any problems, the restore looked correct.

However, while restoring, I noticed that the prepared-but-not-committed transactions seems to be rolled back in reverse order of InnoDB transactions id. So this got me thinking - what would happen if they were rolled back in a different order? Indeed, for multiple transactions modifying a common row, the rollback order is critical! The first transaction to modify the row must be rolled back last, so that the correct before-image is left in the row. If we were to roll back in a different order, we could end up restoring the wrong before-image of the row - which would result in exactly the kind of single-row corruption that we seem to experience in the test failure! So we are getting close it seems. Since InnoDB seems to roll back transactions in reverse order of transaction ID, and since transaction IDs are presumably allocated in order of transaction start, maybe by starting the transactions in a different order, the failure can be provoked.

And sure enough, modifying the test case so that the transactions are started in the opposite order causes it to show the failure! During XtraBackup restore, the last transaction to modify the row is rolled back last, so that the row ends up with the value that was there just before the last update. But this value is wrong, as it was written there by a transaction that was itself rolled back. So we have the corruption reproduced with a small, repeatable test case, and the root cause of the problem completely understood. Task solved!

(Later I cleaned up the test case to crash the server and work with crash recovery instead; this is simpler as it does not involve XtraBackup. Though this also involves the XA recovery procedure with the binlog, the root problem is the same and shows the same failure. As to a fix for the bug, that remains to be seen. I wrote some ideas in the bug report, but it appears non-trivial to fix. The --innodb-release-locks-early feature is originally from the Facebook patch; maybe they will fix it, or maybe we will remove the feature from MariaDB 5.3 before GA release. Corrupt data is a pretty serious bug, after all.)

Lessons learned

I think there are some important points to learn from this debugging story:

  1. When working on high-performance system code, some debugging problems are just inherently hard! Things happen in parallel, and we operate with complex algorithms and high quality requirements on the code. But with a methodical approach, even the hard problems can be solved eventually.
  2. It is important not to ignore test failures in the test frameworks (such as Buildbot), no matter how random, sporadic, and/or elusive they may appear at first. True, many of them are false positives or defects in the test framework rather than the server code. But some of them are true bugs, and among them are some of the most serious and yet difficult bugs to track down. The debugging in this story is trivial compared to how the story would have been if this had to be debugged in a production setting at a support customer. Much nicer to work from a (semi-)repeatable test failure in Buildbot!

Benchmarking thread scheduling in group commit, part 2

My blog has moved! Read it in the new location here.

I got access to our 12-core Intel server, so I was able to do some better benchmarks to test the different group commit thread scheduling methods:

This graph shows queries-per-second as a function of number of parallel connections, for three test runs:

  1. Baseline MariaDB, without group commit.
  2. MariaDB with group commit, using the simple thread scheduling, where the serial part of the group commit algorithm is done by each thread signalling the next one.
  3. MariaDB with group commit and optimised thread scheduling, where the first thread does the serial group commit processing for all transactions at once, in a single thread.
(see the previous post linked above for a more detailed explanation of the two thread scheduling algorithms.)

This test was run on a 12-core server with hyper-threading, memory is 24 GByte. MariaDB was running with datadir in /dev/shm (Linux ram disk), to simulate a really fast disk system and maximise the stress on the CPUs. Binlog is enabled with sync_binlog=1 and innodb_flush_log_at_trx_commit=1. Table type is InnoDB.

I use Gypsy to generate the client load, which is simple auto-commit primary key updates:

    REPLACE INTO t (a,b) VALUES (?, ?)

The graph clearly shows the optimised thread scheduling algorithm to improve scalability. As expected, the effect is more pronounced on the twelve-core server than on the 4-core machine I tested on previously. The optimised thread scheduling has around 50% higher throughput at higher concurrencies. While the naive thread scheduling algorithm suffers from scalability problems to the degree that it is only slightly better than no group commit at all (but remember that this is on ram disk, where group commit is hardly needed in the first place).

There is no doubt that this kind of optimised thread scheduling involves some complications and trickery. Running one part of a transaction in a different thread context from the rest does have the potential to cause subtle bugs.

On the other hand, we are moving fast towards more and more CPU cores and more and more I/O resources, and scalability just keeps getting more and more important. If we can scale MariaDB/MySQL with the hardware improvements, more and more applications can make do with scale-up rather than scale-out, which significantly simplifies the system architecture.

So I am just not comfortable introducing more serialisation (e.g. more global mutex contention) in the server than absolutely necessary. That is why I did the optimisation in the first place even without testing. Still, the question is if an optimisation that only has any effect above 20,000 commits per second is worth the extra complexity? I think I still need to think this over to finally make up my mind, and discuss with other MariaDB developers, but at least now we have a good basis for such discussion (and fortunately, the code is easy to change one way or the other).

Benchmarking thread scheduling in group commit

My blog has moved! Read it in the new location here.

The best part of the recent MariaDB meeting in Lisbon for me was that I got some good feedback on my group commit work. This has been waiting in the review queue for quite some time now.

One comment I got revolve around an optimisation in the implementation related to how threads are scheduled.

A crucial step in the group commit algorithm is when the transactions being committed have been written into the binary log, and we want to commit them in the storage engine(s) in the same order as they were committed in the binlog. This ordering requirement makes that part of the commit process serialised (think global mutex).

Even though care is taken to make this serial part very quick to run inside the storage engine(s), I was still concerned about how it would impact scalability on multi-core machines. So I took extra care to minimise the time spent on the server layer in this step.

Suppose we have three transactions being committed as a group, each running in their own connection thread in the server. It would be natural to let the first thread do the first commit, then have the first thread signal the second thread to do the second commit, and finally have the second thread signal the third thread. The problem with this is that now the inherently serial part of the group commit not only includes the work in the storage engines, it also includes the time needed for two context switches (from thread 1 to thread 2, and from thread 2 to thread 3)! This is particularly costly if, after finishing with thread 1, we end up having to wait for thread 2 to be scheduled because all CPU cores are busy.

So what I did instead was to run all of the serial part in a single thread (the thread of the first transaction). The single thread will handle the commit ordering inside the storage engine for all the transactions, and the remaining threads will just wait for the first one to wake them up. This means the context switches for the waiting threads are not included in the serial part of the algorithm. But it also means that the storage engines need to be prepared to run this part of the commit in a separate thread from the rest of the transaction.

So, in Lisbon there was some discussion around if the modifications I did to InnoDB/XtraDB for this were sufficient to ensure that there would not be any problems with this running part of the commit in a different thread. After all, this requirement is a complication. And then the question came up if the above optimisation is actually needed? Does it notably increase performance?

Now, that is a good question, and I did not have an answer as I never tested it. So now I did! I added an option --binlog-optimize-thread-scheduling to allow to switch between the naive and the optimised way to handle the commit of the different transactions in the serial part of the algorithm, and benchmarked them against each other.

Unfortunately, the two many-core servers we have available for testing were both unavailable (our hosting and quality of servers leaves a lot to be desired unfortunately). So I was left to test on a 4-core (8 threads with hyperthreading) desktop box I have in my own office. I was able to get some useful results from this nevertheless, though I hope to revisit the benchmark later on more interesting hardware.

In order to stress the group commit code maximally, I used a syntetic workload with as many commits per second as possible. I used the fastest disk I have available, /dev/shm (Linux ramdisk). The transactions are single-row updates of the form

    REPLACE INTO t (a,b) VALUES (?, ?)
The server is an Intel Core i7 quad-core with hyperthreading enabled. It has 8GByte of memory. I used Gypsy to generate the load. Table type is XtraDB. The server is running with innodb_flush_log_at_trx_commit=1 and sync_binlog=1.

Here are the results in queries per second, with different number of concurrent connections running the queries:

Number of connections QPS (naive scheduling) QPS (optimised scheduling) QPS (binlog disabled)
So as we see from this table, even with just four cores we see noticable better performance by running the serial part of group commit in a single thread. The improvement is around 10% or so, depending on parallelism. So I think this means that I will want to keep the optimised version.

It is nice to see that we can get > 20k commits/second with the group commit code on cheap desktop hardware. For real servers the I/O subsystem will probably be a bottleneck, but that is what I wanted to see: that the group commit code will not limit the ability to fully utilise high amounts of I/O resources.

While I was at it, I also measured the throughput when the binlog is disabled. As can be seen, enabling the binlog has notable performance impact even with very fast disk. Still, considering the added overhead of writing an extra log file, not to mention the added 2-phase commit step, the overhead is not that unreasonable.

From the table we also see some negative scaling as the number of parallel connections increases. Some of this is likely from InnoDB/XtraDB, but I would like to investigate it deeper at some point to see if there is anything in the group commit part that can be improved with respect to this.

Looking back, should I have done this benchmark when designing the code? I think it is a tricky question, and one that cannot be given a simple answer. It will always be a trade-off: It is not feasible to test (and implement!) every conceivable variant of a new feature during development, it is necessary to also rely on common sense and experience. On the other hand, it is dangerous to rely on intuition with respect to performance; time and time again measurements prove that the real world is very often counter to intuition. In this case I was right, and my optimisation was beneficial; however I could easily have been wrong. I think the main lesson here is how important it is to get feedback on complex design work like this; such feedback is crucial for motivating and structuring the work to be of the quality that we need to see in MariaDB.

My presentation from OpenSourceDays2011

My blog has moved! Read it in the new location here.

Here are the slides from my talk at Open Source Days 2011 on Saturday. The talk was about MariaDB and other parts of the MySQL development community outside of MySQL@Oracle.

For me, the most memorable part of the conference was the talk by Noirin Shirley titled Open Source: Saving the World. Noirin described the Open Source Ushahidi project and how it was used during the natural disaster crisis in Indonesia, New Zealand and other places.

Now, there is a long way from implementing group commit in MariaDB to rescuing injured people out of collapsed buildings, and not all use of Free Software is as samaritan as Ushahidi. Well, no-one can save the world alone.

But in the Free Software community, we work together, each contributing his or her microscopic part, and together slowly but surely building the most valuable software infrastructure in the world. Which then in turn empowers others to work together in other areas outside of (and more important than) software. Working in Free Software enables me to contribute my skills and resources, and I think Noirin managed very well to capture this in her talk.