Parallel replication is in MariaDB 10.0. I did some benchmarks on the code in 10.0.9. The results are quite good! Here is a graph that shows a 10-times improvement when enabling parallel replication:
--sync-binlog=1 --innodb-flush-log-at-trx-commit=1), the slave is about ten times faster at 12 worker threads and above compared to the old single-threaded replication.
These results are for in-order parallel replication. With in-order, transactions are committed on the slave in strictly the same order as on the master, so that applications do not see any differences from using parallel replication. So no changes to the application are needed to use parallel replication; this is just standard sysbench 0.5 with a single table. This makes parallel replication particularly interesting, as it can be used with any existing applications, without the need to eg. split the data into independent schemas as is necessary with out-of-order techniques like the multi-threaded slave feature in MySQL 5.6. It does however make it much harder for the slave to find parallelism to exploit in the events received from the master, which is why it is interesting to see how much improvement can be obtained from normal workloads.
(MariaDB 10.0 does also support out-of-order parallel replication, but that will be the subject of a different article).
The hardware used for the sysbench oltp.lua is the same machine I used to benchmark group commit previously; I am told this is a machine that is typical for a "standard" datacenter server, with decent I/O on a RAID controller with battery-backed-up cache. Sysbench was run with 10 million rows in one table. The mysqld was configured with 16GB buffer pool and 2 times 1.9 gigabyte redo logs. The different graphs are as follows:
- binlog, crash-safe:
--log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1
- no binlog, durable:
- no binlog, non-durable:
- binlog, non-crash-safe:
--log-slave-updates --sync-binlog=0 --innodb-flush-log-at-trx-commit=0
For this test, the master was configured with
This allows the master to delay a transaction up to 10 milliseconds in order
to find up to 12 transactions that can commit in parallel; this helps a lot in
improving parallel replication performance, since transactions that commit in
parallel on the master can be executed in parallel on the slave.
Adding such delay will be acceptable for many applications to speed up the slaves; in fact in my test it did not affect master throughput at all. One attractive option might be to set up monitoring of the slaves, and if they start falling behind, then the commit delay on the master can be temporarily increased, throttling the master a bit while allowing slaves better possibility to catch up.
The other source of parallelism on the slave is that irrespectively of how the transactions were executed on the master, the commit steps of different transactions can always be applied in parallel on the slave. This is particularly effective at improving performance when the commit step is expensive, as happens when a durable, crash-safe configuration is used. This is seen in the benchmark, where the speedup is particularly large when the slave is configured to be crash-safe and durable, to the point where parallel replication almost eliminates any performance penalty for enabling crash-safe binlog on the slaves. But significant speedup is seen in all the configurations.
(In fact, if you look closely, you will see that turning off the binlog ends up decreasing the performance of the slave. This is bug MDEV-5802, and performance should improve when binlog is disabled when this bug is fixed).
I think these are very promising results. I hope this will inspire users to give the new feature a test on real workloads, and share their experiences.
Exploring the limits of parallel replication
I also wanted to see how the code works for workloads that are not favorable to parallel replication. For this I use sysbench update_index.lua. This benchmark creates transactions with a single primary-key update. With just 10 million rows in the table, the test is in-memory, and the actual work spent for the single-row update is rather small compared to the overhead of reading and parsing binlog events and scheduling the work between the various replication threads. So it is interesting to see if parallel replication can still give some smaller benefit here, or at least not make things much worse.
Here are the results for an update_index.lua with 48 threads for the load on
the master. No commit delay was used to help the parallel replication
Finally, to test the absolutely worst-case scenario for parallel replication, I created another workload on the master, this time with update_index.lua running with just a single thread. This way, there is absolutely no opportunity for parallel replication to execute the actual transactions in parallel, though there is still some opportunity to speed up the commit step using group commit.
Here are the results for the single-threaded update_index.lua master load:
So overall, I am pleased with the results of the parallel replication code in MariaDB 10.0. Single-threaded applier has been a major limitation in replication for a long, long time, and I feel now fairly confident that this can play an important part in solving that. I hope this will inspire users to start testing their own loads and hopefully getting good results and/or reporting any issues found on the way.
To test, use MariaDB 10.0.9 or later on both master and slave. Configure the
--slave-parallel-threads=20 or something like
that. And optionally, if your application can tolerate some extra commit
latency, set some reasonable values for
By the way, the raw numbers for the different runs can be seen in this mail on maria-developers@.