Kristian Nielsen (kristiannielsen) wrote,

Global transaction ID in MariaDB

The main goal of global transaction ID is to make it easy to promote a new master and switch all slaves over to continue replication from the new master. This is currently harder than it could be, since the current replication position for a slave is specified in coordinates that are specific to the current master, and it is not trivial to translate them into the corresponding coordinates on the new master. Global transaction ID solves this by annotating each event with the global transaction id which is unique and universal across the whole replication hierarchy.

In addition, there are at least two other main goals for MariaDB global transaction ID:

  1. Make it easy to setup global transaction ID, and easy to provision a new slave into an existing replication hierarchy.
  2. Fully support multi-source replication and other similar setups.

Replication streams

Let us consider the second point first, dealing with multi-source replication. The figure shows a replication topology with five servers. Server 3 is a slave with two independent masters, server 1 and server 2. Server 3 is in addition itself a master for two slaves server 4 and server 5. The coloured boxes A1, A2, ... and B1, B2, ... denote the binlogs in each server.

When server 3 replicates events from its two master servers, events from one master are applied independently from and in parallel with events from the other master. So the events from server 1 and server 2 get interleaved with each other in the binlog of server 3 in essentially arbitrary order. However, an important point is that events from the same master are still strictly ordered. A2 can be either before or after B1, but it will always be after A1.

When the slave server 4 replicates from master server 3, server 4 sees just a single binlog stream, which is the interleaving of events originating in server 1 and server 2. However, since the two original streams are fully independent (by the way that multi-source replication works in MariaDB), they can be freely applied in parallel on server 4 as well. This is very important! It is already a severe performance bottleneck that replication from one master is single-threaded on the slaves, so replicating events from multiple masters also serially would make matters even worse.

What we do is to annotate the events with a global transaction ID that includes a tag that identifies the original source. In the figure, this is marked with different colours, blue for events originating in server 1, and red for server 2. Server 4 and server 5 are then free to apply a "blue" event in parallel with a "red" event, and such parallel events can thus end up in different order in the binlogs in different places in the replication hierarchy. So every server can have a distinct interleaving of the two original event streams, but every interleaving respects the order within a single original stream. In the figure, we see for example that A2 comes before B1 in server 3 and server 5, but after in server 4, however it is always after A1.

This concept of having a collection of distict binlog streams, each strictly ordered but interleaved with each other in a relaxed way, is very powerful. It allows both great flexibility (and hence opportunity for parallelism) in applying independent events, as well as simple representation of the state of each replication slave at each point in time. For each slave, we simply need to remember the global transaction ID of the last event applied in each independent stream. Then to switch a slave to a new master, the master finds the corresponding places within its own binlog for each independent stream and starts sending events from the appropriate location for each stream.

For example, in the figure, we see that the state of server 4 is (A4, B3) and for server 5 it is (A3, B3). Thus we can change server 5 to use server 4 as a slave directly, as server 4 is strictly ahead of server 5 in the replication streams.

Or if we want to instead make server 5 the new master, then we first need to temporarily replicate from server 4 to server 5 up to (A4, B3). Then we can switch over and make server 5 the new master. Note that in general such a procedure may be necessary, as there may be no single server in the hierarchy that is ahead of every other server in every stream if the original master goes away. But since each stream is simply ordered, it is always possible to bring one server up ahead to server as a master for the others.

Setup and provisioning

This brings us back to the first point about, making it easy to setup replication using global transaction ID, and easy to provision a new slave into an existing replication hierarchy.

To create a new slave for a given master, one can proceed exactly the same way whether using global transaction id or not. Make a copy of the master obtaining the corresponding binlog position (mysqldump --master-data, XtraBackup, whatever). Setup the copy as the new slave, and issue CHANGE MASTER TO ... MASTER_LOG_POS=... to start replication. Then when the slave first connects, the master will send last global transaction ID within each existing replication stream, and slave will thus automatically be configured with the correct state. Then if there later is a need to switch the slave to a different master, global transaction ID is already properly set up.

This works exactly because of the property that while we have potentially interleaved distinct replication streams, each stream is strictly ordered across the whole replication hierarchy. I believe this is a very important point, and essential for getting a good global transaction ID design. The notion of an ordered sequence of the statements and transactions executed on the master is the central core of MySQL replication, it is what users know and what has made it so successful despite all its limitations.

Replication domains

To implement this design, MariaDB global transaction ID introduces the notion of a replication domain and an associated domain_id. A replication domain is just a server or group of servers that generate a single, strictly ordered replication stream. Thus, in the example above, there are two domain_id values in play corresponds to the two colours blue and red. The global transaction ID includes the domain_id, and this way every event can be identified with its containing replication stream.

Another important point here is that domain_id is something the DBA configures explicitly. MySQL replication is all about the DBA having control and flexibility. The existence of independent streams of events is a property of the application of MySQL, not some server internal, so it needs to be under the control of the user/DBA. In the example, one would configure server 1 with domain_id=1 and server 2 with domain_id=2.

Of course, in basic (and not so basic) replication setups where only one master server is written to by applications at any one time, there is only a single ordered event stream, so domain_id can be ignored and remain at the default (which is 0).

Note by the way that domain_id is different from server_id! It is possible and normal for multiple servers to share the same domain_id, for example server 1 might be a slave of some higher-up master server, and the two would then share the domain_id. One could even imagine that at some point in the future, servers would have moved around so that server 2 was re-provisioned to replace server 1, it would then retain its old server_id but change its domain_id to 1. So both the blue and the red event stream would have instances with server_id=1, but domain_id will always be consistent.

It is also possible for a single server to use multiple domain IDs. For example, a DBA might configure events generated to receive as domain_id a hash of the current schema. This would be a way of declaring that transactions in distinct schemas are guaranteed to be independent, and it would allow slaves to apply those independent transactions in parallel. The slave will just see distinct streams, and apply them in parallel same way as for multi-source replication. This is similar to the multi-threaded slave that MySQL 5.6 implements. But it is more flexible, for example an application could explicitly mark a long-running transaction with a distict domain_id, and then ensure that it is independent of other queries, allowing it to be replicated in parallel and not delay replication of normal queries.

Current status

The MariaDB global transaction ID is work-in-progress, currently planned for MariaDB 10.0.

The current code is maintained on Launchpad: lp:~maria-captains/maria/10.0-mdev26. The design is written up in detail in Jira task MDEV-26, where the progress is also tracked.

Global transaction ID has already been discussed on the maria-developers mailing list. I have received valuable feedback there which has been included in the current design. But I very much welcome additional feedback, I am open to changing anything if it makes the end result better. Much of the community seems to not be using mailing lists to their full potential (hint hint!), hence this blog post to hopefully reach a wider audience that might be interested.

Tags: database, freesoftware, mariadb, mysql, programming, replication
  • Post a new comment


    default userpic

    Your reply will be screened