I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.
(It also gives me the opportunity to explain more deeply a lot of interesting
design decisions that I left out in the first post for the sake of brevity and
On crash-safe slave
One of the things I really wanted to improve with global transaction ID is to
make the replication slaves more crash safe with respect to their current
replication state. This state is mostly persistently stored information about
which event(s) were last executed on the slave, so that after a server restart
the slave will know from which point in the master binlog(s) to resume
replication. In current (5.5 and earlier) replication, this state is stored
simply by continuously writing a file
event executed. If the server crashes, this is very susceptible to corruption
where the contents of the file no longer matches the actual state of tables in
With MariaDB global transaction ID, the replication state is stored in the
following table instead of in a plain file:
CREATE TABLE rpl_slave_state (
domain_id INT UNSIGNED NOT NULL,
sub_id BIGINT UNSIGNED NOT NULL,
server_id INT UNSIGNED NOT NULL,
seq_no BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (domain_id, sub_id));
When a transaction is executed on the slave, this table is updated as part of
the transaction. So if the table is created with InnoDB (or other
transactional engine) and the replicated events also use transactional tables,
then the replication state is crash safe. DDL, or non-transactional engines
such as MyISAM, remain crash-unsafe of course.
A global transaction ID in MariaDB consists of
described in the previous post, an increasing sequence number, and the
server_id. Recall that the replication state with global
transaction ID consists of the last global transaction ID applied within each
independent replication stream, ie. a mapping
domain_id to global transaction ID. This is what the
But what about the
sub_id in the above table? This is to prevent
concurrency issues when parallel replication is used. If we want to be able to
execute in parallel two transactions with the same
then these two transactions will both need to update the
rpl_slave_state. If the transactions would update the same
row in the table, then one transaction would have to wait on a row lock for
the other to commit. This would prevent any kind of group commit, which would
be a very serious performance bottleneck.
So instead, each transaction inserts a new, separate row into the table to
record the new global transaction ID applied. There may thus be multiple
entries for a given
sub_id is used to
distinguish them, it is simply a (local) integer that is increased for each
transaction received from the master binlog. Thus, at any given time the last
applied global transaction ID for given
domain_id is the one with
sub_id in the table.
In effect, the replication state is obtained with a query like this:
SELECT domain_id, server_id, seq_no
WHERE (domain_id, sub_id) IN
(SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
Old rows are deleted when no longer needed.
Thus, two replicated transactions can be executed like this:
UPDATE some_table UPDATE other_table
SET value = value + 1 SET name = "foo"
WHERE id = 10; WHERE category LIKE "food_%";
INSERT INTO mysql.rpl_slave_state INSERT INTO mysql.rpl_slave_state
SET domain_id = 1, SET domain_id = 1,
sub_id = 100, sub_id = 101,
server_id = 5, server_id = 5,
seq_no = 300010; seq_no = 300011;
These two transactions can run completely independent, including the insert
. And the commits at the end can be done
together as a single group commit, where we ensure that the second one is
recorded as happening after the first one so commit order (and binlog order)
is preserved and visibility is correct (second transaction not visible to any
query without the first also being visible).
Contrast this with how things would be with a rpl_slave_state table with a
single row per
UPDATE some_table UPDATE other_table
SET value = value + 1 SET name = "foo"
WHERE id = 10; WHERE category LIKE "food_%";
SET server_id = 5,
seq_no = 300010
WHERE domain_id = 1;
SET server_id = 5,
seq_no = 300011
WHERE domain_id = 1;
Here the update of the replication state table in the second transaction would
have to wait for the first transaction to commit, because of row locks. Group
commit becomes impossible.
(I actually explained this issue to the replication developers at
MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had
On where to store the replication state
As Giuseppe pointed out, in
the global transaction
ID design it is still written that the replication state will be stored
in the slave binlog, not in the
rpl_slave_state table. Sorry
about this, I will get the document updated as soon as possible.
I had basically two ideas for how to store the slave state in a crash-safe way:
- In the slave's binlog.
- In a (transactional) table.
The big advantage of (2) is that it works also when the binlog is not enabled
on the slave. Since there can still be substantial overhead to enabling the
binlog, I currently plan to go with this approach.
The advantage of (1) is that it is potentially cheaper when the binlog is
enabled on the slave, as it commonly will be when global transaction ID is
enabled (to be able to promote a slave as a new master, the binlog must be
enabled, after all). We already write every single global transaction ID
applied into the binlog, and if we crash, we already scan the binlog during
crash recovery. Thus, it is easy during crash recovery to rebuild the
replication state from the binlog contents. This way we get crash safe slave
state without the overhead of maintaining an
It will be possible in the future to refine this, so that we could use method
(1) if binlog is enabled, else method (2). This might improve performance
slightly when binlog is enabled. But we should first benchmark to check if
such refinement will be worth it in terms of performance gained. It seems
likely that any gains will be modest, at best.
On parallel replication
Parallel replication is something that has been long overdue, but is now a
reality. MariaDB 10.0 will have multi-source replication, which
is actually a form of parallel replication. MySQL 5.6 will
have multi-threaded slave. Galera can do
parallel replication, as can Tungsten I believe, though I am
not familiar with details. There are several other mechanisms for parallel
replication planned for later MariaDB releases, like
It is thus very important to think parallel replication into the design of
global transaction ID from the start. I fully agree with Giuseppe's remarks
here about MySQL 5.6 replication features failing completely to do this. They
introduce in 5.6 three new features that require extensions to how the
replication state is stored: crash-safe slave, global transaction ID,
and multi-threaded slave. They have managed to do this by
introducing three completely different solutions. This is just
insane. It makes one wonder if Oracle management forbids Oracle developers to
talk to each other, just like we already know they prevent discussions with
the community ...
So, in relation to global transaction ID there are basically two kinds of
parallel replication techniques: in-order and out-of-order. The two interact
with global transaction ID in different ways.
On in-order parallel replication
In-order is when two (or more) different transactions are executed in parallel
on the slave, but the commit of the second transaction is delayed to after the
first transaction has committed. Galera is an example of this, I
think. Planned tasks
are also in-order techniques.
In-order parallel replication is transparent to applications and users (at
least with MVCC transactional engines like InnoDB), since changes only become
visible on COMMIT, and commits are done in a serial way. It is thus also
mostly transparent to global transaction ID, and does not need much special
consideration for the design.
One thing that can be done, and that I am currently working on, is to
integrate in-order parallel replication with group commit. Suppose we run
transactions T1 and T2 in parallel on the slave, and suppose that T2 happens
to complete first so that we have to wait in T2's commit for T1 to commit
first. If we integrate this wait with group commit, we can actually commit T1
and T2 at the same time, taking care to write the commit records to the binlog
and to the storage engine in the right order (T1 before T2). This way, the
wait is likely to improve performance rather than reduce it, in fact.
On out-of-order parallel replication
Out-of-order parallel replication is when transactions can be committed in a
different order on the slave than on the master. The MySQL 5.6 multi-threaded
slave feature is an example of this.
Out-of-order must be explicitly enabled by the application/DBA, because it
breaks fundamental semantics. If commits happen in different order, the slave
database may be temporarily in a state that never existed on the master and
may be invalid for the application. But if the application is written to
tolerate such inconsistencies, and explicitly declares this to the database,
then there may be potential for more parallelism than with in-order methods.
This can make out-of-order interesting.
The typical example, which MySQL 5.6 multi-threaded slave uses, is when the
application declares that transactions against different schemas are
guaranteed independent. Different schemas can then be replicated independently
(though if the application messes up and transactions happen to not really be
independent, things can break). MariaDB 10.0 multi-source replication is
another example, where the application declares a guarantee that two master
servers can be replicated independently.
Out-of-order creates a challenge for global transaction ID when switching to a
new master. Because events in the binlog on the new master are in different
order, there will not in general be a single place from which to start
replication without either loosing or duplicating some event.
MariaDB global transaction ID handles this by only allowing out-of-order
parallel replication between different replication domains, never within a
single domain. In effect, the DBA/application explicitly declares the possible
independent replication streams, and then it is sufficient to remember one
global transaction ID per
domain_id as the position reached within each independent stream.
Thus, suppose we have a master where updates to schemas are independent, and
we want to replicate them in parallel on slaves. On the master, we configure
10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog,
we set the
domain_id value to a hash of the used schema.
On the slave, we then configure 10 SQL threads. Two received transactions with
domain_id can be executed in parallel. Two transactions
using same schema will map to the same
domain_id and will thus
not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded
slave almost for free, using the exact same mechanism as for executing
multi-source replication in parallel. The replication state on the slave will
in this case consist of the 10 different global transaction IDs reached within
each of the 10 replication domains. And they can be stored in the
rpl_slave_state just as described above. Thus replication
state for out-of-order parallel replication is fully integrated with the rest
of the design, needing no special mechanisms.
And we can do more! The application (with suitable privileges) is allowed to
domain_id per-query. For example, we can run all normal
domain_id=1. But then if we have a long-running
maintenance query like an ALTER TABLE or something that updates every row in a
large table, we can execute it with
domain_id=2, if we take care
that no other queries conflict with it. This way, the long-running query can
run in parallel "in the background", without causing any replication delay for
In effect, the application or DBA now has great flexibility in declaring which
queries can replicate independent of (and thus in parallel with) each other,
and all this just falls out almost for free from the overall design. I foresee
that this will be a very powerful feature to have for large, busy replication
Note btw. that most out-of-order parallel replication techniques can also be
done as in-order simply by delaying the final COMMIT steps of transactions to
happen in-order. This way one could for example do per-schema parallel
replication without polluting the replication state with many global
transaction IDs. This should generally achieve similar improvement in overall
throughput, though latency of individual transactions can be longer.
On "holes" in the global transaction ID sequences
Global transaction IDs have a sequence-number component, which ensures
uniqueness by being always increasing. This raises the issue of whether an
event will always have a sequence number exactly one bigger than the previous
event, or if it is allowed to have "holes", where some sequence number is
never allocated to an event.
For MariaDB global transaction ID, I took the approach that holes are
allowed. There are a number of good reasons for this.
Mostly, I believe that a design that relies on "no holes" is a fundamental
mistake. In MySQL 5.6 global transaction ID, holes are absolutely not
allowed. If a hole ever turns up, you will be stuck with it literally
forever. The MySQL 5.6 replication state lists every sequence number not yet
applied on a slave server, so if one becomes missing it will forever
remain. Unless you remove it manually, and as far as I have been able to
determine, there are currently no facilities for this. Anyone who knows how
fragile MySQL replication can be should realise that this is a recipe for
Another point: because of the strict "no holes" requirement, in MySQL 5.6,
when events are filtered with
--replicate-ignore-db or whatever,
they had to change the code so that a dummy event is used to replace the
filtered event. In effect, you cannot really filter any events any more! I
think that alone should be enough to realise that the design is wrong.
A more subtle point is that a strict "no holes" requirement makes it much
harder to correctly and scalable handle allocation of new numbers. Basically,
to allocate next number in a multi-thread environment, a lock needs to be
taken. We need to take this lock for as short as possible to preserve
scalability. But then, what happens if we allocate some sequence number N to
transaction T1, and then later we get some failure that prevents T1 from
successfully committing and being written into the binlog? We now cannot
simply rollback T1, because some other transaction T2 may have already
allocated the next number, and then we would leave a hole. Subtle issues like
this are important to achieve good scalability.
So I think it is wrong to base the design on never having holes. On the other
hand, there is no reason to deliberately introduce holes just for the fun of
it. Sequence numbers in MariaDB global transaction ID will generally be
without holes, it is just that nothing will break if somehow a hole should
Also, whenever a global transaction ID is received on a slave server, the
server's own internal counter for the next sequence number to allocate will be
set to one more than the received sequence number, if it is currently
smaller. This gives the very nice property in a standard setup, where only one
master is ever written at any one time: The sequence number in itself will be
globally unique and always increasing. This means that one can look at any two
global transaction IDs and immediately know which one comes first in the
history, which can be very useful. It also allows to give a warning if
multiple masters are being written without being configured with distinct
replication domain ID. This is detected when a server replicates a global
transaction ID with same
domain_id as its own but smaller
In multi-master-like setups, sequence number by itself can no longer be
globally unique. But even here, if the system is correctly configured so that
each actively written master has its own
number will be unique per
domain_id and allow to order global
transaction IDs within one domain (and of course, between different domains
there is no well-defined ordering anyway).
On CHANGE MASTER TO syntax
In the previous post, I did not really go into what new syntax will be
introduced for MariaDB global transaction ID, both for the sake of brevity,
and also because it has not really been fully decided yet.
However, there will certainly be the possibility to change directly the
replication slave state, ie. the global transaction ID to start replicating
from within each replication domain. For example something like this:
CHANGE MASTER TO master_host = "master.lan",
master_gtid_pos = "1-1-100,2-5-300";
This is a fine and supported way to start replication. I just want to mention
that it will also be supported to start replication in the old way,
, and the
master will automatically convert this into the
and set this for the slave. This
can be convenient, as many tools like
provide easy access to the old-style binlog position. It is certainly an
improvement over MySQL 5.6 global transaction ID, where the only documented
way to setup a slave involves RESET MASTER (!) on the master server...
Incidentally, note that
master_gtid_pos has just one global
transaction ID per
domain_id, not one
server_id. Thus, if not using any form of multi-master, there
will be just one global transaction ID to set.
So if we start with server 1 as the master, and then some time later switch
over to server 2 for a master, the binlog will have global transaction IDs
server_id=2. But the slave
binlog state will be just a single global transaction ID,
server_id=2 in this case. Since binlog order is always the
same within one replication domain, a single global transaction ID is
sufficient to know the correct place to continue replication.
I think this is a very nice property, that the size of the replication state
is fixed: one global transaction ID per configured replication domain. In
contrast, for MySQL 5.6 global transaction ID, any
ever worked as master will remain in the replication state forever. If you
ever had a server id 666 you will still be stuck with it 10 years later when
you specify the replication state in CHANGE MASTER (assuming they will at some
point even allow specifying the replication state in CHANGE MASTER).
Once the global transaction replication state is set, changing to a new master
could happen with something like this:
CHANGE MASTER TO master_host = "master.lan",
master_gtid_pos = AUTO;
This is the whole point of global transaction ID, of course, to be able to do
this and automatically get replication started from the right point(s) in the
binlog on the new master.
One idea that I have not yet decided about is to allow just this simple syntax:
CHANGE MASTER TO master_host = "master.lan";
so that if no starting position is specified, and a global transaction state
already exists, we default to
master_gtid_pos = AUTO
. This would be
a change in current behaviour, so maybe it is not a good idea. On the other
hand, the current behaviour is to start replication from whatever happens to
be the first not yet purged binlog file on the master, which is almost guaranteed
to be wrong. So it is tempting to change this simple syntax to just do the
On extensions to
Robert mentions some possible extensions to the
program, and I agree with those.
The master binlog is carefully designed so that it is easily possible to
locate any given global transaction ID and the corresponding binlog position
(or determine that such global transaction ID is not present in any binlog
files). In the initial design this requires scanning one (but just one) binlog
file from the beginning; later we could add an index facility if this becomes
a bottleneck. The
mysqlbinlog program should also support this,
probably by allowing to specify a global transaction ID (or multiple IDs)
Robert also mentions the usefulness of an option to filter out events from
within just one replication domain/stream. This is something I had not thought
of, but it would clearly be useful and is simple to implement.
On session variable
With MariaDB global transaction ID,
server_id becomes a session
variable, as do newly introduced variables
gtid_seq_no. This allows an external replication mechanism to
apply events from another server outside the normal replication mechanism, and
still preserve the global transaction ID when the resulting queries are logged
in the local binlog. One important use case for this is of course
mysqlbinlog | mysql. Here, mysqlbinlog
can set these variables to preserve the global transaction IDs on the events
applied, so that fail-over and so on will still work correctly.
Since messing with
server_id and so on has the possibility to
easily break replication, setting these requires SUPER privileges.