You are viewing kristiannielsen

Kristian Nielsen - More on global transaction ID in MariaDB
January 4th, 2013
04:55 pm

[Link]

Previous Entry Share Next Entry
More on global transaction ID in MariaDB

I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.

(It also gives me the opportunity to explain more deeply a lot of interesting design decisions that I left out in the first post for the sake of brevity and clarity.)

On crash-safe slave

One of the things I really wanted to improve with global transaction ID is to make the replication slaves more crash safe with respect to their current replication state. This state is mostly persistently stored information about which event(s) were last executed on the slave, so that after a server restart the slave will know from which point in the master binlog(s) to resume replication. In current (5.5 and earlier) replication, this state is stored simply by continuously writing a file relay-log.info after each event executed. If the server crashes, this is very susceptible to corruption where the contents of the file no longer matches the actual state of tables in the database. </p>

With MariaDB global transaction ID, the replication state is stored in the following table instead of in a plain file:

    CREATE TABLE rpl_slave_state (
	domain_id INT UNSIGNED NOT NULL,
	sub_id BIGINT UNSIGNED NOT NULL,
	server_id INT UNSIGNED NOT NULL,
	seq_no BIGINT UNSIGNED NOT NULL,
	PRIMARY KEY (domain_id, sub_id));
When a transaction is executed on the slave, this table is updated as part of the transaction. So if the table is created with InnoDB (or other transactional engine) and the replicated events also use transactional tables, then the replication state is crash safe. DDL, or non-transactional engines such as MyISAM, remain crash-unsafe of course.

A global transaction ID in MariaDB consists of domain_id as described in the previous post, an increasing sequence number, and the usual server_id. Recall that the replication state with global transaction ID consists of the last global transaction ID applied within each independent replication stream, ie. a mapping from domain_id to global transaction ID. This is what the table rpl_slave_state provides.

But what about the sub_id in the above table? This is to prevent concurrency issues when parallel replication is used. If we want to be able to execute in parallel two transactions with the same domain_id, then these two transactions will both need to update the table rpl_slave_state. If the transactions would update the same row in the table, then one transaction would have to wait on a row lock for the other to commit. This would prevent any kind of group commit, which would be a very serious performance bottleneck.

So instead, each transaction inserts a new, separate row into the table to record the new global transaction ID applied. There may thus be multiple entries for a given domain_id. The sub_id is used to distinguish them, it is simply a (local) integer that is increased for each transaction received from the master binlog. Thus, at any given time the last applied global transaction ID for given domain_id is the one with the highest sub_id in the table.

In effect, the replication state is obtained with a query like this:

    SELECT domain_id, server_id, seq_no
    FROM rpl_slave_state
    WHERE (domain_id, sub_id) IN
      (SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
Old rows are deleted when no longer needed.

Thus, two replicated transactions can be executed like this:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    INSERT INTO mysql.rpl_slave_state      INSERT INTO mysql.rpl_slave_state
       SET domain_id = 1,                     SET domain_id = 1,
	   sub_id = 100,                          sub_id = 101,
	   server_id = 5,                         server_id = 5,
	   seq_no = 300010;                       seq_no = 300011;

    COMMIT;                                COMMIT;
These two transactions can run completely independent, including the insert into rpl_slave_state. And the commits at the end can be done together as a single group commit, where we ensure that the second one is recorded as happening after the first one so commit order (and binlog order) is preserved and visibility is correct (second transaction not visible to any query without the first also being visible).

Contrast this with how things would be with a rpl_slave_state table with a single row per domain_id:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    UPDATE bad_rpl_slave_state
       SET server_id = 5,
	   seq_no = 300010
     WHERE domain_id = 1;

    COMMIT;

					   UPDATE bad_rpl_slave_state
					      SET server_id = 5,
						  seq_no = 300011
					    WHERE domain_id = 1;

					   COMMIT;
Here the update of the replication state table in the second transaction would have to wait for the first transaction to commit, because of row locks. Group commit becomes impossible.

(I actually explained this issue to the replication developers at MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had ignored it...)

On where to store the replication state

As Giuseppe pointed out, in the global transaction ID design it is still written that the replication state will be stored in the slave binlog, not in the rpl_slave_state table. Sorry about this, I will get the document updated as soon as possible.

I had basically two ideas for how to store the slave state in a crash-safe way:

  1. In the slave's binlog.
  2. In a (transactional) table.
The big advantage of (2) is that it works also when the binlog is not enabled on the slave. Since there can still be substantial overhead to enabling the binlog, I currently plan to go with this approach.

The advantage of (1) is that it is potentially cheaper when the binlog is enabled on the slave, as it commonly will be when global transaction ID is enabled (to be able to promote a slave as a new master, the binlog must be enabled, after all). We already write every single global transaction ID applied into the binlog, and if we crash, we already scan the binlog during crash recovery. Thus, it is easy during crash recovery to rebuild the replication state from the binlog contents. This way we get crash safe slave state without the overhead of maintaining an extra rpl_slave_state table.

It will be possible in the future to refine this, so that we could use method (1) if binlog is enabled, else method (2). This might improve performance slightly when binlog is enabled. But we should first benchmark to check if such refinement will be worth it in terms of performance gained. It seems likely that any gains will be modest, at best.

On parallel replication

Parallel replication is something that has been long overdue, but is now a reality. MariaDB 10.0 will have multi-source replication, which is actually a form of parallel replication. MySQL 5.6 will have multi-threaded slave. Galera can do parallel replication, as can Tungsten I believe, though I am not familiar with details. There are several other mechanisms for parallel replication planned for later MariaDB releases, like MWL#184 and MDEV-520.

It is thus very important to think parallel replication into the design of global transaction ID from the start. I fully agree with Giuseppe's remarks here about MySQL 5.6 replication features failing completely to do this. They introduce in 5.6 three new features that require extensions to how the replication state is stored: crash-safe slave, global transaction ID, and multi-threaded slave. They have managed to do this by introducing three completely different solutions. This is just insane. It makes one wonder if Oracle management forbids Oracle developers to talk to each other, just like we already know they prevent discussions with the community ...

So, in relation to global transaction ID there are basically two kinds of parallel replication techniques: in-order and out-of-order. The two interact with global transaction ID in different ways.

On in-order parallel replication

In-order is when two (or more) different transactions are executed in parallel on the slave, but the commit of the second transaction is delayed to after the first transaction has committed. Galera is an example of this, I think. Planned tasks MWL#184 and possibly MDEV-520 are also in-order techniques.

In-order parallel replication is transparent to applications and users (at least with MVCC transactional engines like InnoDB), since changes only become visible on COMMIT, and commits are done in a serial way. It is thus also mostly transparent to global transaction ID, and does not need much special consideration for the design.

One thing that can be done, and that I am currently working on, is to integrate in-order parallel replication with group commit. Suppose we run transactions T1 and T2 in parallel on the slave, and suppose that T2 happens to complete first so that we have to wait in T2's commit for T1 to commit first. If we integrate this wait with group commit, we can actually commit T1 and T2 at the same time, taking care to write the commit records to the binlog and to the storage engine in the right order (T1 before T2). This way, the wait is likely to improve performance rather than reduce it, in fact.

On out-of-order parallel replication

Out-of-order parallel replication is when transactions can be committed in a different order on the slave than on the master. The MySQL 5.6 multi-threaded slave feature is an example of this.

Out-of-order must be explicitly enabled by the application/DBA, because it breaks fundamental semantics. If commits happen in different order, the slave database may be temporarily in a state that never existed on the master and may be invalid for the application. But if the application is written to tolerate such inconsistencies, and explicitly declares this to the database, then there may be potential for more parallelism than with in-order methods. This can make out-of-order interesting.

The typical example, which MySQL 5.6 multi-threaded slave uses, is when the application declares that transactions against different schemas are guaranteed independent. Different schemas can then be replicated independently (though if the application messes up and transactions happen to not really be independent, things can break). MariaDB 10.0 multi-source replication is another example, where the application declares a guarantee that two master servers can be replicated independently.

Out-of-order creates a challenge for global transaction ID when switching to a new master. Because events in the binlog on the new master are in different order, there will not in general be a single place from which to start replication without either loosing or duplicating some event.

MariaDB global transaction ID handles this by only allowing out-of-order parallel replication between different replication domains, never within a single domain. In effect, the DBA/application explicitly declares the possible independent replication streams, and then it is sufficient to remember one global transaction ID per domain_id as the position reached within each independent stream.

Thus, suppose we have a master where updates to schemas are independent, and we want to replicate them in parallel on slaves. On the master, we configure 10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog, we set the domain_id value to a hash of the used schema.

On the slave, we then configure 10 SQL threads. Two received transactions with different domain_id can be executed in parallel. Two transactions using same schema will map to the same domain_id and will thus not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded slave almost for free, using the exact same mechanism as for executing multi-source replication in parallel. The replication state on the slave will in this case consist of the 10 different global transaction IDs reached within each of the 10 replication domains. And they can be stored in the table rpl_slave_state just as described above. Thus replication state for out-of-order parallel replication is fully integrated with the rest of the design, needing no special mechanisms.

And we can do more! The application (with suitable privileges) is allowed to change domain_id per-query. For example, we can run all normal queries with domain_id=1. But then if we have a long-running maintenance query like an ALTER TABLE or something that updates every row in a large table, we can execute it with domain_id=2, if we take care that no other queries conflict with it. This way, the long-running query can run in parallel "in the background", without causing any replication delay for normal queries.

In effect, the application or DBA now has great flexibility in declaring which queries can replicate independent of (and thus in parallel with) each other, and all this just falls out almost for free from the overall design. I foresee that this will be a very powerful feature to have for large, busy replication setups.

Note btw. that most out-of-order parallel replication techniques can also be done as in-order simply by delaying the final COMMIT steps of transactions to happen in-order. This way one could for example do per-schema parallel replication without polluting the replication state with many global transaction IDs. This should generally achieve similar improvement in overall throughput, though latency of individual transactions can be longer.

On "holes" in the global transaction ID sequences

Global transaction IDs have a sequence-number component, which ensures uniqueness by being always increasing. This raises the issue of whether an event will always have a sequence number exactly one bigger than the previous event, or if it is allowed to have "holes", where some sequence number is never allocated to an event.

For MariaDB global transaction ID, I took the approach that holes are allowed. There are a number of good reasons for this.

Mostly, I believe that a design that relies on "no holes" is a fundamental mistake. In MySQL 5.6 global transaction ID, holes are absolutely not allowed. If a hole ever turns up, you will be stuck with it literally forever. The MySQL 5.6 replication state lists every sequence number not yet applied on a slave server, so if one becomes missing it will forever remain. Unless you remove it manually, and as far as I have been able to determine, there are currently no facilities for this. Anyone who knows how fragile MySQL replication can be should realise that this is a recipe for disaster.

Another point: because of the strict "no holes" requirement, in MySQL 5.6, when events are filtered with --replicate-ignore-db or whatever, they had to change the code so that a dummy event is used to replace the filtered event. In effect, you cannot really filter any events any more! I think that alone should be enough to realise that the design is wrong.

A more subtle point is that a strict "no holes" requirement makes it much harder to correctly and scalable handle allocation of new numbers. Basically, to allocate next number in a multi-thread environment, a lock needs to be taken. We need to take this lock for as short as possible to preserve scalability. But then, what happens if we allocate some sequence number N to transaction T1, and then later we get some failure that prevents T1 from successfully committing and being written into the binlog? We now cannot simply rollback T1, because some other transaction T2 may have already allocated the next number, and then we would leave a hole. Subtle issues like this are important to achieve good scalability.

So I think it is wrong to base the design on never having holes. On the other hand, there is no reason to deliberately introduce holes just for the fun of it. Sequence numbers in MariaDB global transaction ID will generally be without holes, it is just that nothing will break if somehow a hole should sneak in.

Also, whenever a global transaction ID is received on a slave server, the server's own internal counter for the next sequence number to allocate will be set to one more than the received sequence number, if it is currently smaller. This gives the very nice property in a standard setup, where only one master is ever written at any one time: The sequence number in itself will be globally unique and always increasing. This means that one can look at any two global transaction IDs and immediately know which one comes first in the history, which can be very useful. It also allows to give a warning if multiple masters are being written without being configured with distinct replication domain ID. This is detected when a server replicates a global transaction ID with same domain_id as its own but smaller sequence number.

In multi-master-like setups, sequence number by itself can no longer be globally unique. But even here, if the system is correctly configured so that each actively written master has its own domain_id, sequence number will be unique per domain_id and allow to order global transaction IDs within one domain (and of course, between different domains there is no well-defined ordering anyway).

On CHANGE MASTER TO syntax

In the previous post, I did not really go into what new syntax will be introduced for MariaDB global transaction ID, both for the sake of brevity, and also because it has not really been fully decided yet.

However, there will certainly be the possibility to change directly the replication slave state, ie. the global transaction ID to start replicating from within each replication domain. For example something like this:

    CHANGE MASTER TO master_host = "master.lan",
           master_gtid_pos = "1-1-100,2-5-300";
This is a fine and supported way to start replication. I just want to mention that it will also be supported to start replication in the old way, with master_log_file and master_log_pos, and the master will automatically convert this into the corresponding master_gtid_pos and set this for the slave. This can be convenient, as many tools like mysqldump or XtraBackup provide easy access to the old-style binlog position. It is certainly an improvement over MySQL 5.6 global transaction ID, where the only documented way to setup a slave involves RESET MASTER (!) on the master server...

Incidentally, note that master_gtid_pos has just one global transaction ID per domain_id, not one per server_id. Thus, if not using any form of multi-master, there will be just one global transaction ID to set.

So if we start with server 1 as the master, and then some time later switch over to server 2 for a master, the binlog will have global transaction IDs both with server_id=1 and server_id=2. But the slave binlog state will be just a single global transaction ID, with server_id=2 in this case. Since binlog order is always the same within one replication domain, a single global transaction ID is sufficient to know the correct place to continue replication.

I think this is a very nice property, that the size of the replication state is fixed: one global transaction ID per configured replication domain. In contrast, for MySQL 5.6 global transaction ID, any server_id that ever worked as master will remain in the replication state forever. If you ever had a server id 666 you will still be stuck with it 10 years later when you specify the replication state in CHANGE MASTER (assuming they will at some point even allow specifying the replication state in CHANGE MASTER).

Once the global transaction replication state is set, changing to a new master could happen with something like this:

    CHANGE MASTER TO master_host = "master.lan",
                     master_gtid_pos = AUTO;
This is the whole point of global transaction ID, of course, to be able to do this and automatically get replication started from the right point(s) in the binlog on the new master.

One idea that I have not yet decided about is to allow just this simple syntax:

    CHANGE MASTER TO master_host = "master.lan";
so that if no starting position is specified, and a global transaction state already exists, we default to master_gtid_pos = AUTO. This would be a change in current behaviour, so maybe it is not a good idea. On the other hand, the current behaviour is to start replication from whatever happens to be the first not yet purged binlog file on the master, which is almost guaranteed to be wrong. So it is tempting to change this simple syntax to just do the right thing.

On extensions to mysqlbinlog

Robert mentions some possible extensions to the mysqlbinlog program, and I agree with those.

The master binlog is carefully designed so that it is easily possible to locate any given global transaction ID and the corresponding binlog position (or determine that such global transaction ID is not present in any binlog files). In the initial design this requires scanning one (but just one) binlog file from the beginning; later we could add an index facility if this becomes a bottleneck. The mysqlbinlog program should also support this, probably by allowing to specify a global transaction ID (or multiple IDs) for --start-position and --stop-position.

Robert also mentions the usefulness of an option to filter out events from within just one replication domain/stream. This is something I had not thought of, but it would clearly be useful and is simple to implement.

On session variable server_id

With MariaDB global transaction ID, server_id becomes a session variable, as do newly introduced variables gtid_domain_id and gtid_seq_no. This allows an external replication mechanism to apply events from another server outside the normal replication mechanism, and still preserve the global transaction ID when the resulting queries are logged in the local binlog. One important use case for this is of course point-in-time recovery, mysqlbinlog | mysql. Here, mysqlbinlog can set these variables to preserve the global transaction IDs on the events applied, so that fail-over and so on will still work correctly.

Since messing with server_id and so on has the possibility to easily break replication, setting these requires SUPER privileges.

Tags: , , , , ,

(14 comments | Leave a comment)

Comments
 
From:Robert Hodges
Date:January 4th, 2013 06:02 pm (UTC)
(Link)
Hi Kristian,

Thanks for the detailed follow-up! I'm still digesting it fully but I should point that the no-gaps approach we use in Tungsten works quite nicely for us because we are parsing the binlog, which is already in serial order. The same is true for replicating out of other DBMS types--we first ensure serialization, then create our log records. Therefore we can just assign the sequence number in order without creating gaps. I understand your reasoning for not wanting this if you are operating in a concurrent environment within the server. It seems to me that each way has its own difficulties--for example once you allow gaps *at all* the fact that they may only occur occasionally is irrelevant to algorithms for handling the log.

Cheers, Robert
From:(Anonymous)
Date:January 5th, 2013 05:52 am (UTC)
(Link)
Will the binlog format must be set to row if enable parallel replication
From:kristiannielsen
Date:January 5th, 2013 10:51 am (UTC)
(Link)
There are different techniques being planned.

The technique linked to from MDEV-520 will only be able to replicate transactions in parallel if row-based replication is used.

The other techniques work for both statement, mixed, and row mode.
[User Picture]
From:Alexey Yurchenko
Date:January 7th, 2013 12:29 pm (UTC)
(Link)
Hi Kristian,
I've noticed that you're planning to use manually assigned ordinal numbers as domain IDs. It looks nice and simple, but has a potential to create streams with the same ID. Have you considered using UUIDs as domain IDs to avoid the risk to ever have a two different events with the same GTID?
Regards, Alex
From:kristiannielsen
Date:January 7th, 2013 01:49 pm (UTC)
(Link)
I do not see how using UUID for domain_id helps, they just make the job of the
DBA harder by being unreadable?

The whole point of having domain_id separate from server_id is that multiple
servers can share the same domain_id. For example, in a simple setup with one
master and N slaves, all N+1 servers must have the same domain_id (probably
the default of 0). So that if one of the slaves is promoted as the new master,
it will continue writing the same stream (same domain_id) in its binlog.

In fact, domain_id is only allowed to be different on two separate servers if
the DBA/user/application can guarantee that updates done on those two servers
will always be independent of each other...

In contrast, server_id can be allocated using UUID, as it must always be
unique per server. This is what MySQL 5.6 does. However, I dislike this use of
UUID, and am currently not planning on merging that to MariaDB ...
[User Picture]
From:Alexey Yurchenko
Date:January 7th, 2013 02:37 pm (UTC)
(Link)
From your previous post:
"A replication domain is just a server or group of servers that generate a single, strictly ordered replication stream."

So let's put it a bit wider (and forget about UUIDs): how do you plan to ensure that all servers in a given domain have the same domain ID and servers not belonging to this domain - different domain ID? Or you don't see this as a potential hazard?
From:kristiannielsen
Date:January 7th, 2013 03:50 pm (UTC)
(Link)
This will be up to the DBA to configure.

I do not see how this can be any different. Replication domains are a property
of the application. The replication domain a given query belongs to is
determined by the domain_id that the DBA configures for it.

So by definition, two events (or two servers) are in the same domain iff they
are configured with the same domain_id. The application/user/DBA decides which
domains are there, not MariaDB.

Maybe some examples will help?

If you have a replication topology with a single master active at the same
time, configure all servers with the same domain_id (the default will do).

If you have multi-source replication with one slave replicationg two
independent masters, configure each master with a different domain_id. The
slave with multiple master sources can detect if same domain_id is received
from multiple sources and give a warning or error.

If you have two or more masters in a ring, configure each such master with its
own domain_id. Every direct slave of a master should get the same domain_id if
it is promoted to replace that master.

Basically, domain_id is a tool that allows the DBA to know and control exactly
which independent replication streams exist in the system, and hence exactly
what the replication slave state looks like. That is the point, knowledge and
control.

The server can not guess which replication domains exists by its own. However,
one can do like MySQL 5.6, which basically assumes that every event can
potentially belong to its own domain. This makes things automatic, so DBA does
not need to configure domain_id.

But without knowing which event belongs to which domain, the server is then
unable to use this information. So if S1 does multi-source replication from M1
and M2, and S2 is a slave of S1, then we can not on S2 apply events from M1 in
parallel with those from M2, which is a critical performance issue. And we can
not deal with error handling around missing events. And replication state
status becomes more complex. That is why I prefer explicit domain_id
configuration.
From:kristiannielsen
Date:January 7th, 2013 04:06 pm (UTC)
(Link)
To put it another way:

Currently, there is no way to see on a master server if it is used for
multi-source replication, or if it is part of a multi-master ring. But this
information is useful/needed for good global transaction ID support.

I propose to add a configuration variable (in the form of domain_id) that
allows the DBA to provide the needed information.

This gives a slightly higher burden on the DBA for complex replication
setups. But in return it provides simpler and more powerful management, and
higher performance.
From:kristiannielsen
Date:January 7th, 2013 05:43 pm (UTC)
(Link)
Still, this is an interesting topic, thanks for raising it!

I think I will try to write a follow-up post with some examples in more detail, going through how the MySQL and the MariaDB approach will deal with them. It is certain to be instructive and help improve the final design.
From:(Anonymous)
Date:February 18th, 2013 11:10 am (UTC)
(Link)
Hi, Kristian

What about the idea of batch commit transaction in slave to accelerate applying binary log on slave instead of parallel replication ?
It is much simple and can work on all kinds of binlog format.
https://code.google.com/p/david-mysql-tools/wiki/Batch_Commit_for_accelerate_MySQL_slave
From:kristiannielsen
Date:February 18th, 2013 11:20 am (UTC)
(Link)
> What about the idea of batch commit transaction in slave to accelerate
> applying binary log on slave instead of parallel replication ?

Yes, I think it is a good idea.

If you check recent discussions on the maria-developers@ mailing list, you
will see a patch from me on parallel replication with in-order commit. This
actually does something related: applies three transactions in parallel, but
commits them in a single batch (a single fsync() shared among all).

But with my patch it will be somewhat random how many commits are batched. It
seems worthwhile to allow for more controlled configuration of how to batch
commits to reduce fsync. It is something for me to consider after I have
finished the global transaction ID ...

Thanks,

- Kristian.
From:(Anonymous)
Date:September 29th, 2013 09:56 am (UTC)
(Link)
I wanted to thank you for this excellent read!! I definitely enjoyed every little bit of it.

I have you bookmarked to check out new stuff you post…
From:(Anonymous)
Date:February 25th, 2014 02:07 am (UTC)
(Link)
Excellent article. I am going through some of
these issues as well..
From:(Anonymous)
Date:March 14th, 2014 04:46 am (UTC)
(Link)
It's actually a nice and helpful piece of information.

I am satisfied that you just shared this useful
information with us. Please stay us informed like this.
Thank you for sharing.
Powered by LiveJournal.com