Dissecting the MySQL replication binlog events|
For the replication
project that I am currently working on in MariaDB, I wanted to understand
exactly what information is needed to do full replication of all MySQL/MariaDB
statements on the level of completeness that existing replication does. So I
went through the code, and this is what I found.
What I am after here is a complete list of what the execution engine needs to
provide to have everything that a replication system needs to be able to
completely replicate all changes made on a master server. But not
anything specific to the particular implementation of replication used, like
binlog positions or replication event disk formats, etc.
The basic information needed is of course the query (for statement-based
replication), or the column values (for row-based replication). But there are
lots of extra details needed, especially for statement-based replication. I
need to make sure that the replication API we are designing will be able to
provide all needed information, and it was always nagging in the back
of my head that there would be lots and lots of small bits in various corners
that would be missing and cause problems. So it was good to get this
overview. Turns out that there are a lot of details, but not
that many, and it should be manageable.
All of the events that are used in replication are listed in
sql/log_event.h. So anything needed for
complete replication can be found here, but mixed up with lots of other
details about the MySQL binlog implementation, backwards compatibility,
etc. So what follows is an extract from
log_event.cc of the
actual change information contained in those events.
The main event for statement-based replication is QUERY_EVENT. It contains the
query to be executed (as a string) and some information to provide the context
for correct execution. Here is the list of information:
- SQL query.
- Default database for the query (eg. from
- The setting of some server variables in effect at the time the query
autocommit (whether autocommit is enabled).
- Character set and collation at various levels (see the
section 9.1.4. Connection
Character Sets and Collations in the MySQL manual for background on these):
- Client (
- Connection (
- Server (
- Current default database (
note that there are few statements that rely on this, comments in
the code say it is only
foreign_key_checks (whether foreign keys are checked).
unique_checks (whether unique constraint checks are
- Time zone of the master database server.
- Names to use for days and months; this is identified by a code that
is mapped to a table of names to use in
SELECT ... WHERE
autoinc IS NULL returns last insert id for ODBC compatibility).
- Error code from executing the query on the master (for non-transactional
statements that may still make permanent changes even though they fail
mid-way; on the slave the query should fail with the same error).
- Connection ID (this is used to correctly distinguish
TABLEs with same name used in different connections on the master
Note that not all of this information is replicated in all query events, as
not all of it is needed for a given query. But a replication API must make the
information available for the queries where it is needed.
These events provide additional context for executing a query on the slave:
- Value of
LAST_INSERT_ID (for queries that reference it).
- Value of
INSERT_ID (to get same auto_increment numbers for
inserts on the slave as on the master).
- The random seed (so RAND() can return same values in queries on slaves
as on the master).
- The values for any
@user_variables referenced in a query
These four events are used to do statement replication of
INFILE. The contents of the file to be loaded is sent in blocks
BEGIN_LOAD_QUERY_EVENT followed by zero or
APPEND_BLOCK_EVENT. Then the actual query is sent
EXECUTE_LOAD_QUERY_EVENT, which is a variant
QUERY_EVENT that replaces the original filename with the name
of a temporary file on the slave and deletes the temporary file afterwards
DELETE_FILE_EVENT is used in certain error cases).
This is the complete story of exactly how much information needs to be
provided on the master to make statement replication work as it does currently
in MySQL. If you get the thought that this is a little bit scary in terms of
complexity I tend to agree with you ;-). There is a lot to be said for the
comparative simplicity of row-based replication (and it is also interesting to
see the history of bug fixes in MySQL 5.1 that gradually have moved more and
more statements to be replicated row-based (in mixed-mode binlogging) due to
corner cases where statement-based replication can fail).
Still, once we have the list of information, it is not that hard to provide
the information in a pluggable replication API for any implementations that
want to try their luck with statement-based replication. And of course,
row-based replication only
DELETE! We also
need to support
CREATE TABLE and similar statements, for which it
is still useful to know the above exhaustive list of information that may be
needed in one form or another.
In row-based replication, each DML statement is binlogged in two parts. First
the tables modified in the query are described
TABLE_MAP_EVENT, and second the row values changed are
The information describing modified tables in row-based replication is as
- Database name.
- Table name.
- List of columns in the table. For each column, the following
information is included:
- Column type (this is
- Column metadata (this is what is returned
field->save_field_metadata(); this is for example the
maximum length of a
VARCHAR, the precision and number of
- Whether the column is NULL-able.
- Table map id; this is just an internally generated uniqie number for
subsequent events to refer to the table described.
Note in particular that column names are not used/needed in current
MySQL/MariaDB row-based replication. I personally think this is a good way to
do it. However, in a generic API, it will make sense to make the full table
definition available to implementations, each of which can choose what and how
to log in terms of table metadata.
These events handle replication of
(and similar statements like
REPLACE etc.) They contain the
- Table map id, referencing a table previously described
- Value of
similar to statement-based binlogging (but for row-based, those two are
all the context storedm though see remarks below).
- List (bitmap really) of columns updated. This is essentially
write_set that is used in the storage engine API (but see
below for explanation). For
UPDATE, there are two bitmaps, one
for the before image and one for the after image.
- List of records containing the values of each column modified. There is
one such record for every row update logged. For
are two records for each row update, one for the before image (values before
the update was done) and one for the after image (values after the update
I must say, investigating how these row-based events are implemented in MySQL
really makes the feature seem rather half-baked. There are several issues:
- The lists/bitmaps of column updated sound useful, but in reality they
are set unconditionally to include all columns! Except for NDB).
- This also means all columns
in a row are always sent, even for
UPDATE. Except for NDB, which only logs needed
- Some of the "extra" flags in the storage engine API are not included,
HA_EXTRA_WRITE_CAN_REPLACE. This is actually a bug, as
it means that a storage engine using such flags to optimise its operation
will not replicate correctly. In the existing MySQL source, only NDB uses
this flag, but NDB does special tricks for binlogging and slave replication
which avoids this particular issue in most cases.
I strongly suspect that some of this half-baking was done in a quick-and-dirty
attempt to squeeze NDB replication in. At least, there are several "this is
only used by NDB" type comments in the vicinity of these things in the source code.
In any case, for the replication API, it is probably a good idea to re-think
this part and make sure that the information logged for row updates is
complete and sane for all reasonable use cases.
My idea is to have a replication API that provides for generation and
consumption of events completely separate from any details of the actual
format of events in the binlog or any other method used to store or process
the events. This will allow replication plugins that use a completely
different binlog implementation, or even has no binlog at all.
So such an API needs to provide all of the above information (to allow
re-implementing the existing binlog/replication as a plugin, if for no other
reason), but need not provide such information in any particular event
format. In fact, I am trying to make the API so that such information need not
be materialised in structures or memory buffers at all; instead relying on
providing accessor methods, so that an implementation can request just the
information it needs, and materialise it as or if needed.
On top of this I still think it makes sense to define a standard (but
optional) materialised event format, so that more light-weight plugins can be
written that can do interesting things with replication without having to
implement a full new event format each time. I am still considering whether to
extend the existing binlog format (which is not all that attractive, as it is
not very easily extensible), or whether to define a new more flexible format
(for example based on the Google protobuffer library).
More on the existing binlog format
Just for completeness, here is some additional description of the existing
MySQL/MariaDB 5.1 binlog format. These are things that I believe are not
required in a new API, as they are mostly internal implementation
details. However, as I had to go through them anyway while finding the stuff
that does need to be in the API, I will include a brief description here.
Additional query information
Some additional information, which is mostly redundant, is included with query
events for statement-based binlogging:
- Bitmap of tables affected by multi-table update (this allows to know
which tables will be updated without parsing the query, eg. for filtering
events based on database/table name.)
- Time spent in query on master.
- Catalog (I believe this is old unused stuff. Idea is that each database
belongs to a catalog, but I have never seen this actually used anywhere).
- A flag
LOG_EVENT_THREAD_SPECIFIC_F which is set if the
TEMPORARY table (allows to get this information
without parsing the query).
- A flag
LOG_EVENT_SUPPRESS_USE_F set in a few cases when the
master knows that the query is independent of what the current database is
(so that a possible
USE statement can be optimised away).
Binlog specific events
These are events that are specific to the binlog implementation:
This is used to record a transaction ID for each transaction written to
the binlog in 2-phase commit. This recorded ID is needed during crash
recovery on the master to know which prepared transactions in
transactional engines need to be recovered to get consistency with what is
in the binlog. It is not used on the slave in replication (though this
events implies a
COMMIT, which _does_ have effect on the
slave, of course.)
This event is written at the start of every binlog file. It provides to
slaves reading the binlog the master server version and the event size of
all following events, thereby providing some facilities for extending
event formats while maintaining backwards compatibility.
This is logged when the master shuts down gracefully (though I do not
think this is used much, if any)
This is logged at the end of a binlog file when the master starts a new
binlog file. It is needed by the slave to reset it's master binlog
position so that the IO thread can proceed correctly from the next binlog
file (incidentally, it is a clear weakness in the binlog implementation
that the slaves need knowledge about binlog file names and data offsets on
the master server, and is a cause of much complexity when switching
masters in advanced replication topologies. Something that really needs
improvements in the near future).
This is logged by the master when something bad happens that may cause
replication to fail/diverge, so that the slave can be notified of the
problem and stop, informing the DBA/sysadm to resolve the issue.
Finally there are a number of events that are no longer generated (but which
are still important for the slave replication code to handle to be able to
work with masters of older versions):
Various old events for handling
LOAD DATA INFILE (as can be
LOAD DATA INFILE has had some changes in replication
over the years :-).
Old version of
Old versions of the row-based replication binlog events.
Not used, I think it may have been related to some feature that was never
Tags: freesoftware, mariadb, mysql, programming