MariaDB replication feature preview released

I am pleased to announce the availability of the MariaDB 5.2 feature preview release. Find the details and download links on the knowledgebase.

There has been quite good interest in the replication work I have been doing around MariaDB, and I wanted a way to make it easy for people to use, experiment with, and give feedback on the new features. The result is this replication feature preview release. This will all eventually make it into the next official release, however this is likely still some month off.

All the usual binary packages and source tarballs are available for download. As something new, I now also made apt-enabled repositories available for Debian and Ubuntu; this should greatly simplify installation on these .deb based distributions.

So please try it out, and give feedback on the mailing list or bug tracker. I will make sure to fix any bugs and keep the feature preview updated until everything is available in an official release.

Here is the list of new features in the replication preview release:

Group commit for the binary log

This preview release implements group commit that works when using XtraDB with the binary log enabled. (In previous MariaDB releases, and all MySQL releases at the time of writing, group commit works in InnoDB/XtraDB when the binary log is disabled, but stops working when the binary log is enabled).

Documentation.

Enhancements for START TRANSACTION WITH CONSISTENT SNAPSHOT

START TRANSACTION WITH CONSISTENT SNAPSHOT now also works with the binary log. This means that it is possible to obtain the binlog position corresponding to a transactional snapshot of the database without any blocking of other queries at all. This is used by mysqldump --single-transaction --master-data to do a fully non-blocking backup that can be used to provision a new slave.

START TRANSACTION WITH CONSISTENT SNAPSHOT now also works consistently between transactions involving more than one storage engine (currently XTraDB and PBXT support this).

Documentation.

Annotation of row-based replication events with the original SQL statement

When using row-based replication, the binary log does not contain SQL statements, only discrete single-row insert/update/delete events. This can make it harder to read mysqlbinlog output and understand where in an application a given event may have originated, complicating analysis and debugging.

This feature adds an option to include the original SQL statement as a comment in the binary log (and shown in mysqlbinlog output) for row-based replication events.

Documentation.

Row-based replication for tables with no primary key

This feature can improve the performance of row-based replication on tables that do not have a primary key (or other unique key), but that do have another index that can help locate rows to update or delete. With this feature, index cardinality information from ANALYZE TABLE is considered when selecting the index to use (before this feature is implemented, the first index was selected unconditionally).

Documentation.

Early release during prepare phase of XtraDB row locks

This feature adds an option to make XtraDB release the row locks for a transaction earlier during the COMMIT step when running with --sync-binlog=1 and --innodb-flush-log-at-trx-commit=1. This can improve throughput if the workload has a bottleneck on hot-spot rows.

Documentation.

PBXT consistent commit ordering

This feature implements the new commit ordering storage engine API in PBXT. With this feature, it is possible to use START TRANSACTION WITH CONSISTENT SNAPSHOT and get consistency among transactions that involve both XtraDB and InnoDB. (Without this feature, there is no such consistency guarantee. For example, even after running START TRANSACTION WITH CONSISTENT SNAPSHOT it was still possible for the InnoDB/XtraDB part of some transaction T to be visible and the PBXT part of the same transaction T to not be visible.)

Documentation.

Miscellaneous

  • Small change to make mysqlbinlog omit redundant use statements around BEGIN/SAVEPOINT/COMMIT/ROLLBACK events when reading MySQL 5.0 binlogs.

Christmas @ MariaDB

The Danish "julehjerte" is apparently a Danish/Northern Europe Christmas tradition (at least according to Wikipedia). But hopefully people outside this region will also be able to enjoy this variant:

    

I have been doing "julehjerter" ever since I was a small kid, and every Christmas try to do something different with it. As seen above, this year I decided to combine the tradition with the MariaDB logo, and I am frankly quite pleased with the result :-)

The future of replication revealed in Istanbul

A very good meeting in Istanbul is drawing to an end. People from Monty Program, Facebook, Galera, Percona, SkySQL, and other parts of the community are meeting with one foot on the European continent and another in Asia to discuss all things MariaDB and MySQL and experience the mystery of the Orient.

At the meeting I had the opportunity to present my plans and visions for the future development of replication in MariaDB. My talk was very well received, and I had a lot of good discussions afterwards with many of the bright people here. Working from home in a virtual company, it means a lot to get this kind of inspiration and encouragement from others on occasion, and I am looking forward to continuing the work after an early flight to Copenhagen tomorrow.

The new interface for transaction coordinator plugins is what particularly interests me at the moment. The immediate benefit of this work is working group commit for transactions with the binary log enabled. But just as interesting (if more subtle), the project is an enabler for several other nice features related to hot backup and recovery. I spent a lot of effort working on the interfaces to the transaction controller and related extensions to the storage engine API, and I think the result is quite solid and a good basis for coming work.

After the transaction coordinator plugin, the next step is an API for event generators that will allow plugins to receive replication events on an equal footing with the built-in MySQL binary log implementation; I will be using this in cooperation with Codership to more tightly integrate their Galera synchronous replication into MariaDB. And long-term, I am hoping to combine all of the pieces to finally start attacking the general problem of parallel execution of events on replication slaves, the solution of which is long overdue.

(The MariaDB replication project page has lots of pointers to more information on the various projects for anyone interested).

Almost too good to be true, out excursion today was blessed with sunshine and mild weather after countless days of rain and storm. There were even rumours of sightings of dolphins jumping again during the SkySQL excursion yesterday. So while lots of hard work remains, all in all, the omens seem all good for the future of replication in MariaDB!

Dynamic linking costs two cycles

It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU cycles per cross-module call. I usually take forever to get to the point in my writing, so I thought I would change this for once :-)

In MySQL, there has been a historical tendency to favour static linking, in part because to avoid the overhead (in execution efficiency) associated with dynamic linking. However, on modern systems there are also very serious drawbacks when using static linking.

The particular issue that inspired this article is that I was working on MWL#74, building a proper shared libmysqld.so library for the MariaDB embedded server. The lack of a proper libmysqld.so in MySQL and MariaDB has caused no end of grief for packaging Amarok for the various Linux distributions. My patch increases the amount of dynamic linking (in a default build), so I did a quick test to get an idea of the overhead of this.

ELF dynamic linking

The overhead comes from the way dynamic linking works in ELF-based systems like Linux (and many other POSIX-like operating systems). Code in shared libraries must be compiled to be position-independent, achieved with the -fPIC compiler switch. This allows the loader to simply mmap() the image of a shared library into the process address space at whatever free memory space is available, and the code can run without any need for the loader to do any kind of relocations of the code. For a much more detailed explanation see for example this article.

When generating position-independent code for a function call into another shared object, the compiler cannot generate a simple absolute call instruction, as the destination address is not known until run-time. Instead, the call goes via an indirect jump is generated, fetching the destination address from a table called the PLT, short for Procedure Linkage Table. For example:

                       callq  0x400680 <mylib_myfunc@plt>)
...
<mylib_myfunc@plt>:    jmpq   *0x200582(%rip)

The indirect jump resolves at runtime into the address of the real function to be called, so that is the overhead of the call when using dynamic linking: one indirect jump instruction.

Micro-benchmarking

To measure this one-instruction overhead in terms of execution time, I used the following code:

    for (i= 0; i < count; i++)
      v= mylib_myfunc(v);

The function mylib_myfunc() is placed in a library, with the following code:

    int mylib_myfunc(int v) {return v+1;}

I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz machine running Linux amd64. Here are the results from running the loop for 1,000,000,000 (one billion) operations:

 total time (sec.)CPU cycles/iteration
Static linking2.546
Dynamic linking3.388

So that is the two CPU cycles of overhead per call that I referred to at the start of this post.

Incidentally, if you try stepping through the call with a debugger, you will see a much larger overhead for the very first call. Do not be fooled by this, this is just because the loader fills in the PLT lazily, computing the correct address of the destination only on the first time the call is made (so addresses of functions that are never called by a process need never be calculated). See above-referenced article for more details.

(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar, but the actual overhead may be somewhat larger, since that architecture lacks program-counter-relative addressing and so must reserve one register %ebx (out of its already quite limited register bank) for this purpose. I did not measure the 32-bit case, I think it is of little interest nowadays for high-performance MySQL or MariaDB deployments (and the overhead of function calls on x86 32-bit is significantly higher anyway, dynamic linking or not, due to the need to push and pop all arguments to/from the stack)).

Conclusion

Two cycles per call is, in my opinion, a very modest overhead. It is hard to imagine high-performance code where this will have a real-life noticeable effect. Modern systems rely heavily on dynamic linking, and static linking is nowadays causing much more problems that it solves. And I think it is also time to put the efficiency argument for static linking to rest.

Micro-benchmarking pthread_cond_broadcast()

In my work on group commit for MariaDB, I have the following situation:

A group of threads are going to participate in group commit. This means that one of the threads, called the group leader, will run an fsync() for all of them, while the other threads wait. Once the group leader is done, it needs to wake up all of the other threads.

The obvious way to do this is to have the group leader call pthread_cond_broadcast() on a condition that the other threads are waiting for with pthread_cond_wait():

  bool wakeup= false;
  pthread_cond_t wakeup_cond;
  pthread_mutex_t wakeup_mutex

Waiter:

  pthread_mutex_lock(&wakeup_mutex);
  while (!wakeup)
    pthread_cond_wait(&wakeup_cond, &wakeup_mutex);
  pthread_mutex_unlock(&wakeup_mutex);
  // Continue processing after group commit is now done.

Group leader:

  pthread_mutex_lock(&wakeup_mutex);
  wakeup= true;
  pthread_cond_broadcast(&wakeup_cond);
  pthread_mutex_unlock(&wakeup_mutex);

Note the association of the condition with a mutex. This association is inherent in the way pthread condition variables work. The mutex must be locked when calling into pthread_mutex_wait(), and will be obtained again before the call returns. (Check the man page for pthread_cond_wait() for details).

Now, when I think about how these condition variables work, something strikes me as somewhat odd.

The idea is that the broadcast signals every waiting thread to wake up. However, because of the associated mutex, only one thread will actually be able to wake up; this thread will obtain a lock on the mutex, and all other to-be-awoken threads will now have to wait for this mutex! Only after the first thread releases this mutex will the next thread wakeup holding the mutex, then after releasing the third thread can wake up, and so on.

So if we have say 100 threads waiting, the last one will have to wait for the first 99 threads to each be scheduled and each release the mutex, one after the other in a completely serialised fashion.

But what I really want is to just let them all run at once in parallel (or at least as many as my machine has spare cores for). There is another way to achieve this, by simply using a separate condition and mutex for each thread, and have the group leader signal each one individually:

Waiter:

  pthread_mutex_lock(&me->wakeup_mutex);
  while (!me->wakeup)
    pthread_cond_wait(&me->wakeup_cond, &me->wakeup_mutex);
  pthread_mutex_unlock(&me->wakeup_mutex);

Group leader:

  for waiter in <all waiters>
    pthread_mutex_lock(&waiter->wakeup_mutex);
    waiter->wakeup= true;
    pthread_cond_signal(&wakeup_cond);
    pthread_mutex_unlock(&wakeup_mutex);

This way, every waiter is free to start running as soon as woken up by the leader; no waiters have to wait for one another. This seems advantageous, especially as number of cores increases (rumours are that 48 core machines are becoming commodity).

"Seems" advantageous. But is it really? Let us micro-benchmark it.

For this, I start up 5000 threads. Each thread goes to wait on a condition, either a single shared one, or distinct in each thread. The main program then signals the threads to wakeup, either with a single pthread_cond_broadcast(), or with one pthread_cond_signal() per thread. Each thread records the time they woke up, and the main program collects these times and computes how long it took between starting to signal the condition(s) and wakeup of the last thread. (Here is the full C source code for the test program).

I ran the program on an Intel quad Core i7 with hyperthreading enabled, the most parallel machine I have easy access to. The results is the following:

pthread_cond_broadcast(): 46.9 msec
pthread_cond_signal(): 17.6 msec

Conclusion: pthread_cond_broadcast() is slower, as I speculated. I would expect the effect to be more pronounced on systems with more cores; it would be interesting if readers with access to such systems could try the test program and comment below on the results.

MySQL/MariaDB replication: applying events on the slave side

Working on a new set of replication APIs in MariaDB, I have given some thought to the generation of replication events on the master server.

But there is another side of the equation: to apply the generated events on a slave server. This is something that most replication setups will need (unless they replicate to non-MySQL/MariaDB slaves). So it will be good to provide a generic interface for this, otherwise every binlog-like plugin implementation will have to re-invent this themselves.

A central idea in the current design for generating events is that we do not enforce a specific content of events. Instead, the API provides accessors for a lot of different information related to each event, allowing the plugin flexibility in choosing what to include in a particular event format. For example, one plugin may request column names for a row-based UPDATE event; another plugin may not need them and can avoid any overhead related to column names simply by not requesting them.

To get the same flexibility on the slave side, the roles of plugin and API are reversed. Here, the plugin will have a certain pre-determined (by the particular event format implemented) set of information related to the event available. And the API must make do with whatever information it is provided (or fail gracefully if essential information is missing).

My idea is that the event application API will provide corresponding events to the events in the generation API. Each application event will have "provider" methods corresponding to the accessor methods of the generator API. So the plugin that wants to apply an event can obtain an event generator object, call the appropriate provider methods for all the information available, and finally ask the API to execute the event with the provided context information.

This is only an abstract idea at this point; there are lots of details to take care of to make this idea into a concrete design proposal. And I have not fully decided if such an API will be part of the replication project or not. But I like the idea so far.

Understanding how MySQL binlog events are applied on the slave

I wanted to get a better understanding of what is involved in an event application API like the one described above. So I did a similar exercise to the one I wrote about in my last post, where I went through in detail all the information that the existing MySQL binlog format includes. This time I went through the details in the code that applies MySQL binlog events on a slave.

Again, I concentrate on the actual events that change the database, ignoring (most) details that relate only to the particular binlog format used by MySQL (and there are quite a few :-).

At the top level, the slave SQL thread (in exec_relay_log_event()) reads events (next_event()) from the relay logs and executes (apply_event_and_update_pos()) them.

There are a number of intricate details here relating to switching to a new relay log (and purging old ones), and re-winding in the relay log to re-try a failed transaction (eg. in case of deadlock or the like). This is mostly specific to the particular binlog implementation.

The actual data changes are mostly done in the do_apply_event() methods of each event class in sql/log_event.cc. I will go briefly through this method for the events that are used to change actual data in current MySQL replication. It is relatively easy to read a particular detail out of the code, since it is all located in these do_apply_event() methods.

(One problem however is that there are special cases sprinkled across the code where special action is taken (or not taken) when running in the slave SQL thread. I have not so far tried to determine the full list of such special cases, or access how many there are).

Query_log_event

The main task done here is to set up various context in the THD, execute the query, and then perform necessary cleanup. If the query throws an error, there is also some fairly complex logic to handle this error correctly; for example to ignore certain errors, to require same error as on master (if the query failed on the master in some particular way), and to re-try the query/transaction for certain errors (like deadlock).

There are also some hard-coded special cases for NDB (this seems to be a common theme in the replication code).

The main think that to my eyes make this part of the code complex is the set of actions taken to prepare context before executing the query, and to clean up after execution. Each individual step in the code is in fact relatively easy to follow (and often the commenting is quite good). The problem is that there are so many individual steps. It is very hard to feel sure that exactly this set of actions is sufficient (and that none are redundant for that matter).

For example, code like this (not complete, just a random part of the setup):

    thd->set_time((time_t)when);
    thd->set_query((char*)query_arg, q_len_arg);
    VOID(pthread_mutex_lock(&LOCK_thread_count));
    thd->query_id = next_query_id();
    VOID(pthread_mutex_unlock(&LOCK_thread_count));
    thd->variables.pseudo_thread_id= thread_id;

It seems very easy to forget to assign query_id or whatever if one was to write this from scratch. That is something I would really like to improve in an API for replication plugins: it should be possible to understand completely exactly what setup and cleanup is needed around event execution, and there should be appropriate methods to achieve such setup/cleanup.

Another thing that is interesting in the code for Query_log_event::do_apply_event() is that the code does strcmp() on the query against keywords like COMMIT, SAVEPOINT, and ROLLBACK, and follows a fairly different execution path for these. This seems to bypass the SQL parser! But in reality, these particular queries are generated on the master with special code in the server that carefully restricts the possible format (eg. no whitespace or comments etc). So in effect, this is just a way to hack in special event types for these special queries without actually adding new binlog format events.

Rand_log_event, Intvar_log_event, and User_var_log_event

The action taken for these events is essentially to update the THD with the information in the event: value of random seed, LAST_INSERT_ID/INSERT_ID, or @user_variable.

Xid_log_event

On the slave side, this event is essentially a COMMIT. One thing that springs to mind is how different the code to handle this event is from the special code in Query_log_event that handles a query "COMMIT". Again, it is very hard to tell from the code if this is a bug, or if the different-looking code is in fact equivalent.

Begin_load_query_log_event, Append_block_log_event, Delete_file_log_event, and Execute_load_query_log_event

This implements the LOAD DATA INFILE query, which needs special handling as it originally references a file on the master server (or on the client machine). The actual execution of the query is handled the same way as for normal queries (Execute_load_query_log_event is a sub-class of Query_log_event), but some preparation is needed first to write the data from the events in the relay log into a temporary file on the slave.

The main thing one notices about this code is how it handles re-writing the LOAD DATA query to use a new temporary file name on the slave. Take for example this query:

    LOAD DATA CONCURRENT /**/ LOCAL INFILE 'foobar.dat' REPLACE INTO /**/ TABLE ...

The event includes offsets into the query string of the two places marked /**/ in the example. This part of the query is then re-written in the slave code (so it is not just replacing the filename). Again, the code by-passes the SQL parser, it just so happens that the SQL syntax in this case is sufficiently simple that this is not too hackish to do. If one were to check, one would probably see that any user comments in this particular part of the query string disappear in the slave binlog if --log-slave-updates is enabled.

Table_map_log_event

This just links a struct RPL_TABLE_LIST into a list, containing information about the tables described by the event. The actual opening of the table is done when executing the first row event (WRITE/UPDATE/DELETE).

Write_rows_log_event, Update_rows_log_event, and Delete_rows_log_event

These are what handle the application of row-based replication events on a slave.

The first of the row events causes some setup to be done, a partial extract is this:

    lex_start(thd);
    mysql_reset_thd_for_next_command(thd, 0);
    thd->transaction.stmt.modified_non_trans_table= FALSE;
    query_cache.invalidate_locked_for_write(rli->tables_to_lock);

(As remarked for Query_log_event, while row event application setup is somewhat simpler, it still appears a bit magic that exactly these setups are sufficient and necessary).

The code also switches to row-based binlogging for the following row operations (this is for --log-slave-updates, as it is not possible to binlog the application of row-based events as statement-based events in the slave binlog). This is by the way an interesting challenge for a generic replication API: how does one handle binlogging of events applied on a slave, for daisy-chaining replication servers? This of course gets more interesting if one were to use a different binlogging plugin on the slave than on the master. I need to think more about it, but this seems to be a pretty strong argument that a generic event application API is needed, which hooks into event generators to properly generate all needed events for the updates done on slaves. Another important aspect is the support of a global transaction ID, that will identify a transaction uniquely across an entire replication setup to make migrating slaves to a new master easier. Such a global transaction ID also needs to be preserved in a slave binlog when replicating events from a master.

During execution of the first row event, the code also sets the flags for foreign key checks and unique checks that is included in the event from the master. And it checks if the event should be skipped (for --replicate-ignore-db and friends).

A check is made to ensure that the table(s) on the slave are compatible with the tables on the master (as described in the table map event(s) received just before the row event(s)). The MySQL row-based replication has a fair bit of flexibility in terms of allowing differences in tables between master and slave, such as allowing different column types, different storage engine, or even extra columns on the slave table. In particular allowing extra columns raises some issues about default values etc. for these columns, though I did not really go into details about this.

Again, there are a number of hard-coded differences for NDB.

For Write_rows_log_event, flags need to be set to ensure that column values for AUTO_INCREMENT and TIMESTAMP columns are taken from the supplied values, not auto-generated. If slave_exec_mode=IDEMPOTENT, an INSERT that fails due to already existing row does not cause replication to fail; instead an UPDATE is tried, or in some cases (like if there are foreign keys) a DELETE + re-tried INSERT. There is also code to hint storage engines about the approximate number of rows that are part of a bulk insert.

For Update_rows_log_event and Delete_rows_log_event, the code needs to locate the row to update/delete. This is done by primary key if one exists on the slave table (the current binlogging always includes every column in the before image of the row, so the primary key value of the row to modify is always available). But there is also support for tables with no primary key, in which the first index on the table is used to locate the row (if any), or failing that a full table scan. This btw. is a good reminder to not use row-based replication with primary-key-less tables with non-trivial amount of rows: every row operation applied will need a full table scan!

Finally, the values from the event are unpacked into a row buffer in the format used by MySQL storage engines, and the read_set and write_set are set up (current replication always includes all columns in row operations), before the actual ha_write_row(), ha_update_row(), or ha_delete_row() call into the storage engine handler is made to perform the actual update. Note that a single row event can include multiple rows, which are applied one after the other.

Final words

And that is it! Quite a bit of detail, but again I found it very useful to create this complete overview; it will make things easier when re-implementing this in a new replication API.

Dissecting the MySQL replication binlog events

For the replication project that I am currently working on in MariaDB, I wanted to understand exactly what information is needed to do full replication of all MySQL/MariaDB statements on the level of completeness that existing replication does. So I went through the code, and this is what I found.

What I am after here is a complete list of what the execution engine needs to provide to have everything that a replication system needs to be able to completely replicate all changes made on a master server. But not anything specific to the particular implementation of replication used, like binlog positions or replication event disk formats, etc.

The basic information needed is of course the query (for statement-based replication), or the column values (for row-based replication). But there are lots of extra details needed, especially for statement-based replication. I need to make sure that the replication API we are designing will be able to provide all needed information, and it was always nagging in the back of my head that there would be lots and lots of small bits in various corners that would be missing and cause problems. So it was good to get this overview. Turns out that there are a lot of details, but not that many, and it should be manageable.

All of the events that are used in replication are listed in enum Log_event_type in sql/log_event.h. So anything needed for complete replication can be found here, but mixed up with lots of other details about the MySQL binlog implementation, backwards compatibility, etc. So what follows is an extract from log_event.cc of the actual change information contained in those events.

Statement-based replication

QUERY_EVENT

The main event for statement-based replication is QUERY_EVENT. It contains the query to be executed (as a string) and some information to provide the context for correct execution. Here is the list of information:

  • SQL query.
  • Default database for the query (eg. from USE statement).
  • The setting of some server variables in effect at the time the query was run:
    • sql_mode.
    • autocommit (whether autocommit is enabled).
    • Character set and collation at various levels (see the section 9.1.4. Connection Character Sets and Collations in the MySQL manual for background on these):
      • Client (character_set_client).
      • Connection (character_set_connection).
      • Server (character_set_server).
      • Current default database (character_set_database; note that there are few statements that rely on this, comments in the code say it is only LOAD DATA).
    • foreign_key_checks (whether foreign keys are checked).
    • unique_checks (whether unique constraint checks are enforced).
    • auto_increment_offset and auto_increment_increment.
    • Time zone of the master database server.
    • Names to use for days and months; this is identified by a code that is mapped to a table of names to use in sql/sql_locale.cc.
    • sql_auto_is_null (whether SELECT ... WHERE autoinc IS NULL returns last insert id for ODBC compatibility).
  • Error code from executing the query on the master (for non-transactional statements that may still make permanent changes even though they fail mid-way; on the slave the query should fail with the same error).
  • Connection ID (this is used to correctly distinguish TEMPORARY TABLEs with same name used in different connections on the master simultaneously).

Note that not all of this information is replicated in all query events, as not all of it is needed for a given query. But a replication API must make the information available for the queries where it is needed.

INTVAR_EVENT, RAND_EVENT, and USER_VAR_EVENT

These events provide additional context for executing a query on the slave:

  • Value of LAST_INSERT_ID (for queries that reference it).
  • Value of INSERT_ID (to get same auto_increment numbers for inserts on the slave as on the master).
  • The random seed (so RAND() can return same values in queries on slaves as on the master).
  • The values for any @user_variables referenced in a query

BEGIN_LOAD_QUERY_EVENT, APPEND_BLOCK_EVENT, EXECUTE_LOAD_QUERY_EVENT, and DELETE_FILE_EVENT

These four events are used to do statement replication of LOAD DATA INFILE. The contents of the file to be loaded is sent in blocks in BEGIN_LOAD_QUERY_EVENT followed by zero or more APPEND_BLOCK_EVENT. Then the actual query is sent in EXECUTE_LOAD_QUERY_EVENT, which is a variant of QUERY_EVENT that replaces the original filename with the name of a temporary file on the slave and deletes the temporary file afterwards (DELETE_FILE_EVENT is used in certain error cases).

This is the complete story of exactly how much information needs to be provided on the master to make statement replication work as it does currently in MySQL. If you get the thought that this is a little bit scary in terms of complexity I tend to agree with you ;-). There is a lot to be said for the comparative simplicity of row-based replication (and it is also interesting to see the history of bug fixes in MySQL 5.1 that gradually have moved more and more statements to be replicated row-based (in mixed-mode binlogging) due to corner cases where statement-based replication can fail).

Still, once we have the list of information, it is not that hard to provide the information in a pluggable replication API for any implementations that want to try their luck with statement-based replication. And of course, row-based replication only handles INSERT/UPDATE/DELETE! We also need to support CREATE TABLE and similar statements, for which it is still useful to know the above exhaustive list of information that may be needed in one form or another.

Row-based replication

In row-based replication, each DML statement is binlogged in two parts. First the tables modified in the query are described with TABLE_MAP_EVENT, and second the row values changed are logged with WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, or DELETE_ROWS_EVENT.

TABLE_MAP_EVENT

The information describing modified tables in row-based replication is as follows:

  • Database name.
  • Table name.
  • List of columns in the table. For each column, the following information is included:
    • Column type (this is field->type()).
    • Column metadata (this is what is returned by field->save_field_metadata(); this is for example the maximum length of a VARCHAR, the precision and number of decimals in DECIMAL, etc.)
    • Whether the column is NULL-able.
  • Table map id; this is just an internally generated uniqie number for subsequent events to refer to the table described.

Note in particular that column names are not used/needed in current MySQL/MariaDB row-based replication. I personally think this is a good way to do it. However, in a generic API, it will make sense to make the full table definition available to implementations, each of which can choose what and how to log in terms of table metadata.

WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, and DELETE_ROWS_EVENT

These events handle replication of respectively INSERT, UPDATE, and DELETE (and similar statements like REPLACE etc.) They contain the following information:

  • Table map id, referencing a table previously described with TABLE_MAP_EVENT.
  • Value of foreign_key_checks and unique_checks, similar to statement-based binlogging (but for row-based, those two are all the context storedm though see remarks below).
  • List (bitmap really) of columns updated. This is essentially the write_set that is used in the storage engine API (but see below for explanation). For UPDATE, there are two bitmaps, one for the before image and one for the after image.
  • List of records containing the values of each column modified. There is one such record for every row update logged. For UPDATE there are two records for each row update, one for the before image (values before the update was done) and one for the after image (values after the update was done).

I must say, investigating how these row-based events are implemented in MySQL really makes the feature seem rather half-baked. There are several issues:

  • The lists/bitmaps of column updated sound useful, but in reality they are set unconditionally to include all columns! Except for NDB).
  • This also means all columns in a row are always sent, even for DELETE and UPDATE. Except for NDB, which only logs needed columns.
  • Some of the "extra" flags in the storage engine API are not included, such as HA_EXTRA_WRITE_CAN_REPLACE. This is actually a bug, as it means that a storage engine using such flags to optimise its operation will not replicate correctly. In the existing MySQL source, only NDB uses this flag, but NDB does special tricks for binlogging and slave replication which avoids this particular issue in most cases.

I strongly suspect that some of this half-baking was done in a quick-and-dirty attempt to squeeze NDB replication in. At least, there are several "this is only used by NDB" type comments in the vicinity of these things in the source code.

In any case, for the replication API, it is probably a good idea to re-think this part and make sure that the information logged for row updates is complete and sane for all reasonable use cases.

Thoughts

My idea is to have a replication API that provides for generation and consumption of events completely separate from any details of the actual format of events in the binlog or any other method used to store or process the events. This will allow replication plugins that use a completely different binlog implementation, or even has no binlog at all.

So such an API needs to provide all of the above information (to allow re-implementing the existing binlog/replication as a plugin, if for no other reason), but need not provide such information in any particular event format. In fact, I am trying to make the API so that such information need not be materialised in structures or memory buffers at all; instead relying on providing accessor methods, so that an implementation can request just the information it needs, and materialise it as or if needed.

On top of this I still think it makes sense to define a standard (but optional) materialised event format, so that more light-weight plugins can be written that can do interesting things with replication without having to implement a full new event format each time. I am still considering whether to extend the existing binlog format (which is not all that attractive, as it is not very easily extensible), or whether to define a new more flexible format (for example based on the Google protobuffer library).

More on the existing binlog format

Just for completeness, here is some additional description of the existing MySQL/MariaDB 5.1 binlog format. These are things that I believe are not required in a new API, as they are mostly internal implementation details. However, as I had to go through them anyway while finding the stuff that does need to be in the API, I will include a brief description here.

Additional query information

Some additional information, which is mostly redundant, is included with query events for statement-based binlogging:

  • Bitmap of tables affected by multi-table update (this allows to know which tables will be updated without parsing the query, eg. for filtering events based on database/table name.)
  • Time spent in query on master.
  • Catalog (I believe this is old unused stuff. Idea is that each database belongs to a catalog, but I have never seen this actually used anywhere).
  • A flag LOG_EVENT_THREAD_SPECIFIC_F which is set if the query uses TEMPORARY table (allows to get this information without parsing the query).
  • A flag LOG_EVENT_SUPPRESS_USE_F set in a few cases when the master knows that the query is independent of what the current database is (so that a possible USE statement can be optimised away).

Binlog specific events

These are events that are specific to the binlog implementation:

XID_EVENT
This is used to record a transaction ID for each transaction written to the binlog in 2-phase commit. This recorded ID is needed during crash recovery on the master to know which prepared transactions in transactional engines need to be recovered to get consistency with what is in the binlog. It is not used on the slave in replication (though this events implies a COMMIT, which _does_ have effect on the slave, of course.)
FORMAT_DESCRIPTION_EVENT
This event is written at the start of every binlog file. It provides to slaves reading the binlog the master server version and the event size of all following events, thereby providing some facilities for extending event formats while maintaining backwards compatibility.
STOP_EVENT
This is logged when the master shuts down gracefully (though I do not think this is used much, if any)
ROTATE_EVENT
This is logged at the end of a binlog file when the master starts a new binlog file. It is needed by the slave to reset it's master binlog position so that the IO thread can proceed correctly from the next binlog file (incidentally, it is a clear weakness in the binlog implementation that the slaves need knowledge about binlog file names and data offsets on the master server, and is a cause of much complexity when switching masters in advanced replication topologies. Something that really needs improvements in the near future).
INCIDENT_EVENT
This is logged by the master when something bad happens that may cause replication to fail/diverge, so that the slave can be notified of the problem and stop, informing the DBA/sysadm to resolve the issue.

Obsolete events

Finally there are a number of events that are no longer generated (but which are still important for the slave replication code to handle to be able to work with masters of older versions):

LOAD_EVENT, NEW_LOAD_EVENT, CREATE_FILE_EVENT, and EXEC_LOAD_EVENT
Various old events for handling LOAD DATA INFILE (as can be seen, LOAD DATA INFILE has had some changes in replication over the years :-).
START_EVENT_V3
Old version of FORMAT_DESCRIPTION_EVENT.
PRE_GA_WRITE_ROWS_EVENT, PRE_GA_UPDATE_ROWS_EVENT, and PRE_GA_DELETE_ROWS_EVENT
Old versions of the row-based replication binlog events.
SLAVE_EVENT
Not used, I think it may have been related to some feature that was never completed.

Fixing MySQL group commit (part 4 of 3)

(No three-part series is complete without a part 4, right?)

Here is an analogy that describes well what group commit does. We have a bus driving back and forth transporting people from A to B (corresponding to fsync() "transporting" commits to durable storage on disk). The group commit optimisation is to have the bus pick up everyone that is waiting at A before driving to B, not drive people one by one. Makes sense, huh? :-)

It is pretty obvious that this optimisation of having more than one person in the bus can dramatically improve throughput, and it is the same for the group commit optimisation. Here is a graph from a benchmark comparing stock MariaDB 5.1 vs. MariaDB patched with a proof-of-concept patch that enables group commit:

Benchmark results

When group commit is implemented, we see clearly how performance (measured in queries per second) scales dramatically as the number of threads increases. Whereas with stock MariaDB with no group commit, there is no scaling at all. We also see that SSD is better than HDD (no surprise there), but that with sufficient parallelism from the application, group commit can to a large extent compensate for the slower disks.

This is the same benchmark as in the first part of the series. Binlog is enabled. Durability is enabled with sync_binlog=1 and flush_log_at_trx_commit=1 (and disk cache disabled to prevent the disks lying about when data is durable). The load is single-row transactions against a 1000000-row XtraDB table. The benchmark is thus specifically designed to make the fsync() calls at the end of commit the bottleneck.

I should remark that I did not really tune the servers used in the benchmark for high parallelism (except for raising max_connections :-), and I ran the client on the same machine as the server. So it is likely that there are other effects than group commit influencing the performance at high parallelism (especially on the SSD results, which I ran on my laptop). But I just wanted to see if my group commit work scales with higher parallelism, and the graphs clearly shows that it does!

Architecture

For this work, I have focused a lot on the API for storage engine and binlog plugins (we do not have binlog plugins now, but this is something that we will be working on in MariaDB later this year). I want a clean interface that allows plugins to implement group commit in a simple and efficient manner.

A crucial point is the desire to get commits ordered the same way in the different engines (ie. typically in InnoDB and in the binlog), as I discussed in previous articles. As group commit is about parallelims, and ordering is about serialisation, these two tend to get into conflict. My idea is to introduce new calls in the interface to storage engines and the XA transaction coordinator (which is how binlog interacts with commit internally in the server). These new calls allow plugins that care about commit order to cooperate on getting correct ordering without getting in each others way and killing parallelims. Plugins that do not need any ordering can ignore the new calls, which are optional (for example the transaction coordinator that runs when the binlog is disabled does not need any ordering).

The full architecture is written up in detail in the MariaDB Worklog#116. But the basic idea is to introduce a new handlerton method:

    void commit_ordered(handlerton *hton, THD *thd, bool all);
This is called just prior to the normal commit() method, and is guaranteed to run in the same commit order across all engines (and binlog) participating in the transaction.

This allows for a lot of flexibility in plugin implementations. A typical implementation would in the commit_ordered() method write the transaction data into its in-memory log buffer, and delay the time-consuming write() and fsync() to the parallel commit() method. InnoDB/XtraDB is already structured in this way, so fits very well into this scheme.

But if an engine wants to use another approach, for example a ticket-based approach as Mark and Mats suggested, that is easy to do too. Just allocate the ticket in commit_ordered(), and use it in commit(). I believe most approaches should fit in well with the proposed model.

I also added a corresponding prepare_ordered() call, which runs in commit order during the prepare phase. The intension is to provide a place to release InnoDB row locks early for even better performance, though I still need to get the Facebook people to explain exactly what they want to do in this respect ;-)

I also spent a lot of thought on getting efficient inter-thread synchronisation in the archtecture. As Mats mentioned, if one is not careful, it is easy to end up with O(N2) cost of thread wake-up, with N the number of transactions participating in group commit. As the goal is to get N as high as possible to maximise sharing of the expensive fsync() call, such O(N2) cost is to be avoided.

In the architecture described in MariaDB Worklog#116, there should in the normal case only be a single highly contested lock, the one on the binlog group commit (which is inherent to the idea of group commit, one thread does the fsync() while the rest of participating threads wait). I use a lock-free queue to make threads in prepare_ordered() not block threads in commit_ordered() and vice versa. The prepare_ordered() calls runs under a global lock, but as they are intended to execute very quickly there should ideally be little contention here. The commit_ordered() calls run in a loop in a single thread, also avoiding serious lock contention as long as commit_ordered() runs quickly as intended.

In particular, running the commit_ordered() loop in a single thread for each group commit avoids high cost of thread wake-up. If we were to try to run the sequential part of commit in different threads in a specific commit order, we would need to switch execution context from one thread to the next, bouncing the thread of control all over the cores in an SMP system. Which takes lots of context switches, and could potentially be costly. In the proposed architecture, a single thread runs all commit_ordered() method calls and wakes up the other waiting threads individually, each free to proceed immediately without any more waiting for one another.

Of course, an engine/binlog plugin that so desires is free to implement such thread-hopping itself, by allocating a ticket in one of the _ordered() methods, and doing its own synchronisation in its commit() method. After all, it may be beneficial or necessary in some cases. The point is that different plugins can use different methods, each using the one that works best for that particular engine without getting in the way of each other.

Further improvements

If we implement this approach, there are a couple of other interesting enhancements that can be implemented relatively easy due to the commit ordering facilities:

  • Currently, we sync to disk three times per commit to ensure consistency between InnoDB and binlog after a crash. But if we know the commit order is the same in engine and in binlog, and if we store in the engine the corresponding binlog position (which InnoDB already does), then we need only sync once (for the binlog) and can still recover reliably after a crash. Since we have a consistent commit order, we can during crash recovery replay the binlog from the position after the last not lost commit inside InnoDB (just like we would apply the binlog on a slave).
  • Currently, the START TRANSACTION WITH CONSISTENT SNAPSHOT, which is supposed to run a transaction with a consistent view in multiple transactional engines, is not really all that consistent. It is quite possible to see a transaction committed in one engine but not in another, and vice versa. However, with an architecture like the one proposed here, it should be easy to just take the snapshot under the same lock that commit_ordered() runs under, and the snapshot will be really consistent (on engines that support commit order). As a bonus, it would also be possible to provice a binlog position corresponding to the consistent snapshot.
  • XtraDB (and similar backup solutions) should be able to create a backup which includes a binlog position (suitable for provisioning a new slave) without having to run FLUSH TABLES WITH READ LOCK, which can be quite costly as it blocks all transaction processing while it runs.
  • As already mentioned, the Facebook group has some ideas for releasing InnoDB row locks early in order to reduce the load on hot-spot rows; this requires consistent commit order.
  • Implementation

    If anyone is interested in looking at the actual code of the proof-of-concept implementation, it is available as a quilt patch series and as a Launchpad bzr tree (licence is GPLv2).

    Do be aware that this is work in progress.

Fixing MySQL group commit (part 3)

This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.

So how do we fix group commit in MySQL? As we saw in the second article of this series, we can just eliminate the prepare_commit_mutex from InnoDB, extend the binary logging to do group commit by itself, and that would solve the problem.

However, we might be able to do even better. As explained in the first article, with binary logging enabled we need XA to ensure consistency after a crash, and that requires to do three fsyncs for a commit. Even if each of those can be shared with other transactions using group commit, it is still expensive. During a discussion on the maria-developers@ mailing list, an idea came up for how to do this with only a single (shared) fsync() for a commit.

The basic idea is to only do fsync() for the binary log, not for the storage engine, corresponding to running with innodb_flush_log_at_trx_commit set to 2 or even 0.

If we do this, we can end up in the following situation: some transaction A is written into the binary log, and fsync() makes sure that is stored durably on disk. Then transaction A is committed in InnoDB. And before the operating system and hardware gets around to store the InnoDB part of A durably on disk, we get a crash.

Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.

In order to do this, we need two things:

  • For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
  • We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.

Now, for ensuring same commit order, we do not want to re-introduce the (by now) infamous prepare_commit_mutex, as that would make it impossible to have group commit for the binary log. Instead we should use another way to ensure such order. There are several ways to do this. Mark Callaghan explained one such way to do this at the MySQL conference, described further in this article.

The basic idea is that when writing transactions into the binary log, we remember their ordering. We can do this by putting the transactions into a queue, by assigning them a global transaction id in monotonic sequence, or by assigning them some kind of ticket as Mark suggests. Then inside innobase_commit(), transactions can coordinate with each other to make sure they go into the engine in the order dictated by the queue, global transaction id, or ticket.

I think I have a working idea for how to extend the storage engine API to be able to do this in a clean way for any transactional engine. We can Introduce an optional handler call commit_fast() that is guaranteed to be called in the same order as transactions are written to the binary log, prior to the normal commit handler call. Basically it would be called under a binary log mutex. The idea is that commit_fast() will contain the "fast" part of innobase_commit(), as explained in the previous article. Then in commit_fast(), the engine can do the assignment of a ticket or insertion into a queue, as needed.

I think possibly for symmetry we would want to also add a similar xa_prepare_fast() handler call that would be invoked after the normal xa_prepare() and similarly be guaranteed to be in the same order as binary log commit, though I need to consider this a bit more to fully make up my mind.

I believe such an addition to the storage engine API would allow to implement in a clean way for all engines the method of re-playing the binary log at crash recovery to avoid more than a single fsync() at commit.

So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter's original Bug#13669 for fixed!

Fixing MySQL group commit (part 2)

This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.

Internally, InnoDB (and hence XtraDB) do support group commit. The way this works is seen in the innobase_commit() function. The work in this function is split into two parts. First, a "fast" part, which registers the commit in memory:

    trx->flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx->flush_log_later = FALSE;
Second, a "slow" part, which writes and fsync's the commit to disk to make it durable:
    trx_commit_complete_for_mysql(trx)
While one transaction is busy executing the "slow" part, any number of later transactions can complete their "fast" part, and queue up waiting for the running fsync() to finish. Once it does finish, a single fsync() of the log is now sufficient to complete the slow part for all of the queued-up transactions. This is how group commit works in InnoDB when the binary log is disabled.

When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:

    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()

Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB locks the prepare_commit_mutex in innobase_xa_prepare(), and does not release it until after the "fast" part of innobase_commit() has completed. This means that while one transaction is executing innobase_commit(), all subsequent transactions will be blocked inside innobase_xa_prepare() waiting for the mutex. As a result, no transactions can queue up to share an fsync(), and group commit is broken with the binary log enabled.

So, why does InnoDB hold the problematic prepare_commit_mutex across the binary logging? That turns out to be a really good question. After extensive research into the issue, it appears that in fact there is no good reason at all for the mutex to be held.

Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.

But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.

Sergei Golubchik investigated this issue during the 2010 MySQL conference, inspired by the many discussions of group commit that took place there. It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position. This statement completely blocks the processing of commits until released, removing any possibility of different commit order in engine and binary log (InnoDB hot backup is closed source, so difficult to check, but presumably works in the same way). So there certainly is no need for holding the prepare_commit_mutex to ensure consistent binary log position for backups!

There is another popular way to do hot backups without using FLUSH TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially runs the recovery algorithm at restore time. In this case, XA is used to ensure that engine and binary log are consistent at server start, eliminating any need to enforce same ordering of commits.

So it really seems that there just is no good reason for the prepare_commit_mutex mutex to exist in the first place. Unless someone can come up with a good explanation for why it should be needed, I am forced to conclude that we have lived with 5 years of broken group commit in MySQL solely because of incorrect hearsay about how things should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB ever cared sufficiently to take a serious look at this important issue.

(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).

This concludes the second part of the series, showing that group commit can be restored simply by removing the offending prepare_commit_mutex from InnoDB. The third and final article in the series will discuss some deeper issues that arise from looking into this part of the server code, and some interesting ideas for further improving things related to group commit.