Home
Kristian Nielsen Below are the 10 most recent journal entries recorded in the "Kristian Nielsen" journal:

[<< Previous 10 entries]

July 9th, 2010
12:33 am

[Link]

MySQL/MariaDB replication: applying events on the slave side

Working on a new set of replication APIs in MariaDB, I have given some thought to the generation of replication events on the master server.

But there is another side of the equation: to apply the generated events on a slave server. This is something that most replication setups will need (unless they replicate to non-MySQL/MariaDB slaves). So it will be good to provide a generic interface for this, otherwise every binlog-like plugin implementation will have to re-invent this themselves.

A central idea in the current design for generating events is that we do not enforce a specific content of events. Instead, the API provides accessors for a lot of different information related to each event, allowing the plugin flexibility in choosing what to include in a particular event format. For example, one plugin may request column names for a row-based UPDATE event; another plugin may not need them and can avoid any overhead related to column names simply by not requesting them.

To get the same flexibility on the slave side, the roles of plugin and API are reversed. Here, the plugin will have a certain pre-determined (by the particular event format implemented) set of information related to the event available. And the API must make do with whatever information it is provided (or fail gracefully if essential information is missing).

My idea is that the event application API will provide corresponding events to the events in the generation API. Each application event will have "provider" methods corresponding to the accessor methods of the generator API. So the plugin that wants to apply an event can obtain an event generator object, call the appropriate provider methods for all the information available, and finally ask the API to execute the event with the provided context information.

This is only an abstract idea at this point; there are lots of details to take care of to make this idea into a concrete design proposal. And I have not fully decided if such an API will be part of the replication project or not. But I like the idea so far.

Understanding how MySQL binlog events are applied on the slave

I wanted to get a better understanding of what is involved in an event application API like the one described above. So I did a similar exercise to the one I wrote about in my last post, where I went through in detail all the information that the existing MySQL binlog format includes. This time I went through the details in the code that applies MySQL binlog events on a slave.

Again, I concentrate on the actual events that change the database, ignoring (most) details that relate only to the particular binlog format used by MySQL (and there are quite a few :-).

At the top level, the slave SQL thread (in exec_relay_log_event()) reads events (next_event()) from the relay logs and executes (apply_event_and_update_pos()) them.

There are a number of intricate details here relating to switching to a new relay log (and purging old ones), and re-winding in the relay log to re-try a failed transaction (eg. in case of deadlock or the like). This is mostly specific to the particular binlog implementation.

The actual data changes are mostly done in the do_apply_event() methods of each event class in sql/log_event.cc. I will go briefly through this method for the events that are used to change actual data in current MySQL replication. It is relatively easy to read a particular detail out of the code, since it is all located in these do_apply_event() methods.

(One problem however is that there are special cases sprinkled across the code where special action is taken (or not taken) when running in the slave SQL thread. I have not so far tried to determine the full list of such special cases, or access how many there are).

Query_log_event

The main task done here is to set up various context in the THD, execute the query, and then perform necessary cleanup. If the query throws an error, there is also some fairly complex logic to handle this error correctly; for example to ignore certain errors, to require same error as on master (if the query failed on the master in some particular way), and to re-try the query/transaction for certain errors (like deadlock).

There are also some hard-coded special cases for NDB (this seems to be a common theme in the replication code).

The main think that to my eyes make this part of the code complex is the set of actions taken to prepare context before executing the query, and to clean up after execution. Each individual step in the code is in fact relatively easy to follow (and often the commenting is quite good). The problem is that there are so many individual steps. It is very hard to feel sure that exactly this set of actions is sufficient (and that none are redundant for that matter).

For example, code like this (not complete, just a random part of the setup):

    thd->set_time((time_t)when);
    thd->set_query((char*)query_arg, q_len_arg);
    VOID(pthread_mutex_lock(&LOCK_thread_count));
    thd->query_id = next_query_id();
    VOID(pthread_mutex_unlock(&LOCK_thread_count));
    thd->variables.pseudo_thread_id= thread_id;

It seems very easy to forget to assign query_id or whatever if one was to write this from scratch. That is something I would really like to improve in an API for replication plugins: it should be possible to understand completely exactly what setup and cleanup is needed around event execution, and there should be appropriate methods to achieve such setup/cleanup.

Another thing that is interesting in the code for Query_log_event::do_apply_event() is that the code does strcmp() on the query against keywords like COMMIT, SAVEPOINT, and ROLLBACK, and follows a fairly different execution path for these. This seems to bypass the SQL parser! But in reality, these particular queries are generated on the master with special code in the server that carefully restricts the possible format (eg. no whitespace or comments etc). So in effect, this is just a way to hack in special event types for these special queries without actually adding new binlog format events.

Rand_log_event, Intvar_log_event, and User_var_log_event

The action taken for these events is essentially to update the THD with the information in the event: value of random seed, LAST_INSERT_ID/INSERT_ID, or @user_variable.

Xid_log_event

On the slave side, this event is essentially a COMMIT. One thing that springs to mind is how different the code to handle this event is from the special code in Query_log_event that handles a query "COMMIT". Again, it is very hard to tell from the code if this is a bug, or if the different-looking code is in fact equivalent.

Begin_load_query_log_event, Append_block_log_event, Delete_file_log_event, and Execute_load_query_log_event

This implements the LOAD DATA INFILE query, which needs special handling as it originally references a file on the master server (or on the client machine). The actual execution of the query is handled the same way as for normal queries (Execute_load_query_log_event is a sub-class of Query_log_event), but some preparation is needed first to write the data from the events in the relay log into a temporary file on the slave.

The main thing one notices about this code is how it handles re-writing the LOAD DATA query to use a new temporary file name on the slave. Take for example this query:

    LOAD DATA CONCURRENT /**/ LOCAL INFILE 'foobar.dat' REPLACE INTO /**/ TABLE ...

The event includes offsets into the query string of the two places marked /**/ in the example. This part of the query is then re-written in the slave code (so it is not just replacing the filename). Again, the code by-passes the SQL parser, it just so happens that the SQL syntax in this case is sufficiently simple that this is not too hackish to do. If one were to check, one would probably see that any user comments in this particular part of the query string disappear in the slave binlog if --log-slave-updates is enabled.

Table_map_log_event

This just links a struct RPL_TABLE_LIST into a list, containing information about the tables described by the event. The actual opening of the table is done when executing the first row event (WRITE/UPDATE/DELETE).

Write_rows_log_event, Update_rows_log_event, and Delete_rows_log_event

These are what handle the application of row-based replication events on a slave.

The first of the row events causes some setup to be done, a partial extract is this:

    lex_start(thd);
    mysql_reset_thd_for_next_command(thd, 0);
    thd->transaction.stmt.modified_non_trans_table= FALSE;
    query_cache.invalidate_locked_for_write(rli->tables_to_lock);

(As remarked for Query_log_event, while row event application setup is somewhat simpler, it still appears a bit magic that exactly these setups are sufficient and necessary).

The code also switches to row-based binlogging for the following row operations (this is for --log-slave-updates, as it is not possible to binlog the application of row-based events as statement-based events in the slave binlog). This is by the way an interesting challenge for a generic replication API: how does one handle binlogging of events applied on a slave, for daisy-chaining replication servers? This of course gets more interesting if one were to use a different binlogging plugin on the slave than on the master. I need to think more about it, but this seems to be a pretty strong argument that a generic event application API is needed, which hooks into event generators to properly generate all needed events for the updates done on slaves. Another important aspect is the support of a global transaction ID, that will identify a transaction uniquely across an entire replication setup to make migrating slaves to a new master easier. Such a global transaction ID also needs to be preserved in a slave binlog when replicating events from a master.

During execution of the first row event, the code also sets the flags for foreign key checks and unique checks that is included in the event from the master. And it checks if the event should be skipped (for --replicate-ignore-db and friends).

A check is made to ensure that the table(s) on the slave are compatible with the tables on the master (as described in the table map event(s) received just before the row event(s)). The MySQL row-based replication has a fair bit of flexibility in terms of allowing differences in tables between master and slave, such as allowing different column types, different storage engine, or even extra columns on the slave table. In particular allowing extra columns raises some issues about default values etc. for these columns, though I did not really go into details about this.

Again, there are a number of hard-coded differences for NDB.

For Write_rows_log_event, flags need to be set to ensure that column values for AUTO_INCREMENT and TIMESTAMP columns are taken from the supplied values, not auto-generated. If slave_exec_mode=IDEMPOTENT, an INSERT that fails due to already existing row does not cause replication to fail; instead an UPDATE is tried, or in some cases (like if there are foreign keys) a DELETE + re-tried INSERT. There is also code to hint storage engines about the approximate number of rows that are part of a bulk insert.

For Update_rows_log_event and Delete_rows_log_event, the code needs to locate the row to update/delete. This is done by primary key if one exists on the slave table (the current binlogging always includes every column in the before image of the row, so the primary key value of the row to modify is always available). But there is also support for tables with no primary key, in which the first index on the table is used to locate the row (if any), or failing that a full table scan. This btw. is a good reminder to not use row-based replication with primary-key-less tables with non-trivial amount of rows: every row operation applied will need a full table scan!

Finally, the values from the event are unpacked into a row buffer in the format used by MySQL storage engines, and the read_set and write_set are set up (current replication always includes all columns in row operations), before the actual ha_write_row(), ha_update_row(), or ha_delete_row() call into the storage engine handler is made to perform the actual update. Note that a single row event can include multiple rows, which are applied one after the other.

Final words

And that is it! Quite a bit of detail, but again I found it very useful to create this complete overview; it will make things easier when re-implementing this in a new replication API.

Tags: , , , ,

(Leave a comment)

June 21st, 2010
02:16 pm

[Link]

Dissecting the MySQL replication binlog events

For the replication project that I am currently working on in MariaDB, I wanted to understand exactly what information is needed to do full replication of all MySQL/MariaDB statements on the level of completeness that existing replication does. So I went through the code, and this is what I found.

What I am after here is a complete list of what the execution engine needs to provide to have everything that a replication system needs to be able to completely replicate all changes made on a master server. But not anything specific to the particular implementation of replication used, like binlog positions or replication event disk formats, etc.

The basic information needed is of course the query (for statement-based replication), or the column values (for row-based replication). But there are lots of extra details needed, especially for statement-based replication. I need to make sure that the replication API we are designing will be able to provide all needed information, and it was always nagging in the back of my head that there would be lots and lots of small bits in various corners that would be missing and cause problems. So it was good to get this overview. Turns out that there are a lot of details, but not that many, and it should be manageable.

All of the events that are used in replication are listed in enum Log_event_type in sql/log_event.h. So anything needed for complete replication can be found here, but mixed up with lots of other details about the MySQL binlog implementation, backwards compatibility, etc. So what follows is an extract from log_event.cc of the actual change information contained in those events.

Statement-based replication

QUERY_EVENT

The main event for statement-based replication is QUERY_EVENT. It contains the query to be executed (as a string) and some information to provide the context for correct execution. Here is the list of information:

  • SQL query.
  • Default database for the query (eg. from USE statement).
  • The setting of some server variables in effect at the time the query was run:
    • sql_mode.
    • autocommit (whether autocommit is enabled).
    • Character set and collation at various levels (see the section 9.1.4. Connection Character Sets and Collations in the MySQL manual for background on these):
      • Client (character_set_client).
      • Connection (character_set_connection).
      • Server (character_set_server).
      • Current default database (character_set_database; note that there are few statements that rely on this, comments in the code say it is only LOAD DATA).
    • foreign_key_checks (whether foreign keys are checked).
    • unique_checks (whether unique constraint checks are enforced).
    • auto_increment_offset and auto_increment_increment.
    • Time zone of the master database server.
    • Names to use for days and months; this is identified by a code that is mapped to a table of names to use in sql/sql_locale.cc.
    • sql_auto_is_null (whether SELECT ... WHERE autoinc IS NULL returns last insert id for ODBC compatibility).
  • Error code from executing the query on the master (for non-transactional statements that may still make permanent changes even though they fail mid-way; on the slave the query should fail with the same error).
  • Connection ID (this is used to correctly distinguish TEMPORARY TABLEs with same name used in different connections on the master simultaneously).

Note that not all of this information is replicated in all query events, as not all of it is needed for a given query. But a replication API must make the information available for the queries where it is needed.

INTVAR_EVENT, RAND_EVENT, and USER_VAR_EVENT

These events provide additional context for executing a query on the slave:

  • Value of LAST_INSERT_ID (for queries that reference it).
  • Value of INSERT_ID (to get same auto_increment numbers for inserts on the slave as on the master).
  • The random seed (so RAND() can return same values in queries on slaves as on the master).
  • The values for any @user_variables referenced in a query

BEGIN_LOAD_QUERY_EVENT, APPEND_BLOCK_EVENT, EXECUTE_LOAD_QUERY_EVENT, and DELETE_FILE_EVENT

These four events are used to do statement replication of LOAD DATA INFILE. The contents of the file to be loaded is sent in blocks in BEGIN_LOAD_QUERY_EVENT followed by zero or more APPEND_BLOCK_EVENT. Then the actual query is sent in EXECUTE_LOAD_QUERY_EVENT, which is a variant of QUERY_EVENT that replaces the original filename with the name of a temporary file on the slave and deletes the temporary file afterwards (DELETE_FILE_EVENT is used in certain error cases).

This is the complete story of exactly how much information needs to be provided on the master to make statement replication work as it does currently in MySQL. If you get the thought that this is a little bit scary in terms of complexity I tend to agree with you ;-). There is a lot to be said for the comparative simplicity of row-based replication (and it is also interesting to see the history of bug fixes in MySQL 5.1 that gradually have moved more and more statements to be replicated row-based (in mixed-mode binlogging) due to corner cases where statement-based replication can fail).

Still, once we have the list of information, it is not that hard to provide the information in a pluggable replication API for any implementations that want to try their luck with statement-based replication. And of course, row-based replication only handles INSERT/UPDATE/DELETE! We also need to support CREATE TABLE and similar statements, for which it is still useful to know the above exhaustive list of information that may be needed in one form or another.

Row-based replication

In row-based replication, each DML statement is binlogged in two parts. First the tables modified in the query are described with TABLE_MAP_EVENT, and second the row values changed are logged with WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, or DELETE_ROWS_EVENT.

TABLE_MAP_EVENT

The information describing modified tables in row-based replication is as follows:

  • Database name.
  • Table name.
  • List of columns in the table. For each column, the following information is included:
    • Column type (this is field->type()).
    • Column metadata (this is what is returned by field->save_field_metadata(); this is for example the maximum length of a VARCHAR, the precision and number of decimals in DECIMAL, etc.)
    • Whether the column is NULL-able.
  • Table map id; this is just an internally generated uniqie number for subsequent events to refer to the table described.

Note in particular that column names are not used/needed in current MySQL/MariaDB row-based replication. I personally think this is a good way to do it. However, in a generic API, it will make sense to make the full table definition available to implementations, each of which can choose what and how to log in terms of table metadata.

WRITE_ROWS_EVENT, UPDATE_ROWS_EVENT, and DELETE_ROWS_EVENT

These events handle replication of respectively INSERT, UPDATE, and DELETE (and similar statements like REPLACE etc.) They contain the following information:

  • Table map id, referencing a table previously described with TABLE_MAP_EVENT.
  • Value of foreign_key_checks and unique_checks, similar to statement-based binlogging (but for row-based, those two are all the context storedm though see remarks below).
  • List (bitmap really) of columns updated. This is essentially the write_set that is used in the storage engine API (but see below for explanation). For UPDATE, there are two bitmaps, one for the before image and one for the after image.
  • List of records containing the values of each column modified. There is one such record for every row update logged. For UPDATE there are two records for each row update, one for the before image (values before the update was done) and one for the after image (values after the update was done).

I must say, investigating how these row-based events are implemented in MySQL really makes the feature seem rather half-baked. There are several issues:

  • The lists/bitmaps of column updated sound useful, but in reality they are set unconditionally to include all columns! Except for NDB).
  • This also means all columns in a row are always sent, even for DELETE and UPDATE. Except for NDB, which only logs needed columns.
  • Some of the "extra" flags in the storage engine API are not included, such as HA_EXTRA_WRITE_CAN_REPLACE. This is actually a bug, as it means that a storage engine using such flags to optimise its operation will not replicate correctly. In the existing MySQL source, only NDB uses this flag, but NDB does special tricks for binlogging and slave replication which avoids this particular issue in most cases.

I strongly suspect that some of this half-baking was done in a quick-and-dirty attempt to squeeze NDB replication in. At least, there are several "this is only used by NDB" type comments in the vicinity of these things in the source code.

In any case, for the replication API, it is probably a good idea to re-think this part and make sure that the information logged for row updates is complete and sane for all reasonable use cases.

Thoughts

My idea is to have a replication API that provides for generation and consumption of events completely separate from any details of the actual format of events in the binlog or any other method used to store or process the events. This will allow replication plugins that use a completely different binlog implementation, or even has no binlog at all.

So such an API needs to provide all of the above information (to allow re-implementing the existing binlog/replication as a plugin, if for no other reason), but need not provide such information in any particular event format. In fact, I am trying to make the API so that such information need not be materialised in structures or memory buffers at all; instead relying on providing accessor methods, so that an implementation can request just the information it needs, and materialise it as or if needed.

On top of this I still think it makes sense to define a standard (but optional) materialised event format, so that more light-weight plugins can be written that can do interesting things with replication without having to implement a full new event format each time. I am still considering whether to extend the existing binlog format (which is not all that attractive, as it is not very easily extensible), or whether to define a new more flexible format (for example based on the Google protobuffer library).

More on the existing binlog format

Just for completeness, here is some additional description of the existing MySQL/MariaDB 5.1 binlog format. These are things that I believe are not required in a new API, as they are mostly internal implementation details. However, as I had to go through them anyway while finding the stuff that does need to be in the API, I will include a brief description here.

Additional query information

Some additional information, which is mostly redundant, is included with query events for statement-based binlogging:

  • Bitmap of tables affected by multi-table update (this allows to know which tables will be updated without parsing the query, eg. for filtering events based on database/table name.)
  • Time spent in query on master.
  • Catalog (I believe this is old unused stuff. Idea is that each database belongs to a catalog, but I have never seen this actually used anywhere).
  • A flag LOG_EVENT_THREAD_SPECIFIC_F which is set if the query uses TEMPORARY table (allows to get this information without parsing the query).
  • A flag LOG_EVENT_SUPPRESS_USE_F set in a few cases when the master knows that the query is independent of what the current database is (so that a possible USE statement can be optimised away).

Binlog specific events

These are events that are specific to the binlog implementation:

XID_EVENT
This is used to record a transaction ID for each transaction written to the binlog in 2-phase commit. This recorded ID is needed during crash recovery on the master to know which prepared transactions in transactional engines need to be recovered to get consistency with what is in the binlog. It is not used on the slave in replication (though this events implies a COMMIT, which _does_ have effect on the slave, of course.)
FORMAT_DESCRIPTION_EVENT
This event is written at the start of every binlog file. It provides to slaves reading the binlog the master server version and the event size of all following events, thereby providing some facilities for extending event formats while maintaining backwards compatibility.
STOP_EVENT
This is logged when the master shuts down gracefully (though I do not think this is used much, if any)
ROTATE_EVENT
This is logged at the end of a binlog file when the master starts a new binlog file. It is needed by the slave to reset it's master binlog position so that the IO thread can proceed correctly from the next binlog file (incidentally, it is a clear weakness in the binlog implementation that the slaves need knowledge about binlog file names and data offsets on the master server, and is a cause of much complexity when switching masters in advanced replication topologies. Something that really needs improvements in the near future).
INCIDENT_EVENT
This is logged by the master when something bad happens that may cause replication to fail/diverge, so that the slave can be notified of the problem and stop, informing the DBA/sysadm to resolve the issue.

Obsolete events

Finally there are a number of events that are no longer generated (but which are still important for the slave replication code to handle to be able to work with masters of older versions):

LOAD_EVENT, NEW_LOAD_EVENT, CREATE_FILE_EVENT, and EXEC_LOAD_EVENT
Various old events for handling LOAD DATA INFILE (as can be seen, LOAD DATA INFILE has had some changes in replication over the years :-).
START_EVENT_V3
Old version of FORMAT_DESCRIPTION_EVENT.
PRE_GA_WRITE_ROWS_EVENT, PRE_GA_UPDATE_ROWS_EVENT, and PRE_GA_DELETE_ROWS_EVENT
Old versions of the row-based replication binlog events.
SLAVE_EVENT
Not used, I think it may have been related to some feature that was never completed.

Tags: , , ,

(1 comment | Leave a comment)

May 31st, 2010
04:17 pm

[Link]

Fixing MySQL group commit (part 4 of 3)

(No three-part series is complete without a part 4, right?)

Here is an analogy that describes well what group commit does. We have a bus driving back and forth transporting people from A to B (corresponding to fsync() "transporting" commits to durable storage on disk). The group commit optimisation is to have the bus pick up everyone that is waiting at A before driving to B, not drive people one by one. Makes sense, huh? :-)

It is pretty obvious that this optimisation of having more than one person in the bus can dramatically improve throughput, and it is the same for the group commit optimisation. Here is a graph from a benchmark comparing stock MariaDB 5.1 vs. MariaDB patched with a proof-of-concept patch that enables group commit:

Benchmark results

When group commit is implemented, we see clearly how performance (measured in queries per second) scales dramatically as the number of threads increases. Whereas with stock MariaDB with no group commit, there is no scaling at all. We also see that SSD is better than HDD (no surprise there), but that with sufficient parallelism from the application, group commit can to a large extent compensate for the slower disks.

This is the same benchmark as in the first part of the series. Binlog is enabled. Durability is enabled with sync_binlog=1 and flush_log_at_trx_commit=1 (and disk cache disabled to prevent the disks lying about when data is durable). The load is single-row transactions against a 1000000-row XtraDB table. The benchmark is thus specifically designed to make the fsync() calls at the end of commit the bottleneck.

I should remark that I did not really tune the servers used in the benchmark for high parallelism (except for raising max_connections :-), and I ran the client on the same machine as the server. So it is likely that there are other effects than group commit influencing the performance at high parallelism (especially on the SSD results, which I ran on my laptop). But I just wanted to see if my group commit work scales with higher parallelism, and the graphs clearly shows that it does!

Architecture

For this work, I have focused a lot on the API for storage engine and binlog plugins (we do not have binlog plugins now, but this is something that we will be working on in MariaDB later this year). I want a clean interface that allows plugins to implement group commit in a simple and efficient manner.

A crucial point is the desire to get commits ordered the same way in the different engines (ie. typically in InnoDB and in the binlog), as I discussed in previous articles. As group commit is about parallelims, and ordering is about serialisation, these two tend to get into conflict. My idea is to introduce new calls in the interface to storage engines and the XA transaction coordinator (which is how binlog interacts with commit internally in the server). These new calls allow plugins that care about commit order to cooperate on getting correct ordering without getting in each others way and killing parallelims. Plugins that do not need any ordering can ignore the new calls, which are optional (for example the transaction coordinator that runs when the binlog is disabled does not need any ordering).

The full architecture is written up in detail in the MariaDB Worklog#116. But the basic idea is to introduce a new handlerton method:

    void commit_ordered(handlerton *hton, THD *thd, bool all);
This is called just prior to the normal commit() method, and is guaranteed to run in the same commit order across all engines (and binlog) participating in the transaction.

This allows for a lot of flexibility in plugin implementations. A typical implementation would in the commit_ordered() method write the transaction data into its in-memory log buffer, and delay the time-consuming write() and fsync() to the parallel commit() method. InnoDB/XtraDB is already structured in this way, so fits very well into this scheme.

But if an engine wants to use another approach, for example a ticket-based approach as Mark and Mats suggested, that is easy to do too. Just allocate the ticket in commit_ordered(), and use it in commit(). I believe most approaches should fit in well with the proposed model.

I also added a corresponding prepare_ordered() call, which runs in commit order during the prepare phase. The intension is to provide a place to release InnoDB row locks early for even better performance, though I still need to get the Facebook people to explain exactly what they want to do in this respect ;-)

I also spent a lot of thought on getting efficient inter-thread synchronisation in the archtecture. As Mats mentioned, if one is not careful, it is easy to end up with O(N2) cost of thread wake-up, with N the number of transactions participating in group commit. As the goal is to get N as high as possible to maximise sharing of the expensive fsync() call, such O(N2) cost is to be avoided.

In the architecture described in MariaDB Worklog#116, there should in the normal case only be a single highly contested lock, the one on the binlog group commit (which is inherent to the idea of group commit, one thread does the fsync() while the rest of participating threads wait). I use a lock-free queue to make threads in prepare_ordered() not block threads in commit_ordered() and vice versa. The prepare_ordered() calls runs under a global lock, but as they are intended to execute very quickly there should ideally be little contention here. The commit_ordered() calls run in a loop in a single thread, also avoiding serious lock contention as long as commit_ordered() runs quickly as intended.

In particular, running the commit_ordered() loop in a single thread for each group commit avoids high cost of thread wake-up. If we were to try to run the sequential part of commit in different threads in a specific commit order, we would need to switch execution context from one thread to the next, bouncing the thread of control all over the cores in an SMP system. Which takes lots of context switches, and could potentially be costly. In the proposed architecture, a single thread runs all commit_ordered() method calls and wakes up the other waiting threads individually, each free to proceed immediately without any more waiting for one another.

Of course, an engine/binlog plugin that so desires is free to implement such thread-hopping itself, by allocating a ticket in one of the _ordered() methods, and doing its own synchronisation in its commit() method. After all, it may be beneficial or necessary in some cases. The point is that different plugins can use different methods, each using the one that works best for that particular engine without getting in the way of each other.

Further improvements

If we implement this approach, there are a couple of other interesting enhancements that can be implemented relatively easy due to the commit ordering facilities:

  • Currently, we sync to disk three times per commit to ensure consistency between InnoDB and binlog after a crash. But if we know the commit order is the same in engine and in binlog, and if we store in the engine the corresponding binlog position (which InnoDB already does), then we need only sync once (for the binlog) and can still recover reliably after a crash. Since we have a consistent commit order, we can during crash recovery replay the binlog from the position after the last not lost commit inside InnoDB (just like we would apply the binlog on a slave).
  • Currently, the START TRANSACTION WITH CONSISTENT SNAPSHOT, which is supposed to run a transaction with a consistent view in multiple transactional engines, is not really all that consistent. It is quite possible to see a transaction committed in one engine but not in another, and vice versa. However, with an architecture like the one proposed here, it should be easy to just take the snapshot under the same lock that commit_ordered() runs under, and the snapshot will be really consistent (on engines that support commit order). As a bonus, it would also be possible to provice a binlog position corresponding to the consistent snapshot.
  • XtraDB (and similar backup solutions) should be able to create a backup which includes a binlog position (suitable for provisioning a new slave) without having to run FLUSH TABLES WITH READ LOCK, which can be quite costly as it blocks all transaction processing while it runs.
  • As already mentioned, the Facebook group has some ideas for releasing InnoDB row locks early in order to reduce the load on hot-spot rows; this requires consistent commit order.
  • Implementation

    If anyone is interested in looking at the actual code of the proof-of-concept implementation, it is available as a quilt patch series and as a Launchpad bzr tree (licence is GPLv2).

    Do be aware that this is work in progress.

Tags: , , , ,

(2 comments | Leave a comment)

April 23rd, 2010
11:18 am

[Link]

Fixing MySQL group commit (part 3)

This is the third and final article in a series about group commit in MySQL. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. The second article explained the part of the InnoDB code that is responsible for the problem.

So how do we fix group commit in MySQL? As we saw in the second article of this series, we can just eliminate the prepare_commit_mutex from InnoDB, extend the binary logging to do group commit by itself, and that would solve the problem.

However, we might be able to do even better. As explained in the first article, with binary logging enabled we need XA to ensure consistency after a crash, and that requires to do three fsyncs for a commit. Even if each of those can be shared with other transactions using group commit, it is still expensive. During a discussion on the maria-developers@ mailing list, an idea came up for how to do this with only a single (shared) fsync() for a commit.

The basic idea is to only do fsync() for the binary log, not for the storage engine, corresponding to running with innodb_flush_log_at_trx_commit set to 2 or even 0.

If we do this, we can end up in the following situation: some transaction A is written into the binary log, and fsync() makes sure that is stored durably on disk. Then transaction A is committed in InnoDB. And before the operating system and hardware gets around to store the InnoDB part of A durably on disk, we get a crash.

Now on crash recovery, we will have A in the binary log, but in the engine A may be lost, causing an inconsistency. But this inconsistency can be resolved simply by re-playing the transaction A against InnoDB, using the data for A stored in the binary log. Just like it would normally be applied on a replication slave. After re-playing the transaction, we will again be in a consistent state.

In order to do this, we need two things:

  • For each transaction, we need to store in the InnoDB engine information about which is the corresponding position in the binary log, so that at crash recovery we will know from which position in the binary log to start re-playing transactions from.
  • We also need to ensure that the order of commits in the binary log and in InnoDB is the same! Otherwise, after a crash we could find ourselves in the situation that the binary log has transaction A followed by transaction B, while the InnoDB storage engine contains only transaction B committed, not transaction A. This would leave us with no reliable place in the binary log to start re-playing transactions from.

Now, for ensuring same commit order, we do not want to re-introduce the (by now) infamous prepare_commit_mutex, as that would make it impossible to have group commit for the binary log. Instead we should use another way to ensure such order. There are several ways to do this. Mark Callaghan explained one such way to do this at the MySQL conference, described further in this article.

The basic idea is that when writing transactions into the binary log, we remember their ordering. We can do this by putting the transactions into a queue, by assigning them a global transaction id in monotonic sequence, or by assigning them some kind of ticket as Mark suggests. Then inside innobase_commit(), transactions can coordinate with each other to make sure they go into the engine in the order dictated by the queue, global transaction id, or ticket.

I think I have a working idea for how to extend the storage engine API to be able to do this in a clean way for any transactional engine. We can Introduce an optional handler call commit_fast() that is guaranteed to be called in the same order as transactions are written to the binary log, prior to the normal commit handler call. Basically it would be called under a binary log mutex. The idea is that commit_fast() will contain the "fast" part of innobase_commit(), as explained in the previous article. Then in commit_fast(), the engine can do the assignment of a ticket or insertion into a queue, as needed.

I think possibly for symmetry we would want to also add a similar xa_prepare_fast() handler call that would be invoked after the normal xa_prepare() and similarly be guaranteed to be in the same order as binary log commit, though I need to consider this a bit more to fully make up my mind.

I believe such an addition to the storage engine API would allow to implement in a clean way for all engines the method of re-playing the binary log at crash recovery to avoid more than a single fsync() at commit.

So this concludes the series. Using these ideas, I hope we will soon see patches for MySQL and MariaDB that greatly enhances the performance for durable and crash-safe commits, so that we can finally declare Peter's original Bug#13669 for fixed!

Tags: , , , ,

(2 comments | Leave a comment)

10:51 am

[Link]

Fixing MySQL group commit (part 2)

This is the second in a series of three articles about ideas for implementing full support for group commit in MariaDB. The first article discussed the background: group commit in MySQL does not work when the binary log is enabled. See also the third article.

Internally, InnoDB (and hence XtraDB) do support group commit. The way this works is seen in the innobase_commit() function. The work in this function is split into two parts. First, a "fast" part, which registers the commit in memory:

    trx->flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx->flush_log_later = FALSE;
Second, a "slow" part, which writes and fsync's the commit to disk to make it durable:
    trx_commit_complete_for_mysql(trx)
While one transaction is busy executing the "slow" part, any number of later transactions can complete their "fast" part, and queue up waiting for the running fsync() to finish. Once it does finish, a single fsync() of the log is now sufficient to complete the slow part for all of the queued-up transactions. This is how group commit works in InnoDB when the binary log is disabled.

When the binary log is enabled, MySQL uses XA/2-phase commit to ensure consistency between the binary log and the storage engine. This means that a commit now takes three parts:

    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()

Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB locks the prepare_commit_mutex in innobase_xa_prepare(), and does not release it until after the "fast" part of innobase_commit() has completed. This means that while one transaction is executing innobase_commit(), all subsequent transactions will be blocked inside innobase_xa_prepare() waiting for the mutex. As a result, no transactions can queue up to share an fsync(), and group commit is broken with the binary log enabled.

So, why does InnoDB hold the problematic prepare_commit_mutex across the binary logging? That turns out to be a really good question. After extensive research into the issue, it appears that in fact there is no good reason at all for the mutex to be held.

Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that taking the mutex is necessary to ensure that commits happen in the same order in InnoDB and in the binary log. This is certainly true; without taking the mutex we can have transaction A committed in InnoDB before transaction B, but B written to the binary log before transaction A.

But this just raises the next question: why is it necessary to ensure the same commit order in InnoDB and in the binary log? The only reason that I could find stated is that this is needed for InnoDB hot backup and XtraBackup to be able to extract the correct binary log position corresponding to the state of the engine contained in the backup.

Sergei Golubchik investigated this issue during the 2010 MySQL conference, inspired by the many discussions of group commit that took place there. It turns out that XtraDB does a FLUSH TABLES WITH READ LOCK when it extracts the binary log position. This statement completely blocks the processing of commits until released, removing any possibility of different commit order in engine and binary log (InnoDB hot backup is closed source, so difficult to check, but presumably works in the same way). So there certainly is no need for holding the prepare_commit_mutex to ensure consistent binary log position for backups!

There is another popular way to do hot backups without using FLUSH TABLES WITH READ LOCK: LVM snapshots. But an LVM snapshot essentially runs the recovery algorithm at restore time. In this case, XA is used to ensure that engine and binary log are consistent at server start, eliminating any need to enforce same ordering of commits.

So it really seems that there just is no good reason for the prepare_commit_mutex mutex to exist in the first place. Unless someone can come up with a good explanation for why it should be needed, I am forced to conclude that we have lived with 5 years of broken group commit in MySQL solely because of incorrect hearsay about how things should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB ever cared sufficiently to take a serious look at this important issue.

(In order to get full group commit in MySQL there is another issue that needs to be solved. The current binary log code does not include implementation of group commit, so this also needs to be implemented. Such an implementation should be possible to do using standard techniques, and is independent of fixing of group commit in InnoDB).

This concludes the second part of the series, showing that group commit can be restored simply by removing the offending prepare_commit_mutex from InnoDB. The third and final article in the series will discuss some deeper issues that arise from looking into this part of the server code, and some interesting ideas for further improving things related to group commit.

Tags: , , , ,

(6 comments | Leave a comment)

10:48 am

[Link]

Fixing MySQL group commit (part 1)

This is the first in a series of three articles about ideas for implementing full support for group commit in MariaDB (for the other parts see the second and third articles). Group commit is an important optimisation for databases that helps mitigate the latency of physically writing data to permanent storage. Group commit can have a dramatic effect on performance, as the following graph shows:

Benchmark results

The rising blue and yellow lines show transactions per second when group commit is working, showing greatly improved throughput as the parallelism (number of concurrently running transactions) increases. The flat red and green lines show transactions per second with no group commit, with no scaling at all as parallelism increases. As can be seen, the effect of group commit on performance can be dramatic, improving throughput by an order of magnitude or more. More details of this benchmark below, but first some background information.

Durability and group commit

In a fully transactional system, it is traditionally expected that when a transaction is committed successfully, it becomes durable. The term durable is the "D" in ACID, and means that even if the system crashes the very moment after commit succeeded (power failure, kernel panic, server software crash, or whatever), the transaction will remain committed after the system is restarted, and crash recovery has been performed.

The usual way to ensure durability is by writing, to a transactional log file, sufficient information to fully recover the transaction in case of a crash, and then use the fsync() system call to force the data to be physically written to the disk drive before returning successfully from the commit operation. This way, in case crash recovery becomes necessary, we know that the needed information will be available. There are other methods than fsync(), including calling the fdatasync() system call or using the O_DIRECT flag when opening the log file, but for simplicity we will use fsync() to refer to any method for forcing data to physical disk.

fsync() is an expensive operation. A good traditional hard disk drive (HDD) will do around 150 fsyncs per second (10k rotation per minute drives). A good solid state disk like the Intel X25-M will do around 1200 fsyncs per second. It is possible to use RAID controllers with a battery backed up cache (which will keep data in cache memory during a power failure and physically write it to disk when the power returns); this will reduce the overhead of fsync(), but not eliminate it completely.

(There are other ways than fsync() to ensure durability. For example in a cluster with synchronous replication (like NDB or Galera), durability can be achieved by making sure the transaction is replicated fully to multiple nodes, on the assumption that a system failure will not take out all nodes at once. Whatever method used, ensuring durability is usually signficantly more expensive that merely committing a transaction to local memory.)

So the naive approach, which does fsync() after every commit, will limit throughput to around 150 transactions per second (on a standard HDD). But with group commit we can do much better. If we have several transactions running concurrently, all waiting to fsync their data in the commit step, we can use a single fsync() call to flush them all to physical storage in one go. The cost of fsync() is often not much higher for multiple transactions than for a single one, so as the above graph shows, this simple optimisation greatly reduces the overhead of fsync() for parallel workloads.

Group commit in MySQL and MariaDB

MySQL (and MariaDB) has full support for ACID when using the popular InnoDB (XtraDB in MariaDB) storage engine (and there are other storage engines with ACID support as well). For InnoDB/XtraDB, durability is enabled by setting innodb_flush_log_at_trx_commit=1.

Durability is needed when there is a requirement that committed transactions must survive a crash. But this is not the only reason for enabling durability. Another reason is when the binary log is enabled, for using a server as a replication master.

When the binary log is used for replication, it is important that the content of the binary log on the master exactly match the changes done in the storage engine(s). Otherwise the slaves will replicate to different data than what is on the master, causing replication to diverge and possibly even break if the differences are such that a query on the master is unable to run on the slave without error. If we do not have durability, some number of transactions may be lost after a crash, and if the transactions lost in the storage engines are not the same as the transactions lost in the binary log, we will end up with an inconsistency. So with the binary log enabled, durability is needed in MySQL/MariaDB to be able to recover into a consistent state after a crash.

With the binary log enabled, MySQL/MariaDB uses XA/2-phase commit between the binary log and the storage engine to ensure the needed durability of all transactions. In XA, committing a transaction is a three-step process:

  1. First, a prepare step, in which the transaction is made durable in the engine(s). After this step, the transaction can still be rolled back; also, in case of a crash after the prepare phase, the transaction can be recovered.
  2. If the prepare step succeeds, the transaction is made durable in the binary log.
  3. Finally, the commit step is run in the engine(s) to make the transaction actually committed (after this step the transaction can no longer be rolled back).
The idea is that when the system comes back up after a crash, crash recovery will go through the binary log. Any prepared (but not committed) transactions that are found in the binary log will be committed in the storage engine(s). Other prepared transactions will be rolled back. The result is guaranteed consistency between the engines and the binary log.

Now, each of the three above steps requires an fsync() to work, making a commit three times as costly in this respect compared to a commit with the binary log disabled. This makes it all the more important to use the group commit optimisation to mitigate the overhead from fsync().

But unfortunately, group commit does not work in MySQL/MariaDB when the binary log is enabled! This is the infamous Bug#13669, reported by Peter Zaitsev back in 2005.

So this is what we see in the graph and benchmark shown at the start. This is a benchmark running a lot of very simple transactions (a single REPLACE statement on a smallish XtraDB table) against a server with and without the binary log enabled. This kind of benchmark is bottlenecked on the fsync throughput of the I/O system when durability is enabled.

The benchmark is done against two different servers. One has a pair of Western Digital 10k rpm HDDs (with binary log and XtraDB log on different drives). The other has a single Intel X25-M SSD. The servers are both running MariaDB 5.1.44, and are configured with durable commits in XtraDB, and with drive cache turned off (drives like to lie about fsync to look better in casual benchmarks).

The graph shows throughput in transactions per second for different number of threads running in parallel. For each server, there is a line for results with the binary log disabled, and one with the binary log enabled.

We see that with one thread, there is some overhead in enabling the binary log, as is to be expected given that three calls to fsync() are required instead of just one.

But much worse, we also see that group commit does not work at all when the binary log is enabled. While the lines with binary log disabled show excellent scaling as the parallelism increases, the lines for binary log enabled are completely flat. With group commit non-functional, the overhead of enabling the binary log is enourmous at higher parallelism (at 64 threads on HDD it is actually two orders of magnitude worse with binary log enabled).

So this concludes the first part of the series. We have seen that if we can get group commit to work when the binary log is enabled, we can expect a huge gain in performance on workloads that are bottlenecked on the fsync throughput of the available I/O system.

The second part will go into detail with why the code for group commit does not work when the binary log is enabled. The third (and final) part will discuss some ideas about how to fix the code with respect to group commit and the binary log.

Tags: , , , ,

(10 comments | Leave a comment)

10:38 am

[Link]

Debugging memory leaks in plugins with Valgrind

I had an interesting IRC discussion the other day with Monty Taylor about what turned out to be a limitation in Valgrind with respect to debugging memory leaks in dynamically loaded plugins.

Monty Taylor's original problem was with Drizzle, but as it turns out, it is common to all of the MySQL-derived code bases. When there is a memory leak from an allocation in a dynamically loaded plugin, Valgrind will detect the leak, but the part of the stack trace that is within the plugin shows up as an unhelpful three question marks "???":

==1287== 400 bytes in 4 blocks are definitely lost in loss record 5 of 8
==1287==    at 0x4C22FAB: malloc (vg_replace_malloc.c:207)
==1287==    by 0x126A2186: ???
==1287==    by 0x7C8E01: ha_initialize_handlerton(st_plugin_int*) (handler.cc:429)
==1287==    by 0x88ADD6: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1033)
Which tells you little more than that there is a leak in one of your plugins.

After trying a couple of things, we found that this is a known limitation in Valgrind in relation to code that is loaded with dlopen() and later unloaded with dlclose():

http://bugs.kde.org/show_bug.cgi?id=79362

The basic problem is that Valgrind records the location of the malloc() call as just a memory address. And when the memory leak check is performed after the end of program execution, the plugin has been unloaded with dlclose(), and the recorded memory address is therefore no longer valid.

The problem is specific to memory leak checks, which are done only after the code has been unloaded. Other checks (like use of uninitialised values and use-after-free) work fine with full information in the stack traces, as such checks are done while the plugin code is still loaded into memory. But the memory leak checks are arguably among the most useful cheks Valgrind does, as Valgrind is often the only way to find and fix critical memory leaks efficiently.

Fortunately, once the issue was understood, we had an easy work-around: disable the dlclose() call in the server plugin code, and the leak is then detected with full information in the stack trace. Unfortunately this introduces a leak of its own, since now the memory allocated in dlopen() is never freed, so we get another spurious Valgrind memory leak warning.

Another possible way to get the same effect is to pass the RTLD_NODELETE flag to dlopen() to achieve the same effect, though I did not try this yet.

A possibly better work-around (which I also did not try yet) is one suggested in the above referenced Valgrind feature request. By adding the offending plugin(s) as LD_PRELOAD when starting the server, the plugin code will not actually be unloaded in dlclose(), so stack traces should be available without any spurious leak warnings from Valgrind. However, this will not work well if some of the dynamic plugins need a particular load order (according to the suggestion in the feature request). I also need to check if this actually works for plugins (like storage engines) that has link dependencies to symbols in the main program. But it might be a good option if it can be made to work.

(At first I was surprised to learn that this was a problem in MySQL and MariaDB, as I never saw it before. But I suppose the reason is that we so far have built most plugins as built-in, rather than as dynamically loaded .so files. The problem is likely to occur more frequently as we are moving to do more and more with plugins in MariaDB, so it is nice to know a work-around. Thanks, Monty!)

Tags: , , , , , , ,

(Leave a comment)

March 29th, 2010
11:03 am

[Link]

MariaDB talk at the OpenSourceDays 2010 conference

Earlier this month, I was at the OpenSourceDays 2010 conference, giving a talk on MariaDB (the slides from the talk are available).

The talk went quite well I think (though I probably talked way too fast as I usually do; at least that means that I finished on time with plenty room for questions..)

There was quite a bit of interest after the talk from many of the people who heard it. It was even reported on by the Danish IT media version2.dk (article in Danish).

Especially interesting to me was to discuss with three people from Danish site komogvind.dk, who told me fascinating details about their work keeping a busy site running; one of them even went right home to benchmark against MariaDB. Thanks to you, and to everyone else for your interest and time!

This time in the talk, I tried to also focus on the community and development aspects of MariaDB (in addition to the mandatory feature list and benchmark graphs, of course). To me, the most important thing about MariaDB is that we now have the infrastructure and community for people outside of MySQL to do fullscale development at the same level as inside MySQL. This was missing before. It is a much less concrete thing than features and benchmarks, so I found it much harder to present in a good way, without it turning into nothing but buzzwords. But from the feedback I got afterwards, it seems I succeeded pretty well with this part also, which I am especially happy about!

The talk was recorded on video by the organisers. The latest I heard was that the video footage is still being edited though (I was kind of waiting, hoping to be able to include a link to the video in this post). But if they do manage to finish the editing and make the videos available later, I will post an update.

A big thanks to the organisers of the OpenSourceDays 2010 conference! I had a great time, and hope to be back again next spring for OpenSourceDays 2011.

Tags: , , ,

(Leave a comment)

February 15th, 2010
12:44 pm

[Link]

Conference time!

It is conference time for me. I just came home from FOSDEM 2010 where we had a booth and I gave a talk. At the end of the month there will be a company meeting in Iceland for Monty Program, followed by Open Source Days 2010 where I will also be speaking. And then in April there is the MySQL User Conference. With two additional talks given at local user groups end of last year, I think I've about filled my quota for now, I feel quite fortunate that it turned out that I will not also be presenting at the UC! (I do not have a natural talent for speaking, and tend to need to spend quite a lot of time in preparations.)

MariaDB/PBXT booth at FOSDEM
Having a booth at FOSDEM turned out really well I think, as I got to talk to a lot of different people that passed by the booth. I also had a very nice dinner with the PostgreSQL people where I learned a lot about the internals of that database. As well as dinners with people from the MySQL world, also with lots of interesting discussions.

Thanks to all of the people that I met at FOSDEM. It was fun and inspiring to meet you, looking forward to the next time!

One thing strikes me as I am piecing together the mandatory "MariaDB feature list" for my next talk. It seems people tend to focus a lot on the extra features MariaDB has over MySQL. But there is another aspect that I think is just as important: MariaDB is creating an open framework where community developers can work together on new development. This is something that has been missing in the past.

It is easy to focus on a concrete list of features, whereas the idea of an abstract framework is much harder to present as more than buzzword talk. But I will try to get it into my next talk, as I think ultimately both are of equal importance.

Tags: , , ,

(2 comments | Leave a comment)

January 20th, 2010
11:50 am

[Link]

Why I work on Free Software

I happened upon this old LinuxJournal article about how the University of Zululand in South Africa used MySQL and other Free Software to make do with a 128 kbit (and later 768 kbit) internet connection for their staff and students.

This made me remember the trip I made to another African country, Burkina Faso, 15 years ago:

With the huge amount of work and numerous difficult obstacles facing my work on the MariaDB project, it can be hard to keep up motivation at times. It helps to remember why I am doing this.

It must be about the time when some of these kids should go to University or start up new projects. Maybe some of them will work with software. I want them to be able to invest in local skills and infrastructure that they need, rather than in software licenses funding nice houses and boats in other countries:

(House and yacht pictures courtesy of Wikipedia under the Creative Commons Attribution ShareAlike 3.0)

Tags: , ,

(1 comment | Leave a comment)

[<< Previous 10 entries]

Powered by LiveJournal.com