<?xml version='1.0' encoding='utf-8' ?>
<!--  If you are running a bot please visit this policy page outlining rules you must respect. http://www.livejournal.com/bots/  -->
<rss version='2.0' xmlns:lj='http://www.livejournal.org/rss/lj/1.0/' xmlns:media='http://search.yahoo.com/mrss/' xmlns:atom10='http://www.w3.org/2005/Atom'>
<channel>
  <title>Kristian Nielsen</title>
  <link>http://kristiannielsen.livejournal.com/</link>
  <description>Kristian Nielsen - LiveJournal.com</description>
  <lastBuildDate>Thu, 14 Feb 2013 15:23:13 GMT</lastBuildDate>
  <generator>LiveJournal / LiveJournal.com</generator>
  <lj:journal>kristiannielsen</lj:journal>
  <lj:journalid>9980544</lj:journalid>
  <lj:journaltype>personal</lj:journaltype>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/17238.html</guid>
  <pubDate>Thu, 14 Feb 2013 15:23:13 GMT</pubDate>
  <title>First steps with MariaDB Global Transaction ID</title>
  <link>http://kristiannielsen.livejournal.com/17238.html</link>
  <description>
&lt;p&gt;
My &lt;a href=&quot;http://kristiannielsen.livejournal.com/16826.html&quot; rel=&quot;nofollow&quot;&gt;previous&lt;/a&gt;
&lt;a href=&quot;http://kristiannielsen.livejournal.com/17008.html&quot; rel=&quot;nofollow&quot;&gt;writings&lt;/a&gt; were
mostly teoretical, so I wanted to give a more practical example, showing
the actual state of the current code. I also wanted to show how I have tried
to make the feature fit well into the existing replication features, without
requiring the user to enable lots of options or understand lots of
restrictions before being able to use it.
&lt;/p&gt;

&lt;p&gt;
So let us start! We will build the code from &lt;a href=&quot;https://code.launchpad.net/~maria-captains/maria/10.0-mdev26&quot; rel=&quot;nofollow&quot;&gt;
&lt;code&gt;lp:~maria-captains/maria/10.0-mdev26&lt;/code&gt;&lt;/a&gt;, which at the
time of writing is at revision
&lt;code&gt;knielsen@knielsen-hq.org-20130214134205-403yjqvzva6xk52j&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
First, we start a master server on port 3310 and put a bit of data into it:
&lt;pre&gt;
    server1&amp;gt; use test;
    server1&amp;gt; create table t1 (a int primary key, b int) engine=innodb;
    server1&amp;gt; insert into t1 values (1,1);
    server1&amp;gt; insert into t1 values (2,1);
    server1&amp;gt; insert into t1 values (3,1);
&lt;/pre&gt;
To provision a slave, we take a mysqldump:
&lt;pre&gt;
    bash$ mysqldump --master-data=2 --single-transaction -uroot test &amp;gt; /tmp/dump.sql
&lt;/pre&gt;
Note that with &lt;code&gt;--master-data=2 --single-transaction&lt;/code&gt; we obtain the
exact binlog position corresponding to the data in the dump. Since &lt;a href=&quot;https://kb.askmonty.org/en/enhancements-for-start-transaction-with-consistent/&quot; rel=&quot;nofollow&quot;&gt;MariaDB
5.3&lt;/a&gt;, this is completely non-blocking on the server (it does not do
&lt;code&gt;FLUSH TABLES WITH READ LOCK&lt;/code&gt;):
&lt;pre&gt;
    bash$ grep &quot;CHANGE MASTER&quot; /tmp/dump.sql
    -- CHANGE MASTER TO MASTER_LOG_FILE=&apos;master-bin.000001&apos;, MASTER_LOG_POS=910;
&lt;/pre&gt;
Meanwhile, the master server has a couple more transactions:
&lt;pre&gt;
    server1&amp;gt; insert into t1 values (4,2);
    server1&amp;gt; insert into t1 values (5,2);
&lt;/pre&gt;
Now let us start up the slave server on port 3311, load the dump, and start
replicating from the master:
&lt;pre&gt;
    bash$ mysql -uroot test &amp;lt; /tmp/dump.sql
    server2&amp;gt; change master to master_host=&apos;127.0.0.1&apos;, master_port=3310,
        master_user=&apos;root&apos;, master_log_file=&apos;master-bin.000001&apos;, master_log_pos=910;
    server2&amp;gt; start slave;
    server2&amp;gt; select * from t1;
    +---+------+
    | a | b    |
    +---+------+
    | 1 |    1 |
    | 2 |    1 |
    | 3 |    1 |
    | 4 |    2 |
    | 5 |    2 |
    +---+------+
    5 rows in set (0.00 sec)
&lt;/pre&gt;
So slave is up to date. In addition, when the slave connects to the master, it
downloads the current GTID replication state, so everything is now ready for
using global transaction ID. Let us promote the slave as the new master,
and then later make the old master a slave of the new master. So stop the
slave thread on the old slave, and run another transaction to simulate it
being the new master:
&lt;pre&gt;
    server2&amp;gt; stop slave;
    server2&amp;gt; insert into t1 values (6,3);
&lt;/pre&gt;
Finally, let us attach the old master as a slave using global transaction ID:
&lt;pre&gt;
    server1&amp;gt; change master to master_host=&apos;127.0.0.1&apos;, master_port=3311,
        master_user=&apos;root&apos;, master_gtid_pos=auto;
    server1&amp;gt; start slave;
    server1&amp;gt; select * from t1;
    +---+------+
    | a | b    |
    +---+------+
    | 1 |    1 |
    | 2 |    1 |
    | 3 |    1 |
    | 4 |    2 |
    | 5 |    2 |
    | 6 |    3 |
    +---+------+
    6 rows in set (0.01 sec)
&lt;/pre&gt;
Old master is now running as slave and is up-to-date with the new master.
&lt;/p&gt;

&lt;p&gt;
So that is it! A short post from me for once, but that is the whole
point. Replication with MariaDB Global Transaction ID works much as it always
did. The only new thing here is when we issue the &lt;code&gt;CHANGE MASTER&lt;/code&gt;
to make the old master a slave of the new master. We do not have to manually
try to compute or guess the correct binlog position on the new master, we can
just specify &lt;code&gt;MASTER_GTID_POS=AUTO&lt;/code&gt; and the servers figure out the
rest for themselves.
&lt;/p&gt;

&lt;p&gt;
I hope I managed to show more concretely my ideas with MariaDB Global
Transaction ID. Comments and questions are most welcome, as always. Everything
above is actual commands that work on the current code on
Launchpad. Everything else may or may not work yet, as this is work in
progress, just so you know!
&lt;/p&gt;
</description>
  <comments>http://kristiannielsen.livejournal.com/17238.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <category>replication</category>
  <category>database</category>
  <lj:security>public</lj:security>
  <lj:reply-count>9</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/17008.html</guid>
  <pubDate>Fri, 04 Jan 2013 15:55:23 GMT</pubDate>
  <title>More on global transaction ID in MariaDB</title>
  <link>http://kristiannielsen.livejournal.com/17008.html</link>
  <description>
&lt;p&gt;
I got some very good comments/questions on &lt;a href=&quot;http://kristiannielsen.livejournal.com/16826.html&quot; rel=&quot;nofollow&quot;&gt;my previous post on MariaDB global transaction ID&lt;/a&gt;, from Giuseppe and Robert (of  Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.
&lt;/p&gt;

&lt;p&gt;
(It also gives me the opportunity to explain more deeply a lot of interesting
design decisions that I left out in the first post for the sake of brevity and
clarity.)
&lt;/p&gt;


&lt;h3&gt;On crash-safe slave&lt;/h3&gt;

One of the things I really wanted to improve with global transaction ID is to
make the replication slaves more crash safe with respect to their current
replication state. This state is mostly persistently stored information about
which event(s) were last executed on the slave, so that after a server restart
the slave will know from which point in the master binlog(s) to resume
replication. In current (5.5 and earlier) replication, this state is stored
simply by continuously writing a file &lt;code&gt;relay-log.info&lt;/code&gt; after each
event executed. If the server crashes, this is very susceptible to corruption
where the contents of the file no longer matches the actual state of tables in
the database.
&amp;lt;/p&amp;gt;

&lt;p&gt;
With MariaDB global transaction ID, the replication state is stored in the
following table instead of in a plain file:
&lt;pre&gt;
    CREATE TABLE rpl_slave_state (
	domain_id INT UNSIGNED NOT NULL,
	sub_id BIGINT UNSIGNED NOT NULL,
	server_id INT UNSIGNED NOT NULL,
	seq_no BIGINT UNSIGNED NOT NULL,
	PRIMARY KEY (domain_id, sub_id));
&lt;/pre&gt;
When a transaction is executed on the slave, this table is updated as part of
the transaction. So if the table is created with InnoDB (or other
transactional engine) and the replicated events also use transactional tables,
then the replication state is crash safe. DDL, or non-transactional engines
such as MyISAM, remain crash-unsafe of course.
&lt;/p&gt;

&lt;p&gt;
A global transaction ID in MariaDB consists of &lt;code&gt;domain_id&lt;/code&gt; as
described in the previous post, an increasing sequence number, and the
usual &lt;code&gt;server_id&lt;/code&gt;. Recall that the replication state with global
transaction ID consists of the last global transaction ID applied within each
independent replication stream, &lt;i&gt;ie.&lt;/i&gt; a mapping
from &lt;code&gt;domain_id&lt;/code&gt; to global transaction ID. This is what the
table &lt;code&gt;rpl_slave_state&lt;/code&gt; provides.
&lt;/p&gt;

&lt;p&gt;
But what about the &lt;code&gt;sub_id&lt;/code&gt; in the above table? This is to prevent
concurrency issues when parallel replication is used. If we want to be able to
execute in parallel two transactions with the same &lt;code&gt;domain_id&lt;/code&gt;,
then these two transactions will both need to update the
table &lt;code&gt;rpl_slave_state&lt;/code&gt;. If the transactions would update the same
row in the table, then one transaction would have to wait on a row lock for
the other to commit. This would prevent any kind of group commit, which would
be a very serious performance bottleneck.
&lt;/p&gt;

&lt;p&gt;
So instead, each transaction inserts a new, separate row into the table to
record the new global transaction ID applied. There may thus be multiple
entries for a given &lt;code&gt;domain_id&lt;/code&gt;. The &lt;code&gt;sub_id&lt;/code&gt; is used to
distinguish them, it is simply a (local) integer that is increased for each
transaction received from the master binlog. Thus, at any given time the last
applied global transaction ID for given &lt;code&gt;domain_id&lt;/code&gt; is the one with
the highest &lt;code&gt;sub_id&lt;/code&gt; in the table.
&lt;/p&gt;

&lt;p&gt;
In effect, the replication state is obtained with a query like this:
&lt;pre&gt;
    SELECT domain_id, server_id, seq_no
    FROM rpl_slave_state
    WHERE (domain_id, sub_id) IN
      (SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
&lt;/pre&gt;
Old rows are deleted when no longer needed.
&lt;/p&gt;

&lt;p&gt;
Thus, two replicated transactions can be executed like this:
&lt;pre&gt;
    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = &quot;foo&quot;
     WHERE id = 10;                         WHERE category LIKE &quot;food_%&quot;;

    INSERT INTO mysql.rpl_slave_state      INSERT INTO mysql.rpl_slave_state
       SET domain_id = 1,                     SET domain_id = 1,
	   sub_id = 100,                          sub_id = 101,
	   server_id = 5,                         server_id = 5,
	   seq_no = 300010;                       seq_no = 300011;

    COMMIT;                                COMMIT;
&lt;/pre&gt;
These two transactions can run completely independent, including the insert
into &lt;code&gt;rpl_slave_state&lt;/code&gt;. And the commits at the end can be done
together as a single group commit, where we ensure that the second one is
recorded as happening after the first one so commit order (and binlog order)
is preserved and visibility is correct (second transaction not visible to any
query without the first also being visible).
&lt;/p&gt;

&lt;p&gt;
Contrast this with how things would be with a rpl_slave_state table with a
single row per &lt;code&gt;domain_id&lt;/code&gt;:
&lt;pre&gt;
    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = &quot;foo&quot;
     WHERE id = 10;                         WHERE category LIKE &quot;food_%&quot;;

    UPDATE bad_rpl_slave_state
       SET server_id = 5,
	   seq_no = 300010
     WHERE domain_id = 1;

    COMMIT;

					   UPDATE bad_rpl_slave_state
					      SET server_id = 5,
						  seq_no = 300011
					    WHERE domain_id = 1;

					   COMMIT;
&lt;/pre&gt;
Here the update of the replication state table in the second transaction would
have to wait for the first transaction to commit, because of row locks. Group
commit becomes impossible.
&lt;/p&gt;

&lt;p&gt;
(I actually explained this issue to the replication developers at
MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had
ignored it...)
&lt;/p&gt;


&lt;h3&gt;On where to store the replication state&lt;/h3&gt;

&lt;p&gt;
As Giuseppe pointed out, in
the &lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-26&quot; rel=&quot;nofollow&quot;&gt;global transaction
  ID design&lt;/a&gt; it is still written that the replication state will be stored
in the slave binlog, not in the &lt;code&gt;rpl_slave_state&lt;/code&gt; table. Sorry
about this, I will get the document updated as soon as possible.
&lt;/p&gt;

&lt;p&gt;
I had basically two ideas for how to store the slave state in a crash-safe way:
&lt;ol&gt;
&lt;li&gt; In the slave&apos;s binlog.
&lt;li&gt; In a (transactional) table.
&lt;/ol&gt;
The big advantage of (2) is that it works also when the binlog is not enabled
on the slave. Since there can still be substantial overhead to enabling the
binlog, I currently plan to go with this approach.
&lt;/p&gt;

&lt;p&gt;
The advantage of (1) is that it is potentially cheaper when the binlog is
enabled on the slave, as it commonly will be when global transaction ID is
enabled (to be able to promote a slave as a new master, the binlog must be
enabled, after all). We already write every single global transaction ID
applied into the binlog, and if we crash, we already scan the binlog during
crash recovery. Thus, it is easy during crash recovery to rebuild the
replication state from the binlog contents. This way we get crash safe slave
state without the overhead of maintaining an
extra &lt;code&gt;rpl_slave_state&lt;/code&gt; table.
&lt;/p&gt;

&lt;p&gt;
It will be possible in the future to refine this, so that we could use method
(1) if binlog is enabled, else method (2). This might improve performance
slightly when binlog is enabled. But we should first benchmark to check if
such refinement will be worth it in terms of performance gained. It seems
likely that any gains will be modest, at best.
&lt;/p&gt;


&lt;h3&gt;On parallel replication&lt;/h3&gt;

&lt;p&gt;
Parallel replication is something that has been long overdue, but is now a
reality. MariaDB 10.0 will have &lt;a href=&quot;&quot;&gt;multi-source replication&lt;/a&gt;, which
is actually a form of parallel replication. MySQL 5.6 will
have &lt;a href=&quot;&quot;&gt;multi-threaded slave&lt;/a&gt;. &lt;a href=&quot;&quot;&gt;Galera&lt;/a&gt; can do
parallel replication, as can &lt;a href=&quot;&quot;&gt;Tungsten&lt;/a&gt; I believe, though I am
not familiar with details. There are several other mechanisms for parallel
replication planned for later MariaDB releases, like
&lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184&quot; rel=&quot;nofollow&quot;&gt;MWL#184&lt;/a&gt;
and &lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-520&quot; rel=&quot;nofollow&quot;&gt;MDEV-520&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
It is thus very important to think parallel replication into the design of
global transaction ID from the start. I fully agree with Giuseppe&apos;s remarks
here about MySQL 5.6 replication features failing completely to do this. They
introduce in 5.6 three new features that require extensions to how the
replication state is stored: crash-safe slave, global transaction ID,
and multi-threaded slave. They have managed to do this by
introducing three completely different solutions. This is just
insane. It makes one wonder if Oracle management forbids Oracle developers to
talk to each other, just like we already know they prevent discussions with
the community ...
&lt;/p&gt;

&lt;p&gt;
So, in relation to global transaction ID there are basically two kinds of
parallel replication techniques: in-order and out-of-order. The two interact
with global transaction ID in different ways.
&lt;/p&gt;


&lt;h3&gt;On in-order parallel replication&lt;/h3&gt;

&lt;p&gt;
In-order is when two (or more) different transactions are executed in parallel
on the slave, but the commit of the second transaction is delayed to after the
first transaction has committed. Galera is an example of this, I
think. Planned tasks
&lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184&quot; rel=&quot;nofollow&quot;&gt;MWL#184&lt;/a&gt;
and possibly
&lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-520&quot; rel=&quot;nofollow&quot;&gt;MDEV-520&lt;/a&gt;
are also in-order techniques.
&lt;/p&gt;

&lt;p&gt;
In-order parallel replication is transparent to applications and users (at
least with MVCC transactional engines like InnoDB), since changes only become
visible on COMMIT, and commits are done in a serial way. It is thus also
mostly transparent to global transaction ID, and does not need much special
consideration for the design.
&lt;/p&gt;

&lt;p&gt;
One thing that can be done, and that I am currently working on, is to
integrate in-order parallel replication with group commit. Suppose we run
transactions T1 and T2 in parallel on the slave, and suppose that T2 happens
to complete first so that we have to wait in T2&apos;s commit for T1 to commit
first. If we integrate this wait with group commit, we can actually commit T1
and T2 at the same time, taking care to write the commit records to the binlog
and to the storage engine in the right order (T1 before T2). This way, the
wait is likely to improve performance rather than reduce it, in fact.
&lt;/p&gt;


&lt;h3&gt;On out-of-order parallel replication&lt;/h3&gt;

&lt;p&gt;
Out-of-order parallel replication is when transactions can be committed in a
different order on the slave than on the master. The MySQL 5.6 multi-threaded
slave feature is an example of this.
&lt;/p&gt;

&lt;p&gt;
Out-of-order must be explicitly enabled by the application/DBA, because it
breaks fundamental semantics. If commits happen in different order, the slave
database may be temporarily in a state that never existed on the master and
may be invalid for the application. But if the application is written to
tolerate such inconsistencies, and explicitly declares this to the database,
then there may be potential for more parallelism than with in-order methods.
This can make out-of-order interesting.
&lt;/p&gt;

&lt;p&gt;
The typical example, which MySQL 5.6 multi-threaded slave uses, is when the
application declares that transactions against different schemas are
guaranteed independent. Different schemas can then be replicated independently
(though if the application messes up and transactions happen to not really be
independent, things can break). MariaDB 10.0 multi-source replication is
another example, where the application declares a guarantee that two master
servers can be replicated independently.
&lt;/p&gt;

&lt;p&gt;
Out-of-order creates a challenge for global transaction ID when switching to a
new master. Because events in the binlog on the new master are in different
order, there will not in general be a single place from which to start
replication without either loosing or duplicating some event.
&lt;/p&gt;

&lt;p&gt;
MariaDB global transaction ID handles this by only allowing out-of-order
parallel replication between different replication domains, never within a
single domain. In effect, the DBA/application explicitly declares the possible
independent replication streams, and then it is sufficient to remember one
global transaction ID per &lt;code&gt;domain_id&lt;/code&gt; as the position reached within each independent stream.
&lt;/p&gt;

&lt;p&gt;
Thus, suppose we have a master where updates to schemas are independent, and
we want to replicate them in parallel on slaves. On the master, we configure
10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog,
we set the &lt;code&gt;domain_id&lt;/code&gt; value to a hash of the used schema.
&lt;/p&gt;

&lt;p&gt;
On the slave, we then configure 10 SQL threads. Two received transactions with
different &lt;code&gt;domain_id&lt;/code&gt; can be executed in parallel. Two transactions
using same schema will map to the same &lt;code&gt;domain_id&lt;/code&gt; and will thus
not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded
slave almost for free, using the exact same mechanism as for executing
multi-source replication in parallel. The replication state on the slave will
in this case consist of the 10 different global transaction IDs reached within
each of the 10 replication domains. And they can be stored in the
table &lt;code&gt;rpl_slave_state&lt;/code&gt; just as described above. Thus replication
state for out-of-order parallel replication is fully integrated with the rest
of the design, needing no special mechanisms.
&lt;/p&gt;

&lt;p&gt;
And we can do more! The application (with suitable privileges) is allowed to
change &lt;code&gt;domain_id&lt;/code&gt; per-query. For example, we can run all normal
queries with &lt;code&gt;domain_id=1&lt;/code&gt;. But then if we have a long-running
maintenance query like an ALTER TABLE or something that updates every row in a
large table, we can execute it with &lt;code&gt;domain_id=2&lt;/code&gt;, if we take care
that no other queries conflict with it. This way, the long-running query can
run in parallel &quot;in the background&quot;, without causing any replication delay for
normal queries.
&lt;/p&gt;

&lt;p&gt;
In effect, the application or DBA now has great flexibility in declaring which
queries can replicate independent of (and thus in parallel with) each other,
and all this just falls out almost for free from the overall design. I foresee
that this will be a very powerful feature to have for large, busy replication
setups.
&lt;/p&gt;

&lt;p&gt;
Note btw. that most out-of-order parallel replication techniques can also be
done as in-order simply by delaying the final COMMIT steps of transactions to
happen in-order. This way one could for example do per-schema parallel
replication without polluting the replication state with many global
transaction IDs. This should generally achieve similar improvement in overall
throughput, though latency of individual transactions can be longer.
&lt;/p&gt;


&lt;h3&gt;On &quot;holes&quot; in the global transaction ID sequences&lt;/h3&gt;

&lt;p&gt;
Global transaction IDs have a sequence-number component, which ensures
uniqueness by being always increasing. This raises the issue of whether an
event will always have a sequence number exactly one bigger than the previous
event, or if it is allowed to have &quot;holes&quot;, where some sequence number is
never allocated to an event.
&lt;/p&gt;

&lt;p&gt;
For MariaDB global transaction ID, I took the approach that holes are
allowed. There are a number of good reasons for this.
&lt;/p&gt;

&lt;p&gt;
Mostly, I believe that a design that relies on &quot;no holes&quot; is a fundamental
mistake. In MySQL 5.6 global transaction ID, holes are absolutely not
allowed. If a hole ever turns up, you will be stuck with it literally
forever. The MySQL 5.6 replication state lists every sequence number not yet
applied on a slave server, so if one becomes missing it will forever
remain. Unless you remove it manually, and as far as I have been able to
determine, there are currently no facilities for this. Anyone who knows how
fragile MySQL replication can be should realise that this is a recipe for
disaster.
&lt;/p&gt;

&lt;p&gt;
Another point: because of the strict &quot;no holes&quot; requirement, in MySQL 5.6,
when events are filtered with &lt;code&gt;--replicate-ignore-db&lt;/code&gt; or whatever,
they had to change the code so that a dummy event is used to replace the
filtered event. In effect, you cannot really filter any events any more! I
think that alone should be enough to realise that the design is wrong.
&lt;/p&gt;

&lt;p&gt;
A more subtle point is that a strict &quot;no holes&quot; requirement makes it much
harder to correctly and scalable handle allocation of new numbers. Basically,
to allocate next number in a multi-thread environment, a lock needs to be
taken. We need to take this lock for as short as possible to preserve
scalability. But then, what happens if we allocate some sequence number N to
transaction T1, and then later we get some failure that prevents T1 from
successfully committing and being written into the binlog? We now cannot
simply rollback T1, because some other transaction T2 may have already
allocated the next number, and then we would leave a hole. Subtle issues like
this are important to achieve good scalability.
&lt;/p&gt;

&lt;p&gt;
So I think it is wrong to base the design on never having holes. On the other
hand, there is no reason to deliberately introduce holes just for the fun of
it. Sequence numbers in MariaDB global transaction ID will generally be
without holes, it is just that nothing will break if somehow a hole should
sneak in.
&lt;/p&gt;

&lt;p&gt;
Also, whenever a global transaction ID is received on a slave server, the
server&apos;s own internal counter for the next sequence number to allocate will be
set to one more than the received sequence number, if it is currently
smaller. This gives the very nice property in a standard setup, where only one
master is ever written at any one time: The sequence number in itself will be
globally unique and always increasing. This means that one can look at any two
global transaction IDs and immediately know which one comes first in the
history, which can be very useful. It also allows to give a warning if
multiple masters are being written without being configured with distinct
replication domain ID. This is detected when a server replicates a global
transaction ID with same &lt;code&gt;domain_id&lt;/code&gt; as its own but smaller
sequence number.
&lt;/p&gt;

&lt;p&gt;
In multi-master-like setups, sequence number by itself can no longer be
globally unique. But even here, if the system is correctly configured so that
each actively written master has its own &lt;code&gt;domain_id&lt;/code&gt;, sequence
number will be unique per &lt;code&gt;domain_id&lt;/code&gt; and allow to order global
transaction IDs within one domain (and of course, between different domains
there is no well-defined ordering anyway).
&lt;/p&gt;


&lt;h3&gt;On CHANGE MASTER TO syntax&lt;/h3&gt;

&lt;p&gt;
In the previous post, I did not really go into what new syntax will be
introduced for MariaDB global transaction ID, both for the sake of brevity,
and also because it has not really been fully decided yet.
&lt;/p&gt;

&lt;p&gt;
However, there will certainly be the possibility to change directly the
replication slave state, ie. the global transaction ID to start replicating
from within each replication domain. For example something like this:
&lt;pre&gt;
    CHANGE MASTER TO master_host = &quot;master.lan&quot;,
           master_gtid_pos = &quot;1-1-100,2-5-300&quot;;
&lt;/pre&gt;
This is a fine and supported way to start replication. I just want to mention
that it will also be supported to start replication in the old way,
with &lt;code&gt;master_log_file&lt;/code&gt; and &lt;code&gt;master_log_pos&lt;/code&gt;, and the
master will automatically convert this into the
corresponding &lt;code&gt;master_gtid_pos&lt;/code&gt; and set this for the slave. This
can be convenient, as many tools like &lt;code&gt;mysqldump&lt;/code&gt; or XtraBackup
provide easy access to the old-style binlog position. It is certainly an
improvement over MySQL 5.6 global transaction ID, where the only documented
way to setup a slave involves RESET MASTER (!) on the master server...
&lt;/p&gt;

&lt;p&gt;
Incidentally, note that &lt;code&gt;master_gtid_pos&lt;/code&gt; has just one global
transaction ID per &lt;code&gt;domain_id&lt;/code&gt;, not one
per &lt;code&gt;server_id&lt;/code&gt;. Thus, if not using any form of multi-master, there
will be just one global transaction ID to set.
&lt;/p&gt;

&lt;p&gt;
So if we start with server 1 as the master, and then some time later switch
over to server 2 for a master, the binlog will have global transaction IDs
both with &lt;code&gt;server_id=1&lt;/code&gt; and &lt;code&gt;server_id=2&lt;/code&gt;. But the slave
binlog state will be just a single global transaction ID,
with &lt;code&gt;server_id=2&lt;/code&gt; in this case. Since binlog order is always the
same within one replication domain, a single global transaction ID is
sufficient to know the correct place to continue replication.
&lt;/p&gt;

&lt;p&gt;
I think this is a very nice property, that the size of the replication state
is fixed: one global transaction ID per configured replication domain. In
contrast, for MySQL 5.6 global transaction ID, any &lt;code&gt;server_id&lt;/code&gt; that
ever worked as master will remain in the replication state forever. If you
ever had a server id 666 you will still be stuck with it 10 years later when
you specify the replication state in CHANGE MASTER (assuming they will at some
point even allow specifying the replication state in CHANGE MASTER).
&lt;/p&gt;

&lt;p&gt;
Once the global transaction replication state is set, changing to a new master
could happen with something like this:
&lt;pre&gt;
    CHANGE MASTER TO master_host = &quot;master.lan&quot;,
                     master_gtid_pos = AUTO;
&lt;/pre&gt;
This is the whole point of global transaction ID, of course, to be able to do
this and automatically get replication started from the right point(s) in the
binlog on the new master.
&lt;/p&gt;

&lt;p&gt;
One idea that I have not yet decided about is to allow just this simple syntax:
&lt;pre&gt;
    CHANGE MASTER TO master_host = &quot;master.lan&quot;;
&lt;/pre&gt;
so that if no starting position is specified, and a global transaction state
already exists, we default to &lt;code&gt;master_gtid_pos = AUTO&lt;/code&gt;. This would be
a change in current behaviour, so maybe it is not a good idea. On the other
hand, the current behaviour is to start replication from whatever happens to
be the first not yet purged binlog file on the master, which is almost guaranteed
to be wrong. So it is tempting to change this simple syntax to just do the
right thing.
&lt;/p&gt;


&lt;h3&gt;On extensions to &lt;code&gt;mysqlbinlog&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
Robert mentions some possible extensions to the &lt;code&gt;mysqlbinlog&lt;/code&gt;
program, and I agree with those.
&lt;/p&gt;

&lt;p&gt;
The master binlog is carefully designed so that it is easily possible to
locate any given global transaction ID and the corresponding binlog position
(or determine that such global transaction ID is not present in any binlog
files). In the initial design this requires scanning one (but just one) binlog
file from the beginning; later we could add an index facility if this becomes
a bottleneck. The &lt;code&gt;mysqlbinlog&lt;/code&gt; program should also support this,
probably by allowing to specify a global transaction ID (or multiple IDs)
for &lt;code&gt;--start-position&lt;/code&gt; and &lt;code&gt;--stop-position&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
Robert also mentions the usefulness of an option to filter out events from
within just one replication domain/stream. This is something I had not thought
of, but it would clearly be useful and is simple to implement.
&lt;/p&gt;


&lt;h3&gt;On session variable &lt;code&gt;server_id&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
With MariaDB global transaction ID, &lt;code&gt;server_id&lt;/code&gt; becomes a session
variable, as do newly introduced variables &lt;code&gt;gtid_domain_id&lt;/code&gt;
and &lt;code&gt;gtid_seq_no&lt;/code&gt;. This allows an external replication mechanism to
apply events from another server outside the normal replication mechanism, and
still preserve the global transaction ID when the resulting queries are logged
in the local binlog. One important use case for this is of course
point-in-time recovery, &lt;code&gt;mysqlbinlog | mysql&lt;/code&gt;. Here, mysqlbinlog
can set these variables to preserve the global transaction IDs on the events
applied, so that fail-over and so on will still work correctly.
&lt;/p&gt;

&lt;p&gt;
Since messing with &lt;code&gt;server_id&lt;/code&gt; and so on has the possibility to
easily break replication, setting these requires SUPER privileges.
&lt;/p&gt;
</description>
  <comments>http://kristiannielsen.livejournal.com/17008.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <category>replication</category>
  <category>database</category>
  <lj:security>public</lj:security>
  <lj:reply-count>11</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/16826.html</guid>
  <pubDate>Thu, 03 Jan 2013 14:28:50 GMT</pubDate>
  <title>Global transaction ID in MariaDB</title>
  <link>http://kristiannielsen.livejournal.com/16826.html</link>
  <description>&lt;p&gt;
The main goal of global transaction ID is to make it easy to promote a new
master and switch all slaves over to continue replication from the new
master. This is currently harder than it could be, since the current
replication position for a slave is specified in coordinates that are specific
to the current master, and it is not trivial to translate them into the
corresponding coordinates on the new master. Global transaction ID solves this
by annotating each event with the &lt;b&gt;global transaction id&lt;/b&gt; which is unique
and universal across the whole replication hierarchy.
&lt;/p&gt;
&lt;p&gt;
In addition, there are at least two other main goals for &lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-26&quot; rel=&quot;nofollow&quot;&gt;MariaDB global transaction
ID&lt;/a&gt;:
&lt;ol&gt;
&lt;li&gt; Make it easy to setup global transaction ID, and easy to provision a new
slave into an existing replication hierarchy.
&lt;li&gt; Fully support &lt;a href=&quot;https://kb.askmonty.org/en/multi-source-replication/&quot; rel=&quot;nofollow&quot;&gt;multi-source
replication&lt;/a&gt; and other similar setups.
&lt;/ol&gt;
&lt;/p&gt;


&lt;h3&gt;Replication streams&lt;/h3&gt;

&lt;p&gt;
&lt;a href=&quot;http://knielsen-hq.org/blog/img/gtid-topology-1.svg&quot; rel=&quot;nofollow&quot;&gt;
  &lt;img align=&quot;right&quot; src=&quot;http://knielsen-hq.org/blog/img/gtid-topology-1.png&quot;&gt;
&lt;/a&gt;
Let us consider the second point first, dealing with multi-source
replication. The figure shows a replication topology with five servers. Server
3 is a slave with two independent masters, server 1 and server 2. Server 3 is
in addition itself a master for two slaves server 4 and server 5. The coloured
boxes A1, A2, ... and B1, B2, ... denote the binlogs in each server.
&lt;/p&gt;

&lt;p&gt;
When server 3 replicates events from its two master servers, events from
one master are applied independently from and in parallel with events from
the other master. So the events from server 1 and server 2 get interleaved
with each other in the binlog of server 3 in essentially arbitrary
order. However, an important point is that events from the same master are
still strictly ordered. A2 can be either before or after B1, but it will
always be after A1.
&lt;/p&gt;

&lt;p&gt;
When the slave server 4 replicates from master server 3, server 4 sees just a
single binlog stream, which is the interleaving of events originating in
server 1 and server 2. However, since the two original streams are fully
independent (by the way that multi-source replication works in MariaDB), they
can be freely applied in parallel on server 4 as well. This is very important!
It is already a severe performance bottleneck that replication from one master
is single-threaded on the slaves, so replicating events from multiple masters
also serially would make matters even worse.
&lt;/p&gt;

&lt;p&gt;
What we do is to annotate the events with a global transaction ID that
includes a tag that identifies the original source. In the figure, this is
marked with different colours, blue for events originating in server 1, and
red for server 2. Server 4 and server 5 are then free to apply a &quot;blue&quot; event
in parallel with a &quot;red&quot; event, and such parallel events can thus end up in
different order in the binlogs in different places in the replication
hierarchy. So every server can have a distinct interleaving of the two original
event streams, &lt;em&gt;but&lt;/em&gt; every interleaving respects the order within a
single original stream. In the figure, we see for example that A2 comes before
B1 in server 3 and server 5, but after in server 4, however it is always after
A1.
&lt;/p&gt;

&lt;p&gt;
This concept of having a collection of distict binlog streams, each strictly
ordered but interleaved with each other in a relaxed way, is very powerful. It
allows both great flexibility (and hence opportunity for parallelism) in
applying independent events, as well as simple representation of the state of
each replication slave at each point in time. For each slave, we simply need
to remember the global transaction ID of the last event applied in each
independent stream. Then to switch a slave to a new master, the master finds
the corresponding places within its own binlog for each independent stream and
starts sending events from the appropriate location for each stream.
&lt;/p&gt;

&lt;p&gt;
For example, in the figure, we see that the state of server 4 is (A4, B3)
and for server 5 it is (A3, B3). Thus we can change server 5 to use server 4
as a slave directly, as server 4 is strictly ahead of server 5 in the
replication streams.
&lt;/p&gt;

&lt;p&gt;
Or if we want to instead make server 5 the new master, then we first need to
temporarily replicate from server 4 to server 5 up to (A4, B3). Then we can
switch over and make server 5 the new master. Note that in general such a
procedure may be necessary, as there may be no single server in the hierarchy
that is ahead of every other server in every stream if the original master
goes away. But since each stream is simply ordered, it is always possible to
bring one server up ahead to server as a master for the others.
&lt;/p&gt;

&lt;h3&gt;Setup and provisioning&lt;/h3&gt;

&lt;p&gt;
This brings us back to the first point about, making it easy to setup
replication using global transaction ID, and easy to provision a new slave
into an existing replication hierarchy.
&lt;/p&gt;

&lt;p&gt;
To create a new slave for a given master, one can proceed exactly the same way
whether using global transaction id or not. Make a copy of the master
obtaining the corresponding binlog position (&lt;code&gt;mysqldump
--master-data&lt;/code&gt;, XtraBackup, whatever). Setup the copy as the new slave,
and issue &lt;code&gt;CHANGE MASTER TO ... MASTER_LOG_POS=...&lt;/code&gt; to start
replication. Then when the slave first connects, the master will send last
global transaction ID within each existing replication stream, and slave will thus
automatically be configured with the correct state. Then if there later is a
need to switch the slave to a different master, global transaction ID is
already properly set up.
&lt;/p&gt;

&lt;p&gt;
This works exactly because of the property that while we have potentially
interleaved distinct replication streams, each stream is strictly ordered
across the whole replication hierarchy. I believe this is a very important
point, and essential for getting a good global transaction ID design. The
notion of an ordered sequence of the statements and transactions executed on
the master is &lt;em&gt;the&lt;/em&gt; central core of MySQL replication, it is what users
know and what has made it so successful despite all its limitations.
&lt;/p&gt;


&lt;h3&gt;Replication domains&lt;/h3&gt;

&lt;p&gt;
To implement this design, MariaDB global transaction ID introduces the notion
of a &lt;em&gt;replication domain&lt;/em&gt; and an associated &lt;code&gt;domain_id&lt;/code&gt;. A
replication domain is just a server or group of servers that generate a
single, strictly ordered replication stream. Thus, in the example above,
there are two &lt;code&gt;domain_id&lt;/code&gt; values in play corresponds to the two
colours blue and red. The global transaction ID includes the
&lt;code&gt;domain_id&lt;/code&gt;, and this way every event can be identified with its
containing replication stream.
&lt;/p&gt;

&lt;p&gt;
Another important point here is that &lt;code&gt;domain_id&lt;/code&gt; is something the
DBA configures explicitly. MySQL replication is all about the DBA having
control and flexibility. The existence of independent streams of events is a
property of the application of MySQL, not some server internal, so it needs to
be under the control of the user/DBA. In the example, one would configure
server 1 with &lt;code&gt;domain_id=1&lt;/code&gt; and server 2 with
&lt;code&gt;domain_id=2&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
Of course, in basic (and not so basic) replication setups where only one
master server is written to by applications at any one time, there is only a
single ordered event stream, so &lt;code&gt;domain_id&lt;/code&gt; can be ignored and
remain at the default (which is 0).
&lt;/p&gt;

&lt;p&gt;
Note by the way that &lt;code&gt;domain_id&lt;/code&gt; is different from
&lt;code&gt;server_id&lt;/code&gt;! It is possible and normal for multiple servers to
share the same &lt;code&gt;domain_id&lt;/code&gt;, for example server 1 might be a slave
of some higher-up master server, and the two would then share the
&lt;code&gt;domain_id&lt;/code&gt;. One could even imagine that at some point in the
future, servers would have moved around so that server 2 was re-provisioned to
replace server 1, it would then retain its old &lt;code&gt;server_id&lt;/code&gt; but
change its &lt;code&gt;domain_id&lt;/code&gt; to 1. So both the blue and the red event
stream would have instances with &lt;code&gt;server_id=1&lt;/code&gt;, but
&lt;code&gt;domain_id&lt;/code&gt; will always be consistent.
&lt;/p&gt;

&lt;p&gt;
It is also possible for a single server to use multiple domain IDs. For
example, a DBA might configure events generated to receive as domain_id a hash
of the current schema. This would be a way of declaring that transactions in
distinct schemas are guaranteed to be independent, and it would allow slaves
to apply those independent transactions in parallel. The slave will just see
distinct streams, and apply them in parallel same way as for multi-source
replication. This is similar to the multi-threaded slave that MySQL 5.6
implements. But it is more flexible, for example an application could
explicitly mark a long-running transaction with a distict
&lt;code&gt;domain_id&lt;/code&gt;, and then ensure that it is independent of other
queries, allowing it to be replicated in parallel and not delay replication of
normal queries.
&lt;/p&gt;


&lt;h3&gt;Current status&lt;/h3&gt;

&lt;p&gt;
The MariaDB global transaction ID is work-in-progress, currently planned for
MariaDB 10.0.
&lt;/p&gt;

&lt;p&gt;
The current code is maintained on Launchpad: &lt;a href=&quot;https://code.launchpad.net/~maria-captains/maria/10.0-mdev26&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;lp:~maria-captains/maria/10.0-mdev26&lt;/code&gt;&lt;/a&gt;. The
design is written up in detail in &lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-26&quot; rel=&quot;nofollow&quot;&gt;Jira task MDEV-26&lt;/a&gt;,
where the progress is also tracked.
&lt;/p&gt;

&lt;p&gt;
Global transaction ID has already been discussed on the &lt;a href=&quot;https://launchpad.net/~maria-developers&quot; rel=&quot;nofollow&quot;&gt;maria-developers mailing
list&lt;/a&gt;. I have received valuable feedback there which has been included in
the current design. But I very much welcome additional feedback, I am open to
changing anything if it makes the end result better. Much of the community
seems to not be using mailing lists to their full potential (hint hint!),
hence this blog post to hopefully reach a wider audience that might be
interested.
&lt;/p&gt;
</description>
  <comments>http://kristiannielsen.livejournal.com/16826.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <category>replication</category>
  <category>database</category>
  <lj:security>public</lj:security>
  <lj:reply-count>3</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/16392.html</guid>
  <pubDate>Tue, 03 Jul 2012 11:17:51 GMT</pubDate>
  <title>Integer overflow</title>
  <link>http://kristiannielsen.livejournal.com/16392.html</link>
  <description>&lt;p&gt;
What do you think of this piece of C code?
&lt;pre&gt;
  void foo(long v) {
    unsigned long u;
    unsigned sign;
    if (v &amp;lt; 0) {
      u = -v;
      sign = 1;
    } else {
      u = v;
      sign = 0;
    }
    ...
&lt;/pre&gt;
Seems pretty simple, right? Then what do you think of this output from MySQL:
&lt;pre&gt;
  mysql&amp;gt; create table t1 (a bigint) as select &apos;-9223372036854775807.5&apos; as a;
  mysql&amp;gt; select * from t1;
  +----------------------+
  | a                    |
  +----------------------+
  | -&apos;..--).0-*(+,))+(0( | 
  +----------------------+
&lt;/pre&gt;
Yes, that is authentic output from older versions of MySQL. Not just the wrong
number, the output is complete garbage!
This is my all-time
favorite &lt;a href=&quot;http://bugs.mysql.com/bug.php?id=31799&quot; rel=&quot;nofollow&quot;&gt;MySQL bug#31799&lt;/a&gt;.
It was caused by code like the above C snippet.
&lt;/p&gt;

&lt;p&gt;
So can you spot what is wrong with the code? Looks pretty simple, does it not?
But the title of this post may give a hint...
&lt;/p&gt;

&lt;p&gt;
It is a little known fact that signed integer overflow is &lt;em&gt;undefined&lt;/em&gt;
in C! The code above contains such undefined behaviour. The
expression &lt;code&gt;-v&lt;/code&gt; &lt;em&gt;overflows&lt;/em&gt; when &lt;code&gt;v&lt;/code&gt; contains the
smallest negative integer of the long type (-2&lt;sup&gt;63&lt;/sup&gt; on 64-bit) - the
absolute value of this cannot be represented in the type. The correct way to
put the absolute value of signed &lt;code&gt;v&lt;/code&gt; into unsigned &lt;code&gt;u&lt;/code&gt;
is &lt;code&gt;u = (unsigned long)0 - (unsigned long)v&lt;/code&gt;. Unsigned
overflow &lt;em&gt;is&lt;/em&gt; well-defined in C, in contrast to signed overflow.
&lt;/p&gt;

&lt;p&gt;
And yes, GCC &lt;em&gt;will&lt;/em&gt; generate unexpected (but technically valid)
assembler for such code, as seen in the Bug#31799. If you do not like this,
then use &lt;code&gt;-fno-strict-overflow&lt;/code&gt; like I believe Postgresql and the
Linux kernel do.
&lt;/p&gt;

&lt;p&gt;
(But better write correct C code from the start).
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/16392.html</comments>
  <category>freesoftware</category>
  <category>compiler</category>
  <category>mysql</category>
  <category>programming</category>
  <category>c</category>
  <category>database</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/16382.html</guid>
  <pubDate>Mon, 11 Jun 2012 12:48:35 GMT</pubDate>
  <title>Even faster group commit!</title>
  <link>http://kristiannielsen.livejournal.com/16382.html</link>
  <description>&lt;p&gt;
I found time to continue my &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;previous&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;work on&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;group&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12810.html&quot; rel=&quot;nofollow&quot;&gt;commit&lt;/a&gt; for the
binary log in MariaDB.
&lt;/p&gt;

&lt;p&gt;
In current code, a (group) commit to InnoDB does not less than &lt;em&gt;three&lt;/em&gt; &lt;code&gt;fsync()&lt;/code&gt;
calls:
&lt;ol&gt;
&lt;li&gt; Once during InnoDB prepare, to make sure we can recover the transaction
in InnoDB if we crash after writing it to the binlog.
&lt;li&gt; Once after binlog write, to make sure we have the transaction in the
binlog before we irrevocably commit it in InnoDB.
&lt;li&gt; Once during InnoDB commit, to make sure we no longer need to scan the
binlog after a crash to recover the transaction.
&lt;/ol&gt;
Of course, in point 3, it really is not necessary to do an
&lt;code&gt;fsync()&lt;/code&gt; after &lt;em&gt;every&lt;/em&gt; (group) commit. In fact, it seems hardly
necessary to do such &lt;code&gt;fsync()&lt;/code&gt; at all! If we should crash before
the commit record hits the disk, we can always recover the transaction by
scanning the binlogs and checking which of the transactions in InnoDB prepared
state should be committed. Of course, we do not want to keep and scan years
worth of binlogs, but we need only &lt;code&gt;fsync()&lt;/code&gt; every so often, not
after every commit.
&lt;/p&gt;

&lt;p&gt; So I implemented &lt;a href=&quot;https://mariadb.atlassian.net/browse/MDEV-232&quot; rel=&quot;nofollow&quot;&gt;MDEV-232&lt;/a&gt;. This removes the &lt;code&gt;fsync()&lt;/code&gt; call in the
commit step in InnoDB. Instead, the binlog code requests from InnoDB (and any
other transactional storage engines) to flush all pending commits to disk when
the binlog is rotated. When InnoDB is done with the flush, it reports back to
the binlog code. We keep track of how far InnoDB has flushed by writing
so-called checkpoint events into the current binlog. After a crash, we first
scan the latest binlog. The last checkpoint event found will tell us if we
need to scan any older binlogs to be sure to find all commits that were not
durably committed inside InnoDB prior to the crash.
&lt;/p&gt;

&lt;p&gt;
The result is that we only need to do two &lt;code&gt;fsync()&lt;/code&gt; calls per
(group) commit instead of three.
&lt;/p&gt;

&lt;p&gt;
I benchmarked the code on a server with a good disk system - HP RAID
controller with a battery-backed up disk cache. When the cache is enabled,
&lt;code&gt;fsync()&lt;/code&gt; is fast, around 400 microseconds. When the cache is
disabled, it is slow, several milliseconds. The setup should be mostly
comparable to Mark Callaghan&apos;s benchmarks &lt;a href=&quot;http://www.facebook.com/note.php?note_id=10150211546215933&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://www.facebook.com/notes/mysql-at-facebook/group-commit-again/10150261692455933&quot; rel=&quot;nofollow&quot;&gt;here&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
I used sysbench &lt;code&gt;update_non_index.lua&lt;/code&gt; to make it easier for others
to understand/reproduce the test. This does a single update of one row in a
table in each transaction. I used 100,000 rows in the table. Group commit is
now so fast that at higher concurrencies, it is no longer the bottleneck. It
will be interesting to test again with the new InnoDB code from MySQL 5.6 and
any other scaslability improvements that have been made there.
&lt;/p&gt;

&lt;h2&gt;Slow fsync()&lt;/h2&gt;

&lt;div align=&quot;center&quot;&gt;&lt;img src=&quot;http://knielsen-hq.org/blog/img/even-faster-group-commit-1.png&quot;&gt;&lt;/div&gt;

&lt;p&gt;
As can be seen, we have a very substantial improvement, around 30-60% more
commits per second depending on concurrency. Not only are we saving one out of
three expensive &lt;code&gt;fsync()&lt;/code&gt; calls, improvements to the locking done
during commit also allow more commits to share the same &lt;code&gt;fsync()&lt;/code&gt;.
&lt;/p&gt;

&lt;h2&gt;Fast fsync()&lt;/h2&gt;

&lt;div align=&quot;center&quot;&gt;&lt;img src=&quot;http://knielsen-hq.org/blog/img/even-faster-group-commit-2.png&quot;&gt;&lt;/div&gt;

&lt;p&gt;
Even with fast &lt;code&gt;fsync()&lt;/code&gt;, the improvements are substantial.
&lt;/p&gt;

&lt;p&gt;
I am fairly pleased with these results. There is still substantial overhead
from enabling the binlog (like several times slowdown if &lt;code&gt;fsync()&lt;/code&gt;
time is the bottleneck), and I have a design for mostly solving this in &lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=164&quot; rel=&quot;nofollow&quot;&gt;MWL#164&lt;/a&gt;. But
I think perhaps it is now time to turn towards other more important areas. In
particular I would like to turn to &lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=184&quot; rel=&quot;nofollow&quot;&gt;MWL#184&lt;/a&gt; -
another method for parallel apply of events on slaves that can help in cases
where the per-database split of workload that exists in Tungsten and MySQL 5.6
can not be used, like many updates to a single table. Improving throughput
even further on the master may not be the most important if slaves are already
struggling to keep up with current throughput, and this is another
relatively simple spin-off from group commit that could greatly help.
&lt;/p&gt;

&lt;p&gt;
For anyone interested, the current code is pushed to &lt;a href=&quot;https://code.launchpad.net/~maria-captains/maria/5.5-mdev232&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;lp:~maria-captains/maria/5.5-mdev232&lt;/code&gt;&lt;/a&gt;
&lt;/p&gt;

&lt;h2&gt;MySQL group commit&lt;/h2&gt;

&lt;p&gt;
It was an interesting coincidence that the new MySQL group commit preview was
&lt;a href=&quot;http://mysqlmusings.blogspot.dk/2012/06/binary-log-group-commit-in-mysql-56.html&quot; rel=&quot;nofollow&quot;&gt;published&lt;/a&gt; just as I was finishing this work. So I had the
chance to take a quick look and include it in the benchmarks (with slow
&lt;code&gt;fsync()&lt;/code&gt;):
&lt;div align=&quot;center&quot;&gt;&lt;img src=&quot;http://knielsen-hq.org/blog/img/even-faster-group-commit-3.png&quot;&gt;&lt;/div&gt;
&lt;p&gt;
While the implementation in MySQL 5.6 preview is completely different from
MariaDB (talk about &quot;not invented here ...&quot;), the basic design is now quite
similar, as far as I could gather from the code. A single thread writes all
transactions in the group into the binlog, in order; likewise a single thread
does the commits (to memory) inside InnoDB, in order. The storage engine
interface is extended with a &lt;code&gt;thd_get_durability_property()&lt;/code&gt;
callback for the engines - when the server returns HA_IGNORE_DURABILITY from this,
InnoDB &lt;code&gt;commit()&lt;/code&gt; method is changed to work exactly like MariaDB
&lt;code&gt;commit_ordered()&lt;/code&gt;: commit to memory but do not sync to disk.
&lt;/p&gt;

&lt;p&gt;
(It remains to see what storage engine developers will think of MySQL implementing
a different API for the same functionality ...)
&lt;/p&gt;

&lt;p&gt;
The new MySQL group commit also removes the third &lt;code&gt;fsync()&lt;/code&gt; in the
InnoDB commit, same as the new MariaDB code. To ensure they can still recover
after a crash, they just call into the storage engines to sync all commits to
disk during binlog rotate. I actually like that from the point of simplicity -
even if it does stall commits for longer, it is unlikely to matter in
practice. What actually happens inside InnoDB in the two implementations is
identical.
&lt;/p&gt;

&lt;p&gt;
The new MySQL group commit is substantially slower than the new MariaDB group
commit in this benchmark. My guess is that this is in part due to suboptimal inter-thread
communication. As &lt;a href=&quot;http://kristiannielsen.livejournal.com/15739.html&quot; rel=&quot;nofollow&quot;&gt;I wrote about earlier&lt;/a&gt;, this is crucial to get best
performance at high commit rates, and the MySQL code seems to do additional
synchronisation between what they call stages - binlog write, binlog
&lt;code&gt;fsync()&lt;/code&gt;, and storage engine commit. Since the designs are now
basically identical, it should not be hard to get this fixed to perform the same as
MariaDB. (Of course, if they had started from my work, they could have spent
the effort improving that even more, rather than wasting it on catch-up).
&lt;/p&gt;

&lt;p&gt;
Note that the speedup from group commit (any version of it) is highly
dependent on the workload and the speed of the disk system. With fast
transactions, slow &lt;code&gt;fsync()&lt;/code&gt;, and high concurrency, the speedup
will be huge. With long transactions, fast &lt;code&gt;fsync()&lt;/code&gt;, and low
concurrency, the speedups will be modest, if any.
&lt;/p&gt;

&lt;p&gt; Incidentally, the new MySQL group commit is a change from the designs described &lt;a href=&quot;http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html&quot; rel=&quot;nofollow&quot;&gt;earlier&lt;/a&gt;, where individual commit threads would use
&lt;code&gt;pwrite()&lt;/code&gt; in parallel into the binary log. I am convinced this is
a good change. The writing to binlog is just &lt;code&gt;memcpy()&lt;/code&gt; between
buffers, a single thread can do gigabytes worth of that, it is not where the
bottleneck is. While it is crucial to optimise the inter-thread communication,
as &lt;a href=&quot;http://kristiannielsen.livejournal.com/15739.html&quot; rel=&quot;nofollow&quot;&gt;I found out here&lt;/a&gt; - and lots of small parallel
&lt;code&gt;pwrite()&lt;/code&gt; calls into the same few data blocks at the end of a file
delivered to the file system is not likely to be a success. If binlog write
bandwidth would really turn out to be a problem the solution is to have
multiple logs in parallel - but I think we are quite far from being there yet.
&lt;/p&gt;

&lt;p&gt;
It is a pity that we cannot work together in the MySQL world. I approached the MySQL developers
several times over the past few years suggesting we work together, with no
success. There are trivial bugs in the MySQL group commit preview whose fix
yield great speedup. I could certainly have used more input while doing my
implementation. The MySQL user community could have much better quality if we
would only work together.
&lt;/p&gt;

&lt;p&gt;
Instead, Oracle engineers use their own bugtracker which is not accessible to
others, push to their own development trees which are not accessible to
others, communicate on their own mailing lists which are not accessible to
others, hold their own developer meetings which are not accessible to others
... the list is endless.
&lt;/p&gt;

&lt;p&gt;
The most important task when MySQL was aquired was to collect the different
development groups working on the code base and create a real, great,
collaborative Open Source project. Oracle has totally botched this task up.
Instead, what we have is lots of groups each working on their own tree, with
no real interesting in collaborating. I am amazed every time I read some prominent MySQL
community member praise the Oracle stewardship of MySQL. If these people are
not interested in creating a healthy Open Source project and just
want to not pay for their database software, why do they not go use the
express/cost-free editions of SQL server or Oracle or whatever?
&lt;/p&gt;

&lt;p&gt;
It is kind of sad, really.
&lt;/p&gt;&lt;/a&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/16382.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <category>database</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/15893.html</guid>
  <pubDate>Wed, 22 Jun 2011 14:37:43 GMT</pubDate>
  <title>Tale of a bug</title>
  <link>http://kristiannielsen.livejournal.com/15893.html</link>
  <description>&lt;p&gt;
This is a tale of the
bug &lt;a href=&quot;https://bugs.launchpad.net/maria/+bug/798213&quot; rel=&quot;nofollow&quot;&gt;lp:798213&lt;/a&gt;. The
bug report has the initial report, and a summary of the real problem obtained
after detailed analysis, but it does not describe the processes of getting
from the former to the latter. I thought it would be interesting to document
this, as the analysis of this bug was rather tricky and contains several good
lessons.
&lt;/p&gt;

&lt;h3&gt;Background&lt;/h3&gt;

&lt;p&gt;
The bug first manifested itself as a sporadic failure in one of
our &lt;a href=&quot;https://launchpad.net/randgen&quot; rel=&quot;nofollow&quot;&gt;random query generator&lt;/a&gt; tests
for replication. We run this test after all MariaDB pushes in our Buildbot
setup. However, this failure had only occured twice in several months, so it
is clearly a very rare failure.
&lt;/p&gt;

&lt;p&gt;
The first task was to try to repeat the problem and get some more data in the
form of binlog files and so on. Philip kindly helped with this, and after
running the test repeatedly for several hours he finally managed to obtain a
failure and attach the available information to the initial bug report. Time
for analysis!
&lt;/p&gt;

&lt;h3&gt;Understanding the failure&lt;/h3&gt;

&lt;p&gt;
The first step is to understand what the test is doing, and what the failure
means.
&lt;/p&gt;

&lt;p&gt;
The test starts up a master server and exposes it to some random
parallel write load. Half-way through, with the server running at full speed,
it takes a non-blocking XtraBackup backup, and restores the backup into a new
slave server. Finally it starts the new slave replicating from the binlog position
reported by XtraBackup, and when the generated load is done and the slave
caught up, it compares the master and slave to check that they are consistent.
This test is an important check of &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;my&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;group&lt;/a&gt;
&lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;commit&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12810.html&quot; rel=&quot;nofollow&quot;&gt;work&lt;/a&gt;,
which is carefully engineered to provide group commit while still preserving
the commit order and consistent binlog position that is needed by XtraBackup
to do such non-blocking provisioning of new slaves.
&lt;/p&gt;

&lt;p&gt;
The failure is that in a failed run, the master and slave are different when
compared at the end. The slave has a couple of extra rows (later I discovered
the bug could also manifest itself as a single row being different).
So this is not good obviously, and needs to be investigated.
&lt;/p&gt;

&lt;h3&gt;Analysing the failure&lt;/h3&gt;

&lt;p&gt;
So this is a typical case of a &quot;hard&quot; failure to debug. We have binlogs with
100k queries or so, and a slave that somewhere in those 100k queries diverges
from the master. Working on problems like this, it is important to work
methodically, slowly but surely narrowing down the problem, come up with
hypothesis about the behaviour and positively affirm or reject them, until
finally the problem is narrowed down sufficiently that the real cause is
apparent. Random poking around not only is likely to waste time, but far
worse, without a real understanding of the root cause of the failure, there is
a great danger of eventually tweaking things so that the failure happens to go
away in the test at hand, yet the underlying bug is still there. After all,
the failure was already highly sporadic to begin with.
&lt;/p&gt;

&lt;p&gt;
First I wanted to know if the problem is that replication diverges
(eg. because of non-deterministic queries in the statement-based replication),
or if it is a problem with the restored backup used to start the slave (wrong
data or starting binlog position). Clearly, I strongly suspected a wrong
starting binlog position, as this is what my group commit work messes
with. But as it turns out, this was &lt;i&gt;not&lt;/i&gt; the problem, again stressing the
need to always verify positively any assumptions made during debugging.
&lt;/p&gt;

&lt;p&gt;
To check this, I setup a new slave server from scratch, and had it replicate
from the master binlog all the way from the start to the end. I then compared
all three end results: (A) the original master; (B) the slave provisioned by
XtraBackup, and (C) the new slave replicated from the start of the binlogs. It
turns out that (A) and (C) are identical, while (B) differs. So this strongly
suggests a problem with the restored XtraBackup; the binlogs by themselves
replicate without problems.
&lt;/p&gt;

&lt;p&gt;
To go further, I needed to analyse the state of the slave server just after
the XtraBackup has been restored, without the effect of the thousands of
queries replicated afterwards. Unfortunately this was not saved as part of the
original test. It was trivial to add to the test (just copy away the backup to
a safe place before starting the slave server), but then the need came to
reproduce the failure again.
&lt;/p&gt;

&lt;p&gt;
This is another important step in debugging hard sporadic failures: Get to the
point where the failure can be reliably reproduced, at least for some
semi-reasonable meaning of &quot;reliable&quot;. This is really important not only to
help debugging, but also to be able to verify that a proposed bug fix actually
fixes the original bug! I do have experienced once or twice a failure so
elusive that the only way to fix was to commit blindly a possible fix, then wait for
several months to see if the failure would re-appear in that
interval. Fortunately, in far the most cases, with a bit of work, this is not
necessary.
&lt;/p&gt;

&lt;p&gt;
Same here: After a bit of experimentation, I found that I could reliably
reproduce the failure by reducing the duration of the test from 5 minutes to
35 seconds, and running the test in a tight loop until it failed. It always
failed after typically 15-40 runs.
&lt;/p&gt;

&lt;p&gt;
So now I had the state of the slave provisioned with XtraBackup as it was just
before it starts replicating. So what I did was to set up another slave server
from scratch and let it replicate from the master binlogs using START SLAVE
UNTIL with the binlog position reported by XtraBackup. If the XtraBackup and
its reported binlog start position are correct, these two servers should be
identical. But sure enough, a comparison showed that they differed! In this
case it was a single row that had different data. So this confirms the
hypothesis that the problem is with the restored XtraBackup data and/or binlog
position.
&lt;/p&gt;

&lt;p&gt;
So now, thinking it was the binlog position that was off, I naturally next
looked into the master binlog around this position, looking for an event just
before the position that was not applied, or an event just after that already
was applied. However, to my surprise I did not find this. I did find an event
just after that updated the table that had the wrong row. However, the data in
the update looked nothing like the data that was expected in the wrong
row. And besides, that update was part of a transaction updating multiple
tables; if that event was duplicated or missing, there would have been more
row differences in more tables, not just one row in a single table. I did find
an earlier event that looked somewhat related, however it was far back in the
binlog (so not resolvable by merely adjusting the starting binlog pos); and
besides again it was part of a bigger transaction updating more rows, while I
had only one row with wrong data.
&lt;/p&gt;

&lt;p&gt;
So at this point I need a new idea; the original hypothesis has been proven
false. The restored XtraBackup is clearly wrong in a single row, but nothing
in the binlog explains how this one row difference could have occured. When
analysis runs up against a dead end, it is time to get more data. So I ran the
test for a couple hours, obtained a handful more failures, and analysed then
in the same way. Each time I saw the XtraBackup differing from the master in
one row, or in once case the slave had a few extra rows.
&lt;/p&gt;

&lt;p&gt;
So this is strange. After we restore the XtraBackup, we have one row (or a few
rows) different from the master server. And those rows were updated in a
multi-row transaction. It is as if we are somehow missing part of a
transactions. Which is obviously quite bad, and indicates something bad going
on at a deep level
&lt;/p&gt;

&lt;p&gt;
Again it is time to get more data. Now I try running the test with different
server options to see if it makes any difference. Running with
&lt;code&gt;--binlog-optimize-thread-scheduling=0&lt;/code&gt; still caused the failure, 
so it was not related to the
&lt;a href=&quot;http://kristiannielsen.livejournal.com/15515.html&quot; rel=&quot;nofollow&quot;&gt;thread scheduling
optimisation&lt;/a&gt; that I implemented. Then I noticed that the test runs with
the option &lt;code&gt;--innodb-release-locks-early=1&lt;/code&gt; enabled.
On a hunch I tried running without this option, and AHA! Without this option,
I was no longer able to repeat the failure, even after 250 runs!
&lt;/p&gt;

&lt;p&gt;
At this point, I start to strongly suspect a bug in the
&lt;code&gt;--innodb-release-locks-early&lt;/code&gt;
feature. But this is still not proven! It could also be that with the option
disabled, there is less opportunity for parallelism, hiding the problem which
could really be elsewhere. So I still needed to understand exactly what the
root cause of the problem is.
&lt;/p&gt;

&lt;h3&gt;Eureka!&lt;/h3&gt;

&lt;p&gt;
At this point, I had sufficient information to start just thinking about the
problem, trying to work out ways in which things could go wrong in a way that
would produce symptoms like what we see. So I started to think on how 
&lt;code&gt;--innodb-release-locks-early&lt;/code&gt; works and how InnoDB undo and
recovery in general function. So I tried a couple of ideas, some did not seem
relevant...
&lt;/p&gt;

&lt;p&gt;
...and then, something occured to me. What the
&lt;code&gt;--innodb-release-locks-early&lt;/code&gt;
feature does is to make InnoDB release row locks for a transaction earlier
than normal, just after the transaction has been prepared (but before it is
written into the binlog and committed inside InnoDB). This is to allow another
transaction waiting on the same row locks to proceed as quickly as possible.
&lt;/p&gt;

&lt;p&gt;
Now, this means that with proper timing, it is possible for such a second
transaction to &lt;i&gt;also&lt;/i&gt; prepare before the first has time to commit. At
this point we thus have &lt;i&gt;two&lt;/i&gt; transactions in the prepared state, both of
which modified the same row. If we were to take an XtraBackup snapshot at this
exact moment, upon restore XtraBackup would need to roll back both of those
transactions (the situation would be the same if the server crashed at that
point and later did crash recovery).
&lt;/p&gt;

&lt;p&gt;
This begs the question if such rollback will work correctly? This certainly is
not something that could occur in InnoDB before the patch for
&lt;code&gt;--innodb-release-locks-early&lt;/code&gt; was implemented, and from
my knowledge of the patch, I know it does not explicitly do anything to
make this work. Aha! So now we have a new hypothesis: Rollback of multiple
transactions modifying the same row causes problems.
&lt;/p&gt;

&lt;p&gt;
To test this hypothesis, I used
the &lt;a href=&quot;http://forge.mysql.com/wiki/MySQL_Internals_Test_Synchronization#Debug_Sync_Facility&quot; rel=&quot;nofollow&quot;&gt;Debug_Sync&lt;/a&gt;
facility to create a mysql-test-run test case. This test case creates runs a
few transactions in parallel, all modifying a common row, then starts them
committing but pauses the server when some of them are still in the prepared
state. At this point it takes an XtraBackup snapshot. I then tried restoring
the XtraBackup snapshot to a fresh server. Tada... but unfortunately this did
not show any problems, the restore looked correct.
&lt;/p&gt;

&lt;p&gt;
However, while restoring, I noticed that the prepared-but-not-committed
transactions seems to be rolled back in reverse order of InnoDB transactions
id. So this got me thinking - what would happen if they were rolled back in a
different order? Indeed, for multiple transactions modifying a common row, the
rollback order is critical! The &lt;i&gt;first&lt;/i&gt; transaction to modify the row
must be rolled back &lt;i&gt;last&lt;/i&gt;, so that the correct before-image is left in
the row. If we were to roll back in a different order, we could end up
restoring the wrong before-image of the row - which would result in exactly
the kind of single-row corruption that we seem to experience in the test
failure! So we are getting close it seems. Since InnoDB seems to roll back
transactions in reverse order of transaction ID, and since transaction IDs are
presumably allocated in order of transaction &lt;i&gt;start&lt;/i&gt;, maybe by starting
the transactions in a different order, the failure can be provoked.
&lt;/p&gt;

&lt;p&gt;
And sure enough, modifying the test case so that the transactions are started
in the opposite order causes it to show the failure! During XtraBackup
restore, the last transaction to modify the row is rolled back last, so that
the row ends up with the value that was there just before the last update. But
this value is wrong, as it was written there by a transaction that was itself
rolled back. So we have the corruption reproduced with a small, repeatable
test case, and the root cause of the problem completely understood. Task
solved!
&lt;/p&gt;

&lt;p&gt;
(Later I cleaned up the test case to crash the server and work with crash
recovery instead; this is simpler as it does not involve XtraBackup. Though
this also involves the XA recovery procedure with the binlog, the root problem
is the same and shows the same failure. As to a fix for the bug, that remains
to be seen. I wrote some ideas in the bug report, but it appears non-trivial
to fix. The &lt;code&gt;--innodb-release-locks-early&lt;/code&gt; feature is originally
from the &lt;a href=&quot;https://launchpad.net/mysqlatfacebook&quot; rel=&quot;nofollow&quot;&gt;Facebook patch&lt;/a&gt;;
maybe they will fix it, or maybe we will remove the feature from MariaDB 5.3 before
GA release. Corrupt data is a pretty serious bug, after all.)
&lt;/p&gt;

&lt;h3&gt;Lessons learned&lt;/h3&gt;

&lt;p&gt;
I think there are some important points to learn from this debugging story:
&lt;ol&gt;
&lt;li&gt; When working on high-performance system code, some debugging problems are
  just inherently &lt;i&gt;hard&lt;/i&gt;! Things happen in parallel, and we operate with
  complex algorithms and high quality requirements on the code. But with a
  methodical approach, even the hard problems can be solved eventually.
&lt;li&gt; It is important not to ignore test failures in the test frameworks (such
  as Buildbot), no matter how random, sporadic, and/or elusive they may appear
  at first. True, many of them are false positives or defects in the test
  framework rather than the server code. But some of them are true bugs, and
  among them are some of the most serious and yet difficult bugs to track
  down. The debugging in this story is &lt;i&gt;trivial&lt;/i&gt; compared to how the
  story would have been if this had to be debugged in a production setting at
  a support customer. Much nicer to work from a (semi-)repeatable test failure
  in Buildbot!
&lt;/ol&gt;
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/15893.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>debugging</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>6</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/15739.html</guid>
  <pubDate>Thu, 24 Mar 2011 16:49:39 GMT</pubDate>
  <title>Benchmarking thread scheduling in group commit, part 2</title>
  <link>http://kristiannielsen.livejournal.com/15739.html</link>
  <description>&lt;p&gt;
I got access to our 12-core Intel server, so I was able to do some better
benchmarks to test the different &lt;a href=&quot;http://kristiannielsen.livejournal.com/15515.html&quot; rel=&quot;nofollow&quot;&gt;group commit thread scheduling&lt;/a&gt; methods:
&lt;/p&gt;
&lt;div align=&quot;center&quot;&gt;
&lt;img src=&quot;http://knielsen-hq.org/maria/group-commit-thread-scheduling-graph.png&quot;&gt;
&lt;/div&gt;
&lt;p&gt;
This graph shows queries-per-second as a function of number of parallel
connections, for three test runs:
&lt;ol&gt;
&lt;li&gt; Baseline MariaDB, without group commit.
&lt;li&gt; MariaDB with group commit, using the simple thread scheduling, where the
  serial part of the group commit algorithm is done by each thread signalling
  the next one.
&lt;li&gt; MariaDB with group commit and optimised thread scheduling, where the
  first thread does the serial group commit processing for all transactions at
  once, in a single thread.
&lt;/ol&gt;
(see the previous post linked above for a more detailed explanation of the two
thread scheduling algorithms.)
&lt;/p&gt;

&lt;p&gt;
This test was run on a 12-core server with hyper-threading, memory is
24 GByte. MariaDB was running with datadir in &lt;code&gt;/dev/shm&lt;/code&gt; (Linux ram
disk), to simulate a really fast disk system and maximise the stress on the
CPUs. Binlog is enabled with &lt;code&gt;sync_binlog=1&lt;/code&gt;
and &lt;code&gt;innodb_flush_log_at_trx_commit=1&lt;/code&gt;. Table type is InnoDB.
&lt;/p&gt;

&lt;p&gt;
I use &lt;a href=&quot;https://launchpad.net/gypsy&quot; rel=&quot;nofollow&quot;&gt;Gypsy&lt;/a&gt; to generate the client
load, which is simple auto-commit primary key updates:
&lt;pre&gt;
    REPLACE INTO t (a,b) VALUES (?, ?)
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
The graph clearly shows the optimised thread scheduling algorithm to improve
scalability. As expected, the effect is more pronounced on the twelve-core
server than on the 4-core machine I tested on previously. The optimised thread
scheduling has around 50% higher throughput at higher concurrencies. While the
naive thread scheduling algorithm suffers from scalability problems to the
degree that it is only slightly better than no group commit at all (but
remember that this is on ram disk, where group commit is hardly needed in the
first place).
&lt;/p&gt;

&lt;p&gt;
There is no doubt that this kind of optimised thread scheduling involves some
complications and trickery. Running one part of a transaction in a different
thread context from the rest does have the potential to cause subtle bugs.
&lt;/p&gt;

&lt;p&gt;
On the other hand, we are moving fast towards more and more CPU cores and more
and more I/O resources, and scalability just keeps getting more and more
important. If we can scale MariaDB/MySQL with the hardware improvements, more
and more applications can make do with scale-up rather than scale-out, which
significantly simplifies the system architecture.
&lt;/p&gt;

&lt;p&gt;
So I am just not comfortable introducing more serialisation (e.g. more global
mutex contention) in the server than absolutely necessary. That is why I did
the optimisation in the first place even without testing. Still, the question
is if an optimisation that only has any effect above 20,000 commits per second
is worth the extra complexity? I think I still need to think this over to
finally make up my mind, and discuss with other MariaDB developers, but at
least now we have a good basis for such discussion (and fortunately, the code
is easy to change one way or the other).
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/15739.html</comments>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>8</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/15515.html</guid>
  <pubDate>Wed, 23 Mar 2011 13:03:33 GMT</pubDate>
  <title>Benchmarking thread scheduling in group commit</title>
  <link>http://kristiannielsen.livejournal.com/15515.html</link>
  <description>&lt;p&gt;
The best part of the recent &lt;a href=&quot;http://askmonty.org/blog/newsletters-from-the-mariadb-dev-meeting-in-lisbon/&quot; rel=&quot;nofollow&quot;&gt;MariaDB meeting&lt;/a&gt; in Lisbon for me
was that I got some good feedback on &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;my&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;group&lt;/a&gt;
&lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;commit&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12810.html&quot; rel=&quot;nofollow&quot;&gt;work&lt;/a&gt;. This has been waiting in the review
queue for quite some time now.
&lt;/p&gt;

&lt;p&gt;
One comment I got revolve around an optimisation in the implementation related
to how threads are scheduled.
&lt;/p&gt;

&lt;p&gt;
A crucial step in the group commit algorithm is when the transactions being
committed have been written into the binary log, and we want to commit them in
the storage engine(s) &lt;b&gt;in the same order&lt;/b&gt; as they were committed in the
binlog. This ordering requirement makes that part of the commit process
serialised (think global mutex).
&lt;/p&gt;

&lt;p&gt;
Even though care is taken to make this serial part very quick to run inside
the storage engine(s), I was still concerned about how it would impact
scalability on multi-core machines. So I took extra care to minimise the time
spent on the server layer in this step.
&lt;/p&gt;

&lt;p&gt;
Suppose we have three transactions being committed as a group, each running in
their own connection thread in the server. It would be natural to let the
first thread do the first commit, then have the first thread signal the second
thread to do the second commit, and finally have the second thread signal the
third thread. The problem with this is that now the inherently serial part of
the group commit not only includes the work in the storage engines, it also
includes the time needed for two context switches (from thread 1 to thread 2,
and from thread 2 to thread 3)! This is particularly costly if, after finishing
with thread 1, we end up having to wait for thread 2 to be scheduled because
all CPU cores are busy.
&lt;/p&gt;

&lt;p&gt;
So what I did instead was to run all of the serial part in a single thread
(the thread of the first transaction). The single thread will handle the
commit ordering inside the storage engine for all the transactions, and the
remaining threads will just wait for the first one to wake them up. This means
the context switches for the waiting threads are not included in the serial
part of the algorithm. But it also means that the storage engines need to be
prepared to run this part of the commit in a separate thread from the rest of
the transaction.
&lt;/p&gt;

&lt;p&gt;
So, in Lisbon there was some discussion around if the modifications I did to
InnoDB/XtraDB for this were sufficient to ensure that there would not be any
problems with this running part of the commit in a different thread. After
all, this requirement &lt;em&gt;is&lt;/em&gt; a complication. And then the question came
up if the above optimisation is actually needed? Does it notably increase
performance?
&lt;/p&gt;

&lt;p&gt;
Now, that is a good question, and I did not have an answer as I never tested it.
So now I did! I added an
option &lt;code&gt;--binlog-optimize-thread-scheduling&lt;/code&gt; to allow to switch
between the naive and the optimised way to handle the commit of the different
transactions in the serial part of the algorithm, and benchmarked them against
each other.
&lt;/p&gt;

&lt;p&gt;
Unfortunately, the two many-core servers we have available for testing were
both unavailable (our hosting and quality of servers leaves a lot to be
desired unfortunately). So I was left to test on a 4-core (8 threads with
hyperthreading) desktop box I have in my own office. I was able to get some
useful results from this nevertheless, though I hope to revisit the benchmark
later on more interesting hardware.
&lt;/p&gt;

&lt;p&gt;
In order to stress the group commit code maximally, I used a syntetic workload
with as many commits per second as possible. I used the fastest disk I have
available, &lt;code&gt;/dev/shm&lt;/code&gt; (Linux ramdisk). The transactions are
single-row updates of the form
&lt;pre&gt;
    REPLACE INTO t (a,b) VALUES (?, ?)
&lt;/pre&gt;
The server is an Intel Core i7 quad-core with hyperthreading enabled. It has
8GByte of memory. I used &lt;a href=&quot;https://launchpad.net/gypsy&quot; rel=&quot;nofollow&quot;&gt;Gypsy&lt;/a&gt; to generate the load.
Table type is XtraDB. The server is running
with &lt;code&gt;innodb_flush_log_at_trx_commit=1&lt;/code&gt;
and &lt;code&gt;sync_binlog=1&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
Here are the results in queries per second, with different number of
concurrent connections running the queries:
&lt;div align=&quot;center&quot;&gt;
&lt;table border=&quot;1&quot;&gt;
&lt;tr&gt;
&lt;th&gt;Number of connections&lt;/th&gt;
&lt;th&gt;QPS (naive scheduling)&lt;/th&gt;
&lt;th&gt;QPS (optimised scheduling)&lt;/th&gt;
&lt;th&gt;QPS (binlog disabled)&lt;/th&gt;
&lt;/tr&gt;
&lt;tr&gt;&lt;td align=&quot;right&quot;&gt;16&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;21700&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;23600&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29000&lt;/td&gt;
&lt;tr&gt;&lt;td align=&quot;right&quot;&gt;32&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;19000&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;22500&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;29700&lt;/td&gt;
&lt;tr&gt;&lt;td align=&quot;right&quot;&gt;128&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;18000&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;19500&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;26800&lt;/td&gt;
&lt;/table&gt;
&lt;/div&gt;
So as we see from this table, even with just four cores we see noticable
better performance by running the serial part of group commit in a single
thread. The improvement is around 10% or so, depending on parallelism. So I think
this means that I will want to keep the optimised version.
&lt;/p&gt;

&lt;p&gt;
It is nice to see that we can get &amp;gt; 20k commits/second with the group commit
code on cheap desktop hardware. For real servers the I/O subsystem will
probably be a bottleneck, but that is what I wanted to see: that the group
commit code will not limit the ability to fully utilise high amounts of I/O
resources.
&lt;/p&gt;

&lt;p&gt;
While I was at it, I also measured the throughput when the binlog is
disabled. As can be seen, enabling the binlog has notable performance impact
even with very fast disk. Still, considering the added overhead of writing an
extra log file, not to mention the added 2-phase commit step, the overhead is
not &lt;em&gt;that&lt;/em&gt; unreasonable.
&lt;/p&gt;

&lt;p&gt;
From the table we also see some negative scaling as the number of parallel
connections increases. Some of this is likely from InnoDB/XtraDB, but I would
like to investigate it deeper at some point to see if there is anything in the
group commit part that can be improved with respect to this.
&lt;/p&gt;

&lt;p&gt;
Looking back, should I have done this benchmark when designing the code? I
think it is a tricky question, and one that cannot be given a simple
answer. It will always be a trade-off: It is not feasible to test (and
implement!) every conceivable variant of a new feature during development,
it is necessary to also rely on common sense and experience. On the other
hand, it is dangerous to rely on intuition with respect to performance; time
and time again measurements prove that the real world is very often counter to
intuition. In this case I was right, and my optimisation was beneficial;
however I could easily have been wrong. I think the main lesson here is how
important it is to get feedback on complex design work like this; such
feedback is crucial for motivating and structuring the work to be of the
quality that we need to see in MariaDB.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/15515.html</comments>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/15178.html</guid>
  <pubDate>Tue, 08 Mar 2011 08:30:52 GMT</pubDate>
  <title>My presentation from OpenSourceDays2011</title>
  <link>http://kristiannielsen.livejournal.com/15178.html</link>
  <description>&lt;p&gt;
Here are &lt;a href=&quot;http://knielsen-hq.org/maria/osd2011.pdf&quot; rel=&quot;nofollow&quot;&gt;the slides&lt;/a&gt;
from my &lt;a href=&quot;http://opensourcedays.org/print/74&quot; rel=&quot;nofollow&quot;&gt;talk at Open Source Days
2011&lt;/a&gt; on Saturday. The talk was about MariaDB and other parts of the MySQL
development community outside of MySQL@Oracle.
&lt;/p&gt;

&lt;p&gt;
For me, the most memorable part of the conference was the talk by
Noirin Shirley titled
&lt;a href=&quot;http://opensourcedays.org/print/101&quot; rel=&quot;nofollow&quot;&gt;Open Source: Saving the World&lt;/a&gt;.
Noirin described the Open Source &lt;a href=&quot;http://blog.ushahidi.com&quot; rel=&quot;nofollow&quot;&gt;Ushahidi project&lt;/a&gt;
and how it was used during the natural disaster crisis in Indonesia, New
Zealand and other places.
&lt;/p&gt;

&lt;p&gt;
Now, there is a long way
from &lt;a href=&quot;http://kristiannielsen.livejournal.com/12810.html&quot; rel=&quot;nofollow&quot;&gt;implementing
group commit in MariaDB&lt;/a&gt; to rescuing injured people out of collapsed
buildings, and not all use of Free Software is as samaritan as
Ushahidi. Well, no-one can save the world alone.
&lt;/p&gt;

&lt;p&gt;But in the Free Software
community, we work together, each contributing his or her microscopic part,
and together slowly but surely building the most valuable software
infrastructure in the world. Which then in turn empowers others to work
together in other areas outside of (and more important than) software.
Working in Free Software enables me to contribute my skills and resources, and
I think Noirin managed very well to capture this in her talk.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/15178.html</comments>
  <category>conference</category>
  <category>mariadb</category>
  <category>talk</category>
  <category>mysql</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/14961.html</guid>
  <pubDate>Thu, 24 Feb 2011 11:13:11 GMT</pubDate>
  <title>Speaking at OpenSourceDays2011</title>
  <link>http://kristiannielsen.livejournal.com/14961.html</link>
  <description>&lt;p&gt;
Again this year, I will be speaking about MariaDB and stuff at the
&lt;a href=&quot;http://opensourcedays.org/&quot; rel=&quot;nofollow&quot;&gt;OpenSourceDays2011&lt;/a&gt; conference in
Copenhagen, Denmark. The conference will take place on Saturday March 5,
that&apos;s just over a week from now!
The &lt;a href=&quot;http://opensourcedays.org/node/77&quot; rel=&quot;nofollow&quot;&gt;program&lt;/a&gt; is ready and my
talk is scheduled for the afternoon at 15:30. Hope to meet a lot of people
there!
&lt;/p&gt;

&lt;p&gt;
(I will be sure to make the slides from my talk available here afterwards, for
those of you interested but unable to attend.)
&lt;/p&gt;

&lt;p&gt;
Here is the abstract for my talk:
&lt;/p&gt;

&lt;h3&gt;Latest news from the MariaDB (and MySQL) community&lt;/h3&gt;

&lt;p&gt;A lot of Open Source software projects got transfered to Oracle last year as part of the acquisition of Sun Microsystems. Not everybody in the affected Open Source communities have been happy with this transfer, to put it mildly, and projects like LibreOffice, Illumos, Jenkins, and others are forking left and right to become independent of Oracle.&lt;/p&gt;

&lt;p&gt;Interestingly, MySQL, one of the major projects taken over from Sun, already had several forks active prior to the acquisition, among them MariaDB, which was started by original MySQL founder Michael &quot;Monty&quot; Widenius in 2009.&lt;/p&gt;

&lt;p&gt;In the talk I will describe the MariaDB project: why it was started, what it is, and what we have been up to in the first two years of the project&apos;s existence. I will then give a more technical description of one particular performance feature that is new in MariaDB, &quot;group commit&quot;, which is something I worked on personally the last year, and which I think is a good example of the kind of development that happens in MariaDB. Finally I want to give an &quot;interactive FAQ&quot;, answering some of the questions that are buzzing around in
the community concerning the future of MySQL and derivatives inside and outside of Oracle.&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/14961.html</comments>
  <category>conference</category>
  <category>mariadb</category>
  <category>talk</category>
  <category>mysql</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/14805.html</guid>
  <pubDate>Thu, 24 Feb 2011 11:11:43 GMT</pubDate>
  <title>MariaDB replication feature preview released</title>
  <link>http://kristiannielsen.livejournal.com/14805.html</link>
  <description>&lt;p&gt;
I am pleased to announce the availability of the MariaDB 5.2 feature preview
release. Find the details and download
links &lt;a href=&quot;http://kb.askmonty.org/v/mariadb-52-replication-feature-preview&quot; rel=&quot;nofollow&quot;&gt;on
the knowledgebase&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
There has been quite good interest in the replication work I have been doing
around MariaDB, and I wanted a way to make it easy for people to use,
experiment with, and give feedback on the new features. The result is this
replication feature preview release. This will all eventually make it into the
next official release, however this is likely still some month off.
&lt;/p&gt;

&lt;p&gt;
All the usual binary packages and source tarballs
are &lt;a href=&quot;http://kb.askmonty.org/v/mariadb-52-replication-feature-preview&quot; rel=&quot;nofollow&quot;&gt;available
for download&lt;/a&gt;. As something new, I now also made apt-enabled repositories
available for Debian and Ubuntu; this should greatly simplify installation on
these .deb based distributions.
&lt;/p&gt;

&lt;p&gt;
So please try it out, and give feedback on
the &lt;a href=&quot;https://launchpad.net/~maria-developers&quot; rel=&quot;nofollow&quot;&gt;mailing list&lt;/a&gt;
or &lt;a href=&quot;https://bugs.launchpad.net/maria/+filebug&quot; rel=&quot;nofollow&quot;&gt;bug tracker&lt;/a&gt;. I will
make sure to fix any bugs and keep the feature preview updated until
everything is available in an official release.
&lt;/p&gt;

&lt;p&gt;
Here is the list of new features in the replication preview release:
&lt;/p&gt;
&lt;h3&gt;Group commit for the binary log&lt;/h3&gt;
&lt;p&gt;This preview release implements group commit that works when using XtraDB with
the binary log enabled. (In previous MariaDB releases, and all MySQL releases at
the time of writing, group commit works in InnoDB/XtraDB when the binary log
is disabled, but stops working when the binary log is enabled).&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/group-commit-for-the-binary-log&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;Enhancements for &lt;code&gt;START TRANSACTION WITH CONSISTENT SNAPSHOT&lt;/code&gt;&lt;/h3&gt;
&lt;p&gt;&lt;code&gt;START TRANSACTION WITH CONSISTENT SNAPSHOT&lt;/code&gt; now also works with the binary
log. This means that it is possible to obtain the binlog position
corresponding to a transactional snapshot of the database without any blocking
of other queries at all. This is used by &lt;code&gt;mysqldump --single-transaction
--master-data&lt;/code&gt; to do a fully non-blocking backup that can be used to provision
a new slave.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;START TRANSACTION WITH CONSISTENT SNAPSHOT&lt;/code&gt; now also works consistently between
transactions involving more than one storage engine (currently XTraDB and PBXT
support this).&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/1549&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Annotation of row-based replication events with the original SQL statement&lt;/h3&gt;

&lt;p&gt;When using row-based replication, the binary log does not contain SQL
statements, only discrete single-row insert/update/delete events. This can
make it harder to read mysqlbinlog output and understand where in an
application a given event may have originated, complicating analysis and
debugging.&lt;/p&gt;
&lt;p&gt;This feature adds an option to include the original SQL statement as a
comment in the binary log (and shown in mysqlbinlog output) for row-based
replication events.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/annotate_rows_log_event&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Row-based replication for tables with no primary key&lt;/h3&gt;
&lt;p&gt;This feature can improve the performance of row-based replication on tables
that do not have a primary key (or other unique key), but that do have another
index that can help locate rows to update or delete. With this feature, index
cardinality information from &lt;code&gt;ANALYZE TABLE&lt;/code&gt; is considered when selecting the
index to use (before this feature is implemented, the first index was selected
unconditionally).&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/row-based-replication-with-no-primary-key&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Early release during prepare phase of XtraDB row locks&lt;/h3&gt;

&lt;p&gt;This feature adds an option to make XtraDB release the row locks for a
transaction earlier during the &lt;code&gt;COMMIT&lt;/code&gt; step when running with &lt;code&gt;--sync-binlog=1&lt;/code&gt;
and &lt;code&gt;--innodb-flush-log-at-trx-commit=1&lt;/code&gt;. This can improve throughput if the
workload has a bottleneck on hot-spot rows.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/xtradb-option-innodb-release-locks-early&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;PBXT consistent commit ordering&lt;/h3&gt;
&lt;p&gt;This feature implements the new commit ordering storage engine API in
PBXT. With this feature, it is possible to use &lt;code&gt;START TRANSACTION WITH
CONSISTENT SNAPSHOT&lt;/code&gt; and get consistency among transactions that involve both
XtraDB and InnoDB. (Without this feature, there is no such consistency
guarantee. For example, even after running &lt;code&gt;START TRANSACTION WITH CONSISTENT
SNAPSHOT&lt;/code&gt; it was still possible for the InnoDB/XtraDB part of some transaction
T to be visible and the PBXT part of the same transaction T to not be visible.)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://kb.askmonty.org/v/1549&quot; rel=&quot;nofollow&quot;&gt;Documentation.&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;Miscellaneous&lt;/h3&gt;
&lt;ul&gt;&lt;li&gt;Small change to make mysqlbinlog omit redundant &lt;code&gt;use&lt;/code&gt; statements around &lt;code&gt;BEGIN&lt;/code&gt;/&lt;code&gt;SAVEPOINT&lt;/code&gt;/&lt;code&gt;COMMIT&lt;/code&gt;/&lt;code&gt;ROLLBACK&lt;/code&gt; events when reading MySQL 5.0 binlogs.
&lt;/li&gt;&lt;/ul&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/14805.html</comments>
  <category>mariadb</category>
  <category>release</category>
  <category>mysql</category>
  <category>replication</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/14429.html</guid>
  <pubDate>Tue, 07 Dec 2010 11:05:52 GMT</pubDate>
  <title>Christmas @ MariaDB</title>
  <link>http://kristiannielsen.livejournal.com/14429.html</link>
  <description>&lt;p&gt;
The Danish &quot;julehjerte&quot; is apparently a Danish/Northern Europe Christmas
tradition
(&lt;a href=&quot;https://secure.wikimedia.org/wikipedia/en/wiki/Pleated_Christmas_hearts&quot; rel=&quot;nofollow&quot;&gt;at
    least according to Wikipedia&lt;/a&gt;). But hopefully people outside this
region will also be able to enjoy this variant:
&lt;/p&gt;
&lt;div align=&quot;center&quot;&gt;
&lt;img src=&quot;http://knielsen-hq.org/maria/mariadb-hjerte.jpg&quot; align=&quot;middle&quot;&gt;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;img src=&quot;http://knielsen-hq.org/maria/mariadb-logo.png&quot; align=&quot;middle&quot;&gt;
&lt;/div&gt;
&lt;p&gt;
I have been doing &quot;julehjerter&quot; ever since I was a small kid, and every
Christmas try to do something different with it. As seen above, this year I
decided to combine the tradition with the MariaDB logo, and I am frankly quite
pleased with the result :-)
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/14429.html</comments>
  <category>mariadb</category>
  <category>mysql</category>
  <category>fun</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/14305.html</guid>
  <pubDate>Mon, 11 Oct 2010 15:32:30 GMT</pubDate>
  <title>The future of replication revealed in Istanbul</title>
  <link>http://kristiannielsen.livejournal.com/14305.html</link>
  <description>&lt;p&gt;
A very good meeting in Istanbul is drawing to an end. People from Monty Program,
Facebook, Galera, Percona, SkySQL, and other parts of the community are
meeting with one foot on the European continent and another in Asia to discuss
all things MariaDB and MySQL and experience the mystery of the Orient.
&lt;/p&gt;

&lt;p&gt;
At the meeting I had the opportunity
to &lt;a href=&quot;http://knielsen-hq.org/maria/repl.pdf&quot; rel=&quot;nofollow&quot;&gt;present&lt;/a&gt; my plans and
visions for the future development of replication in MariaDB. My talk was very
well received, and I had a lot of good discussions afterwards with many of the
bright people here. Working from home in a virtual company, it means a lot to
get this kind of inspiration and encouragement from others on occasion, and I
am looking forward to continuing the work after an early flight to Copenhagen
tomorrow.
&lt;/p&gt;

&lt;p&gt;
The new interface for transaction coordinator plugins is what particularly
interests me at the moment. The immediate benefit of this work is
&lt;a href=&quot;http://kristiannielsen.livejournal.com/12810.html&quot; rel=&quot;nofollow&quot;&gt;working group
commit&lt;/a&gt; for transactions with the binary log enabled. But just as interesting (if
more subtle), the project is an enabler for several other nice features
related to hot backup and recovery. I spent a lot of effort working on the
interfaces to the transaction controller and related extensions to the storage
engine API, and I think the result is quite solid and a good basis for coming
work.
&lt;/p&gt;

&lt;p&gt;
After the transaction coordinator plugin, the next step is an API for event
generators that will allow plugins to receive replication events on an equal
footing with the built-in MySQL binary log implementation; I will be using
this in cooperation with Codership to more tightly integrate
their &lt;a href=&quot;http://www.codership.com/&quot; rel=&quot;nofollow&quot;&gt;Galera&lt;/a&gt; synchronous replication
into MariaDB. And long-term, I am hoping to combine all of the pieces to finally
start attacking the general problem of parallel execution of events on
replication slaves, the solution of which is long overdue.
&lt;/p&gt;

&lt;p&gt;
(The MariaDB &lt;a href=&quot;http://askmonty.org/wiki/ReplicationProject&quot; rel=&quot;nofollow&quot;&gt;replication
project page&lt;/a&gt; has lots of pointers to more information on the various
projects for anyone interested).
&lt;/p&gt;

&lt;p&gt;
Almost too good to be true, out excursion today was blessed with sunshine and
mild weather after countless days of rain and storm. There were even rumours
of sightings of dolphins jumping again during the SkySQL excursion yesterday.
So while lots of hard work remains, all in all, the omens seem all good for
the future of replication in MariaDB!
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/14305.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <category>replication</category>
  <lj:security>public</lj:security>
  <lj:reply-count>11</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/14038.html</guid>
  <pubDate>Sun, 03 Oct 2010 16:55:36 GMT</pubDate>
  <title>Dynamic linking costs two cycles</title>
  <link>http://kristiannielsen.livejournal.com/14038.html</link>
  <description>&lt;p&gt;
It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU
cycles per cross-module call. I usually take forever to get to the point in my
writing, so I thought I would change this for once :-)
&lt;/p&gt;

&lt;p&gt;
In MySQL, there has been a historical tendency to favour static linking, in
part because to avoid the overhead (in execution efficiency) associated with
dynamic linking. However, on modern systems there are also very serious
drawbacks when using static linking.

&lt;p&gt;
The particular issue that inspired this article is that I was working
on &lt;a href=&quot;http://askmonty.org/worklog/Server-Sprint/?tid=74&quot; rel=&quot;nofollow&quot;&gt;MWL#74&lt;/a&gt;,
building a proper shared &lt;code&gt;libmysqld.so&lt;/code&gt; library for the MariaDB
embedded server. The lack of a proper &lt;code&gt;libmysqld.so&lt;/code&gt; in MySQL and
MariaDB has caused no end of grief for
packaging &lt;a href=&quot;http://amarok.kde.org/&quot; rel=&quot;nofollow&quot;&gt;Amarok&lt;/a&gt; for the various Linux
distributions. My patch increases the amount of dynamic linking (in a default
build), so I did a quick test to get an idea of the overhead of this.
&lt;/p&gt;

&lt;h2&gt;ELF dynamic linking&lt;/h2&gt;

&lt;p&gt;
The overhead comes from the way dynamic linking works in ELF-based systems
like Linux (and many other POSIX-like operating systems). Code in shared
libraries must be compiled to be position-independent, achieved with
the &lt;code&gt;-fPIC&lt;/code&gt; compiler switch. This allows the loader to
simply &lt;code&gt;mmap()&lt;/code&gt; the image of a shared library into the process
address space at whatever free memory space is available, and the code can run
without any need for the loader to do any kind of relocations of the code. For
a much more detailed explanation see for
example &lt;a href=&quot;http://www.symantec.com/connect/es/articles/dynamic-linking-linux-and-windows-part-one&quot; rel=&quot;nofollow&quot;&gt;this
    article&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
When generating position-independent code for a function call into another
shared object, the compiler cannot generate a simple
absolute &lt;code&gt;call&lt;/code&gt; instruction, as the destination address is not
known until run-time. Instead, the call goes via an indirect jump is
generated, fetching the destination address from a table called the PLT, short
for Procedure Linkage Table. For example:
&lt;/p&gt;
&lt;pre&gt;
                       callq  0x400680 &amp;lt;mylib_myfunc@plt&amp;gt;)
...
&amp;lt;mylib_myfunc@plt&amp;gt;:    jmpq   *0x200582(%rip)
&lt;/pre&gt;
&lt;p&gt;
The indirect jump resolves at runtime into the address of the real function to
be called, so that is the overhead of the call when using dynamic linking: one
indirect jump instruction.
&lt;/p&gt;

&lt;h2&gt;Micro-benchmarking&lt;/h2&gt;

&lt;p&gt;
To measure this one-instruction overhead in terms of execution time, I used
the following code:
&lt;pre&gt;
    for (i= 0; i &amp;lt; count; i++)
      v= mylib_myfunc(v);
&lt;/pre&gt;
&lt;p&gt;
The function &lt;code&gt;mylib_myfunc()&lt;/code&gt; is placed in a library, with the
following code:
&lt;pre&gt;
    int mylib_myfunc(int v) {return v+1;}
&lt;/pre&gt;
&lt;p&gt;
I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz
machine running Linux amd64. Here are the results from running the loop for
1,000,000,000 (one billion) operations:
&lt;/p&gt;
&lt;div align=&quot;left&quot;&gt;
&lt;table border=&quot;1&quot;&gt;
  &lt;tr&gt;&lt;th&gt;&amp;nbsp;&lt;/th&gt;&lt;th&gt;total time (sec.)&lt;/th&gt;&lt;th&gt;CPU cycles/iteration&lt;/th&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Static linking&lt;/th&gt;&lt;td align=&quot;right&quot;&gt;2.54&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;6&lt;/td&gt;&lt;/tr&gt;
  &lt;tr&gt;&lt;th align=&quot;left&quot;&gt;Dynamic linking&lt;/th&gt;&lt;td align=&quot;right&quot;&gt;3.38&lt;/td&gt;&lt;td align=&quot;right&quot;&gt;8&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p&gt;
So that is the two CPU cycles of overhead per call that I referred to at the
start of this post.
&lt;/p&gt;
&lt;p&gt;
Incidentally, if you try stepping through the call with a debugger, you will
see a much larger overhead for the very first call. Do not be fooled by this,
this is just because the loader fills in the PLT lazily, computing the correct
address of the destination only on the first time the call is made (so
addresses of functions that are never called by a process need never be
calculated). See above-referenced article for more details.
&lt;/p&gt;
&lt;p&gt;
(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar,
but the actual overhead may be somewhat larger, since that architecture lacks
program-counter-relative addressing and so must reserve one
register &lt;code&gt;%ebx&lt;/code&gt; (out of its already quite limited register bank)
for this purpose. I did not measure the 32-bit case, I think it is of little
interest nowadays for high-performance MySQL or MariaDB deployments (and the
overhead of function calls on x86 32-bit is significantly higher anyway,
dynamic linking or not, due to the need to push and pop all arguments to/from
the stack)).
&lt;/p&gt;

&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;
Two cycles per call is, in my opinion, a very modest overhead. It is hard to
imagine high-performance code where this will have a real-life noticeable
effect. Modern systems rely heavily on dynamic linking, and static linking is
nowadays causing much more problems that it solves. And I think it is also
time to put the efficiency argument for static linking to rest.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/14038.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/13577.html</guid>
  <pubDate>Sun, 05 Sep 2010 16:38:49 GMT</pubDate>
  <title>Micro-benchmarking pthread_cond_broadcast()</title>
  <link>http://kristiannielsen.livejournal.com/13577.html</link>
  <description>&lt;p&gt;
In my work
on &lt;a href=&quot;http://askmonty.org/worklog/Server-BackLog/?tid=116&quot; rel=&quot;nofollow&quot;&gt;group
    commit&lt;/a&gt; for MariaDB, I have the following situation:
&lt;/p&gt;

&lt;p&gt;
A group of threads are going to participate in group commit. This means that
one of the threads, called the &lt;em&gt;group leader&lt;/em&gt;, will run
an &lt;code&gt;fsync()&lt;/code&gt; for all of them, while the other threads wait.
Once the group leader is done, it needs to wake up all of the other threads.
&lt;/p&gt;

&lt;p&gt;
The obvious way to do this is to have the group leader
call &lt;code&gt;pthread_cond_broadcast()&lt;/code&gt; on a condition that the other
threads are waiting for with pthread_cond_wait():
&lt;/p&gt;
&lt;pre&gt;
  bool wakeup= false;
  pthread_cond_t wakeup_cond;
  pthread_mutex_t wakeup_mutex
&lt;/pre&gt;
&lt;p&gt;
Waiter:
&lt;pre&gt;
  pthread_mutex_lock(&amp;amp;wakeup_mutex);
  while (!wakeup)
    pthread_cond_wait(&amp;amp;wakeup_cond, &amp;amp;wakeup_mutex);
  pthread_mutex_unlock(&amp;amp;wakeup_mutex);
  // Continue processing after group commit is now done.
&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
Group leader:
&lt;/p&gt;
&lt;pre&gt;
  pthread_mutex_lock(&amp;amp;wakeup_mutex);
  wakeup= true;
  pthread_cond_broadcast(&amp;amp;wakeup_cond);
  pthread_mutex_unlock(&amp;amp;wakeup_mutex);
&lt;/pre&gt;
&lt;p&gt;
Note the association of the condition with a mutex. This association is
inherent in the way pthread condition variables work. The mutex must be locked
when calling into &lt;code&gt;pthread_mutex_wait()&lt;/code&gt;, and
&lt;b&gt;will be obtained again before the call returns&lt;/b&gt;.
(Check the
&lt;a href=&quot;http://www.opengroup.org/onlinepubs/009695399/functions/pthread_cond_wait.html&quot; rel=&quot;nofollow&quot;&gt;man
page&lt;/a&gt;
for &lt;code&gt;pthread_cond_wait()&lt;/code&gt; for details).
&lt;/p&gt;

&lt;p&gt;
Now, when I think about how these condition variables work, something strikes
me as somewhat odd.
&lt;/p&gt;

&lt;p&gt;
The idea is that the broadcast signals every waiting thread to wake
up. However, because of the associated mutex, only one thread will actually be
able to wake up; this thread will obtain a lock on the mutex, and all other
to-be-awoken threads will now have to wait for this mutex! Only after the
first thread releases this mutex will the next thread wakeup holding the
mutex, then after releasing the third thread can wake up, and so on.
&lt;/p&gt;

&lt;p&gt;
So if we have say 100 threads waiting, the last one will have to wait for the
first 99 threads to each be scheduled and each release the mutex, one after
the other in a completely serialised fashion.
&lt;/p&gt;

&lt;p&gt;
But what I really want is to just let them all run at once in parallel (or at
least as many as my machine has spare cores for). There is another way to
achieve this, by simply using a separate condition and mutex for each thread,
and have the group leader signal each one individually:
&lt;/p&gt;
&lt;p&gt;
Waiter:
&lt;pre&gt;
  pthread_mutex_lock(&amp;amp;me-&amp;gt;wakeup_mutex);
  while (!me-&amp;gt;wakeup)
    pthread_cond_wait(&amp;amp;me-&amp;gt;wakeup_cond, &amp;amp;me-&amp;gt;wakeup_mutex);
  pthread_mutex_unlock(&amp;amp;me-&amp;gt;wakeup_mutex);
&lt;/pre&gt;
&lt;/p&gt;
&lt;p&gt;
Group leader:
&lt;/p&gt;
&lt;pre&gt;
  for waiter in &amp;lt;all waiters&amp;gt;
    pthread_mutex_lock(&amp;amp;waiter-&amp;gt;wakeup_mutex);
    waiter-&amp;gt;wakeup= true;
    pthread_cond_signal(&amp;amp;wakeup_cond);
    pthread_mutex_unlock(&amp;amp;wakeup_mutex);
&lt;/pre&gt;
&lt;p&gt;
This way, every waiter is free to start running as soon as woken up by the
leader; no waiters have to wait for one another. This seems advantageous,
especially as number of cores increases (rumours are that 48 core machines are
becoming commodity).
&lt;/p&gt;

&lt;p&gt;
&lt;b&gt;&quot;Seems&quot;&lt;/b&gt; advantageous. But is it really? Let us micro-benchmark it.
&lt;/p&gt;

&lt;p&gt;
For this, I start up 5000 threads. Each thread goes to wait on a condition,
either a single shared one, or distinct in each thread. The main program then
signals the threads to wakeup, either with a single &lt;code&gt;pthread_cond_broadcast()&lt;/code&gt;,
or with one &lt;code&gt;pthread_cond_signal()&lt;/code&gt; per thread. Each thread records
the time they woke up, and the main program collects these times and computes
how long it took between starting to signal the condition(s) and wakeup of the
last thread. (Here is
the &lt;a href=&quot;http://knielsen-hq.org/maria/pthread_cond_broadcast.c&quot; rel=&quot;nofollow&quot;&gt;full C
    source code&lt;/a&gt; for the test program).
&lt;/p&gt;

&lt;p&gt;
I ran the program on an Intel quad Core i7 with hyperthreading enabled, the
most parallel machine I have easy access to. The results is the following:
&lt;br&gt;
&lt;table&gt;
  &lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;pthread_cond_broadcast()&lt;/code&gt;:
      &lt;td align=&quot;right&quot;&gt;46.9 msec
  &lt;tr&gt;&lt;td align=&quot;left&quot;&gt;&lt;code&gt;pthread_cond_signal()&lt;/code&gt;:
      &lt;td align=&quot;right&quot;&gt;17.6 msec
&lt;/table&gt;
&lt;br&gt;
Conclusion: &lt;code&gt;pthread_cond_broadcast()&lt;/code&gt; &lt;em&gt;is&lt;/em&gt; slower, as I
speculated. I would expect the effect to be more pronounced on systems with
more cores; it would be interesting if readers with access to such systems
could try the test program and comment below on the results.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/13577.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>10</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/13382.html</guid>
  <pubDate>Thu, 08 Jul 2010 22:33:07 GMT</pubDate>
  <title>MySQL/MariaDB replication: applying events on the slave side</title>
  <link>http://kristiannielsen.livejournal.com/13382.html</link>
  <description>&lt;p&gt;
Working on a new set of replication APIs in MariaDB, I have
given &lt;a href=&quot;http://kristiannielsen.livejournal.com/13253.html&quot; rel=&quot;nofollow&quot;&gt;some&lt;/a&gt; &lt;a href=&quot;http://askmonty.org/worklog/Server-Sprint/index.pl?tid=120&quot; rel=&quot;nofollow&quot;&gt;thought&lt;/a&gt;
to the generation of replication events on the master server.
&lt;/p&gt;

&lt;p&gt;
But there is another side of the equation: to apply the generated events on a
slave server. This is something that most replication setups will need (unless
they replicate to non-MySQL/MariaDB slaves). So it will be good to provide a
generic interface for this, otherwise every binlog-like plugin implementation
will have to re-invent this themselves.
&lt;/p&gt;

&lt;p&gt;
A central idea in the current design for generating events is that we do not
enforce a specific content of events. Instead, the API provides accessors for
a lot of different information related to each event, allowing the plugin
flexibility in choosing what to include in a particular event format. For
example, one plugin may request column names for a row-based UPDATE event;
another plugin may not need them and can avoid any overhead related to column
names simply by not requesting them.
&lt;/p&gt;

&lt;p&gt;
To get the same flexibility on the slave side, the roles of plugin and API are
reversed. Here, the plugin will have a certain pre-determined (by the
particular event format implemented) set of information related to the event
available. And the API must make do with whatever information it is provided
(or fail gracefully if essential information is missing).
&lt;/p&gt;

&lt;p&gt;
My idea is that the event application API will provide corresponding events to
the events in the generation API. Each application event will have &quot;provider&quot;
methods corresponding to the accessor methods of the generator API. So the
plugin that wants to apply an event can obtain an event generator object, call
the appropriate provider methods for all the information available, and
finally ask the API to execute the event with the provided context
information.
&lt;/p&gt;

&lt;p&gt;
This is only an abstract idea at this point; there are lots of details to take
care of to make this idea into a concrete design proposal. And I have not
fully decided if such an API will be part of the replication project or
not. But I like the idea so far.
&lt;/p&gt;


&lt;h2&gt;Understanding how MySQL binlog events are applied on the slave&lt;/h2&gt;

&lt;p&gt;
I wanted to get a better understanding of what is involved in an event
application API like the one described above. So I did a similar exercise to
the one I wrote about in my
&lt;a href=&quot;http://kristiannielsen.livejournal.com/13253.html&quot; rel=&quot;nofollow&quot;&gt;last post&lt;/a&gt;,
where I went through in detail all the information that the existing MySQL
binlog format includes. This time I went through the details in the code that
applies MySQL binlog events on a slave.
&lt;/p&gt;

&lt;p&gt;
Again, I concentrate on the actual events that change the database, ignoring
(most) details that relate only to the particular binlog format used by MySQL
(and there are quite a few :-).
&lt;/p&gt;

&lt;p&gt;
At the top level, the slave SQL thread
(in &lt;code&gt;exec_relay_log_event()&lt;/code&gt;) reads events
(&lt;code&gt;next_event()&lt;/code&gt;) from the relay logs and executes
(&lt;code&gt;apply_event_and_update_pos()&lt;/code&gt;) them.
&lt;/p&gt;

&lt;p&gt;
There are a number of intricate details here relating to switching to a new
relay log (and purging old ones), and re-winding in the relay log to re-try a
failed transaction (eg. in case of deadlock or the like). This is mostly
specific to the particular binlog implementation.
&lt;/p&gt;

&lt;p&gt;
The actual data changes are mostly done in the &lt;code&gt;do_apply_event()&lt;/code&gt;
methods of each event class in &lt;code&gt;sql/log_event.cc&lt;/code&gt;. I will go
briefly through this method for the events that are used to change actual data
in current MySQL replication. It is relatively easy to read a particular
detail out of the code, since it is all located in
these &lt;code&gt;do_apply_event()&lt;/code&gt; methods.
&lt;/p&gt;

&lt;p&gt;
(One problem however is that there are special cases sprinkled across the code
where special action is taken (or not taken) when running in the slave SQL
thread. I have not so far tried to determine the full list of such special
cases, or access how many there are).
&lt;/p&gt;


&lt;h3&gt;Query_log_event&lt;/h3&gt;

&lt;p&gt;
The main task done here is to set up various context in the &lt;code&gt;THD&lt;/code&gt;,
execute the query, and then perform necessary cleanup. If the query throws an
error, there is also some fairly complex logic to handle this error correctly;
for example to ignore certain errors, to require same error as on master (if
the query failed on the master in some particular way), and to re-try the
query/transaction for certain errors (like deadlock).
&lt;/p&gt;

&lt;p&gt;
There are also some hard-coded special cases for NDB (this seems to be a
common theme in the replication code).
&lt;/p&gt;

&lt;p&gt;
The main think that to my eyes make this part of the code complex is the set
of actions taken to prepare context before executing the query, and to clean
up after execution. Each individual step in the code is in fact relatively
easy to follow (and often the commenting is quite good). The problem is that
there are so many individual steps. It is very hard to feel sure that exactly
this set of actions is sufficient (and that none are redundant for that
matter).
&lt;/p&gt;

&lt;p&gt;
For example, code like this (not complete, just a random part of the setup):
&lt;/p&gt;
&lt;pre&gt;
    thd-&amp;gt;set_time((time_t)when);
    thd-&amp;gt;set_query((char*)query_arg, q_len_arg);
    VOID(pthread_mutex_lock(&amp;amp;LOCK_thread_count));
    thd-&amp;gt;query_id = next_query_id();
    VOID(pthread_mutex_unlock(&amp;amp;LOCK_thread_count));
    thd-&amp;gt;variables.pseudo_thread_id= thread_id;
&lt;/pre&gt;
&lt;p&gt;
It seems very easy to forget to assign &lt;code&gt;query_id&lt;/code&gt; or whatever if
one was to write this from scratch. That is something I would really like to
improve in an API for replication plugins: it should be possible to understand
completely exactly what setup and cleanup is needed around event execution,
and there should be appropriate methods to achieve such setup/cleanup.
&lt;/p&gt;

&lt;p&gt;
Another thing that is interesting in the code
for &lt;code&gt;Query_log_event::do_apply_event()&lt;/code&gt; is that the code
does &lt;code&gt;strcmp()&lt;/code&gt; on the query against keywords like COMMIT,
SAVEPOINT, and ROLLBACK, and follows a fairly different execution path for
these. This seems to bypass the SQL parser! But in reality, these particular
queries are generated on the master with special code in the server that
carefully restricts the possible format (eg. no whitespace or comments
etc). So in effect, this is just a way to hack in special event types for
these special queries without actually adding new binlog format events.
&lt;/p&gt;


&lt;h3&gt;Rand_log_event, Intvar_log_event, and User_var_log_event&lt;/h3&gt;

&lt;p&gt;
The action taken for these events is essentially to update
the &lt;code&gt;THD&lt;/code&gt; with the information in the event: value of random
seed, &lt;code&gt;LAST_INSERT_ID&lt;/code&gt;/&lt;code&gt;INSERT_ID&lt;/code&gt;,
or &lt;code&gt;@user_variable&lt;/code&gt;.
&lt;/p&gt;


&lt;h3&gt;Xid_log_event&lt;/h3&gt;

&lt;p&gt;
On the slave side, this event is essentially a COMMIT. One thing that springs
to mind is how different the code to handle this event is from the special
code in &lt;code&gt;Query_log_event&lt;/code&gt; that handles a query &quot;COMMIT&quot;. Again, it
is very hard to tell from the code if this is a bug, or if the
different-looking code is in fact equivalent.
&lt;/p&gt;


&lt;h3&gt;Begin_load_query_log_event, Append_block_log_event, Delete_file_log_event, and Execute_load_query_log_event&lt;/h3&gt;

&lt;p&gt;
This implements the &lt;code&gt;LOAD DATA INFILE&lt;/code&gt; query, which needs special
handling as it originally references a file on the master server (or on the
client machine). The actual execution of the query is handled the same way as
for normal queries (&lt;code&gt;Execute_load_query_log_event&lt;/code&gt; is a sub-class
of &lt;code&gt;Query_log_event&lt;/code&gt;), but some preparation is needed first to
write the data from the events in the relay log into a temporary file on the
slave.
&lt;/p&gt;

&lt;p&gt;
The main thing one notices about this code is how it handles re-writing the
&lt;code&gt;LOAD DATA&lt;/code&gt; query to use a new temporary file name on the
slave. Take for example this query:
&lt;/p&gt;
&lt;pre&gt;
    LOAD DATA CONCURRENT /**/ LOCAL INFILE &apos;foobar.dat&apos; REPLACE INTO /**/ TABLE ...
&lt;/pre&gt;
&lt;p&gt;
The event includes offsets into the query string of the two places
marked &lt;code&gt;/**/&lt;/code&gt; in the example. This part of the query is then
re-written in the slave code (so it is not just replacing the
filename). Again, the code by-passes the SQL parser, it just so happens that
the SQL syntax in this case is sufficiently simple that this is not too
hackish to do. If one were to check, one would probably see that any user comments
in this particular part of the query string disappear in the slave
binlog if &lt;code&gt;--log-slave-updates&lt;/code&gt; is enabled.
&lt;/p&gt;


&lt;h3&gt;Table_map_log_event&lt;/h3&gt;

&lt;p&gt;
This just links a &lt;code&gt;struct RPL_TABLE_LIST&lt;/code&gt; into a list, containing
information about the tables described by the event. The actual opening of the
table is done when executing the first row event (WRITE/UPDATE/DELETE).
&lt;/p&gt;


&lt;h3&gt;Write_rows_log_event, Update_rows_log_event, and Delete_rows_log_event&lt;/h3&gt;

&lt;p&gt;
These are what handle the application of row-based replication events on a slave.
&lt;/p&gt;

&lt;p&gt;
The first of the row events causes some setup to be done, a partial extract is this:
&lt;/p&gt;
&lt;pre&gt;
    lex_start(thd);
    mysql_reset_thd_for_next_command(thd, 0);
    thd-&amp;gt;transaction.stmt.modified_non_trans_table= FALSE;
    query_cache.invalidate_locked_for_write(rli-&amp;gt;tables_to_lock);
&lt;/pre&gt;
&lt;p&gt;
(As remarked for &lt;code&gt;Query_log_event&lt;/code&gt;, while row event application
setup is somewhat simpler, it still appears a bit magic that exactly these
setups are sufficient and necessary).
&lt;/p&gt;

&lt;p&gt;
The code also switches to row-based binlogging for the following row
operations (this is for &lt;code&gt;--log-slave-updates&lt;/code&gt;, as it is
not possible to binlog the application of row-based events as statement-based
events in the slave binlog). This is by the way an interesting challenge for a
generic replication API: how does one handle binlogging of events applied on a
slave, for daisy-chaining replication servers? This of course gets more
interesting if one were to use a different binlogging plugin on the slave than
on the master. I need to think more about it, but this seems to be a pretty
strong argument that a generic event application API is needed, which hooks
into event generators to properly generate all needed events for the updates
done on slaves. Another important aspect is the support of a global
transaction ID, that will identify a transaction uniquely across an entire
replication setup to make migrating slaves to a new master easier. Such a
global transaction ID also needs to be preserved in a slave binlog when
replicating events from a master.
&lt;/p&gt;

&lt;p&gt;
During execution of the first row event, the code also sets the flags for
foreign key checks and unique checks that is included in the event from the master.
And it checks if the event should be skipped
(for &lt;code&gt;--replicate-ignore-db&lt;/code&gt; and friends).
&lt;/p&gt;

&lt;p&gt;
A check is made to ensure that the table(s) on the slave are compatible with
the tables on the master (as described in the table map event(s) received just
before the row event(s)). The MySQL row-based replication has a fair bit of
flexibility in terms of allowing differences in tables between master and
slave, such as allowing different column types, different storage engine, or
even extra columns on the slave table. In particular allowing extra columns
raises some issues about default values etc. for these columns, though I did
not really go into details about this.
&lt;/p&gt;

&lt;p&gt;
Again, there are a number of hard-coded differences for NDB.
&lt;/p&gt;

&lt;p&gt;
For &lt;code&gt;Write_rows_log_event&lt;/code&gt;, flags need to be set to ensure that
column values for &lt;code&gt;AUTO_INCREMENT&lt;/code&gt; and &lt;code&gt;TIMESTAMP&lt;/code&gt;
columns are taken from the supplied values, not
auto-generated. If &lt;code&gt;slave_exec_mode=IDEMPOTENT&lt;/code&gt;, an INSERT that
fails due to already existing row does not cause replication to fail; instead
an UPDATE is tried, or in some cases (like if there are foreign keys) a DELETE
+ re-tried INSERT. There is also code to hint storage engines about the
approximate number of rows that are part of a bulk insert.
&lt;/p&gt;

&lt;p&gt;
For &lt;code&gt;Update_rows_log_event&lt;/code&gt; and &lt;code&gt;Delete_rows_log_event&lt;/code&gt;,
the code needs to locate the row to update/delete. This is done by primary key
if one exists on the slave table (the current binlogging always includes every
column in the before image of the row, so the primary key value of the row to
modify is always available). But there is also support for tables with no
primary key, in which the first index on the table is used to locate the row
(if any), or failing that a full table scan. This btw. is a good reminder
to &lt;i&gt;not&lt;/i&gt; use row-based replication with primary-key-less tables with
non-trivial amount of rows: &lt;i&gt;every&lt;/i&gt; row operation applied will need a
full table scan!
&lt;/p&gt;

&lt;p&gt;
Finally, the values from the event are unpacked into a row buffer in the
format used by MySQL storage engines, and the &lt;code&gt;read_set&lt;/code&gt;
and &lt;code&gt;write_set&lt;/code&gt; are set up (current replication always includes all
columns in row operations), before the
actual &lt;code&gt;ha_write_row()&lt;/code&gt;, &lt;code&gt;ha_update_row()&lt;/code&gt;,
or &lt;code&gt;ha_delete_row()&lt;/code&gt; call into the storage engine handler is made
to perform the actual update. Note that a single row event can include
multiple rows, which are applied one after the other.
&lt;/p&gt;


&lt;h2&gt;Final words&lt;/h2&gt;

&lt;p&gt;
And that is it! Quite a bit of detail, but again I found it very useful to
create this complete overview; it will make things easier when re-implementing
this in a new replication API.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/13382.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <category>replication</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/13253.html</guid>
  <pubDate>Mon, 21 Jun 2010 12:16:57 GMT</pubDate>
  <title>Dissecting the MySQL replication binlog events</title>
  <link>http://kristiannielsen.livejournal.com/13253.html</link>
  <description>&lt;p&gt;
For the &lt;a href=&quot;http://askmonty.org/wiki/ReplicationProject&quot; rel=&quot;nofollow&quot;&gt;replication
project&lt;/a&gt; that I am currently working on in MariaDB, I wanted to understand
exactly what information is needed to do full replication of all MySQL/MariaDB
statements on the level of completeness that existing replication does. So I
went through the code, and this is what I found.
&lt;/p&gt;

&lt;p&gt;
What I am after here is a complete list of what the execution engine needs to
provide to have everything that a replication system needs to be able to
completely replicate all changes made on a master server. But &lt;i&gt;not&lt;/i&gt;
anything specific to the particular implementation of replication used, like
binlog positions or replication event disk formats, etc.
&lt;/p&gt;

&lt;p&gt;
The basic information needed is of course the query (for statement-based
replication), or the column values (for row-based replication). But there are
lots of extra details needed, especially for statement-based replication. I
need to make sure that the replication API we are designing will be able to
provide &lt;i&gt;all&lt;/i&gt; needed information, and it was always nagging in the back
of my head that there would be lots and lots of small bits in various corners
that would be missing and cause problems. So it was good to get this
overview. Turns out that there &lt;i&gt;are&lt;/i&gt; a lot of details, but not
&lt;i&gt;that&lt;/i&gt; many, and it should be manageable.
&lt;/p&gt;

&lt;p&gt;
All of the events that are used in replication are listed in &lt;code&gt;enum
Log_event_type&lt;/code&gt; in &lt;code&gt;sql/log_event.h&lt;/code&gt;. So anything needed for
complete replication can be found here, but mixed up with lots of other
details about the MySQL binlog implementation, backwards compatibility,
etc. So what follows is an extract from &lt;code&gt;log_event.cc&lt;/code&gt; of the
actual change information contained in those events.
&lt;/p&gt;

&lt;h2&gt;Statement-based replication&lt;/h2&gt;

&lt;h3&gt;&lt;code&gt;QUERY_EVENT&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
The main event for statement-based replication is QUERY_EVENT. It contains the
query to be executed (as a string) and some information to provide the context
for correct execution. Here is the list of information:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt; SQL query.
  &lt;li&gt; Default database for the query (eg. from &lt;code&gt;USE&lt;/code&gt; statement).
  &lt;li&gt; The setting of some server variables in effect at the time the query
    was run:
    &lt;ul&gt;
      &lt;li&gt; &lt;code&gt;sql_mode&lt;/code&gt;.
      &lt;li&gt; &lt;code&gt;autocommit&lt;/code&gt; (whether autocommit is enabled).
      &lt;li&gt; Character set and collation at various levels (see the
      section &lt;a href=&quot;http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html&quot; rel=&quot;nofollow&quot;&gt;9.1.4. Connection
      Character Sets and Collations&lt;/a&gt; in the MySQL manual for background on these):
	&lt;ul&gt;
	  &lt;li&gt; Client (&lt;code&gt;character_set_client&lt;/code&gt;).
	  &lt;li&gt; Connection (&lt;code&gt;character_set_connection&lt;/code&gt;).
	  &lt;li&gt; Server (&lt;code&gt;character_set_server&lt;/code&gt;).
	  &lt;li&gt; Current default database (&lt;code&gt;character_set_database&lt;/code&gt;;
	  note that there are few statements that rely on this, comments in
	  the code say it is only &lt;code&gt;LOAD DATA&lt;/code&gt;).
        &lt;/ul&gt;
      &lt;li&gt; &lt;code&gt;foreign_key_checks&lt;/code&gt; (whether foreign keys are checked).
      &lt;li&gt; &lt;code&gt;unique_checks&lt;/code&gt; (whether unique constraint checks are
	enforced).
      &lt;li&gt; &lt;code&gt;auto_increment_offset&lt;/code&gt; and &lt;code&gt;auto_increment_increment&lt;/code&gt;.
      &lt;li&gt; Time zone of the master database server.
      &lt;li&gt; Names to use for days and months; this is identified by a code that
	is mapped to a table of names to use in &lt;code&gt;sql/sql_locale.cc&lt;/code&gt;.
      &lt;li&gt; &lt;code&gt;sql_auto_is_null&lt;/code&gt; (whether &lt;code&gt;SELECT ... WHERE
      autoinc IS NULL&lt;/code&gt; returns last insert id for ODBC compatibility).
    &lt;/ul&gt;
  &lt;li&gt;
  &lt;li&gt; Error code from executing the query on the master (for non-transactional
    statements that may still make permanent changes even though they fail
    mid-way; on the slave the query should fail with the same error).
  &lt;li&gt; Connection ID (this is used to correctly distinguish &lt;code&gt;TEMPORARY
      TABLE&lt;/code&gt;s with same name used in different connections on the master
    simultaneously).
&lt;/ul&gt;
&lt;p&gt;
Note that not all of this information is replicated in all query events, as
not all of it is needed for a given query. But a replication API must make the
information available for the queries where it is needed.
&lt;/p&gt;

&lt;h3&gt;&lt;code&gt;INTVAR_EVENT&lt;/code&gt;, &lt;code&gt;RAND_EVENT&lt;/code&gt;, and &lt;code&gt;USER_VAR_EVENT&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
These events provide additional context for executing a query on the slave:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt; Value of &lt;code&gt;LAST_INSERT_ID&lt;/code&gt; (for queries that reference it).
  &lt;li&gt; Value of &lt;code&gt;INSERT_ID&lt;/code&gt; (to get same auto_increment numbers for
  inserts on the slave as on the master).
  &lt;li&gt; The random seed (so RAND() can return same values in queries on slaves
  as on the master).
  &lt;li&gt; The values for any &lt;code&gt;@user_variables&lt;/code&gt; referenced in a query
&lt;/ul&gt;

&lt;h3&gt;&lt;code&gt;BEGIN_LOAD_QUERY_EVENT&lt;/code&gt;, &lt;code&gt;APPEND_BLOCK_EVENT&lt;/code&gt;, &lt;code&gt;EXECUTE_LOAD_QUERY_EVENT&lt;/code&gt;, and &lt;code&gt;DELETE_FILE_EVENT&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
These four events are used to do statement replication of &lt;code&gt;LOAD DATA
INFILE&lt;/code&gt;. The contents of the file to be loaded is sent in blocks
in &lt;code&gt;BEGIN_LOAD_QUERY_EVENT&lt;/code&gt; followed by zero or
more &lt;code&gt;APPEND_BLOCK_EVENT&lt;/code&gt;. Then the actual query is sent
in &lt;code&gt;EXECUTE_LOAD_QUERY_EVENT&lt;/code&gt;, which is a variant
of &lt;code&gt;QUERY_EVENT&lt;/code&gt; that replaces the original filename with the name
of a temporary file on the slave and deletes the temporary file afterwards
(&lt;code&gt;DELETE_FILE_EVENT&lt;/code&gt; is used in certain error cases).
&lt;/p&gt;

&lt;p&gt;
This is the complete story of exactly how much information needs to be
provided on the master to make statement replication work as it does currently
in MySQL. If you get the thought that this is a little bit scary in terms of
complexity I tend to agree with you ;-). There is a lot to be said for the
comparative simplicity of row-based replication (and it is also interesting to
see the history of bug fixes in MySQL 5.1 that gradually have moved more and
more statements to be replicated row-based (in mixed-mode binlogging) due to
corner cases where statement-based replication can fail).
&lt;/p&gt;

&lt;p&gt;
Still, once we have the list of information, it is not that hard to provide
the information in a pluggable replication API for any implementations that
want to try their luck with statement-based replication. And of course,
row-based replication only
handles &lt;code&gt;INSERT&lt;/code&gt;/&lt;code&gt;UPDATE&lt;/code&gt;/&lt;code&gt;DELETE&lt;/code&gt;! We also
need to support &lt;code&gt;CREATE TABLE&lt;/code&gt; and similar statements, for which it
is still useful to know the above exhaustive list of information that may be
needed in one form or another.
&lt;/p&gt;

&lt;h2&gt;Row-based replication&lt;/h2&gt;

&lt;p&gt;
In row-based replication, each DML statement is binlogged in two parts. First
the tables modified in the query are described
with &lt;code&gt;TABLE_MAP_EVENT&lt;/code&gt;, and second the row values changed are
logged with &lt;code&gt;WRITE_ROWS_EVENT&lt;/code&gt;, &lt;code&gt;UPDATE_ROWS_EVENT&lt;/code&gt;,
or &lt;code&gt;DELETE_ROWS_EVENT&lt;/code&gt;.
&lt;/p&gt;

&lt;h3&gt;&lt;code&gt;TABLE_MAP_EVENT&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
The information describing modified tables in row-based replication is as
follows:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt; Database name.
  &lt;li&gt; Table name.
  &lt;li&gt; List of columns in the table. For each column, the following
  information is included:
    &lt;ul&gt;
      &lt;li&gt; Column type (this is &lt;code&gt;field-&amp;gt;type()&lt;/code&gt;).
      &lt;li&gt; Column metadata (this is what is returned
      by &lt;code&gt;field-&amp;gt;save_field_metadata()&lt;/code&gt;; this is for example the
      maximum length of a &lt;code&gt;VARCHAR&lt;/code&gt;, the precision and number of
      decimals in &lt;code&gt;DECIMAL&lt;/code&gt;, etc.)
      &lt;li&gt; Whether the column is NULL-able.
    &lt;/ul&gt;
  &lt;li&gt; Table map id; this is just an internally generated uniqie number for
  subsequent events to refer to the table described.
&lt;/ul&gt;
&lt;p&gt;
Note in particular that column names are &lt;i&gt;not&lt;/i&gt; used/needed in current
MySQL/MariaDB row-based replication. I personally think this is a good way to
do it. However, in a generic API, it will make sense to make the full table
definition available to implementations, each of which can choose what and how
to log in terms of table metadata.
&lt;/p&gt;

&lt;h3&gt;&lt;code&gt;WRITE_ROWS_EVENT&lt;/code&gt;, &lt;code&gt;UPDATE_ROWS_EVENT&lt;/code&gt;, and &lt;code&gt;DELETE_ROWS_EVENT&lt;/code&gt;&lt;/h3&gt;

&lt;p&gt;
These events handle replication of
respectively &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, and &lt;code&gt;DELETE&lt;/code&gt;
(and similar statements like &lt;code&gt;REPLACE&lt;/code&gt; etc.) They contain the
following information:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Table map id, referencing a table previously described
  with &lt;code&gt;TABLE_MAP_EVENT&lt;/code&gt;.
  &lt;li&gt;Value of &lt;code&gt;foreign_key_checks&lt;/code&gt; and &lt;code&gt;unique_checks&lt;/code&gt;,
  similar to statement-based binlogging (but for row-based, those two are
  all the context storedm though see remarks below).
  &lt;li&gt;List (bitmap really) of columns updated. This is essentially
  the &lt;code&gt;write_set&lt;/code&gt; that is used in the storage engine API (but see
  below for explanation). For &lt;code&gt;UPDATE&lt;/code&gt;, there are two bitmaps, one
  for the before image and one for the after image.
  &lt;li&gt;List of records containing the values of each column modified. There is
  one such record for every row update logged. For &lt;code&gt;UPDATE&lt;/code&gt; there
  are two records for each row update, one for the before image (values before
  the update was done) and one for the after image (values after the update
  was done).
&lt;/ul&gt;

&lt;p&gt;
I must say, investigating how these row-based events are implemented in MySQL
really makes the feature seem rather half-baked. There are several issues:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The lists/bitmaps of column updated sound useful, but in reality they
  are set unconditionally to include all columns! Except for NDB).
  &lt;li&gt;This also means all columns
  in a row are always sent, even for &lt;code&gt;DELETE&lt;/code&gt;
  and &lt;code&gt;UPDATE&lt;/code&gt;. &lt;i&gt;Except&lt;/i&gt; for NDB, which only logs needed
  columns.
  &lt;li&gt;Some of the &quot;extra&quot; flags in the storage engine API are not included,
  such as &lt;code&gt;HA_EXTRA_WRITE_CAN_REPLACE&lt;/code&gt;. This is actually a bug, as
  it means that a storage engine using such flags to optimise its operation
  will not replicate correctly. In the existing MySQL source, only NDB uses
  this flag, but NDB does special tricks for binlogging and slave replication
  which avoids this particular issue in most cases.
&lt;/ul&gt;
&lt;p&gt;
I strongly suspect that some of this half-baking was done in a quick-and-dirty
attempt to squeeze NDB replication in. At least, there are several &quot;this is
only used by NDB&quot; type comments in the vicinity of these things in the source code.
&lt;/p&gt;

&lt;p&gt;
In any case, for the replication API, it is probably a good idea to re-think
this part and make sure that the information logged for row updates is
complete and sane for all reasonable use cases.
&lt;/p&gt;


&lt;h2&gt;Thoughts&lt;/h2&gt;

&lt;p&gt;
My idea is to have a replication API that provides for generation and
consumption of events completely separate from any details of the actual
format of events in the binlog or any other method used to store or process
the events. This will allow replication plugins that use a completely
different binlog implementation, or even has no binlog at all.
&lt;/p&gt;

&lt;p&gt;
So such an API needs to provide all of the above information (to allow
re-implementing the existing binlog/replication as a plugin, if for no other
reason), but need not provide such information in any particular event
format. In fact, I am trying to make the API so that such information need not
be materialised in structures or memory buffers at all; instead relying on
providing accessor methods, so that an implementation can request just the
information it needs, and materialise it as or if needed.
&lt;/p&gt;

&lt;p&gt;
On top of this I still think it makes sense to define a standard (but
optional) materialised event format, so that more light-weight plugins can be
written that can do interesting things with replication without having to
implement a full new event format each time. I am still considering whether to
extend the existing binlog format (which is not all that attractive, as it is
not very easily extensible), or whether to define a new more flexible format
(for example based on the Google protobuffer library).
&lt;/p&gt;


&lt;h2&gt;More on the existing binlog format&lt;/h2&gt;

&lt;p&gt;
Just for completeness, here is some additional description of the existing
MySQL/MariaDB 5.1 binlog format. These are things that I believe are not
required in a new API, as they are mostly internal implementation
details. However, as I had to go through them anyway while finding the stuff
that &lt;i&gt;does&lt;/i&gt; need to be in the API, I will include a brief description here.
&lt;/p&gt;


&lt;h3&gt;Additional query information&lt;/h3&gt;

&lt;p&gt;
Some additional information, which is mostly redundant, is included with query
events for statement-based binlogging:
&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Bitmap of tables affected by multi-table update (this allows to know
   which tables will be updated without parsing the query, eg. for filtering
   events based on database/table name.)
  &lt;li&gt;Time spent in query on master.
  &lt;li&gt;Catalog (I believe this is old unused stuff. Idea is that each database
  belongs to a catalog, but I have never seen this actually used anywhere).
  &lt;li&gt;A flag &lt;code&gt;LOG_EVENT_THREAD_SPECIFIC_F&lt;/code&gt; which is set if the
    query uses &lt;code&gt;TEMPORARY&lt;/code&gt; table (allows to get this information
    without parsing the query).
  &lt;li&gt;A flag &lt;code&gt;LOG_EVENT_SUPPRESS_USE_F&lt;/code&gt; set in a few cases when the
    master knows that the query is independent of what the current database is
    (so that a possible &lt;code&gt;USE&lt;/code&gt; statement can be optimised away).
&lt;/ul&gt;


&lt;h3&gt;Binlog specific events&lt;/h3&gt;

&lt;p&gt;
These are events that are specific to the binlog implementation:
&lt;/p&gt;
&lt;dl&gt;
  &lt;dt&gt;&lt;code&gt;XID_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    This is used to record a transaction ID for each transaction written to
    the binlog in 2-phase commit. This recorded ID is needed during crash
    recovery on the master to know which prepared transactions in
    transactional engines need to be recovered to get consistency with what is
    in the binlog. It is not used on the slave in replication (though this
    events implies a &lt;code&gt;COMMIT&lt;/code&gt;, which _does_ have effect on the
    slave, of course.)
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;FORMAT_DESCRIPTION_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    This event is written at the start of every binlog file. It provides to
    slaves reading the binlog the master server version and the event size of
    all following events, thereby providing some facilities for extending
    event formats while maintaining backwards compatibility.
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;STOP_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    This is logged when the master shuts down gracefully (though I do not
    think this is used much, if any)
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;ROTATE_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    This is logged at the end of a binlog file when the master starts a new
    binlog file. It is needed by the slave to reset it&apos;s master binlog
    position so that the IO thread can proceed correctly from the next binlog
    file (incidentally, it is a clear weakness in the binlog implementation
    that the slaves need knowledge about binlog file names and data offsets on
    the master server, and is a cause of much complexity when switching
    masters in advanced replication topologies. Something that really needs
    improvements in the near future).
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;INCIDENT_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    This is logged by the master when something bad happens that may cause
    replication to fail/diverge, so that the slave can be notified of the
    problem and stop, informing the DBA/sysadm to resolve the issue.
  &lt;/dd&gt;
&lt;/dl&gt;

&lt;h3&gt;Obsolete events&lt;/h3&gt;

&lt;p&gt;
Finally there are a number of events that are no longer generated (but which
are still important for the slave replication code to handle to be able to
work with masters of older versions):
&lt;/p&gt;
&lt;dl&gt;
  &lt;dt&gt;&lt;code&gt;LOAD_EVENT&lt;code&gt;, &lt;/code&gt;NEW_LOAD_EVENT&lt;/code&gt;, &lt;code&gt;CREATE_FILE_EVENT&lt;/code&gt;,
    and &lt;code&gt;EXEC_LOAD_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    Various old events for handling &lt;code&gt;LOAD DATA INFILE&lt;/code&gt; (as can be
    seen, &lt;code&gt;LOAD DATA INFILE&lt;/code&gt; has had some changes in replication
    over the years :-).
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;START_EVENT_V3&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    Old version of &lt;code&gt;FORMAT_DESCRIPTION_EVENT&lt;/code&gt;.
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;PRE_GA_WRITE_ROWS_EVENT&lt;code&gt;, &lt;/code&gt;PRE_GA_UPDATE_ROWS_EVENT&lt;code&gt;,
  and &lt;/code&gt;PRE_GA_DELETE_ROWS_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    Old versions of the row-based replication binlog events.
  &lt;/dd&gt;
  &lt;dt&gt;&lt;code&gt;SLAVE_EVENT&lt;/code&gt;&lt;/dt&gt;
  &lt;dd&gt;
    Not used, I think it may have been related to some feature that was never
    completed.
  &lt;/dd&gt;
&lt;/dl&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/13253.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/12810.html</guid>
  <pubDate>Mon, 31 May 2010 14:17:25 GMT</pubDate>
  <title>Fixing MySQL group commit (part 4 of 3)</title>
  <link>http://kristiannielsen.livejournal.com/12810.html</link>
  <description>&lt;p&gt;
(No
&lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;three&lt;/a&gt;-&lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;part&lt;/a&gt; &lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;series&lt;/a&gt; is complete without a part 4, right?)
&lt;/p&gt;

&lt;p&gt;
Here is an analogy that describes well what group commit does. We have a bus
driving back and forth transporting people from A to B (corresponding
to &lt;code&gt;fsync()&lt;/code&gt; &quot;transporting&quot; commits to durable storage on
disk). The group commit optimisation is to have the bus pick up everyone that
is waiting at A before driving to B, not drive people one by one. Makes sense,
huh? :-)
&lt;/p&gt;

&lt;p&gt;
It is pretty obvious that this optimisation of having more than one person in
the bus can dramatically improve throughput, and it is the same for the group
commit optimisation. Here is a graph from a benchmark comparing stock MariaDB
5.1 vs. MariaDB patched with a proof-of-concept patch that enables group
commit:
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;http://knielsen-hq.org/maria/fix-group-commit-4.png&quot; alt=&quot;Benchmark results&quot;&gt;
&lt;/center&gt;

&lt;p&gt;
When group commit is implemented, we see clearly how performance (measured in
queries per second) scales dramatically as the number of threads
increases. Whereas with stock MariaDB with no group commit, there is no
scaling at all. We also see that SSD is better than HDD (no surprise there),
but that with sufficient parallelism from the application, group commit can to
a large extent compensate for the slower disks.
&lt;/p&gt;

&lt;p&gt;
This is the same benchmark as in
the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;first part&lt;/a&gt;
of the series. Binlog is enabled. Durability is enabled
with &lt;code&gt;sync_binlog=1&lt;/code&gt; and &lt;code&gt;flush_log_at_trx_commit=1&lt;/code&gt;
(and disk cache disabled to prevent the disks lying about when data is
durable). The load is single-row transactions against a 1000000-row XtraDB
table. The benchmark is thus specifically designed to make
the &lt;code&gt;fsync()&lt;/code&gt; calls at the end of commit the bottleneck.
&lt;/p&gt;

&lt;p&gt;
I should remark that I did not really tune the servers used in the benchmark
for high parallelism (except for raising &lt;code&gt;max_connections&lt;/code&gt; :-), and
I ran the client on the same machine as the server. So it is likely that there
are other effects than group commit influencing the performance at high
parallelism (especially on the SSD results, which I ran on my laptop). But I
just wanted to see if my group commit work scales with higher parallelism, and the
graphs clearly shows that it does!
&lt;/p&gt;

&lt;h2&gt;Architecture&lt;/h2&gt;

&lt;p&gt;
For this work, I have focused a lot on the API for storage engine and binlog
plugins (we do not have binlog plugins now, but this is something that we will
be working on in MariaDB later this year). I want a clean interface that
allows plugins to implement group commit in a simple and efficient manner.
&lt;/p&gt;

&lt;p&gt;
A crucial point is the desire to get commits ordered the same way in the
different engines (ie. typically in InnoDB and in the binlog), as I discussed
in previous articles. As group commit is about parallelims, and ordering is
about serialisation, these two tend to get into conflict. My idea is to
introduce new calls in the interface to storage engines and the XA transaction
coordinator (which is how binlog interacts with commit internally in the
server). These new calls allow plugins that care about commit order to
cooperate on getting correct ordering without getting in each others way and
killing parallelims. Plugins that do not need any ordering can ignore the
new calls, which are optional (for example the transaction coordinator that
runs when the binlog is disabled does not need any ordering).
&lt;/p&gt;

&lt;p&gt;
The full architecture is written up in detail in the
MariaDB &lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=116&quot; rel=&quot;nofollow&quot;&gt;Worklog#116&lt;/a&gt;.
But the basic idea is to introduce a new handlerton method:
&lt;pre&gt;
    void commit_ordered(handlerton *hton, THD *thd, bool all);
&lt;/pre&gt;
This is called just prior to the normal &lt;code&gt;commit()&lt;/code&gt; method, and is
guaranteed to run in the same commit order across all engines (and binlog)
participating in the transaction.
&lt;/p&gt;

&lt;p&gt;
This allows for a lot of flexibility in plugin implementations. A typical
implementation would in the &lt;code&gt;commit_ordered()&lt;/code&gt; method write the
transaction data into its in-memory log buffer, and delay the
time-consuming &lt;code&gt;write()&lt;/code&gt; and &lt;code&gt;fsync()&lt;/code&gt; to the
parallel &lt;code&gt;commit()&lt;/code&gt; method. InnoDB/XtraDB is already structured in
this way, so fits very well into this scheme.
&lt;/p&gt;

&lt;p&gt;
But if an engine wants to use another approach, for example a
ticket-based approach
as &lt;a href=&quot;http://www.facebook.com/note.php?note_id=386328905932&quot; rel=&quot;nofollow&quot;&gt;Mark&lt;/a&gt;
and &lt;a href=&quot;http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html&quot; rel=&quot;nofollow&quot;&gt;Mats&lt;/a&gt;
suggested, that is easy to do too. Just allocate the ticket
in &lt;code&gt;commit_ordered()&lt;/code&gt;, and use it in &lt;code&gt;commit()&lt;/code&gt;. I
believe most approaches should fit in well with the proposed model.
&lt;/p&gt;

&lt;p&gt;
I also added a corresponding &lt;code&gt;prepare_ordered()&lt;/code&gt; call, which runs
in commit order during the prepare phase. The intension is to provide a place
to release InnoDB row locks early for even better performance, though I still
need to get the Facebook people to explain exactly what they want to do in
this respect ;-)
&lt;/p&gt;

&lt;p&gt;
I also spent a lot of thought on getting efficient inter-thread
synchronisation in the
archtecture. As &lt;a href=&quot;http://mysqlmusings.blogspot.com/2010/04/binary-log-group-commit-implementation.html&quot; rel=&quot;nofollow&quot;&gt;Mats&lt;/a&gt;
mentioned, if one is not careful, it is easy to end up
with &lt;span&gt;O(N&lt;sup&gt;2&lt;/sup&gt;)&lt;/span&gt; cost of thread wake-up, with N
the number of transactions participating in group commit. As the goal is to get N
as high as possible to maximise sharing of the expensive &lt;code&gt;fsync()&lt;/code&gt;
call, such &lt;span&gt;O(N&lt;sup&gt;2&lt;/sup&gt;)&lt;/span&gt; cost is to be avoided.
&lt;/p&gt;

&lt;p&gt;
In the architecture
described in &lt;a href=&quot;http://askmonty.org/worklog/Server-RawIdeaBin/?tid=116&quot; rel=&quot;nofollow&quot;&gt;MariaDB
Worklog#116&lt;/a&gt;, there should in the normal case only be a single highly
contested lock, the one on the binlog group commit (which is inherent to the
idea of group commit, one thread does the &lt;code&gt;fsync()&lt;/code&gt; while the rest
of participating threads wait). I use a lock-free queue to make threads
in &lt;code&gt;prepare_ordered()&lt;/code&gt; not block threads
in &lt;code&gt;commit_ordered()&lt;/code&gt; and vice
versa. The &lt;code&gt;prepare_ordered()&lt;/code&gt; calls runs under a global lock, but
as they are intended to execute very quickly there should ideally be little contention
here. The &lt;code&gt;commit_ordered()&lt;/code&gt; calls run in a loop in a single
thread, also avoiding serious lock contention as long as commit_ordered() runs
quickly as intended.
&lt;/p&gt;

&lt;p&gt;
In particular, running the &lt;code&gt;commit_ordered()&lt;/code&gt; loop in a single
thread for each group commit avoids high cost of thread wake-up. If we were to
try to run the sequential part of commit in different threads in a specific
commit order, we would need to switch execution context from one thread to the
next, bouncing the thread of control all over the cores in an SMP
system. Which takes lots of context switches, and could potentially be
costly. In the proposed architecture, a single thread runs
all &lt;code&gt;commit_ordered()&lt;/code&gt; method calls and wakes up the other waiting
threads individually, each free to proceed immediately without any more
waiting for one another.
&lt;/p&gt;

&lt;p&gt;
Of course, an engine/binlog plugin that so desires is free to implement such
thread-hopping itself, by allocating a ticket in one of
the &lt;code&gt;_ordered()&lt;/code&gt; methods, and doing its own synchronisation in
its &lt;code&gt;commit()&lt;/code&gt; method. After all, it may be beneficial or necessary in
some cases. The point is that different plugins can use different methods,
each using the one that works best for that particular engine without getting
in the way of each other.
&lt;/p&gt;


&lt;h2&gt;Further improvements&lt;/h2&gt;

&lt;p&gt;
If we implement this approach, there are a couple of other interesting
enhancements that can be implemented relatively easy due to the commit
ordering facilities:
&lt;ul&gt;
&lt;li&gt; Currently, we sync to disk three times per commit to ensure consistency
  between InnoDB and binlog after a crash. But if we know the commit order is
  the same in engine and in binlog, and if we store in the engine the
  corresponding binlog position (which InnoDB already does), then we need only
  sync once (for the binlog) and can still recover reliably after a
  crash. Since we have a consistent commit order, we can during crash recovery replay
  the binlog from the position after the last not lost commit inside InnoDB
  (just like we would apply the binlog on a slave).
&lt;li&gt; Currently, the &lt;code&gt;START TRANSACTION WITH CONSISTENT SNAPSHOT&lt;/code&gt;,
  which is supposed to run a transaction with a consistent view in multiple
  transactional engines, is not really all that consistent. It is quite
  possible to see a transaction committed in one engine but not in another,
  and vice versa. However, with an architecture like the one proposed here, it
  should be easy to just take the snapshot under the same lock
  that &lt;code&gt;commit_ordered()&lt;/code&gt; runs under, and the snapshot will be
  really consistent (on engines that support commit order). As a bonus, it
  would also be possible to provice a binlog position corresponding to the
  consistent snapshot.
&lt;li&gt; XtraDB (and similar backup solutions) should be able to create a backup
  which includes a binlog position (suitable for provisioning a new slave)
  without having to run &lt;code&gt;FLUSH TABLES WITH READ LOCK&lt;/code&gt;, which can be
  quite costly as it blocks all transaction processing while it runs.
&lt;li&gt;
  As already mentioned, the Facebook group has some ideas for releasing InnoDB
  row locks early in order to reduce the load on hot-spot rows; this requires
  consistent commit order.
&lt;/li&gt;
&lt;/p&gt;

&lt;h2&gt;Implementation&lt;/h2&gt;
&lt;p&gt;
If anyone is interested in looking at the actual code of the proof-of-concept
implementation, it is available as
a &lt;a href=&quot;https://knielsen-hq.org/maria/patches.mwl116/&quot; rel=&quot;nofollow&quot;&gt;quilt patch
    series&lt;/a&gt; and as
a &lt;a href=&quot;https://code.launchpad.net/~knielsen/maria/5.1-group-commit&quot; rel=&quot;nofollow&quot;&gt;Launchpad
    bzr tree&lt;/a&gt; (licence is GPLv2).
&lt;/p&gt;

&lt;p&gt;
Do be aware that this is work in progress.
&lt;/p&gt;&lt;/ul&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/12810.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/12553.html</guid>
  <pubDate>Fri, 23 Apr 2010 09:18:28 GMT</pubDate>
  <title>Fixing MySQL group commit (part 3)</title>
  <link>http://kristiannielsen.livejournal.com/12553.html</link>
  <description>&lt;p&gt;
This is the third and final article in a series about group commit in
MySQL. The &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;first article&lt;/a&gt; discussed the background: group
commit in MySQL does not work when the binary log is
enabled. The &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;second article&lt;/a&gt; explained the part of the
InnoDB code that is responsible for the problem.
&lt;/p&gt;

&lt;p&gt;
So how do we fix group commit in MySQL? As we saw in the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;second
article&lt;/a&gt; of this series, we can just eliminate
the &lt;code&gt;prepare_commit_mutex&lt;/code&gt; from InnoDB, extend the binary logging
to do group commit by itself, and that would solve the problem.
&lt;/p&gt;

&lt;p&gt;
However, we might be able to do even better. As explained in
the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;first article&lt;/a&gt;, with binary logging enabled we need XA
to ensure consistency after a crash, and that requires to do three fsyncs for
a commit. Even if each of those can be shared with other transactions using
group commit, it is still expensive. During a &lt;a href=&quot;https://lists.launchpad.net/maria-developers/msg01998.html&quot; rel=&quot;nofollow&quot;&gt;discussion&lt;/a&gt; on
the &lt;a href=&quot;https://launchpad.net/~maria-developers&quot; rel=&quot;nofollow&quot;&gt;maria-developers@&lt;/a&gt; mailing list, an idea came up for how
to do this with only a single (shared) &lt;code&gt;fsync()&lt;/code&gt; for a commit.
&lt;/p&gt;

&lt;p&gt;
The basic idea is to only do &lt;code&gt;fsync()&lt;/code&gt; for the binary log, not for
the storage engine, corresponding to running
with &lt;code&gt;innodb_flush_log_at_trx_commit&lt;/code&gt; set to 2 or even 0.
&lt;/p&gt;

&lt;p&gt;
If we do this, we can end up in the following situation: some transaction A is
written into the binary log, and &lt;code&gt;fsync()&lt;/code&gt; makes sure that is
stored durably on disk. Then transaction A is committed in InnoDB. And before
the operating system and hardware gets around to store the InnoDB part of A
durably on disk, we get a crash.
&lt;/p&gt;

&lt;p&gt;
Now on crash recovery, we will have A in the binary log, but in the engine A
may be lost, causing an inconsistency. But this inconsistency can be resolved
simply by re-playing the transaction A against InnoDB, using the data for A
stored in the binary log. Just like it would normally be applied on a
replication slave. After re-playing the transaction, we will again be in a
consistent state.
&lt;/p&gt;

&lt;p&gt;
In order to do this, we need two things:
&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt; For each transaction, we need to store in the InnoDB engine
  information about which is the corresponding position in the binary log, so
  that at crash recovery we will know from which position in the binary log to
  start re-playing transactions from.
&lt;li&gt; We also need to ensure that the order of commits in the binary log and in
  InnoDB is the same! Otherwise, after a crash we could find ourselves in the
  situation that the binary log has transaction A followed by transaction B,
  while the InnoDB storage engine contains only transaction B committed, not
  transaction A. This would leave us with no reliable place in the binary log
  to start re-playing transactions from.
&lt;/ul&gt;
&lt;p&gt;
Now, for ensuring same commit order, we do &lt;em&gt;not&lt;/em&gt; want to re-introduce
the (by now) infamous &lt;code&gt;prepare_commit_mutex&lt;/code&gt;, as that would make it
impossible to have group commit for the binary log. Instead we should use
another way to ensure such order. There are several ways to do this. Mark
Callaghan explained one such way to do this at the MySQL conference, described
further in &lt;a href=&quot;http://www.facebook.com/note.php?note_id=386328905932&quot; rel=&quot;nofollow&quot;&gt;this article&lt;/a&gt;.
&lt;/p&gt;

&lt;p&gt;
The basic idea is that when writing transactions into the binary log, we
remember their ordering. We can do this by putting the transactions into a
queue, by assigning them a global transaction id in monotonic sequence, or by
assigning them some kind of ticket as Mark suggests. Then
inside &lt;code&gt;innobase_commit()&lt;/code&gt;, transactions can coordinate with each
other to make sure they go into the engine in the order dictated by the queue,
global transaction id, or ticket.
&lt;/p&gt;

&lt;p&gt;
I think I have a working idea for how to extend the storage engine API to be
able to do this in a clean way for any transactional engine. We can Introduce an
optional handler call &lt;code&gt;commit_fast()&lt;/code&gt; that is guaranteed to be called
in the same order as transactions are written to the binary log, prior to the
normal commit handler call. Basically it would be called under a binary log
mutex. The idea is that &lt;code&gt;commit_fast()&lt;/code&gt; will contain the &quot;fast&quot; part
of &lt;code&gt;innobase_commit()&lt;/code&gt;, as explained in the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;previous
article&lt;/a&gt;. Then in &lt;code&gt;commit_fast()&lt;/code&gt;, the engine can do the
assignment of a ticket or insertion into a queue, as needed.
&lt;/p&gt;

&lt;p&gt;
I think possibly for symmetry we would want to also add a
similar &lt;code&gt;xa_prepare_fast()&lt;/code&gt; handler call that would be
invoked &lt;em&gt;after&lt;/em&gt; the normal &lt;code&gt;xa_prepare()&lt;/code&gt; and similarly be
guaranteed to be in the same order as binary log commit, though I need to
consider this a bit more to fully make up my mind.
&lt;/p&gt;

&lt;p&gt;
I believe such an addition to the storage engine API would allow to implement
in a clean way for all engines the method of re-playing the binary log at
crash recovery to avoid more than a single &lt;code&gt;fsync()&lt;/code&gt; at commit.
&lt;/p&gt;

&lt;p&gt;
So this concludes the series. Using these ideas, I hope we will soon see
patches for MySQL and MariaDB that greatly enhances the performance for
durable and crash-safe commits, so that we can finally declare Peter&apos;s
original &lt;a href=&quot;http://bugs.mysql.com/bug.php?id=13669&quot; rel=&quot;nofollow&quot;&gt;Bug#13669&lt;/a&gt; for
fixed!
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/12553.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/12408.html</guid>
  <pubDate>Fri, 23 Apr 2010 08:51:37 GMT</pubDate>
  <title>Fixing MySQL group commit (part 2)</title>
  <link>http://kristiannielsen.livejournal.com/12408.html</link>
  <description>&lt;p&gt;
This is the second in a series of three articles about ideas for implementing
full support for &lt;em&gt;group commit&lt;/em&gt; in MariaDB. The &lt;a href=&quot;http://kristiannielsen.livejournal.com/12254.html&quot; rel=&quot;nofollow&quot;&gt;first
article&lt;/a&gt; discussed the background: group commit in MySQL does not work when
the binary log is enabled. See also the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;third&lt;/a&gt; article.
&lt;/p&gt;

&lt;p&gt;
Internally, InnoDB (and hence XtraDB) do support group commit. The way this
works is seen in the &lt;code&gt;innobase_commit()&lt;/code&gt; function. The work in this
function is split into two parts. First, a &quot;fast&quot; part, which registers the commit in
memory:
&lt;pre&gt;
    trx-&amp;gt;flush_log_later = TRUE;
    innobase_commit_low(trx);
    trx-&amp;gt;flush_log_later = FALSE;
&lt;/pre&gt;
Second, a &quot;slow&quot; part, which writes and fsync&apos;s the commit to disk to make
it durable:
&lt;pre&gt;
    trx_commit_complete_for_mysql(trx)
&lt;/pre&gt;
While one transaction is busy executing the &quot;slow&quot; part, any number of later
transactions can complete their &quot;fast&quot; part, and queue up waiting for the
running &lt;code&gt;fsync()&lt;/code&gt; to finish. Once it does finish, a
single &lt;code&gt;fsync()&lt;/code&gt; of the log is now sufficient to complete the slow
part for &lt;em&gt;all&lt;/em&gt; of the queued-up transactions. This is how group commit
works in InnoDB when the binary log is disabled.
&lt;/p&gt;

&lt;p&gt;
When the binary log is enabled, MySQL uses XA/2-phase commit to ensure
consistency between the binary log and the storage engine. This means that a
commit now takes three parts:
&lt;pre&gt;
    innobase_xa_prepare()
    write() and fsync() binary log
    innobase_commit()
&lt;/pre&gt;
&lt;/p&gt;

&lt;p&gt;
Now, there is an extra detail to the prepare and commit code in InnoDB. InnoDB
locks the &lt;code&gt;prepare_commit_mutex&lt;/code&gt;
in &lt;code&gt;innobase_xa_prepare()&lt;/code&gt;, and does not release it until after the
&quot;fast&quot; part of &lt;code&gt;innobase_commit()&lt;/code&gt; has completed. This means that
while one transaction is executing &lt;code&gt;innobase_commit()&lt;/code&gt;, all
subsequent transactions will be blocked
inside &lt;code&gt;innobase_xa_prepare()&lt;/code&gt; waiting for the mutex. As a result,
no transactions can queue up to share an &lt;code&gt;fsync()&lt;/code&gt;, and group
commit is broken with the binary log enabled.
&lt;/p&gt;

&lt;p&gt;
So, why does InnoDB hold the problematic &lt;code&gt;prepare_commit_mutex&lt;/code&gt;
across the binary logging? That turns out to be a really good question. After
extensive research into the issue, it appears that in fact there is no good
reason at all for the mutex to be held.
&lt;/p&gt;

&lt;p&gt;
Comments in the InnoDB code, in the bug tracker, and elsewhere, mention that
taking the mutex is necessary to ensure that commits happen in the same order
in InnoDB and in the binary log. This is certainly true; without taking the
mutex we can have transaction A committed in InnoDB before transaction B, but
B written to the binary log before transaction A.
&lt;/p&gt;

&lt;p&gt;
But this just raises the next question: why is it necessary to ensure the same
commit order in InnoDB and in the binary log? The only reason that I could
find stated is that this is needed for InnoDB hot backup and XtraBackup to be able
to extract the correct binary log position corresponding to the state of the
engine contained in the backup.
&lt;/p&gt;

&lt;p&gt;
Sergei Golubchik investigated this issue during the 2010 MySQL conference,
inspired by the many discussions of group commit that took place there. It
turns out that XtraDB does a &lt;code&gt;FLUSH TABLES WITH READ LOCK&lt;/code&gt; when it
extracts the binary log position. This statement completely blocks the
processing of commits until released, removing &lt;em&gt;any&lt;/em&gt; possibility of
different commit order in engine and binary log (InnoDB hot backup is closed
source, so difficult to check, but presumably works in the same way). So there
certainly is no need for holding the &lt;code&gt;prepare_commit_mutex&lt;/code&gt; to
ensure consistent binary log position for backups!
&lt;/p&gt;

&lt;p&gt;
There is another popular way to do hot backups without using &lt;code&gt;FLUSH
TABLES WITH READ LOCK&lt;/code&gt;: LVM snapshots. But an LVM snapshot essentially
runs the recovery algorithm at restore time. In this case, XA is used to
ensure that engine and binary log are consistent at server start, eliminating
any need to enforce same ordering of commits.
&lt;/p&gt;

&lt;p&gt;
So it really seems that there just is no good reason for
the &lt;code&gt;prepare_commit_mutex&lt;/code&gt; mutex to exist in the first
place. Unless someone can come up with a good explanation for why it should be
needed, I am forced to conclude that we have lived with 5 years of broken
group commit in MySQL solely because of incorrect hearsay about how things
should work. Which is kind of sad, and suggest that no-one at MySQL or InnoDB
ever cared sufficiently to take a serious look at this important issue.
&lt;/p&gt;

&lt;p&gt;
(In order to get full group commit in MySQL there is another issue that needs
to be solved. The current binary log code does not include implementation of
group commit, so this also needs to be implemented. Such an implementation
should be possible to do using standard techniques, and is independent of
fixing of group commit in InnoDB).
&lt;/p&gt;

&lt;p&gt;
This concludes the second part of the series, showing that group commit can be
restored simply by removing the offending &lt;code&gt;prepare_commit_mutex&lt;/code&gt;
from InnoDB. The third and final article in the series will discuss some
deeper issues that arise from looking into this part of the server code, and
some interesting ideas for further improving things related to group commit.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/12408.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>7</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/12254.html</guid>
  <pubDate>Fri, 23 Apr 2010 08:48:15 GMT</pubDate>
  <title>Fixing MySQL group commit (part 1)</title>
  <link>http://kristiannielsen.livejournal.com/12254.html</link>
  <description>&lt;p&gt;
This is the first in a series of three articles about ideas for implementing
full support for &lt;em&gt;group commit&lt;/em&gt; in MariaDB (for the other parts see the &lt;a href=&quot;http://kristiannielsen.livejournal.com/12408.html&quot; rel=&quot;nofollow&quot;&gt;second&lt;/a&gt; and &lt;a href=&quot;http://kristiannielsen.livejournal.com/12553.html&quot; rel=&quot;nofollow&quot;&gt;third&lt;/a&gt; articles). Group commit is an
important optimisation for databases that helps mitigate the latency of
physically writing data to permanent storage. Group commit can have a dramatic
effect on performance, as the following graph shows:
&lt;/p&gt;
&lt;center&gt;
&lt;img src=&quot;http://knielsen-hq.org/maria/fix-group-commit-1.png&quot; alt=&quot;Benchmark results&quot;&gt;
&lt;/center&gt;
&lt;p&gt;
The rising blue and yellow lines show transactions per second when group
commit is working, showing greatly improved throughput as the parallelism
(number of concurrently running transactions) increases. The flat red and
green lines show transactions per second with no group commit, with no scaling
at all as parallelism increases. As can be seen, the effect of group commit on
performance can be dramatic, improving throughput by an order of magnitude or
more. More details of this benchmark below, but first some background
information.
&lt;/p&gt;

&lt;h2&gt;Durability and group commit&lt;/h2&gt;

&lt;p&gt;
In a fully transactional system, it is traditionally expected that when a
transaction is committed successfully, it becomes &lt;em&gt;durable&lt;/em&gt;. The
term &lt;em&gt;durable&lt;/em&gt; is the &quot;D&quot; in ACID, and means that even if the system
crashes the very moment after commit succeeded (power failure, kernel panic,
server software crash, or whatever), the transaction will remain committed
after the system is restarted, and crash recovery has been performed.
&lt;/p&gt;

&lt;p&gt;
The usual way to ensure durability is by writing, to a transactional log file,
sufficient information to fully recover the transaction in case of a crash,
and then use the &lt;code&gt;fsync()&lt;/code&gt; system call to force the data to be
physically written to the disk drive before returning successfully from the
commit operation. This way, in case crash recovery becomes necessary, we know
that the needed information will be available. There are other methods
than &lt;code&gt;fsync()&lt;/code&gt;, including calling the &lt;code&gt;fdatasync()&lt;/code&gt;
system call or using the &lt;code&gt;O_DIRECT&lt;/code&gt; flag when opening the log file,
but for simplicity we will use &lt;code&gt;fsync()&lt;/code&gt; to refer to any method for
forcing data to physical disk.
&lt;/p&gt;

&lt;p&gt;
&lt;code&gt;fsync()&lt;/code&gt; is an expensive operation. A good traditional hard disk
drive (HDD) will do around 150 fsyncs per second (10k rotation per minute
drives). A good solid state disk like the Intel X25-M will do around 1200
fsyncs per second. It is possible to use RAID controllers with a battery
backed up cache (which will keep data in cache memory during a power failure
and physically write it to disk when the power returns); this will reduce the
overhead of &lt;code&gt;fsync()&lt;/code&gt;, but not eliminate it completely.
&lt;/p&gt;

&lt;p&gt;
(There are other ways than &lt;code&gt;fsync()&lt;/code&gt; to ensure durability. For
example in a cluster with synchronous replication
(like &lt;a href=&quot;http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster.html&quot; rel=&quot;nofollow&quot;&gt;NDB&lt;/a&gt;
or &lt;a href=&quot;http://www.codership.com/&quot; rel=&quot;nofollow&quot;&gt;Galera&lt;/a&gt;),
durability can be achieved by making sure the transaction is replicated fully
to multiple nodes, on the assumption that a system failure will not take out
all nodes at once. Whatever method used, ensuring durability is usually
signficantly more expensive that merely committing a transaction to local
memory.)
&lt;/p&gt;

&lt;p&gt;
So the naive approach, which does &lt;code&gt;fsync()&lt;/code&gt; after every commit,
will limit throughput to around 150 transactions per second (on a standard
HDD). But with group commit we can do much better. If we have several
transactions running concurrently, all waiting to fsync their data in the
commit step, we can use a single &lt;code&gt;fsync()&lt;/code&gt; call to flush them all
to physical storage in one go. The cost of &lt;code&gt;fsync()&lt;/code&gt; is often not
much higher for multiple transactions than for a single one, so as the above
graph shows, this simple optimisation greatly reduces the overhead
of &lt;code&gt;fsync()&lt;/code&gt; for parallel workloads.
&lt;/p&gt;

&lt;h2&gt;Group commit in MySQL and MariaDB&lt;/h2&gt;

&lt;p&gt;
MySQL (and MariaDB) has full support for ACID when using the popular InnoDB
(XtraDB in MariaDB) storage engine (and there are other storage engines with
ACID support as well). For InnoDB/XtraDB, durability is enabled by
setting &lt;code&gt;innodb_flush_log_at_trx_commit=1&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
Durability is needed when there is a requirement that committed
transactions &lt;em&gt;must&lt;/em&gt; survive a crash. But this is not the only reason
for enabling durability. Another reason is when the binary log is enabled, for
using a server as a replication master.
&lt;/p&gt;

&lt;p&gt;
When the binary log is used for replication, it is important that the content
of the binary log on the master exactly match the changes done in the storage
engine(s). Otherwise the slaves will replicate to different data than what is
on the master, causing replication to diverge and possibly even break if the
differences are such that a query on the master is unable to run on the slave
without error. If we do not have durability, some number of transactions may
be lost after a crash, and if the transactions lost in the storage engines are
not the same as the transactions lost in the binary log, we will end up with
an inconsistency. So with the binary log enabled, durability is needed in
MySQL/MariaDB to be able to recover into a consistent state after a crash.
&lt;/p&gt;

&lt;p&gt;
With the binary log enabled, MySQL/MariaDB uses XA/2-phase commit between the
binary log and the storage engine to ensure the needed durability of all
transactions. In XA, committing a transaction is a three-step process:
&lt;ol&gt;
  &lt;li&gt; First, a &lt;em&gt;prepare&lt;/em&gt; step, in which the transaction is made durable in
    the engine(s). After this step, the transaction can still be rolled
    back; also, in case of a crash after the prepare phase, the transaction
    can be recovered.&lt;/li&gt;
  &lt;li&gt; If the prepare step succeeds, the transaction is made durable in the
    binary log.&lt;/li&gt;
  &lt;li&gt; Finally, the &lt;em&gt;commit&lt;/em&gt; step is run in the engine(s) to make the
    transaction actually committed (after this step the transaction can no
    longer be rolled back).&lt;/li&gt;
&lt;/ol&gt;
The idea is that when the system comes back up after a crash, crash recovery
will go through the binary log. Any prepared (but not committed) transactions
that are found in the binary log will be committed in the storage
engine(s). Other prepared transactions will be rolled back. The result is
guaranteed consistency between the engines and the binary log.
&lt;/p&gt;

&lt;p&gt;
Now, &lt;em&gt;each&lt;/em&gt; of the three above steps requires an &lt;code&gt;fsync()&lt;/code&gt;
to work, making a commit three times as costly in this respect compared to a
commit with the binary log disabled. This makes it all the more important to
use the group commit optimisation to mitigate the overhead from
&lt;code&gt;fsync()&lt;/code&gt;.
&lt;/p&gt;

&lt;p&gt;
But unfortunately, group commit does not work in MySQL/MariaDB when the binary
log is enabled! This is the
infamous &lt;a href=&quot;http://bugs.mysql.com/bug.php?id=13669&quot; rel=&quot;nofollow&quot;&gt;Bug#13669&lt;/a&gt;,
reported by Peter Zaitsev back in 2005.
&lt;/p&gt;

&lt;p&gt;
So this is what we see in the graph and benchmark shown at the start. This is
a benchmark running a lot of very simple transactions
(a single &lt;code&gt;REPLACE&lt;/code&gt; statement on a smallish XtraDB table)
against a server with and without the binary log enabled. This kind of
benchmark is bottlenecked on the fsync throughput of the I/O system when
durability is enabled.
&lt;/p&gt;

&lt;p&gt;
The benchmark is done against two different servers. One has a pair of Western
Digital 10k rpm HDDs (with binary log and XtraDB log on different drives). The
other has a single Intel X25-M SSD. The servers are both running MariaDB
5.1.44, and are configured with durable commits in XtraDB, and with drive
cache turned off (drives like to lie about fsync to look better in casual
benchmarks).
&lt;/p&gt;

&lt;p&gt;
The graph shows throughput in transactions per second for different number of
threads running in parallel. For each server, there is a line for results with
the binary log disabled, and one with the binary log enabled.
&lt;/p&gt;

&lt;p&gt;
We see that with one thread, there is some overhead in enabling the binary
log, as is to be expected given that three calls to &lt;code&gt;fsync()&lt;/code&gt; are
required instead of just one.
&lt;/p&gt;

&lt;p&gt;
But much worse, we also see that group commit does not work at all when the
binary log is enabled. While the lines with binary log disabled show excellent
scaling as the parallelism increases, the lines for binary log enabled are
completely flat. With group commit non-functional, the overhead of enabling
the binary log is enourmous at higher parallelism (at 64 threads on HDD it is
actually two orders of magnitude worse with binary log enabled).
&lt;/p&gt;

&lt;p&gt;
So this concludes the first part of the series. We have seen that if we can
get group commit to work when the binary log is enabled, we can expect a huge
gain in performance on workloads that are bottlenecked on the fsync throughput
of the available I/O system.
&lt;/p&gt;

&lt;p&gt;
The second part will go into detail with why the code for group commit does
not work when the binary log is enabled. The third (and final) part will
discuss some ideas about how to fix  the code with respect to group commit and
the binary log.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/12254.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <category>performance</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>12</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/11783.html</guid>
  <pubDate>Fri, 23 Apr 2010 08:38:06 GMT</pubDate>
  <title>Debugging memory leaks in plugins with Valgrind</title>
  <link>http://kristiannielsen.livejournal.com/11783.html</link>
  <description>&lt;p&gt;
I had an interesting IRC discussion the other day with Monty Taylor about what
turned out to be a limitation in &lt;a href=&quot;http://valgrind.org/&quot; rel=&quot;nofollow&quot;&gt;Valgrind&lt;/a&gt;
with respect to debugging memory leaks in dynamically loaded plugins.
&lt;/p&gt;

&lt;p&gt;
Monty Taylor&apos;s original problem was
with &lt;a href=&quot;http://drizzle.org/&quot; rel=&quot;nofollow&quot;&gt;Drizzle&lt;/a&gt;, but as it turns out, it is
common to all of the MySQL-derived code bases. When there is a memory leak
from an allocation in a dynamically loaded plugin, Valgrind will detect the
leak, but the part of the stack trace that is within the plugin shows up as an
unhelpful three question marks &quot;???&quot;:
&lt;pre&gt;
==1287== 400 bytes in 4 blocks are definitely lost in loss record 5 of 8
==1287==    at 0x4C22FAB: malloc (vg_replace_malloc.c:207)
==1287==    by 0x126A2186: ???
==1287==    by 0x7C8E01: ha_initialize_handlerton(st_plugin_int*) (handler.cc:429)
==1287==    by 0x88ADD6: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1033)
&lt;/pre&gt;
Which tells you little more than that there is a leak in one of your plugins.
&lt;/p&gt;

&lt;p&gt;
After trying a couple of things, we found that this is a known limitation in
Valgrind in relation to code that is loaded with &lt;code&gt;dlopen()&lt;/code&gt; and
later unloaded with &lt;code&gt;dlclose()&lt;/code&gt;:
&lt;/p&gt;
&lt;center&gt;&lt;a href=&quot;http://bugs.kde.org/show_bug.cgi?id=79362&quot; rel=&quot;nofollow&quot;&gt;http://bugs.kde.org/show_bug.cgi?id=79362&lt;/a&gt;&lt;/center&gt;
&lt;p&gt;
The basic problem is that Valgrind records the location of
the &lt;code&gt;malloc()&lt;/code&gt; call as just a memory address. And when the memory
leak check is performed after the end of program execution, the plugin has
been unloaded with &lt;code&gt;dlclose()&lt;/code&gt;, and the recorded memory address is
therefore no longer valid.
&lt;/p&gt;

&lt;p&gt;
The problem is specific to memory leak checks, which are done only after the
code has been unloaded. Other checks (like use of uninitialised values and
use-after-free) work fine with full information in the stack traces, as such
checks are done while the plugin code is still loaded into memory. But the
memory leak checks are arguably among the most useful cheks Valgrind does, as
Valgrind is often the only way to find and fix critical memory leaks
efficiently.
&lt;/p&gt;

&lt;p&gt;
Fortunately, once the issue was understood, we had an easy work-around:
disable the &lt;code&gt;dlclose()&lt;/code&gt; call in the server plugin code, and the
leak is then detected with full information in the stack trace. Unfortunately
this introduces a leak of its own, since now the memory allocated
in &lt;code&gt;dlopen()&lt;/code&gt; is never freed, so we get another spurious Valgrind
memory leak warning.
&lt;/p&gt;

&lt;p&gt;Another possible way to get the same effect is to pass
the &lt;code&gt;RTLD_NODELETE&lt;/code&gt; flag to &lt;code&gt;dlopen()&lt;/code&gt; to achieve the
same effect, though I did not try this yet.
&lt;/p&gt;

&lt;p&gt;
A possibly better work-around (which I also did not try yet) is one suggested
in the above referenced Valgrind feature request. By adding the offending
plugin(s) as &lt;code&gt;LD_PRELOAD&lt;/code&gt; when starting the server, the plugin code
will not actually be unloaded in &lt;code&gt;dlclose()&lt;/code&gt;, so stack traces
should be available without any spurious leak warnings from Valgrind. However,
this will not work well if some of the dynamic plugins need a particular load
order (according to the suggestion in the feature request). I also need to
check if this actually works for plugins (like storage engines) that has link
dependencies to symbols in the main program. But it might be a good option if
it can be made to work.
&lt;/p&gt;

&lt;p&gt;
(At first I was surprised to learn that this was a problem in MySQL and
MariaDB, as I never saw it before. But I suppose the reason is that we so far
have built most plugins as built-in, rather than as dynamically
loaded &lt;code&gt;.so&lt;/code&gt; files. The problem is likely to occur more frequently
as we are moving to do more and more with plugins in MariaDB, so it is nice to
know a work-around. Thanks, Monty!)
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/11783.html</comments>
  <category>developmentprocess</category>
  <category>freesoftware</category>
  <category>drizzle</category>
  <category>mariadb</category>
  <category>valgrind</category>
  <category>mysql</category>
  <category>debugging</category>
  <category>programming</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/11602.html</guid>
  <pubDate>Mon, 29 Mar 2010 09:03:56 GMT</pubDate>
  <title>MariaDB talk at the OpenSourceDays 2010 conference</title>
  <link>http://kristiannielsen.livejournal.com/11602.html</link>
  <description>&lt;p&gt;
Earlier this month, I was at
the &lt;a href=&quot;http://www.opensourcedays.org/2010/&quot; rel=&quot;nofollow&quot;&gt;OpenSourceDays 2010&lt;/a&gt;
conference, giving a talk on MariaDB
(the &lt;a href=&quot;http://askmonty.org/w/images/a/a4/Osd2010.pdf&quot; rel=&quot;nofollow&quot;&gt;slides from the talk&lt;/a&gt; are
available).
&lt;/p&gt;

&lt;p&gt;
The talk went quite well I think (though I probably talked way too fast as I usually
do; at least that means that I finished on time with plenty room for
questions..)
&lt;/p&gt;

&lt;p&gt;
There was quite a bit of interest after the talk from many of the
people who heard it.
It was even
&lt;a href=&quot;http://www.version2.dk/artikel/14131-mysqls-lillesoester-loesner-grebet-fra-sun-og-oracle&quot; rel=&quot;nofollow&quot;&gt;reported on
  by the Danish IT media version2.dk&lt;/a&gt; (article in Danish).
&lt;/p&gt;

&lt;p&gt;
Especially interesting to me was to discuss with three people
from Danish site &lt;a href=&quot;http://www.komogvind.dk/&quot; rel=&quot;nofollow&quot;&gt;komogvind.dk&lt;/a&gt;, who told
me fascinating details about their work keeping a busy site running; one of
them even went right home to benchmark against MariaDB. Thanks to you, and to
everyone else for your interest and time!
&lt;/p&gt;

&lt;p&gt;
This time in the talk, I tried to also focus on the community and development
aspects of MariaDB (in addition to the mandatory feature list and benchmark
graphs, of course). To me, the most important thing about MariaDB is that we
now have the infrastructure and community for people outside of MySQL to do fullscale
development at the same level as inside MySQL. This was missing before. It is
a much less concrete thing than features and benchmarks, so I found it much
harder to present in a good way, without it turning into nothing but buzzwords. But from the
feedback I got afterwards, it seems I succeeded pretty well with this part
also, which I am especially happy about!
&lt;/p&gt;

&lt;p&gt;
The talk was recorded on video by the organisers. The latest I heard was that
the video footage is still being edited though (I was kind of waiting, hoping
to be able to include a link to the video in this post). But if they do manage
to finish the editing and make the videos available later, I will post an update.
&lt;/p&gt;

&lt;p&gt;
A big thanks to the organisers of the OpenSourceDays 2010 conference! I had
a great time, and hope to be back again next spring for OpenSourceDays 2011.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/11602.html</comments>
  <category>conference</category>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <lj:security>public</lj:security>
  <lj:reply-count>0</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/11316.html</guid>
  <pubDate>Mon, 15 Feb 2010 11:44:58 GMT</pubDate>
  <title>Conference time!</title>
  <link>http://kristiannielsen.livejournal.com/11316.html</link>
  <description>&lt;p&gt;
It is conference time for me. I just came home
from &lt;a href=&quot;http://fosdem.org/2010/&quot; rel=&quot;nofollow&quot;&gt;FOSDEM 2010&lt;/a&gt; where we had a booth
and I gave a talk. At the end of the month there will be a company meeting in
Iceland for Monty Program, followed
by &lt;a href=&quot;http://www.opensourcedays.org/2010/&quot; rel=&quot;nofollow&quot;&gt;Open Source Days 2010&lt;/a&gt;
where I will also be speaking. And then in April there is
the &lt;a href=&quot;http://en.oreilly.com/mysql2010/&quot; rel=&quot;nofollow&quot;&gt;MySQL User Conference&lt;/a&gt;.
With two additional talks given at local user groups end of last year, I think
I&apos;ve about filled my quota for now, I feel quite fortunate that it turned out
that I will not also be presenting at the UC! (I do not have a natural talent
for speaking, and tend to need to spend quite a lot of time in preparations.)
&lt;/p&gt;

&lt;p&gt;
&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;http://knielsen-hq.org/blog/img/fosdem-booth.jpeg&quot; alt=&quot;MariaDB/PBXT booth at FOSDEM&quot;&gt;
&lt;/div&gt;
Having a booth at FOSDEM turned out really well I think, as I got to talk to a
lot of different people that passed by the booth. I also had a very nice
dinner with the PostgreSQL people where I learned a lot about the internals of
that database. As well as dinners with people from the MySQL world, also with
lots of interesting discussions.
&lt;/p&gt;

&lt;p&gt;
Thanks to all of the people that I met at FOSDEM. It was fun and inspiring to
meet you, looking forward to the next time!
&lt;/p&gt;

&lt;p&gt;
One thing strikes me as I am piecing together the mandatory &quot;MariaDB feature
list&quot; for my next talk. It seems people tend to focus a lot on the extra
features MariaDB has over MySQL. But there is another aspect that I think is
just as important: MariaDB is creating an open framework where community
developers can work together on new development. This is something that has
been missing in the past.
&lt;/p&gt;

&lt;p&gt;
It is easy to focus on a concrete list of features, whereas the idea of an
abstract framework is much harder to present as more than buzzword talk. But I
will try to get it into my next talk, as I think ultimately both are of
equal importance.
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/11316.html</comments>
  <category>developmentprocess</category>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <lj:security>public</lj:security>
  <lj:reply-count>2</lj:reply-count>
</item>
<item>
  <guid isPermaLink='true'>http://kristiannielsen.livejournal.com/11132.html</guid>
  <pubDate>Wed, 20 Jan 2010 10:50:14 GMT</pubDate>
  <title>Why I work on Free Software</title>
  <link>http://kristiannielsen.livejournal.com/11132.html</link>
  <description>&lt;p&gt;
I happened upon this &lt;a href=&quot;http://www.linuxjournal.com/article/6214&quot; rel=&quot;nofollow&quot;&gt;old
LinuxJournal&lt;/a&gt; article about how the University of Zululand in South Africa
used &lt;a href=&quot;http://www.mysql.com/&quot; rel=&quot;nofollow&quot;&gt;MySQL&lt;/a&gt; and other Free Software to make
do with a 128 kbit (and later 768 kbit) internet connection for their staff
and students.
&lt;/p&gt;

&lt;p&gt;
This made me remember the trip I made to another African
country, &lt;a href=&quot;https://secure.wikimedia.org/wikipedia/en/wiki/Burkina_Faso&quot; rel=&quot;nofollow&quot;&gt;Burkina
Faso&lt;/a&gt;, 15 years ago:
&lt;div&gt;
&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
  &lt;a href=&quot;https://knielsen-hq.org/blog/img/burkina1.jpeg&quot; rel=&quot;nofollow&quot;&gt;
    &lt;img src=&quot;https://knielsen-hq.org/blog/img/burkina1-small.jpeg&quot;&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;
  &lt;a href=&quot;https://knielsen-hq.org/blog/img/burkina2.jpeg&quot; rel=&quot;nofollow&quot;&gt;
    &lt;img src=&quot;https://knielsen-hq.org/blog/img/burkina2-small.jpeg&quot;&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
With the huge amount of work and numerous difficult obstacles facing my work
on the &lt;a href=&quot;http://mariadb.com/&quot; rel=&quot;nofollow&quot;&gt;MariaDB project&lt;/a&gt;, it can be hard to
keep up motivation at times. It helps to remember why I am doing this.
&lt;/p&gt;

&lt;p&gt;
It must be about the time when some of these kids should go to University or
start up new projects. Maybe some of them will work with software. I want them
to be able to invest in local skills and infrastructure that they need, rather
than in software licenses funding nice houses and boats in other countries:
&lt;div&gt;
&lt;table&gt;&lt;tr&gt;
&lt;td&gt;
  &lt;a href=&quot;https://knielsen-hq.org/blog/img/Bill_gates_house.jpg&quot; rel=&quot;nofollow&quot;&gt;
    &lt;img src=&quot;https://knielsen-hq.org/blog/img/Bill_gates_house-small.jpg&quot;&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;
  &lt;a href=&quot;https://knielsen-hq.org/blog/img/Rising_Sun_Yacht.JPG&quot; rel=&quot;nofollow&quot;&gt;
    &lt;img src=&quot;https://knielsen-hq.org/blog/img/Rising_Sun_Yacht-small.JPG&quot;&gt;&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/p&gt;

&lt;p&gt;
(House and yacht pictures courtesy of &lt;a href=&quot;http://www.wikipedia.org/&quot; rel=&quot;nofollow&quot;&gt;Wikipedia&lt;/a&gt; under
the &lt;a href=&quot;http://creativecommons.org/licenses/by-sa/3.0/&quot; rel=&quot;nofollow&quot;&gt;Creative Commons Attribution ShareAlike 3.0&lt;/a&gt;)
&lt;/p&gt;</description>
  <comments>http://kristiannielsen.livejournal.com/11132.html</comments>
  <category>freesoftware</category>
  <category>mariadb</category>
  <category>mysql</category>
  <lj:security>public</lj:security>
  <lj:reply-count>1</lj:reply-count>
</item>
</channel>
</rss>
