You are viewing kristiannielsen

Kristian Nielsen Below are 10 entries, after skipping 10 most recent ones in the "Kristian Nielsen" journal:

[<< Previous 10 entries -- Next 10 entries >>]

June 11th, 2012
02:48 pm

[Link]

Even faster group commit!

I found time to continue my previous work on group commit for the binary log in MariaDB.

In current code, a (group) commit to InnoDB does not less than three fsync() calls:

  1. Once during InnoDB prepare, to make sure we can recover the transaction in InnoDB if we crash after writing it to the binlog.
  2. Once after binlog write, to make sure we have the transaction in the binlog before we irrevocably commit it in InnoDB.
  3. Once during InnoDB commit, to make sure we no longer need to scan the binlog after a crash to recover the transaction.
Of course, in point 3, it really is not necessary to do an fsync() after every (group) commit. In fact, it seems hardly necessary to do such fsync() at all! If we should crash before the commit record hits the disk, we can always recover the transaction by scanning the binlogs and checking which of the transactions in InnoDB prepared state should be committed. Of course, we do not want to keep and scan years worth of binlogs, but we need only fsync() every so often, not after every commit.

So I implemented MDEV-232. This removes the fsync() call in the commit step in InnoDB. Instead, the binlog code requests from InnoDB (and any other transactional storage engines) to flush all pending commits to disk when the binlog is rotated. When InnoDB is done with the flush, it reports back to the binlog code. We keep track of how far InnoDB has flushed by writing so-called checkpoint events into the current binlog. After a crash, we first scan the latest binlog. The last checkpoint event found will tell us if we need to scan any older binlogs to be sure to find all commits that were not durably committed inside InnoDB prior to the crash.

The result is that we only need to do two fsync() calls per (group) commit instead of three.

I benchmarked the code on a server with a good disk system - HP RAID controller with a battery-backed up disk cache. When the cache is enabled, fsync() is fast, around 400 microseconds. When the cache is disabled, it is slow, several milliseconds. The setup should be mostly comparable to Mark Callaghan's benchmarks here and here.

I used sysbench update_non_index.lua to make it easier for others to understand/reproduce the test. This does a single update of one row in a table in each transaction. I used 100,000 rows in the table. Group commit is now so fast that at higher concurrencies, it is no longer the bottleneck. It will be interesting to test again with the new InnoDB code from MySQL 5.6 and any other scaslability improvements that have been made there.

Slow fsync()

As can be seen, we have a very substantial improvement, around 30-60% more commits per second depending on concurrency. Not only are we saving one out of three expensive fsync() calls, improvements to the locking done during commit also allow more commits to share the same fsync().

Fast fsync()

Even with fast fsync(), the improvements are substantial.

I am fairly pleased with these results. There is still substantial overhead from enabling the binlog (like several times slowdown if fsync() time is the bottleneck), and I have a design for mostly solving this in MWL#164. But I think perhaps it is now time to turn towards other more important areas. In particular I would like to turn to MWL#184 - another method for parallel apply of events on slaves that can help in cases where the per-database split of workload that exists in Tungsten and MySQL 5.6 can not be used, like many updates to a single table. Improving throughput even further on the master may not be the most important if slaves are already struggling to keep up with current throughput, and this is another relatively simple spin-off from group commit that could greatly help.

For anyone interested, the current code is pushed to lp:~maria-captains/maria/5.5-mdev232

MySQL group commit

It was an interesting coincidence that the new MySQL group commit preview was published just as I was finishing this work. So I had the chance to take a quick look and include it in the benchmarks (with slow fsync()):

While the implementation in MySQL 5.6 preview is completely different from MariaDB (talk about "not invented here ..."), the basic design is now quite similar, as far as I could gather from the code. A single thread writes all transactions in the group into the binlog, in order; likewise a single thread does the commits (to memory) inside InnoDB, in order. The storage engine interface is extended with a thd_get_durability_property() callback for the engines - when the server returns HA_IGNORE_DURABILITY from this, InnoDB commit() method is changed to work exactly like MariaDB commit_ordered(): commit to memory but do not sync to disk.

(It remains to see what storage engine developers will think of MySQL implementing a different API for the same functionality ...)

The new MySQL group commit also removes the third fsync() in the InnoDB commit, same as the new MariaDB code. To ensure they can still recover after a crash, they just call into the storage engines to sync all commits to disk during binlog rotate. I actually like that from the point of simplicity - even if it does stall commits for longer, it is unlikely to matter in practice. What actually happens inside InnoDB in the two implementations is identical.

The new MySQL group commit is substantially slower than the new MariaDB group commit in this benchmark. My guess is that this is in part due to suboptimal inter-thread communication. As I wrote about earlier, this is crucial to get best performance at high commit rates, and the MySQL code seems to do additional synchronisation between what they call stages - binlog write, binlog fsync(), and storage engine commit. Since the designs are now basically identical, it should not be hard to get this fixed to perform the same as MariaDB. (Of course, if they had started from my work, they could have spent the effort improving that even more, rather than wasting it on catch-up).

Note that the speedup from group commit (any version of it) is highly dependent on the workload and the speed of the disk system. With fast transactions, slow fsync(), and high concurrency, the speedup will be huge. With long transactions, fast fsync(), and low concurrency, the speedups will be modest, if any.

Incidentally, the new MySQL group commit is a change from the designs described earlier, where individual commit threads would use pwrite() in parallel into the binary log. I am convinced this is a good change. The writing to binlog is just memcpy() between buffers, a single thread can do gigabytes worth of that, it is not where the bottleneck is. While it is crucial to optimise the inter-thread communication, as I found out here - and lots of small parallel pwrite() calls into the same few data blocks at the end of a file delivered to the file system is not likely to be a success. If binlog write bandwidth would really turn out to be a problem the solution is to have multiple logs in parallel - but I think we are quite far from being there yet.

It is a pity that we cannot work together in the MySQL world. I approached the MySQL developers several times over the past few years suggesting we work together, with no success. There are trivial bugs in the MySQL group commit preview whose fix yield great speedup. I could certainly have used more input while doing my implementation. The MySQL user community could have much better quality if we would only work together.

Instead, Oracle engineers use their own bugtracker which is not accessible to others, push to their own development trees which are not accessible to others, communicate on their own mailing lists which are not accessible to others, hold their own developer meetings which are not accessible to others ... the list is endless.

The most important task when MySQL was aquired was to collect the different development groups working on the code base and create a real, great, collaborative Open Source project. Oracle has totally botched this task up. Instead, what we have is lots of groups each working on their own tree, with no real interesting in collaborating. I am amazed every time I read some prominent MySQL community member praise the Oracle stewardship of MySQL. If these people are not interested in creating a healthy Open Source project and just want to not pay for their database software, why do they not go use the express/cost-free editions of SQL server or Oracle or whatever?

It is kind of sad, really.

Tags: , , , , ,

(1 comment | Leave a comment)

June 22nd, 2011
04:37 pm

[Link]

Tale of a bug

This is a tale of the bug lp:798213. The bug report has the initial report, and a summary of the real problem obtained after detailed analysis, but it does not describe the processes of getting from the former to the latter. I thought it would be interesting to document this, as the analysis of this bug was rather tricky and contains several good lessons.

Background

The bug first manifested itself as a sporadic failure in one of our random query generator tests for replication. We run this test after all MariaDB pushes in our Buildbot setup. However, this failure had only occured twice in several months, so it is clearly a very rare failure.

The first task was to try to repeat the problem and get some more data in the form of binlog files and so on. Philip kindly helped with this, and after running the test repeatedly for several hours he finally managed to obtain a failure and attach the available information to the initial bug report. Time for analysis!

Understanding the failure

The first step is to understand what the test is doing, and what the failure means.

The test starts up a master server and exposes it to some random parallel write load. Half-way through, with the server running at full speed, it takes a non-blocking XtraBackup backup, and restores the backup into a new slave server. Finally it starts the new slave replicating from the binlog position reported by XtraBackup, and when the generated load is done and the slave caught up, it compares the master and slave to check that they are consistent. This test is an important check of my group commit work, which is carefully engineered to provide group commit while still preserving the commit order and consistent binlog position that is needed by XtraBackup to do such non-blocking provisioning of new slaves.

The failure is that in a failed run, the master and slave are different when compared at the end. The slave has a couple of extra rows (later I discovered the bug could also manifest itself as a single row being different). So this is not good obviously, and needs to be investigated.

Analysing the failure

So this is a typical case of a "hard" failure to debug. We have binlogs with 100k queries or so, and a slave that somewhere in those 100k queries diverges from the master. Working on problems like this, it is important to work methodically, slowly but surely narrowing down the problem, come up with hypothesis about the behaviour and positively affirm or reject them, until finally the problem is narrowed down sufficiently that the real cause is apparent. Random poking around not only is likely to waste time, but far worse, without a real understanding of the root cause of the failure, there is a great danger of eventually tweaking things so that the failure happens to go away in the test at hand, yet the underlying bug is still there. After all, the failure was already highly sporadic to begin with.

First I wanted to know if the problem is that replication diverges (eg. because of non-deterministic queries in the statement-based replication), or if it is a problem with the restored backup used to start the slave (wrong data or starting binlog position). Clearly, I strongly suspected a wrong starting binlog position, as this is what my group commit work messes with. But as it turns out, this was not the problem, again stressing the need to always verify positively any assumptions made during debugging.

To check this, I setup a new slave server from scratch, and had it replicate from the master binlog all the way from the start to the end. I then compared all three end results: (A) the original master; (B) the slave provisioned by XtraBackup, and (C) the new slave replicated from the start of the binlogs. It turns out that (A) and (C) are identical, while (B) differs. So this strongly suggests a problem with the restored XtraBackup; the binlogs by themselves replicate without problems.

To go further, I needed to analyse the state of the slave server just after the XtraBackup has been restored, without the effect of the thousands of queries replicated afterwards. Unfortunately this was not saved as part of the original test. It was trivial to add to the test (just copy away the backup to a safe place before starting the slave server), but then the need came to reproduce the failure again.

This is another important step in debugging hard sporadic failures: Get to the point where the failure can be reliably reproduced, at least for some semi-reasonable meaning of "reliable". This is really important not only to help debugging, but also to be able to verify that a proposed bug fix actually fixes the original bug! I do have experienced once or twice a failure so elusive that the only way to fix was to commit blindly a possible fix, then wait for several months to see if the failure would re-appear in that interval. Fortunately, in far the most cases, with a bit of work, this is not necessary.

Same here: After a bit of experimentation, I found that I could reliably reproduce the failure by reducing the duration of the test from 5 minutes to 35 seconds, and running the test in a tight loop until it failed. It always failed after typically 15-40 runs.

So now I had the state of the slave provisioned with XtraBackup as it was just before it starts replicating. So what I did was to set up another slave server from scratch and let it replicate from the master binlogs using START SLAVE UNTIL with the binlog position reported by XtraBackup. If the XtraBackup and its reported binlog start position are correct, these two servers should be identical. But sure enough, a comparison showed that they differed! In this case it was a single row that had different data. So this confirms the hypothesis that the problem is with the restored XtraBackup data and/or binlog position.

So now, thinking it was the binlog position that was off, I naturally next looked into the master binlog around this position, looking for an event just before the position that was not applied, or an event just after that already was applied. However, to my surprise I did not find this. I did find an event just after that updated the table that had the wrong row. However, the data in the update looked nothing like the data that was expected in the wrong row. And besides, that update was part of a transaction updating multiple tables; if that event was duplicated or missing, there would have been more row differences in more tables, not just one row in a single table. I did find an earlier event that looked somewhat related, however it was far back in the binlog (so not resolvable by merely adjusting the starting binlog pos); and besides again it was part of a bigger transaction updating more rows, while I had only one row with wrong data.

So at this point I need a new idea; the original hypothesis has been proven false. The restored XtraBackup is clearly wrong in a single row, but nothing in the binlog explains how this one row difference could have occured. When analysis runs up against a dead end, it is time to get more data. So I ran the test for a couple hours, obtained a handful more failures, and analysed then in the same way. Each time I saw the XtraBackup differing from the master in one row, or in once case the slave had a few extra rows.

So this is strange. After we restore the XtraBackup, we have one row (or a few rows) different from the master server. And those rows were updated in a multi-row transaction. It is as if we are somehow missing part of a transactions. Which is obviously quite bad, and indicates something bad going on at a deep level

Again it is time to get more data. Now I try running the test with different server options to see if it makes any difference. Running with --binlog-optimize-thread-scheduling=0 still caused the failure, so it was not related to the thread scheduling optimisation that I implemented. Then I noticed that the test runs with the option --innodb-release-locks-early=1 enabled. On a hunch I tried running without this option, and AHA! Without this option, I was no longer able to repeat the failure, even after 250 runs!

At this point, I start to strongly suspect a bug in the --innodb-release-locks-early feature. But this is still not proven! It could also be that with the option disabled, there is less opportunity for parallelism, hiding the problem which could really be elsewhere. So I still needed to understand exactly what the root cause of the problem is.

Eureka!

At this point, I had sufficient information to start just thinking about the problem, trying to work out ways in which things could go wrong in a way that would produce symptoms like what we see. So I started to think on how --innodb-release-locks-early works and how InnoDB undo and recovery in general function. So I tried a couple of ideas, some did not seem relevant...

...and then, something occured to me. What the --innodb-release-locks-early feature does is to make InnoDB release row locks for a transaction earlier than normal, just after the transaction has been prepared (but before it is written into the binlog and committed inside InnoDB). This is to allow another transaction waiting on the same row locks to proceed as quickly as possible.

Now, this means that with proper timing, it is possible for such a second transaction to also prepare before the first has time to commit. At this point we thus have two transactions in the prepared state, both of which modified the same row. If we were to take an XtraBackup snapshot at this exact moment, upon restore XtraBackup would need to roll back both of those transactions (the situation would be the same if the server crashed at that point and later did crash recovery).

This begs the question if such rollback will work correctly? This certainly is not something that could occur in InnoDB before the patch for --innodb-release-locks-early was implemented, and from my knowledge of the patch, I know it does not explicitly do anything to make this work. Aha! So now we have a new hypothesis: Rollback of multiple transactions modifying the same row causes problems.

To test this hypothesis, I used the Debug_Sync facility to create a mysql-test-run test case. This test case creates runs a few transactions in parallel, all modifying a common row, then starts them committing but pauses the server when some of them are still in the prepared state. At this point it takes an XtraBackup snapshot. I then tried restoring the XtraBackup snapshot to a fresh server. Tada... but unfortunately this did not show any problems, the restore looked correct.

However, while restoring, I noticed that the prepared-but-not-committed transactions seems to be rolled back in reverse order of InnoDB transactions id. So this got me thinking - what would happen if they were rolled back in a different order? Indeed, for multiple transactions modifying a common row, the rollback order is critical! The first transaction to modify the row must be rolled back last, so that the correct before-image is left in the row. If we were to roll back in a different order, we could end up restoring the wrong before-image of the row - which would result in exactly the kind of single-row corruption that we seem to experience in the test failure! So we are getting close it seems. Since InnoDB seems to roll back transactions in reverse order of transaction ID, and since transaction IDs are presumably allocated in order of transaction start, maybe by starting the transactions in a different order, the failure can be provoked.

And sure enough, modifying the test case so that the transactions are started in the opposite order causes it to show the failure! During XtraBackup restore, the last transaction to modify the row is rolled back last, so that the row ends up with the value that was there just before the last update. But this value is wrong, as it was written there by a transaction that was itself rolled back. So we have the corruption reproduced with a small, repeatable test case, and the root cause of the problem completely understood. Task solved!

(Later I cleaned up the test case to crash the server and work with crash recovery instead; this is simpler as it does not involve XtraBackup. Though this also involves the XA recovery procedure with the binlog, the root problem is the same and shows the same failure. As to a fix for the bug, that remains to be seen. I wrote some ideas in the bug report, but it appears non-trivial to fix. The --innodb-release-locks-early feature is originally from the Facebook patch; maybe they will fix it, or maybe we will remove the feature from MariaDB 5.3 before GA release. Corrupt data is a pretty serious bug, after all.)

Lessons learned

I think there are some important points to learn from this debugging story:

  1. When working on high-performance system code, some debugging problems are just inherently hard! Things happen in parallel, and we operate with complex algorithms and high quality requirements on the code. But with a methodical approach, even the hard problems can be solved eventually.
  2. It is important not to ignore test failures in the test frameworks (such as Buildbot), no matter how random, sporadic, and/or elusive they may appear at first. True, many of them are false positives or defects in the test framework rather than the server code. But some of them are true bugs, and among them are some of the most serious and yet difficult bugs to track down. The debugging in this story is trivial compared to how the story would have been if this had to be debugged in a production setting at a support customer. Much nicer to work from a (semi-)repeatable test failure in Buildbot!

Tags: , , , ,

(6 comments | Leave a comment)

March 24th, 2011
05:49 pm

[Link]

Benchmarking thread scheduling in group commit, part 2

I got access to our 12-core Intel server, so I was able to do some better benchmarks to test the different group commit thread scheduling methods:

This graph shows queries-per-second as a function of number of parallel connections, for three test runs:

  1. Baseline MariaDB, without group commit.
  2. MariaDB with group commit, using the simple thread scheduling, where the serial part of the group commit algorithm is done by each thread signalling the next one.
  3. MariaDB with group commit and optimised thread scheduling, where the first thread does the serial group commit processing for all transactions at once, in a single thread.
(see the previous post linked above for a more detailed explanation of the two thread scheduling algorithms.)

This test was run on a 12-core server with hyper-threading, memory is 24 GByte. MariaDB was running with datadir in /dev/shm (Linux ram disk), to simulate a really fast disk system and maximise the stress on the CPUs. Binlog is enabled with sync_binlog=1 and innodb_flush_log_at_trx_commit=1. Table type is InnoDB.

I use Gypsy to generate the client load, which is simple auto-commit primary key updates:

    REPLACE INTO t (a,b) VALUES (?, ?)

The graph clearly shows the optimised thread scheduling algorithm to improve scalability. As expected, the effect is more pronounced on the twelve-core server than on the 4-core machine I tested on previously. The optimised thread scheduling has around 50% higher throughput at higher concurrencies. While the naive thread scheduling algorithm suffers from scalability problems to the degree that it is only slightly better than no group commit at all (but remember that this is on ram disk, where group commit is hardly needed in the first place).

There is no doubt that this kind of optimised thread scheduling involves some complications and trickery. Running one part of a transaction in a different thread context from the rest does have the potential to cause subtle bugs.

On the other hand, we are moving fast towards more and more CPU cores and more and more I/O resources, and scalability just keeps getting more and more important. If we can scale MariaDB/MySQL with the hardware improvements, more and more applications can make do with scale-up rather than scale-out, which significantly simplifies the system architecture.

So I am just not comfortable introducing more serialisation (e.g. more global mutex contention) in the server than absolutely necessary. That is why I did the optimisation in the first place even without testing. Still, the question is if an optimisation that only has any effect above 20,000 commits per second is worth the extra complexity? I think I still need to think this over to finally make up my mind, and discuss with other MariaDB developers, but at least now we have a good basis for such discussion (and fortunately, the code is easy to change one way or the other).

Tags: , , ,

(8 comments | Leave a comment)

March 23rd, 2011
02:03 pm

[Link]

Benchmarking thread scheduling in group commit

The best part of the recent MariaDB meeting in Lisbon for me was that I got some good feedback on my group commit work. This has been waiting in the review queue for quite some time now.

One comment I got revolve around an optimisation in the implementation related to how threads are scheduled.

A crucial step in the group commit algorithm is when the transactions being committed have been written into the binary log, and we want to commit them in the storage engine(s) in the same order as they were committed in the binlog. This ordering requirement makes that part of the commit process serialised (think global mutex).

Even though care is taken to make this serial part very quick to run inside the storage engine(s), I was still concerned about how it would impact scalability on multi-core machines. So I took extra care to minimise the time spent on the server layer in this step.

Suppose we have three transactions being committed as a group, each running in their own connection thread in the server. It would be natural to let the first thread do the first commit, then have the first thread signal the second thread to do the second commit, and finally have the second thread signal the third thread. The problem with this is that now the inherently serial part of the group commit not only includes the work in the storage engines, it also includes the time needed for two context switches (from thread 1 to thread 2, and from thread 2 to thread 3)! This is particularly costly if, after finishing with thread 1, we end up having to wait for thread 2 to be scheduled because all CPU cores are busy.

So what I did instead was to run all of the serial part in a single thread (the thread of the first transaction). The single thread will handle the commit ordering inside the storage engine for all the transactions, and the remaining threads will just wait for the first one to wake them up. This means the context switches for the waiting threads are not included in the serial part of the algorithm. But it also means that the storage engines need to be prepared to run this part of the commit in a separate thread from the rest of the transaction.

So, in Lisbon there was some discussion around if the modifications I did to InnoDB/XtraDB for this were sufficient to ensure that there would not be any problems with this running part of the commit in a different thread. After all, this requirement is a complication. And then the question came up if the above optimisation is actually needed? Does it notably increase performance?

Now, that is a good question, and I did not have an answer as I never tested it. So now I did! I added an option --binlog-optimize-thread-scheduling to allow to switch between the naive and the optimised way to handle the commit of the different transactions in the serial part of the algorithm, and benchmarked them against each other.

Unfortunately, the two many-core servers we have available for testing were both unavailable (our hosting and quality of servers leaves a lot to be desired unfortunately). So I was left to test on a 4-core (8 threads with hyperthreading) desktop box I have in my own office. I was able to get some useful results from this nevertheless, though I hope to revisit the benchmark later on more interesting hardware.

In order to stress the group commit code maximally, I used a syntetic workload with as many commits per second as possible. I used the fastest disk I have available, /dev/shm (Linux ramdisk). The transactions are single-row updates of the form

    REPLACE INTO t (a,b) VALUES (?, ?)
The server is an Intel Core i7 quad-core with hyperthreading enabled. It has 8GByte of memory. I used Gypsy to generate the load. Table type is XtraDB. The server is running with innodb_flush_log_at_trx_commit=1 and sync_binlog=1.

Here are the results in queries per second, with different number of concurrent connections running the queries:

Number of connections QPS (naive scheduling) QPS (optimised scheduling) QPS (binlog disabled)
16217002360029000
32190002250029700
128180001950026800
So as we see from this table, even with just four cores we see noticable better performance by running the serial part of group commit in a single thread. The improvement is around 10% or so, depending on parallelism. So I think this means that I will want to keep the optimised version.

It is nice to see that we can get > 20k commits/second with the group commit code on cheap desktop hardware. For real servers the I/O subsystem will probably be a bottleneck, but that is what I wanted to see: that the group commit code will not limit the ability to fully utilise high amounts of I/O resources.

While I was at it, I also measured the throughput when the binlog is disabled. As can be seen, enabling the binlog has notable performance impact even with very fast disk. Still, considering the added overhead of writing an extra log file, not to mention the added 2-phase commit step, the overhead is not that unreasonable.

From the table we also see some negative scaling as the number of parallel connections increases. Some of this is likely from InnoDB/XtraDB, but I would like to investigate it deeper at some point to see if there is anything in the group commit part that can be improved with respect to this.

Looking back, should I have done this benchmark when designing the code? I think it is a tricky question, and one that cannot be given a simple answer. It will always be a trade-off: It is not feasible to test (and implement!) every conceivable variant of a new feature during development, it is necessary to also rely on common sense and experience. On the other hand, it is dangerous to rely on intuition with respect to performance; time and time again measurements prove that the real world is very often counter to intuition. In this case I was right, and my optimisation was beneficial; however I could easily have been wrong. I think the main lesson here is how important it is to get feedback on complex design work like this; such feedback is crucial for motivating and structuring the work to be of the quality that we need to see in MariaDB.

Tags: , , ,

(1 comment | Leave a comment)

March 8th, 2011
09:30 am

[Link]

My presentation from OpenSourceDays2011

Here are the slides from my talk at Open Source Days 2011 on Saturday. The talk was about MariaDB and other parts of the MySQL development community outside of MySQL@Oracle.

For me, the most memorable part of the conference was the talk by Noirin Shirley titled Open Source: Saving the World. Noirin described the Open Source Ushahidi project and how it was used during the natural disaster crisis in Indonesia, New Zealand and other places.

Now, there is a long way from implementing group commit in MariaDB to rescuing injured people out of collapsed buildings, and not all use of Free Software is as samaritan as Ushahidi. Well, no-one can save the world alone.

But in the Free Software community, we work together, each contributing his or her microscopic part, and together slowly but surely building the most valuable software infrastructure in the world. Which then in turn empowers others to work together in other areas outside of (and more important than) software. Working in Free Software enables me to contribute my skills and resources, and I think Noirin managed very well to capture this in her talk.

Tags: , , ,

(Leave a comment)

February 24th, 2011
12:13 pm

[Link]

Speaking at OpenSourceDays2011

Again this year, I will be speaking about MariaDB and stuff at the OpenSourceDays2011 conference in Copenhagen, Denmark. The conference will take place on Saturday March 5, that's just over a week from now! The program is ready and my talk is scheduled for the afternoon at 15:30. Hope to meet a lot of people there!

(I will be sure to make the slides from my talk available here afterwards, for those of you interested but unable to attend.)

Here is the abstract for my talk:

Latest news from the MariaDB (and MySQL) community

A lot of Open Source software projects got transfered to Oracle last year as part of the acquisition of Sun Microsystems. Not everybody in the affected Open Source communities have been happy with this transfer, to put it mildly, and projects like LibreOffice, Illumos, Jenkins, and others are forking left and right to become independent of Oracle.

Interestingly, MySQL, one of the major projects taken over from Sun, already had several forks active prior to the acquisition, among them MariaDB, which was started by original MySQL founder Michael "Monty" Widenius in 2009.

In the talk I will describe the MariaDB project: why it was started, what it is, and what we have been up to in the first two years of the project's existence. I will then give a more technical description of one particular performance feature that is new in MariaDB, "group commit", which is something I worked on personally the last year, and which I think is a good example of the kind of development that happens in MariaDB. Finally I want to give an "interactive FAQ", answering some of the questions that are buzzing around in the community concerning the future of MySQL and derivatives inside and outside of Oracle.

Tags: , , ,

(1 comment | Leave a comment)

12:11 pm

[Link]

MariaDB replication feature preview released

I am pleased to announce the availability of the MariaDB 5.2 feature preview release. Find the details and download links on the knowledgebase.

There has been quite good interest in the replication work I have been doing around MariaDB, and I wanted a way to make it easy for people to use, experiment with, and give feedback on the new features. The result is this replication feature preview release. This will all eventually make it into the next official release, however this is likely still some month off.

All the usual binary packages and source tarballs are available for download. As something new, I now also made apt-enabled repositories available for Debian and Ubuntu; this should greatly simplify installation on these .deb based distributions.

So please try it out, and give feedback on the mailing list or bug tracker. I will make sure to fix any bugs and keep the feature preview updated until everything is available in an official release.

Here is the list of new features in the replication preview release:

Group commit for the binary log

This preview release implements group commit that works when using XtraDB with the binary log enabled. (In previous MariaDB releases, and all MySQL releases at the time of writing, group commit works in InnoDB/XtraDB when the binary log is disabled, but stops working when the binary log is enabled).

Documentation.

Enhancements for START TRANSACTION WITH CONSISTENT SNAPSHOT

START TRANSACTION WITH CONSISTENT SNAPSHOT now also works with the binary log. This means that it is possible to obtain the binlog position corresponding to a transactional snapshot of the database without any blocking of other queries at all. This is used by mysqldump --single-transaction --master-data to do a fully non-blocking backup that can be used to provision a new slave.

START TRANSACTION WITH CONSISTENT SNAPSHOT now also works consistently between transactions involving more than one storage engine (currently XTraDB and PBXT support this).

Documentation.

Annotation of row-based replication events with the original SQL statement

When using row-based replication, the binary log does not contain SQL statements, only discrete single-row insert/update/delete events. This can make it harder to read mysqlbinlog output and understand where in an application a given event may have originated, complicating analysis and debugging.

This feature adds an option to include the original SQL statement as a comment in the binary log (and shown in mysqlbinlog output) for row-based replication events.

Documentation.

Row-based replication for tables with no primary key

This feature can improve the performance of row-based replication on tables that do not have a primary key (or other unique key), but that do have another index that can help locate rows to update or delete. With this feature, index cardinality information from ANALYZE TABLE is considered when selecting the index to use (before this feature is implemented, the first index was selected unconditionally).

Documentation.

Early release during prepare phase of XtraDB row locks

This feature adds an option to make XtraDB release the row locks for a transaction earlier during the COMMIT step when running with --sync-binlog=1 and --innodb-flush-log-at-trx-commit=1. This can improve throughput if the workload has a bottleneck on hot-spot rows.

Documentation.

PBXT consistent commit ordering

This feature implements the new commit ordering storage engine API in PBXT. With this feature, it is possible to use START TRANSACTION WITH CONSISTENT SNAPSHOT and get consistency among transactions that involve both XtraDB and InnoDB. (Without this feature, there is no such consistency guarantee. For example, even after running START TRANSACTION WITH CONSISTENT SNAPSHOT it was still possible for the InnoDB/XtraDB part of some transaction T to be visible and the PBXT part of the same transaction T to not be visible.)

Documentation.

Miscellaneous

  • Small change to make mysqlbinlog omit redundant use statements around BEGIN/SAVEPOINT/COMMIT/ROLLBACK events when reading MySQL 5.0 binlogs.

Tags: , , ,

(Leave a comment)

December 7th, 2010
12:05 pm

[Link]

Christmas @ MariaDB

The Danish "julehjerte" is apparently a Danish/Northern Europe Christmas tradition (at least according to Wikipedia). But hopefully people outside this region will also be able to enjoy this variant:

    

I have been doing "julehjerter" ever since I was a small kid, and every Christmas try to do something different with it. As seen above, this year I decided to combine the tradition with the MariaDB logo, and I am frankly quite pleased with the result :-)

Tags: , ,

(Leave a comment)

October 11th, 2010
05:32 pm

[Link]

The future of replication revealed in Istanbul

A very good meeting in Istanbul is drawing to an end. People from Monty Program, Facebook, Galera, Percona, SkySQL, and other parts of the community are meeting with one foot on the European continent and another in Asia to discuss all things MariaDB and MySQL and experience the mystery of the Orient.

At the meeting I had the opportunity to present my plans and visions for the future development of replication in MariaDB. My talk was very well received, and I had a lot of good discussions afterwards with many of the bright people here. Working from home in a virtual company, it means a lot to get this kind of inspiration and encouragement from others on occasion, and I am looking forward to continuing the work after an early flight to Copenhagen tomorrow.

The new interface for transaction coordinator plugins is what particularly interests me at the moment. The immediate benefit of this work is working group commit for transactions with the binary log enabled. But just as interesting (if more subtle), the project is an enabler for several other nice features related to hot backup and recovery. I spent a lot of effort working on the interfaces to the transaction controller and related extensions to the storage engine API, and I think the result is quite solid and a good basis for coming work.

After the transaction coordinator plugin, the next step is an API for event generators that will allow plugins to receive replication events on an equal footing with the built-in MySQL binary log implementation; I will be using this in cooperation with Codership to more tightly integrate their Galera synchronous replication into MariaDB. And long-term, I am hoping to combine all of the pieces to finally start attacking the general problem of parallel execution of events on replication slaves, the solution of which is long overdue.

(The MariaDB replication project page has lots of pointers to more information on the various projects for anyone interested).

Almost too good to be true, out excursion today was blessed with sunshine and mild weather after countless days of rain and storm. There were even rumours of sightings of dolphins jumping again during the SkySQL excursion yesterday. So while lots of hard work remains, all in all, the omens seem all good for the future of replication in MariaDB!

Tags: , , , ,

(11 comments | Leave a comment)

October 3rd, 2010
06:55 pm

[Link]

Dynamic linking costs two cycles

It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU cycles per cross-module call. I usually take forever to get to the point in my writing, so I thought I would change this for once :-)

In MySQL, there has been a historical tendency to favour static linking, in part because to avoid the overhead (in execution efficiency) associated with dynamic linking. However, on modern systems there are also very serious drawbacks when using static linking.

The particular issue that inspired this article is that I was working on MWL#74, building a proper shared libmysqld.so library for the MariaDB embedded server. The lack of a proper libmysqld.so in MySQL and MariaDB has caused no end of grief for packaging Amarok for the various Linux distributions. My patch increases the amount of dynamic linking (in a default build), so I did a quick test to get an idea of the overhead of this.

ELF dynamic linking

The overhead comes from the way dynamic linking works in ELF-based systems like Linux (and many other POSIX-like operating systems). Code in shared libraries must be compiled to be position-independent, achieved with the -fPIC compiler switch. This allows the loader to simply mmap() the image of a shared library into the process address space at whatever free memory space is available, and the code can run without any need for the loader to do any kind of relocations of the code. For a much more detailed explanation see for example this article.

When generating position-independent code for a function call into another shared object, the compiler cannot generate a simple absolute call instruction, as the destination address is not known until run-time. Instead, the call goes via an indirect jump is generated, fetching the destination address from a table called the PLT, short for Procedure Linkage Table. For example:

                       callq  0x400680 <mylib_myfunc@plt>)
...
<mylib_myfunc@plt>:    jmpq   *0x200582(%rip)

The indirect jump resolves at runtime into the address of the real function to be called, so that is the overhead of the call when using dynamic linking: one indirect jump instruction.

Micro-benchmarking

To measure this one-instruction overhead in terms of execution time, I used the following code:

    for (i= 0; i < count; i++)
      v= mylib_myfunc(v);

The function mylib_myfunc() is placed in a library, with the following code:

    int mylib_myfunc(int v) {return v+1;}

I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz machine running Linux amd64. Here are the results from running the loop for 1,000,000,000 (one billion) operations:

 total time (sec.)CPU cycles/iteration
Static linking2.546
Dynamic linking3.388

So that is the two CPU cycles of overhead per call that I referred to at the start of this post.

Incidentally, if you try stepping through the call with a debugger, you will see a much larger overhead for the very first call. Do not be fooled by this, this is just because the loader fills in the PLT lazily, computing the correct address of the destination only on the first time the call is made (so addresses of functions that are never called by a process need never be calculated). See above-referenced article for more details.

(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar, but the actual overhead may be somewhat larger, since that architecture lacks program-counter-relative addressing and so must reserve one register %ebx (out of its already quite limited register bank) for this purpose. I did not measure the 32-bit case, I think it is of little interest nowadays for high-performance MySQL or MariaDB deployments (and the overhead of function calls on x86 32-bit is significantly higher anyway, dynamic linking or not, due to the need to push and pop all arguments to/from the stack)).

Conclusion

Two cycles per call is, in my opinion, a very modest overhead. It is hard to imagine high-performance code where this will have a real-life noticeable effect. Modern systems rely heavily on dynamic linking, and static linking is nowadays causing much more problems that it solves. And I think it is also time to put the efficiency argument for static linking to rest.

Tags: , , , ,

(3 comments | Leave a comment)

[<< Previous 10 entries -- Next 10 entries >>]

Powered by LiveJournal.com