This is a tale of the bug lp:798213. The bug report has the initial report, and a summary of the real problem obtained after detailed analysis, but it does not describe the processes of getting from the former to the latter. I thought it would be interesting to document this, as the analysis of this bug was rather tricky and contains several good lessons.
The bug first manifested itself as a sporadic failure in one of our random query generator tests for replication. We run this test after all MariaDB pushes in our Buildbot setup. However, this failure had only occured twice in several months, so it is clearly a very rare failure.
The first task was to try to repeat the problem and get some more data in the form of binlog files and so on. Philip kindly helped with this, and after running the test repeatedly for several hours he finally managed to obtain a failure and attach the available information to the initial bug report. Time for analysis!
Understanding the failure
The first step is to understand what the test is doing, and what the failure means.
The test starts up a master server and exposes it to some random parallel write load. Half-way through, with the server running at full speed, it takes a non-blocking XtraBackup backup, and restores the backup into a new slave server. Finally it starts the new slave replicating from the binlog position reported by XtraBackup, and when the generated load is done and the slave caught up, it compares the master and slave to check that they are consistent. This test is an important check of my group commit work, which is carefully engineered to provide group commit while still preserving the commit order and consistent binlog position that is needed by XtraBackup to do such non-blocking provisioning of new slaves.
The failure is that in a failed run, the master and slave are different when compared at the end. The slave has a couple of extra rows (later I discovered the bug could also manifest itself as a single row being different). So this is not good obviously, and needs to be investigated.
Analysing the failure
So this is a typical case of a "hard" failure to debug. We have binlogs with 100k queries or so, and a slave that somewhere in those 100k queries diverges from the master. Working on problems like this, it is important to work methodically, slowly but surely narrowing down the problem, come up with hypothesis about the behaviour and positively affirm or reject them, until finally the problem is narrowed down sufficiently that the real cause is apparent. Random poking around not only is likely to waste time, but far worse, without a real understanding of the root cause of the failure, there is a great danger of eventually tweaking things so that the failure happens to go away in the test at hand, yet the underlying bug is still there. After all, the failure was already highly sporadic to begin with.
First I wanted to know if the problem is that replication diverges (eg. because of non-deterministic queries in the statement-based replication), or if it is a problem with the restored backup used to start the slave (wrong data or starting binlog position). Clearly, I strongly suspected a wrong starting binlog position, as this is what my group commit work messes with. But as it turns out, this was not the problem, again stressing the need to always verify positively any assumptions made during debugging.
To check this, I setup a new slave server from scratch, and had it replicate from the master binlog all the way from the start to the end. I then compared all three end results: (A) the original master; (B) the slave provisioned by XtraBackup, and (C) the new slave replicated from the start of the binlogs. It turns out that (A) and (C) are identical, while (B) differs. So this strongly suggests a problem with the restored XtraBackup; the binlogs by themselves replicate without problems.
To go further, I needed to analyse the state of the slave server just after the XtraBackup has been restored, without the effect of the thousands of queries replicated afterwards. Unfortunately this was not saved as part of the original test. It was trivial to add to the test (just copy away the backup to a safe place before starting the slave server), but then the need came to reproduce the failure again.
This is another important step in debugging hard sporadic failures: Get to the point where the failure can be reliably reproduced, at least for some semi-reasonable meaning of "reliable". This is really important not only to help debugging, but also to be able to verify that a proposed bug fix actually fixes the original bug! I do have experienced once or twice a failure so elusive that the only way to fix was to commit blindly a possible fix, then wait for several months to see if the failure would re-appear in that interval. Fortunately, in far the most cases, with a bit of work, this is not necessary.
Same here: After a bit of experimentation, I found that I could reliably reproduce the failure by reducing the duration of the test from 5 minutes to 35 seconds, and running the test in a tight loop until it failed. It always failed after typically 15-40 runs.
So now I had the state of the slave provisioned with XtraBackup as it was just before it starts replicating. So what I did was to set up another slave server from scratch and let it replicate from the master binlogs using START SLAVE UNTIL with the binlog position reported by XtraBackup. If the XtraBackup and its reported binlog start position are correct, these two servers should be identical. But sure enough, a comparison showed that they differed! In this case it was a single row that had different data. So this confirms the hypothesis that the problem is with the restored XtraBackup data and/or binlog position.
So now, thinking it was the binlog position that was off, I naturally next looked into the master binlog around this position, looking for an event just before the position that was not applied, or an event just after that already was applied. However, to my surprise I did not find this. I did find an event just after that updated the table that had the wrong row. However, the data in the update looked nothing like the data that was expected in the wrong row. And besides, that update was part of a transaction updating multiple tables; if that event was duplicated or missing, there would have been more row differences in more tables, not just one row in a single table. I did find an earlier event that looked somewhat related, however it was far back in the binlog (so not resolvable by merely adjusting the starting binlog pos); and besides again it was part of a bigger transaction updating more rows, while I had only one row with wrong data.
So at this point I need a new idea; the original hypothesis has been proven false. The restored XtraBackup is clearly wrong in a single row, but nothing in the binlog explains how this one row difference could have occured. When analysis runs up against a dead end, it is time to get more data. So I ran the test for a couple hours, obtained a handful more failures, and analysed then in the same way. Each time I saw the XtraBackup differing from the master in one row, or in once case the slave had a few extra rows.
So this is strange. After we restore the XtraBackup, we have one row (or a few rows) different from the master server. And those rows were updated in a multi-row transaction. It is as if we are somehow missing part of a transactions. Which is obviously quite bad, and indicates something bad going on at a deep level
Again it is time to get more data. Now I try running the test with different
server options to see if it makes any difference. Running with
--binlog-optimize-thread-scheduling=0 still caused the failure,
so it was not related to the
optimisation that I implemented. Then I noticed that the test runs with
On a hunch I tried running without this option, and AHA! Without this option,
I was no longer able to repeat the failure, even after 250 runs!
At this point, I start to strongly suspect a bug in the
feature. But this is still not proven! It could also be that with the option
disabled, there is less opportunity for parallelism, hiding the problem which
could really be elsewhere. So I still needed to understand exactly what the
root cause of the problem is.
At this point, I had sufficient information to start just thinking about the
problem, trying to work out ways in which things could go wrong in a way that
would produce symptoms like what we see. So I started to think on how
--innodb-release-locks-early works and how InnoDB undo and
recovery in general function. So I tried a couple of ideas, some did not seem
...and then, something occured to me. What the
feature does is to make InnoDB release row locks for a transaction earlier
than normal, just after the transaction has been prepared (but before it is
written into the binlog and committed inside InnoDB). This is to allow another
transaction waiting on the same row locks to proceed as quickly as possible.
Now, this means that with proper timing, it is possible for such a second transaction to also prepare before the first has time to commit. At this point we thus have two transactions in the prepared state, both of which modified the same row. If we were to take an XtraBackup snapshot at this exact moment, upon restore XtraBackup would need to roll back both of those transactions (the situation would be the same if the server crashed at that point and later did crash recovery).
This begs the question if such rollback will work correctly? This certainly is
not something that could occur in InnoDB before the patch for
--innodb-release-locks-early was implemented, and from
my knowledge of the patch, I know it does not explicitly do anything to
make this work. Aha! So now we have a new hypothesis: Rollback of multiple
transactions modifying the same row causes problems.
To test this hypothesis, I used the Debug_Sync facility to create a mysql-test-run test case. This test case creates runs a few transactions in parallel, all modifying a common row, then starts them committing but pauses the server when some of them are still in the prepared state. At this point it takes an XtraBackup snapshot. I then tried restoring the XtraBackup snapshot to a fresh server. Tada... but unfortunately this did not show any problems, the restore looked correct.
However, while restoring, I noticed that the prepared-but-not-committed transactions seems to be rolled back in reverse order of InnoDB transactions id. So this got me thinking - what would happen if they were rolled back in a different order? Indeed, for multiple transactions modifying a common row, the rollback order is critical! The first transaction to modify the row must be rolled back last, so that the correct before-image is left in the row. If we were to roll back in a different order, we could end up restoring the wrong before-image of the row - which would result in exactly the kind of single-row corruption that we seem to experience in the test failure! So we are getting close it seems. Since InnoDB seems to roll back transactions in reverse order of transaction ID, and since transaction IDs are presumably allocated in order of transaction start, maybe by starting the transactions in a different order, the failure can be provoked.
And sure enough, modifying the test case so that the transactions are started in the opposite order causes it to show the failure! During XtraBackup restore, the last transaction to modify the row is rolled back last, so that the row ends up with the value that was there just before the last update. But this value is wrong, as it was written there by a transaction that was itself rolled back. So we have the corruption reproduced with a small, repeatable test case, and the root cause of the problem completely understood. Task solved!
(Later I cleaned up the test case to crash the server and work with crash
recovery instead; this is simpler as it does not involve XtraBackup. Though
this also involves the XA recovery procedure with the binlog, the root problem
is the same and shows the same failure. As to a fix for the bug, that remains
to be seen. I wrote some ideas in the bug report, but it appears non-trivial
to fix. The
--innodb-release-locks-early feature is originally
from the Facebook patch;
maybe they will fix it, or maybe we will remove the feature from MariaDB 5.3 before
GA release. Corrupt data is a pretty serious bug, after all.)
I think there are some important points to learn from this debugging story:
- When working on high-performance system code, some debugging problems are just inherently hard! Things happen in parallel, and we operate with complex algorithms and high quality requirements on the code. But with a methodical approach, even the hard problems can be solved eventually.
- It is important not to ignore test failures in the test frameworks (such as Buildbot), no matter how random, sporadic, and/or elusive they may appear at first. True, many of them are false positives or defects in the test framework rather than the server code. But some of them are true bugs, and among them are some of the most serious and yet difficult bugs to track down. The debugging in this story is trivial compared to how the story would have been if this had to be debugged in a production setting at a support customer. Much nicer to work from a (semi-)repeatable test failure in Buildbot!