|
Kristian Nielsen
[Recent Entries][Archive][Friends][User Info]
Below are the 10 most recent journal entries recorded in the "Kristian Nielsen" journal:[<< Previous 10 entries]
04:37 pm
[Link] |
Tale of a bug
This is a tale of the
bug lp:798213. The
bug report has the initial report, and a summary of the real problem obtained
after detailed analysis, but it does not describe the processes of getting
from the former to the latter. I thought it would be interesting to document
this, as the analysis of this bug was rather tricky and contains several good
lessons.
Background
The bug first manifested itself as a sporadic failure in one of
our random query generator tests
for replication. We run this test after all MariaDB pushes in our Buildbot
setup. However, this failure had only occured twice in several months, so it
is clearly a very rare failure.
The first task was to try to repeat the problem and get some more data in the
form of binlog files and so on. Philip kindly helped with this, and after
running the test repeatedly for several hours he finally managed to obtain a
failure and attach the available information to the initial bug report. Time
for analysis!
Understanding the failure
The first step is to understand what the test is doing, and what the failure
means.
The test starts up a master server and exposes it to some random
parallel write load. Half-way through, with the server running at full speed,
it takes a non-blocking XtraBackup backup, and restores the backup into a new
slave server. Finally it starts the new slave replicating from the binlog position
reported by XtraBackup, and when the generated load is done and the slave
caught up, it compares the master and slave to check that they are consistent.
This test is an important check of my group
commit work,
which is carefully engineered to provide group commit while still preserving
the commit order and consistent binlog position that is needed by XtraBackup
to do such non-blocking provisioning of new slaves.
The failure is that in a failed run, the master and slave are different when
compared at the end. The slave has a couple of extra rows (later I discovered
the bug could also manifest itself as a single row being different).
So this is not good obviously, and needs to be investigated.
Analysing the failure
So this is a typical case of a "hard" failure to debug. We have binlogs with
100k queries or so, and a slave that somewhere in those 100k queries diverges
from the master. Working on problems like this, it is important to work
methodically, slowly but surely narrowing down the problem, come up with
hypothesis about the behaviour and positively affirm or reject them, until
finally the problem is narrowed down sufficiently that the real cause is
apparent. Random poking around not only is likely to waste time, but far
worse, without a real understanding of the root cause of the failure, there is
a great danger of eventually tweaking things so that the failure happens to go
away in the test at hand, yet the underlying bug is still there. After all,
the failure was already highly sporadic to begin with.
First I wanted to know if the problem is that replication diverges
(eg. because of non-deterministic queries in the statement-based replication),
or if it is a problem with the restored backup used to start the slave (wrong
data or starting binlog position). Clearly, I strongly suspected a wrong
starting binlog position, as this is what my group commit work messes
with. But as it turns out, this was not the problem, again stressing the
need to always verify positively any assumptions made during debugging.
To check this, I setup a new slave server from scratch, and had it replicate
from the master binlog all the way from the start to the end. I then compared
all three end results: (A) the original master; (B) the slave provisioned by
XtraBackup, and (C) the new slave replicated from the start of the binlogs. It
turns out that (A) and (C) are identical, while (B) differs. So this strongly
suggests a problem with the restored XtraBackup; the binlogs by themselves
replicate without problems.
To go further, I needed to analyse the state of the slave server just after
the XtraBackup has been restored, without the effect of the thousands of
queries replicated afterwards. Unfortunately this was not saved as part of the
original test. It was trivial to add to the test (just copy away the backup to
a safe place before starting the slave server), but then the need came to
reproduce the failure again.
This is another important step in debugging hard sporadic failures: Get to the
point where the failure can be reliably reproduced, at least for some
semi-reasonable meaning of "reliable". This is really important not only to
help debugging, but also to be able to verify that a proposed bug fix actually
fixes the original bug! I do have experienced once or twice a failure so
elusive that the only way to fix was to commit blindly a possible fix, then wait for
several months to see if the failure would re-appear in that
interval. Fortunately, in far the most cases, with a bit of work, this is not
necessary.
Same here: After a bit of experimentation, I found that I could reliably
reproduce the failure by reducing the duration of the test from 5 minutes to
35 seconds, and running the test in a tight loop until it failed. It always
failed after typically 15-40 runs.
So now I had the state of the slave provisioned with XtraBackup as it was just
before it starts replicating. So what I did was to set up another slave server
from scratch and let it replicate from the master binlogs using START SLAVE
UNTIL with the binlog position reported by XtraBackup. If the XtraBackup and
its reported binlog start position are correct, these two servers should be
identical. But sure enough, a comparison showed that they differed! In this
case it was a single row that had different data. So this confirms the
hypothesis that the problem is with the restored XtraBackup data and/or binlog
position.
So now, thinking it was the binlog position that was off, I naturally next
looked into the master binlog around this position, looking for an event just
before the position that was not applied, or an event just after that already
was applied. However, to my surprise I did not find this. I did find an event
just after that updated the table that had the wrong row. However, the data in
the update looked nothing like the data that was expected in the wrong
row. And besides, that update was part of a transaction updating multiple
tables; if that event was duplicated or missing, there would have been more
row differences in more tables, not just one row in a single table. I did find
an earlier event that looked somewhat related, however it was far back in the
binlog (so not resolvable by merely adjusting the starting binlog pos); and
besides again it was part of a bigger transaction updating more rows, while I
had only one row with wrong data.
So at this point I need a new idea; the original hypothesis has been proven
false. The restored XtraBackup is clearly wrong in a single row, but nothing
in the binlog explains how this one row difference could have occured. When
analysis runs up against a dead end, it is time to get more data. So I ran the
test for a couple hours, obtained a handful more failures, and analysed then
in the same way. Each time I saw the XtraBackup differing from the master in
one row, or in once case the slave had a few extra rows.
So this is strange. After we restore the XtraBackup, we have one row (or a few
rows) different from the master server. And those rows were updated in a
multi-row transaction. It is as if we are somehow missing part of a
transactions. Which is obviously quite bad, and indicates something bad going
on at a deep level
Again it is time to get more data. Now I try running the test with different
server options to see if it makes any difference. Running with
--binlog-optimize-thread-scheduling=0 still caused the failure,
so it was not related to the
thread scheduling
optimisation that I implemented. Then I noticed that the test runs with
the option --innodb-release-locks-early=1 enabled.
On a hunch I tried running without this option, and AHA! Without this option,
I was no longer able to repeat the failure, even after 250 runs!
At this point, I start to strongly suspect a bug in the
--innodb-release-locks-early
feature. But this is still not proven! It could also be that with the option
disabled, there is less opportunity for parallelism, hiding the problem which
could really be elsewhere. So I still needed to understand exactly what the
root cause of the problem is.
Eureka!
At this point, I had sufficient information to start just thinking about the
problem, trying to work out ways in which things could go wrong in a way that
would produce symptoms like what we see. So I started to think on how
--innodb-release-locks-early works and how InnoDB undo and
recovery in general function. So I tried a couple of ideas, some did not seem
relevant...
...and then, something occured to me. What the
--innodb-release-locks-early
feature does is to make InnoDB release row locks for a transaction earlier
than normal, just after the transaction has been prepared (but before it is
written into the binlog and committed inside InnoDB). This is to allow another
transaction waiting on the same row locks to proceed as quickly as possible.
Now, this means that with proper timing, it is possible for such a second
transaction to also prepare before the first has time to commit. At
this point we thus have two transactions in the prepared state, both of
which modified the same row. If we were to take an XtraBackup snapshot at this
exact moment, upon restore XtraBackup would need to roll back both of those
transactions (the situation would be the same if the server crashed at that
point and later did crash recovery).
This begs the question if such rollback will work correctly? This certainly is
not something that could occur in InnoDB before the patch for
--innodb-release-locks-early was implemented, and from
my knowledge of the patch, I know it does not explicitly do anything to
make this work. Aha! So now we have a new hypothesis: Rollback of multiple
transactions modifying the same row causes problems.
To test this hypothesis, I used
the Debug_Sync
facility to create a mysql-test-run test case. This test case creates runs a
few transactions in parallel, all modifying a common row, then starts them
committing but pauses the server when some of them are still in the prepared
state. At this point it takes an XtraBackup snapshot. I then tried restoring
the XtraBackup snapshot to a fresh server. Tada... but unfortunately this did
not show any problems, the restore looked correct.
However, while restoring, I noticed that the prepared-but-not-committed
transactions seems to be rolled back in reverse order of InnoDB transactions
id. So this got me thinking - what would happen if they were rolled back in a
different order? Indeed, for multiple transactions modifying a common row, the
rollback order is critical! The first transaction to modify the row
must be rolled back last, so that the correct before-image is left in
the row. If we were to roll back in a different order, we could end up
restoring the wrong before-image of the row - which would result in exactly
the kind of single-row corruption that we seem to experience in the test
failure! So we are getting close it seems. Since InnoDB seems to roll back
transactions in reverse order of transaction ID, and since transaction IDs are
presumably allocated in order of transaction start, maybe by starting
the transactions in a different order, the failure can be provoked.
And sure enough, modifying the test case so that the transactions are started
in the opposite order causes it to show the failure! During XtraBackup
restore, the last transaction to modify the row is rolled back last, so that
the row ends up with the value that was there just before the last update. But
this value is wrong, as it was written there by a transaction that was itself
rolled back. So we have the corruption reproduced with a small, repeatable
test case, and the root cause of the problem completely understood. Task
solved!
(Later I cleaned up the test case to crash the server and work with crash
recovery instead; this is simpler as it does not involve XtraBackup. Though
this also involves the XA recovery procedure with the binlog, the root problem
is the same and shows the same failure. As to a fix for the bug, that remains
to be seen. I wrote some ideas in the bug report, but it appears non-trivial
to fix. The --innodb-release-locks-early feature is originally
from the Facebook patch;
maybe they will fix it, or maybe we will remove the feature from MariaDB 5.3 before
GA release. Corrupt data is a pretty serious bug, after all.)
Lessons learned
I think there are some important points to learn from this debugging story:
- When working on high-performance system code, some debugging problems are
just inherently hard! Things happen in parallel, and we operate with
complex algorithms and high quality requirements on the code. But with a
methodical approach, even the hard problems can be solved eventually.
- It is important not to ignore test failures in the test frameworks (such
as Buildbot), no matter how random, sporadic, and/or elusive they may appear
at first. True, many of them are false positives or defects in the test
framework rather than the server code. But some of them are true bugs, and
among them are some of the most serious and yet difficult bugs to track
down. The debugging in this story is trivial compared to how the
story would have been if this had to be debugged in a production setting at
a support customer. Much nicer to work from a (semi-)repeatable test failure
in Buildbot!
Tags: debugging, freesoftware, mariadb, mysql, programming
|
05:49 pm
[Link] |
Benchmarking thread scheduling in group commit, part 2
I got access to our 12-core Intel server, so I was able to do some better
benchmarks to test the different group commit thread scheduling methods:
This graph shows queries-per-second as a function of number of parallel
connections, for three test runs:
- Baseline MariaDB, without group commit.
- MariaDB with group commit, using the simple thread scheduling, where the
serial part of the group commit algorithm is done by each thread signalling
the next one.
- MariaDB with group commit and optimised thread scheduling, where the
first thread does the serial group commit processing for all transactions at
once, in a single thread.
(see the previous post linked above for a more detailed explanation of the two
thread scheduling algorithms.)
This test was run on a 12-core server with hyper-threading, memory is
24 GByte. MariaDB was running with datadir in /dev/shm (Linux ram
disk), to simulate a really fast disk system and maximise the stress on the
CPUs. Binlog is enabled with sync_binlog=1
and innodb_flush_log_at_trx_commit=1. Table type is InnoDB.
I use Gypsy to generate the client
load, which is simple auto-commit primary key updates:
REPLACE INTO t (a,b) VALUES (?, ?)
The graph clearly shows the optimised thread scheduling algorithm to improve
scalability. As expected, the effect is more pronounced on the twelve-core
server than on the 4-core machine I tested on previously. The optimised thread
scheduling has around 50% higher throughput at higher concurrencies. While the
naive thread scheduling algorithm suffers from scalability problems to the
degree that it is only slightly better than no group commit at all (but
remember that this is on ram disk, where group commit is hardly needed in the
first place).
There is no doubt that this kind of optimised thread scheduling involves some
complications and trickery. Running one part of a transaction in a different
thread context from the rest does have the potential to cause subtle bugs.
On the other hand, we are moving fast towards more and more CPU cores and more
and more I/O resources, and scalability just keeps getting more and more
important. If we can scale MariaDB/MySQL with the hardware improvements, more
and more applications can make do with scale-up rather than scale-out, which
significantly simplifies the system architecture.
So I am just not comfortable introducing more serialisation (e.g. more global
mutex contention) in the server than absolutely necessary. That is why I did
the optimisation in the first place even without testing. Still, the question
is if an optimisation that only has any effect above 20,000 commits per second
is worth the extra complexity? I think I still need to think this over to
finally make up my mind, and discuss with other MariaDB developers, but at
least now we have a good basis for such discussion (and fortunately, the code
is easy to change one way or the other).
Tags: mariadb, mysql, performance, programming
|
02:03 pm
[Link] |
Benchmarking thread scheduling in group commit
The best part of the recent MariaDB meeting in Lisbon for me
was that I got some good feedback on my group
commit work. This has been waiting in the review
queue for quite some time now.
One comment I got revolve around an optimisation in the implementation related
to how threads are scheduled.
A crucial step in the group commit algorithm is when the transactions being
committed have been written into the binary log, and we want to commit them in
the storage engine(s) in the same order as they were committed in the
binlog. This ordering requirement makes that part of the commit process
serialised (think global mutex).
Even though care is taken to make this serial part very quick to run inside
the storage engine(s), I was still concerned about how it would impact
scalability on multi-core machines. So I took extra care to minimise the time
spent on the server layer in this step.
Suppose we have three transactions being committed as a group, each running in
their own connection thread in the server. It would be natural to let the
first thread do the first commit, then have the first thread signal the second
thread to do the second commit, and finally have the second thread signal the
third thread. The problem with this is that now the inherently serial part of
the group commit not only includes the work in the storage engines, it also
includes the time needed for two context switches (from thread 1 to thread 2,
and from thread 2 to thread 3)! This is particularly costly if, after finishing
with thread 1, we end up having to wait for thread 2 to be scheduled because
all CPU cores are busy.
So what I did instead was to run all of the serial part in a single thread
(the thread of the first transaction). The single thread will handle the
commit ordering inside the storage engine for all the transactions, and the
remaining threads will just wait for the first one to wake them up. This means
the context switches for the waiting threads are not included in the serial
part of the algorithm. But it also means that the storage engines need to be
prepared to run this part of the commit in a separate thread from the rest of
the transaction.
So, in Lisbon there was some discussion around if the modifications I did to
InnoDB/XtraDB for this were sufficient to ensure that there would not be any
problems with this running part of the commit in a different thread. After
all, this requirement is a complication. And then the question came
up if the above optimisation is actually needed? Does it notably increase
performance?
Now, that is a good question, and I did not have an answer as I never tested it.
So now I did! I added an
option --binlog-optimize-thread-scheduling to allow to switch
between the naive and the optimised way to handle the commit of the different
transactions in the serial part of the algorithm, and benchmarked them against
each other.
Unfortunately, the two many-core servers we have available for testing were
both unavailable (our hosting and quality of servers leaves a lot to be
desired unfortunately). So I was left to test on a 4-core (8 threads with
hyperthreading) desktop box I have in my own office. I was able to get some
useful results from this nevertheless, though I hope to revisit the benchmark
later on more interesting hardware.
In order to stress the group commit code maximally, I used a syntetic workload
with as many commits per second as possible. I used the fastest disk I have
available, /dev/shm (Linux ramdisk). The transactions are
single-row updates of the form
REPLACE INTO t (a,b) VALUES (?, ?)
The server is an Intel Core i7 quad-core with hyperthreading enabled. It has
8GByte of memory. I used Gypsy to generate the load.
Table type is XtraDB. The server is running
with innodb_flush_log_at_trx_commit=1
and sync_binlog=1.
Here are the results in queries per second, with different number of
concurrent connections running the queries:
| Number of connections |
QPS (naive scheduling) |
QPS (optimised scheduling) |
QPS (binlog disabled) |
| 16 | 21700 | 23600 | 29000 |
| 32 | 19000 | 22500 | 29700 |
| 128 | 18000 | 19500 | 26800 |
So as we see from this table, even with just four cores we see noticable
better performance by running the serial part of group commit in a single
thread. The improvement is around 10% or so, depending on parallelism. So I think
this means that I will want to keep the optimised version.
It is nice to see that we can get > 20k commits/second with the group commit
code on cheap desktop hardware. For real servers the I/O subsystem will
probably be a bottleneck, but that is what I wanted to see: that the group
commit code will not limit the ability to fully utilise high amounts of I/O
resources.
While I was at it, I also measured the throughput when the binlog is
disabled. As can be seen, enabling the binlog has notable performance impact
even with very fast disk. Still, considering the added overhead of writing an
extra log file, not to mention the added 2-phase commit step, the overhead is
not that unreasonable.
From the table we also see some negative scaling as the number of parallel
connections increases. Some of this is likely from InnoDB/XtraDB, but I would
like to investigate it deeper at some point to see if there is anything in the
group commit part that can be improved with respect to this.
Looking back, should I have done this benchmark when designing the code? I
think it is a tricky question, and one that cannot be given a simple
answer. It will always be a trade-off: It is not feasible to test (and
implement!) every conceivable variant of a new feature during development,
it is necessary to also rely on common sense and experience. On the other
hand, it is dangerous to rely on intuition with respect to performance; time
and time again measurements prove that the real world is very often counter to
intuition. In this case I was right, and my optimisation was beneficial;
however I could easily have been wrong. I think the main lesson here is how
important it is to get feedback on complex design work like this; such
feedback is crucial for motivating and structuring the work to be of the
quality that we need to see in MariaDB.
Tags: mariadb, mysql, performance, programming
|
09:30 am
[Link] |
My presentation from OpenSourceDays2011
Here are the slides
from my talk at Open Source Days
2011 on Saturday. The talk was about MariaDB and other parts of the MySQL
development community outside of MySQL@Oracle.
For me, the most memorable part of the conference was the talk by
Noirin Shirley titled
Open Source: Saving the World.
Noirin described the Open Source Ushahidi project
and how it was used during the natural disaster crisis in Indonesia, New
Zealand and other places.
Now, there is a long way
from implementing
group commit in MariaDB to rescuing injured people out of collapsed
buildings, and not all use of Free Software is as samaritan as
Ushahidi. Well, no-one can save the world alone.
But in the Free Software
community, we work together, each contributing his or her microscopic part,
and together slowly but surely building the most valuable software
infrastructure in the world. Which then in turn empowers others to work
together in other areas outside of (and more important than) software.
Working in Free Software enables me to contribute my skills and resources, and
I think Noirin managed very well to capture this in her talk.
Tags: conference, mariadb, mysql, talk
|
12:13 pm
[Link] |
Speaking at OpenSourceDays2011
Again this year, I will be speaking about MariaDB and stuff at the
OpenSourceDays2011 conference in
Copenhagen, Denmark. The conference will take place on Saturday March 5,
that's just over a week from now!
The program is ready and my
talk is scheduled for the afternoon at 15:30. Hope to meet a lot of people
there!
(I will be sure to make the slides from my talk available here afterwards, for
those of you interested but unable to attend.)
Here is the abstract for my talk:
Latest news from the MariaDB (and MySQL) community
A lot of Open Source software projects got transfered to Oracle last year as part of the acquisition of Sun Microsystems. Not everybody in the affected Open Source communities have been happy with this transfer, to put it mildly, and projects like LibreOffice, Illumos, Jenkins, and others are forking left and right to become independent of Oracle.
Interestingly, MySQL, one of the major projects taken over from Sun, already had several forks active prior to the acquisition, among them MariaDB, which was started by original MySQL founder Michael "Monty" Widenius in 2009.
In the talk I will describe the MariaDB project: why it was started, what it is, and what we have been up to in the first two years of the project's existence. I will then give a more technical description of one particular performance feature that is new in MariaDB, "group commit", which is something I worked on personally the last year, and which I think is a good example of the kind of development that happens in MariaDB. Finally I want to give an "interactive FAQ", answering some of the questions that are buzzing around in
the community concerning the future of MySQL and derivatives inside and outside of Oracle.
Tags: conference, mariadb, mysql, talk
|
12:11 pm
[Link] |
MariaDB replication feature preview released
I am pleased to announce the availability of the MariaDB 5.2 feature preview
release. Find the details and download
links on
the knowledgebase.
There has been quite good interest in the replication work I have been doing
around MariaDB, and I wanted a way to make it easy for people to use,
experiment with, and give feedback on the new features. The result is this
replication feature preview release. This will all eventually make it into the
next official release, however this is likely still some month off.
All the usual binary packages and source tarballs
are available
for download. As something new, I now also made apt-enabled repositories
available for Debian and Ubuntu; this should greatly simplify installation on
these .deb based distributions.
So please try it out, and give feedback on
the mailing list
or bug tracker. I will
make sure to fix any bugs and keep the feature preview updated until
everything is available in an official release.
Here is the list of new features in the replication preview release:
Group commit for the binary log
This preview release implements group commit that works when using XtraDB with
the binary log enabled. (In previous MariaDB releases, and all MySQL releases at
the time of writing, group commit works in InnoDB/XtraDB when the binary log
is disabled, but stops working when the binary log is enabled).
Documentation.
Enhancements for START TRANSACTION WITH CONSISTENT SNAPSHOT
START TRANSACTION WITH CONSISTENT SNAPSHOT now also works with the binary
log. This means that it is possible to obtain the binlog position
corresponding to a transactional snapshot of the database without any blocking
of other queries at all. This is used by mysqldump --single-transaction
--master-data to do a fully non-blocking backup that can be used to provision
a new slave.
START TRANSACTION WITH CONSISTENT SNAPSHOT now also works consistently between
transactions involving more than one storage engine (currently XTraDB and PBXT
support this).
Documentation.
Annotation of row-based replication events with the original SQL statement
When using row-based replication, the binary log does not contain SQL
statements, only discrete single-row insert/update/delete events. This can
make it harder to read mysqlbinlog output and understand where in an
application a given event may have originated, complicating analysis and
debugging.
This feature adds an option to include the original SQL statement as a
comment in the binary log (and shown in mysqlbinlog output) for row-based
replication events.
Documentation.
Row-based replication for tables with no primary key
This feature can improve the performance of row-based replication on tables
that do not have a primary key (or other unique key), but that do have another
index that can help locate rows to update or delete. With this feature, index
cardinality information from ANALYZE TABLE is considered when selecting the
index to use (before this feature is implemented, the first index was selected
unconditionally).
Documentation.
Early release during prepare phase of XtraDB row locks
This feature adds an option to make XtraDB release the row locks for a
transaction earlier during the COMMIT step when running with --sync-binlog=1
and --innodb-flush-log-at-trx-commit=1. This can improve throughput if the
workload has a bottleneck on hot-spot rows.
Documentation.
PBXT consistent commit ordering
This feature implements the new commit ordering storage engine API in
PBXT. With this feature, it is possible to use START TRANSACTION WITH
CONSISTENT SNAPSHOT and get consistency among transactions that involve both
XtraDB and InnoDB. (Without this feature, there is no such consistency
guarantee. For example, even after running START TRANSACTION WITH CONSISTENT
SNAPSHOT it was still possible for the InnoDB/XtraDB part of some transaction
T to be visible and the PBXT part of the same transaction T to not be visible.)
Documentation.
Miscellaneous
- Small change to make mysqlbinlog omit redundant
use statements around BEGIN/SAVEPOINT/COMMIT/ROLLBACK events when reading MySQL 5.0 binlogs.
Tags: mariadb, mysql, release, replication
|
12:05 pm
[Link] |
Christmas @ MariaDB
The Danish "julehjerte" is apparently a Danish/Northern Europe Christmas
tradition
(at
least according to Wikipedia). But hopefully people outside this
region will also be able to enjoy this variant:
I have been doing "julehjerter" ever since I was a small kid, and every
Christmas try to do something different with it. As seen above, this year I
decided to combine the tradition with the MariaDB logo, and I am frankly quite
pleased with the result :-)
Tags: fun, mariadb, mysql
|
05:32 pm
[Link] |
The future of replication revealed in Istanbul
A very good meeting in Istanbul is drawing to an end. People from Monty Program,
Facebook, Galera, Percona, SkySQL, and other parts of the community are
meeting with one foot on the European continent and another in Asia to discuss
all things MariaDB and MySQL and experience the mystery of the Orient.
At the meeting I had the opportunity
to present my plans and
visions for the future development of replication in MariaDB. My talk was very
well received, and I had a lot of good discussions afterwards with many of the
bright people here. Working from home in a virtual company, it means a lot to
get this kind of inspiration and encouragement from others on occasion, and I
am looking forward to continuing the work after an early flight to Copenhagen
tomorrow.
The new interface for transaction coordinator plugins is what particularly
interests me at the moment. The immediate benefit of this work is
working group
commit for transactions with the binary log enabled. But just as interesting (if
more subtle), the project is an enabler for several other nice features
related to hot backup and recovery. I spent a lot of effort working on the
interfaces to the transaction controller and related extensions to the storage
engine API, and I think the result is quite solid and a good basis for coming
work.
After the transaction coordinator plugin, the next step is an API for event
generators that will allow plugins to receive replication events on an equal
footing with the built-in MySQL binary log implementation; I will be using
this in cooperation with Codership to more tightly integrate
their Galera synchronous replication
into MariaDB. And long-term, I am hoping to combine all of the pieces to finally
start attacking the general problem of parallel execution of events on
replication slaves, the solution of which is long overdue.
(The MariaDB replication
project page has lots of pointers to more information on the various
projects for anyone interested).
Almost too good to be true, out excursion today was blessed with sunshine and
mild weather after countless days of rain and storm. There were even rumours
of sightings of dolphins jumping again during the SkySQL excursion yesterday.
So while lots of hard work remains, all in all, the omens seem all good for
the future of replication in MariaDB!
Tags: freesoftware, mariadb, mysql, programming, replication
|
06:55 pm
[Link] |
Dynamic linking costs two cycles
It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU
cycles per cross-module call. I usually take forever to get to the point in my
writing, so I thought I would change this for once :-)
In MySQL, there has been a historical tendency to favour static linking, in
part because to avoid the overhead (in execution efficiency) associated with
dynamic linking. However, on modern systems there are also very serious
drawbacks when using static linking.
The particular issue that inspired this article is that I was working
on MWL#74,
building a proper shared libmysqld.so library for the MariaDB
embedded server. The lack of a proper libmysqld.so in MySQL and
MariaDB has caused no end of grief for
packaging Amarok for the various Linux
distributions. My patch increases the amount of dynamic linking (in a default
build), so I did a quick test to get an idea of the overhead of this.
ELF dynamic linking
The overhead comes from the way dynamic linking works in ELF-based systems
like Linux (and many other POSIX-like operating systems). Code in shared
libraries must be compiled to be position-independent, achieved with
the -fPIC compiler switch. This allows the loader to
simply mmap() the image of a shared library into the process
address space at whatever free memory space is available, and the code can run
without any need for the loader to do any kind of relocations of the code. For
a much more detailed explanation see for
example this
article.
When generating position-independent code for a function call into another
shared object, the compiler cannot generate a simple
absolute call instruction, as the destination address is not
known until run-time. Instead, the call goes via an indirect jump is
generated, fetching the destination address from a table called the PLT, short
for Procedure Linkage Table. For example:
callq 0x400680 <mylib_myfunc@plt>)
...
<mylib_myfunc@plt>: jmpq *0x200582(%rip)
The indirect jump resolves at runtime into the address of the real function to
be called, so that is the overhead of the call when using dynamic linking: one
indirect jump instruction.
Micro-benchmarking
To measure this one-instruction overhead in terms of execution time, I used
the following code:
for (i= 0; i < count; i++)
v= mylib_myfunc(v);
The function mylib_myfunc() is placed in a library, with the
following code:
int mylib_myfunc(int v) {return v+1;}
I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz
machine running Linux amd64. Here are the results from running the loop for
1,000,000,000 (one billion) operations:
| | total time (sec.) | CPU cycles/iteration |
| Static linking | 2.54 | 6 |
| Dynamic linking | 3.38 | 8 |
So that is the two CPU cycles of overhead per call that I referred to at the
start of this post.
Incidentally, if you try stepping through the call with a debugger, you will
see a much larger overhead for the very first call. Do not be fooled by this,
this is just because the loader fills in the PLT lazily, computing the correct
address of the destination only on the first time the call is made (so
addresses of functions that are never called by a process need never be
calculated). See above-referenced article for more details.
(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar,
but the actual overhead may be somewhat larger, since that architecture lacks
program-counter-relative addressing and so must reserve one
register %ebx (out of its already quite limited register bank)
for this purpose. I did not measure the 32-bit case, I think it is of little
interest nowadays for high-performance MySQL or MariaDB deployments (and the
overhead of function calls on x86 32-bit is significantly higher anyway,
dynamic linking or not, due to the need to push and pop all arguments to/from
the stack)).
Conclusion
Two cycles per call is, in my opinion, a very modest overhead. It is hard to
imagine high-performance code where this will have a real-life noticeable
effect. Modern systems rely heavily on dynamic linking, and static linking is
nowadays causing much more problems that it solves. And I think it is also
time to put the efficiency argument for static linking to rest.
Tags: freesoftware, mariadb, mysql, performance, programming
|
06:38 pm
[Link] |
Micro-benchmarking pthread_cond_broadcast()
In my work
on group
commit for MariaDB, I have the following situation:
A group of threads are going to participate in group commit. This means that
one of the threads, called the group leader, will run
an fsync() for all of them, while the other threads wait.
Once the group leader is done, it needs to wake up all of the other threads.
The obvious way to do this is to have the group leader
call pthread_cond_broadcast() on a condition that the other
threads are waiting for with pthread_cond_wait():
bool wakeup= false;
pthread_cond_t wakeup_cond;
pthread_mutex_t wakeup_mutex
Waiter:
pthread_mutex_lock(&wakeup_mutex);
while (!wakeup)
pthread_cond_wait(&wakeup_cond, &wakeup_mutex);
pthread_mutex_unlock(&wakeup_mutex);
// Continue processing after group commit is now done.
Group leader:
pthread_mutex_lock(&wakeup_mutex);
wakeup= true;
pthread_cond_broadcast(&wakeup_cond);
pthread_mutex_unlock(&wakeup_mutex);
Note the association of the condition with a mutex. This association is
inherent in the way pthread condition variables work. The mutex must be locked
when calling into pthread_mutex_wait(), and
will be obtained again before the call returns.
(Check the
man
page
for pthread_cond_wait() for details).
Now, when I think about how these condition variables work, something strikes
me as somewhat odd.
The idea is that the broadcast signals every waiting thread to wake
up. However, because of the associated mutex, only one thread will actually be
able to wake up; this thread will obtain a lock on the mutex, and all other
to-be-awoken threads will now have to wait for this mutex! Only after the
first thread releases this mutex will the next thread wakeup holding the
mutex, then after releasing the third thread can wake up, and so on.
So if we have say 100 threads waiting, the last one will have to wait for the
first 99 threads to each be scheduled and each release the mutex, one after
the other in a completely serialised fashion.
But what I really want is to just let them all run at once in parallel (or at
least as many as my machine has spare cores for). There is another way to
achieve this, by simply using a separate condition and mutex for each thread,
and have the group leader signal each one individually:
Waiter:
pthread_mutex_lock(&me->wakeup_mutex);
while (!me->wakeup)
pthread_cond_wait(&me->wakeup_cond, &me->wakeup_mutex);
pthread_mutex_unlock(&me->wakeup_mutex);
Group leader:
for waiter in <all waiters>
pthread_mutex_lock(&waiter->wakeup_mutex);
waiter->wakeup= true;
pthread_cond_signal(&wakeup_cond);
pthread_mutex_unlock(&wakeup_mutex);
This way, every waiter is free to start running as soon as woken up by the
leader; no waiters have to wait for one another. This seems advantageous,
especially as number of cores increases (rumours are that 48 core machines are
becoming commodity).
"Seems" advantageous. But is it really? Let us micro-benchmark it.
For this, I start up 5000 threads. Each thread goes to wait on a condition,
either a single shared one, or distinct in each thread. The main program then
signals the threads to wakeup, either with a single pthread_cond_broadcast(),
or with one pthread_cond_signal() per thread. Each thread records
the time they woke up, and the main program collects these times and computes
how long it took between starting to signal the condition(s) and wakeup of the
last thread. (Here is
the full C
source code for the test program).
I ran the program on an Intel quad Core i7 with hyperthreading enabled, the
most parallel machine I have easy access to. The results is the following:
pthread_cond_broadcast():
| 46.9 msec
| pthread_cond_signal():
| 17.6 msec
|
Conclusion: pthread_cond_broadcast() is slower, as I
speculated. I would expect the effect to be more pronounced on systems with
more cores; it would be interesting if readers with access to such systems
could try the test program and comment below on the results.
Tags: freesoftware, mariadb, mysql, performance, programming
|
[<< Previous 10 entries] |