|
Kristian Nielsen
[Recent Entries][Archive][Friends][Profile]
Below are the 10 most recent journal entries recorded in the "Kristian Nielsen" journal:[<< Previous 10 entries]
09:43 pm
[Link] |
Hacking a box of 240x320 displays with the ESP8266
For the last few days, I have been playing with some small displays we have
lying around in Labitat. We have
ninety-odd of them from an old donation, and I thought it would be cool to
be able to use them for some fun projects.
The displays
are ET024002DMU
with a
built-in ST7787
controller. They really are quite nice displays. 240-by-320 resolution in
a 42mm-by-60mm form factor. 18-bit colour and a nice and clear display at
a high resolution. The protocol is reasonably well designed and described,
and they have some nice extra features that should make them suitable for
most tasks, even demanding ones.
The picture to the right shows a display soldered to a simple breakout
board and connected to an old STM32F405 (ARM Cortex-M4 board) of mine.
This is using the
display's 16-bit parallel bus, which allows for really fast update. It
should even be possible to connect the display to the memory controller of
the STM32 and utilise DMA to update the display without tying up the CPU
(though my test just used bit-banging on a GPIO port). Other nice features
include the ability to sync with or control the vertical refresh (to avoid
visible glitches/tearing during frame updates) and hardware scrolling. The
display is spec'ed for 2.8V, which is somewhat inconvenient when most of
my stuff is otherwise running 3.3V logic, but so far just running the
display at 3.3V seems to work fine.
Driving the display over a 16-bit bus from and ARM MUC is cool for
high-end stuff. Since we have so many of these displays lying around, it
would also be cool to have something that is really easy, and really
cheap, to use. So I extended
the TFT_eSPI library for
the ESP8266 to support
the displays. The ESP8266 is very cheap and conveniently runs
3.3V logic (as opposed to the Arduino's normally 5V). The ESPs have become
very popular, so since I have not used them before, this was also a good
chance to learn a bit more about them.
The TFT_eSPI library is done by Bodmer, derived from some Adafruit
libraries. It uses the serial interface mode of the display. This greatly
reduces pin count (the ESP8266 seems quite starved for GPIOs), but also
greatly reduces performance, of course. But the TFT_eSPI is well
optimised, and should be adequate for most purposes. I think I managed to
fully implement the ST7787 controller for the library, all of the demos
seem to run well. And the performance should be on par with the other
supported display controllers. Here is a video showing the display running
some demos from the library:
While implementing the support in the library, I discovered a couple
interesting details about the display that I wanted to write up.
Generally, these kind of displays all seem to share a protocol, and work
much the same way. However, the ST7787 only supports a 3-wire serial mode,
while the other controllers supported by the library are using a 4-wire
serial mode. The difference is that the 4-wire mode uses a dedicated pin DC
to distinguish between command (DC=0) and data (DC=1) bytes. In the 3-wire
mode, the DC pin is not available, and the data/command bit is instead
included as a 9th bit with every byte. Here are diagrams showing this,
taken from the datasheet of another display controller, the ILI9341:
The main work in extending the library to support the ST7787 was to
re-write all the data transfers to handle the 3-wire mode, bit-stuffing
the D/C bit into the SPI data stream instead of pulsing the D/C GPIO. The
bit-stuffing probably introduces some overhead; on the other hand, now
multiple commands can be given to the ESP8266's SPI peripheral at once,
without the need to manually pulse D/C in-between transfers. So I expect
the performance to be roughly equivalent; it will be interesting to run
some benchmark comparisons against other displays.
I did discover one notable limitation of the 3-wire mode, though. This is
when reading pixel data back out of the display's frame buffer memory.
There is a command RAMRD to do this, and in the other modes, it allows to
read out all pixels in a previously defined read window at once in a
single command, or a subset of pixels. But this did not appear to work in
the 3-wire mode. It turns out that the display controller takes its
dataline into high-Z mode as soon as one pixel has been transfered, no
matter the size of the defined read window. This means that frame buffer
reads will be a lot slower on the ST7787, due to the overhead of reading
out pixels one by one (fortunately, framebuffer read is not a common
operation).
I am thinking that this limitation comes from not having the D/C pin
available to distinguish data from commands and thereby terminating a
transfer. Apparently, the read window size is not available to the SPI
interface, or forcing the host to transfer the full window size was deemed
undesirable. Thus, only single pixel transfers are supported. The data
sheet is not very clear on this, to say the least.
A related limitation is seen in the RAMWR command, which writes a sequence
of pixels to the display. The RAMWR does support writing multiple
pixels, which is fortunate as this is one of the most used and most
performance-critical operations. But what I found was that when writing
less that the defined window size, some glitches can occur. It appears
that the controller's SPI interface is buffering up to 4 pixels before
writing them to the display memory. When writing less than 4 pixels, the
pixels would not be written to framebuffer memory immediately. Instead,
they would be included in a later write, appearing in the wrong location!
This problem only occurs when writing less than the defined window size.
So once understood, it was easy to work around by always setting the right
window size before writing pixels - which only affected the line drawing
algorithm. Again, this seems to be a limitation of the 3-wire protocol due
to missing D/C input. Apparently, it is not able to utilise the D/C bit
that is available to the controller in write mode, instead
relying on the frame buffer handler to notify when the write window is
full. Or maybe it is just a controller bug.
It was btw. interesting to see how similar these different controllers
are. This is quite apparent also from similarities between the data
sheets. There has clearly been a lot of copy-paste between them. For
example, I found this gem in the ST7787 datasheet (in the description of
the WRX pin):
-In 4-line serial interface, this pin is used as D/CX (data/ command select)
The ST7787 though does not have any 4-line serial interface (there is no
other mention of this mode anywhere in the datasheet). So this is clearly
a left-over from a copy-paste from another controller...
Otherwise the implementation of ST7787 support in the library was
reasonably straight-forward. The library uses a lot of direct register
access to the SPI hardware to optimise the transfer speed, so some care is
needed to get the bit-stuffing right in all cases. One thing to be aware
of is that the RAMRD command for reading a pixel needs 9 dummy clock
pulses between sending the command and reading back the pixel data.
A very careful reading of the datasheet might be said to hint at
this, though it really is not very clear.
Another thing to be aware of is that the ST7787 uses a single shared data
line for its communication; it does not have separate MOSI/MISO lines.
This is not a problem in practice, as the controller never transmits and
receives data at the same time. So the D0 pin of the ST7787 can simply be
connected to both MOSI and MISO on the ESP8266 microcontroller. But the
code then needs to temporarily disable the MOSI pin output while reading,
to avoid contention on the data line.
So with the library working, it should be easy to control the displays
from an ESP8266. The library has a lot of nice functionality, including
really good text support with a lot of available fonts. The next step
should be to make a couple nice PCBs with connector for the display and
for an ESP8266, some power supply, handy pin breakouts, etc. Hopefully
some cool projects will come of it in the coming months.
Tags: arduino, arm, electronics, esp8266, freesoftware
|
10:43 am
[Link] |
FVWM and Java application window focus
I found the solution for a problem that has been annoying me for quite some
time. I am using FVWM for a
window manager. A few programs written in Java, most notably the Arduino
IDE, will not accept keyboard focus. As soon as the window is given focus
using normal window-manager commands, it immediately loses the focus again,
making it impossible to eg. edit code or use keyboard shortcuts. I could
usually get it to focus by randomly clicking a few menus or some such, but
it was hardly usable.
Turns out that FVWM has a Style option to handle this: Lenience.
From the FVWM manual:
The ICCCM states that windows possessing the
property WM_HINTS(WM_HINTS): Client accepts input or input focus:
False should not be given the keyboard input focus by the window
manager... A number of applications set this property, and yet expect the
window manager to give them the keyboard focus anyway, so fvwm provides a
window style, Lenience, which allows fvwm to overlook this ICCCM rule.
And indeed, the problem seems solved after I put the following in my FVWM
configuration:
Style "*Arduino*" Lenience
The Arduino IDE window now keeps focus when switching between virtual
desktops, using window selection hotkeys, etc. It still seems to require
explicitly clicking the window with the mouse to actually accept keyboard
events, which is annoying, but at least it is no longer impossible to type
into the window. Not sure if this is a fundamental "feature" of Arduino,
or if there is a way to fix this also. I rarely use the Arduino IDE
anyway.
From the FVWM documentation it would appear that this is a problem with
the Arduino code (or the underlying Java window toolkit, I remember seeing
similar problems in other Java applications). Not sure why it has never
been fixed though - or if it is actually FVWM that is mis-interpreting or
being overly zealous with its adherence to ICCCM compliance.
Tags: arduino, freesoftware, fvwm, java, linux
|
12:01 pm
[Link] |
Improving replication with multiple storage engines
New MariaDB/MySQL storage engines such as MyRocks and TokuDB have renewed
interest in using engines other than InnoDB. This is great, but also
presents new challenges. In this article, I will describe work that I am
currently finishing, and which addresses one such challenge.
For example, the left bar in the figure shows what happens to MyRocks
replication performance when used with a default install where the
replication state table uses InnoDB. The middle bar shows the performance
improvement from my patch.
Current MariaDB and MySQL replication uses tables to transactionally record
the replication state (eg mysql.gtid_slave_pos). When
non-InnoDB storage engines are introduced the question becomes: What engine
should be used for the replication table? Any choice will penalise other
engines heavily by injecting a cross-engine transaction with every
replicated change. Unless all tables can be migrated to the other engine
at once, this is an unavoidable problem with current MariaDB / MySQL code.
To solve this I have implemented MDEV-12179, per-engine
mysql.gtid_slave_pos tables, which should hopefully be in MariaDB 10.3
soon. This patch makes the server able to use multiple replication state
tables, one for each engine used. This way, InnoDB transactions can update
the InnoDB replication table, and eg. MyRocks transactions can update the
MyRocks table. No cross-engine transactions are needed (unless the
application itself uses both InnoDB and MyRocks in a single
transaction).
The feature is enabled with the new option
--gtid-pos-auto-engines=innodb,rocksdb. The server will
automatically create the new replication tables when/if needed, and will
read any such tables present at server start to restore the replication
state.
Performance test
To test the impact of the new feature, I ran a sysbench write-only load on a
master, and measured the time for a slave to apply the full load. The
workload is using MyRocks tables, while the default mysql.gtid_slave_pos
table is stored in InnoDB. The performance was compared with and without
--gtid-pos-auto-engines=innodb,rocksdb. Full details of test
options are available following the link at the end of the article.
Replication injects an update into a small table as part of each commit. The
performance impact of this will be most noticeable for fast transactions, where
the commit overhead is relatively larger. It will be particularly noticeable
when durability is enabled (--innodb-flush-log-at-trx-commit=1
and similar). If another storage engine is added into a transaction, extra
fsync() calls are needed in the commit step, which can be very expensive.
I tested the performance in two scenarios:
- A "worst case" scenario with durability/fsync enabled for binlog,
InnoDB, and MyRocks, on hardware with slow fsync.
- A "best case" scenario with all durability/fsync disabled.
In the "worst case" we would hope to see substantial improvement due to
reducing the number of fsync operations. In the "best case" improvements
will be expected to be small, if any, though there may still be some
improvement due to avoiding CPU and some I/O overhead from running the
commits through two engines.
The figure at the start of the article shows the results from the "worst
case". The left bar is the time for the slave to catch up when the
replication state table is using the default InnoDB storage engine. The
middle bar is the time when using
--gtid-pos-auto-engines=innodb,myrocks and the right bar is
when the state table is changed to MyRocks (MyRocks-only load).
We see a huge speed penalty from the cross-engine transactions introduced by
the InnoDB state table, the slave is twice as slow. However, with the patch,
all the performance is recovered compared to MyRocks-only setup.
The test was run on consumer-grade hardware with limited I/O capabilities. I
ran a small script to test the speed of fdatasync() (see link
at end of article). This SATA-attached SSD can do around 120 fdatasync()
calls per second, writing 16 KB blocks at random round-robin among five 1MB
data files. In the "worst case" test, the load is completely disk-bound.
Thus, the absolute transactions-per-second numbers are low, and the impact
of the new feature is very big.
The results of the "best case" is in the following figure.
The "best case" workload is CPU-bound, disk utilisation is low. Sysbench
write-only does several queries in each transaction, so commit overhead is
relatively lower. Still, we see a substantial cost of replication introduced
cross-engine, it runs 18% slower than the MyRocks-only case. And again, the
patch is able to fully recover the performance.
So I think these are really good results for the new feature. The impact for
the user is low - just set a server option, and the server will handle the
rest. We could eventually make InnoDB, TokuDB, and MyRocks default for
--gtid-pos-auto-engines to make it fully automatic.
The actual performance gain will depend completely on the workload, and the
absolute numbers from these performance tests mean little, perhaps. But they
do show that there should be significant potential gain in many cases, and
enourmous gains in some cases.
Conclusions
I hope this feature will help experiments with, and eventual migration to,
the new storage engines such as TokuDB and MyRocks. The ability to have good
replication performance when different storage engines are used in different
transactions (but not within a single transaction) should make it easier to
experiment, without committing everything on a server to a new and unknown
engine. There might even be use cases for deploying permanently on a
mixed-engine setup, with different parts of the data utilising different
performance characteristics of each engine.
The present work here is implemented for MariaDB only. However, there is
some discussions on porting it to other MySQL variants. While the details of
the implementation will differ somewhat, due to code differences in MariaDB
replication, I believe a similar approach should work well in the other
MySQL variants also. So it is definitely a possibility, if there is
interest.
Links:
Tags: database, freesoftware, mariadb, mysql, performance, programming
|
03:56 pm
[Link] |
Understanding skew factors in Simplex/Improved Perlin Noise
[Here is a
PDF version for readers whose browser does not
understand MathML.]
The Simplex Noise
(or Improved Perlin Noise) algorithm uses a somewhat
mysterious "skew factor"
of . I
did not find any really satisfactory explanation for this factor in the
descriptions of Simplex Noise that I read. But I managed to work it out
nevertheless, which I thought was a fun exercise and worth a quick write-up.
Simplex noise is constructed by assigning random values to each point in a
simplex grid. The simplex grid is a tiling of the plane using rhombuses, each
rhombus consisting of two equilateral triangles. See the figure on the right.
Given a point
(expressed in normal rectangular coordinates), we
first transform the coordinates into
expressed in the simplex grid. Then
we take the integer parts of u and v to find the corners of the containing
equilateral triangle, and take the random values assigned to these
corners. The "noise value" of the original point
is then some suitable
interpolation of these values with respect to the distance from
to each corner.
The implementation of this algorithm is explained in detail
in several places. The code to transform into and back out of the simplex grid might look like
this:
final double F2 = 0.5*(Math.sqrt(3.0)-1.0);
double s = (xin+yin)*F2;
int u = fastfloor(xin+s);
int v = fastfloor(yin+s);
final double G2 = -(3.0-Math.sqrt(3.0))/6.0;
double t = (u+v)*G2;
double X0 = u+t;
double Y0 = v+t;
So the question is, where do these funny factors F2 and G2 come from?
To understand this, let us first consider the general form of the
transformation from simplex coordinates
in the grid spanned by
and
to the rectangular
coordinates . It is
where
and
. So this
requires 4 multiplications in the general case.
However, we can freely choose which simplex grid to use! So we can try to
choose one that reduces the computational work needed.
First, we can choose the orientation of the grid so
that u → and
v → are symmetric around
the diagonal x=y. Then
a=d and
b=c, so we can write the transformation as
x = (a-b) u + b (u+v)
y = (a-b) v + b (u+v)
Second, we can choose the scale of the grid so that
(a-b)=1,
and then we get simply
x = u + t
y = v + t
where t=b(u+v).
This simpler form requires only a single multiplication.
This is exactly the form we see in the above code snippet, with G2 being the
name for the constant b. We can work out from this that
the vectors that span the grid used by the code are
u → =
(1-3 - 3 6,
-3 - 3 6)
v → =
(-3 - 3 6,
1-3 - 3 6)
The interested reader is encouraged to check that
u → ,
v → , and
u → + v →
all have the same length, so that the triangles that form the grid indeed end
up being equilateral.
Working out the inverse of the transformation (again a good exercise for the
interested reader) leads to another transformation of the same simple form,
this time with the constant
F2 = 3-1 2
as seen in the code.
So I believe this is the explanation of the mysterious factors F2 and G2 in
the example code found by Google searches on the net. They arise from the
choice of the simplex grid to use. This choice is made to make the associated coordinate
transformation use only a single multiplication, rather than the four required
in the general case. This makes the implementation of the algorithm more
efficient. Pretty clever, right?
Tags: graphics, math, performance, programming
|
12:11 am
[Link] |
Arduino to the max: 11x11x11 LED-cube
March 29 2014 is Arduino day, also
in Labitat.
This is a good opportunity to describe my LED-cube:
This LED-cube pulls a number of tricks to get the most out of just a single
normal Arduino Uno. A
meager 16 MHz and 2048 bytes of RAM goes a long way with sufficient ingenuity
and creativity. Here are some highlights:
- 12-bit PWM, 16 grayscales non-linear.
- Animations generated on-board, read from SD-card, or streamed over USB.
- 178 Hz refresh rate, transferring 3 Mbits/s of data to the LED driver
shift registers.
- 50 Hz animation framerate, receiving 269kbit/s of animation data over the
serial port.
- Approximately half of spare CPU time available for on-board generation of
animations.
- Multi-processor Arduino! Uses the Atmega 16U2 USB-handling chip on the
Arduino Uno as a second processor, reading animations from an SD-card.
- Hand-crafted assembler to speed up time-critical parts.
LED-cube basics
One of the first questions I get when I show the LED-cube to people is: How is
it possible to turn individual LEDs on and off? After all, it looks like
everything is connected to everything. Here is how it works.
The LEDs in the cube are organised into columns. The 11x11x11 cube thus has
11*11=121 columns, each containing 11 LEDs. Let us first see how a single
column works:
All the cathodes are tied together and connected to ground. Each anode can be
individually connected to the supply voltage using one of the switches. This
allows to turn on one LED by connecting the appropriate switch.
Now let us see how it works with multiple columns:
The anodes of one LED from each column are all tied together and connected by
a single switch to the supply voltage. The columns now have each a switch
connecting them to ground. Now, to turn on eg. D5, we would connect SW2 and
SW5. To turn on in addition D8, we would connect in addition SW6.
But what if we want to turn on D1 and D5? For this, we need to connect all of
switches SW1, SW2, SW4, and SW5. But that turns on also D2 and D4. Thus, to
control every LED independently, we need something more.
The answer is multiplexing. We only ever have one of SW1, SW2, and
SW3 turned on at the same time. This allows to individually control each LED
in one level of the columns using the bottom column-switches. Then shortly
after (half a millisecond in the 11x11x11 LED-cube), we turn off that switch
and turn on the switch for the next level, at the same time flipping the
column-switches as appropriate for controlling the LEDs at that level. By
doing this quickly enough, the human eye is too slow to perceive any
flickering, and we get the illusion that each LED is turned on or off
independently, at will.
Using multiplexing, the required number of switches and connections is greatly
reduced. In the 11x11x11 cube, we need only 11*11=121 switches for the
columns, plus an additional 11 swiches, one for the anodes in each horizontal
layer. ("Only", compared to 11*11*11=1331). In the cube structure, the column
connections are the vertical structures, and the horizontal structures connect
the anodes in one layer each to a switch through 11 extra columns at the back
of the cube.
Soldering the LED structure requires some soldering skills, but with 1331 LEDs
to do, such skill will be naturally aquired before the end of the
project. Most especially, patience is needed. There are several good pages on
the internet describing in detail how the LEDs are soldered together for an
LED-cube, here is one, for
example.
The layer/anode switches are implemented using 11 P-channel MOSFETs,
controlled directly from 11 GPIO pins on the Arduino. The column switches are
implemented using 8 TLC5940
LED driver ICs. The TLC5940 has the additional benefit of being able to
accurately control how much current each LED receives, as well as being able
to dynamically adjust the LED brightness using 12-bit PWM (4096 intensity
levels).
Electronics
The electronics sit on a 20cm-by-20cm PCB located below the base of the
LED-cube. The picture shows the PCB from the bottom side; the LED columns (and
the connection to the 11 layers) enter through the yellow holes from the top,
are bent 90 degrees, and soldered to the copper pads.
The 8 ICs spaced around the middle are the TLC5940 LED drivers, connected to
each LED pad. The connections are made for ease of PCB layout, so the software
has a lookup table that maps each LED column to the TLC5940 output to which it
is connected. Along the right of the board are seen the 11 P-channel MOSFETs
that supply the voltage for each layer, one at a time.
At the very right edge is seen the connector to the GPIO pins on the Arduino
Uno. Originally, the cube was controlled by an actual Arduino Uno device,
connected to the cube with a ribbon cable. But the current version uses
instead a custom control PCB with a small, narrow footprint that just fits
below the base of the LED-cube. This way all the electronics is hidden away
below the very narrow base, resulting in a very nice elegant presentation of
the cube:
This control PCB is similar to an Arduino Uno; but it is stripped for anything
not needed for the cube, and has a micro-SD card slot and level-shifter
soldered directly on the board (originally an small external micro-SD board
was connected to the Arduino Uno). The micro-SD slot (not shown on the
drawing) and the level-shifter are in the middle. To the left is the
USB-connector and the Atmega16U2 chip that handles the USB. To the right is
the Atmega328 that runs the main code for the LED-cube. The connections to the
main PCB are soldered to the pads along the right and top-right.
The PCBs were layed out with KiCAD;
see the end of this article for links to source code. The main board was
manufactured with the PCB CNC milling machine we have at Labitat. The small
Arduino-clone PCB was manufactured
by the SeeedStudio Fusion PCB service.
Software
The LED-cube has 1331 LEDs, each of which can be PWM'ed for 16 different
levels of brightness, from 0 (turned off) to 15 (fully on). The state of all
the LEDs is stored in a framebuffer; with 4 bits per LED that amounts to 666
bytes. To avoid flicker, double-buffering is needed. With 2kByte of memory,
the Arduino has just room for two framebuffers, with a bit of memory to spare
for the rest of the activities.
The software has two main tasks to control the LED-cube:
- Every 20 milliseconds, load one of the frame buffers with the next frame
of the animation being played, for an animation framerate of 50 per second.
- Every 500 microseconds, turn on the LEDs in one layer of the cube with
the appropriate intensity, by programming the TLC5940 driver chips with the
correct PWM value and turning on the power to that layer through the next
P-channel MOSFET; resulting in a refresh rate for the entire 11-layer cube
of around 178 Hz.
The loading of animation frames into the framebuffers can happen from three
sources: They can be loaded from an SD-card, they can be streamed from a PC
over the USB port, or they can be generated by code on-board the Arduino
itself. The refresh of the LED intensities happens in a timer interrupt, using
the SPI device on the Atmega328 to communicate with the TLC5940 driver
ICs. The individual steps are detailed in the following sections.
Communicating with the TLC5940 LED-driver ICs.
The TLC5940 LED-drivers need an external PWM clock; this is supplied from the
arduino from a timer in PWM mode. An 8 MHz PWM-clock is used; this is the
fastest clock that the 16 MHz Atmega328 can generate. One PWM cycle is 4096
PWM clocks, as the TLC5940 uses 12-bit PWM. It is important to always let the
TLC5940s run an integral number of PWM cycles; else the brightness of LEDs
will be off. So a timer-interrupt runs every 4096 PWM cycles to trigger the
TLC5940s; this works out to about 2 kHz for switching each layer on and off,
or 178 Hz refresh rate for all 11 layers. The TLC5940 ICs employ a latch, so
the timer interrupt first latches new data to get stable PWM timing, then goes
to work to send out new data to the driver ICs to be ready to latch at the
next timer interrupt.
The 8 TLC5940 ICs have a total of 128 outputs, each of which needs 12 bits of
PWM data. That amount to 1536 bits or 192 bytes of data to send to the LED
drivers in each refresh cycle. One refresh cycle has a total of 8192 CPU clock
cycles, and we need to leave as much as possible of those for generating /
downloading the next frame of animation, so this step is carefully optimised
for speed.
The data is shifted into the TLC5940s using data and clock pins, so we can use
the Atmega328 SPI device in master mode. The maximum speed possible for the
SPI device is using an 8 MHz clock, so that means 16 CPU cycles per byte plus
a couple extra to load the next byte, as the SPI device is only
single-buffered, for a total of around 3500-4000 cycles of the 8192 available.
But we also need to prepare the data to send to the LED drivers. This involves
a few steps per TLC5940 output:
- Fetch from a lookup-table the index of the LED that is connected
to the next TLC5940 output on the PCB (remember, those connections are in
more or less random order to optimise the PCB layout).
- Use that index into the frame buffer to fetch the 4-bit value that gives
the required brightness of the LED.
- Use the brightness value to index into another lookup table, which maps
each possible brightness into the corresponding 12-bit PWM value. This
mapping is made non-linear to improve the dynamic range available with just
16 different levels, as the lower PWM values are perceived by the human eye
to be much easier to distinguish than the higer values.
- Pack the 12-bit values of two consecutive outputs into three bytes
suitable for transmission on the SPI device (it is thus easier to do two
LEDs at a time, producing three whole bytes of SPI data for each loop
iteration).
As it turns out, these operations together take something like 60 CPU cycles
for two LEDs or 20 cycles per SPI output byte, which is another 3500-4000
cycles.
With 8192 cycles total, we cannot afford to first spend 4000 cycles preparing
the data, then spend another 4000 cycles sending the data out. So we use a
trick. By writing the code in assembler and carefully timing each instruction,
we can overlap the two operations, inserting into the data generation
algorithm instructions to feed the next byte of data to the SPI at exactly the
points in time where they are needed.
In C it might look something like this, with the data generation algorithm on
the left and the SPI data output on the right:
// Loop over the TLC5940 outputs two at a time
for (o = 127; o >= 0; o = o - 2)
{
// Find index into frame buffer
led_idx = led_map[o]; // Output one byte
SPDR = byte1
// Load the 4-bit intensity from the frame buffer
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;
// Loopup the 12-bit PWM value from the intensity.
pwm1 = pwm_lookup[intensity];
// Output one byte
// Same for second output SPDR = byte2
led_idx = led_map[o+1];
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;
pwm2 = pwm_lookup[intensity];
// Pack the two 12-bit PWM values into three SPI bytes
byte1 = pwm1 >> 4;
byte2 = (pwm1 & 0xf) << 4 | (pwm2 >> 8); // Output one byte
byte3 = pwm2 & 0xff; SPDR = byte3
However, the algorithm is not really expressible in C. The instructions to
output the data bytes to the SPI device need to be placed so that the bytes to
output have been computed and the timing between them must be just
right so that they happen as soon as the SPI has completed the last transfer
(but no sooner, or the previous byte will be corrupted). The actual assembler
code can be
seen here,
in the function shift_out_frame_spi(). The comments in the
function mark the cycles spent on each instruction so that the SPI output can
be spaced out with the correct timing.
This way, the code is able to send out data for two TLC5940 outputs in around
63 CPU cycles, or about 4000 cycles total per refresh cycle, leaving half the
CPU time free for handling the frames of animation, which is nice. I think
this is a rather interesting programming technique. It is a bit like
multi-threading, but with the instruction scheduling hardcoded explicitly into
the program.
Serial reception
In addition to generating some animations on the Arduino itself, they can be
streamed into the Atmega328 through the serial port. The protocol is mostly
the raw binary data in the framebuffer (4 bits per LED), plus a couple of
control bytes like start/end marker, frame number, and checksum, to facilitate
synchronisation between sender and receiver. If the receiver detects that the
frames are not received correctly, it sends back an error byte; the sender
notices this and pauses the data for a few frames, and the two ends
re-synchronise. This is a simple but effective technique that allows for
efficient transfer of the data.
One frame of data on the serial port is 672 bytes inclusive control bytes. At
50 frames of animation per second, that amounts to 33600 bytes or 268800 bits
per second. The serial port is run at 500000 bits per second; with the
overhead of start and stop bits, that still allows to comfortably transfer the
required data at the desired rate.
However, with this rather high data rate, care is needed to be able to process
all received bytes sufficiently fast that no data is lost. The Atmega328 has
just a single byte receive buffer. At 500kbps, a new byte arrives 50000 times
per second, meaning that we have just 320 CPU cycles to process a received
byte before it will be lost due to being overwritten by the next byte.
To handle this, a serial receive interrupt is employed. The interrupt is
triggered whenever a byte is received by the serial device on the Atmega328,
and we need to ensure that it will be serviced within at most 320 CPU
cycles. The Atmega328 does not have interrupt priorities, but it does support
nested interrupts. Interrupts are automatically disabled whenever an interrupt
routine is invoked, but that routine can re-enable interrupts explicitly, and
this will allow another nested interrupt to be handled before the first one is
completed. Indeed, this is absolute necessary to do in the cube in the refresh
timer interrupt, as this runs for several thousand cycles. Nested interrupts
work well, but they require a lot of care; race conditions between conflicting
interrupts can be quite hard to debug, and one also needs to protect against
runaway interrupts (where the same interrupt is invoked recursively and
repeatedly on top of itself until the stack is overrun).
With more than 30000 serial interrupts per second, we also want to make the
code for the serial interrupt handler as efficient as possible. Unfortunately
the AVR architecture does not exactly shine in this respect. Here is how a
typical interrupt routine looks as generated by GCC:
push r1
push r0
in r0, 0x3f
push r0
eor r1, r1
push r16
push r17
push r18
push r19
push r20
push r21
push r22
push r23
push r24
push r25
push r26
push r27
push r28
push r30
push r31
...
pop r31
pop r30
pop r28
pop r27
pop r26
pop r25
pop r24
pop r23
pop r22
pop r21
pop r20
pop r19
pop r18
pop r17
pop r16
pop r0
out 0x3f, r0
pop r0
pop r1
reti
That is no less than 40 instructions just as pre/post-ample, most of which
take two CPU cycles each.
Of course, in an interrupt routine, we do need to save/restore all registers
used. However, most of the invocations of the serial interrupt do not need to
use more than a few registers; just enough to grab the next byte from the
serial device and put it into the frame buffer. Only for the control bytes
at the start and end of a frame do we need more registers for more complex
processing. Unfortunately, GCC always generates the code to push and pop all
the registers unconditionally, even though some of them are only used in
rarely executed code paths (the large number of callee-save registers in the
AVR calling convention plays a part of the problem here).
The solution is to write the serial interrupt in hand-optimised assembler. In
the fast path, where we are just stuffing a byte into the framebuffer (and
computing a checksum on-the-fly, incidentally), we only need to save three
registers (plus the condition codes). That all can be done in just 26
instructions. Then in the slow path, the assembler code goes on to push all
remaining registers and defer to the more complex processing in a C function.
The actual code can be
seen here.
The assembler code for the fath path is
in serial_interrupt_rx_naked(), while the slow path is in the
function serial_interrupt_slow_part().
Reading animations off SD-card
While some of the animations can be computed on-the-fly on-board the Arduino,
some of the more complex ones are too much for the puny Atmega328 to
handle. The problem usually is not so much processing speed as memory: With a
total of just 2048 bytes of RAM, most of which is already reserved for frame
buffers and stack space and various global variables, not a lot is left to
keep track of stuff like position and velocity of lots of particles in the
fireworks animation or similar stuff. Using the serial port, we can generate
the animations on a PC and stream them to the cube; however it is also nice to
be able to run the cube completely standalone: just plug it into power (or
even run it off a battery) and it runs and displays animations on its own,
without needing a laptop on tow. Thus the idea was born to pre-compute the
animations and read them from an SD-card.
Now, at just 672 bytes per frame of animation, a 4 GB SD-card can store more
than one day worth of animation, so this is fine. The problem however is that
we are running short on IO as well as on memory. Mostly all available pins are
already needed to control the 11 MOSFETs and to communicate with the TLC5940
ICs. Besides, the SPI device, which is normally used to communicate with an
SD-card, is needed for communication with the TLC5940s, and the serial device
(which can also do SPI) is needed for the USB-to-serial interface. So what can
be done?
Let us take a closer look at the Arduino Uno, the top left corner near the USB
connector:
What we see here is the Atmega16U2 IC, which comes pre-installed with firmware
(based on LUFA
to control the USB port and function as USB-to-serial proxy. It is however
perfectly possible to modify that firmware to do additional things - such as
connect an SD-card! Furthermore, just next to it we have the ICP header - this
is normally used to program the flash on the Atmega16U2, but it has the pins
for the SPI device, which is just what we need to communicate with an SD-card
over SPI. And finally, we even have unsoldered pads with four spare GPIO, one
of which we will need as a chip select pin for the SD-card.
So as it turns out, the Arduino Uno is in fact a multi-processor device! The
LED-cube exploits this, connecting the SD-card to the ICP pins and spare GPIO
of the Atmega16U2 MCU and hacking the USB firmware to handle the SD-card. If
there is no activity on the USB port, the hacked firmware will check for the
presence of an SD-card with animation data on it; if found, data will be
streamed from the SD-card over the serial port to the Atmega328, which will
handle the serial data the same, whether it originates from a PC at the other
end of the USB, or from the SD card.
Now, using the Atmega16U2 in this way does present some challenges. The
Atmega16U2 is only equipped with a meager 512 bytes of RAM, some of which is
already needed for LUFA data and so on. The data on SD-cards is read one
sector at a time, and a single sector is 512 bytes, already more than the RAM
we have left. Most libraries for reading SD-cards and dealing with the FAT
filesystem on them is based on reading one sector at a time into a buffer in
RAM and processing it there; that just will not work when we have only a few
hundred bytes of RAM to spare for the task.
Furthermore, most SD-card/FAT libraries are written in a traditional blocking
style. That means, they provide some function you can call to read data from a
file on the SD-card. Such function will take a memory buffer (which we do not
have the memory for), and it will not return to the caller until all of the
requested data has been read, which means waiting at least for one sector to
be read. That does not integrate well with the existing USB/LUFA firmware,
which runs its own main loop that waits for activity on the USB device and
does not return to the main program unless there is some activity to respond
to.
To overcome these challenges, I wrote a small event-driven FAT library, seen
in ev_fat.h
and ev_fat.c.
This library works in a streaming fashion, without any blocking. It never
needs to process SD-card data in a memory buffer. Instead, the caller feeds it
the bytes read off the SD-card one by one, and the library processes the bytes
as they are received, keeping track of its state in a small data structure,
and returning status information back to the caller about which sectors from
the SD-card to read next.
/*
Open a named file in root dir of FAT file system.
Before calling, st->state must be initialised to 0.
Then the function must be repeatedly called until it returns
EV_FILE_ST_DONE or negative error code EV_FILE_ST_E*.
The returned status tells the next action to take, see comments in struct
ev_file_status for details.
When EV_FILE_ST_DONE is returned, the first sector of the file, and the
length in bytes of the file, is returned in st->st_get_block_done.
*/
int
ev_file_get_first_block(const char *filename, struct ev_file_status *st);
/*
After opening a file, this finds the next sector in the file. When calling
this function, st->st_get_block_done must be set to / retain the value set
by the previous call to ev_file_get_first_block() /
ev_file_get_next_block(). After EV_FILE_ST_DONE is returned the new sector
number is then found in st->st_get_block_done.
*/
int
ev_file_get_next_block(struct ev_file_status *st);
/*
This callback is used to stream bytes read as a response to a request
EV_FILE_ST_STREAM_BYTES. Each byte requested must be passed in, in
sequence. The return value is true if no more data needs to be streamed;
in this case it is permissible, but not required, to stop the read early
and not stream the rest of the requested bytes.
*/
int
ev_file_stream_bytes(uint8_t byte_read, struct ev_file_status *st);
With this library, the reading of the SD-card can be handled completely inside
an SPI interrupt routine, without disturbing the LUFA USB code. Each time a
byte has been processed in the communication between the Atmega16U2 and the
SD-card, the SPI device triggers the SPI interrupt. This interrupt processes
any byte received, updates its internal state, and loads the next byte to be
processed into the SPI device data register. The interrupt is seen
in ISR(SPI_STC_vect).
The code handles the protocol to connect to and initialise the SD-card, and
then takes care of reading in sectors and passing the bytes to the
event-driven FAT library.
When we get to actually read real file data out of the SD-card, we stream it
directly to the serial port (where it will be received and processed by the
Atmega328), to avoid the need for large memory buffers. The existing firmware
already has a small FIFO used to buffer data for sending down the serial
line. We re-use that, so that when no data is available from the USB for a few
seconds we start filling up the FIFO with data from the SD-card instead. A
serial device interrupt is triggered whenever the previous byte has been fully
transmitted down the serial line; this interrupt fetches the next byte from
the FIFO and loads it into the serial transmit data register. If the SD-card
delivers data faster than the 500kbps serial line can transmit, we temporarily
pause the SPI communication and resume it once the serial interrupt has made
room for more data in the FIFO; the SD-card specifications explicitly mention
this as a supported way to operate, precisely to help very small
microcontrollers be able to process data without requiring excess buffering
capabilities.
The end result is an extended USB firmware that retains all the original
functionality (streaming serial data from a PC and so on; even flashing the
Atmega328 over the serial port still works). And in addition, if the USB is
idle and an SD-card is present, data is instead continuously streamed from the
card, allowing completely stand-alone operation of the cube.
The code to handle all this does end up rather intricate, as can be
imagined. Apart from the need to write a custom FAT-reading library, the
precise timing between the different interrupt handlers end up requiring quite
a lot of careful coding and debugging. But in the end, I found that code to be
quite an interresting exercise, and fun as well - and this is after all a
for-the-fun-of-it type project.
Calculating the animations
One of the nice thouches of the visuals this LED-cube in particular is the
availability of 16 different intensity levels. This allows for some nice
effects, such as fading the LEDs in-out to give a warmer, light-bulb-like
perception, and using anti-aliasing to greatly reduce the disadvantage of the
very limited 11-by-11-by-11 resolution.
All the animations are computed
by this
C++ program. The code is mostly a lot of math using vector
computations, trigonometry, random number distributions, permutations, physics
simulations and other nice stuff. The end result is a sequential stream of
animation frames that can be send directly to the LED-cube over the serial
port, or stored in a file on an SD-card for stand-alone playback.
Conclusions, and source code
If I were to do this project today, I would probably use an ARM microcontroller like
the STM32F4.
Such a microcontroller is easily able to handle driving something like this
LED-cube without the need for any special tricks due to its much larger memory
and performance. But this was just a for-fun project, and it was interesting
to see just how much could be squeezed out of the very popular AVR-based
Arduino. That is quite a lot, as it turns out.
The nice thing about the LED-cube is: On the one hand it involves lots of
tricky programming and advanced technology. On the other hand it has an
immediate appeal to many different kinds of people, as is seen whenever we
take it on display and it immediately draws the eyes of people passing by. The
technology aspect is much harder to appreciate than the visual aspect. I
have wanted to write up this article describing the project for some time, in
all the gory technical details. I hope a few people will be able to use this
write-up to appreciate the technical side as well as the visual side.
All of the code and design files for this project are available on Github
under an open source license (GPL):
In addition, the Arduino code
needs these header files by
Esmil for easy access to Atmega registers and so on.
Tags: electronics, microcontroller, performance, programming
|
|
05:56 pm
[Link] |
No application changes needed: 10 times faster slave with MariaDB 10 parallel replication
Parallel
replication is in MariaDB 10.0. I did some benchmarks on the code in
10.0.9. The results are quite good! Here is a graph that shows a 10-times
improvement when enabling parallel replication:
The graph shows the transaction per second as a function of number of slave
worker threads, when the slave is executing events from the master at full
speed. The master binlog was generated with sysbench oltp.lua. When the binlog
is enabled on the slave and made crash-safe ( --sync-binlog=1
--innodb-flush-log-at-trx-commit=1), the slave is about ten times faster
at 12 worker threads and above compared to the old single-threaded
replication.
These results are for
in-order parallel replication. With in-order, transactions are committed
on the slave in strictly the same order as on the master, so that applications
do not see any differences from using parallel replication. So no
changes to the application are needed to use parallel replication; this is just
standard sysbench 0.5 with a single table. This makes parallel replication
particularly interesting, as it can be used with any existing applications,
without the need to eg. split the data into independent schemas as is
necessary with out-of-order techniques like the multi-threaded slave feature in
MySQL 5.6. It does however make it much harder for the slave to find
parallelism to exploit in the events received from the master, which is why it
is interesting to see how much improvement can be obtained from normal
workloads.
(MariaDB 10.0 does also support out-of-order parallel replication, but that
will be the subject of a different article).
The hardware used for the sysbench oltp.lua is the same machine I used to
benchmark group
commit previously; I am told this is a machine that is typical for a
"standard" datacenter server, with decent I/O on a RAID controller with
battery-backed-up cache. Sysbench was run with 10 million rows in one
table. The mysqld was configured with 16GB buffer pool and 2 times 1.9
gigabyte redo logs. The different graphs are as follows:
- binlog, crash-safe:
--log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1
- no binlog, durable:
--skip-log-slave-updates --innodb-flush-log-at-trx-commit=1
- no binlog, non-durable:
--skip-log-bin --innodb-flush-log-at-trx-commit=2
- binlog, non-crash-safe:
--log-slave-updates --sync-binlog=0 --innodb-flush-log-at-trx-commit=0
For this test, the master was configured with
--binlog-commit-wait-count=12 --binlog-commit-wait-usec=10000.
This allows the master to delay a transaction up to 10 milliseconds in order
to find up to 12 transactions that can commit in parallel; this helps a lot in
improving parallel replication performance, since transactions that commit in
parallel on the master can be executed in parallel on the slave.
Adding such delay will be acceptable for many applications to speed up the
slaves; in fact in my test it did not affect master throughput at all. One
attractive option might be to set up monitoring of the slaves, and if they
start falling behind, then the commit delay on the master can be temporarily
increased, throttling the master a bit while allowing slaves better
possibility to catch up.
The other source of parallelism on the slave is that irrespectively of how the
transactions were executed on the master, the commit steps of different
transactions can always be applied in parallel on the slave. This is
particularly effective at improving performance when the commit step is
expensive, as happens when a durable, crash-safe configuration is used. This
is seen in the benchmark, where the speedup is particularly large when the
slave is configured to be crash-safe and durable, to the point where parallel
replication almost eliminates any performance penalty for enabling crash-safe
binlog on the slaves. But significant speedup is seen in all the
configurations.
(In fact, if you look closely, you will see that turning off the
binlog ends up decreasing the performance of the slave. This is
bug MDEV-5802, and performance
should improve when binlog is disabled when this bug is fixed).
I think these are very promising results. I hope this will inspire users to
give the new feature a test on real workloads, and share their experiences.
Exploring the limits of parallel replication
I also wanted to see how the code works for workloads that are not favorable
to parallel replication. For this I use sysbench update_index.lua. This
benchmark creates transactions with a single primary-key update. With just 10
million rows in the table, the test is in-memory, and the actual work spent
for the single-row update is rather small compared to the overhead of reading
and parsing binlog events and scheduling the work between the various
replication threads. So it is interesting to see if parallel replication can
still give some smaller benefit here, or at least not make things much worse.
Here are the results for an update_index.lua with 48 threads for the load on
the master. No commit delay was used to help the parallel replication
(--binlog-commit-wait-count=0):
As can be seen, even in this workload there is significant improvement from
parallel replication. Again, the improvement is particularly high when commit
is expensive, due to the possibility to do the commits in parallel and
amortise the cost using group commit. But even with binlog disabled and
non-durable InnoDB commit, we see some improvement, though only a modest one.
Finally, to test the absolutely worst-case scenario for parallel replication,
I created another workload on the master, this time with update_index.lua
running with just a single thread. This way, there is absolutely
no opportunity for parallel replication to execute the actual
transactions in parallel, though there is still some opportunity to speed up
the commit step using group commit.
Here are the results for the single-threaded update_index.lua master load:
As can be seen, even in this extreme case, we do see some speedup when commits
are expensive (durable/crash-safe configuration). But when durability and
crash-safety is turned off, and commits are cheap, then the overhead of the
extra thread scheduling shows, and the results of parallel replication are
worse, if not disastrously so. Note by the way that with this test, the time
taken by the slave to replicate the load was smaller than the time for the
master to generate it with single-threaded sysbench.
Conclusions
So overall, I am pleased with the results of the parallel replication code in
MariaDB 10.0. Single-threaded applier has been a major limitation in
replication for a long, long time, and I feel now fairly confident that this
can play an important part in solving that. I hope this will inspire users to
start testing their own loads and hopefully getting good results and/or
reporting any issues found on the way.
To test, use MariaDB 10.0.9 or later on both master and slave. Configure the
slave with --slave-parallel-threads=20 or something like
that. And optionally, if your application can tolerate some extra commit
latency, set some reasonable values for
--binlog-commit-wait-count and
--binlog-commit-wait-usec.
By the way, the raw numbers for the different runs can be seen in this mail on maria-developers@.
Tags: database, mariadb, mysql, performance, replication
|
07:23 pm
[Link] |
Using MASTER_GTID_WAIT() to avoid stale reads from slaves in replication
I have just implemented MASTER_GTID_WAIT() in MariaDB
10.0. This can be used to give a very elegant solution to the problem of
stale
reads
in replication read-scaleout, without incuring the overheads normally associated with synchronous
replication techniques. This idea came up
recently in a discussion with Stephane Varoqui, and is similar to the concept
of Lamport logical clock described in this Wikipedia
article.
I wanted to describe this, hoping to induce people to test and maybe start
using this, as it is a simple but very neat idea, actually.
A very typical use of MariaDB/MySQL replication is for read-scaleout.
The application does all updates against a single master, which replicates to
a set of slaves. The application can then distribute its reads among the
slaves. If most of the database load is from reads, then this is an effective
way to scale beyond what a single database server can handle.
The problem of stale reads occurs since MariaDB/MySQL replication is
asynchronous. There is some delay from the commit of a change on the
master server until that change becomes committed and visible on a slave
server. This problem can be quite annoying in many cases, eg. user changes
some preferences and reloads the page (read goes to a different slave server
which is not caught up); now the change looks as if it was lost.
The idea is to use the MariaDB Global Transaction
ID (GTID for short) as a logical
clock. When we do an update on the master, we can obtain the associated
GTID from @@LAST_GTID. Then on the slave, we can execute a
MASTER_GTID_WAIT(<GTID from master>, <timeout>) before doing a
query for which read consistency is critical; this will ensure that the slave
will catch up before the query is run, and that no stale read results. Or if
we get a timeout, we can try another slave server, which hopefully is not
lagging as much behind.
Typically, we would pass around the GTID (the logical clock value) between
different parts of the system to maintain consistency where it is needed and
avoid stale reads. Eg. in a web application, we would set a cookie with the
GTID when an update is made to the database. Then on the next HTTP request, we
have the GTID which is needed in MASTER_GTID_WAIT(), even if it
goes to a completely different web server. Or if we send an email, we can
encode the GTID in any contained link, to be sure that it will work whichever
slave server it might end up in, possibly in a different data center on a
different continent.
By passing around the GTID whenever we communicate between parts of the
system, we can ensure that if transaction A is known to have occured before
transaction B, then any change made by A will also be visible to B. If there
was no communication (direct or indirect), then B cannot know that A came
before - and if there was communication, we can avoid stale reads by passing
the GTID as part of that communication.
The great thing about this technique is that it is optional. We can
use it for just the queries where avoiding stale reads is critical, and only
there take the associated penalty (in the form of increased latency). Other parts of the
database or application will not be negatively affected. This is much more
flexible than using some form of synchronous replication, which will incur
some penalty for all parts of the system.
So for applications that require more performance than what is currently
possible to get from fully synchronous replication, we can choose consistency
in a few critical places where it matters, and go for the performance of
asynchronous replication in the rest.
I am hoping that some users will start to do tests with this, and I am very
eager to hear any feedback and suggestions for further improvements (the maria-developers@ list is a
good place for those). I think this can become a very useful technique in many
cases.
By the way, note that it is not necessary to actually switch the replication
to GTID to use this technique, as in MariaDB the GTIDs are always generated
and replicated, even if they are not used for the slave when connecting to the
master. This should make it simple for people to start testing. One will need
to run MariaDB 10.0 on both the master and the slave, of course - but this
could be achieved by setting up one MariaDB server as a slave of the main
system (which could be running MySQL or Percona Server or MariaDB 5.5 or
whatever), and then setting up another MariaDB 10.0 slave using the first slave
as master.
If this turns out to be something that people find generally useful, it would
be interesting to integrate this more tightly into the client protocol and
client API. The @@LAST_GTID could be sent automatically back with
the query results, the same way that @@LAST_INSERT_ID is
done. And another option could enable the automatic execution of the
MASTER_GTID_WAIT() call before queries sent to the slave
servers. This could make the technique almost transparent for the end
application, and could help avoid the overhead of extra client-server
roundtrips and extra SQL parsing.
However, that is only worth doing if this turns out to be something that will
actually be shown to be useful in practice. So if you want to see that happen
eventually, then get testing and send in that feedback!
The code is available in
the 10.0 bzr
tree now, and will be in the next (10.0.9) release of MariaDB.
Tags: database, mariadb, mysql, performance, programming, replication
|
11:44 am
[Link] |
More on 40% better single-threaded performance in MariaDB
In my previous
post I wrote about how I achived a >40% speedup on sysbench read-only
using profile-guided optimisation (PGO). While this is a preliminary result, I
though it was so interesting that it deserved early mention. The fact that
any benchmark can be improved that much shows clearly that PGO is something
worth looking into. Even if we will probably not improve all
workloads by 40%, it seems highly likely that we can obtain significant gains
also for many real workloads.
I had one or two interesting comments on the post that raise valid concerns,
so I wanted to write a follow-up here, explaining some of the points in more
details and going deeper into the performance counter measurements. As I wrote
before, actual observations and measurements are crucial to fully understand
performance of complex code on modern CPUs. Intuition and guesswork just does
not suffice.
On branch mispredictions
bludwarf suggested that branch
mispredictions could be an important part of the picture:
> It would be interesting to see the hottest spots.... Having to execute just
> a lot of code is not an explanation. If it would be so, then profiler
> feedback won't affect performance so much. I believe the reson for so big
> difference for the results is branch misprediction.
I had similar thoughts when I first saw the results, but I was able to verify
my explanation with actual measurements of instructions retired and icache
misses before and after profile-guided optimisations:
171821 INST_RETIRED.ANY
17730 ICACHE.MISSES
166686 INST_RETIRED.ANY
13643 ICACHE.MISSES
Before PGO, ICACHE.MISSES/INST_RETIRED.ANY = 10.3%. After, we get 8.2%. So we
do get significant less icache misses. I wrote in my previous post
some plausible reasons for that with my remarks on basic blocks, though I have
not tried to verify those with actual measurements.
I also checked branch mispredictions. The Intel manual has two measures to
estimate the cost of branch misprediction. One is time wasted due to
speculatively issued uops that do not retire, %Bad_Speculation =
(UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES )
/ (4*CPU_CLK_UNHALTED.THREAD). The other is cycles wasted based on an
estimated cost of 20 cycles per misprediction, %BR.MISP.COST = 20 *
BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD. Here are the
relevant measurements taken before and after (two sets of measurements, as
Sandy Bridge supports only 4 simultaneous performance counters):
486641 CPU_CLK_UNHALTED.THREAD
205880 UOPS_RETIRED.RETIRE_SLOTS
226184 UOPS_ISSUED.ANY
7219 INT_MISC.RECOVERY_CYCLES
%Bad_Speculation = 2.5%
335051 CPU_CLK_UNHALTED.THREAD
194616 UOPS_RETIRED.RETIRE_SLOTS
216353 UOPS_ISSUED.ANY
6830 INT_MISC.RECOVERY_CYCLES
%Bad_Speculation = 3.7%
490250 CPU_CLK_UNHALTED.THREAD
1310 BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 5.3%
338524 CPU_CLK_UNHALTED.THREAD
1147 BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 6.8%
Note that the actual branch misprediction is much the same. But with less time
spent on other bottlenecks, the estimated relative cost becomes higher.
Most of the branch mispredictions are due to indirect branches (virtual
functions and @plt entries in shared libraries). Even without PGO the CPU is
able to predict other branches very well, so compiler or __builtin_expect can
not help much.
At the end of this post, I have appended the top of some per-function
profiles. They show how branch mispredictions are mostly from commonly called
library functions like memcpy() (these are mispredictions of the indirect
jumps in the PLT), and from virtual function calls. They also show how the
icache misses are spread out more or less evenly, correlating strongly with
the count of actual instructions executed in various parts of the code.
On better coverage of the PGO
A comment from Anonymous raised another valid concern:
> If I read you correctly, you are using the same benchmark for optimizing as
> well as measuring performance increases. You need to have different test and
> training sets or your measurements don't generalize.
This is true, of course. The sysbench read-only exercises only an extremely
narrow subset of mysqld functionality. The next step, on which I am already
working, is to implement a more suitable and general workload for the profile,
which can hopefully cover, and hence speedup, a significant portion of common
real-life workloads.
More tests will be needed to be sure that real-world gains can be achieved,
but I am very hopeful. After all, much of the code of sys-bench read-only is
also used in many other workloads, and much of the properties (eg. infrequency
of the lots and lots of error check branches) should be similar.
My main concern is that GCC will choose to de-optimise code that is not
covered by the test load, to the extent that it significantly harms
performance in some cases. But it should not be too much of a problem - after
all, GCC will still try to optimise the generated code as best it can, and if
the information it has from the profile is incomplete, it should not be much
worse than having no information, as is the case without PGO. And over time,
we can extend the coverage of the test load to help eliminate any problems
that do turn up.
Since the last post, I implemented a first version of a program that generates
a test load. For now, I put it on github: gen_profile_load. The
program generates a range of different loads, such as simple key lookups,
small joins, inserts/updates/deletes, under various conditions such as
with/without binlogging, with/without prepared statements, and so on.
I plan to next do some tests with using this for doing the PGO, and checking
the effect on sysbench read-only and possibly other benchmarks. Another idea I
had was to try to use only parts of the generated load, and test that this
does not cause significant regressions on the omitted parts. This could help
in getting some idea of what kind of (hopefully small) performance regressions
can be expected for workloads that are not well covered by the PGO.
Detailed profiles
These are pasted from the output of perf report. They show the
top functions in terms of instructions executed, icache misses suffered, and
branches mispredicted.
Instructions retired (INST_RETIRED.ANY), before PGO:
Samples: 171K of event 'rc0', Event count (approx.): 171821
1,94% mysqld libc-2.13.so [.] __memcpy_ssse3
1,92% mysqld mysqld [.] my_hash_sort_bin
1,60% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,48% mysqld mysqld [.] malloc
1,36% mysqld mysqld [.] free
1,34% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,31% mysqld mysqld [.] alloc_root
1,30% mysqld mysqld [.] JOIN::optimize_inner()
1,16% mysqld mysqld [.] my_malloc_size_cb_func
1,14% mysqld mysqld [.] ha_innobase::build_template(bool)
1,08% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
1,08% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,07% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
1,01% mysqld libpthread-2.13.so [.] pthread_mutex_lock
1,00% mysqld mysqld [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,98% mysqld libpthread-2.13.so [.] pthread_getspecific
0,93% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,90% mysqld mysqld [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
0,88% mysqld [vdso] [.] 0x000000000000070c
0,83% mysqld mysqld [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,79% mysqld libc-2.13.so [.] __memset_sse2
0,78% mysqld libc-2.13.so [.] __strlen_sse42
0,77% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg
0,77% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,71% mysqld mysqld [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,70% mysqld libc-2.13.so [.] __strcmp_sse42
0,70% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,70% mysqld [kernel.kallsyms] [k] update_cfs_shares
0,69% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,68% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,60% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,59% mysqld [kernel.kallsyms] [k] effective_load
0,56% mysqld [kernel.kallsyms] [k] select_task_rq_fair
0,56% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0,50% mysqld mysqld [.] ha_innobase::external_lock(THD*, int)
0,48% mysqld mysqld [.] my_malloc
0,48% mysqld [kernel.kallsyms] [k] enqueue_task_fair
0,48% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,47% mysqld [kernel.kallsyms] [k] update_entity_load_avg
0,46% mysqld mysqld [.] mysql_execute_command(THD*)
0,46% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,45% mysqld mysqld [.] Item_func::type() const
0,44% mysqld [kernel.kallsyms] [k] update_curr
0,44% mysqld [kernel.kallsyms] [k] copy_user_generic_string
0,44% mysqld [kernel.kallsyms] [k] do_raw_spin_lock
0,43% mysqld mysqld [.] my_lengthsp_8bit
0,43% mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0,43% mysqld [kernel.kallsyms] [k] init_sync_kiocb.constprop.15
0,41% mysqld mysqld [.] get_best_combination(JOIN*)
0,40% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,36% mysqld mysqld [.] Item_equal::Item_equal(Item*, Item*, bool)
0,36% mysqld mysqld [.] Item::Item()
0,35% mysqld [kernel.kallsyms] [k] atomic64_dec_and_test
0,35% mysqld mysqld [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,35% mysqld mysqld [.] JOIN::exec_inner()
0,35% mysqld [kernel.kallsyms] [k] account_entity_enqueue
0,35% mysqld [kernel.kallsyms] [k] do_sys_poll
0,35% mysqld [kernel.kallsyms] [k] system_call
0,34% mysqld mysqld [.] JOIN::destroy()
0,34% mysqld mysqld [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
Icache misses (ICACHE.MISSES), before PGO:
Samples: 17K of event 'r280', Event count (approx.): 17730
2,21% mysqld mysqld [.] JOIN::optimize_inner()
2,04% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,91% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,42% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,27% mysqld libc-2.13.so [.] __memcpy_ssse3
1,20% mysqld mysqld [.] malloc
1,09% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
1,07% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,90% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,90% mysqld [vdso] [.] 0x0000000000000771
0,86% mysqld libpthread-2.13.so [.] pthread_mutex_lock
0,83% mysqld mysqld [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,73% mysqld mysqld [.] mysql_execute_command(THD*)
0,72% mysqld mysqld [.] alloc_root
0,68% mysqld mysqld [.] JOIN::exec_inner()
0,65% mysqld libc-2.13.so [.] __memset_sse2
0,64% mysqld mysqld [.] JOIN::destroy()
0,63% mysqld mysqld [.] free
0,59% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0,55% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,55% mysqld [kernel.kallsyms] [k] do_raw_spin_lock
0,54% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,54% mysqld mysqld [.] ha_innobase::external_lock(THD*, int)
0,52% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,51% mysqld mysqld [.] my_malloc_size_cb_func
0,51% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,51% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,50% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,47% mysqld mysqld [.] setup_conds(THD*, TABLE_LIST*, List<TABLE_LIST>&, Item**)
0,47% mysqld mysqld [.] get_best_combination(JOIN*)
0,47% mysqld mysqld [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
0,47% mysqld libc-2.13.so [.] __strcmp_sse42
0,47% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg
0,46% mysqld mysqld [.] trx_commit(trx_t*)
0,44% mysqld [kernel.kallsyms] [k] fget_light
0,43% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,43% mysqld [kernel.kallsyms] [k] system_call
0,42% mysqld mysqld [.] JOIN::cleanup(bool)
0,41% mysqld mysqld [.] handle_select(THD*, LEX*, select_result*, unsigned long)
0,41% mysqld mysqld [.] mtr_commit(mtr_t*)
0,41% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,41% mysqld mysqld [.] my_hash_sort_bin
0,40% mysqld mysqld [.] open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*)
0,40% mysqld libpthread-2.13.so [.] pthread_getspecific
0,39% mysqld mysqld [.] buf_page_get_known_nowait(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)
0,39% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,38% mysqld mysqld [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.226
0,38% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,38% mysqld mysqld [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,37% mysqld mysqld [.] ha_innobase::extra(ha_extra_function)
0,37% mysqld mysqld [.] ha_innobase::build_template(bool)
0,37% mysqld [kernel.kallsyms] [k] enqueue_task_fair
0,36% mysqld mysqld [.] add_key_field(JOIN*, key_field_t**, unsigned int, Item_func*, Field*, bool, Item**, unsigned int, unsigned long long, st_sargable_param**)
0,36% mysqld mysqld [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,35% mysqld [kernel.kallsyms] [k] update_rq_clock
0,34% mysqld mysqld [.] do_select(JOIN*, List<Item>*, TABLE*, Procedure*)
0,33% mysqld mysqld [.] my_real_read(st_net*, unsigned long*)
0,33% mysqld mysqld [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,33% mysqld [kernel.kallsyms] [k] __schedule
0,33% mysqld mysqld [.] Item::Item()
0,32% mysqld mysqld [.] thd_ha_data
Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), before PGO:
Samples: 1K of event 'rc5:p', Event count (approx.): 1310
7,02% mysqld libc-2.13.so [.] __memcpy_ssse3
6,64% mysqld libpthread-2.13.so [.] pthread_getspecific
3,74% mysqld libpthread-2.13.so [.] pthread_mutex_unlock
3,66% mysqld mysqld [.] Item_func::type() const
3,05% mysqld libc-2.13.so [.] __memset_sse2
2,67% mysqld mysqld [.] Field_long::type() const
2,60% mysqld libpthread-2.13.so [.] pthread_mutex_lock
2,52% mysqld libc-2.13.so [.] __strcmp_sse42
2,06% mysqld mysqld [.] Item_field::used_tables() const
2,06% mysqld mysqld [.] Item_param::used_tables() const
2,06% mysqld libc-2.13.so [.] __strlen_sse42
1,76% mysqld librt-2.13.so [.] clock_gettime
1,68% mysqld mysqld [.] Item::cmp_type() const
1,60% mysqld mysqld [.] st_select_lex::master_unit()
1,37% mysqld mysqld [.] Item_equal::functype() const
1,30% mysqld mysqld [.] Item::const_item() const
1,30% mysqld mysqld [.] Item::real_item()
1,07% mysqld mysqld [.] Field_str::charset() const
0,92% mysqld mysqld [.] Protocol::end_statement()
0,92% mysqld mysqld [.] Item_field::type() const
0,84% mysqld mysqld [.] mysql_execute_command(THD*)
0,84% mysqld mysqld [.] Item_field::cleanup()
0,84% mysqld mysqld [.] Item_field::field_type() const
0,84% mysqld mysqld [.] ha_innobase::extra(ha_extra_function)
0,84% mysqld libpthread-2.13.so [.] __libc_recv
0,76% mysqld mysqld [.] Item_param::save_in_field(Field*, bool)
0,76% mysqld mysqld [.] my_hash_sort_bin
0,69% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
0,69% mysqld mysqld [.] my_real_read(st_net*, unsigned long*)
0,69% mysqld mysqld [.] scheduler_wait_net_end
0,69% mysqld mysqld [.] Item_param::val_int()
0,69% mysqld mysqld [.] Item::check_cols(unsigned int)
0,69% mysqld mysqld [.] Item_field::result_type() const
0,61% mysqld mysqld [.] Field_long::pack_length() const
0,61% mysqld mysqld [.] vio_read
0,61% mysqld [kernel.kallsyms] [k] system_call
0,53% mysqld mysqld [.] Item::is_expensive()
0,53% mysqld mysqld [.] st_select_lex::outer_select()
0,53% mysqld mysqld [.] do_handle_one_connection(THD*)
0,53% mysqld mysqld [.] Explain_select::~Explain_select()
0,53% mysqld mysqld [.] Field_num::result_type() const
0,53% mysqld mysqld [.] Item_param::result_type() const
0,53% mysqld mysqld [.] Item_param::field_type() const
0,53% mysqld mysqld [.] Item_equal::count_sargable_conds(unsigned char*)
0,53% mysqld mysqld [.] Item_func_eq::functype() const
0,53% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,53% mysqld [kernel.kallsyms] [k] __pollwait
0,53% mysqld [kernel.kallsyms] [k] sock_def_readable
0,46% mysqld mysqld [.] net_before_header_psi(st_net*, void*, unsigned long)
0,46% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,46% mysqld mysqld [.] handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)
0,46% mysqld mysqld [.] Item_equal::select_optimize() const
0,46% mysqld mysqld [.] Item_func::used_tables() const
0,46% mysqld mysqld [.] ha_innobase::info(unsigned int)
0,46% mysqld libc-2.13.so [.] __memmove_ssse3
0,46% mysqld libc-2.13.so [.] __memcmp_sse4_1
0,46% mysqld [kernel.kallsyms] [k] dequeue_task_fair
0,46% mysqld [kernel.kallsyms] [k] pick_next_task_fair
0,38% mysqld mysqld [.] Protocol_binary::store(char const*, unsigned long, charset_info_st const*)
0,38% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
Instructions retired (INST_RETIRED.ANY), after PGO:
Samples: 166K of event 'rc0', Event count (approx.): 166686
2,55% mysqld libc-2.13.so [.] __memcpy_ssse3
1,79% mysqld mysqld [.] my_hash_sort_bin
1,60% mysqld mysqld [.] alloc_root
1,48% mysqld mysqld [.] malloc
1,42% mysqld mysqld [.] my_malloc_size_cb_func
1,42% mysqld mysqld [.] free
1,29% mysqld libc-2.13.so [.] __memset_sse2
1,23% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
1,20% mysqld mysqld [.] ha_innobase::build_template(bool)
1,16% mysqld libpthread-2.13.so [.] pthread_mutex_lock
1,15% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,15% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,13% mysqld mysqld [.] JOIN::optimize_inner()
1,12% mysqld mysqld [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
1,07% mysqld [vdso] [.] 0x00000000000006a1
1,03% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,00% mysqld libpthread-2.13.so [.] pthread_getspecific
0,99% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,94% mysqld libc-2.13.so [.] __strlen_sse42
0,83% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,80% mysqld libc-2.13.so [.] __strcmp_sse42
0,78% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg
0,78% mysqld mysqld [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,75% mysqld mysqld [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,69% mysqld mysqld [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,68% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,68% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0,63% mysqld [kernel.kallsyms] [k] effective_load
0,63% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,62% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,59% mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
0,57% mysqld [kernel.kallsyms] [k] select_task_rq_fair
0,56% mysqld [kernel.kallsyms] [k] update_cfs_shares
0,56% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,52% mysqld [kernel.kallsyms] [k] update_curr
0,52% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,52% mysqld mysqld [.] ha_innobase::external_lock(THD*, int)
0,51% mysqld mysqld [.] my_malloc
0,50% mysqld [kernel.kallsyms] [k] update_entity_load_avg
0,46% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46% mysqld [kernel.kallsyms] [k] copy_user_generic_string
0,45% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,43% mysqld mysqld [.] get_best_combination(JOIN*)
0,43% mysqld [kernel.kallsyms] [k] do_raw_spin_lock
0,42% mysqld mysqld [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,42% mysqld [kernel.kallsyms] [k] enqueue_task_fair
0,41% mysqld [kernel.kallsyms] [k] account_entity_enqueue
0,39% mysqld [kernel.kallsyms] [k] init_sync_kiocb.constprop.15
0,39% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,39% mysqld mysqld [.] mysql_execute_command(THD*)
0,38% mysqld mysqld [.] my_lengthsp_8bit
0,38% mysqld [kernel.kallsyms] [k] system_call
0,38% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,37% mysqld mysqld [.] Item::Item()
0,36% mysqld [kernel.kallsyms] [k] atomic64_dec_and_test
0,35% mysqld librt-2.13.so [.] clock_gettime
0,34% mysqld mysqld [.] JOIN::cleanup(bool)
0,34% mysqld mysqld [.] Item_func::type() const
0,33% mysqld mysqld [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,33% mysqld mysqld [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,33% mysqld [kernel.kallsyms] [k] __update_tg_runnable_avg.isra.25
Icache misses (ICACHE.MISSES), after PGO:
Samples: 13K of event 'r280', Event count (approx.): 13643
2,07% mysqld libc-2.13.so [.] __memcpy_ssse3
1,39% mysqld mysqld [.] malloc
1,33% mysqld mysqld [.] JOIN::optimize_inner()
1,26% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,17% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,08% mysqld [vdso] [.] 0x0000000000000864
1,07% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,06% mysqld libc-2.13.so [.] __memset_sse2
1,01% mysqld mysqld [.] free
0,97% mysqld libpthread-2.13.so [.] pthread_mutex_lock
0,92% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0,80% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,76% mysqld mysqld [.] alloc_root
0,74% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt
0,73% mysqld libc-2.13.so [.] __strcmp_sse42
0,71% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69% mysqld mysqld [.] my_hash_sort_bin
0,67% mysqld [kernel.kallsyms] [k] fget_light
0,63% mysqld [kernel.kallsyms] [k] do_raw_spin_lock
0,62% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,59% mysqld mysqld [.] Item::Item()
0,57% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,53% mysqld libpthread-2.13.so [.] pthread_getspecific
0,52% mysqld [kernel.kallsyms] [k] update_cfs_shares
0,51% mysqld mysqld [.] ha_innobase::external_lock(THD*, int)
0,51% mysqld mysqld [.] my_malloc_size_cb_func
0,50% mysqld mysqld [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,48% mysqld mysqld [.] mysql_execute_command(THD*)
0,47% mysqld mysqld [.] free_root
0,47% mysqld [kernel.kallsyms] [k] idle_cpu
0,47% mysqld [kernel.kallsyms] [k] unix_stream_sendmsg
0,46% mysqld [kernel.kallsyms] [k] system_call
0,44% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg
0,43% mysqld mysqld [.] ha_innobase::info_low(unsigned int, bool)
0,43% mysqld mysqld [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,43% mysqld librt-2.13.so [.] clock_gettime
0,43% mysqld [kernel.kallsyms] [k] enqueue_task_fair
0,42% mysqld mysqld [.] JOIN::init(THD*, List<Item>&, unsigned long long, select_result*)
0,42% mysqld mysqld [.] Item_field::used_tables() const
0,41% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,40% mysqld mysqld [.] Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)
0,40% mysqld [kernel.kallsyms] [k] emulate_vsyscall
0,40% mysqld mysqld [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,40% mysqld [kernel.kallsyms] [k] update_rq_clock
0,39% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,38% mysqld mysqld [.] ha_innobase::extra(ha_extra_function)
0,38% mysqld [kernel.kallsyms] [k] update_entity_load_avg
0,37% mysqld mysqld [.] get_best_combination(JOIN*)
0,37% mysqld mysqld [.] Item_field::fix_fields(THD*, Item**)
0,37% mysqld mysqld [.] my_malloc
0,37% mysqld [kernel.kallsyms] [k] __perf_event_task_sched_out
0,37% mysqld [kernel.kallsyms] [k] __schedule
0,36% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,36% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,35% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,35% mysqld mysqld [.] Prepared_statement::execute(String*, bool)
0,34% mysqld mysqld [.] thd_ha_data
0,34% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,34% mysqld [kernel.kallsyms] [k] select_task_rq_fair
0,34% mysqld [kernel.kallsyms] [k] do_sys_poll
0,34% mysqld mysqld [.] st_select_lex::master_unit()
Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), after PGO:
Samples: 1K of event 'rc5:p', Event count (approx.): 1147
8,37% mysqld libc-2.13.so [.] __memcpy_ssse3
7,06% mysqld libpthread-2.13.so [.] pthread_getspecific
3,84% mysqld mysqld [.] Item_func::type() const
2,88% mysqld libc-2.13.so [.] __memset_sse2
2,70% mysqld libpthread-2.13.so [.] pthread_mutex_lock
2,35% mysqld mysqld [.] st_select_lex::master_unit()
2,27% mysqld libpthread-2.13.so [.] pthread_mutex_unlock
2,18% mysqld mysqld [.] Item_param::used_tables() const
2,01% mysqld mysqld [.] Field_long::type() const
2,01% mysqld libc-2.13.so [.] __strlen_sse42
1,92% mysqld libc-2.13.so [.] __strcmp_sse42
1,39% mysqld mysqld [.] Item::const_item() const
1,31% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
1,22% mysqld mysqld [.] handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
1,13% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
1,13% mysqld mysqld [.] end_send(JOIN*, st_join_table*, bool)
1,05% mysqld mysqld [.] my_real_read(st_net*, unsigned long*)
1,05% mysqld mysqld [.] Item_equal::functype() const
1,05% mysqld mysqld [.] Item_bool_func2::cleanup()
0,96% mysqld mysqld [.] Item::is_expensive()
0,96% mysqld mysqld [.] Item_field::type() const
0,96% mysqld mysqld [.] Item_param::type() const
0,87% mysqld mysqld [.] Item::real_item()
0,78% mysqld mysqld [.] Item_field::used_tables() const
0,78% mysqld mysqld [.] scheduler_wait_net_end
0,78% mysqld librt-2.13.so [.] clock_gettime
0,70% mysqld mysqld [.] Statement::set_statement(Statement*)
0,70% mysqld mysqld [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.180
0,70% mysqld mysqld [.] JOIN::optimize()
0,70% mysqld mysqld [.] Field_num::size_of() const
0,70% mysqld mysqld [.] Field_str::charset() const
0,70% mysqld mysqld [.] Item_field::cleanup()
0,70% mysqld libpthread-2.13.so [.] pthread_rwlock_rdlock
0,61% mysqld mysqld [.] select_result::initialize_tables(JOIN*)
0,61% mysqld mysqld [.] mysql_execute_command(THD*)
0,61% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,61% mysqld mysqld [.] JOIN::optimize_inner()
0,61% mysqld mysqld [.] Field_str::decimals() const
0,61% mysqld mysqld [.] handler::rebind_psi()
0,61% mysqld mysqld [.] Item_field::field_type() const
0,61% mysqld mysqld [.] Item::cmp_type() const
0,52% mysqld mysqld [.] net_before_header_psi(st_net*, void*, unsigned long)
0,52% mysqld mysqld [.] my_net_read
0,52% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,52% mysqld mysqld [.] Field::new_key_field(st_mem_root*, TABLE*, unsigned char*, unsigned char*, unsigned int)
0,52% mysqld mysqld [.] Field::send_binary(Protocol*)
0,52% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
0,52% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,52% mysqld [vdso] [.] 0x0000000000000630
0,44% mysqld mysqld [.] Protocol::send_eof(unsigned int, unsigned int)
0,44% mysqld mysqld [.] Item::has_subquery() const
0,44% mysqld mysqld [.] Item::mark_as_condition_AND_part(TABLE_LIST*)
0,44% mysqld mysqld [.] select_send::send_eof()
0,44% mysqld mysqld [.] select_send::send_data(List<Item>&)
0,44% mysqld mysqld [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,44% mysqld mysqld [.] do_handle_one_connection(THD*)
0,44% mysqld mysqld [.] Field_str::repertoire() const
0,44% mysqld mysqld [.] handler::cancel_pushed_idx_cond()
0,44% mysqld mysqld [.] Item_field::get_item_equal()
0,44% mysqld mysqld [.] Item::check_cols(unsigned int)
0,44% mysqld mysqld [.] Item_field::fix_fields(THD*, Item**)
Tags: database, mariadb, mysql, performance, programming
|
02:13 pm
[Link] |
40% better single-threaded performance in MariaDB
Continuing my investigation of
single-threaded performance in the MariaDB server, I managed to increase
throughput of single-threaded read-only sysbench by more than 40% so far:
I use read-only sysbench 0.4.12 run like this:
sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx run
And mysqld is run with minimal options:
sql/mysqld --no-defaults --basedir=X --datadir=Y --innodb-buffer-pool-size=128M
With modern high-performance CPUs, it is necessary to do detailed measurements
using the built-in performance counters in order to get any kind of
understanding of how an application performs and what the bottlenecks are.
Forget about looking at the code and counting instructions or cycles as we did
in the old days. It no longer works, not even to within an order of magnitude.
I am using the Linux perf program for this. During my
invistigations, I found that the main bottleneck in the benchmark turns out to
be instruction cache misses. Here is an example measurement:
| Counter | Value |
| INST_RETIRED.ANY | 170431 |
| ICACHE.MISSES | 17194 |
So 10% of executed instructions missed the level 1 instruction cache ("icache"). That is
bad. The Intel Optimization Manual says this about the ratio
ICACHE.MISSES/INST_RETIRED.ANY:
Anything over 1% of instructions retired can be a significant issue.
So we are 10 times worse than "a significant issue".
Instruction cache misses cause a bottleneck in the frontend of the CPU -
where x86 instructions are fetch, decoded, and supplied to the micro-op queue
to be scheduled for the out-of-order dataflow backend. To get an idea about
how badly bottlenecked we actually are in the frontend in this benchmark, we
can use another measure from the Intel manual:
IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD) = 81.5%
This ratio estimates the percentage of time in which the front-end is not able
to deliver instructions sufficiently fast for the backend to work at full
speed. In this case, for more than 80% of the time, the CPU would have been
able to do more work in the same time if only instructions could be fetch and
decoded faster.
Note the importance of this analysis in knowing what areas to work on to
improve performance. Roughly speaking, 80% of our time is spent waiting for
the icache. We could have spent years on improving data locality or branch
mispredictions or whatever in the code, and we would never have been able to
get very big improvements. Now, knowing where the sore spot is, we can spend
our efforts where it matters the most.
In the case of icache misses, the best place to start is with profile-guided
optimisation. This works by first running a specially instrumented mysqld
binary to collect data about the load during the benchmark. Then the binary is
recompiled, where the compiler can take into account the information obtained
to better optimise the generated code.
Using information about how the program actually behaves when run can help the
compiler lay out the basic blocks of the program to benefit the most common
execution paths. This can help reduce the footprint in the icache by reducing
jumps into or out of the middle of a 16-byte instruction stream (such jumps
waste the rest of the 16-byte stream, which is the unit of storage of the
icache). It can also increase the length of straight-line code execution
paths, which should help the hardware prefetcher keep the level 1 icache
supplied.
So that is the theory - but does it work in practice? It turns out that it
does. First I compile with --coverage added to the
CFLAGS/CXXFLAGS. Then I run sysbench to generate the profile data. Then I
recompile adding instead the --profile-use flag. That is all
there is to it, GCC handles everything else automatically.
Re-running the
benchmark with the optimised binary yields a large speedup:
| Binary | Queries per second | Speedup |
Base: -O3 | 21404 | |
PGO: -O3 --profile-use | 30903 | 44% |
So a 44% speedup just from compiling with different optimisation, not bad!
(The actual server-side improvement is in fact even higher, as the sysbench
client consumes part of the runtime due to the single-threaded nature of the
test).
By the way, that 44% speedup comes from just a modest reduction in icache miss
rate - from 10% down to 8%. It just shows how expensive those icache misses
are. Fetching from L2 cache takes something like 12 cycles, and during each of
those cycles the CPU will be able to do nothing. Given that we could
potentially execute 4 operations in the backend each of those clocks, that is
a lot of waste. So with 8% misses still left there is room for further huge
improvements, though those improvements will likely be much harder to achieve
than just fiddling with compiler options.
I think it is clear that we will need to start using profile-guided
optimisation for future MariaDB binaries. This will be some work, as we need
to develop a set of typical workloads to optimise after, and integrate running
of those into the compilation. We will need to cover a range of different
common usages to optimise after. And more tests with a wider range of
benchmarks will be needed to ensure that the gain is more generally applicable
and does not cause significant regressions in performance of other parts of
the code. With such a huge improvement in this test I am confident that things
can be generally improved, but it still needs proper testing.
My work on improving single-threaded performance will continue, as time
permits, and I certainly expect more good results along the way (several
patches are already in the pipeline). But I thought this one was so
interesting that it was worth mentioning to a wider audience.
Tags: database, mariadb, mysql, performance, programming
|
12:34 pm
[Link] |
MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primit
I took a quick look at MariaDB 10.0 single-treaded performance (simple
read-only sysbench). One thing immediately leaps to the eye, and I thought it
worthy of mention. It contains an important lesson about the use of
synchronisation primitives and in particular "atomic operations" in MariaDB
(and MySQL).
I am using the Linux perf tool on this
sysbench command:
sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
Look at the top offender in the output from perf report:
1,54% mysqld mysqld [.] set_thread_state_v1
The only thing this does is set a string for SHOW PROCESSLIST (and the like)
about what the thread is doing. And we are spending a whopping 1.5% of the
total time doing this.
And why? That becomes clear when looking at the disassembly (extract):
2d: mov 0x7b0(%rbx),%eax
33: mov %eax,%edx
lock cmpxchg %eax,0x7b0(%rbx)
jne 33
and $0xfffffffc,%edx
add $0x1,%edx
xchg %edx,0x7b0(%rbx)
mov %rbp,0x7b8(%rbx)
mov %ecx,0x7c0(%rbx)
mov 0x7b0(%rbx),%eax
5e: mov %eax,%edx
lock cmpxchg %eax,0x7b0(%rbx)
jne 5e
and $0xfffffffc,%edx
add $0x6,%edx
xchg %edx,0x7b0(%rbx)
No less than two locked cmpxchg instructions. Each of these implies a
full
( memory barrier),
which is really expensive.
To achieve their very high performance, modern CPUs process many instructions
at the same time, a hundred or more is possible. But for a full memory
barrier, the CPU needs to stop and ensure that all pending store instructions
have fully reached the last-level cache. Only then can it start processing any
following instructions involving a load. This is not something we should be
doing every time a thread moves to a new stage of query processing.
I am not really familiar with the performance schema code (nor am I sure I
want to be). But let us make some guess at what the code is trying to do here.
Here is the code from which the first locked instruction originates:
/**
Execute an allocated to dirty transition.
This transition should be executed by the writer that owns the record,
before the record is modified.
*/
void allocated_to_dirty(void)
{
uint32 copy= PFS_atomic::load_u32(&m_version_state);
uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
/* We own the record, no need to use compare and swap. */
PFS_atomic::store_u32(&m_version_state, new_val);
}
So perhaps the idea is that only the thread itself can modify its state, but
other threads can read it. And we want the update to be fast (I suppose
updating the state is more important than inspecting it). So the developers
wanted to avoid a lock that could block the writer. Instead, it sets a flag to
mark that the data is being modified before the update. And clears the flag
after. And a reader can then check the flag; if the flag was modified during
a read, the data is potentially inconsistent, and the read can be re-tried.
Now, this really is poorly designed. The PFS_atomic::load_u32()
is implemented using my_atomic_load32() and
my_atimic_store32(), which is inappropriate here. The
my_atomic() facility implies a full memory barrier in all of the
operations, which makes them somewhat less useful for performance-critical
code.
All we need here is that the setting of the update flag becomes visible before
changes to the state, and that the clearing of the flag becomes visible
after. Thus it should be sufficient with a pair of write memory barriers in
the writer (and corresponding read memory barriers in the reader).
The write and read memory barriers are much less expensive than a
full barrier. All they do is prevent a store (respectively a load) before the
barrier to be satisfied before a store (load) after the barrier. There is no
need for the CPU to stop execution. In fact, on x86, there is usually nothing
special to do, because stores and loads are never re-ordered unless the
special non-temporal memory instructions are used.
But look closer at the code:
static void set_thread_state_v1(const char* state)
...
int state_len= state ? strlen(state) : 0;
pfs->m_processlist_lock.allocated_to_dirty();
pfs->m_processlist_state_ptr= state;
pfs->m_processlist_state_length= state_len;
pfs->m_processlist_lock.dirty_to_allocated();
So it is storing a pointer to a string along with the length, and it needs to
do synchronisation to ensure that the two are consistent with each other. But
suppose instead that we store only the pointer. This can be done atomically without _any_
need for memory barriers or other synchronisation (as long as the memory
location is naturally aligned). That would seem to be even better than using
the (cheaper) write- and read-barriers.
To get a better idea of how bad the use of full memory barriers is for
performance, I wrote a quick and dirty test program.
It has code similar to what is in set_thread_state_v1(), as well
as three other variants of it: One using just a write memory barrier, one
using no memory barrier (not SMP safe, but useful for comparison), and one
using normal pthread mutexes rather than a lock-free algorithm. Here is the
time taken for 100,000,000 iterations of the function:
| | wallclock | cycles | cycles overhead |
| Original | 3.70 | 81 | 71 |
| Write barrier | 0.53 | 12 | 2 |
| No barrier | 0.45 | 10 | - |
| Pthread mutex | 2.26 | 50 | 40 |
Note how expensive the full memory barrier version is. The overhead (compared
to no barrier) is 35 times larger than the lesser write-barrier. And I
suspect that the real overhead is even bigger in real-life than in this
synthetic micro-benchmark (for example imagine if we get a cache miss on one
of those locked instructions).
Note the sad thing here that the "clever" lock-free algorithm in this case
actually makes things slower than just using a simple pthread
mutex. The overhead of the original code is 75% higher than the mutex
version. This is not really surprising: the main cost of a mutex operations is
also just a single full memory barrier, but pthread mutexes have probably been
very carefully tuned and optimised by some very knowledgeable people, unlike
the ad-hoc performance schema code. Lock-free algorithms require a lot of
knowledge and care to do correctly, it is not simple stuff.
I think there is an important lesson here about our use of synchronisation
primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention
issues, and some various parts starting to use lock-free algorithms, there
does not seem to be a general understanding of issues around memory barriers
among the developers. In fact to my knowledge there is no general facilities
for explicit memory barriers. We have my_atomic.h, which confuses
the issues of "atomicity" and "memory barriers" and defines a handful of
"atomic" operations that are really locked instructions implying a full memory
barrier, as in the example above. As the test program shows, it is really
important to distinguish between different kinds of memory barriers and
understand the performance implications of the use of each. I have always been
really unhappy about my_atomic.h, and I think this is an area
that we need to improve significantly in.
And who knows, maybe the performance schema developers can take a hint or two
from this.
Tags: database, mariadb, mysql, performance, programming
|
|
[<< Previous 10 entries] |