### Hacking a box of 240x320 displays with the ESP8266

For the last few days, I have been playing with some small displays we have lying around in Labitat. We have ninety-odd of them from an old donation, and I thought it would be cool to be able to use them for some fun projects.

The displays are ET024002DMU with a built-in ST7787 controller. They really are quite nice displays. 240-by-320 resolution in a 42mm-by-60mm form factor. 18-bit colour and a nice and clear display at a high resolution. The protocol is reasonably well designed and described, and they have some nice extra features that should make them suitable for most tasks, even demanding ones.

The picture to the right shows a display soldered to a simple breakout board and connected to an old STM32F405 (ARM Cortex-M4 board) of mine. This is using the display's 16-bit parallel bus, which allows for really fast update. It should even be possible to connect the display to the memory controller of the STM32 and utilise DMA to update the display without tying up the CPU (though my test just used bit-banging on a GPIO port). Other nice features include the ability to sync with or control the vertical refresh (to avoid visible glitches/tearing during frame updates) and hardware scrolling. The display is spec'ed for 2.8V, which is somewhat inconvenient when most of my stuff is otherwise running 3.3V logic, but so far just running the display at 3.3V seems to work fine.

Driving the display over a 16-bit bus from and ARM MUC is cool for high-end stuff. Since we have so many of these displays lying around, it would also be cool to have something that is really easy, and really cheap, to use. So I extended the TFT_eSPI library for the ESP8266 to support the displays. The ESP8266 is very cheap and conveniently runs 3.3V logic (as opposed to the Arduino's normally 5V). The ESPs have become very popular, so since I have not used them before, this was also a good chance to learn a bit more about them.

The TFT_eSPI library is done by Bodmer, derived from some Adafruit libraries. It uses the serial interface mode of the display. This greatly reduces pin count (the ESP8266 seems quite starved for GPIOs), but also greatly reduces performance, of course. But the TFT_eSPI is well optimised, and should be adequate for most purposes. I think I managed to fully implement the ST7787 controller for the library, all of the demos seem to run well. And the performance should be on par with the other supported display controllers. Here is a video showing the display running some demos from the library:

While implementing the support in the library, I discovered a couple interesting details about the display that I wanted to write up. Generally, these kind of displays all seem to share a protocol, and work much the same way. However, the ST7787 only supports a 3-wire serial mode, while the other controllers supported by the library are using a 4-wire serial mode. The difference is that the 4-wire mode uses a dedicated pin DC to distinguish between command (DC=0) and data (DC=1) bytes. In the 3-wire mode, the DC pin is not available, and the data/command bit is instead included as a 9th bit with every byte. Here are diagrams showing this, taken from the datasheet of another display controller, the ILI9341:

The main work in extending the library to support the ST7787 was to re-write all the data transfers to handle the 3-wire mode, bit-stuffing the D/C bit into the SPI data stream instead of pulsing the D/C GPIO. The bit-stuffing probably introduces some overhead; on the other hand, now multiple commands can be given to the ESP8266's SPI peripheral at once, without the need to manually pulse D/C in-between transfers. So I expect the performance to be roughly equivalent; it will be interesting to run some benchmark comparisons against other displays.

I did discover one notable limitation of the 3-wire mode, though. This is when reading pixel data back out of the display's frame buffer memory. There is a command RAMRD to do this, and in the other modes, it allows to read out all pixels in a previously defined read window at once in a single command, or a subset of pixels. But this did not appear to work in the 3-wire mode. It turns out that the display controller takes its dataline into high-Z mode as soon as one pixel has been transfered, no matter the size of the defined read window. This means that frame buffer reads will be a lot slower on the ST7787, due to the overhead of reading out pixels one by one (fortunately, framebuffer read is not a common operation).

I am thinking that this limitation comes from not having the D/C pin available to distinguish data from commands and thereby terminating a transfer. Apparently, the read window size is not available to the SPI interface, or forcing the host to transfer the full window size was deemed undesirable. Thus, only single pixel transfers are supported. The data sheet is not very clear on this, to say the least.

A related limitation is seen in the RAMWR command, which writes a sequence of pixels to the display. The RAMWR does support writing multiple pixels, which is fortunate as this is one of the most used and most performance-critical operations. But what I found was that when writing less that the defined window size, some glitches can occur. It appears that the controller's SPI interface is buffering up to 4 pixels before writing them to the display memory. When writing less than 4 pixels, the pixels would not be written to framebuffer memory immediately. Instead, they would be included in a later write, appearing in the wrong location!

This problem only occurs when writing less than the defined window size. So once understood, it was easy to work around by always setting the right window size before writing pixels - which only affected the line drawing algorithm. Again, this seems to be a limitation of the 3-wire protocol due to missing D/C input. Apparently, it is not able to utilise the D/C bit that is available to the controller in write mode, instead relying on the frame buffer handler to notify when the write window is full. Or maybe it is just a controller bug.

It was btw. interesting to see how similar these different controllers are. This is quite apparent also from similarities between the data sheets. There has clearly been a lot of copy-paste between them. For example, I found this gem in the ST7787 datasheet (in the description of the WRX pin):

-In 4-line serial interface, this pin is used as D/CX (data/ command select)

The ST7787 though does not have any 4-line serial interface (there is no other mention of this mode anywhere in the datasheet). So this is clearly a left-over from a copy-paste from another controller...

Otherwise the implementation of ST7787 support in the library was reasonably straight-forward. The library uses a lot of direct register access to the SPI hardware to optimise the transfer speed, so some care is needed to get the bit-stuffing right in all cases. One thing to be aware of is that the RAMRD command for reading a pixel needs 9 dummy clock pulses between sending the command and reading back the pixel data. A very careful reading of the datasheet might be said to hint at this, though it really is not very clear.

Another thing to be aware of is that the ST7787 uses a single shared data line for its communication; it does not have separate MOSI/MISO lines. This is not a problem in practice, as the controller never transmits and receives data at the same time. So the D0 pin of the ST7787 can simply be connected to both MOSI and MISO on the ESP8266 microcontroller. But the code then needs to temporarily disable the MOSI pin output while reading, to avoid contention on the data line.

So with the library working, it should be easy to control the displays from an ESP8266. The library has a lot of nice functionality, including really good text support with a lot of available fonts. The next step should be to make a couple nice PCBs with connector for the display and for an ESP8266, some power supply, handy pin breakouts, etc. Hopefully some cool projects will come of it in the coming months.

### FVWM and Java application window focus

I found the solution for a problem that has been annoying me for quite some time. I am using FVWM for a window manager. A few programs written in Java, most notably the Arduino IDE, will not accept keyboard focus. As soon as the window is given focus using normal window-manager commands, it immediately loses the focus again, making it impossible to eg. edit code or use keyboard shortcuts. I could usually get it to focus by randomly clicking a few menus or some such, but it was hardly usable.

Turns out that FVWM has a Style option to handle this: Lenience. From the FVWM manual:

The ICCCM states that windows possessing the property WM_HINTS(WM_HINTS): Client accepts input or input focus: False should not be given the keyboard input focus by the window manager... A number of applications set this property, and yet expect the window manager to give them the keyboard focus anyway, so fvwm provides a window style, Lenience, which allows fvwm to overlook this ICCCM rule.

And indeed, the problem seems solved after I put the following in my FVWM configuration:

```  Style "*Arduino*" Lenience
```

The Arduino IDE window now keeps focus when switching between virtual desktops, using window selection hotkeys, etc. It still seems to require explicitly clicking the window with the mouse to actually accept keyboard events, which is annoying, but at least it is no longer impossible to type into the window. Not sure if this is a fundamental "feature" of Arduino, or if there is a way to fix this also. I rarely use the Arduino IDE anyway.

From the FVWM documentation it would appear that this is a problem with the Arduino code (or the underlying Java window toolkit, I remember seeing similar problems in other Java applications). Not sure why it has never been fixed though - or if it is actually FVWM that is mis-interpreting or being overly zealous with its adherence to ICCCM compliance.

### Improving replication with multiple storage engines

New MariaDB/MySQL storage engines such as MyRocks and TokuDB have renewed interest in using engines other than InnoDB. This is great, but also presents new challenges. In this article, I will describe work that I am currently finishing, and which addresses one such challenge.

For example, the left bar in the figure shows what happens to MyRocks replication performance when used with a default install where the replication state table uses InnoDB. The middle bar shows the performance improvement from my patch.

Current MariaDB and MySQL replication uses tables to transactionally record the replication state (eg `mysql.gtid_slave_pos`). When non-InnoDB storage engines are introduced the question becomes: What engine should be used for the replication table? Any choice will penalise other engines heavily by injecting a cross-engine transaction with every replicated change. Unless all tables can be migrated to the other engine at once, this is an unavoidable problem with current MariaDB / MySQL code.

To solve this I have implemented MDEV-12179, per-engine mysql.gtid_slave_pos tables, which should hopefully be in MariaDB 10.3 soon. This patch makes the server able to use multiple replication state tables, one for each engine used. This way, InnoDB transactions can update the InnoDB replication table, and eg. MyRocks transactions can update the MyRocks table. No cross-engine transactions are needed (unless the application itself uses both InnoDB and MyRocks in a single transaction).

The feature is enabled with the new option `--gtid-pos-auto-engines=innodb,rocksdb`. The server will automatically create the new replication tables when/if needed, and will read any such tables present at server start to restore the replication state.

## Performance test

To test the impact of the new feature, I ran a sysbench write-only load on a master, and measured the time for a slave to apply the full load. The workload is using MyRocks tables, while the default mysql.gtid_slave_pos table is stored in InnoDB. The performance was compared with and without `--gtid-pos-auto-engines=innodb,rocksdb`. Full details of test options are available following the link at the end of the article.

Replication injects an update into a small table as part of each commit. The performance impact of this will be most noticeable for fast transactions, where the commit overhead is relatively larger. It will be particularly noticeable when durability is enabled (`--innodb-flush-log-at-trx-commit=1` and similar). If another storage engine is added into a transaction, extra fsync() calls are needed in the commit step, which can be very expensive.

I tested the performance in two scenarios:

1. A "worst case" scenario with durability/fsync enabled for binlog, InnoDB, and MyRocks, on hardware with slow fsync.
2. A "best case" scenario with all durability/fsync disabled.
In the "worst case" we would hope to see substantial improvement due to reducing the number of fsync operations. In the "best case" improvements will be expected to be small, if any, though there may still be some improvement due to avoiding CPU and some I/O overhead from running the commits through two engines.

The figure at the start of the article shows the results from the "worst case". The left bar is the time for the slave to catch up when the replication state table is using the default InnoDB storage engine. The middle bar is the time when using `--gtid-pos-auto-engines=innodb,myrocks` and the right bar is when the state table is changed to MyRocks (MyRocks-only load).

We see a huge speed penalty from the cross-engine transactions introduced by the InnoDB state table, the slave is twice as slow. However, with the patch, all the performance is recovered compared to MyRocks-only setup.

The test was run on consumer-grade hardware with limited I/O capabilities. I ran a small script to test the speed of `fdatasync()` (see link at end of article). This SATA-attached SSD can do around 120 fdatasync() calls per second, writing 16 KB blocks at random round-robin among five 1MB data files. In the "worst case" test, the load is completely disk-bound. Thus, the absolute transactions-per-second numbers are low, and the impact of the new feature is very big.

The results of the "best case" is in the following figure. The "best case" workload is CPU-bound, disk utilisation is low. Sysbench write-only does several queries in each transaction, so commit overhead is relatively lower. Still, we see a substantial cost of replication introduced cross-engine, it runs 18% slower than the MyRocks-only case. And again, the patch is able to fully recover the performance.

So I think these are really good results for the new feature. The impact for the user is low - just set a server option, and the server will handle the rest. We could eventually make InnoDB, TokuDB, and MyRocks default for `--gtid-pos-auto-engines` to make it fully automatic. The actual performance gain will depend completely on the workload, and the absolute numbers from these performance tests mean little, perhaps. But they do show that there should be significant potential gain in many cases, and enourmous gains in some cases.

## Conclusions

I hope this feature will help experiments with, and eventual migration to, the new storage engines such as TokuDB and MyRocks. The ability to have good replication performance when different storage engines are used in different transactions (but not within a single transaction) should make it easier to experiment, without committing everything on a server to a new and unknown engine. There might even be use cases for deploying permanently on a mixed-engine setup, with different parts of the data utilising different performance characteristics of each engine.

The present work here is implemented for MariaDB only. However, there is some discussions on porting it to other MySQL variants. While the details of the implementation will differ somewhat, due to code differences in MariaDB replication, I believe a similar approach should work well in the other MySQL variants also. So it is definitely a possibility, if there is interest.

### Understanding skew factors in Simplex/Improved Perlin Noise

The Simplex Noise (or Improved Perlin Noise) algorithm uses a somewhat mysterious "skew factor" of $\frac{\sqrt{3}-1}{2}$. I did not find any really satisfactory explanation for this factor in the descriptions of Simplex Noise that I read. But I managed to work it out nevertheless, which I thought was a fun exercise and worth a quick write-up.

Simplex noise is constructed by assigning random values to each point in a simplex grid. The simplex grid is a tiling of the plane using rhombuses, each rhombus consisting of two equilateral triangles. See the figure on the right.

Given a point $\left(x,y\right)$ (expressed in normal rectangular coordinates), we first transform the coordinates into $\left(u,v\right)$ expressed in the simplex grid. Then we take the integer parts of u and v to find the corners of the containing equilateral triangle, and take the random values assigned to these corners. The "noise value" of the original point $\left(x,y\right)$ is then some suitable interpolation of these values with respect to the distance from $\left(x,y\right)$ to each corner.

The implementation of this algorithm is explained in detail in several places. The code to transform into and back out of the simplex grid might look like this:

```final double F2 = 0.5*(Math.sqrt(3.0)-1.0);
double s = (xin+yin)*F2;
int u = fastfloor(xin+s);
int v = fastfloor(yin+s);
final double G2 = -(3.0-Math.sqrt(3.0))/6.0;
double t = (u+v)*G2;
double X0 = u+t;
double Y0 = v+t;
```
So the question is, where do these funny factors F2 and G2 come from?

To understand this, let us first consider the general form of the transformation from simplex coordinates $\left(u,v\right)$ in the grid spanned by $\stackrel{\to }{u}$ and $\stackrel{\to }{v}$ to the rectangular coordinates $\left(x,y\right)$. It is

$x=au+bv$
$y=cu+dv$
where $\stackrel{\to }{u}= \left(a,c\right)$ and $\stackrel{\to }{u}=\left(b,d\right)$. So this requires 4 multiplications in the general case.

However, we can freely choose which simplex grid to use! So we can try to choose one that reduces the computational work needed.

First, we can choose the orientation of the grid so that $\stackrel{\to }{u}$ and $\stackrel{\to }{v}$ are symmetric around the diagonal $x=y$. Then $a=d$ and $b=c$, so we can write the transformation as

$x=\left(a-b\right)u+b\left(u+v\right)$
$y=\left(a-b\right)v+b\left(u+v\right)$
Second, we can choose the scale of the grid so that $\left(a-b\right)=1$, and then we get simply
$x=u+t$
$y=v+t$
where $t=b\left(u+v\right)$. This simpler form requires only a single multiplication.

This is exactly the form we see in the above code snippet, with G2 being the name for the constant $b$. We can work out from this that the vectors that span the grid used by the code are

$\stackrel{\to }{u}= \left(1-\frac{3-\sqrt{3}}{6},-\frac{3-\sqrt{3}}{6}\right)$
$\stackrel{\to }{v}= \left(-\frac{3-\sqrt{3}}{6},1-\frac{3-\sqrt{3}}{6}\right)$
The interested reader is encouraged to check that $\stackrel{\to }{u}$, $\stackrel{\to }{v}$, and $\stackrel{\to }{u}+\stackrel{\to }{v}$ all have the same length, so that the triangles that form the grid indeed end up being equilateral.

Working out the inverse of the transformation (again a good exercise for the interested reader) leads to another transformation of the same simple form, this time with the constant $F2 =\frac{\sqrt{3}-1}{2}$ as seen in the code.

So I believe this is the explanation of the mysterious factors F2 and G2 in the example code found by Google searches on the net. They arise from the choice of the simplex grid to use. This choice is made to make the associated coordinate transformation use only a single multiplication, rather than the four required in the general case. This makes the implementation of the algorithm more efficient. Pretty clever, right?

### Arduino to the max: 11x11x11 LED-cube

March 29 2014 is Arduino day, also in Labitat. This is a good opportunity to describe my LED-cube:

This LED-cube pulls a number of tricks to get the most out of just a single normal Arduino Uno. A meager 16 MHz and 2048 bytes of RAM goes a long way with sufficient ingenuity and creativity. Here are some highlights:

• 12-bit PWM, 16 grayscales non-linear.
• Animations generated on-board, read from SD-card, or streamed over USB.
• 178 Hz refresh rate, transferring 3 Mbits/s of data to the LED driver shift registers.
• 50 Hz animation framerate, receiving 269kbit/s of animation data over the serial port.
• Approximately half of spare CPU time available for on-board generation of animations.
• Multi-processor Arduino! Uses the Atmega 16U2 USB-handling chip on the Arduino Uno as a second processor, reading animations from an SD-card.
• Hand-crafted assembler to speed up time-critical parts.

## LED-cube basics

One of the first questions I get when I show the LED-cube to people is: How is it possible to turn individual LEDs on and off? After all, it looks like everything is connected to everything. Here is how it works.

The LEDs in the cube are organised into columns. The 11x11x11 cube thus has 11*11=121 columns, each containing 11 LEDs. Let us first see how a single column works:

All the cathodes are tied together and connected to ground. Each anode can be individually connected to the supply voltage using one of the switches. This allows to turn on one LED by connecting the appropriate switch.

Now let us see how it works with multiple columns:

The anodes of one LED from each column are all tied together and connected by a single switch to the supply voltage. The columns now have each a switch connecting them to ground. Now, to turn on eg. D5, we would connect SW2 and SW5. To turn on in addition D8, we would connect in addition SW6.

But what if we want to turn on D1 and D5? For this, we need to connect all of switches SW1, SW2, SW4, and SW5. But that turns on also D2 and D4. Thus, to control every LED independently, we need something more.

The answer is multiplexing. We only ever have one of SW1, SW2, and SW3 turned on at the same time. This allows to individually control each LED in one level of the columns using the bottom column-switches. Then shortly after (half a millisecond in the 11x11x11 LED-cube), we turn off that switch and turn on the switch for the next level, at the same time flipping the column-switches as appropriate for controlling the LEDs at that level. By doing this quickly enough, the human eye is too slow to perceive any flickering, and we get the illusion that each LED is turned on or off independently, at will.

Using multiplexing, the required number of switches and connections is greatly reduced. In the 11x11x11 cube, we need only 11*11=121 switches for the columns, plus an additional 11 swiches, one for the anodes in each horizontal layer. ("Only", compared to 11*11*11=1331). In the cube structure, the column connections are the vertical structures, and the horizontal structures connect the anodes in one layer each to a switch through 11 extra columns at the back of the cube.

Soldering the LED structure requires some soldering skills, but with 1331 LEDs to do, such skill will be naturally aquired before the end of the project. Most especially, patience is needed. There are several good pages on the internet describing in detail how the LEDs are soldered together for an LED-cube, here is one, for example.

The layer/anode switches are implemented using 11 P-channel MOSFETs, controlled directly from 11 GPIO pins on the Arduino. The column switches are implemented using 8 TLC5940 LED driver ICs. The TLC5940 has the additional benefit of being able to accurately control how much current each LED receives, as well as being able to dynamically adjust the LED brightness using 12-bit PWM (4096 intensity levels).

## Electronics

The electronics sit on a 20cm-by-20cm PCB located below the base of the LED-cube. The picture shows the PCB from the bottom side; the LED columns (and the connection to the 11 layers) enter through the yellow holes from the top, are bent 90 degrees, and soldered to the copper pads.

The 8 ICs spaced around the middle are the TLC5940 LED drivers, connected to each LED pad. The connections are made for ease of PCB layout, so the software has a lookup table that maps each LED column to the TLC5940 output to which it is connected. Along the right of the board are seen the 11 P-channel MOSFETs that supply the voltage for each layer, one at a time.

At the very right edge is seen the connector to the GPIO pins on the Arduino Uno. Originally, the cube was controlled by an actual Arduino Uno device, connected to the cube with a ribbon cable. But the current version uses instead a custom control PCB with a small, narrow footprint that just fits below the base of the LED-cube. This way all the electronics is hidden away below the very narrow base, resulting in a very nice elegant presentation of the cube:

This control PCB is similar to an Arduino Uno; but it is stripped for anything not needed for the cube, and has a micro-SD card slot and level-shifter soldered directly on the board (originally an small external micro-SD board was connected to the Arduino Uno). The micro-SD slot (not shown on the drawing) and the level-shifter are in the middle. To the left is the USB-connector and the Atmega16U2 chip that handles the USB. To the right is the Atmega328 that runs the main code for the LED-cube. The connections to the main PCB are soldered to the pads along the right and top-right.

The PCBs were layed out with KiCAD; see the end of this article for links to source code. The main board was manufactured with the PCB CNC milling machine we have at Labitat. The small Arduino-clone PCB was manufactured by the SeeedStudio Fusion PCB service.

## Software

The LED-cube has 1331 LEDs, each of which can be PWM'ed for 16 different levels of brightness, from 0 (turned off) to 15 (fully on). The state of all the LEDs is stored in a framebuffer; with 4 bits per LED that amounts to 666 bytes. To avoid flicker, double-buffering is needed. With 2kByte of memory, the Arduino has just room for two framebuffers, with a bit of memory to spare for the rest of the activities.

The software has two main tasks to control the LED-cube:

1. Every 20 milliseconds, load one of the frame buffers with the next frame of the animation being played, for an animation framerate of 50 per second.
2. Every 500 microseconds, turn on the LEDs in one layer of the cube with the appropriate intensity, by programming the TLC5940 driver chips with the correct PWM value and turning on the power to that layer through the next P-channel MOSFET; resulting in a refresh rate for the entire 11-layer cube of around 178 Hz.
The loading of animation frames into the framebuffers can happen from three sources: They can be loaded from an SD-card, they can be streamed from a PC over the USB port, or they can be generated by code on-board the Arduino itself. The refresh of the LED intensities happens in a timer interrupt, using the SPI device on the Atmega328 to communicate with the TLC5940 driver ICs. The individual steps are detailed in the following sections.

## Communicating with the TLC5940 LED-driver ICs.

The TLC5940 LED-drivers need an external PWM clock; this is supplied from the arduino from a timer in PWM mode. An 8 MHz PWM-clock is used; this is the fastest clock that the 16 MHz Atmega328 can generate. One PWM cycle is 4096 PWM clocks, as the TLC5940 uses 12-bit PWM. It is important to always let the TLC5940s run an integral number of PWM cycles; else the brightness of LEDs will be off. So a timer-interrupt runs every 4096 PWM cycles to trigger the TLC5940s; this works out to about 2 kHz for switching each layer on and off, or 178 Hz refresh rate for all 11 layers. The TLC5940 ICs employ a latch, so the timer interrupt first latches new data to get stable PWM timing, then goes to work to send out new data to the driver ICs to be ready to latch at the next timer interrupt.

The 8 TLC5940 ICs have a total of 128 outputs, each of which needs 12 bits of PWM data. That amount to 1536 bits or 192 bytes of data to send to the LED drivers in each refresh cycle. One refresh cycle has a total of 8192 CPU clock cycles, and we need to leave as much as possible of those for generating / downloading the next frame of animation, so this step is carefully optimised for speed.

The data is shifted into the TLC5940s using data and clock pins, so we can use the Atmega328 SPI device in master mode. The maximum speed possible for the SPI device is using an 8 MHz clock, so that means 16 CPU cycles per byte plus a couple extra to load the next byte, as the SPI device is only single-buffered, for a total of around 3500-4000 cycles of the 8192 available.

But we also need to prepare the data to send to the LED drivers. This involves a few steps per TLC5940 output:

1. Fetch from a lookup-table the index of the LED that is connected to the next TLC5940 output on the PCB (remember, those connections are in more or less random order to optimise the PCB layout).
2. Use that index into the frame buffer to fetch the 4-bit value that gives the required brightness of the LED.
3. Use the brightness value to index into another lookup table, which maps each possible brightness into the corresponding 12-bit PWM value. This mapping is made non-linear to improve the dynamic range available with just 16 different levels, as the lower PWM values are perceived by the human eye to be much easier to distinguish than the higer values.
4. Pack the 12-bit values of two consecutive outputs into three bytes suitable for transmission on the SPI device (it is thus easier to do two LEDs at a time, producing three whole bytes of SPI data for each loop iteration).
As it turns out, these operations together take something like 60 CPU cycles for two LEDs or 20 cycles per SPI output byte, which is another 3500-4000 cycles.

With 8192 cycles total, we cannot afford to first spend 4000 cycles preparing the data, then spend another 4000 cycles sending the data out. So we use a trick. By writing the code in assembler and carefully timing each instruction, we can overlap the two operations, inserting into the data generation algorithm instructions to feed the next byte of data to the SPI at exactly the points in time where they are needed.

In C it might look something like this, with the data generation algorithm on the left and the SPI data output on the right:

```    // Loop over the TLC5940 outputs two at a time
for (o = 127; o >= 0; o = o - 2)
{
// Find index into frame buffer
led_idx = led_map[o];                                           // Output one byte
SPDR = byte1
// Load the 4-bit intensity from the frame buffer
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;

// Loopup the 12-bit PWM value from the intensity.
pwm1 = pwm_lookup[intensity];
// Output one byte
// Same for second output                                       SPDR = byte2
led_idx = led_map[o+1];
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;
pwm2 = pwm_lookup[intensity];

// Pack the two 12-bit PWM values into three SPI bytes
byte1 = pwm1 >> 4;
byte2 = (pwm1 & 0xf) << 4 | (pwm2 >> 8);                        // Output one byte
byte3 = pwm2 & 0xff;                                            SPDR = byte3
```
However, the algorithm is not really expressible in C. The instructions to output the data bytes to the SPI device need to be placed so that the bytes to output have been computed and the timing between them must be just right so that they happen as soon as the SPI has completed the last transfer (but no sooner, or the previous byte will be corrupted). The actual assembler code can be seen here, in the function `shift_out_frame_spi()`. The comments in the function mark the cycles spent on each instruction so that the SPI output can be spaced out with the correct timing.

This way, the code is able to send out data for two TLC5940 outputs in around 63 CPU cycles, or about 4000 cycles total per refresh cycle, leaving half the CPU time free for handling the frames of animation, which is nice. I think this is a rather interesting programming technique. It is a bit like multi-threading, but with the instruction scheduling hardcoded explicitly into the program.

## Serial reception

In addition to generating some animations on the Arduino itself, they can be streamed into the Atmega328 through the serial port. The protocol is mostly the raw binary data in the framebuffer (4 bits per LED), plus a couple of control bytes like start/end marker, frame number, and checksum, to facilitate synchronisation between sender and receiver. If the receiver detects that the frames are not received correctly, it sends back an error byte; the sender notices this and pauses the data for a few frames, and the two ends re-synchronise. This is a simple but effective technique that allows for efficient transfer of the data.

One frame of data on the serial port is 672 bytes inclusive control bytes. At 50 frames of animation per second, that amounts to 33600 bytes or 268800 bits per second. The serial port is run at 500000 bits per second; with the overhead of start and stop bits, that still allows to comfortably transfer the required data at the desired rate.

However, with this rather high data rate, care is needed to be able to process all received bytes sufficiently fast that no data is lost. The Atmega328 has just a single byte receive buffer. At 500kbps, a new byte arrives 50000 times per second, meaning that we have just 320 CPU cycles to process a received byte before it will be lost due to being overwritten by the next byte.

To handle this, a serial receive interrupt is employed. The interrupt is triggered whenever a byte is received by the serial device on the Atmega328, and we need to ensure that it will be serviced within at most 320 CPU cycles. The Atmega328 does not have interrupt priorities, but it does support nested interrupts. Interrupts are automatically disabled whenever an interrupt routine is invoked, but that routine can re-enable interrupts explicitly, and this will allow another nested interrupt to be handled before the first one is completed. Indeed, this is absolute necessary to do in the cube in the refresh timer interrupt, as this runs for several thousand cycles. Nested interrupts work well, but they require a lot of care; race conditions between conflicting interrupts can be quite hard to debug, and one also needs to protect against runaway interrupts (where the same interrupt is invoked recursively and repeatedly on top of itself until the stack is overrun).

With more than 30000 serial interrupts per second, we also want to make the code for the serial interrupt handler as efficient as possible. Unfortunately the AVR architecture does not exactly shine in this respect. Here is how a typical interrupt routine looks as generated by GCC:

```    push    r1
push    r0
in      r0, 0x3f
push    r0
eor     r1, r1
push    r16
push    r17
push    r18
push    r19
push    r20
push    r21
push    r22
push    r23
push    r24
push    r25
push    r26
push    r27
push    r28
push    r30
push    r31

...

pop     r31
pop     r30
pop     r28
pop     r27
pop     r26
pop     r25
pop     r24
pop     r23
pop     r22
pop     r21
pop     r20
pop     r19
pop     r18
pop     r17
pop     r16
pop     r0
out     0x3f, r0
pop     r0
pop     r1
reti
```
That is no less than 40 instructions just as pre/post-ample, most of which take two CPU cycles each.

Of course, in an interrupt routine, we do need to save/restore all registers used. However, most of the invocations of the serial interrupt do not need to use more than a few registers; just enough to grab the next byte from the serial device and put it into the frame buffer. Only for the control bytes at the start and end of a frame do we need more registers for more complex processing. Unfortunately, GCC always generates the code to push and pop all the registers unconditionally, even though some of them are only used in rarely executed code paths (the large number of callee-save registers in the AVR calling convention plays a part of the problem here).

The solution is to write the serial interrupt in hand-optimised assembler. In the fast path, where we are just stuffing a byte into the framebuffer (and computing a checksum on-the-fly, incidentally), we only need to save three registers (plus the condition codes). That all can be done in just 26 instructions. Then in the slow path, the assembler code goes on to push all remaining registers and defer to the more complex processing in a C function.

The actual code can be seen here. The assembler code for the fath path is in `serial_interrupt_rx_naked()`, while the slow path is in the function `serial_interrupt_slow_part()`.

While some of the animations can be computed on-the-fly on-board the Arduino, some of the more complex ones are too much for the puny Atmega328 to handle. The problem usually is not so much processing speed as memory: With a total of just 2048 bytes of RAM, most of which is already reserved for frame buffers and stack space and various global variables, not a lot is left to keep track of stuff like position and velocity of lots of particles in the fireworks animation or similar stuff. Using the serial port, we can generate the animations on a PC and stream them to the cube; however it is also nice to be able to run the cube completely standalone: just plug it into power (or even run it off a battery) and it runs and displays animations on its own, without needing a laptop on tow. Thus the idea was born to pre-compute the animations and read them from an SD-card.

Now, at just 672 bytes per frame of animation, a 4 GB SD-card can store more than one day worth of animation, so this is fine. The problem however is that we are running short on IO as well as on memory. Mostly all available pins are already needed to control the 11 MOSFETs and to communicate with the TLC5940 ICs. Besides, the SPI device, which is normally used to communicate with an SD-card, is needed for communication with the TLC5940s, and the serial device (which can also do SPI) is needed for the USB-to-serial interface. So what can be done?

Let us take a closer look at the Arduino Uno, the top left corner near the USB connector:

What we see here is the Atmega16U2 IC, which comes pre-installed with firmware (based on LUFA to control the USB port and function as USB-to-serial proxy. It is however perfectly possible to modify that firmware to do additional things - such as connect an SD-card! Furthermore, just next to it we have the ICP header - this is normally used to program the flash on the Atmega16U2, but it has the pins for the SPI device, which is just what we need to communicate with an SD-card over SPI. And finally, we even have unsoldered pads with four spare GPIO, one of which we will need as a chip select pin for the SD-card.

So as it turns out, the Arduino Uno is in fact a multi-processor device! The LED-cube exploits this, connecting the SD-card to the ICP pins and spare GPIO of the Atmega16U2 MCU and hacking the USB firmware to handle the SD-card. If there is no activity on the USB port, the hacked firmware will check for the presence of an SD-card with animation data on it; if found, data will be streamed from the SD-card over the serial port to the Atmega328, which will handle the serial data the same, whether it originates from a PC at the other end of the USB, or from the SD card.

Now, using the Atmega16U2 in this way does present some challenges. The Atmega16U2 is only equipped with a meager 512 bytes of RAM, some of which is already needed for LUFA data and so on. The data on SD-cards is read one sector at a time, and a single sector is 512 bytes, already more than the RAM we have left. Most libraries for reading SD-cards and dealing with the FAT filesystem on them is based on reading one sector at a time into a buffer in RAM and processing it there; that just will not work when we have only a few hundred bytes of RAM to spare for the task.

Furthermore, most SD-card/FAT libraries are written in a traditional blocking style. That means, they provide some function you can call to read data from a file on the SD-card. Such function will take a memory buffer (which we do not have the memory for), and it will not return to the caller until all of the requested data has been read, which means waiting at least for one sector to be read. That does not integrate well with the existing USB/LUFA firmware, which runs its own main loop that waits for activity on the USB device and does not return to the main program unless there is some activity to respond to.

To overcome these challenges, I wrote a small event-driven FAT library, seen in `ev_fat.h` and `ev_fat.c`. This library works in a streaming fashion, without any blocking. It never needs to process SD-card data in a memory buffer. Instead, the caller feeds it the bytes read off the SD-card one by one, and the library processes the bytes as they are received, keeping track of its state in a small data structure, and returning status information back to the caller about which sectors from the SD-card to read next.

```    /*
Open a named file in root dir of FAT file system.
Before calling, st->state must be initialised to 0.
Then the function must be repeatedly called until it returns
EV_FILE_ST_DONE or negative error code EV_FILE_ST_E*.

The returned status tells the next action to take, see comments in struct
ev_file_status for details.

When EV_FILE_ST_DONE is returned, the first sector of the file, and the
length in bytes of the file, is returned in st->st_get_block_done.
*/
int
ev_file_get_first_block(const char *filename, struct ev_file_status *st);

/*
After opening a file, this finds the next sector in the file. When calling
this function, st->st_get_block_done must be set to / retain the value set
by the previous call to ev_file_get_first_block() /
ev_file_get_next_block().  After EV_FILE_ST_DONE is returned the new sector
number is then found in st->st_get_block_done.
*/
int
ev_file_get_next_block(struct ev_file_status *st);

/*
This callback is used to stream bytes read as a response to a request
EV_FILE_ST_STREAM_BYTES. Each byte requested must be passed in, in
sequence. The return value is true if no more data needs to be streamed;
in this case it is permissible, but not required, to stop the read early
and not stream the rest of the requested bytes.
*/
int
```

With this library, the reading of the SD-card can be handled completely inside an SPI interrupt routine, without disturbing the LUFA USB code. Each time a byte has been processed in the communication between the Atmega16U2 and the SD-card, the SPI device triggers the SPI interrupt. This interrupt processes any byte received, updates its internal state, and loads the next byte to be processed into the SPI device data register. The interrupt is seen in `ISR(SPI_STC_vect)`. The code handles the protocol to connect to and initialise the SD-card, and then takes care of reading in sectors and passing the bytes to the event-driven FAT library.

When we get to actually read real file data out of the SD-card, we stream it directly to the serial port (where it will be received and processed by the Atmega328), to avoid the need for large memory buffers. The existing firmware already has a small FIFO used to buffer data for sending down the serial line. We re-use that, so that when no data is available from the USB for a few seconds we start filling up the FIFO with data from the SD-card instead. A serial device interrupt is triggered whenever the previous byte has been fully transmitted down the serial line; this interrupt fetches the next byte from the FIFO and loads it into the serial transmit data register. If the SD-card delivers data faster than the 500kbps serial line can transmit, we temporarily pause the SPI communication and resume it once the serial interrupt has made room for more data in the FIFO; the SD-card specifications explicitly mention this as a supported way to operate, precisely to help very small microcontrollers be able to process data without requiring excess buffering capabilities.

The end result is an extended USB firmware that retains all the original functionality (streaming serial data from a PC and so on; even flashing the Atmega328 over the serial port still works). And in addition, if the USB is idle and an SD-card is present, data is instead continuously streamed from the card, allowing completely stand-alone operation of the cube.

The code to handle all this does end up rather intricate, as can be imagined. Apart from the need to write a custom FAT-reading library, the precise timing between the different interrupt handlers end up requiring quite a lot of careful coding and debugging. But in the end, I found that code to be quite an interresting exercise, and fun as well - and this is after all a for-the-fun-of-it type project.

## Calculating the animations

One of the nice thouches of the visuals this LED-cube in particular is the availability of 16 different intensity levels. This allows for some nice effects, such as fading the LEDs in-out to give a warmer, light-bulb-like perception, and using anti-aliasing to greatly reduce the disadvantage of the very limited 11-by-11-by-11 resolution.

All the animations are computed by this C++ program. The code is mostly a lot of math using vector computations, trigonometry, random number distributions, permutations, physics simulations and other nice stuff. The end result is a sequential stream of animation frames that can be send directly to the LED-cube over the serial port, or stored in a file on an SD-card for stand-alone playback.

## Conclusions, and source code

If I were to do this project today, I would probably use an ARM microcontroller like the STM32F4. Such a microcontroller is easily able to handle driving something like this LED-cube without the need for any special tricks due to its much larger memory and performance. But this was just a for-fun project, and it was interesting to see just how much could be squeezed out of the very popular AVR-based Arduino. That is quite a lot, as it turns out.

The nice thing about the LED-cube is: On the one hand it involves lots of tricky programming and advanced technology. On the other hand it has an immediate appeal to many different kinds of people, as is seen whenever we take it on display and it immediately draws the eyes of people passing by. The technology aspect is much harder to appreciate than the visual aspect. I have wanted to write up this article describing the project for some time, in all the gory technical details. I hope a few people will be able to use this write-up to appreciate the technical side as well as the visual side.

All of the code and design files for this project are available on Github under an open source license (GPL):

### No application changes needed: 10 times faster slave with MariaDB 10 parallel replication

Parallel replication is in MariaDB 10.0. I did some benchmarks on the code in 10.0.9. The results are quite good! Here is a graph that shows a 10-times improvement when enabling parallel replication:

The graph shows the transaction per second as a function of number of slave worker threads, when the slave is executing events from the master at full speed. The master binlog was generated with sysbench oltp.lua. When the binlog is enabled on the slave and made crash-safe (```--sync-binlog=1 --innodb-flush-log-at-trx-commit=1```), the slave is about ten times faster at 12 worker threads and above compared to the old single-threaded replication.

These results are for in-order parallel replication. With in-order, transactions are committed on the slave in strictly the same order as on the master, so that applications do not see any differences from using parallel replication. So no changes to the application are needed to use parallel replication; this is just standard sysbench 0.5 with a single table. This makes parallel replication particularly interesting, as it can be used with any existing applications, without the need to eg. split the data into independent schemas as is necessary with out-of-order techniques like the multi-threaded slave feature in MySQL 5.6. It does however make it much harder for the slave to find parallelism to exploit in the events received from the master, which is why it is interesting to see how much improvement can be obtained from normal workloads.

(MariaDB 10.0 does also support out-of-order parallel replication, but that will be the subject of a different article).

The hardware used for the sysbench oltp.lua is the same machine I used to benchmark group commit previously; I am told this is a machine that is typical for a "standard" datacenter server, with decent I/O on a RAID controller with battery-backed-up cache. Sysbench was run with 10 million rows in one table. The mysqld was configured with 16GB buffer pool and 2 times 1.9 gigabyte redo logs. The different graphs are as follows:

• binlog, crash-safe: `--log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1`
• no binlog, durable: `--skip-log-slave-updates --innodb-flush-log-at-trx-commit=1`
• no binlog, non-durable: `--skip-log-bin --innodb-flush-log-at-trx-commit=2`
• binlog, non-crash-safe: `--log-slave-updates --sync-binlog=0 --innodb-flush-log-at-trx-commit=0`

For this test, the master was configured with `--binlog-commit-wait-count=12 --binlog-commit-wait-usec=10000`. This allows the master to delay a transaction up to 10 milliseconds in order to find up to 12 transactions that can commit in parallel; this helps a lot in improving parallel replication performance, since transactions that commit in parallel on the master can be executed in parallel on the slave.

Adding such delay will be acceptable for many applications to speed up the slaves; in fact in my test it did not affect master throughput at all. One attractive option might be to set up monitoring of the slaves, and if they start falling behind, then the commit delay on the master can be temporarily increased, throttling the master a bit while allowing slaves better possibility to catch up.

The other source of parallelism on the slave is that irrespectively of how the transactions were executed on the master, the commit steps of different transactions can always be applied in parallel on the slave. This is particularly effective at improving performance when the commit step is expensive, as happens when a durable, crash-safe configuration is used. This is seen in the benchmark, where the speedup is particularly large when the slave is configured to be crash-safe and durable, to the point where parallel replication almost eliminates any performance penalty for enabling crash-safe binlog on the slaves. But significant speedup is seen in all the configurations.

(In fact, if you look closely, you will see that turning off the binlog ends up decreasing the performance of the slave. This is bug MDEV-5802, and performance should improve when binlog is disabled when this bug is fixed).

I think these are very promising results. I hope this will inspire users to give the new feature a test on real workloads, and share their experiences.

## Exploring the limits of parallel replication

I also wanted to see how the code works for workloads that are not favorable to parallel replication. For this I use sysbench update_index.lua. This benchmark creates transactions with a single primary-key update. With just 10 million rows in the table, the test is in-memory, and the actual work spent for the single-row update is rather small compared to the overhead of reading and parsing binlog events and scheduling the work between the various replication threads. So it is interesting to see if parallel replication can still give some smaller benefit here, or at least not make things much worse.

Here are the results for an update_index.lua with 48 threads for the load on the master. No commit delay was used to help the parallel replication (--binlog-commit-wait-count=0):

As can be seen, even in this workload there is significant improvement from parallel replication. Again, the improvement is particularly high when commit is expensive, due to the possibility to do the commits in parallel and amortise the cost using group commit. But even with binlog disabled and non-durable InnoDB commit, we see some improvement, though only a modest one.

Finally, to test the absolutely worst-case scenario for parallel replication, I created another workload on the master, this time with update_index.lua running with just a single thread. This way, there is absolutely no opportunity for parallel replication to execute the actual transactions in parallel, though there is still some opportunity to speed up the commit step using group commit.

As can be seen, even in this extreme case, we do see some speedup when commits are expensive (durable/crash-safe configuration). But when durability and crash-safety is turned off, and commits are cheap, then the overhead of the extra thread scheduling shows, and the results of parallel replication are worse, if not disastrously so. Note by the way that with this test, the time taken by the slave to replicate the load was smaller than the time for the master to generate it with single-threaded sysbench.

## Conclusions

So overall, I am pleased with the results of the parallel replication code in MariaDB 10.0. Single-threaded applier has been a major limitation in replication for a long, long time, and I feel now fairly confident that this can play an important part in solving that. I hope this will inspire users to start testing their own loads and hopefully getting good results and/or reporting any issues found on the way.

To test, use MariaDB 10.0.9 or later on both master and slave. Configure the slave with `--slave-parallel-threads=20` or something like that. And optionally, if your application can tolerate some extra commit latency, set some reasonable values for `--binlog-commit-wait-count` and `--binlog-commit-wait-usec`.

By the way, the raw numbers for the different runs can be seen in this mail on maria-developers@.

### Using MASTER_GTID_WAIT() to avoid stale reads from slaves in replication

I have just implemented MASTER_GTID_WAIT() in MariaDB 10.0. This can be used to give a very elegant solution to the problem of stale reads in replication read-scaleout, without incuring the overheads normally associated with synchronous replication techniques. This idea came up recently in a discussion with Stephane Varoqui, and is similar to the concept of Lamport logical clock described in this Wikipedia article.

I wanted to describe this, hoping to induce people to test and maybe start using this, as it is a simple but very neat idea, actually.

A very typical use of MariaDB/MySQL replication is for read-scaleout. The application does all updates against a single master, which replicates to a set of slaves. The application can then distribute its reads among the slaves. If most of the database load is from reads, then this is an effective way to scale beyond what a single database server can handle.

The problem of stale reads occurs since MariaDB/MySQL replication is asynchronous. There is some delay from the commit of a change on the master server until that change becomes committed and visible on a slave server. This problem can be quite annoying in many cases, eg. user changes some preferences and reloads the page (read goes to a different slave server which is not caught up); now the change looks as if it was lost.

The idea is to use the MariaDB Global Transaction ID (GTID for short) as a logical clock. When we do an update on the master, we can obtain the associated GTID from `@@LAST_GTID`. Then on the slave, we can execute a `MASTER_GTID_WAIT(<GTID from master>, <timeout>)` before doing a query for which read consistency is critical; this will ensure that the slave will catch up before the query is run, and that no stale read results. Or if we get a timeout, we can try another slave server, which hopefully is not lagging as much behind.

Typically, we would pass around the GTID (the logical clock value) between different parts of the system to maintain consistency where it is needed and avoid stale reads. Eg. in a web application, we would set a cookie with the GTID when an update is made to the database. Then on the next HTTP request, we have the GTID which is needed in `MASTER_GTID_WAIT()`, even if it goes to a completely different web server. Or if we send an email, we can encode the GTID in any contained link, to be sure that it will work whichever slave server it might end up in, possibly in a different data center on a different continent.

By passing around the GTID whenever we communicate between parts of the system, we can ensure that if transaction A is known to have occured before transaction B, then any change made by A will also be visible to B. If there was no communication (direct or indirect), then B cannot know that A came before - and if there was communication, we can avoid stale reads by passing the GTID as part of that communication.

The great thing about this technique is that it is optional. We can use it for just the queries where avoiding stale reads is critical, and only there take the associated penalty (in the form of increased latency). Other parts of the database or application will not be negatively affected. This is much more flexible than using some form of synchronous replication, which will incur some penalty for all parts of the system.

So for applications that require more performance than what is currently possible to get from fully synchronous replication, we can choose consistency in a few critical places where it matters, and go for the performance of asynchronous replication in the rest.

I am hoping that some users will start to do tests with this, and I am very eager to hear any feedback and suggestions for further improvements (the maria-developers@ list is a good place for those). I think this can become a very useful technique in many cases.

By the way, note that it is not necessary to actually switch the replication to GTID to use this technique, as in MariaDB the GTIDs are always generated and replicated, even if they are not used for the slave when connecting to the master. This should make it simple for people to start testing. One will need to run MariaDB 10.0 on both the master and the slave, of course - but this could be achieved by setting up one MariaDB server as a slave of the main system (which could be running MySQL or Percona Server or MariaDB 5.5 or whatever), and then setting up another MariaDB 10.0 slave using the first slave as master.

If this turns out to be something that people find generally useful, it would be interesting to integrate this more tightly into the client protocol and client API. The `@@LAST_GTID` could be sent automatically back with the query results, the same way that `@@LAST_INSERT_ID` is done. And another option could enable the automatic execution of the `MASTER_GTID_WAIT()` call before queries sent to the slave servers. This could make the technique almost transparent for the end application, and could help avoid the overhead of extra client-server roundtrips and extra SQL parsing.

However, that is only worth doing if this turns out to be something that will actually be shown to be useful in practice. So if you want to see that happen eventually, then get testing and send in that feedback!

The code is available in the 10.0 bzr tree now, and will be in the next (10.0.9) release of MariaDB.

In my previous post I wrote about how I achived a >40% speedup on sysbench read-only using profile-guided optimisation (PGO). While this is a preliminary result, I though it was so interesting that it deserved early mention. The fact that any benchmark can be improved that much shows clearly that PGO is something worth looking into. Even if we will probably not improve all workloads by 40%, it seems highly likely that we can obtain significant gains also for many real workloads.

I had one or two interesting comments on the post that raise valid concerns, so I wanted to write a follow-up here, explaining some of the points in more details and going deeper into the performance counter measurements. As I wrote before, actual observations and measurements are crucial to fully understand performance of complex code on modern CPUs. Intuition and guesswork just does not suffice.

## On branch mispredictions

bludwarf suggested that branch mispredictions could be an important part of the picture:

```> It would be interesting to see the hottest spots.... Having to execute just
> a lot of code is not an explanation. If it would be so, then profiler
> feedback won't affect performance so much. I believe the reson for so big
> difference for the results is branch misprediction.
```
I had similar thoughts when I first saw the results, but I was able to verify my explanation with actual measurements of instructions retired and icache misses before and after profile-guided optimisations:
```    171821  INST_RETIRED.ANY
17730  ICACHE.MISSES

166686  INST_RETIRED.ANY
13643  ICACHE.MISSES
```

Before PGO, ICACHE.MISSES/INST_RETIRED.ANY = 10.3%. After, we get 8.2%. So we do get significant less icache misses. I wrote in my previous post some plausible reasons for that with my remarks on basic blocks, though I have not tried to verify those with actual measurements.

I also checked branch mispredictions. The Intel manual has two measures to estimate the cost of branch misprediction. One is time wasted due to speculatively issued uops that do not retire, %Bad_Speculation = (UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4*CPU_CLK_UNHALTED.THREAD). The other is cycles wasted based on an estimated cost of 20 cycles per misprediction, %BR.MISP.COST = 20 * BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD. Here are the relevant measurements taken before and after (two sets of measurements, as Sandy Bridge supports only 4 simultaneous performance counters):

```    486641  CPU_CLK_UNHALTED.THREAD
205880  UOPS_RETIRED.RETIRE_SLOTS
226184  UOPS_ISSUED.ANY
7219  INT_MISC.RECOVERY_CYCLES

194616  UOPS_RETIRED.RETIRE_SLOTS
216353  UOPS_ISSUED.ANY
6830  INT_MISC.RECOVERY_CYCLES

1310  BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 5.3%

1147  BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 6.8%
```

Note that the actual branch misprediction is much the same. But with less time spent on other bottlenecks, the estimated relative cost becomes higher.

Most of the branch mispredictions are due to indirect branches (virtual functions and @plt entries in shared libraries). Even without PGO the CPU is able to predict other branches very well, so compiler or __builtin_expect can not help much.

At the end of this post, I have appended the top of some per-function profiles. They show how branch mispredictions are mostly from commonly called library functions like memcpy() (these are mispredictions of the indirect jumps in the PLT), and from virtual function calls. They also show how the icache misses are spread out more or less evenly, correlating strongly with the count of actual instructions executed in various parts of the code.

## On better coverage of the PGO

A comment from Anonymous raised another valid concern:

```> If I read you correctly, you are using the same benchmark for optimizing as
> well as measuring performance increases. You need to have different test and
> training sets or your measurements don't generalize.
```

This is true, of course. The sysbench read-only exercises only an extremely narrow subset of mysqld functionality. The next step, on which I am already working, is to implement a more suitable and general workload for the profile, which can hopefully cover, and hence speedup, a significant portion of common real-life workloads.

More tests will be needed to be sure that real-world gains can be achieved, but I am very hopeful. After all, much of the code of sys-bench read-only is also used in many other workloads, and much of the properties (eg. infrequency of the lots and lots of error check branches) should be similar.

My main concern is that GCC will choose to de-optimise code that is not covered by the test load, to the extent that it significantly harms performance in some cases. But it should not be too much of a problem - after all, GCC will still try to optimise the generated code as best it can, and if the information it has from the profile is incomplete, it should not be much worse than having no information, as is the case without PGO. And over time, we can extend the coverage of the test load to help eliminate any problems that do turn up.

Since the last post, I implemented a first version of a program that generates a test load. For now, I put it on github: gen_profile_load. The program generates a range of different loads, such as simple key lookups, small joins, inserts/updates/deletes, under various conditions such as with/without binlogging, with/without prepared statements, and so on.

I plan to next do some tests with using this for doing the PGO, and checking the effect on sysbench read-only and possibly other benchmarks. Another idea I had was to try to use only parts of the generated load, and test that this does not cause significant regressions on the omitted parts. This could help in getting some idea of what kind of (hopefully small) performance regressions can be expected for workloads that are not well covered by the PGO.

## Detailed profiles

These are pasted from the output of `perf report`. They show the top functions in terms of instructions executed, icache misses suffered, and branches mispredicted.

Instructions retired (INST_RETIRED.ANY), before PGO:

```Samples: 171K of event 'rc0', Event count (approx.): 171821
1,94%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,92%  mysqld  mysqld               [.] my_hash_sort_bin
1,60%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,48%  mysqld  mysqld               [.] malloc
1,36%  mysqld  mysqld               [.] free
1,34%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,31%  mysqld  mysqld               [.] alloc_root
1,30%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,16%  mysqld  mysqld               [.] my_malloc_size_cb_func
1,14%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
1,08%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
1,08%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,00%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,93%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,90%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
0,88%  mysqld  [vdso]               [.] 0x000000000000070c
0,83%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,79%  mysqld  libc-2.13.so         [.] __memset_sse2
0,78%  mysqld  libc-2.13.so         [.] __strlen_sse42
0,77%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,77%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,71%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,70%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,70%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,70%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,69%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,68%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,60%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,56%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,50%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,48%  mysqld  mysqld               [.] my_malloc
0,48%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,46%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,46%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,45%  mysqld  mysqld               [.] Item_func::type() const
0,44%  mysqld  [kernel.kallsyms]    [k] update_curr
0,44%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
0,44%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,43%  mysqld  mysqld               [.] my_lengthsp_8bit
0,43%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
0,43%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
0,41%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,40%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,36%  mysqld  mysqld               [.] Item_equal::Item_equal(Item*, Item*, bool)
0,36%  mysqld  mysqld               [.] Item::Item()
0,35%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
0,35%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,35%  mysqld  mysqld               [.] JOIN::exec_inner()
0,35%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
0,35%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
0,35%  mysqld  [kernel.kallsyms]    [k] system_call
0,34%  mysqld  mysqld               [.] JOIN::destroy()
0,34%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
```

Icache misses (ICACHE.MISSES), before PGO:

```Samples: 17K of event 'r280', Event count (approx.): 17730
2,21%  mysqld  mysqld               [.] JOIN::optimize_inner()
2,04%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,91%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,42%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,27%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,20%  mysqld  mysqld               [.] malloc
1,09%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
1,07%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,90%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,90%  mysqld  [vdso]               [.] 0x0000000000000771
0,83%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,73%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,72%  mysqld  mysqld               [.] alloc_root
0,68%  mysqld  mysqld               [.] JOIN::exec_inner()
0,65%  mysqld  libc-2.13.so         [.] __memset_sse2
0,64%  mysqld  mysqld               [.] JOIN::destroy()
0,63%  mysqld  mysqld               [.] free
0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,55%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,55%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,54%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,54%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,52%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
0,51%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,51%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,50%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,47%  mysqld  mysqld               [.] setup_conds(THD*, TABLE_LIST*, List<TABLE_LIST>&, Item**)
0,47%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,47%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
0,47%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,46%  mysqld  mysqld               [.] trx_commit(trx_t*)
0,44%  mysqld  [kernel.kallsyms]    [k] fget_light
0,43%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,43%  mysqld  [kernel.kallsyms]    [k] system_call
0,42%  mysqld  mysqld               [.] JOIN::cleanup(bool)
0,41%  mysqld  mysqld               [.] handle_select(THD*, LEX*, select_result*, unsigned long)
0,41%  mysqld  mysqld               [.] mtr_commit(mtr_t*)
0,41%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,41%  mysqld  mysqld               [.] my_hash_sort_bin
0,40%  mysqld  mysqld               [.] open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*)
0,39%  mysqld  mysqld               [.] buf_page_get_known_nowait(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)
0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,38%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,37%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,37%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
0,36%  mysqld  mysqld               [.] add_key_field(JOIN*, key_field_t**, unsigned int, Item_func*, Field*, bool, Item**, unsigned int, unsigned long long, st_sargable_param**)
0,36%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,35%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
0,34%  mysqld  mysqld               [.] do_select(JOIN*, List<Item>*, TABLE*, Procedure*)
0,33%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
0,33%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,33%  mysqld  [kernel.kallsyms]    [k] __schedule
0,33%  mysqld  mysqld               [.] Item::Item()
0,32%  mysqld  mysqld               [.] thd_ha_data
```

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), before PGO:

```Samples: 1K of event 'rc5:p', Event count (approx.): 1310
7,02%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
3,66%  mysqld  mysqld               [.] Item_func::type() const
3,05%  mysqld  libc-2.13.so         [.] __memset_sse2
2,67%  mysqld  mysqld               [.] Field_long::type() const
2,52%  mysqld  libc-2.13.so         [.] __strcmp_sse42
2,06%  mysqld  mysqld               [.] Item_field::used_tables() const
2,06%  mysqld  mysqld               [.] Item_param::used_tables() const
2,06%  mysqld  libc-2.13.so         [.] __strlen_sse42
1,76%  mysqld  librt-2.13.so        [.] clock_gettime
1,68%  mysqld  mysqld               [.] Item::cmp_type() const
1,60%  mysqld  mysqld               [.] st_select_lex::master_unit()
1,37%  mysqld  mysqld               [.] Item_equal::functype() const
1,30%  mysqld  mysqld               [.] Item::const_item() const
1,30%  mysqld  mysqld               [.] Item::real_item()
1,07%  mysqld  mysqld               [.] Field_str::charset() const
0,92%  mysqld  mysqld               [.] Protocol::end_statement()
0,92%  mysqld  mysqld               [.] Item_field::type() const
0,84%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,84%  mysqld  mysqld               [.] Item_field::cleanup()
0,84%  mysqld  mysqld               [.] Item_field::field_type() const
0,84%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,76%  mysqld  mysqld               [.] Item_param::save_in_field(Field*, bool)
0,76%  mysqld  mysqld               [.] my_hash_sort_bin
0,69%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
0,69%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
0,69%  mysqld  mysqld               [.] scheduler_wait_net_end
0,69%  mysqld  mysqld               [.] Item_param::val_int()
0,69%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
0,69%  mysqld  mysqld               [.] Item_field::result_type() const
0,61%  mysqld  mysqld               [.] Field_long::pack_length() const
0,61%  mysqld  [kernel.kallsyms]    [k] system_call
0,53%  mysqld  mysqld               [.] Item::is_expensive()
0,53%  mysqld  mysqld               [.] st_select_lex::outer_select()
0,53%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
0,53%  mysqld  mysqld               [.] Explain_select::~Explain_select()
0,53%  mysqld  mysqld               [.] Field_num::result_type() const
0,53%  mysqld  mysqld               [.] Item_param::result_type() const
0,53%  mysqld  mysqld               [.] Item_param::field_type() const
0,53%  mysqld  mysqld               [.] Item_equal::count_sargable_conds(unsigned char*)
0,53%  mysqld  mysqld               [.] Item_func_eq::functype() const
0,53%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,53%  mysqld  [kernel.kallsyms]    [k] __pollwait
0,46%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,46%  mysqld  mysqld               [.] handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)
0,46%  mysqld  mysqld               [.] Item_equal::select_optimize() const
0,46%  mysqld  mysqld               [.] Item_func::used_tables() const
0,46%  mysqld  mysqld               [.] ha_innobase::info(unsigned int)
0,46%  mysqld  libc-2.13.so         [.] __memmove_ssse3
0,46%  mysqld  libc-2.13.so         [.] __memcmp_sse4_1
0,38%  mysqld  mysqld               [.] Protocol_binary::store(char const*, unsigned long, charset_info_st const*)
0,38%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
```

Instructions retired (INST_RETIRED.ANY), after PGO:

```Samples: 166K of event 'rc0', Event count (approx.): 166686
2,55%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,79%  mysqld  mysqld               [.] my_hash_sort_bin
1,60%  mysqld  mysqld               [.] alloc_root
1,48%  mysqld  mysqld               [.] malloc
1,42%  mysqld  mysqld               [.] my_malloc_size_cb_func
1,42%  mysqld  mysqld               [.] free
1,29%  mysqld  libc-2.13.so         [.] __memset_sse2
1,20%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
1,15%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,15%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,13%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,12%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
1,07%  mysqld  [vdso]               [.] 0x00000000000006a1
1,03%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
0,99%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,94%  mysqld  libc-2.13.so         [.] __strlen_sse42
0,83%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,80%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,78%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,78%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,75%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,69%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,68%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,68%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,63%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,62%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
0,56%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,56%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,52%  mysqld  [kernel.kallsyms]    [k] update_curr
0,52%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,52%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,51%  mysqld  mysqld               [.] my_malloc
0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
0,45%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,43%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,43%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,42%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,41%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
0,39%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
0,39%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,39%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,38%  mysqld  mysqld               [.] my_lengthsp_8bit
0,38%  mysqld  [kernel.kallsyms]    [k] system_call
0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,37%  mysqld  mysqld               [.] Item::Item()
0,36%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
0,35%  mysqld  librt-2.13.so        [.] clock_gettime
0,34%  mysqld  mysqld               [.] JOIN::cleanup(bool)
0,34%  mysqld  mysqld               [.] Item_func::type() const
0,33%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,33%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,33%  mysqld  [kernel.kallsyms]    [k] __update_tg_runnable_avg.isra.25
```

Icache misses (ICACHE.MISSES), after PGO:

```Samples: 13K of event 'r280', Event count (approx.): 13643
2,07%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,39%  mysqld  mysqld               [.] malloc
1,33%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,26%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,17%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,08%  mysqld  [vdso]               [.] 0x0000000000000864
1,07%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,06%  mysqld  libc-2.13.so         [.] __memset_sse2
1,01%  mysqld  mysqld               [.] free
0,92%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,80%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,76%  mysqld  mysqld               [.] alloc_root
0,73%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,71%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69%  mysqld  mysqld               [.] my_hash_sort_bin
0,67%  mysqld  [kernel.kallsyms]    [k] fget_light
0,63%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,62%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,59%  mysqld  mysqld               [.] Item::Item()
0,57%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,52%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,51%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
0,50%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,48%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,47%  mysqld  mysqld               [.] free_root
0,47%  mysqld  [kernel.kallsyms]    [k] idle_cpu
0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_sendmsg
0,46%  mysqld  [kernel.kallsyms]    [k] system_call
0,44%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,43%  mysqld  mysqld               [.] ha_innobase::info_low(unsigned int, bool)
0,43%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,43%  mysqld  librt-2.13.so        [.] clock_gettime
0,42%  mysqld  mysqld               [.] JOIN::init(THD*, List<Item>&, unsigned long long, select_result*)
0,42%  mysqld  mysqld               [.] Item_field::used_tables() const
0,41%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,40%  mysqld  mysqld               [.] Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)
0,40%  mysqld  [kernel.kallsyms]    [k] emulate_vsyscall
0,40%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,40%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
0,39%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,38%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,37%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,37%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)
0,37%  mysqld  mysqld               [.] my_malloc
0,37%  mysqld  [kernel.kallsyms]    [k] __schedule
0,36%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,36%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,35%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,35%  mysqld  mysqld               [.] Prepared_statement::execute(String*, bool)
0,34%  mysqld  mysqld               [.] thd_ha_data
0,34%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,34%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
0,34%  mysqld  mysqld               [.] st_select_lex::master_unit()
```

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), after PGO:

```Samples: 1K of event 'rc5:p', Event count (approx.): 1147
8,37%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
3,84%  mysqld  mysqld               [.] Item_func::type() const
2,88%  mysqld  libc-2.13.so         [.] __memset_sse2
2,35%  mysqld  mysqld               [.] st_select_lex::master_unit()
2,18%  mysqld  mysqld               [.] Item_param::used_tables() const
2,01%  mysqld  mysqld               [.] Field_long::type() const
2,01%  mysqld  libc-2.13.so         [.] __strlen_sse42
1,92%  mysqld  libc-2.13.so         [.] __strcmp_sse42
1,39%  mysqld  mysqld               [.] Item::const_item() const
1,31%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
1,22%  mysqld  mysqld               [.] handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
1,13%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
1,13%  mysqld  mysqld               [.] end_send(JOIN*, st_join_table*, bool)
1,05%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
1,05%  mysqld  mysqld               [.] Item_equal::functype() const
1,05%  mysqld  mysqld               [.] Item_bool_func2::cleanup()
0,96%  mysqld  mysqld               [.] Item::is_expensive()
0,96%  mysqld  mysqld               [.] Item_field::type() const
0,96%  mysqld  mysqld               [.] Item_param::type() const
0,87%  mysqld  mysqld               [.] Item::real_item()
0,78%  mysqld  mysqld               [.] Item_field::used_tables() const
0,78%  mysqld  mysqld               [.] scheduler_wait_net_end
0,78%  mysqld  librt-2.13.so        [.] clock_gettime
0,70%  mysqld  mysqld               [.] Statement::set_statement(Statement*)
0,70%  mysqld  mysqld               [.] JOIN::optimize()
0,70%  mysqld  mysqld               [.] Field_num::size_of() const
0,70%  mysqld  mysqld               [.] Field_str::charset() const
0,70%  mysqld  mysqld               [.] Item_field::cleanup()
0,61%  mysqld  mysqld               [.] select_result::initialize_tables(JOIN*)
0,61%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,61%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,61%  mysqld  mysqld               [.] JOIN::optimize_inner()
0,61%  mysqld  mysqld               [.] Field_str::decimals() const
0,61%  mysqld  mysqld               [.] handler::rebind_psi()
0,61%  mysqld  mysqld               [.] Item_field::field_type() const
0,61%  mysqld  mysqld               [.] Item::cmp_type() const
0,52%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
0,52%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,52%  mysqld  mysqld               [.] Field::new_key_field(st_mem_root*, TABLE*, unsigned char*, unsigned char*, unsigned int)
0,52%  mysqld  mysqld               [.] Field::send_binary(Protocol*)
0,52%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
0,52%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,52%  mysqld  [vdso]               [.] 0x0000000000000630
0,44%  mysqld  mysqld               [.] Protocol::send_eof(unsigned int, unsigned int)
0,44%  mysqld  mysqld               [.] Item::has_subquery() const
0,44%  mysqld  mysqld               [.] Item::mark_as_condition_AND_part(TABLE_LIST*)
0,44%  mysqld  mysqld               [.] select_send::send_eof()
0,44%  mysqld  mysqld               [.] select_send::send_data(List<Item>&)
0,44%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,44%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
0,44%  mysqld  mysqld               [.] Field_str::repertoire() const
0,44%  mysqld  mysqld               [.] handler::cancel_pushed_idx_cond()
0,44%  mysqld  mysqld               [.] Item_field::get_item_equal()
0,44%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
0,44%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)
```

Continuing my investigation of single-threaded performance in the MariaDB server, I managed to increase throughput of single-threaded read-only sysbench by more than 40% so far:

I use read-only sysbench 0.4.12 run like this:

```    sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx run
```

And mysqld is run with minimal options:

```    sql/mysqld --no-defaults --basedir=X --datadir=Y --innodb-buffer-pool-size=128M
```

With modern high-performance CPUs, it is necessary to do detailed measurements using the built-in performance counters in order to get any kind of understanding of how an application performs and what the bottlenecks are. Forget about looking at the code and counting instructions or cycles as we did in the old days. It no longer works, not even to within an order of magnitude.

I am using the Linux `perf` program for this. During my invistigations, I found that the main bottleneck in the benchmark turns out to be instruction cache misses. Here is an example measurement:

CounterValue
INST_RETIRED.ANY170431
ICACHE.MISSES17194

So 10% of executed instructions missed the level 1 instruction cache ("icache"). That is bad. The Intel Optimization Manual says this about the ratio ICACHE.MISSES/INST_RETIRED.ANY:

Anything over 1% of instructions retired can be a significant issue.

So we are 10 times worse than "a significant issue".

Instruction cache misses cause a bottleneck in the frontend of the CPU - where x86 instructions are fetch, decoded, and supplied to the micro-op queue to be scheduled for the out-of-order dataflow backend. To get an idea about how badly bottlenecked we actually are in the frontend in this benchmark, we can use another measure from the Intel manual:

This ratio estimates the percentage of time in which the front-end is not able to deliver instructions sufficiently fast for the backend to work at full speed. In this case, for more than 80% of the time, the CPU would have been able to do more work in the same time if only instructions could be fetch and decoded faster.

Note the importance of this analysis in knowing what areas to work on to improve performance. Roughly speaking, 80% of our time is spent waiting for the icache. We could have spent years on improving data locality or branch mispredictions or whatever in the code, and we would never have been able to get very big improvements. Now, knowing where the sore spot is, we can spend our efforts where it matters the most.

In the case of icache misses, the best place to start is with profile-guided optimisation. This works by first running a specially instrumented mysqld binary to collect data about the load during the benchmark. Then the binary is recompiled, where the compiler can take into account the information obtained to better optimise the generated code.

Using information about how the program actually behaves when run can help the compiler lay out the basic blocks of the program to benefit the most common execution paths. This can help reduce the footprint in the icache by reducing jumps into or out of the middle of a 16-byte instruction stream (such jumps waste the rest of the 16-byte stream, which is the unit of storage of the icache). It can also increase the length of straight-line code execution paths, which should help the hardware prefetcher keep the level 1 icache supplied.

So that is the theory - but does it work in practice? It turns out that it does. First I compile with `--coverage` added to the CFLAGS/CXXFLAGS. Then I run sysbench to generate the profile data. Then I recompile adding instead the `--profile-use` flag. That is all there is to it, GCC handles everything else automatically.

Re-running the benchmark with the optimised binary yields a large speedup:

BinaryQueries per secondSpeedup
Base: `-O3`21404
PGO: `-O3 --profile-use`3090344%

So a 44% speedup just from compiling with different optimisation, not bad! (The actual server-side improvement is in fact even higher, as the sysbench client consumes part of the runtime due to the single-threaded nature of the test).

By the way, that 44% speedup comes from just a modest reduction in icache miss rate - from 10% down to 8%. It just shows how expensive those icache misses are. Fetching from L2 cache takes something like 12 cycles, and during each of those cycles the CPU will be able to do nothing. Given that we could potentially execute 4 operations in the backend each of those clocks, that is a lot of waste. So with 8% misses still left there is room for further huge improvements, though those improvements will likely be much harder to achieve than just fiddling with compiler options.

I think it is clear that we will need to start using profile-guided optimisation for future MariaDB binaries. This will be some work, as we need to develop a set of typical workloads to optimise after, and integrate running of those into the compilation. We will need to cover a range of different common usages to optimise after. And more tests with a wider range of benchmarks will be needed to ensure that the gain is more generally applicable and does not cause significant regressions in performance of other parts of the code. With such a huge improvement in this test I am confident that things can be generally improved, but it still needs proper testing.

My work on improving single-threaded performance will continue, as time permits, and I certainly expect more good results along the way (several patches are already in the pipeline). But I thought this one was so interesting that it was worth mentioning to a wider audience.

I took a quick look at MariaDB 10.0 single-treaded performance (simple read-only sysbench). One thing immediately leaps to the eye, and I thought it worthy of mention. It contains an important lesson about the use of synchronisation primitives and in particular "atomic operations" in MariaDB (and MySQL).

I am using the Linux `perf` tool on this sysbench command:

```  sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
```
Look at the top offender in the output from `perf report`:
```  1,54%  mysqld  mysqld               [.] set_thread_state_v1
```
The only thing this does is set a string for SHOW PROCESSLIST (and the like) about what the thread is doing. And we are spending a whopping 1.5% of the total time doing this.

And why? That becomes clear when looking at the disassembly (extract):

```  2d:   mov    0x7b0(%rbx),%eax
33:   mov    %eax,%edx
lock   cmpxchg %eax,0x7b0(%rbx)
jne    33
and    \$0xfffffffc,%edx
xchg   %edx,0x7b0(%rbx)
mov    %rbp,0x7b8(%rbx)
mov    %ecx,0x7c0(%rbx)
mov    0x7b0(%rbx),%eax
5e:   mov    %eax,%edx
lock   cmpxchg %eax,0x7b0(%rbx)
jne    5e
and    \$0xfffffffc,%edx
xchg   %edx,0x7b0(%rbx)
```
No less than two locked cmpxchg instructions. Each of these implies a full (memory barrier), which is really expensive.

To achieve their very high performance, modern CPUs process many instructions at the same time, a hundred or more is possible. But for a full memory barrier, the CPU needs to stop and ensure that all pending store instructions have fully reached the last-level cache. Only then can it start processing any following instructions involving a load. This is not something we should be doing every time a thread moves to a new stage of query processing.

I am not really familiar with the performance schema code (nor am I sure I want to be). But let us make some guess at what the code is trying to do here. Here is the code from which the first locked instruction originates:

```  /**
Execute an allocated to dirty transition.
This transition should be executed by the writer that owns the record,
before the record is modified.
*/
void allocated_to_dirty(void)
{
uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
/* We own the record, no need to use compare and swap. */
PFS_atomic::store_u32(&m_version_state, new_val);
}
```
So perhaps the idea is that only the thread itself can modify its state, but other threads can read it. And we want the update to be fast (I suppose updating the state is more important than inspecting it). So the developers wanted to avoid a lock that could block the writer. Instead, it sets a flag to mark that the data is being modified before the update. And clears the flag after. And a reader can then check the flag; if the flag was modified during a read, the data is potentially inconsistent, and the read can be re-tried.

Now, this really is poorly designed. The `PFS_atomic::load_u32()` is implemented using `my_atomic_load32()` and `my_atimic_store32()`, which is inappropriate here. The `my_atomic()` facility implies a full memory barrier in all of the operations, which makes them somewhat less useful for performance-critical code.

All we need here is that the setting of the update flag becomes visible before changes to the state, and that the clearing of the flag becomes visible after. Thus it should be sufficient with a pair of write memory barriers in the writer (and corresponding read memory barriers in the reader).

The write and read memory barriers are much less expensive than a full barrier. All they do is prevent a store (respectively a load) before the barrier to be satisfied before a store (load) after the barrier. There is no need for the CPU to stop execution. In fact, on x86, there is usually nothing special to do, because stores and loads are never re-ordered unless the special non-temporal memory instructions are used.

But look closer at the code:

```static void set_thread_state_v1(const char* state)
...
int state_len= state ? strlen(state) : 0;
pfs->m_processlist_lock.allocated_to_dirty();
pfs->m_processlist_state_ptr= state;
pfs->m_processlist_state_length= state_len;
pfs->m_processlist_lock.dirty_to_allocated();
```
So it is storing a pointer to a string along with the length, and it needs to do synchronisation to ensure that the two are consistent with each other. But suppose instead that we store only the pointer. This can be done atomically without _any_ need for memory barriers or other synchronisation (as long as the memory location is naturally aligned). That would seem to be even better than using the (cheaper) write- and read-barriers.

To get a better idea of how bad the use of full memory barriers is for performance, I wrote a quick and dirty test program. It has code similar to what is in `set_thread_state_v1()`, as well as three other variants of it: One using just a write memory barrier, one using no memory barrier (not SMP safe, but useful for comparison), and one using normal pthread mutexes rather than a lock-free algorithm. Here is the time taken for 100,000,000 iterations of the function:

I think there is an important lesson here about our use of synchronisation primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention issues, and some various parts starting to use lock-free algorithms, there does not seem to be a general understanding of issues around memory barriers among the developers. In fact to my knowledge there is no general facilities for explicit memory barriers. We have `my_atomic.h`, which confuses the issues of "atomicity" and "memory barriers" and defines a handful of "atomic" operations that are really locked instructions implying a full memory barrier, as in the example above. As the test program shows, it is really important to distinguish between different kinds of memory barriers and understand the performance implications of the use of each. I have always been really unhappy about `my_atomic.h`, and I think this is an area that we need to improve significantly in.