## Log in

Or connect using:
Kristian Nielsen Below are the 10 most recent journal entries recorded in the "Kristian Nielsen" journal:
February 14th, 2015
03:56 pm

[Link]

Understanding skew factors in Simplex/Improved Perlin Noise

[Here is a PDF version for readers whose browser does not understand MathML.]

The Simplex Noise (or Improved Perlin Noise) algorithm uses a somewhat mysterious "skew factor" of $\frac{\sqrt{3}-1}{2}$. I did not find any really satisfactory explanation for this factor in the descriptions of Simplex Noise that I read. But I managed to work it out nevertheless, which I thought was a fun exercise and worth a quick write-up.

Simplex noise is constructed by assigning random values to each point in a simplex grid. The simplex grid is a tiling of the plane using rhombuses, each rhombus consisting of two equilateral triangles. See the figure on the right.

Given a point $\left(x,y\right)$ (expressed in normal rectangular coordinates), we first transform the coordinates into $\left(u,v\right)$ expressed in the simplex grid. Then we take the integer parts of u and v to find the corners of the containing equilateral triangle, and take the random values assigned to these corners. The "noise value" of the original point $\left(x,y\right)$ is then some suitable interpolation of these values with respect to the distance from $\left(x,y\right)$ to each corner.

The implementation of this algorithm is explained in detail in several places. The code to transform into and back out of the simplex grid might look like this:

```final double F2 = 0.5*(Math.sqrt(3.0)-1.0);
double s = (xin+yin)*F2;
int u = fastfloor(xin+s);
int v = fastfloor(yin+s);
final double G2 = -(3.0-Math.sqrt(3.0))/6.0;
double t = (u+v)*G2;
double X0 = u+t;
double Y0 = v+t;
```
So the question is, where do these funny factors F2 and G2 come from?

To understand this, let us first consider the general form of the transformation from simplex coordinates $\left(u,v\right)$ in the grid spanned by $\stackrel{\to }{u}$ and $\stackrel{\to }{v}$ to the rectangular coordinates $\left(x,y\right)$. It is

$x=au+bv$
$y=cu+dv$
where $\stackrel{\to }{u}= \left(a,c\right)$ and $\stackrel{\to }{u}=\left(b,d\right)$. So this requires 4 multiplications in the general case.

However, we can freely choose which simplex grid to use! So we can try to choose one that reduces the computational work needed.

First, we can choose the orientation of the grid so that $\stackrel{\to }{u}$ and $\stackrel{\to }{v}$ are symmetric around the diagonal $x=y$. Then $a=d$ and $b=c$, so we can write the transformation as

$x=\left(a-b\right)u+b\left(u+v\right)$
$y=\left(a-b\right)v+b\left(u+v\right)$
Second, we can choose the scale of the grid so that $\left(a-b\right)=1$, and then we get simply
$x=u+t$
$y=v+t$
where $t=b\left(u+v\right)$. This simpler form requires only a single multiplication.

This is exactly the form we see in the above code snippet, with G2 being the name for the constant $b$. We can work out from this that the vectors that span the grid used by the code are

$\stackrel{\to }{u}= \left(1-\frac{3-\sqrt{3}}{6},-\frac{3-\sqrt{3}}{6}\right)$
$\stackrel{\to }{v}= \left(-\frac{3-\sqrt{3}}{6},1-\frac{3-\sqrt{3}}{6}\right)$
The interested reader is encouraged to check that $\stackrel{\to }{u}$, $\stackrel{\to }{v}$, and $\stackrel{\to }{u}+\stackrel{\to }{v}$ all have the same length, so that the triangles that form the grid indeed end up being equilateral.

Working out the inverse of the transformation (again a good exercise for the interested reader) leads to another transformation of the same simple form, this time with the constant $F2 =\frac{\sqrt{3}-1}{2}$ as seen in the code.

So I believe this is the explanation of the mysterious factors F2 and G2 in the example code found by Google searches on the net. They arise from the choice of the simplex grid to use. This choice is made to make the associated coordinate transformation use only a single multiplication, rather than the four required in the general case. This makes the implementation of the algorithm more efficient. Pretty clever, right?

March 29th, 2014
12:11 am

[Link]

Arduino to the max: 11x11x11 LED-cube

March 29 2014 is Arduino day, also in Labitat. This is a good opportunity to describe my LED-cube:

This LED-cube pulls a number of tricks to get the most out of just a single normal Arduino Uno. A meager 16 MHz and 2048 bytes of RAM goes a long way with sufficient ingenuity and creativity. Here are some highlights:

• 12-bit PWM, 16 grayscales non-linear.
• Animations generated on-board, read from SD-card, or streamed over USB.
• 178 Hz refresh rate, transferring 3 Mbits/s of data to the LED driver shift registers.
• 50 Hz animation framerate, receiving 269kbit/s of animation data over the serial port.
• Approximately half of spare CPU time available for on-board generation of animations.
• Multi-processor Arduino! Uses the Atmega 16U2 USB-handling chip on the Arduino Uno as a second processor, reading animations from an SD-card.
• Hand-crafted assembler to speed up time-critical parts.

## LED-cube basics

One of the first questions I get when I show the LED-cube to people is: How is it possible to turn individual LEDs on and off? After all, it looks like everything is connected to everything. Here is how it works.

The LEDs in the cube are organised into columns. The 11x11x11 cube thus has 11*11=121 columns, each containing 11 LEDs. Let us first see how a single column works:

All the cathodes are tied together and connected to ground. Each anode can be individually connected to the supply voltage using one of the switches. This allows to turn on one LED by connecting the appropriate switch.

Now let us see how it works with multiple columns:

The anodes of one LED from each column are all tied together and connected by a single switch to the supply voltage. The columns now have each a switch connecting them to ground. Now, to turn on eg. D5, we would connect SW2 and SW5. To turn on in addition D8, we would connect in addition SW6.

But what if we want to turn on D1 and D5? For this, we need to connect all of switches SW1, SW2, SW4, and SW5. But that turns on also D2 and D4. Thus, to control every LED independently, we need something more.

The answer is multiplexing. We only ever have one of SW1, SW2, and SW3 turned on at the same time. This allows to individually control each LED in one level of the columns using the bottom column-switches. Then shortly after (half a millisecond in the 11x11x11 LED-cube), we turn off that switch and turn on the switch for the next level, at the same time flipping the column-switches as appropriate for controlling the LEDs at that level. By doing this quickly enough, the human eye is too slow to perceive any flickering, and we get the illusion that each LED is turned on or off independently, at will.

Using multiplexing, the required number of switches and connections is greatly reduced. In the 11x11x11 cube, we need only 11*11=121 switches for the columns, plus an additional 11 swiches, one for the anodes in each horizontal layer. ("Only", compared to 11*11*11=1331). In the cube structure, the column connections are the vertical structures, and the horizontal structures connect the anodes in one layer each to a switch through 11 extra columns at the back of the cube.

Soldering the LED structure requires some soldering skills, but with 1331 LEDs to do, such skill will be naturally aquired before the end of the project. Most especially, patience is needed. There are several good pages on the internet describing in detail how the LEDs are soldered together for an LED-cube, here is one, for example.

The layer/anode switches are implemented using 11 P-channel MOSFETs, controlled directly from 11 GPIO pins on the Arduino. The column switches are implemented using 8 TLC5940 LED driver ICs. The TLC5940 has the additional benefit of being able to accurately control how much current each LED receives, as well as being able to dynamically adjust the LED brightness using 12-bit PWM (4096 intensity levels).

## Electronics

The electronics sit on a 20cm-by-20cm PCB located below the base of the LED-cube. The picture shows the PCB from the bottom side; the LED columns (and the connection to the 11 layers) enter through the yellow holes from the top, are bent 90 degrees, and soldered to the copper pads.

The 8 ICs spaced around the middle are the TLC5940 LED drivers, connected to each LED pad. The connections are made for ease of PCB layout, so the software has a lookup table that maps each LED column to the TLC5940 output to which it is connected. Along the right of the board are seen the 11 P-channel MOSFETs that supply the voltage for each layer, one at a time.

At the very right edge is seen the connector to the GPIO pins on the Arduino Uno. Originally, the cube was controlled by an actual Arduino Uno device, connected to the cube with a ribbon cable. But the current version uses instead a custom control PCB with a small, narrow footprint that just fits below the base of the LED-cube. This way all the electronics is hidden away below the very narrow base, resulting in a very nice elegant presentation of the cube:

This control PCB is similar to an Arduino Uno; but it is stripped for anything not needed for the cube, and has a micro-SD card slot and level-shifter soldered directly on the board (originally an small external micro-SD board was connected to the Arduino Uno). The micro-SD slot (not shown on the drawing) and the level-shifter are in the middle. To the left is the USB-connector and the Atmega16U2 chip that handles the USB. To the right is the Atmega328 that runs the main code for the LED-cube. The connections to the main PCB are soldered to the pads along the right and top-right.

The PCBs were layed out with KiCAD; see the end of this article for links to source code. The main board was manufactured with the PCB CNC milling machine we have at Labitat. The small Arduino-clone PCB was manufactured by the SeeedStudio Fusion PCB service.

## Software

The LED-cube has 1331 LEDs, each of which can be PWM'ed for 16 different levels of brightness, from 0 (turned off) to 15 (fully on). The state of all the LEDs is stored in a framebuffer; with 4 bits per LED that amounts to 666 bytes. To avoid flicker, double-buffering is needed. With 2kByte of memory, the Arduino has just room for two framebuffers, with a bit of memory to spare for the rest of the activities.

The software has two main tasks to control the LED-cube:

1. Every 20 milliseconds, load one of the frame buffers with the next frame of the animation being played, for an animation framerate of 50 per second.
2. Every 500 microseconds, turn on the LEDs in one layer of the cube with the appropriate intensity, by programming the TLC5940 driver chips with the correct PWM value and turning on the power to that layer through the next P-channel MOSFET; resulting in a refresh rate for the entire 11-layer cube of around 178 Hz.
The loading of animation frames into the framebuffers can happen from three sources: They can be loaded from an SD-card, they can be streamed from a PC over the USB port, or they can be generated by code on-board the Arduino itself. The refresh of the LED intensities happens in a timer interrupt, using the SPI device on the Atmega328 to communicate with the TLC5940 driver ICs. The individual steps are detailed in the following sections.

## Communicating with the TLC5940 LED-driver ICs.

The TLC5940 LED-drivers need an external PWM clock; this is supplied from the arduino from a timer in PWM mode. An 8 MHz PWM-clock is used; this is the fastest clock that the 16 MHz Atmega328 can generate. One PWM cycle is 4096 PWM clocks, as the TLC5940 uses 12-bit PWM. It is important to always let the TLC5940s run an integral number of PWM cycles; else the brightness of LEDs will be off. So a timer-interrupt runs every 4096 PWM cycles to trigger the TLC5940s; this works out to about 2 kHz for switching each layer on and off, or 178 Hz refresh rate for all 11 layers. The TLC5940 ICs employ a latch, so the timer interrupt first latches new data to get stable PWM timing, then goes to work to send out new data to the driver ICs to be ready to latch at the next timer interrupt.

The 8 TLC5940 ICs have a total of 128 outputs, each of which needs 12 bits of PWM data. That amount to 1536 bits or 192 bytes of data to send to the LED drivers in each refresh cycle. One refresh cycle has a total of 8192 CPU clock cycles, and we need to leave as much as possible of those for generating / downloading the next frame of animation, so this step is carefully optimised for speed.

The data is shifted into the TLC5940s using data and clock pins, so we can use the Atmega328 SPI device in master mode. The maximum speed possible for the SPI device is using an 8 MHz clock, so that means 16 CPU cycles per byte plus a couple extra to load the next byte, as the SPI device is only single-buffered, for a total of around 3500-4000 cycles of the 8192 available.

But we also need to prepare the data to send to the LED drivers. This involves a few steps per TLC5940 output:

1. Fetch from a lookup-table the index of the LED that is connected to the next TLC5940 output on the PCB (remember, those connections are in more or less random order to optimise the PCB layout).
2. Use that index into the frame buffer to fetch the 4-bit value that gives the required brightness of the LED.
3. Use the brightness value to index into another lookup table, which maps each possible brightness into the corresponding 12-bit PWM value. This mapping is made non-linear to improve the dynamic range available with just 16 different levels, as the lower PWM values are perceived by the human eye to be much easier to distinguish than the higer values.
4. Pack the 12-bit values of two consecutive outputs into three bytes suitable for transmission on the SPI device (it is thus easier to do two LEDs at a time, producing three whole bytes of SPI data for each loop iteration).
As it turns out, these operations together take something like 60 CPU cycles for two LEDs or 20 cycles per SPI output byte, which is another 3500-4000 cycles.

With 8192 cycles total, we cannot afford to first spend 4000 cycles preparing the data, then spend another 4000 cycles sending the data out. So we use a trick. By writing the code in assembler and carefully timing each instruction, we can overlap the two operations, inserting into the data generation algorithm instructions to feed the next byte of data to the SPI at exactly the points in time where they are needed.

In C it might look something like this, with the data generation algorithm on the left and the SPI data output on the right:

```    // Loop over the TLC5940 outputs two at a time
for (o = 127; o >= 0; o = o - 2)
{
// Find index into frame buffer
led_idx = led_map[o];                                           // Output one byte
SPDR = byte1
// Load the 4-bit intensity from the frame buffer
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;

// Loopup the 12-bit PWM value from the intensity.
pwm1 = pwm_lookup[intensity];
// Output one byte
// Same for second output                                       SPDR = byte2
led_idx = led_map[o+1];
if (led_idx & 1)
intensity = frame_buffer[(offset + led_idx)/2] & 0xf;
else
intensity = frame_buffer[(offset + led_idx)/2] >> 4;
pwm2 = pwm_lookup[intensity];

// Pack the two 12-bit PWM values into three SPI bytes
byte1 = pwm1 >> 4;
byte2 = (pwm1 & 0xf) << 4 | (pwm2 >> 8);                        // Output one byte
byte3 = pwm2 & 0xff;                                            SPDR = byte3
```
However, the algorithm is not really expressible in C. The instructions to output the data bytes to the SPI device need to be placed so that the bytes to output have been computed and the timing between them must be just right so that they happen as soon as the SPI has completed the last transfer (but no sooner, or the previous byte will be corrupted). The actual assembler code can be seen here, in the function `shift_out_frame_spi()`. The comments in the function mark the cycles spent on each instruction so that the SPI output can be spaced out with the correct timing.

This way, the code is able to send out data for two TLC5940 outputs in around 63 CPU cycles, or about 4000 cycles total per refresh cycle, leaving half the CPU time free for handling the frames of animation, which is nice. I think this is a rather interesting programming technique. It is a bit like multi-threading, but with the instruction scheduling hardcoded explicitly into the program.

## Serial reception

In addition to generating some animations on the Arduino itself, they can be streamed into the Atmega328 through the serial port. The protocol is mostly the raw binary data in the framebuffer (4 bits per LED), plus a couple of control bytes like start/end marker, frame number, and checksum, to facilitate synchronisation between sender and receiver. If the receiver detects that the frames are not received correctly, it sends back an error byte; the sender notices this and pauses the data for a few frames, and the two ends re-synchronise. This is a simple but effective technique that allows for efficient transfer of the data.

One frame of data on the serial port is 672 bytes inclusive control bytes. At 50 frames of animation per second, that amounts to 33600 bytes or 268800 bits per second. The serial port is run at 500000 bits per second; with the overhead of start and stop bits, that still allows to comfortably transfer the required data at the desired rate.

However, with this rather high data rate, care is needed to be able to process all received bytes sufficiently fast that no data is lost. The Atmega328 has just a single byte receive buffer. At 500kbps, a new byte arrives 50000 times per second, meaning that we have just 320 CPU cycles to process a received byte before it will be lost due to being overwritten by the next byte.

To handle this, a serial receive interrupt is employed. The interrupt is triggered whenever a byte is received by the serial device on the Atmega328, and we need to ensure that it will be serviced within at most 320 CPU cycles. The Atmega328 does not have interrupt priorities, but it does support nested interrupts. Interrupts are automatically disabled whenever an interrupt routine is invoked, but that routine can re-enable interrupts explicitly, and this will allow another nested interrupt to be handled before the first one is completed. Indeed, this is absolute necessary to do in the cube in the refresh timer interrupt, as this runs for several thousand cycles. Nested interrupts work well, but they require a lot of care; race conditions between conflicting interrupts can be quite hard to debug, and one also needs to protect against runaway interrupts (where the same interrupt is invoked recursively and repeatedly on top of itself until the stack is overrun).

With more than 30000 serial interrupts per second, we also want to make the code for the serial interrupt handler as efficient as possible. Unfortunately the AVR architecture does not exactly shine in this respect. Here is how a typical interrupt routine looks as generated by GCC:

```    push    r1
push    r0
in      r0, 0x3f
push    r0
eor     r1, r1
push    r16
push    r17
push    r18
push    r19
push    r20
push    r21
push    r22
push    r23
push    r24
push    r25
push    r26
push    r27
push    r28
push    r30
push    r31

...

pop     r31
pop     r30
pop     r28
pop     r27
pop     r26
pop     r25
pop     r24
pop     r23
pop     r22
pop     r21
pop     r20
pop     r19
pop     r18
pop     r17
pop     r16
pop     r0
out     0x3f, r0
pop     r0
pop     r1
reti
```
That is no less than 40 instructions just as pre/post-ample, most of which take two CPU cycles each.

Of course, in an interrupt routine, we do need to save/restore all registers used. However, most of the invocations of the serial interrupt do not need to use more than a few registers; just enough to grab the next byte from the serial device and put it into the frame buffer. Only for the control bytes at the start and end of a frame do we need more registers for more complex processing. Unfortunately, GCC always generates the code to push and pop all the registers unconditionally, even though some of them are only used in rarely executed code paths (the large number of callee-save registers in the AVR calling convention plays a part of the problem here).

The solution is to write the serial interrupt in hand-optimised assembler. In the fast path, where we are just stuffing a byte into the framebuffer (and computing a checksum on-the-fly, incidentally), we only need to save three registers (plus the condition codes). That all can be done in just 26 instructions. Then in the slow path, the assembler code goes on to push all remaining registers and defer to the more complex processing in a C function.

The actual code can be seen here. The assembler code for the fath path is in `serial_interrupt_rx_naked()`, while the slow path is in the function `serial_interrupt_slow_part()`.

## Reading animations off SD-card

While some of the animations can be computed on-the-fly on-board the Arduino, some of the more complex ones are too much for the puny Atmega328 to handle. The problem usually is not so much processing speed as memory: With a total of just 2048 bytes of RAM, most of which is already reserved for frame buffers and stack space and various global variables, not a lot is left to keep track of stuff like position and velocity of lots of particles in the fireworks animation or similar stuff. Using the serial port, we can generate the animations on a PC and stream them to the cube; however it is also nice to be able to run the cube completely standalone: just plug it into power (or even run it off a battery) and it runs and displays animations on its own, without needing a laptop on tow. Thus the idea was born to pre-compute the animations and read them from an SD-card.

Now, at just 672 bytes per frame of animation, a 4 GB SD-card can store more than one day worth of animation, so this is fine. The problem however is that we are running short on IO as well as on memory. Mostly all available pins are already needed to control the 11 MOSFETs and to communicate with the TLC5940 ICs. Besides, the SPI device, which is normally used to communicate with an SD-card, is needed for communication with the TLC5940s, and the serial device (which can also do SPI) is needed for the USB-to-serial interface. So what can be done?

Let us take a closer look at the Arduino Uno, the top left corner near the USB connector:

What we see here is the Atmega16U2 IC, which comes pre-installed with firmware (based on LUFA to control the USB port and function as USB-to-serial proxy. It is however perfectly possible to modify that firmware to do additional things - such as connect an SD-card! Furthermore, just next to it we have the ICP header - this is normally used to program the flash on the Atmega16U2, but it has the pins for the SPI device, which is just what we need to communicate with an SD-card over SPI. And finally, we even have unsoldered pads with four spare GPIO, one of which we will need as a chip select pin for the SD-card.

So as it turns out, the Arduino Uno is in fact a multi-processor device! The LED-cube exploits this, connecting the SD-card to the ICP pins and spare GPIO of the Atmega16U2 MCU and hacking the USB firmware to handle the SD-card. If there is no activity on the USB port, the hacked firmware will check for the presence of an SD-card with animation data on it; if found, data will be streamed from the SD-card over the serial port to the Atmega328, which will handle the serial data the same, whether it originates from a PC at the other end of the USB, or from the SD card.

Now, using the Atmega16U2 in this way does present some challenges. The Atmega16U2 is only equipped with a meager 512 bytes of RAM, some of which is already needed for LUFA data and so on. The data on SD-cards is read one sector at a time, and a single sector is 512 bytes, already more than the RAM we have left. Most libraries for reading SD-cards and dealing with the FAT filesystem on them is based on reading one sector at a time into a buffer in RAM and processing it there; that just will not work when we have only a few hundred bytes of RAM to spare for the task.

Furthermore, most SD-card/FAT libraries are written in a traditional blocking style. That means, they provide some function you can call to read data from a file on the SD-card. Such function will take a memory buffer (which we do not have the memory for), and it will not return to the caller until all of the requested data has been read, which means waiting at least for one sector to be read. That does not integrate well with the existing USB/LUFA firmware, which runs its own main loop that waits for activity on the USB device and does not return to the main program unless there is some activity to respond to.

To overcome these challenges, I wrote a small event-driven FAT library, seen in `ev_fat.h` and `ev_fat.c`. This library works in a streaming fashion, without any blocking. It never needs to process SD-card data in a memory buffer. Instead, the caller feeds it the bytes read off the SD-card one by one, and the library processes the bytes as they are received, keeping track of its state in a small data structure, and returning status information back to the caller about which sectors from the SD-card to read next.

```    /*
Open a named file in root dir of FAT file system.
Before calling, st->state must be initialised to 0.
Then the function must be repeatedly called until it returns
EV_FILE_ST_DONE or negative error code EV_FILE_ST_E*.

The returned status tells the next action to take, see comments in struct
ev_file_status for details.

When EV_FILE_ST_DONE is returned, the first sector of the file, and the
length in bytes of the file, is returned in st->st_get_block_done.
*/
int
ev_file_get_first_block(const char *filename, struct ev_file_status *st);

/*
After opening a file, this finds the next sector in the file. When calling
this function, st->st_get_block_done must be set to / retain the value set
by the previous call to ev_file_get_first_block() /
ev_file_get_next_block().  After EV_FILE_ST_DONE is returned the new sector
number is then found in st->st_get_block_done.
*/
int
ev_file_get_next_block(struct ev_file_status *st);

/*
This callback is used to stream bytes read as a response to a request
EV_FILE_ST_STREAM_BYTES. Each byte requested must be passed in, in
sequence. The return value is true if no more data needs to be streamed;
in this case it is permissible, but not required, to stop the read early
and not stream the rest of the requested bytes.
*/
int
ev_file_stream_bytes(uint8_t byte_read, struct ev_file_status *st);
```

With this library, the reading of the SD-card can be handled completely inside an SPI interrupt routine, without disturbing the LUFA USB code. Each time a byte has been processed in the communication between the Atmega16U2 and the SD-card, the SPI device triggers the SPI interrupt. This interrupt processes any byte received, updates its internal state, and loads the next byte to be processed into the SPI device data register. The interrupt is seen in `ISR(SPI_STC_vect)`. The code handles the protocol to connect to and initialise the SD-card, and then takes care of reading in sectors and passing the bytes to the event-driven FAT library.

When we get to actually read real file data out of the SD-card, we stream it directly to the serial port (where it will be received and processed by the Atmega328), to avoid the need for large memory buffers. The existing firmware already has a small FIFO used to buffer data for sending down the serial line. We re-use that, so that when no data is available from the USB for a few seconds we start filling up the FIFO with data from the SD-card instead. A serial device interrupt is triggered whenever the previous byte has been fully transmitted down the serial line; this interrupt fetches the next byte from the FIFO and loads it into the serial transmit data register. If the SD-card delivers data faster than the 500kbps serial line can transmit, we temporarily pause the SPI communication and resume it once the serial interrupt has made room for more data in the FIFO; the SD-card specifications explicitly mention this as a supported way to operate, precisely to help very small microcontrollers be able to process data without requiring excess buffering capabilities.

The end result is an extended USB firmware that retains all the original functionality (streaming serial data from a PC and so on; even flashing the Atmega328 over the serial port still works). And in addition, if the USB is idle and an SD-card is present, data is instead continuously streamed from the card, allowing completely stand-alone operation of the cube.

The code to handle all this does end up rather intricate, as can be imagined. Apart from the need to write a custom FAT-reading library, the precise timing between the different interrupt handlers end up requiring quite a lot of careful coding and debugging. But in the end, I found that code to be quite an interresting exercise, and fun as well - and this is after all a for-the-fun-of-it type project.

## Calculating the animations

One of the nice thouches of the visuals this LED-cube in particular is the availability of 16 different intensity levels. This allows for some nice effects, such as fading the LEDs in-out to give a warmer, light-bulb-like perception, and using anti-aliasing to greatly reduce the disadvantage of the very limited 11-by-11-by-11 resolution.

All the animations are computed by this C++ program. The code is mostly a lot of math using vector computations, trigonometry, random number distributions, permutations, physics simulations and other nice stuff. The end result is a sequential stream of animation frames that can be send directly to the LED-cube over the serial port, or stored in a file on an SD-card for stand-alone playback.

## Conclusions, and source code

If I were to do this project today, I would probably use an ARM microcontroller like the STM32F4. Such a microcontroller is easily able to handle driving something like this LED-cube without the need for any special tricks due to its much larger memory and performance. But this was just a for-fun project, and it was interesting to see just how much could be squeezed out of the very popular AVR-based Arduino. That is quite a lot, as it turns out.

The nice thing about the LED-cube is: On the one hand it involves lots of tricky programming and advanced technology. On the other hand it has an immediate appeal to many different kinds of people, as is seen whenever we take it on display and it immediately draws the eyes of people passing by. The technology aspect is much harder to appreciate than the visual aspect. I have wanted to write up this article describing the project for some time, in all the gory technical details. I hope a few people will be able to use this write-up to appreciate the technical side as well as the visual side.

All of the code and design files for this project are available on Github under an open source license (GPL):

In addition, the Arduino code needs these header files by Esmil for easy access to Atmega registers and so on.

March 13th, 2014
05:56 pm

[Link]

No application changes needed: 10 times faster slave with MariaDB 10 parallel replication

Parallel replication is in MariaDB 10.0. I did some benchmarks on the code in 10.0.9. The results are quite good! Here is a graph that shows a 10-times improvement when enabling parallel replication:

The graph shows the transaction per second as a function of number of slave worker threads, when the slave is executing events from the master at full speed. The master binlog was generated with sysbench oltp.lua. When the binlog is enabled on the slave and made crash-safe (```--sync-binlog=1 --innodb-flush-log-at-trx-commit=1```), the slave is about ten times faster at 12 worker threads and above compared to the old single-threaded replication.

These results are for in-order parallel replication. With in-order, transactions are committed on the slave in strictly the same order as on the master, so that applications do not see any differences from using parallel replication. So no changes to the application are needed to use parallel replication; this is just standard sysbench 0.5 with a single table. This makes parallel replication particularly interesting, as it can be used with any existing applications, without the need to eg. split the data into independent schemas as is necessary with out-of-order techniques like the multi-threaded slave feature in MySQL 5.6. It does however make it much harder for the slave to find parallelism to exploit in the events received from the master, which is why it is interesting to see how much improvement can be obtained from normal workloads.

(MariaDB 10.0 does also support out-of-order parallel replication, but that will be the subject of a different article).

The hardware used for the sysbench oltp.lua is the same machine I used to benchmark group commit previously; I am told this is a machine that is typical for a "standard" datacenter server, with decent I/O on a RAID controller with battery-backed-up cache. Sysbench was run with 10 million rows in one table. The mysqld was configured with 16GB buffer pool and 2 times 1.9 gigabyte redo logs. The different graphs are as follows:

• binlog, crash-safe: `--log-slave-updates --sync-binlog=1 --innodb-flush-log-at-trx-commit=1`
• no binlog, durable: `--skip-log-slave-updates --innodb-flush-log-at-trx-commit=1`
• no binlog, non-durable: `--skip-log-bin --innodb-flush-log-at-trx-commit=2`
• binlog, non-crash-safe: `--log-slave-updates --sync-binlog=0 --innodb-flush-log-at-trx-commit=0`

For this test, the master was configured with `--binlog-commit-wait-count=12 --binlog-commit-wait-usec=10000`. This allows the master to delay a transaction up to 10 milliseconds in order to find up to 12 transactions that can commit in parallel; this helps a lot in improving parallel replication performance, since transactions that commit in parallel on the master can be executed in parallel on the slave.

Adding such delay will be acceptable for many applications to speed up the slaves; in fact in my test it did not affect master throughput at all. One attractive option might be to set up monitoring of the slaves, and if they start falling behind, then the commit delay on the master can be temporarily increased, throttling the master a bit while allowing slaves better possibility to catch up.

The other source of parallelism on the slave is that irrespectively of how the transactions were executed on the master, the commit steps of different transactions can always be applied in parallel on the slave. This is particularly effective at improving performance when the commit step is expensive, as happens when a durable, crash-safe configuration is used. This is seen in the benchmark, where the speedup is particularly large when the slave is configured to be crash-safe and durable, to the point where parallel replication almost eliminates any performance penalty for enabling crash-safe binlog on the slaves. But significant speedup is seen in all the configurations.

(In fact, if you look closely, you will see that turning off the binlog ends up decreasing the performance of the slave. This is bug MDEV-5802, and performance should improve when binlog is disabled when this bug is fixed).

I think these are very promising results. I hope this will inspire users to give the new feature a test on real workloads, and share their experiences.

## Exploring the limits of parallel replication

I also wanted to see how the code works for workloads that are not favorable to parallel replication. For this I use sysbench update_index.lua. This benchmark creates transactions with a single primary-key update. With just 10 million rows in the table, the test is in-memory, and the actual work spent for the single-row update is rather small compared to the overhead of reading and parsing binlog events and scheduling the work between the various replication threads. So it is interesting to see if parallel replication can still give some smaller benefit here, or at least not make things much worse.

Here are the results for an update_index.lua with 48 threads for the load on the master. No commit delay was used to help the parallel replication (--binlog-commit-wait-count=0):

As can be seen, even in this workload there is significant improvement from parallel replication. Again, the improvement is particularly high when commit is expensive, due to the possibility to do the commits in parallel and amortise the cost using group commit. But even with binlog disabled and non-durable InnoDB commit, we see some improvement, though only a modest one.

Finally, to test the absolutely worst-case scenario for parallel replication, I created another workload on the master, this time with update_index.lua running with just a single thread. This way, there is absolutely no opportunity for parallel replication to execute the actual transactions in parallel, though there is still some opportunity to speed up the commit step using group commit.

Here are the results for the single-threaded update_index.lua master load:

As can be seen, even in this extreme case, we do see some speedup when commits are expensive (durable/crash-safe configuration). But when durability and crash-safety is turned off, and commits are cheap, then the overhead of the extra thread scheduling shows, and the results of parallel replication are worse, if not disastrously so. Note by the way that with this test, the time taken by the slave to replicate the load was smaller than the time for the master to generate it with single-threaded sysbench.

## Conclusions

So overall, I am pleased with the results of the parallel replication code in MariaDB 10.0. Single-threaded applier has been a major limitation in replication for a long, long time, and I feel now fairly confident that this can play an important part in solving that. I hope this will inspire users to start testing their own loads and hopefully getting good results and/or reporting any issues found on the way.

To test, use MariaDB 10.0.9 or later on both master and slave. Configure the slave with `--slave-parallel-threads=20` or something like that. And optionally, if your application can tolerate some extra commit latency, set some reasonable values for `--binlog-commit-wait-count` and `--binlog-commit-wait-usec`.

By the way, the raw numbers for the different runs can be seen in this mail on maria-developers@.

February 10th, 2014
07:23 pm

[Link]

Using MASTER_GTID_WAIT() to avoid stale reads from slaves in replication

I have just implemented MASTER_GTID_WAIT() in MariaDB 10.0. This can be used to give a very elegant solution to the problem of stale reads in replication read-scaleout, without incuring the overheads normally associated with synchronous replication techniques. This idea came up recently in a discussion with Stephane Varoqui, and is similar to the concept of Lamport logical clock described in this Wikipedia article.

I wanted to describe this, hoping to induce people to test and maybe start using this, as it is a simple but very neat idea, actually.

A very typical use of MariaDB/MySQL replication is for read-scaleout. The application does all updates against a single master, which replicates to a set of slaves. The application can then distribute its reads among the slaves. If most of the database load is from reads, then this is an effective way to scale beyond what a single database server can handle.

The problem of stale reads occurs since MariaDB/MySQL replication is asynchronous. There is some delay from the commit of a change on the master server until that change becomes committed and visible on a slave server. This problem can be quite annoying in many cases, eg. user changes some preferences and reloads the page (read goes to a different slave server which is not caught up); now the change looks as if it was lost.

The idea is to use the MariaDB Global Transaction ID (GTID for short) as a logical clock. When we do an update on the master, we can obtain the associated GTID from `@@LAST_GTID`. Then on the slave, we can execute a `MASTER_GTID_WAIT(<GTID from master>, <timeout>)` before doing a query for which read consistency is critical; this will ensure that the slave will catch up before the query is run, and that no stale read results. Or if we get a timeout, we can try another slave server, which hopefully is not lagging as much behind.

Typically, we would pass around the GTID (the logical clock value) between different parts of the system to maintain consistency where it is needed and avoid stale reads. Eg. in a web application, we would set a cookie with the GTID when an update is made to the database. Then on the next HTTP request, we have the GTID which is needed in `MASTER_GTID_WAIT()`, even if it goes to a completely different web server. Or if we send an email, we can encode the GTID in any contained link, to be sure that it will work whichever slave server it might end up in, possibly in a different data center on a different continent.

By passing around the GTID whenever we communicate between parts of the system, we can ensure that if transaction A is known to have occured before transaction B, then any change made by A will also be visible to B. If there was no communication (direct or indirect), then B cannot know that A came before - and if there was communication, we can avoid stale reads by passing the GTID as part of that communication.

The great thing about this technique is that it is optional. We can use it for just the queries where avoiding stale reads is critical, and only there take the associated penalty (in the form of increased latency). Other parts of the database or application will not be negatively affected. This is much more flexible than using some form of synchronous replication, which will incur some penalty for all parts of the system.

So for applications that require more performance than what is currently possible to get from fully synchronous replication, we can choose consistency in a few critical places where it matters, and go for the performance of asynchronous replication in the rest.

I am hoping that some users will start to do tests with this, and I am very eager to hear any feedback and suggestions for further improvements (the maria-developers@ list is a good place for those). I think this can become a very useful technique in many cases.

By the way, note that it is not necessary to actually switch the replication to GTID to use this technique, as in MariaDB the GTIDs are always generated and replicated, even if they are not used for the slave when connecting to the master. This should make it simple for people to start testing. One will need to run MariaDB 10.0 on both the master and the slave, of course - but this could be achieved by setting up one MariaDB server as a slave of the main system (which could be running MySQL or Percona Server or MariaDB 5.5 or whatever), and then setting up another MariaDB 10.0 slave using the first slave as master.

If this turns out to be something that people find generally useful, it would be interesting to integrate this more tightly into the client protocol and client API. The `@@LAST_GTID` could be sent automatically back with the query results, the same way that `@@LAST_INSERT_ID` is done. And another option could enable the automatic execution of the `MASTER_GTID_WAIT()` call before queries sent to the slave servers. This could make the technique almost transparent for the end application, and could help avoid the overhead of extra client-server roundtrips and extra SQL parsing.

However, that is only worth doing if this turns out to be something that will actually be shown to be useful in practice. So if you want to see that happen eventually, then get testing and send in that feedback!

The code is available in the 10.0 bzr tree now, and will be in the next (10.0.9) release of MariaDB.

January 23rd, 2014
11:44 am

[Link]

More on 40% better single-threaded performance in MariaDB

In my previous post I wrote about how I achived a >40% speedup on sysbench read-only using profile-guided optimisation (PGO). While this is a preliminary result, I though it was so interesting that it deserved early mention. The fact that any benchmark can be improved that much shows clearly that PGO is something worth looking into. Even if we will probably not improve all workloads by 40%, it seems highly likely that we can obtain significant gains also for many real workloads.

I had one or two interesting comments on the post that raise valid concerns, so I wanted to write a follow-up here, explaining some of the points in more details and going deeper into the performance counter measurements. As I wrote before, actual observations and measurements are crucial to fully understand performance of complex code on modern CPUs. Intuition and guesswork just does not suffice.

## On branch mispredictions

bludwarf suggested that branch mispredictions could be an important part of the picture:

```> It would be interesting to see the hottest spots.... Having to execute just
> a lot of code is not an explanation. If it would be so, then profiler
> feedback won't affect performance so much. I believe the reson for so big
> difference for the results is branch misprediction.
```
I had similar thoughts when I first saw the results, but I was able to verify my explanation with actual measurements of instructions retired and icache misses before and after profile-guided optimisations:
```    171821  INST_RETIRED.ANY
17730  ICACHE.MISSES

166686  INST_RETIRED.ANY
13643  ICACHE.MISSES
```

Before PGO, ICACHE.MISSES/INST_RETIRED.ANY = 10.3%. After, we get 8.2%. So we do get significant less icache misses. I wrote in my previous post some plausible reasons for that with my remarks on basic blocks, though I have not tried to verify those with actual measurements.

I also checked branch mispredictions. The Intel manual has two measures to estimate the cost of branch misprediction. One is time wasted due to speculatively issued uops that do not retire, %Bad_Speculation = (UOPS_ISSUED.ANY - UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / (4*CPU_CLK_UNHALTED.THREAD). The other is cycles wasted based on an estimated cost of 20 cycles per misprediction, %BR.MISP.COST = 20 * BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD. Here are the relevant measurements taken before and after (two sets of measurements, as Sandy Bridge supports only 4 simultaneous performance counters):

```    486641  CPU_CLK_UNHALTED.THREAD
205880  UOPS_RETIRED.RETIRE_SLOTS
226184  UOPS_ISSUED.ANY
7219  INT_MISC.RECOVERY_CYCLES
%Bad_Speculation = 2.5%

335051  CPU_CLK_UNHALTED.THREAD
194616  UOPS_RETIRED.RETIRE_SLOTS
216353  UOPS_ISSUED.ANY
6830  INT_MISC.RECOVERY_CYCLES
%Bad_Speculation = 3.7%

490250  CPU_CLK_UNHALTED.THREAD
1310  BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 5.3%

338524  CPU_CLK_UNHALTED.THREAD
1147  BR_MISP_RETIRED.ALL_BRANCHES_PS
%BR.MISP.COST = 6.8%
```

Note that the actual branch misprediction is much the same. But with less time spent on other bottlenecks, the estimated relative cost becomes higher.

Most of the branch mispredictions are due to indirect branches (virtual functions and @plt entries in shared libraries). Even without PGO the CPU is able to predict other branches very well, so compiler or __builtin_expect can not help much.

At the end of this post, I have appended the top of some per-function profiles. They show how branch mispredictions are mostly from commonly called library functions like memcpy() (these are mispredictions of the indirect jumps in the PLT), and from virtual function calls. They also show how the icache misses are spread out more or less evenly, correlating strongly with the count of actual instructions executed in various parts of the code.

## On better coverage of the PGO

A comment from Anonymous raised another valid concern:

```> If I read you correctly, you are using the same benchmark for optimizing as
> well as measuring performance increases. You need to have different test and
> training sets or your measurements don't generalize.
```

This is true, of course. The sysbench read-only exercises only an extremely narrow subset of mysqld functionality. The next step, on which I am already working, is to implement a more suitable and general workload for the profile, which can hopefully cover, and hence speedup, a significant portion of common real-life workloads.

More tests will be needed to be sure that real-world gains can be achieved, but I am very hopeful. After all, much of the code of sys-bench read-only is also used in many other workloads, and much of the properties (eg. infrequency of the lots and lots of error check branches) should be similar.

My main concern is that GCC will choose to de-optimise code that is not covered by the test load, to the extent that it significantly harms performance in some cases. But it should not be too much of a problem - after all, GCC will still try to optimise the generated code as best it can, and if the information it has from the profile is incomplete, it should not be much worse than having no information, as is the case without PGO. And over time, we can extend the coverage of the test load to help eliminate any problems that do turn up.

Since the last post, I implemented a first version of a program that generates a test load. For now, I put it on github: gen_profile_load. The program generates a range of different loads, such as simple key lookups, small joins, inserts/updates/deletes, under various conditions such as with/without binlogging, with/without prepared statements, and so on.

I plan to next do some tests with using this for doing the PGO, and checking the effect on sysbench read-only and possibly other benchmarks. Another idea I had was to try to use only parts of the generated load, and test that this does not cause significant regressions on the omitted parts. This could help in getting some idea of what kind of (hopefully small) performance regressions can be expected for workloads that are not well covered by the PGO.

## Detailed profiles

These are pasted from the output of `perf report`. They show the top functions in terms of instructions executed, icache misses suffered, and branches mispredicted.

Instructions retired (INST_RETIRED.ANY), before PGO:

```Samples: 171K of event 'rc0', Event count (approx.): 171821
1,94%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,92%  mysqld  mysqld               [.] my_hash_sort_bin
1,60%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,48%  mysqld  mysqld               [.] malloc
1,36%  mysqld  mysqld               [.] free
1,34%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,31%  mysqld  mysqld               [.] alloc_root
1,30%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,16%  mysqld  mysqld               [.] my_malloc_size_cb_func
1,14%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
1,08%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
1,08%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,07%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
1,01%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
1,00%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,98%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
0,93%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,90%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
0,88%  mysqld  [vdso]               [.] 0x000000000000070c
0,83%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,79%  mysqld  libc-2.13.so         [.] __memset_sse2
0,78%  mysqld  libc-2.13.so         [.] __strlen_sse42
0,77%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,77%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,71%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,70%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,70%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,70%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,69%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,68%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,60%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,59%  mysqld  [kernel.kallsyms]    [k] effective_load
0,56%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
0,56%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,50%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,48%  mysqld  mysqld               [.] my_malloc
0,48%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
0,48%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,47%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
0,46%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,46%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,45%  mysqld  mysqld               [.] Item_func::type() const
0,44%  mysqld  [kernel.kallsyms]    [k] update_curr
0,44%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
0,44%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,43%  mysqld  mysqld               [.] my_lengthsp_8bit
0,43%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
0,43%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
0,41%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,40%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,36%  mysqld  mysqld               [.] Item_equal::Item_equal(Item*, Item*, bool)
0,36%  mysqld  mysqld               [.] Item::Item()
0,35%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
0,35%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,35%  mysqld  mysqld               [.] JOIN::exec_inner()
0,35%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
0,35%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
0,35%  mysqld  [kernel.kallsyms]    [k] system_call
0,34%  mysqld  mysqld               [.] JOIN::destroy()
0,34%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
```

Icache misses (ICACHE.MISSES), before PGO:

```Samples: 17K of event 'r280', Event count (approx.): 17730
2,21%  mysqld  mysqld               [.] JOIN::optimize_inner()
2,04%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,91%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,42%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,27%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,20%  mysqld  mysqld               [.] malloc
1,09%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
1,07%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,90%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,90%  mysqld  [vdso]               [.] 0x0000000000000771
0,86%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
0,83%  mysqld  mysqld               [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227
0,73%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,72%  mysqld  mysqld               [.] alloc_root
0,68%  mysqld  mysqld               [.] JOIN::exec_inner()
0,65%  mysqld  libc-2.13.so         [.] __memset_sse2
0,64%  mysqld  mysqld               [.] JOIN::destroy()
0,63%  mysqld  mysqld               [.] free
0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,55%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231
0,55%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,54%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,54%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,52%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
0,51%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,51%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,50%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,47%  mysqld  mysqld               [.] setup_conds(THD*, TABLE_LIST*, List<TABLE_LIST>&, Item**)
0,47%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,47%  mysqld  mysqld               [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
0,47%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,46%  mysqld  mysqld               [.] trx_commit(trx_t*)
0,44%  mysqld  [kernel.kallsyms]    [k] fget_light
0,43%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,43%  mysqld  [kernel.kallsyms]    [k] system_call
0,42%  mysqld  mysqld               [.] JOIN::cleanup(bool)
0,41%  mysqld  mysqld               [.] handle_select(THD*, LEX*, select_result*, unsigned long)
0,41%  mysqld  mysqld               [.] mtr_commit(mtr_t*)
0,41%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,41%  mysqld  mysqld               [.] my_hash_sort_bin
0,40%  mysqld  mysqld               [.] open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*)
0,40%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
0,39%  mysqld  mysqld               [.] buf_page_get_known_nowait(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)
0,39%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
0,38%  mysqld  mysqld               [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.226
0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,38%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,37%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,37%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
0,37%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
0,36%  mysqld  mysqld               [.] add_key_field(JOIN*, key_field_t**, unsigned int, Item_func*, Field*, bool, Item**, unsigned int, unsigned long long, st_sargable_param**)
0,36%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,35%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
0,34%  mysqld  mysqld               [.] do_select(JOIN*, List<Item>*, TABLE*, Procedure*)
0,33%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
0,33%  mysqld  mysqld               [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result)
0,33%  mysqld  [kernel.kallsyms]    [k] __schedule
0,33%  mysqld  mysqld               [.] Item::Item()
0,32%  mysqld  mysqld               [.] thd_ha_data
```

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), before PGO:

```Samples: 1K of event 'rc5:p', Event count (approx.): 1310
7,02%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
6,64%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
3,74%  mysqld  libpthread-2.13.so   [.] pthread_mutex_unlock
3,66%  mysqld  mysqld               [.] Item_func::type() const
3,05%  mysqld  libc-2.13.so         [.] __memset_sse2
2,67%  mysqld  mysqld               [.] Field_long::type() const
2,60%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
2,52%  mysqld  libc-2.13.so         [.] __strcmp_sse42
2,06%  mysqld  mysqld               [.] Item_field::used_tables() const
2,06%  mysqld  mysqld               [.] Item_param::used_tables() const
2,06%  mysqld  libc-2.13.so         [.] __strlen_sse42
1,76%  mysqld  librt-2.13.so        [.] clock_gettime
1,68%  mysqld  mysqld               [.] Item::cmp_type() const
1,60%  mysqld  mysqld               [.] st_select_lex::master_unit()
1,37%  mysqld  mysqld               [.] Item_equal::functype() const
1,30%  mysqld  mysqld               [.] Item::const_item() const
1,30%  mysqld  mysqld               [.] Item::real_item()
1,07%  mysqld  mysqld               [.] Field_str::charset() const
0,92%  mysqld  mysqld               [.] Protocol::end_statement()
0,92%  mysqld  mysqld               [.] Item_field::type() const
0,84%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,84%  mysqld  mysqld               [.] Item_field::cleanup()
0,84%  mysqld  mysqld               [.] Item_field::field_type() const
0,84%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,84%  mysqld  libpthread-2.13.so   [.] __libc_recv
0,76%  mysqld  mysqld               [.] Item_param::save_in_field(Field*, bool)
0,76%  mysqld  mysqld               [.] my_hash_sort_bin
0,69%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
0,69%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
0,69%  mysqld  mysqld               [.] scheduler_wait_net_end
0,69%  mysqld  mysqld               [.] Item_param::val_int()
0,69%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
0,69%  mysqld  mysqld               [.] Item_field::result_type() const
0,61%  mysqld  mysqld               [.] Field_long::pack_length() const
0,61%  mysqld  mysqld               [.] vio_read
0,61%  mysqld  [kernel.kallsyms]    [k] system_call
0,53%  mysqld  mysqld               [.] Item::is_expensive()
0,53%  mysqld  mysqld               [.] st_select_lex::outer_select()
0,53%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
0,53%  mysqld  mysqld               [.] Explain_select::~Explain_select()
0,53%  mysqld  mysqld               [.] Field_num::result_type() const
0,53%  mysqld  mysqld               [.] Item_param::result_type() const
0,53%  mysqld  mysqld               [.] Item_param::field_type() const
0,53%  mysqld  mysqld               [.] Item_equal::count_sargable_conds(unsigned char*)
0,53%  mysqld  mysqld               [.] Item_func_eq::functype() const
0,53%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,53%  mysqld  [kernel.kallsyms]    [k] __pollwait
0,53%  mysqld  [kernel.kallsyms]    [k] sock_def_readable
0,46%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,46%  mysqld  mysqld               [.] handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)
0,46%  mysqld  mysqld               [.] Item_equal::select_optimize() const
0,46%  mysqld  mysqld               [.] Item_func::used_tables() const
0,46%  mysqld  mysqld               [.] ha_innobase::info(unsigned int)
0,46%  mysqld  libc-2.13.so         [.] __memmove_ssse3
0,46%  mysqld  libc-2.13.so         [.] __memcmp_sse4_1
0,46%  mysqld  [kernel.kallsyms]    [k] dequeue_task_fair
0,46%  mysqld  [kernel.kallsyms]    [k] pick_next_task_fair
0,38%  mysqld  mysqld               [.] Protocol_binary::store(char const*, unsigned long, charset_info_st const*)
0,38%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
```

Instructions retired (INST_RETIRED.ANY), after PGO:

```Samples: 166K of event 'rc0', Event count (approx.): 166686
2,55%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,79%  mysqld  mysqld               [.] my_hash_sort_bin
1,60%  mysqld  mysqld               [.] alloc_root
1,48%  mysqld  mysqld               [.] malloc
1,42%  mysqld  mysqld               [.] my_malloc_size_cb_func
1,42%  mysqld  mysqld               [.] free
1,29%  mysqld  libc-2.13.so         [.] __memset_sse2
1,23%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
1,20%  mysqld  mysqld               [.] ha_innobase::build_template(bool)
1,16%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
1,15%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,15%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,13%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,12%  mysqld  mysqld               [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long)
1,07%  mysqld  [vdso]               [.] 0x00000000000006a1
1,03%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,00%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
0,99%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,94%  mysqld  libc-2.13.so         [.] __strlen_sse42
0,83%  mysqld  mysqld               [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*)
0,80%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,78%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,78%  mysqld  mysqld               [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*)
0,75%  mysqld  mysqld               [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long)
0,69%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,68%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,68%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,63%  mysqld  [kernel.kallsyms]    [k] effective_load
0,63%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,62%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,59%  mysqld  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
0,57%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
0,56%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,56%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,52%  mysqld  [kernel.kallsyms]    [k] update_curr
0,52%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,52%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,51%  mysqld  mysqld               [.] my_malloc
0,50%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
0,46%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,46%  mysqld  [kernel.kallsyms]    [k] copy_user_generic_string
0,45%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,43%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,43%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,42%  mysqld  mysqld               [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*)
0,42%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
0,41%  mysqld  [kernel.kallsyms]    [k] account_entity_enqueue
0,39%  mysqld  [kernel.kallsyms]    [k] init_sync_kiocb.constprop.15
0,39%  mysqld  mysqld               [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool)
0,39%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,38%  mysqld  mysqld               [.] my_lengthsp_8bit
0,38%  mysqld  [kernel.kallsyms]    [k] system_call
0,38%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,37%  mysqld  mysqld               [.] Item::Item()
0,36%  mysqld  [kernel.kallsyms]    [k] atomic64_dec_and_test
0,35%  mysqld  librt-2.13.so        [.] clock_gettime
0,34%  mysqld  mysqld               [.] JOIN::cleanup(bool)
0,34%  mysqld  mysqld               [.] Item_func::type() const
0,33%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,33%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,33%  mysqld  [kernel.kallsyms]    [k] __update_tg_runnable_avg.isra.25
```

Icache misses (ICACHE.MISSES), after PGO:

```Samples: 13K of event 'r280', Event count (approx.): 13643
2,07%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
1,39%  mysqld  mysqld               [.] malloc
1,33%  mysqld  mysqld               [.] JOIN::optimize_inner()
1,26%  mysqld  mysqld               [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*)
1,17%  mysqld  mysqld               [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*)
1,08%  mysqld  [vdso]               [.] 0x0000000000000864
1,07%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
1,06%  mysqld  libc-2.13.so         [.] __memset_sse2
1,01%  mysqld  mysqld               [.] free
0,97%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
0,92%  mysqld  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
0,80%  mysqld  mysqld               [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int)
0,76%  mysqld  mysqld               [.] alloc_root
0,74%  mysqld  libpthread-2.13.so   [.] __pthread_mutex_unlock_usercnt
0,73%  mysqld  libc-2.13.so         [.] __strcmp_sse42
0,71%  mysqld  mysqld               [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*)
0,69%  mysqld  mysqld               [.] my_hash_sort_bin
0,67%  mysqld  [kernel.kallsyms]    [k] fget_light
0,63%  mysqld  [kernel.kallsyms]    [k] do_raw_spin_lock
0,62%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,59%  mysqld  mysqld               [.] Item::Item()
0,57%  mysqld  mysqld               [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183
0,53%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
0,52%  mysqld  [kernel.kallsyms]    [k] update_cfs_shares
0,51%  mysqld  mysqld               [.] ha_innobase::external_lock(THD*, int)
0,51%  mysqld  mysqld               [.] my_malloc_size_cb_func
0,50%  mysqld  mysqld               [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*)
0,48%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,47%  mysqld  mysqld               [.] free_root
0,47%  mysqld  [kernel.kallsyms]    [k] idle_cpu
0,47%  mysqld  [kernel.kallsyms]    [k] unix_stream_sendmsg
0,46%  mysqld  [kernel.kallsyms]    [k] system_call
0,44%  mysqld  [kernel.kallsyms]    [k] unix_stream_recvmsg
0,43%  mysqld  mysqld               [.] ha_innobase::info_low(unsigned int, bool)
0,43%  mysqld  mysqld               [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**)
0,43%  mysqld  librt-2.13.so        [.] clock_gettime
0,43%  mysqld  [kernel.kallsyms]    [k] enqueue_task_fair
0,42%  mysqld  mysqld               [.] JOIN::init(THD*, List<Item>&, unsigned long long, select_result*)
0,42%  mysqld  mysqld               [.] Item_field::used_tables() const
0,41%  mysqld  mysqld               [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long)
0,40%  mysqld  mysqld               [.] Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)
0,40%  mysqld  [kernel.kallsyms]    [k] emulate_vsyscall
0,40%  mysqld  mysqld               [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long)
0,40%  mysqld  [kernel.kallsyms]    [k] update_rq_clock
0,39%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
0,38%  mysqld  mysqld               [.] ha_innobase::extra(ha_extra_function)
0,38%  mysqld  [kernel.kallsyms]    [k] update_entity_load_avg
0,37%  mysqld  mysqld               [.] get_best_combination(JOIN*)
0,37%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)
0,37%  mysqld  mysqld               [.] my_malloc
0,37%  mysqld  [kernel.kallsyms]    [k] __perf_event_task_sched_out
0,37%  mysqld  [kernel.kallsyms]    [k] __schedule
0,36%  mysqld  mysqld               [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
0,36%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,35%  mysqld  mysqld               [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
0,35%  mysqld  mysqld               [.] Prepared_statement::execute(String*, bool)
0,34%  mysqld  mysqld               [.] thd_ha_data
0,34%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
0,34%  mysqld  [kernel.kallsyms]    [k] select_task_rq_fair
0,34%  mysqld  [kernel.kallsyms]    [k] do_sys_poll
0,34%  mysqld  mysqld               [.] st_select_lex::master_unit()
```

Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), after PGO:

```Samples: 1K of event 'rc5:p', Event count (approx.): 1147
8,37%  mysqld  libc-2.13.so         [.] __memcpy_ssse3
7,06%  mysqld  libpthread-2.13.so   [.] pthread_getspecific
3,84%  mysqld  mysqld               [.] Item_func::type() const
2,88%  mysqld  libc-2.13.so         [.] __memset_sse2
2,70%  mysqld  libpthread-2.13.so   [.] pthread_mutex_lock
2,35%  mysqld  mysqld               [.] st_select_lex::master_unit()
2,27%  mysqld  libpthread-2.13.so   [.] pthread_mutex_unlock
2,18%  mysqld  mysqld               [.] Item_param::used_tables() const
2,01%  mysqld  mysqld               [.] Field_long::type() const
2,01%  mysqld  libc-2.13.so         [.] __strlen_sse42
1,92%  mysqld  libc-2.13.so         [.] __strcmp_sse42
1,39%  mysqld  mysqld               [.] Item::const_item() const
1,31%  mysqld  mysqld               [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
1,22%  mysqld  mysqld               [.] handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
1,13%  mysqld  mysqld               [.] dispatch_command(enum_server_command, THD*, char*, unsigned int)
1,13%  mysqld  mysqld               [.] end_send(JOIN*, st_join_table*, bool)
1,05%  mysqld  mysqld               [.] my_real_read(st_net*, unsigned long*)
1,05%  mysqld  mysqld               [.] Item_equal::functype() const
1,05%  mysqld  mysqld               [.] Item_bool_func2::cleanup()
0,96%  mysqld  mysqld               [.] Item::is_expensive()
0,96%  mysqld  mysqld               [.] Item_field::type() const
0,96%  mysqld  mysqld               [.] Item_param::type() const
0,87%  mysqld  mysqld               [.] Item::real_item()
0,78%  mysqld  mysqld               [.] Item_field::used_tables() const
0,78%  mysqld  mysqld               [.] scheduler_wait_net_end
0,78%  mysqld  librt-2.13.so        [.] clock_gettime
0,70%  mysqld  mysqld               [.] Statement::set_statement(Statement*)
0,70%  mysqld  mysqld               [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.180
0,70%  mysqld  mysqld               [.] JOIN::optimize()
0,70%  mysqld  mysqld               [.] Field_num::size_of() const
0,70%  mysqld  mysqld               [.] Field_str::charset() const
0,70%  mysqld  mysqld               [.] Item_field::cleanup()
0,70%  mysqld  libpthread-2.13.so   [.] pthread_rwlock_rdlock
0,61%  mysqld  mysqld               [.] select_result::initialize_tables(JOIN*)
0,61%  mysqld  mysqld               [.] mysql_execute_command(THD*)
0,61%  mysqld  mysqld               [.] mysqld_stmt_execute(THD*, char*, unsigned int)
0,61%  mysqld  mysqld               [.] JOIN::optimize_inner()
0,61%  mysqld  mysqld               [.] Field_str::decimals() const
0,61%  mysqld  mysqld               [.] handler::rebind_psi()
0,61%  mysqld  mysqld               [.] Item_field::field_type() const
0,61%  mysqld  mysqld               [.] Item::cmp_type() const
0,52%  mysqld  mysqld               [.] net_before_header_psi(st_net*, void*, unsigned long)
0,52%  mysqld  mysqld               [.] my_net_read
0,52%  mysqld  mysqld               [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_
0,52%  mysqld  mysqld               [.] Field::new_key_field(st_mem_root*, TABLE*, unsigned char*, unsigned char*, unsigned int)
0,52%  mysqld  mysqld               [.] Field::send_binary(Protocol*)
0,52%  mysqld  mysqld               [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
0,52%  mysqld  mysqld               [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*)
0,52%  mysqld  [vdso]               [.] 0x0000000000000630
0,44%  mysqld  mysqld               [.] Protocol::send_eof(unsigned int, unsigned int)
0,44%  mysqld  mysqld               [.] Item::has_subquery() const
0,44%  mysqld  mysqld               [.] Item::mark_as_condition_AND_part(TABLE_LIST*)
0,44%  mysqld  mysqld               [.] select_send::send_eof()
0,44%  mysqld  mysqld               [.] select_send::send_data(List<Item>&)
0,44%  mysqld  mysqld               [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**)
0,44%  mysqld  mysqld               [.] do_handle_one_connection(THD*)
0,44%  mysqld  mysqld               [.] Field_str::repertoire() const
0,44%  mysqld  mysqld               [.] handler::cancel_pushed_idx_cond()
0,44%  mysqld  mysqld               [.] Item_field::get_item_equal()
0,44%  mysqld  mysqld               [.] Item::check_cols(unsigned int)
0,44%  mysqld  mysqld               [.] Item_field::fix_fields(THD*, Item**)
```

January 20th, 2014
02:13 pm

[Link]

40% better single-threaded performance in MariaDB

Continuing my investigation of single-threaded performance in the MariaDB server, I managed to increase throughput of single-threaded read-only sysbench by more than 40% so far:

I use read-only sysbench 0.4.12 run like this:

```    sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx run
```

And mysqld is run with minimal options:

```    sql/mysqld --no-defaults --basedir=X --datadir=Y --innodb-buffer-pool-size=128M
```

With modern high-performance CPUs, it is necessary to do detailed measurements using the built-in performance counters in order to get any kind of understanding of how an application performs and what the bottlenecks are. Forget about looking at the code and counting instructions or cycles as we did in the old days. It no longer works, not even to within an order of magnitude.

I am using the Linux `perf` program for this. During my invistigations, I found that the main bottleneck in the benchmark turns out to be instruction cache misses. Here is an example measurement:

CounterValue
INST_RETIRED.ANY170431
ICACHE.MISSES17194

So 10% of executed instructions missed the level 1 instruction cache ("icache"). That is bad. The Intel Optimization Manual says this about the ratio ICACHE.MISSES/INST_RETIRED.ANY:

Anything over 1% of instructions retired can be a significant issue.

So we are 10 times worse than "a significant issue".

Instruction cache misses cause a bottleneck in the frontend of the CPU - where x86 instructions are fetch, decoded, and supplied to the micro-op queue to be scheduled for the out-of-order dataflow backend. To get an idea about how badly bottlenecked we actually are in the frontend in this benchmark, we can use another measure from the Intel manual:

IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD) = 81.5%

This ratio estimates the percentage of time in which the front-end is not able to deliver instructions sufficiently fast for the backend to work at full speed. In this case, for more than 80% of the time, the CPU would have been able to do more work in the same time if only instructions could be fetch and decoded faster.

Note the importance of this analysis in knowing what areas to work on to improve performance. Roughly speaking, 80% of our time is spent waiting for the icache. We could have spent years on improving data locality or branch mispredictions or whatever in the code, and we would never have been able to get very big improvements. Now, knowing where the sore spot is, we can spend our efforts where it matters the most.

In the case of icache misses, the best place to start is with profile-guided optimisation. This works by first running a specially instrumented mysqld binary to collect data about the load during the benchmark. Then the binary is recompiled, where the compiler can take into account the information obtained to better optimise the generated code.

Using information about how the program actually behaves when run can help the compiler lay out the basic blocks of the program to benefit the most common execution paths. This can help reduce the footprint in the icache by reducing jumps into or out of the middle of a 16-byte instruction stream (such jumps waste the rest of the 16-byte stream, which is the unit of storage of the icache). It can also increase the length of straight-line code execution paths, which should help the hardware prefetcher keep the level 1 icache supplied.

So that is the theory - but does it work in practice? It turns out that it does. First I compile with `--coverage` added to the CFLAGS/CXXFLAGS. Then I run sysbench to generate the profile data. Then I recompile adding instead the `--profile-use` flag. That is all there is to it, GCC handles everything else automatically.

Re-running the benchmark with the optimised binary yields a large speedup:

BinaryQueries per secondSpeedup
Base: `-O3`21404
PGO: `-O3 --profile-use`3090344%

So a 44% speedup just from compiling with different optimisation, not bad! (The actual server-side improvement is in fact even higher, as the sysbench client consumes part of the runtime due to the single-threaded nature of the test).

By the way, that 44% speedup comes from just a modest reduction in icache miss rate - from 10% down to 8%. It just shows how expensive those icache misses are. Fetching from L2 cache takes something like 12 cycles, and during each of those cycles the CPU will be able to do nothing. Given that we could potentially execute 4 operations in the backend each of those clocks, that is a lot of waste. So with 8% misses still left there is room for further huge improvements, though those improvements will likely be much harder to achieve than just fiddling with compiler options.

I think it is clear that we will need to start using profile-guided optimisation for future MariaDB binaries. This will be some work, as we need to develop a set of typical workloads to optimise after, and integrate running of those into the compilation. We will need to cover a range of different common usages to optimise after. And more tests with a wider range of benchmarks will be needed to ensure that the gain is more generally applicable and does not cause significant regressions in performance of other parts of the code. With such a huge improvement in this test I am confident that things can be generally improved, but it still needs proper testing.

My work on improving single-threaded performance will continue, as time permits, and I certainly expect more good results along the way (several patches are already in the pipeline). But I thought this one was so interesting that it was worth mentioning to a wider audience.

November 28th, 2013
12:34 pm

[Link]

MySQL/MariaDB single-threaded performance regressions, and a lesson in thread synchronisation primit

I took a quick look at MariaDB 10.0 single-treaded performance (simple read-only sysbench). One thing immediately leaps to the eye, and I thought it worthy of mention. It contains an important lesson about the use of synchronisation primitives and in particular "atomic operations" in MariaDB (and MySQL).

I am using the Linux `perf` tool on this sysbench command:

```  sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx
```
Look at the top offender in the output from `perf report`:
```  1,54%  mysqld  mysqld               [.] set_thread_state_v1
```
The only thing this does is set a string for SHOW PROCESSLIST (and the like) about what the thread is doing. And we are spending a whopping 1.5% of the total time doing this.

And why? That becomes clear when looking at the disassembly (extract):

```  2d:   mov    0x7b0(%rbx),%eax
33:   mov    %eax,%edx
lock   cmpxchg %eax,0x7b0(%rbx)
jne    33
and    \$0xfffffffc,%edx
add    \$0x1,%edx
xchg   %edx,0x7b0(%rbx)
mov    %rbp,0x7b8(%rbx)
mov    %ecx,0x7c0(%rbx)
mov    0x7b0(%rbx),%eax
5e:   mov    %eax,%edx
lock   cmpxchg %eax,0x7b0(%rbx)
jne    5e
and    \$0xfffffffc,%edx
add    \$0x6,%edx
xchg   %edx,0x7b0(%rbx)
```
No less than two locked cmpxchg instructions. Each of these implies a full (memory barrier), which is really expensive.

To achieve their very high performance, modern CPUs process many instructions at the same time, a hundred or more is possible. But for a full memory barrier, the CPU needs to stop and ensure that all pending store instructions have fully reached the last-level cache. Only then can it start processing any following instructions involving a load. This is not something we should be doing every time a thread moves to a new stage of query processing.

I am not really familiar with the performance schema code (nor am I sure I want to be). But let us make some guess at what the code is trying to do here. Here is the code from which the first locked instruction originates:

```  /**
Execute an allocated to dirty transition.
This transition should be executed by the writer that owns the record,
before the record is modified.
*/
void allocated_to_dirty(void)
{
uint32 copy= PFS_atomic::load_u32(&m_version_state);
uint32 new_val= (copy & VERSION_MASK) + PFS_LOCK_DIRTY;
/* We own the record, no need to use compare and swap. */
PFS_atomic::store_u32(&m_version_state, new_val);
}
```
So perhaps the idea is that only the thread itself can modify its state, but other threads can read it. And we want the update to be fast (I suppose updating the state is more important than inspecting it). So the developers wanted to avoid a lock that could block the writer. Instead, it sets a flag to mark that the data is being modified before the update. And clears the flag after. And a reader can then check the flag; if the flag was modified during a read, the data is potentially inconsistent, and the read can be re-tried.

Now, this really is poorly designed. The `PFS_atomic::load_u32()` is implemented using `my_atomic_load32()` and `my_atimic_store32()`, which is inappropriate here. The `my_atomic()` facility implies a full memory barrier in all of the operations, which makes them somewhat less useful for performance-critical code.

All we need here is that the setting of the update flag becomes visible before changes to the state, and that the clearing of the flag becomes visible after. Thus it should be sufficient with a pair of write memory barriers in the writer (and corresponding read memory barriers in the reader).

The write and read memory barriers are much less expensive than a full barrier. All they do is prevent a store (respectively a load) before the barrier to be satisfied before a store (load) after the barrier. There is no need for the CPU to stop execution. In fact, on x86, there is usually nothing special to do, because stores and loads are never re-ordered unless the special non-temporal memory instructions are used.

But look closer at the code:

```static void set_thread_state_v1(const char* state)
...
int state_len= state ? strlen(state) : 0;
pfs->m_processlist_lock.allocated_to_dirty();
pfs->m_processlist_state_ptr= state;
pfs->m_processlist_state_length= state_len;
pfs->m_processlist_lock.dirty_to_allocated();
```
So it is storing a pointer to a string along with the length, and it needs to do synchronisation to ensure that the two are consistent with each other. But suppose instead that we store only the pointer. This can be done atomically without _any_ need for memory barriers or other synchronisation (as long as the memory location is naturally aligned). That would seem to be even better than using the (cheaper) write- and read-barriers.

To get a better idea of how bad the use of full memory barriers is for performance, I wrote a quick and dirty test program. It has code similar to what is in `set_thread_state_v1()`, as well as three other variants of it: One using just a write memory barrier, one using no memory barrier (not SMP safe, but useful for comparison), and one using normal pthread mutexes rather than a lock-free algorithm. Here is the time taken for 100,000,000 iterations of the function:

wallclock cycles cycles overhead
Original 3.70 81 71
Write barrier 0.53 12 2
No barrier 0.45 10 -
Pthread mutex 2.26 50 40
Note how expensive the full memory barrier version is. The overhead (compared to no barrier) is 35 times larger than the lesser write-barrier. And I suspect that the real overhead is even bigger in real-life than in this synthetic micro-benchmark (for example imagine if we get a cache miss on one of those locked instructions).

Note the sad thing here that the "clever" lock-free algorithm in this case actually makes things slower than just using a simple pthread mutex. The overhead of the original code is 75% higher than the mutex version. This is not really surprising: the main cost of a mutex operations is also just a single full memory barrier, but pthread mutexes have probably been very carefully tuned and optimised by some very knowledgeable people, unlike the ad-hoc performance schema code. Lock-free algorithms require a lot of knowledge and care to do correctly, it is not simple stuff.

I think there is an important lesson here about our use of synchronisation primitives in MariaDB/MySQL. While there is a lot of focus on mutex contention issues, and some various parts starting to use lock-free algorithms, there does not seem to be a general understanding of issues around memory barriers among the developers. In fact to my knowledge there is no general facilities for explicit memory barriers. We have `my_atomic.h`, which confuses the issues of "atomicity" and "memory barriers" and defines a handful of "atomic" operations that are really locked instructions implying a full memory barrier, as in the example above. As the test program shows, it is really important to distinguish between different kinds of memory barriers and understand the performance implications of the use of each. I have always been really unhappy about `my_atomic.h`, and I think this is an area that we need to improve significantly in.

And who knows, maybe the performance schema developers can take a hint or two from this.

February 14th, 2013
04:23 pm

[Link]

First steps with MariaDB Global Transaction ID

My previous writings were mostly teoretical, so I wanted to give a more practical example, showing the actual state of the current code. I also wanted to show how I have tried to make the feature fit well into the existing replication features, without requiring the user to enable lots of options or understand lots of restrictions before being able to use it.

So let us start! We will build the code from `lp:~maria-captains/maria/10.0-mdev26`, which at the time of writing is at revision `knielsen@knielsen-hq.org-20130214134205-403yjqvzva6xk52j`.

First, we start a master server on port 3310 and put a bit of data into it:

```    server1> use test;
server1> create table t1 (a int primary key, b int) engine=innodb;
server1> insert into t1 values (1,1);
server1> insert into t1 values (2,1);
server1> insert into t1 values (3,1);
```
To provision a slave, we take a mysqldump:
```    bash\$ mysqldump --master-data=2 --single-transaction -uroot test > /tmp/dump.sql
```
Note that with `--master-data=2 --single-transaction` we obtain the exact binlog position corresponding to the data in the dump. Since MariaDB 5.3, this is completely non-blocking on the server (it does not do `FLUSH TABLES WITH READ LOCK`):
```    bash\$ grep "CHANGE MASTER" /tmp/dump.sql
-- CHANGE MASTER TO MASTER_LOG_FILE='master-bin.000001', MASTER_LOG_POS=910;
```
Meanwhile, the master server has a couple more transactions:
```    server1> insert into t1 values (4,2);
server1> insert into t1 values (5,2);
```
Now let us start up the slave server on port 3311, load the dump, and start replicating from the master:
```    bash\$ mysql -uroot test < /tmp/dump.sql
server2> change master to master_host='127.0.0.1', master_port=3310,
master_user='root', master_log_file='master-bin.000001', master_log_pos=910;
server2> start slave;
server2> select * from t1;
+---+------+
| a | b    |
+---+------+
| 1 |    1 |
| 2 |    1 |
| 3 |    1 |
| 4 |    2 |
| 5 |    2 |
+---+------+
5 rows in set (0.00 sec)
```
So slave is up to date. In addition, when the slave connects to the master, it downloads the current GTID replication state, so everything is now ready for using global transaction ID. Let us promote the slave as the new master, and then later make the old master a slave of the new master. So stop the slave thread on the old slave, and run another transaction to simulate it being the new master:
```    server2> stop slave;
server2> insert into t1 values (6,3);
```
Finally, let us attach the old master as a slave using global transaction ID:
```    server1> change master to master_host='127.0.0.1', master_port=3311,
master_user='root', master_gtid_pos=auto;
server1> start slave;
server1> select * from t1;
+---+------+
| a | b    |
+---+------+
| 1 |    1 |
| 2 |    1 |
| 3 |    1 |
| 4 |    2 |
| 5 |    2 |
| 6 |    3 |
+---+------+
6 rows in set (0.01 sec)
```
Old master is now running as slave and is up-to-date with the new master.

So that is it! A short post from me for once, but that is the whole point. Replication with MariaDB Global Transaction ID works much as it always did. The only new thing here is when we issue the `CHANGE MASTER` to make the old master a slave of the new master. We do not have to manually try to compute or guess the correct binlog position on the new master, we can just specify `MASTER_GTID_POS=AUTO` and the servers figure out the rest for themselves.

I hope I managed to show more concretely my ideas with MariaDB Global Transaction ID. Comments and questions are most welcome, as always. Everything above is actual commands that work on the current code on Launchpad. Everything else may or may not work yet, as this is work in progress, just so you know!

January 4th, 2013
04:55 pm

[Link]

More on global transaction ID in MariaDB

I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.

(It also gives me the opportunity to explain more deeply a lot of interesting design decisions that I left out in the first post for the sake of brevity and clarity.)

### On crash-safe slave

One of the things I really wanted to improve with global transaction ID is to make the replication slaves more crash safe with respect to their current replication state. This state is mostly persistently stored information about which event(s) were last executed on the slave, so that after a server restart the slave will know from which point in the master binlog(s) to resume replication. In current (5.5 and earlier) replication, this state is stored simply by continuously writing a file `relay-log.info` after each event executed. If the server crashes, this is very susceptible to corruption where the contents of the file no longer matches the actual state of tables in the database. </p>

With MariaDB global transaction ID, the replication state is stored in the following table instead of in a plain file:

```    CREATE TABLE rpl_slave_state (
domain_id INT UNSIGNED NOT NULL,
sub_id BIGINT UNSIGNED NOT NULL,
server_id INT UNSIGNED NOT NULL,
seq_no BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (domain_id, sub_id));
```
When a transaction is executed on the slave, this table is updated as part of the transaction. So if the table is created with InnoDB (or other transactional engine) and the replicated events also use transactional tables, then the replication state is crash safe. DDL, or non-transactional engines such as MyISAM, remain crash-unsafe of course.

A global transaction ID in MariaDB consists of `domain_id` as described in the previous post, an increasing sequence number, and the usual `server_id`. Recall that the replication state with global transaction ID consists of the last global transaction ID applied within each independent replication stream, ie. a mapping from `domain_id` to global transaction ID. This is what the table `rpl_slave_state` provides.

But what about the `sub_id` in the above table? This is to prevent concurrency issues when parallel replication is used. If we want to be able to execute in parallel two transactions with the same `domain_id`, then these two transactions will both need to update the table `rpl_slave_state`. If the transactions would update the same row in the table, then one transaction would have to wait on a row lock for the other to commit. This would prevent any kind of group commit, which would be a very serious performance bottleneck.

So instead, each transaction inserts a new, separate row into the table to record the new global transaction ID applied. There may thus be multiple entries for a given `domain_id`. The `sub_id` is used to distinguish them, it is simply a (local) integer that is increased for each transaction received from the master binlog. Thus, at any given time the last applied global transaction ID for given `domain_id` is the one with the highest `sub_id` in the table.

In effect, the replication state is obtained with a query like this:

```    SELECT domain_id, server_id, seq_no
FROM rpl_slave_state
WHERE (domain_id, sub_id) IN
(SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
```
Old rows are deleted when no longer needed.

Thus, two replicated transactions can be executed like this:

```    BEGIN;                                 BEGIN;

UPDATE some_table                      UPDATE other_table
SET value = value + 1                  SET name = "foo"
WHERE id = 10;                         WHERE category LIKE "food_%";

INSERT INTO mysql.rpl_slave_state      INSERT INTO mysql.rpl_slave_state
SET domain_id = 1,                     SET domain_id = 1,
sub_id = 100,                          sub_id = 101,
server_id = 5,                         server_id = 5,
seq_no = 300010;                       seq_no = 300011;

COMMIT;                                COMMIT;
```
These two transactions can run completely independent, including the insert into `rpl_slave_state`. And the commits at the end can be done together as a single group commit, where we ensure that the second one is recorded as happening after the first one so commit order (and binlog order) is preserved and visibility is correct (second transaction not visible to any query without the first also being visible).

Contrast this with how things would be with a rpl_slave_state table with a single row per `domain_id`:

```    BEGIN;                                 BEGIN;

UPDATE some_table                      UPDATE other_table
SET value = value + 1                  SET name = "foo"
WHERE id = 10;                         WHERE category LIKE "food_%";

UPDATE bad_rpl_slave_state
SET server_id = 5,
seq_no = 300010
WHERE domain_id = 1;

COMMIT;

UPDATE bad_rpl_slave_state
SET server_id = 5,
seq_no = 300011
WHERE domain_id = 1;

COMMIT;
```
Here the update of the replication state table in the second transaction would have to wait for the first transaction to commit, because of row locks. Group commit becomes impossible.

(I actually explained this issue to the replication developers at MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had ignored it...)

### On where to store the replication state

As Giuseppe pointed out, in the global transaction ID design it is still written that the replication state will be stored in the slave binlog, not in the `rpl_slave_state` table. Sorry about this, I will get the document updated as soon as possible.

I had basically two ideas for how to store the slave state in a crash-safe way:

1. In the slave's binlog.
2. In a (transactional) table.
The big advantage of (2) is that it works also when the binlog is not enabled on the slave. Since there can still be substantial overhead to enabling the binlog, I currently plan to go with this approach.

The advantage of (1) is that it is potentially cheaper when the binlog is enabled on the slave, as it commonly will be when global transaction ID is enabled (to be able to promote a slave as a new master, the binlog must be enabled, after all). We already write every single global transaction ID applied into the binlog, and if we crash, we already scan the binlog during crash recovery. Thus, it is easy during crash recovery to rebuild the replication state from the binlog contents. This way we get crash safe slave state without the overhead of maintaining an extra `rpl_slave_state` table.

It will be possible in the future to refine this, so that we could use method (1) if binlog is enabled, else method (2). This might improve performance slightly when binlog is enabled. But we should first benchmark to check if such refinement will be worth it in terms of performance gained. It seems likely that any gains will be modest, at best.

### On parallel replication

Parallel replication is something that has been long overdue, but is now a reality. MariaDB 10.0 will have multi-source replication, which is actually a form of parallel replication. MySQL 5.6 will have multi-threaded slave. Galera can do parallel replication, as can Tungsten I believe, though I am not familiar with details. There are several other mechanisms for parallel replication planned for later MariaDB releases, like MWL#184 and MDEV-520.

It is thus very important to think parallel replication into the design of global transaction ID from the start. I fully agree with Giuseppe's remarks here about MySQL 5.6 replication features failing completely to do this. They introduce in 5.6 three new features that require extensions to how the replication state is stored: crash-safe slave, global transaction ID, and multi-threaded slave. They have managed to do this by introducing three completely different solutions. This is just insane. It makes one wonder if Oracle management forbids Oracle developers to talk to each other, just like we already know they prevent discussions with the community ...

So, in relation to global transaction ID there are basically two kinds of parallel replication techniques: in-order and out-of-order. The two interact with global transaction ID in different ways.

### On in-order parallel replication

In-order is when two (or more) different transactions are executed in parallel on the slave, but the commit of the second transaction is delayed to after the first transaction has committed. Galera is an example of this, I think. Planned tasks MWL#184 and possibly MDEV-520 are also in-order techniques.

In-order parallel replication is transparent to applications and users (at least with MVCC transactional engines like InnoDB), since changes only become visible on COMMIT, and commits are done in a serial way. It is thus also mostly transparent to global transaction ID, and does not need much special consideration for the design.

One thing that can be done, and that I am currently working on, is to integrate in-order parallel replication with group commit. Suppose we run transactions T1 and T2 in parallel on the slave, and suppose that T2 happens to complete first so that we have to wait in T2's commit for T1 to commit first. If we integrate this wait with group commit, we can actually commit T1 and T2 at the same time, taking care to write the commit records to the binlog and to the storage engine in the right order (T1 before T2). This way, the wait is likely to improve performance rather than reduce it, in fact.

### On out-of-order parallel replication

Out-of-order parallel replication is when transactions can be committed in a different order on the slave than on the master. The MySQL 5.6 multi-threaded slave feature is an example of this.

Out-of-order must be explicitly enabled by the application/DBA, because it breaks fundamental semantics. If commits happen in different order, the slave database may be temporarily in a state that never existed on the master and may be invalid for the application. But if the application is written to tolerate such inconsistencies, and explicitly declares this to the database, then there may be potential for more parallelism than with in-order methods. This can make out-of-order interesting.

The typical example, which MySQL 5.6 multi-threaded slave uses, is when the application declares that transactions against different schemas are guaranteed independent. Different schemas can then be replicated independently (though if the application messes up and transactions happen to not really be independent, things can break). MariaDB 10.0 multi-source replication is another example, where the application declares a guarantee that two master servers can be replicated independently.

Out-of-order creates a challenge for global transaction ID when switching to a new master. Because events in the binlog on the new master are in different order, there will not in general be a single place from which to start replication without either loosing or duplicating some event.

MariaDB global transaction ID handles this by only allowing out-of-order parallel replication between different replication domains, never within a single domain. In effect, the DBA/application explicitly declares the possible independent replication streams, and then it is sufficient to remember one global transaction ID per `domain_id` as the position reached within each independent stream.

Thus, suppose we have a master where updates to schemas are independent, and we want to replicate them in parallel on slaves. On the master, we configure 10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog, we set the `domain_id` value to a hash of the used schema.

On the slave, we then configure 10 SQL threads. Two received transactions with different `domain_id` can be executed in parallel. Two transactions using same schema will map to the same `domain_id` and will thus not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded slave almost for free, using the exact same mechanism as for executing multi-source replication in parallel. The replication state on the slave will in this case consist of the 10 different global transaction IDs reached within each of the 10 replication domains. And they can be stored in the table `rpl_slave_state` just as described above. Thus replication state for out-of-order parallel replication is fully integrated with the rest of the design, needing no special mechanisms.

And we can do more! The application (with suitable privileges) is allowed to change `domain_id` per-query. For example, we can run all normal queries with `domain_id=1`. But then if we have a long-running maintenance query like an ALTER TABLE or something that updates every row in a large table, we can execute it with `domain_id=2`, if we take care that no other queries conflict with it. This way, the long-running query can run in parallel "in the background", without causing any replication delay for normal queries.

In effect, the application or DBA now has great flexibility in declaring which queries can replicate independent of (and thus in parallel with) each other, and all this just falls out almost for free from the overall design. I foresee that this will be a very powerful feature to have for large, busy replication setups.

Note btw. that most out-of-order parallel replication techniques can also be done as in-order simply by delaying the final COMMIT steps of transactions to happen in-order. This way one could for example do per-schema parallel replication without polluting the replication state with many global transaction IDs. This should generally achieve similar improvement in overall throughput, though latency of individual transactions can be longer.

### On "holes" in the global transaction ID sequences

Global transaction IDs have a sequence-number component, which ensures uniqueness by being always increasing. This raises the issue of whether an event will always have a sequence number exactly one bigger than the previous event, or if it is allowed to have "holes", where some sequence number is never allocated to an event.

For MariaDB global transaction ID, I took the approach that holes are allowed. There are a number of good reasons for this.

Mostly, I believe that a design that relies on "no holes" is a fundamental mistake. In MySQL 5.6 global transaction ID, holes are absolutely not allowed. If a hole ever turns up, you will be stuck with it literally forever. The MySQL 5.6 replication state lists every sequence number not yet applied on a slave server, so if one becomes missing it will forever remain. Unless you remove it manually, and as far as I have been able to determine, there are currently no facilities for this. Anyone who knows how fragile MySQL replication can be should realise that this is a recipe for disaster.

Another point: because of the strict "no holes" requirement, in MySQL 5.6, when events are filtered with `--replicate-ignore-db` or whatever, they had to change the code so that a dummy event is used to replace the filtered event. In effect, you cannot really filter any events any more! I think that alone should be enough to realise that the design is wrong.

A more subtle point is that a strict "no holes" requirement makes it much harder to correctly and scalable handle allocation of new numbers. Basically, to allocate next number in a multi-thread environment, a lock needs to be taken. We need to take this lock for as short as possible to preserve scalability. But then, what happens if we allocate some sequence number N to transaction T1, and then later we get some failure that prevents T1 from successfully committing and being written into the binlog? We now cannot simply rollback T1, because some other transaction T2 may have already allocated the next number, and then we would leave a hole. Subtle issues like this are important to achieve good scalability.

So I think it is wrong to base the design on never having holes. On the other hand, there is no reason to deliberately introduce holes just for the fun of it. Sequence numbers in MariaDB global transaction ID will generally be without holes, it is just that nothing will break if somehow a hole should sneak in.

Also, whenever a global transaction ID is received on a slave server, the server's own internal counter for the next sequence number to allocate will be set to one more than the received sequence number, if it is currently smaller. This gives the very nice property in a standard setup, where only one master is ever written at any one time: The sequence number in itself will be globally unique and always increasing. This means that one can look at any two global transaction IDs and immediately know which one comes first in the history, which can be very useful. It also allows to give a warning if multiple masters are being written without being configured with distinct replication domain ID. This is detected when a server replicates a global transaction ID with same `domain_id` as its own but smaller sequence number.

In multi-master-like setups, sequence number by itself can no longer be globally unique. But even here, if the system is correctly configured so that each actively written master has its own `domain_id`, sequence number will be unique per `domain_id` and allow to order global transaction IDs within one domain (and of course, between different domains there is no well-defined ordering anyway).

### On CHANGE MASTER TO syntax

In the previous post, I did not really go into what new syntax will be introduced for MariaDB global transaction ID, both for the sake of brevity, and also because it has not really been fully decided yet.

However, there will certainly be the possibility to change directly the replication slave state, ie. the global transaction ID to start replicating from within each replication domain. For example something like this:

```    CHANGE MASTER TO master_host = "master.lan",
master_gtid_pos = "1-1-100,2-5-300";
```
This is a fine and supported way to start replication. I just want to mention that it will also be supported to start replication in the old way, with `master_log_file` and `master_log_pos`, and the master will automatically convert this into the corresponding `master_gtid_pos` and set this for the slave. This can be convenient, as many tools like `mysqldump` or XtraBackup provide easy access to the old-style binlog position. It is certainly an improvement over MySQL 5.6 global transaction ID, where the only documented way to setup a slave involves RESET MASTER (!) on the master server...

Incidentally, note that `master_gtid_pos` has just one global transaction ID per `domain_id`, not one per `server_id`. Thus, if not using any form of multi-master, there will be just one global transaction ID to set.

So if we start with server 1 as the master, and then some time later switch over to server 2 for a master, the binlog will have global transaction IDs both with `server_id=1` and `server_id=2`. But the slave binlog state will be just a single global transaction ID, with `server_id=2` in this case. Since binlog order is always the same within one replication domain, a single global transaction ID is sufficient to know the correct place to continue replication.

I think this is a very nice property, that the size of the replication state is fixed: one global transaction ID per configured replication domain. In contrast, for MySQL 5.6 global transaction ID, any `server_id` that ever worked as master will remain in the replication state forever. If you ever had a server id 666 you will still be stuck with it 10 years later when you specify the replication state in CHANGE MASTER (assuming they will at some point even allow specifying the replication state in CHANGE MASTER).

Once the global transaction replication state is set, changing to a new master could happen with something like this:

```    CHANGE MASTER TO master_host = "master.lan",
master_gtid_pos = AUTO;
```
This is the whole point of global transaction ID, of course, to be able to do this and automatically get replication started from the right point(s) in the binlog on the new master.

One idea that I have not yet decided about is to allow just this simple syntax:

```    CHANGE MASTER TO master_host = "master.lan";
```
so that if no starting position is specified, and a global transaction state already exists, we default to `master_gtid_pos = AUTO`. This would be a change in current behaviour, so maybe it is not a good idea. On the other hand, the current behaviour is to start replication from whatever happens to be the first not yet purged binlog file on the master, which is almost guaranteed to be wrong. So it is tempting to change this simple syntax to just do the right thing.

### On extensions to `mysqlbinlog`

Robert mentions some possible extensions to the `mysqlbinlog` program, and I agree with those.

The master binlog is carefully designed so that it is easily possible to locate any given global transaction ID and the corresponding binlog position (or determine that such global transaction ID is not present in any binlog files). In the initial design this requires scanning one (but just one) binlog file from the beginning; later we could add an index facility if this becomes a bottleneck. The `mysqlbinlog` program should also support this, probably by allowing to specify a global transaction ID (or multiple IDs) for `--start-position` and `--stop-position`.

Robert also mentions the usefulness of an option to filter out events from within just one replication domain/stream. This is something I had not thought of, but it would clearly be useful and is simple to implement.

### On session variable `server_id`

With MariaDB global transaction ID, `server_id` becomes a session variable, as do newly introduced variables `gtid_domain_id` and `gtid_seq_no`. This allows an external replication mechanism to apply events from another server outside the normal replication mechanism, and still preserve the global transaction ID when the resulting queries are logged in the local binlog. One important use case for this is of course point-in-time recovery, `mysqlbinlog | mysql`. Here, mysqlbinlog can set these variables to preserve the global transaction IDs on the events applied, so that fail-over and so on will still work correctly.

Since messing with `server_id` and so on has the possibility to easily break replication, setting these requires SUPER privileges.

January 3rd, 2013
03:28 pm

[Link]

Global transaction ID in MariaDB

The main goal of global transaction ID is to make it easy to promote a new master and switch all slaves over to continue replication from the new master. This is currently harder than it could be, since the current replication position for a slave is specified in coordinates that are specific to the current master, and it is not trivial to translate them into the corresponding coordinates on the new master. Global transaction ID solves this by annotating each event with the global transaction id which is unique and universal across the whole replication hierarchy.

In addition, there are at least two other main goals for MariaDB global transaction ID:

1. Make it easy to setup global transaction ID, and easy to provision a new slave into an existing replication hierarchy.
2. Fully support multi-source replication and other similar setups.

### Replication streams

Let us consider the second point first, dealing with multi-source replication. The figure shows a replication topology with five servers. Server 3 is a slave with two independent masters, server 1 and server 2. Server 3 is in addition itself a master for two slaves server 4 and server 5. The coloured boxes A1, A2, ... and B1, B2, ... denote the binlogs in each server.

When server 3 replicates events from its two master servers, events from one master are applied independently from and in parallel with events from the other master. So the events from server 1 and server 2 get interleaved with each other in the binlog of server 3 in essentially arbitrary order. However, an important point is that events from the same master are still strictly ordered. A2 can be either before or after B1, but it will always be after A1.

When the slave server 4 replicates from master server 3, server 4 sees just a single binlog stream, which is the interleaving of events originating in server 1 and server 2. However, since the two original streams are fully independent (by the way that multi-source replication works in MariaDB), they can be freely applied in parallel on server 4 as well. This is very important! It is already a severe performance bottleneck that replication from one master is single-threaded on the slaves, so replicating events from multiple masters also serially would make matters even worse.

What we do is to annotate the events with a global transaction ID that includes a tag that identifies the original source. In the figure, this is marked with different colours, blue for events originating in server 1, and red for server 2. Server 4 and server 5 are then free to apply a "blue" event in parallel with a "red" event, and such parallel events can thus end up in different order in the binlogs in different places in the replication hierarchy. So every server can have a distinct interleaving of the two original event streams, but every interleaving respects the order within a single original stream. In the figure, we see for example that A2 comes before B1 in server 3 and server 5, but after in server 4, however it is always after A1.

This concept of having a collection of distict binlog streams, each strictly ordered but interleaved with each other in a relaxed way, is very powerful. It allows both great flexibility (and hence opportunity for parallelism) in applying independent events, as well as simple representation of the state of each replication slave at each point in time. For each slave, we simply need to remember the global transaction ID of the last event applied in each independent stream. Then to switch a slave to a new master, the master finds the corresponding places within its own binlog for each independent stream and starts sending events from the appropriate location for each stream.

For example, in the figure, we see that the state of server 4 is (A4, B3) and for server 5 it is (A3, B3). Thus we can change server 5 to use server 4 as a slave directly, as server 4 is strictly ahead of server 5 in the replication streams.

Or if we want to instead make server 5 the new master, then we first need to temporarily replicate from server 4 to server 5 up to (A4, B3). Then we can switch over and make server 5 the new master. Note that in general such a procedure may be necessary, as there may be no single server in the hierarchy that is ahead of every other server in every stream if the original master goes away. But since each stream is simply ordered, it is always possible to bring one server up ahead to server as a master for the others.

### Setup and provisioning

This brings us back to the first point about, making it easy to setup replication using global transaction ID, and easy to provision a new slave into an existing replication hierarchy.

To create a new slave for a given master, one can proceed exactly the same way whether using global transaction id or not. Make a copy of the master obtaining the corresponding binlog position (```mysqldump --master-data```, XtraBackup, whatever). Setup the copy as the new slave, and issue `CHANGE MASTER TO ... MASTER_LOG_POS=...` to start replication. Then when the slave first connects, the master will send last global transaction ID within each existing replication stream, and slave will thus automatically be configured with the correct state. Then if there later is a need to switch the slave to a different master, global transaction ID is already properly set up.

This works exactly because of the property that while we have potentially interleaved distinct replication streams, each stream is strictly ordered across the whole replication hierarchy. I believe this is a very important point, and essential for getting a good global transaction ID design. The notion of an ordered sequence of the statements and transactions executed on the master is the central core of MySQL replication, it is what users know and what has made it so successful despite all its limitations.

### Replication domains

To implement this design, MariaDB global transaction ID introduces the notion of a replication domain and an associated `domain_id`. A replication domain is just a server or group of servers that generate a single, strictly ordered replication stream. Thus, in the example above, there are two `domain_id` values in play corresponds to the two colours blue and red. The global transaction ID includes the `domain_id`, and this way every event can be identified with its containing replication stream.

Another important point here is that `domain_id` is something the DBA configures explicitly. MySQL replication is all about the DBA having control and flexibility. The existence of independent streams of events is a property of the application of MySQL, not some server internal, so it needs to be under the control of the user/DBA. In the example, one would configure server 1 with `domain_id=1` and server 2 with `domain_id=2`.

Of course, in basic (and not so basic) replication setups where only one master server is written to by applications at any one time, there is only a single ordered event stream, so `domain_id` can be ignored and remain at the default (which is 0).

Note by the way that `domain_id` is different from `server_id`! It is possible and normal for multiple servers to share the same `domain_id`, for example server 1 might be a slave of some higher-up master server, and the two would then share the `domain_id`. One could even imagine that at some point in the future, servers would have moved around so that server 2 was re-provisioned to replace server 1, it would then retain its old `server_id` but change its `domain_id` to 1. So both the blue and the red event stream would have instances with `server_id=1`, but `domain_id` will always be consistent.

It is also possible for a single server to use multiple domain IDs. For example, a DBA might configure events generated to receive as domain_id a hash of the current schema. This would be a way of declaring that transactions in distinct schemas are guaranteed to be independent, and it would allow slaves to apply those independent transactions in parallel. The slave will just see distinct streams, and apply them in parallel same way as for multi-source replication. This is similar to the multi-threaded slave that MySQL 5.6 implements. But it is more flexible, for example an application could explicitly mark a long-running transaction with a distict `domain_id`, and then ensure that it is independent of other queries, allowing it to be replicated in parallel and not delay replication of normal queries.

### Current status

The MariaDB global transaction ID is work-in-progress, currently planned for MariaDB 10.0.

The current code is maintained on Launchpad: `lp:~maria-captains/maria/10.0-mdev26`. The design is written up in detail in Jira task MDEV-26, where the progress is also tracked.

Global transaction ID has already been discussed on the maria-developers mailing list. I have received valuable feedback there which has been included in the current design. But I very much welcome additional feedback, I am open to changing anything if it makes the end result better. Much of the community seems to not be using mailing lists to their full potential (hint hint!), hence this blog post to hopefully reach a wider audience that might be interested.

 Powered by LiveJournal.com