Home
Kristian Nielsen Below are the 17 most recent journal entries recorded in the "Kristian Nielsen" journal:
December 31st, 2009
12:49 pm

[Link]

Oracle speculations

The Planet MySQL has been abuzz with opinions for or against the acquisition of Sun (and in particular MySQL) by Oracle, but I do not have a strong opinion to chime in with in support of either groups. The reason is that I do not know anything about antitrust laws, which is the legal basis for the EC blocking or not blocking the deal; and also I do not know what the alternative is to Oracle buying the MySQL part of Sun.

However, that does not mean that I can not join in the speculations about Oracles reasons for wanting MySQL in the first place ;-)

I think it is basically a matter of obtaining control over MySQL.

The horror scenario for Oracle is that MySQL (or Postgress or another Free Software program) does to the proprietary databases what Linux has done to the proprietary Unixes. Which is essentially to kill them, slowly but surely. This is not an immediate threat to Oracle, but it is a real long-term threat given how similar the technical challenges are of developing a kernel/OS and a database/RDBMS. And should it happen, the impact on the license revenues from the Oracle RDBMS would be devastating.

Compared to that horror scenario, the potential loss of some fraction of Oracle license sales which will stem from the continued development of MySQL is of less consequences. And this loss will happen in any case, if not to MySQL then to Postgress, to a MySQL fork, or to another free database.

So from this reasoning, it makes the most sense for Oracle to continue development of MySQL more or less unchanged from what it was in MySQL AB and later at Sun. At a sufficiently high level (in terms of bug fixes, features, etc.) that most of the community interest will remain on MySQL and not turn to forks or other Free databases that are outside of Oracles control. And keeping the very tight control over the development that MySQL AB and Sun also had, with the community having basically no influence over what goes into the code or not. Then, should it ever become necessary, Oracle has the control it needs to prevent or at least manage the above-mentioned horror scenario.

In fact, this is exactly what Oracle has done with InnoDB since buying them four years ago. The development has continued more or less as before, as far as I can tell at a similar pace and with the same team. And while there is a fork in XtraDB, the Oracle-controlled InnoDB is good enough that by far the majority of the community is using it and not XtraDB.

So basically, after buying InnoDB Oracle has done essentially nothing with it, one way or the other. Notably, Oracle has not done any of the following or similar bold and visionary steps with InnoDB:

  • Transfered some of the technology from the Oracle RDBMS product into InnoDB which is currently missing. Like multiple tablespaces which can be assigned independently to tables (even though MySQL already has the syntax support for this). Or the use of multiple buffer pools to control page eviction. Or online backup based on the transaction log (innobackup is not part of the GPL InnoDB version, and there is no integration with the new MySQL backup interface). Etc. etc.
  • Opened up the development process like other Open Source projects, with public revision control repository (as far as I know they have not even switched to using bzr like the rest of the MySQL world), public mailing lists for reviews and discussions, public bug tracker, etc.
  • Taken steps to integrate the development process better with MySQL AB and later Sun. As far as I know, changes to the InnoDB included in the MySQL source tree still happens by manually sending patches from Oracle to MySQL which are then manually committed to the MySQL tree! Even worse, much of the development of new features has taken place on a separate product, the InnoDB plugin, which is not enabled by default (it was not even in the MySQL source tree until this Summer), and used by only a minority of the community.
  • Given up any kind of control of the development to the community (eg. as far as I know Oracle has taken no steps to work together with XtraDB).

Nothing I have seen in the statements or discussions so far seems to suggest that Oracle would treat an acquisition of MySQL differently.

Will this be good or bad for MySQL? Without an alternative scenario to compare with, I do not know. But it certainly is not sufficient! MySQL development has been stalling for several years, and we need to invigorate it to make MySQL meet the new challenges facing existing applications, and to improve MySQL for use in applications where it is currently weak.

We need improved management of huge databases: tablespace management; buffer pool control; backup infrastructure; etc. We need replication improvements: binlog storage inside default storage engine for improved transaction handling; interleaved logging of transactions; multi-threaded application of row-based events in MVCC engines; robust automatic handling of fail-over scenarios, etc. We need refactoring of the core server to enable future development: new parser; separate abstract syntax tree; move to modern multi-threading architecture with lock-free operations and RCU; etc. We need scalability improvements to multi-core computers. We need versioned metadata for better support of on-line DDL. We need server-side cache of already executed statements with access to per-statement statistics and execution plans. We need merge and hash joins. We need better integration of the many new storage engines being developed: inclusion by default in source and binaries to make them easy to try and use; extensions to the storage engine API to better interface to the new engines and fully exploit each of their unique features. And lots more.

Will Oracle take the lead on some of these, and give up sufficient control for the community to take the lead on the rest? Well, we do not know. It does seem hard to find a motivation for Oracle to drive the development of MySQL into new areas that will necessarily cannibalise their huge license revenues. On the other hand, if they do, they will be most welcome, and we look forward to hopefully working together with them. In any case, we must not blindly rely on this to happen, not from Oracle or any other company which may end up with the MySQL assets!

I sincerely hope that whatever happens to MySQL the company, a sufficient part of the community will remember that we need not just a MySQL (under whatever name) that is "good enough" today, we also need a MySQL for tomorrow. And for this the community needs to support those that step up to lead future MySQL development, whoever it will be.

So will it be MariaDB leading? We still have a way to go before we have proven ourselves worthy to saying this. But what I can say is that we are trying!

Tags: , , , ,

(1 comment | Leave a comment)

December 18th, 2009
12:09 pm

[Link]

MariaDB Buildbot configuration file published

I have now published the Buildbot configuration file that we use for our continuous integration tests in our Buildbot setup. Every push into main and development branches of MariaDB is built and tested on a range of platforms to catch and fix any problems early (and we also test MySQL releases before merging to easily see whether any new problems already existed in MySQL or were introduced by something specific to MariaDB).

The configuration is included in the Tools for MariaDB Launchpad project.

Now, the Buildbot configuration file is not something that most MariaDB users will need or want to care about, of course. But I think it is still very important to have it publicly available, not sitting on some private server of the company Monty Program AB.

The reason is that the whole idea with MariaDB is to make a community branch of MySQL, developed by the community and for the community. We want MariaDB the project to be bigger than Monty Program AB the company. And since the Buildbot testing is so central to the whole MariaDB development process, the Buildbot setup also needs to be available for the community. Want to improve the setup, just see what it is doing, or even set up your own master to show you can do a better job (and yes our Windows setup currently really suck)? Just go ahead! Wondering how the Buildbot setup can be continued if Monty Program AB disappears or turns fascist? Now there is an answer.

Hopefully the configuration can also be useful as an example for people doing fancy things with Buildbot. There is some cool stuff in there. Like creating a source tarball on a linux host, and uploading it to be built on a Windows host (this is how releases are done, so important to check that no files are missing from the source tarball). Another cool thing is the builders that boots op KVM virtual machines on demand to build and test binary packages (.deb, .rpm, and .tar.gz) on all of the 18 Linux platforms we currently release for.

BTW, you do not get the miscellaneous passwords in the published configuration file, sorry! :-)

[The license for the configuration file (which is in fact a sizable Python script, as this is the way Buildbot is configured) is GPL.]

Tags: , , , , ,

(Leave a comment)

December 17th, 2009
01:59 pm

[Link]

Fixing a MariaDB package bug

One of the things that I am really happy about in MariaDB is that we have our releases available as apt (and yum for Centos) repositories. This is largely thanks to being able to build this on the OurDelta package build infrastructure (which again builds on things like the Debian packaging scripts for MySQL).

Something like the Debian apt-get package system (which is also used by Ubuntu) is one of the major innovations in the Free Software world in my opinion. Debian has spent many years refining this system to where it is today. Want to run the mysql client, but it isn't installed? Just try to run it on your local Ubuntu host:

    $ mysql
    The program 'mysql' can be found in the following packages:
     * mysql-client-5.0
     * mysql-client-5.1
    Try: sudo apt-get install <selected package>
    -bash: mysql: command not found
Installing software does not get much easier than that!

Now, MariaDB is not in the distributions yet. However, it is easy to add external repositories like OurDelta into your system, after which packages from the external repositories are available fully integrated with the package system:

    wget -O- http://ourdelta.org/deb/ourdelta.gpg | sudo apt-key add -
    sudo wget http://ourdelta.org/deb/sources/karmic-mariadb-ourdelta.list \
	-O /etc/apt/sources.list.d/ourdelta.list
(there are also GUI ways to do this of course, for those who prefer that). After this is done, installation is just a sudo apt-get install mariadb-server-5.1 away, security updates will appear automatically for MariaDB just like any other package, etc. It will even upgrade an existing MySQL 5.0 installation automatically (but do take a backup first).

In order to make all this work, there is a lot of work going on behind the scenes in the scripts that make up the .deb packaging. I think most people underestimate the amount of work and clever engineering that goes into making a well-working .deb package. It is easy to laugh at how behind the latest stable Debian release is on software versions (and I am happy to do this on occasion as well). But Debian is still unique in the sheer amount of software it contains and the level of integration of each package in the whole system. And it is this work through the last more than 15 years that allows something like Ubuntu to exists, with an upgrade system that allows it to do 6-month release cycles to provide up-to-date software to its users.

As an example of the kind of details that needs to be dealt with, I wanted to explain a tricky packaging bug I fixed recently.

The bug was with installing MariaDB 5.1.39 on top of an existing MySQL 5.0 installation in Debian and Ubuntu. This is supposed to automatically run the mysql_upgrade program to upgrade all tables from the 5.0 format to the 5.1 format. The symptoms were that the upgrade was not performed correctly, for example the system tables in the mysql database were missing some of the columns added in the 5.1 version.

What made this tricky was that the bug was quite elusive. It did not happen always, and some platforms were ok while others (eg. Ubuntu Hardy amd64) seemed to have it more or less repeatedly. Even worse, even when it did occur, it went away by itself (this is because the .deb MariaDB and MySQL packages actually check for the need to upgrade the database on every server start, and the upgrade procedure always worked when the server was restarted after installation).

After poking for this for some hours, I managed to get it reproducibly on a KVM virtual machine containing Ubuntu Hardy. I then traced the problem into the upgrade procedure, which happens in the /etc/mysql/msyql_upgrade script which is run from /etc/init.d/mysql start. Comparing the log from this (in syslog) with the corresponding log from a successful installation showed that the upgrade procedure seemed to be aborted half-way through, but with absolutely no output (like an error message) to indicate any reason for this.

Now the fact that running the upgrade procedure (or just server start) manually did not exhibit any problems made it quite puzzling to figure out what was going on. Looking through the source code of mysql_upgrade and mysqlcheck (which is called from mysql_upgrade) did not reveal anything that would indicate a problem like this. One step would be to start instrumenting the code with printouts, but this would require building a full set of packages and installing them inside a virtual machine for every iteration, which would have been quite time-consuming.

A better approach turned out to be to install the package running under strace -f. This generates a log (45 MByte of it!) of all system calls made by the installation process, including child processes like the mysql_upgrade invocation. Digging through this for a while, I finally discovered that the upgrade process was being terminated because it received a SIGHUP (hangup) signal!

Now why would it be killed with SIGHUP? Turns out it is due to this snippet from the /etc/mysql/debian-start bash script:

    (
      upgrade_system_tables_if_necessary;
      check_root_accounts;
      check_for_crashed_tables;
    ) <&2 &
The upgrade procedure is run in the background (as it may take a long time, and since this is run on every server restart, it could also happen eg. during host boot, which we do not want to block for long periods of time). But there is no protection against the controlling terminal going away! My guess is that apt-get in some cases will allocate a pseudo-tty to deal with package configuration input, and this is closed when installation is done, causing the background upgrade procedure to be killed with SIGHUP.

Now finally the problem is understood, and the fix is a one-liner: just add this before starting the background job:

    trap "" SIGHUP
Problem solved!

[Incidentally, if you installed MariaDB 5.1.39 .debs and experienced this problem, there is an easy workaround for this bug: just restart the MariaDB server once after installation (sudo /etc/init.d/mysql restart). This will make the upgrade procedure go through if it did not already. This fix will be included in the upcoming MariaDB 5.1.41 release.]

Tags: , , , , , ,

(Leave a comment)

October 19th, 2009
11:07 am

[Link]

Building MariaDB/MySQL with Buildbot and KVM

Testing and automation. These two are key to ensuring high quality of software releases.

Ever since I worked briefly in the team at MySQL AB that is responsible for creating the binary (and source) packages of MySQL releases, I have had the vision of a fully automated release procedure. Whenever someone pushes a new commit to the release branch revision control tree, the continuous integration test framework should kick in and do all the steps needed for producing release packages:

  • Checkout the new revision.
  • Build a source tarball, and save it.
  • For each platform, build a binary package from the source tarball. The build should be done in a freshly installed machine without any revision control checkouts, previous build trees, or extra installed software, to ensure that no unwanted dependencies or stray references to other files or packages are introduced.
  • For each platform, install the binary packages, this time on freshly installed machines with also no build tools (compilers, development packages, etc.) installed, to check that they install correctly without unexpected dependencies. Run tests of the installed server, including starting the server and running basic queries and test suites.
  • Upgrade testing, installing the new packages on machines prepared with earlier installations, and testing that the upgrade procedure works and preserves old data.

To do this efficiently, clearly the use of virtual machines is needed. This weekend I played with KVM and Buildbot, and managed to set up a proof-of-concept of this that I am really pleased with.

KVM

There are lots of options for virtualisation these days, including KVM, Xen, VirtualBox, and Vmware. I use KVM, and I really like it. The integration into the distributions is excellent (sudo apt-get install kvm and you're up and running). The interface is powerful and flexible, and at the same time really simple to learn and use. Just a couple of commands with man pages, like it should be in a Unix system. Basically, it just works!

I started by installing a basic ubuntu Jaunty server in a virtual machine:

  qemu-img create -f qcow2 vm-jaunty-i386-base.qcow2 8G
  kvm -m 2047 -hda vm-jaunty-i386-base.qcow2 -cdrom ubuntu-9.04-server-i386.iso \
    -boot d -smp 2 -cpu qemu32,-nx -net nic,model=virtio -net user -redir tcp:2222::22
I use the user mode network stack with port forwarding for ssh access. This allows to run kvm without root privileges, avoids any need to manage different MAC addresses, avoids the need for routing or configuring interfaces, etc.

Using the virtio network driver greatly improved throughput for me when copying things into and out of the virtual machine. The -cpu qemu32,-nx (disable "No eXecute" support) is needed in this case due to some bug or incompatibility, or the installation hangs upon reboot. As usual Google is your friend in cases like this:

Incidentally, I did this using remote X over an SSH connection. This works fine, no need for physical access to the host server. After installation we will run the virtual machine without a graphic console, but it was just easier to use the stock Ubuntu installer than trying to find a way to install over the emulated serial port.

Initial setup

Next I did some basic preparation to make the installed virtual machine work well for command line and script usage. However, the amount of extra packages installed is kept to a minimum to get proper testing against unwanted dependencies.

I Installed ssh server for remote access. I then set it up to use the serial console (as we will be running kvm in -nographic mode). To get a login prompt on serial port 0, create /etc/event.d/ttyS0:

    start on stopped rc2
    start on stopped rc3
    start on stopped rc4
    start on stopped rc5
    stop on runlevel 0
    stop on runlevel 1
    stop on runlevel 6
    respawn
    exec /sbin/getty 115200 ttyS0

To get the kernel to output its boot log to the serial port, edit the kernel line in /boot/grub/menu.lst, removing quiet splash and adding console=ttyS0,115200n8 console=tty0. To get Grub to use the serial port, add these lines to /boot/grub/menu.lst:

    serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1
    terminal --timeout=3 serial console

Next, we need a user account inside the virtual machine that we can use from the outside with passwordless login and sudo access. Inside the guest, create the account and grant passwordless sudo:

    sudo adduser --disabled-password buildbot
    sudo adduser buildbot sudo
    sudo visudo
    # uncomment `%sudo ALL=NOPASSWD: ALL'
Then, in the host create an SSH public/private key pair without passphrase:
    ssh-keygen -t dsa
Copy the resulting ~/.ssh/id_dsa.pub from the host into ~/.ssh/authorized_keys in the guest.

Now we should be able to test that things work:

    kvm_pid_2222' ; exec kvm -m 2047 -hda /kvm/vms/vm-jaunty-i386-makedist.qcow2 \
        -redir 'tcp:2222::22' -boot c -smp 2 -cpu qemu32,-nx -nographic \
        -net nic,model=virtio -net user
    # We should get a login prompt in the terminal window
    ssh -p 2222 buildbot@127.0.0.1 'sudo id'
    # We should get root access without login or sudo asking for password.
We now have the basis for scripting actions against the virtual machine: We can start up the guest from the command line (and shutdown with kill from the host or sudo shutdown -h now from the guest). And we can run commands inside the guest using ssh -p 2222 buildbot@127.0.0.1. The next step is to create variants of this base virtual installation for the different purposes we need.

qemu-img create -b base_image.qcow2

The qcow2 virtual hard disk image format used by qemu (and kvm) has a very powerful feature, activated with the -b option of qemu-img create:

    qemu-img create -b vm-jaunty-i386-base.qcow2 -f qcow2 vm-jaunty-i386-makedist.qcow2
This creates a new image vm-jaunty-i386-makedist.qcow2, which is initially a clone of the base image vm-jaunty-i386-base.qcow2 that takes up (almost) no extra space. But as we use this new image, changes are added in the new image (copy-on-write), without modifying the original base image. This allows painless mass cloning and modification of virtual machines without having to re-install, and without taking up unnecessary extra disk space and I/O for copying images.

We use this to create a virtual machine that we will use to produce the source tarball from bzr sources. This needs installing bzr and some development packages (compilers etc).

    sudo apt-get install bzr
    sudo apt-get build-dep mysql-5.1-server
Note the very nice build-dep feature of apt-get, it actually installs a ton of packages needed to build the MySQL server (and MariaDB has the save dependencies). I also copied in an existing shared bzr repository; this is not strictly necessary, but saves a very painful initial cloing of the entire MariaDB repository from Launchpad (bzr is just painfully slow on source trees of the size of MariaDB/MySQL):
    scp -rp -P 2222 .bzr buildbot@127.0.0.1:

Another virtual machine image is set up for building the binary packages (this does not need bzr):

    qemu-img create -b vm-jaunty-i386-base.qcow2 -f qcow2 vm-jaunty-i386-build.qcow2
(with a bit more planning, I could have cloned -makedist from -build; now I just repeated the install of mysql-server-5.1 dependencies, but not the bzr install. Also, a refinement wouldbe to setup the -build guest without autotools and bison, to check that build is possible without those installed).

Finally, a third image for testing installation:

    qemu-img create -b vm-jaunty-i386-base.qcow2 -f qcow2 vm-jaunty-i386-install.qcow2
I will be testing a bintar package install, so create the mysql user and group:
     sudo adduser --system --group mysql

With these preparations, we should be ready to put the pieces together:

Buildbot

For MariaDB, we use Buildbot for continuous integration testing. The Pushbuild system I developed at MySQL was never released publicly, and in any case it is better to use a general tool like Buildbot that is widely used and maintained by a large community.

I have been very satisfied with Buildbot. It has its quirks and bugs, but we can fix those over time (and have fixed a number of them already, as well as added extra features we needed). I think Buildbot has all of the right ideas for doing serious continuous integration testing. As I read in some presentation, running the builds and tests is the easy part. The hard part is providing the information and tools needed by developers to fix problems that are found by testing. Fixing these problems is what it is all about, after all, not just producing pretty status reports.

First, I installed a buildbot slave on the host machine:

    sudo apt-get install buildbot
    sudo addgroup buildbot kvm  # To allow buildbot to run kvm
    sudo -u buildbot buildbot create-slave --usepty=0 /var/lib/buildbot/maria-slave \
        hasky.askmonty.org:9989 knielsen-kvm-x86 <password>
Then I set up an account for this in the Buildbot master, and configured the builder.

With the above preparation, configuring the build is just setting up the proper shell commands to be run against the slave, although it is of course a bit more involved than for a normal configure+make. I really like the simplicity of this. Basically, after initial preparation of the KVM images, there is very little setup required on the buildbot slave host, it is all just normal shell commands configured on the master. Of course going forward we can refine some of this and maybe put some of it into generic scripts called from the main config, but for a proof-of-concept I think it is brilliant that one can see exactly which commands are run.

I included the complete config in all detail at the end of this post, but here are the main points.

f_kvm_jaunty_x86.addStep(Compile(
        logfiles={"kernel": "kernel_2222.log"},
        command=["sh", "-c", """
kill -9 "$(cat kvm_pid_2222)"
(exec sh -c "echo \$\$ > 'kvm_pid_2222' ; exec kvm -m 2047 \
    -hda /kvm/vms/vm-jaunty-i386-makedist.qcow2 -redir 'tcp:2222::22' -boot c \
    -smp 2 -cpu qemu32,-nx -nographic -net nic,model=virtio -net user" \
    </dev/null >kernel_2222.log 2>&1) &
sleep 15
while : ; do ssh -o ConnectTimeout=4 -p 2222 buildbot@127.0.0.1 true && break; sleep 2; done
ssh -p 2222 buildbot@127.0.0.1 'mkdir -p buildbot && cd buildbot && \
    rm -Rf build && bzr co "lp:~maria-captains/maria/mariadb-5.1-knielsen" build && \
    cd build && BUILD/compile-dist && make dist && \\
    mv "$(make show-dist-name).tar.gz" ..'
"""]))
The kill command removes any previous left-over kvm process (better safe than sorry). We run kvm in the backgroud, getting the console output through a log file. Note that redirecting the kvm output is necessary, as the buildstep will wait for all processes to close the stdout before considering the buildstep done.

After starting the virtual machine, we wait for boot to have completed by checking for successful ssh connection in the while loop. Once it is ready, we send the commands to build the source tarball into the guest using ssh. </p>

f_kvm_jaunty_x86.addStep(SetProperty(
        property="distname",
	command=["ssh", "-p", "2222", "buildbot@127.0.0.1", "cd buildbot/build && make show-dist-name"],
        ))
This gets the base name of the source tarball into a Buildbot build property, an essential feature of Buildbot for more advanced usage. We will need this name in the following build steps (the name depends on the version of the MariaDB server code).

f_kvm_jaunty_x86.addStep(ShellCommand(
	command=["sh", "-c", WithProperties("""
scp -P 2222 buildbot@127.0.0.1:buildbot/%(distname)s.tar.gz .
ssh -p 2222 buildbot@127.0.0.1 'sudo shutdown -h now'
while : ; do sleep 5; kill -0 "$(cat kvm_pid_2222)" || break; done
rm -f kvm_pid_2222
""")],))
We copy out the generated source tarball (we will need it in the next buildstep, which runs in a different virtual machine). We then shutdown this guest, and wait for it to finish with another while loop. Note the use of WithProperties to interpolate the source tarball name obtained in the previous build step.

f_kvm_jaunty_x86.addStep(Compile(
        command=["sh", "-c", WithProperties("""
qemu-img create -b /kvm/vms/vm-jaunty-i386-build.qcow2 -f qcow2 vm-tmp-2222.qcow2
kill -9 "$(cat kvm_pid_2222)"
(exec sh -c "echo \$\$ > 'kvm_pid_2222' ; exec kvm -m 2047 -hda vm-tmp-2222.qcow2 \
    -redir 'tcp:2222::22' -boot c -smp 2 -cpu qemu32,-nx -nographic -net nic,model=virtio \
    -net user" </dev/null >>kernel_2222.log 2>&1) &
# ...
ssh -p 2222 buildbot@127.0.0.1 'rm -Rf buildbot && mkdir buildbot'
scp -P 2222 %(distname)s.tar.gz buildbot@127.0.0.1:buildbot/
ssh -p 2222 buildbot@127.0.0.1 'cd buildbot && tar zxf %(distname)s.tar.gz && \
    cd %(distname)s && ./configure ...'
# ...
""")],))
Here (and in the following install step), we use qemu-img create -b to create a new, temporary image to work in. This ensures that each build will run in a clean, fresh install, without any risk of contamination from previous builds. (The reason we did not do this for the initial step is that we want to save the bzr revisions pulled from Launchpad so we do not have to keep repeatedly pulling the old ones over for each new build. An alternative would be to keep the permanent shared repository on the host machine and export from that inside the virtual machine).

And that's it! Full config details below, but it is basically the same, just with different commands run in the different steps. The result is a builder that fully automatically tests build and install on real machines with the correct setup, 100% repeatable between builds.

The results from this can be seen on the MariaDB Buildbot pages. Things are likely to to shuffle around as we extend and refine this, but for now an example build can be seen here:

This is just a quick proof of concept, but I think all of the essential ingredients are in there. I am hoping that in the not too distant future we will be using something like this regularly to check MariaDB release builds, which should be very good for getting ensuring both the quality and efficiency of future MariaDB releases!

Full config

f_kvm_jaunty_x86= factory.BuildFactory()
f_kvm_jaunty_x86.addStep(Compile(
        description=["making", "dist"],
        descriptionDone=["make", "dist"],
        logfiles={"kernel": "kernel_2222.log"},
        command=["sh", "-c", """
kill -9 "$(cat kvm_pid_2222)"
(exec sh -c "echo \$\$ > 'kvm_pid_2222' ; exec kvm -m 2047 \
    -hda /kvm/vms/vm-jaunty-i386-makedist.qcow2 -redir 'tcp:2222::22' -boot c -smp 2 \
    -cpu qemu32,-nx -nographic -net nic,model=virtio -net user" \
    </dev/null >kernel_2222.log 2>&1) &
sleep 15
while : ; do ssh -o ConnectTimeout=4 -p 2222 buildbot@127.0.0.1 true && break; sleep 2; done
ssh -p 2222 buildbot@127.0.0.1 'mkdir -p buildbot && cd buildbot && \
    rm -Rf build && bzr co "lp:~maria-captains/maria/mariadb-5.1-knielsen" build && \
    cd build && BUILD/compile-dist && make dist && \
    mv "$(make show-dist-name).tar.gz" ..'
"""
                 ],
        ))
f_kvm_jaunty_x86.addStep(SetProperty(
        property="distname",
	command=["ssh", "-p", "2222", "buildbot@127.0.0.1",
                 "cd buildbot/build && make show-dist-name"],
        ))
f_kvm_jaunty_x86.addStep(ShellCommand(
        description=["copying", "tarball"],
        descriptionDone=["copying", "tarball"],
        logfiles={"kernel": "kernel_2222.log"},
	command=["sh", "-c", WithProperties("""
scp -P 2222 buildbot@127.0.0.1:buildbot/%(distname)s.tar.gz .
ssh -p 2222 buildbot@127.0.0.1 'sudo shutdown -h now'
while : ; do sleep 5; kill -0 "$(cat kvm_pid_2222)" || break; done
rm -f kvm_pid_2222
""")],
        ))
f_kvm_jaunty_x86.addStep(Compile(
        description=["making", "bintar"],
        descriptionDone=["make", "bintar"],
        logfiles={"kernel": "kernel_2222.log"},
        command=["sh", "-c", WithProperties("""
qemu-img create -b /kvm/vms/vm-jaunty-i386-build.qcow2 -f qcow2 vm-tmp-2222.qcow2
kill -9 "$(cat kvm_pid_2222)"
(exec sh -c "echo \$\$ > 'kvm_pid_2222' ; exec kvm -m 2047 -hda vm-tmp-2222.qcow2 \
    -redir 'tcp:2222::22' -boot c -smp 2 -cpu qemu32,-nx -nographic -net nic,model=virtio \
    -net user" </dev/null >>kernel_2222.log 2>&1) &
sleep 15
while : ; do ssh -o ConnectTimeout=4 -p 2222 buildbot@127.0.0.1 true && break; sleep 2; done
ssh -p 2222 buildbot@127.0.0.1 'rm -Rf buildbot && mkdir buildbot'
scp -P 2222 %(distname)s.tar.gz buildbot@127.0.0.1:buildbot/
ssh -p 2222 buildbot@127.0.0.1 'cd buildbot && tar zxf %(distname)s.tar.gz && \
    cd %(distname)s && CC="gcc -static-libgcc" CXX="gcc -static-libgcc" \
    CFLAGS="-O2 -fno-omit-frame-pointer -g" CXXFLAGS="-O2 -fno-omit-frame-pointer -g" \
    ./configure --prefix=/usr/local/mysql --exec-prefix=/usr/local/mysql \
    --libexecdir=/usr/local/mysql/bin --localstatedir=/usr/local/mysql/data \
    --with-server-suffix=1 --with-comment="(MariaDB - http://askmonty.org/)" \
    --with-system-type=linux-gnu --enable-shared --enable-static --enable-thread-safe-client \
    --enable-local-infile --with-big-tables --with-libwrap --with-ssl --without-docs \
    --with-readline --with-extra-charsets=all --with-embedded-server --with-libevent \
    --with-partition --with-zlib-dir=bundled --with-plugins=max-no-ndb && make -j3 && \
     sudo rm -Rf /usr/local/mysql && sudo make install && \
     sudo mv /usr/local/mysql /usr/local/%(distname)s-Linux-x386 && \
    tar zcf ../%(distname)s-Linux-x386.tar.gz -C /usr/local %(distname)s-Linux-x386/'
scp -P 2222 buildbot@127.0.0.1:buildbot/%(distname)s-Linux-x386.tar.gz .
ssh -p 2222 buildbot@127.0.0.1 'sudo shutdown -h now'
while : ; do sleep 5; kill -0 "$(cat kvm_pid_2222)" || break; done
rm -f kvm_pid_2222
""")],
        ))
f_kvm_jaunty_x86.addStep(Test(
        description=["testing", "bintar"],
        descriptionDone=["test", "bintar"],
        logfiles={"kernel": "kernel_2222.log"},
        command=["sh", "-c", WithProperties("""
qemu-img create -b /kvm/vms/vm-jaunty-i386-install.qcow2 -f qcow2 vm-tmp-2222.qcow2
kill -9 "$(cat kvm_pid_2222)"
(exec sh -c "echo \$\$ > 'kvm_pid_2222' ; exec kvm -m 2047 -hda vm-tmp-2222.qcow2 \
    -redir 'tcp:2222::22' -boot c -smp 2 -cpu qemu32,-nx -nographic -net nic,model=virtio \
    -net user" </dev/null >>kernel_2222.log 2>&1) &
sleep 15
while : ; do ssh -o ConnectTimeout=4 -p 2222 buildbot@127.0.0.1 true && break; sleep 2; done
ssh -p 2222 buildbot@127.0.0.1 'rm -Rf buildbot && mkdir buildbot'
scp -P 2222 %(distname)s-Linux-x386.tar.gz buildbot@127.0.0.1:buildbot/
ssh -p 2222 buildbot@127.0.0.1 'cd buildbot && \
    sudo rm -Rf /usr/local/mysql /usr/local/%(distname)s-Linux-x386 && \
    sudo tar zxf %(distname)s-Linux-x386.tar.gz -C /usr/local/ && \
    sudo ln -s %(distname)s-Linux-x386 /usr/local/mysql && cd /usr/local/mysql && \
    sudo sudo chown -R mysql . && sudo chgrp -R mysql . && \
    sudo bin/mysql_install_db --user=mysql && sudo chown -R root . && \
    sudo chown -R mysql data mysql-test && \
    cd mysql-test && sudo su -s /bin/sh -c "perl mysql-test-run.pl alias" mysql'
ssh -p 2222 buildbot@127.0.0.1 'sudo shutdown -h now'
while : ; do sleep 5; kill -0 "$(cat kvm_pid_2222)" || break; done
rm -f kvm_pid_2222
""")],
        ))

bld_kvm_jaunty_x86 = {'name': 'kvm-jaunty-x86',
                      'slavename': 'knielsen-kvm-x86',
                      'builddir': 'kvm-jaunty-x86',
                      'factory': f_kvm_jaunty_x86,
                      }

c['builders'].append(bld_kvm_jaunty_x86)

Tags: , , , ,

(Leave a comment)

August 1st, 2009
09:57 pm

[Link]

Valgrinding Drizzle

Like so many others, I got interested in the Drizzle project when it started. Some good ideas, lots of enthusiasm, and just pure GPL license, no "yes, we will take your work for free and sell proprietary licenses to it" SCA.

I even started contributing some development, fixing a number of Valgrind-detected bugs in Drizzle. I am proud that we kept the MySQL code 100% free of Valgrind errors, and wanted to help keep the same in Drizzle. So I debugged and fixed quite a few of the Valgrind-detected bugs that had crept in since forking from MySQL.

As I remember, I got down to two or three remaining or so. However, I it did discourage me somewhat to see how quickly these bugs had been allowed to enter the code. I remember one case where there was a Drizzle patch that had tried to simplify some field types. As I remember, the patch tried to simplify the code by eliminating some of multiple variants of string types. All well and good, but then there was one place where this elimination was a bit tricky, and the patch just #ifdef'ed out the offending part of the code, leaving the resulting code completely broken, as detected by Valgrind. And this had been in the source for 4 months! Cleaning up code is good, but not if only the easy 90% is done, and the rest is left undone. [Later the Drizzle people started the nonsense with "Drizzle is GPL, but contributions are considered licensed to Sun under BSD", and I kind of lost interest.]

Anyway, so half a year later I though it would be interesting to see how the state of Valgrind is for Drizzle nowadays. So I branced the latest lp:drizzle (and lp:libdrizzle), built it, and ran the test suite under Valgrind:

    (cd tests && ./mtr --valgrind --force)

Unfortunately, the results are not good: 1900 Valgrind warnings!

The warnings are all kinds: Memory leaks, mismatched free()/delete, uninitialised memory accessed from system calls or conditional jumps, etc. Some are probably benign or even false positives from Valgrind. Some are probably minor bugs, like tiny memory leaks in seldom-used features or garbage in log output. And some are most definitely serious bugs in the code that need to be fixed. With a flood of 1900 errors, it is impossible to tell which is which without days of careful study of the errors and debugging of the code.

I hope Drizzle will fix these issues. I have a lot of experience with Valgrind, and I know how hard it is both to debug and fix the issues reported, and also to get all developers to understand the importance of not allowing code into the tree with Valgrind problems. But I have also learned how many real, serious problems Valgrind can detect, problems that are often impossible to otherwise catch during development. Valgrind warnings can be caused by benign problems, but they are very rarely false alarms. But it is important to fix problems quickly, otherwise the number of problems will pile up until the sheer mass of issues makes it impossible to ever get back to a clean state with zero warnings in the test suite.

(Drizzle has done other good stuff. Like Building with -Werror -pedantic -Wall. This is something I hope we can soon duplicate in MariaDB. We do have a clean Valgrind test run in MariaDB, and make sure we keep it by running every push through Valgrind in Buildbot).

Tags: , , , , ,

(4 comments | Leave a comment)

July 12th, 2009
09:27 am

[Link]

Learning Python

Among other things, these past few months I have been working on setting up Buildbot, including adding various enhancements and bug fixes that are needed to properly build and test the MariaDB and MySQL code base.

Since Buildbot is written in Python, this means I have also had to learn Python. I am an old-time Perl hacker, so this exercise feels a bit like living in enemy territory ;-)

Anyway, Python is often touted as a more "pretty" language. And in many ways it is. Still, it is not without its own gotchas. Think "scope rules". Obviously someone haven't been reading up on the subject before implementing things in Python, causing the language to behave stupidly (and certainly different from what one expects) in the following three cases that I hit during my Buildbot work.

First assignment is implicit scope declaration

    def foo():
	s = 0
	def inc():
	    s = s + 1
	print s
	inc()
	print s
This results in this error:
UnboundLocalError: local variable 's' referenced before assignment
Why? Because assigning to `s' declares a new variable. Yep, that's right, a nested scope can read the value of a variable in an outer scope, but it cannot assign it!

This is the work-around:

    def foo():
	s = { 'blarg': 0 }
	def inc():
	    s['blarg'] = s['blarg'] + 1
	print s['blarg']
	inc()
	print s['blarg']
Now the inner scope in inc() does not assign to the outer variable `s'. It merely reads the value, and updates the dictionary it contains. So now things work. Hm ...

Class vs. instance members

    class Bar:
	s = 0
	def foo(self, x):
	    self.s += x
	    print self.s

    a = Bar()
    a.foo(5)
    b = Bar()
    b.foo(8)
So this example actually works as one would expect from first glance (it prints "5" then "8"). But then when I looked closer, I did not understand how it could work. That s = 0 creates a class member, shared by all instances of the class. So how can each instance still get their own private copy, each correctly initialised to 0?

Ah, the answer is another variant of assignment creating a new scope. Look at self.s += x. This statement first reads s.self, which provides the value of the class member. It then assigns the new value to s.self, but since this is assignment, it now refers to an instance member, so it creates a new instance member! I don't know what those Python guys where thinking when they made self.s refer to two different variables in a single statement ...

So this means that while the above example works as expected, this very similar one does not:

    class Bar:
	s = []
	def foo(self, x):
	    self.s.append(x)
	    print self.s

    a = Bar()
    a.foo(5)
    b = Bar()
    b.foo(8)
The last statement prints [5,8] as self.s is now a class member shared among all instances.

The work-around here is to initialise member variables in the constructor __init__(), not in the class declaration.

    class Bar:
	def __init__(self):
	    self.s = []
	def foo(self, x):
	    self.s.append(x)
	    print self.s

    a = Bar()
    a.foo(5)
    b = Bar()
    b.foo(8)

Late-binding closure construction

    b = []
    for i in range(10):
	b.append(lambda x: i)

    b[0](42)
    b[3](42)
This outputs the same value "9" twice. All of the functions in the list return 9! Oops.

The reason is apparently that the closure created by (lambda ...) does late binding of captured outer variables, meaning that it refers to the name, not to the value at the time of closure construction. This is unlike any other language I have ever seen that has lexical scoping, so quite confusing.

I know of two work-arounds in this case, neither of them pretty.

One is to use a dummy extra parameter with a default value:

    b = []
    for i in range(10):
	b.append(lambda x, dummy=i: dummy)

    b[0](42)
    b[3](42)
See, when the variable i appears in the default value of a parameter, it is bound early (so to the value of i is used, not the name), different from when the variable appears in the body of the lambda expression.

The other work-around is to build and call an extra closure to force binding to the correct value:

    b = []
    for i in range(10):
	b.append( (lambda j: (lambda x: j)) (i) )

    b[0](42)
    b[3](42)
This time, passing the value of i to the outer lambda forces early binding, so we get the expected results.

Something to be aware of for an old-time Perl hacker like me, used to using functional style when programming...

Tags: , , ,

(6 comments | Leave a comment)

January 28th, 2009
08:10 am

[Link]

Placeholders and SQL injection, part 2

Actually, what I really wanted to blog about before getting carried away with irony yesterday was an old idea on how to force my developers to use placeholders exclusively for SQL queries in applications. As should be apparent from yesterdays blog entry, I am strongly in favour of using placeholders for interpolating values into SQL queries, due to the great reduction in potential bugs (including, but not limited to, SQL injections).

Basically, wrap the database API so that all database access passes through the wrapper. This can usually be achieved, for example by subclassing DBI (for Perl) and returning such subclasses from the application connection pool, or other similar methods. Probably many large web applications already have such wrappers or use APIs that can be patched or extended appropriately.

Now add code that basically bombs out with a big error message if any SQL query contains a quote character. Something like "Always use placeholders for interpolating values into SQL queries! If in disagreement, go see your development lead for your regular spanking!" or words to the same effect.

Sometimes, the wrapper may sit below some code in the database API that emulates placeholders (for example, DBD::mysql used to emulate placeholders in the client using mysql_real_quote_string() or equivalent, since real server-side placeholders are only available with the newer version of the MySQL protocol for prepared statements). But even in this case, the wrapper can still force the use of placeholders by exploiting the fact that MySQL supports both single (')and double (") quotes. Basically, the wrapper would set some private global variable at random to either a single or a double quote, make placeholder emulation use one, and bomb out if the other is detected in query strings. Then any developer trying to sneak manual quoting into the application would quickly be caught, and subsequently taught.

The technique is not perfect. It does not catch completely unquoted number interpolation (shudder). It will also be somewhat of an annoyance to have to specify all string constants as placeholders (there is nothing wrong with "SELECT value FROM t WHERE id = ? AND color = 'red'"). In the end, I never got to implement it, also because my team was small enough and clue-full enough that normal face-to-face talk was sufficient to make placeholders be used throughout.

But if I ever find myself as lead or architect for a web application team, I will be sorely tempted to implement it, as an educational means for the developers and just to see what reactions it will cause.

Tags: , , ,

(3 comments | Leave a comment)

January 27th, 2009
04:06 pm

[Link]

Placeholders and SQL injection

It is sad to see how 9X% (or should that be 99.X%?) of SQL applications are riddled with SQL injection bugs.

There really is no excuse for this. Nobody writes code like this:

sub stupid_sum {
    my ($list) = @_;
    my $string = shift @$list;
    for (@$list) {
      $string .= " + " . $_;
    }
    my $sum = eval($string);
    return $sum;
}
Right? Just because our computers use the Von Neumann architecture, where CPU instructions and data is stored in the same memory, does not mean that we cannot distinguish between code and data (ok, so in TeX we do not, but there is a reason TeX is not pleasant to write applications in).

So when we use functions to group our code into logical units, we have this fancy syntax for something called parameters. And we can write clever stuff like this:

int foo(int a) {
    return a + 1;
}
And so the "+" and the "1" are part of the code for the function foo(), while the other value "a" to be added is data which comes from another part of the program. Great stuff!

In fact, in the old days, people were using something called embedded SQL, which tries to keep this distinction for using SQL with another language. Though I have to admit, having used Oracle ProC, that this was quite horrible.

But there is no need for embedding the SQL into the syntax of every language, because these brilliant people invented the placeholder! So now we can also have parameters for SQL, an SQL string holds the code to be executed, and placeholders supply the values to be used from other parts of the program.

And it is so easy. No need for mysql_real_quote_string() and other horrors. Just do like this, using Perl and DBI as example:

sub mark_items {
    my ($dbh, $mark, @keys_list) = @_;
    my $placeholders = join(",", map("?", @keys_list));
    $dbh->do("UPDATE t SET flag = ? WHERE id IN (" . $placeholders . ")",
             undef,
             $mark, @keys_list);
}
And then you can sleep well at night not worrying about which kind of values are passed to your SQL, or whether "+" can maybe format your hard disk if someone passes in the wrong argument.

Just because modern dynamic languages make string concatenation easy, does not mean that confusing code and data is a good thing. Von Neumann architecture is good for CPUs, but at higher levels of abstraction we have moved on.

As Bjarne Stroustrup often says, just as plumbers need an education to be allowed to mess up pipes, why doesn't a programmer need an education that makes him or her understand these things before being allowed to release software?

Tags: , ,

(7 comments | Leave a comment)

November 27th, 2008
11:51 pm

[Link]

Selecting rows holding group-wise maximum of a field, part two

Selecting rows holding group-wise maximum is a favorite problem of mine, but one which only rarely pops up. But for some reason, after my last blog post on the subject, it seems to be mentioned almost daily around here.

Something that I forgot to mention in the previous post is that most of the examples there assume suitable indexing is available to get decent performance. Basically a composite index on both the column(s) in the GROUP BY and the column over which MAX is computed is needed. In the example I gave, such an index is available throught the primary key.

However, such an index may not be available in all cases. Maybe maintaining it would be too expensive, or maybe the data the max is computed over is itself the result of a (sub-)query, and no indexing is available. So it is worth it also to understand this case, as the performance of the different possible queries differ greatly from the indexed case.

So let us modify the original example to not have any useful indexes:

CREATE TABLE object_versions (
  id INT PRIMARY KEY AUTO_INCREMENT,
  object_id INT NOT NULL,
  version INT NOT NULL,
  data VARCHAR(1000)
) ENGINE=InnoDB;

This time, I will use a data set of size only 1% of the previous example, as without indexes some of the queries get ridiculously poor performance. So let us take 10,000 rows, 1000 object each with 10 versions. I use this Perl long^H^H^H^Hone-liner to load the data:

perl -MDBI -le '$vers=10; $groups=1000; $dbh=DBI->connect("dbi:mysql:", "test",
"pass", {RaiseError => 1}); $dbh->do("USE test"); foreach $o (1..$groups)
{ $dbh->do("INSERT INTO object_versions VALUES " . join(", ", map("(null, ?,?,?)",
1..$vers)), undef, map( ($o, $_, "data_${o}_$_"), 1..$vers)); }'
(Yes, I know... but I have a strange love for Perl one-liners).

Here are the results:

mysql> SELECT data FROM object_versions o1
WHERE version = (SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id);
1000 rows in set (25.55 sec)

mysql> SELECT o1.data FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2
ON (o1.object_id = o2.object_id AND o1.version = o2.version);
1000 rows in set (0.72 sec)

mysql> SELECT o1.data FROM object_versions o1
LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version
WHERE o2.object_id IS NULL;
1000 rows in set (15.44 sec)

mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
1000 rows in set (0.52 sec)

mysql> SELECT data FROM
(SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id;
1000 rows in set (0.04 sec)

The only query that has any kind of decent performance here is last one using the "evil trick" of abusing the MySQL GROUP BY extensions in a way that is explicitly documented to not produce well-defined results. Which is really sad, since it is the only way I know of of getting the database to make the obvious execution plan in this case, which is to simply sort the data on the GROUP BY expression, and then loop over the rows picking the max row in each group on the way.

In fact, in many cases I think a decent alternative is to just select all rows into the client using ORDER BY, and do the aggregation there.

Now I just need someone to implement my SELECT MAX(object_id, version) ...

Tags: ,

(Leave a comment)

November 21st, 2008
11:36 pm

[Link]

Selecting rows holding group-wise maximum of a field

Today there was a question on the Freenode MySQL channel about a classical problem: Rows holding group-wise maximum of a column. This is a problem that I keep encountering every so often, so I thought I would write up something about it.

A good example of the problem is a table like the following holding versioned objects:

CREATE TABLE object_versions (
  object_id INT NOT NULL,
  version INT NOT NULL,
  data VARCHAR(1000),
  PRIMARY KEY(object_id, version)
) ENGINE=InnoDB
Now it is easy to get the latest version for an object:
SELECT data FROM object_versions WHERE object_id = ? ORDER BY version DESC LIMIT 1
The query will even be very fast as it can use the index to directly fetch the right row:
mysql> EXPLAIN SELECT data FROM object_versions
WHERE object_id = 42 ORDER BY version DESC LIMIT 1;
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table           | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | object_versions | ref  | PRIMARY       | PRIMARY | 4       | const |    3 | Using where | 
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

But what if we want to select the latest version of all (or some range of) objects? This is a problem that I think standard SQL (or any SQL dialect that I know of, including MySQL) has no satisfactory answer to.

Intuitively, the problem should be simple for the database engine to solve. Just traverse the BTree structure of the primary key (assume InnoDB clustered index storage here), and for each value of the first part of the primary key (object_id), pick the highest value of the second part (version) and return the corresponding row (this is similar to what I believe is sometimes called index skip scan). However, this idea is surprisingly difficult to express in SQL.

The first method suggested in the above link to the MySQL manual works in this case, but it is not all that great in my opinion. For example, it does not work well if the column that MAX is computed over is not unique per group (as it is in this example with versions); it will return all of the maximal rows which may or may not be what you wanted. And the query plan is not all that great either:

mysql> EXPLAIN SELECT data FROM object_versions o1 WHERE version =
(SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id);
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
| id | select_type        | table | type | possible_keys | key     | key_len | ref                   | rows | Extra       |
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
|  1 | PRIMARY            | o1    | ALL  | NULL          | NULL    | NULL    | NULL                  |    6 | Using where | 
|  2 | DEPENDENT SUBQUERY | o2    | ref  | PRIMARY       | PRIMARY | 4       | einstein.o1.object_id |    1 | Using index | 
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
2 rows in set (0.00 sec)
It is apparently doing a full table scan with an index lookup for every row in the table, which is not that bad, but certainly more expensive than necessary, especially if there are many versions per object. Still, it is probably the best method in most cases (or so I thought first, but see benchmarks below!).

The two other suggestions from the MySQL manual are not perfect either (though the first one is blazingly fast, see benchmarks below). One is to use an uncorrelated subquery with a join:

mysql> EXPLAIN SELECT o1.data FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2
ON (o1.object_id = o2.object_id AND o1.version = o2.version);
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                     | rows | Extra       |
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                    |    3 |             | 
|  1 | PRIMARY     | o1              | eq_ref | PRIMARY       | PRIMARY | 8       | o2.object_id,o2.version |    1 |             | 
|  2 | DERIVED     | object_versions | index  | NULL          | PRIMARY | 8       | NULL                    |    6 | Using index | 
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
At first, I actually did not know exactly how to interpret this plan output. After the benchmarks given below, I now think this plan is actually very good, apparently it is first using something like an index skip scan to compute the MAX() in the uncorrelated subquery, and then looking up each row using the primary key. It still has the issue with multiple rows if version was not unique per object.

The other suggestion uses an outer self-join:

mysql> EXPLAIN SELECT o1.data FROM object_versions o1
LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version
WHERE o2.object_id IS NULL;
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref                   | rows | Extra                                |
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
|  1 | SIMPLE      | o1    | ALL  | NULL          | NULL    | NULL    | NULL                  |    6 |                                      | 
|  1 | SIMPLE      | o2    | ref  | PRIMARY       | PRIMARY | 4       | einstein.o1.object_id |    1 | Using where; Using index; Not exists | 
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
2 rows in set (0.00 sec)
The plan again looks reasonable, but not optimal. And somehow, all three methods feel unnatural for something that ought to be simple to express.

And in fact, there is a nice way to express this in SQL, except that it does not work (at least not in MySQL):

SELECT MAX(version, data) FROM object_versions GROUP BY object_id;
If there was just support for computing MAX() over multiple columns like this, this query would be a nice, natural, and simple way to express our problem. And it would be relatively easy for database engines to create the optimal plan, I think index skip scan is fairly standard already for single-column MAX() with GROUP BY. And the syntax feels very natural, even though it does bend the rules somehow by a single expression (MAX(version, data)) returning multiple columns. I have half a mind to try to implement it myself in MySQL or Drizzle one day ...

In fact, one can almost use this technique by an old trick-of-the-trade:

mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
+---------------------------------+
| max(concat(version, ":", data)) |
+---------------------------------+
| 2:foo2                          | 
| 1:bar                           | 
| 3:baz2                          | 
+---------------------------------+
3 rows in set (0.00 sec)

mysql> EXPLAIN SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table           | type  | possible_keys | key     | key_len | ref  | rows | Extra |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
|  1 | SIMPLE      | object_versions | index | NULL          | PRIMARY | 8       | NULL |    6 |       | 
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
1 row in set (0.00 sec)
Though I consider this (and variations thereof) a hack with limited practical usage.

And speaking of hacks, there is actually another way to solve the problem, one which I learned about recently at a customer:

mysql> EXPLAIN SELECT data FROM
(SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
| id | select_type | table           | type  | possible_keys | key     | key_len | ref  | rows | Extra                           |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL   | NULL          | NULL    | NULL    | NULL |    6 | Using temporary; Using filesort | 
|  2 | DERIVED     | object_versions | index | NULL          | PRIMARY | 8       | NULL |    6 |                                 | 
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
Get it? This cleverly/evilly (ab)uses the MySQL non-standard extension which allows SELECT of columns not in the GROUP BY clause even without using aggregate functions. MySQL will return a value for the column from an "arbitrary" row in the group. In practise, it chooses it deterministically from the first row in the group, which is why this trick seems to work well in practise. But it is clearly documented to not be supported, so not really something to recommend, though interesting to see.

Bonus benchmark

As a free bonus, I decided to run some quick benchmarks. As it turns out, the results are quite surprising!

So I filled the table above with 1,000,000 rows, 1000 objects each with 1000 versions. Total table size is about 50Mb or so. I then ran each of the five above queries:

mysql> SELECT data FROM object_versions o1
WHERE version = (SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id);
1000 rows in set (4 min 22.86 sec)

mysql> SELECT o1.data FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2
ON (o1.object_id = o2.object_id AND o1.version = o2.version);
1000 rows in set (0.01 sec)

mysql> SELECT o1.data FROM object_versions o1
LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version
WHERE o2.object_id IS NULL;
1000 rows in set (2 min 42.72 sec)

mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
1000 rows in set (0.63 sec)

mysql> SELECT data FROM
(SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id;
1000 rows in set (15.61 sec)

The differences are huge!

The clear winner is query 2, the uncorrelated subquery. Apparently it can do an index skip scan for the inner MAX/GROUP BY query, followed with a primary key join, so it only ever has to touch 1000 rows. An almost optimal plan.

Query 1 and 3 (correlated subquery and outer join) are spectacularly bad. It looks as if they are doing something like for each 1,000,000 rows in the full table scan, do a 1000 row index range scan in the subquery/join, for a total of 1 billion rows examined. Or something. Not good.

Query 4 and 5, the trick queries, are doing so-so, probably they get away with a sort of a full table scan of 1,000,000 rows.

Conclusions: Uncorrelated subquery is the undisputed winner!

Tags: ,

(11 comments | Leave a comment)

November 8th, 2008
06:02 am

[Link]

Slides for my lightning talks at Open Source Days 2008

In case anyone is interested in a copy of my slides for the two lightning talks I gave at the Open Source Days 2008 conference, I have made them available here:

  • "Optimizing Large Databases Using InnoDB Clustered Indexes:" HTML and PDF.
  • "Profiling with OProfile and Intel Core 2 performance counters:" HTML and PDF.

I waqs quite pleased with the benchmark that I prepared for the InnoDB mini-talk, where I measure the performance difference between clustered and auto_increment primary key, both with data that fits in memory and with data that does not. I have wanted to do this benchmark for quite some time, as I have not really seen real results for this before, though the technique of using clustered primary keys for performance is well-known.

The results are quite interesting, with clustered indexes being faster for I/O bound load with more than an order of magnitude. It is also interesting to see how dead-cheap hardware can do 23k queries/second and read 1000000 rows/second with a properly tuned database.

Tags: , , , ,

(Leave a comment)

February 21st, 2008
09:54 pm

[Link]

25 years of Open Source

Today I realised that I have been programming for 25 years. Quite an anniversary! Of course, it does not really feel like that long, probably because a bit more than the first half of the time was really 'just for fun', as a kid's hobby and later during University.

I also realised that right from the beginning and through all the years I have had the Open Source mindset, even though I did not learn about Free Software until much later.

I believe there are two reasons:

  1. I love reading source code.
  2. I hate the wasted effort when code could be reused, but is not.

Right from the start I have been reading code. This was when software was distributed as BASIC listings in magazines that one would manually type into the computer. And I remember reading and modifying the dis-assembly of machine code programs, getting 'infinite lives' in games and learning the techniques they used to produce their state-of-the-art graphics.

I doubt there was ever a great novelist that did not learn from reading lots of literature. And I believe the same is true for coding, great programmers learn from reading lots of code.

And as soon as I could, I moved towards more portable languages and platforms, wanting my code to be used and reused as much as possible. I distributed my first GPL'ed program 17 years ago (XFH), after getting Internet access at the University.

It is the same mindset that made me decide to join MySQL, really.

Previously, I worked with applications on top of Oracle. Now in many ways, Oracle sucks less than most other software. But it is not free software. When we started using MySQL (version 4.1 beta I think), I was really pleased with it.

While Oracle had some advantages in terms of features and usability, MySQL had the advantage of being Open Source, and to me that made them about equal. And improving MySQL features and usability seemed rather more feasible to me than getting Oracle open-sourced. So I wanted to contribute my skills towards the first of these goals. Improving MySQL, enabling the use and reuse of the code by me and others. And getting to know the source code of the M in LAMP.

On to the next 25 years of Open Source, I guess ...

Tags: ,

(1 comment | Leave a comment)

May 31st, 2007
12:57 pm

[Link]

DGCov: A tool for checking test coverage of a patch
Today I published the DGCov tool on the MySQL Forge.

DGCov is a neat tool that I implemented last year for use internally at MySQL, an old idea of Monty's.

The idea is to help developers to check that a new patch for the server code has received adequate testing before pushing it to the main tree. The GCC compiler has the gcov tool that can check which lines of the source code were never executed even once. But suppose you change a few 1000 lines across a big source base like MySQL. Then GCov output is not all that useful, since it will report tons of lines as not executed, and it is difficult to manually check which of those lines were touched by your patch.

The DGCov tool takes the GCov output and filters it so that it only shows those lines that were touched by the patch being checked. This output is immediately applicable to the work done by the developer, and very useful to check what the test coverage is, especially when doing patches that make smaller changes across a big part of the source.

Of course, just checking code coverage is not sufficient to ensure that good testing has been done of the code! But it is very good for making sure that one's changes have at least been run once before releasing them. This way one can avoid embarrasments like publishing infinite loops or uncontitional crashes in the code.

Hopefully, others will be able to use this for MySQL or other development projects (it is not specific to MySQL at all really).

Tags: , ,

(1 comment | Leave a comment)

September 12th, 2006
09:01 pm

[Link]

Partitioned archive tables

Is there anyone using partitioned archive tables in MySQL 5.1 for storing logs (or other voluminous data)?

Storing large amounts of logs in a relational database can bring special challenges.

Logs can take up huge amounts of space on disk, and while disk space is cheap, disk I/O can be expensive, performance-wise. But many logs compress really well, and for this the MySQL archive storage engine (insert-only, no indexes, gzip'ed storage) can be used.

You often want to scan across a few hours or days worth of logs, and indexes are poor for this purpose as the large number of disk seeks can kill performance. And full table scans of years of logs is not all that much fun either. For this, table partioning (supported in MySQL 5.1) is very useful. Store each day or week of logs in a separate table partition, partitioned on day (maybe use the 5.1 event sceduler, or simply a cron job, to add new partitions automatically). Then efficient full-partition scans of just the days or weeks in question can be used.

So why not combine both approaches? MySQL 5.1 supports partitioned archive tables just fine:

  CREATE TABLE mylog (unix_time INT, msg VARCHAR(1000))
  ENGINE=archive
  PARTITION BY RANGE(unix_time)
  (PARTITION p0 VALUES LESS THAN (UNIX_TIMESTAMP('20060902000000')),
   PARTITION p1 VALUES LESS THAN (UNIX_TIMESTAMP('20060903000000')));

This has the potential to really boost performance of relational log management. Very low space usage, and still full SQL query access to the data, with good performance.

A quick test confirms this (after adding some more partitions). First load some rows:

  perl -MDBI -e '$dbh=DBI->connect("dbi:mysql:"); $dbh->do("INSERT INTO mylog values($_, \"log message on $_\")") for (1157148000 .. 1157148000 + 86400*5 - 1)'
Only about 10 bytes used per row, great (archive storage engine doing it's bit)! And queries in limited date ranges are faster:
mysql> select * from mylog where msg LIKE '%71234%';
+------------+---------------------------+
| unix_time  | msg                       |
+------------+---------------------------+
| 1157171234 | log message on 1157171234 |
| 1157271234 | log message on 1157271234 |
| 1157371234 | log message on 1157371234 |
| 1157471234 | log message on 1157471234 |
| 1157571234 | log message on 1157571234 |
+------------+---------------------------+
5 rows in set (2.15 sec)

mysql> select * from mylog where unix_time >= unix_timestamp('20060903100000') and unix_time < unix_timestamp('20060903105959') and msg LIKE '%71234%';
+------------+---------------------------+
| unix_time  | msg                       |
+------------+---------------------------+
| 1157271234 | log message on 1157271234 |
+------------+---------------------------+
1 row in set (0.66 sec)

So partition pruning is working as expected.

Tags:

(11 comments | Leave a comment)

June 11th, 2006
11:07 pm

[Link]

Data Corruption!

At the start of this week, we suffered a corruption of our main 5.1 source code repository at MySQL. No data was lost, but I spent most of four working days on cleaning up the corruption, Monty spent one day, and many other people had to spend time on this or were stalled in their work while the problem was being resolved. Including the usual stories with fetching off-site backup tapes only to find them broken, etc.

Our source code repository is the centre that all our work in Engineering revolves around, and it just has to be stable. The confidence in the revision control software that we use suffers greatly from such an experience, and the lost confidence can never really be restored.

But there is a good lesson in this for MySQL, I think.

Like revision control software, MySQL is used by our users to store their valuable data. The database is the center around which applications revolve, and it must be stable. If our customers suffer loss or corruption of data due to bugs in MySQL, the consequential loss of confidence will be impossible to restore.

There are a number of tools and procedures used to keep tight control of the quality of the server code. For example:

  • New code undergoes code review by two other developers before being accepted into the main repositories.
  • We have an open bug database. Everyone can open a bug, and everyone can see bugs that are open, or that were fixed in the past.
  • The server is available for community testing right from the early alpha versions. Users can test new versions early (and are rewarded for doing so; MySQL has a wow to fix every repeatable bug reported in the bug DB, so by testing early releases users can make sure that the server will work in their applications once it reaches GA).
  • We have a very comprehensive automatic test suite. For every bug fixed, we add a test to the test suite so that the same bug will never sneak in again un-noticed.
  • We have a tool 'autopush' that runs the entire test suite before code is pushed to the main repository, and rejects new code if the testsuite fails even a single test.
  • The test suite runs with a custom debugging memory allocator my_malloc that tests for memory leaks. Of course, a single missed my_free during the test suite is considered a testsuite failure. (Since a database server must be able to run uninterrupted for indefinite periods of time, any memory leak is a serious error).
  • We use Valgrind to catch memory leaks in third-party libraries (which do not use my_malloc) and to catch pointer errors and other memory related errors.
  • We use the GCov program to check that newly added code includes sufficient test cases to cover all aspects of new functionality (GCov is a tool for GCC that reports how many times each line of code is executed during a test program run).
  • We have a tool 'Pushbuild' that builds and tests the server source every time new code is pushed. Builds in pushbuild include multiple processors and OS'es (Pentium, Opteron, Sparc, PowerPC, ...; Linux, Windows, Solaris, HPUX, QNX, ...); building with full feature set or with just a few features enabled; debug and optimized builds; Valgrind tests; GCov tests; and others. (There has been talk about making reports from pushbuild available externally; if you think this is a good idea drop a comment, and it may happen sooner).

So overall, MySQL is in very good shape, quality wise. But it is still good to remind ourselves from time to time why we do this, and why it is important.

Tags: ,

(1 comment | Leave a comment)

May 7th, 2006
10:36 pm

[Link]

Full-index scan faster than full-table scan

At this years LinuxForum I was manning the MySQL booth together with Carsten Pedersen. We were kept quite busy with lots of people coming to tell about their use of the MySQL database for their particular project and ask about or discuss a particular issue of theirs. Which was fine, since the talks did not appeal a lot to me anyway.

One guy (I forgot who) had a small performance problem in his application. The application is a database of about 550,000 companies, storing name and various other bits of information about each company. What I would call a "small" database (since it is easily kept completely cached in ram), though not a trivial one.

This application has a facility to search for a company using any part of the company name:

    SELECT ... FROM company WHERE name LIKE '%pattern%&apos
The query was taking too long, I think something like 10 or 15 seconds. Clearly a full table scan is needed, since neither BTREE or full-text indexes can be used, but still >10 seconds seems way over the top on modern hardware.

He was using InnoDB, so the obvious suggestion was to make sure that InnoDB was configured with sufficiently large buffer pool to cache all of the database in ram (the machine had sufficient memory for that, but it is necessary to set the innodb_buffer_pool_size sufficiently large in order for MySQL to actually cache the data).

Still, the problem got me thinking: How fast can one actually do full table scans in MySQL on modern hardware? 15 seconds for a measly 550.000 records seems pretty poor given that modern CPUs will execute billions of instructions each second. Probably in this particular case insufficient caching was causing disk I/O, reducing the speed. But another issue might be that this table has lots of fields with different information, so the scan needs to not only search the 'apos' column, but also skip over all of the other fields in the table.

In this case, an index on the 'name' column can help. Normally you are told that in a query using LIKE '%...%', an index can not be used. But if all columns used in the query are available from a single index, MySQL can use an index scan instead of a table scan. Even though the whole of the index must be scanned, this can still be beneficial if the index is smaller than the table.

So lets run a quick test on some actual data. First a table definition:

    CREATE TABLE company (
            id INT AUTO_INCREMENT PRIMARY KEY,
            name VARCHAR(255),
            addr VARCHAR(255),
            phone VARCHAR(64),
            status INT
    ) ENGINE=innodb;
The table in the original application undoubtedly had many more columns, but this will do for a quick test.

Now some Perl code to fill the table with some realistic data:

    for (1..550000) {
      $dbh->commit() if($_ % 500 == 0);
      my $words = int(rand(3)) + 1;
      my @word = ();
      for (1..$words) {
        my $x = '';
        my $length = int(rand(13)) + 2;
        for (1..$length) {
          $x .= chr(int(rand(26)) + ord('a'));
        }
        push @word, $x;
      }
      my $name = join(' ', @word);
      $dbh->do(<<SQL, undef, $name, "foobar", "+1 234 56", int(rand(1000)));
    INSERT INTO company(name, addr, phone, status)
    VALUES (?, ?, ?, ?)
    SQL
    }
    $dbh->commit();
And lets set a reasonable InnoDB buffer pool size in my.cnf:
    innodb-buffer-pool-size = 262144000

Now let us try it:

    mysql> CREATE UNIQUE INDEX cover1 on company(name, id);
    mysql> EXPLAIN SELECT id FROM company WHERE name LIKE '%xgef%';
    +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+
    | id | select_type | table   | type  | possible_keys | key    | key_len | ref  | rows   | Extra                    |
    +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+
    |  1 | SIMPLE      | company | index | NULL          | cover1 | 262     | NULL | 548220 | Using where; Using index |
    +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+
    1 row in set (0.00 sec)

    mysql> SELECT id FROM company WHERE name LIKE '%xgef%';
    +--------+
    | id     |
    +--------+
    | 310501 |
    | 408997 |
    | 274935 |
    | 418352 |
    | 335782 |
    |  98100 |
    | 239031 |
    | 454742 |
    +--------+
    8 rows in set (0.52 sec)
So half a second to search all 550,000 names. Much better than 15 seconds, and completely adequate for the application in question, where such a search is performed perhaps once every minute.

Of course half a second for a search would not be acceptable for all applications. To speed it up more, one idea might be to store all suffixes of company names in a separate table, along with the company id. This table would be quite big, but still of reasonable size: If names are on average 25 characters, the number of rows would be about 14 million. Then with an index on the column of name suffixes, a query like this:

    SELECT id FROM suffix_table WHERE name_suffix LIKE 'pattern%'
could use the index to immediately look up the required data with millisecond response time. But that kind of complexity was not required for the problem at hand.

Tags:

(Leave a comment)

April 18th, 2006
01:41 pm

[Link]

Visual Studio limitations

MySQL 5.1 is moving to a new system for building the source code on Windows, based on CMake (more on this another time).

While finishing up the last bits of this, I ran into a very annoying limitation in Visual Studio (version 7.1 I think). The library include path in Visual Studio project files (.vcproj) has a very tight limit on the maximum length, around 2000 characters it seems. Put a path longer than that in the AdditionalLibraryDirectories section, and you will not get an error but the build will fail to find required libraries. This problem is made more serious by the fact that CMake tends to use absolute paths, leading to longer paths in the project files.

Googling a bit for this problem turned up nothing, so in the end I had to resort to some extremely ugly Perl one-liner build cleanup:

perl -i.bak -pe "s/[a-z]:[^,; ]+\\\\$BUILD_DIR/../gi if /AdditionalLibraryDirectories/" sql/mysqld.vcproj
Not very nice :-(

I think there is a lesson here: Do not put arbitrary limits into development tools. People will try to use the tools in unexpected ways, and fail.

Tags: ,

(1 comment | Leave a comment)

Powered by LiveJournal.com

Advertisement