At the start of this week, we suffered a corruption of our main 5.1 source
code repository at MySQL. No data was lost, but I spent most of four working
days on cleaning up the corruption, Monty spent one day, and many other people
had to spend time on this or were stalled in their work while the problem was
being resolved. Including the usual stories with fetching off-site backup
tapes only to find them broken, etc.
Our source code repository is the centre that all our work in Engineering
revolves around, and it just has to be stable. The confidence in the
revision control software that we use suffers greatly from such an experience,
and the lost confidence can never really be restored.
But there is a good lesson in this for MySQL, I think.
Like revision control software, MySQL is used by our users to store their
valuable data. The database is the center around which applications revolve,
and it must be stable. If our customers suffer loss or corruption of
data due to bugs in MySQL, the consequential loss of confidence will be
impossible to restore.
There are a number of tools and procedures used to keep tight control of
the quality of the server code. For example:
- New code undergoes code review by two other developers before being
accepted into the main repositories.
- We have an open bug database. Everyone can open a bug, and everyone can
see bugs that are open, or that were fixed in the past.
- The server is available for community testing right from the early alpha
versions. Users can test new versions early (and are rewarded for doing so;
MySQL has a wow to fix every repeatable bug reported in the bug DB, so by
testing early releases users can make sure that the server will work in their
applications once it reaches GA).
- We have a very comprehensive automatic test suite. For every bug fixed,
we add a test to the test suite so that the same bug will never sneak in again
- We have a tool 'autopush' that runs the entire test suite
before code is pushed to the main repository, and rejects new code if the
testsuite fails even a single test.
- The test suite runs with a custom debugging memory allocator
my_malloc that tests for memory leaks. Of course, a single missed
my_free during the test suite is considered a testsuite
failure. (Since a database server must be able to run uninterrupted for
indefinite periods of time, any memory leak is a serious error).
- We use Valgrind to catch memory leaks
in third-party libraries (which do not use
my_malloc) and to
catch pointer errors and other memory related errors.
- We use the GCov
program to check that newly added code includes sufficient test cases to cover
all aspects of new functionality (GCov is a tool for GCC that reports how many
times each line of code is executed during a test program run).
- We have a tool 'Pushbuild' that builds and tests the server
source every time new code is pushed. Builds in pushbuild include multiple
processors and OS'es (Pentium, Opteron, Sparc, PowerPC, ...; Linux, Windows,
Solaris, HPUX, QNX, ...); building with full feature set or with just a few
features enabled; debug and optimized builds; Valgrind tests; GCov tests; and
others. (There has been talk about making reports from pushbuild available
externally; if you think this is a good idea drop a comment, and it may happen
So overall, MySQL is in very good shape, quality wise. But it is still good
to remind ourselves from time to time why we do this, and why it is
Tags: mysql, scm