lundi 19 septembre 2016

Running FreeBSD in Travis-CI

Note for geospatial focused readers: this article has little to do with geo, although it is applied to GDAL, but more with software virtualization, hacks, software archeology and the value of free software. Note for virtualization experts: I'm not one, so please bear with my approximate language and inaccuracies.

Travis-CI is a popular continuous integration platform, that can be easily used with software projects hosted at GitHub. Travis-CI has a free offer for software having public repository at GitHub. Travis-CI provides cloud instances running Linux or Mac OS X. To increase portability tests of GDAL, I wondered if it was somehow possible to run another operating system with Travis-CI, for example FreeBSD. A search lead me to this question in their bug tracker but the outcome seems to be that it is not possible, nor in their medium or long term plans.


One idea that came quickly to mind was to use the QEMU machine emulator that can simulate full machines (CPU, peripherals, memory, etc), of several hardware architectures (Intel x86, ARM, MIPS, SPARC, PowerPC, etc..). To run QEMU, you mostly need to have a virtual hard drive, i.e. a file that replicates the content of the hard disk of the virtual machine you want to run. I found here a small ready-to-use x86_64 image of FreeBSD 9.2, with one nice property: the ssh server and DHCP are automatically started, making it possible to remote connect to it.

So starting with a Travis-CI Ubuntu Trusty (14.04) image, here are the step to launch our FreeBSD guest:

sudo apt-get install qemu
wget ftp://ftp.stacklet.com/archive/x86-64/FreeBSD/9.2/\
    freebsd.9-2.x86-64.20140103.raw.img.txz
tar xJvf freebsd.9-2.x86-64.20140103.raw.img.txz
qemu-system-x86_64 -daemonize -display none \
   freebsd.9-2.x86-64.20140103.raw.img \
   -m 1536 -smp 4 -net user,hostfwd=tcp::10022-:22 -net nic

The qemu invokation starts the virtual machine as a daemon without display, turn on networking and asks for the guest (ie FreeBSD) TCP port 22 (the ssh port) to be accessible by the host (Linux Trusty) as port 10022

To ssh into the VM, there's one slight inconvenience: ssh login requires a password. The root password for this VM is "password". But ssh is secured and doesn't accept the password to be provided through files or piped in with "echo". I found that the sshpass utility was designed to overcome this in situations where security isn't really what matters. However, the version of sshpass bundled with Ubuntu Trusty didn't work with the corresponding ssh version (not surprisingly since the authors of sshpass mention that it is full of assumptions about how ssh works, that can be easily breaks with changes of ssh). I found that the latest version 1.0.6 worked however.

With 4 extra lines, we can now login into our FreeBSD instance:


wget http://fossies.org/linux/privat/sshpass-1.06.tar.gz
tar xzf sshpass-1.06.tar.gz
cd sshpass-1.06 && ./configure && make -j3 && cd ..
export MYSSH="sshpass-1.06/sshpass -p password ssh \
   -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no \
    root@localhost -p10022" 

So now we can configure a bit our FreeBSD VM to install with the 'pkg' package manager a few dependencies to build GDAL:

$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg bootstrap'
$MYSSH 'mkdir /etc/pkg'
sshpass-1.06/sshpass -p password scp \
   -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no \
   -P 10022 FreeBSD.conf root@localhost:/etc/pkg/FreeBSD.conf
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg update'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install gmake'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install python27'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install py27-numpy'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install sqlite3 curl'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install expat'
    
Here we go: ./configure && make ! That works, but 50 minutes later (the maximum length of a Travis-CI job), our job is killed with perhaps only 10% of the GDAL code base being compiled. The reason is that we used the pure software emulation mode of QEMU that involves on-the-fly disassembling of the code to be run and re-assembling. QEMU can for example emulate a ARM guest on a Intel host, and vice-versa, and there's no really shortcuts when the guest and host architectures are the same. So your guest can typically run 10 times slower than it would on a real machine with its native architecture. Actually, that's not true, since with the addition of CPU instructions dedicated to virtualization (VT-x for Intel, AMD-V for AMD), an hypervisor called KVM (Kernel Virtual Machine) was added to the Linux kernel, and QEMU can use KVM to implement the above mentioned shortcuts to reach near bare-metal performance. It just takes to use 'kvm' instead of 'qemu-system-x86_64'. Let's do that ! Sigh, our attempt fails miserably with a "failed to initialize KVM" error message. If we display the content of /proc/cpuinfo, we get:

flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc
rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1
sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt

A lot of nice to have things, but the important thing to notice is the absence of the 'vmx' (Intel virtualization instruction set) and 'svm' (similar for AMD) flags. So this machine has no hardware virtualization capabilities ! Or more precisely, this *virtual* machine has no such capabilities. The documentation of the Trusty Travis-CI environment mentionned they are based on Google Computing Engine as the hypervisor, and apparently it does not allow (or is not configured to allow) nested virtualization, despite GCE being based on KVM, and KVM potentially allowing nested virtualization. GCE allows Docker to run inside VM, but Docker only runs Linux "guests". So it seems we are really stuck.

Here comes the time for good old memories and a bit of software archeology. QEMU was started by Fabrice Bellard. If you didn't know his name yet, F. Bellard created FFMPEG and QEMU, holds a world record for the number of decimals of Pi computed on a COTS PC, has ported QEMU in JavaScript to run the Linux kernel in your browser, devised BPG, a new compression based on HEVC, etc....

At the time where his interest was focused on QEMU, he created KQemu, a kernel module (for Linux, Windows, FreeBSD hosts), that could significantly enhance QEMU performance when the guest and hosts are x86/x86_64. KQemu requires QEMU to be modified to communicate with the kernel module (similarly to the working of QEMU with the KVM kernel module). KQemu started as a closed source project and was eventually released as GPL v2. One of the key feature of KQemu is that it does not require (nor use) hardware virtualization instructions. KQemu software virtualization involves complicated tricks, particularly for code in the guest that run in "Ring 0", ie with the highest priviledges, that you must patch to run as Ring 3 (non-priviledge) code in the host. You can get an idea of what is involved by reading the documentation of VirtualBox regarding software virtualization. I will not pretend that QEMU+KQemu did the exact same tricks as VirtualBox, but that should give you at least a picture of the challenges involved.  This complexity is what lead to KQemu to be eventually abandonned when CPUs with hardware virtualization became widespread to the market since KVM based virtualization is much cleaner to implement. Starting with QEMU 0.12.0, KQemu support was finally dropped from QEMU code base.

Due to KQemu not using hardware virtualization instructions, there is a good hope that it can run inside a virtualized environment. So let's have a try with QEMU 0.11.1 and KQemu 1.4.0pre. Compiling QEMU 0.11.1 on Ubuntu Trusty runs quite well, except a linking error easily fixed with this trivial patch. Building KQemu is a bit more involved, being a kernel module and the (internal) Linux kernel API being prone to changes from time to time. One good news is that the Linux specific part of kqemu is a relatively small file and the API breaks were limited to 2 aspects. The way to get the memory management structure of the current task had changed in Linux 2.6.23 and I found this simple patch to solve it. Another change that occured in a later Linux release is the removal of kernel semaphores to be replaced by mutexes. My cumulated patch to fix all compilation issues is here. I don't pretend that it is technically correct as my knowledge of kernel internals is more than limited, but a local test seemed to confirm that adding -enable-kqemu to the qemu command line worked sufficiently well to start and do things in the FreeBSD VM, and at a very decent speed. I tried the -kernel-qemu switch that turns on KQemu acceleration for kernel guest code, but that resulted in a crash of qemu near the end of the boot process of FreeBSD. Which is not surprising as kernel-qemu makes some assumptions on the internal working of the guest OS, which perhaps FreeBSD does not meet. Or perhaps this is just a bug of qemu/kqemu.

Running it on Travis-CI was successful too, with the compilation being done in 20 minutes, so probably half of the speed of bare metal, which is good enough. kqemu does not support SMP guests (but this was listed in the potential "roadmap", so probably achievable), but if we wanted to speed up compilation, we could potentially launch 2 kqemu-enabled qemu instances (the Travis-CI VM have 2 cores available) that would compile different parts of the software with the build tree being hosted in a NFS share. I said that compilation goes fine, except that the build process (actually the qemu instance) crashes at libgdal.so building time (I can also reproduce that locally). This is probably because the history of qemu & kqemu wasn't long enough to go from beta quality to production quality. I've workarounded this issue by only doing the compilation in -enable-kqemu mode, restarting the VM in pure software emulation to do the linking, and then restarting in -enable-kqemu mode. Unfortunately running the GDAL Python autotest suite in kqemu mode also leads to a qemu crash (due to the design of kqemu only runnnig code in ring 3, crashes do not affect the host), and running it completely in pure emulation mode reaches the 50 minute time-out, so for the sake of this demonstration, I only run one of the test file. And now we have our first succesful build given this build recipee.

I could also have potentially tried VirtualBox because, as mentionned above, it supports software virtualization with acceleration. But that is only for 32 bit guests (and I didn't find a ready-made FreeBSD 32bit image that you can directly ssh into). For 64 bit guests, VirtualBox require hardware virtualization to be available in the host. To the best of my knowledge, KQemu is (was) the only solution to enable acceleration of 64 bit guests without hardware requirements.

My main conclusion of this experiment is it is a striking example of a key advantage of the open source model. If kqemu had not been released as GPL v2, I would have never been able to resurrect it and modify it to run on newer kernels (actually there was also QVM86, an attempt of developing an alternative to Kqemu while Kqemu was still closed source and that was abandonned when VirtualBox was open sourced).

mardi 19 juillet 2016

Speeding up computation of raster statistics using SSE-2/AVX-2

GDAL offers a method ComputeStatistics() that given a raster band returns the minimum and maximum values of pixels, the mean value and the standard deviation.

For those not remembering how to compute mean and standard deviations, the basic formulas for values indexed from 0 to N-1 are :
mean = sum(value(i) for i = 0 to N-1) / N
std_dev = square root of the mean of the square of the differences of values to the mean
std_dev = sqrt(sum(i = 0 to N-1, (value(i) - mean)^2)) / N)
A very naive version would first compute the mean, and in a second pass compute the standard deviation.

But it can be easily proven (by expanding the (value(i) - mean)^2 term),that it is also equivalent to :
std_dev = sqrt(sum(i = 0 to N-1, value(i)^2)/N - mean^2)
std_dev = sqrt(mean_of_square_values - square_of_mean)

std_dev = sqrt(sum(i = 0 to N-1, value(i)^2)/N - (sum_of_values/N)^2)
std_dev = sqrt(N^2 *(sum(i = 0 to N-1, value(i)^2)/N - (sum_of_values/N)^2)) / N
std_dev = sqrt(N * sum_of_square_values - sum_of_values^2) / N
A less naive implementation would compute the sum of values and the sum of square values in a single pass. However the standard deviation computed like that might be subject to numeric instability given that even if the result is small, sum_of_square_values and sum_of_values can be very big for a big number of pixels, and thus if represented with floating point numbers, the difference between both terms can be wrong.

Welford algorithm

So in recent GDAL versions, the computation of the mean and standard deviation is done in a progressive and numerically stable way, thanks to the Welford algorithm

The generic code is:
pixel_counter = 0
mean = 0
M2 = 0
foreach(value in all pixels):
    if value < minimum or pixel_counter == 0: minimum = value
    if value > maximum or pixel_counter == 0: maximum = value
    pixel_counter = pixel_counter + 1
    delta = value - mean
    mean = mean + delta / pixel_counter
    M2 = M2 + delta * (value - mean);

std_dev = sqrt( M2 / pixel_counter )

Proof of Welford algorithm

(You can skip this paragraph and still follow the rest of this article)

The magic of Welford algorithm lies in the following recurrence relations.

For the mean, it is rather obvious :

N*mean(N) = sum(i = 0 to N-1, value(i))
N*mean(N) = sum(i = 0 to N-2, value(i)) + value(N-1)
N*mean(N) = (N-1) * mean(N-1) + value(N-1)
mean(N) = (N-1)/N * mean(N-1) + value(N-1)/N
mean(N) = mean(N-1) + (value(N-1) - mean(N-1)) / N

Hence mean = mean + delta / pixel_counter

For the standard deviation, the proof is a little bit more lengthy :

N*stddev(N)^2 = sum(i=0 to N-1, (value(i) - mean(N))^2 )

N*stddev(N)^2 = sum(i=0 to N-1, (value(i) - (mean(N-1) + (value(N-1) - mean(N-1)) / N))^2 )

N*stddev(N)^2 = sum(i=0 to N-1, ((value(i) - mean(N-1)) - ((value(N-1) - mean(N-1)) / N))^2 )

N*stddev(N)^2 = sum(i=0 to N-1, (value(i) - mean(N-1))^2 + ((value(N-1) - mean(N-1)) / N)^2
             - 2 * (value(i) - mean(N-1))*((value(N-1) - mean(N-1)) / N)  )

N*stddev(N)^2 = sum(i=0 to N-1, (value(i) - mean(N-1))^2) + N * ((value(N-1) - mean(N-1)) / N)^2
              - 2 * sum(i=0 to N-1, (value(i) - mean(N-1)))*((value(N-1) - mean(N-1)) / N))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 +  (value(N-1) - mean(N-1)) ^2
                    +  N * ((value(N-1) - mean(N-1)) / N)^2
              - 2 * sum(i=0 to N-1, (value(i) - mean(N-1)))*((value(N-1) - mean(N-1)) / N))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 + (value(N-1) - mean(N-1))^2 * (1 + 1 / N)
              - 2 * N( mean(N) - mean(N-1)) *((value(N-1) - mean(N-1)) / N))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 + (value(N-1) - mean(N-1)) *
            ((1 + 1 / N) *  (value(N-1) - mean(N-1)) - 2 * N( mean(N) - mean(N-1)) / N))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 + (value(N-1) - mean(N-1)) *
            ((value(N-1) - mean(N-1) + (value(N-1) - mean(N-1) / N - 2 * N( mean(N) - mean(N-1)) / N))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 + (value(N-1) - mean(N-1)) *
            ((value(N-1) - mean(N-1) - (mean(N) - mean(N-1))))

N*stddev(N)^2 = (N-1)*stddev(N-1)^2 + (value(N-1) - mean(N-1)) * (value(N-1) - mean(N))


Hence M2 = M2 + delta * (value - mean)

Integer based computation of standard deviation

The Welford algorithm is good but it involves floating point operations for each pixel to compute the progressive mean and variance. Whereas fundamentally we would need those floating point operations only at the end if using the original formulas, and we could use integer arithmetics for the rest. Another drawback of Welford approach is that it prevents any direct parallelization (there might still be ways to reconcile partial computations, but I have not explored those), whereas if you have a set of pixels, you can conceptually divide it in as many subsets you want, and for each subset compute its local minimum, maximum, sum of values and sum of square values. Merging subsets is then trivial: take the minimum of minimums, maximum of maximums, sums of sum of values and sums of sum of square values.

Let us consider the case of pixels whose type is unsigned byte, ie with values in the range [0,255]. We want to compute
std_dev = sqrt(N * sum_of_square_values - sum_of_values^2) / N
For practical reasons, we want N, sum_of_square_values and sum_of_values to fit on a 64bit unsigned integer (uint64), which is the largest natural integral type that can be easily and efficiently used on today's CPUs. The most limiting factor will be sum_of_square_values. Given that in the worse case, a square value is equal to 255*255, the maximum number of pixels N we can address is (2^64-1) / (255*255) = 283 686 952 306 183, which is large enough to represent a raster of 16 million pixels x 16 million pixels. Good enough.

We know need to be able to multiply two uint64 values and get the result as a uint128, and compute the difference of two uint128 values. The multiplication on Intel/AMD CPUs in 64bit mode natively yields to a 128 bit wide result. It is just that there is no standardized way in C/C++ how to get that result. For GCC compiler in 64 bit mode, the __uint128_t type can be used in a transparent way
to do that :
__uint128_t result = (__uint128_t)operand_64bit * other_operand_64bit
For Visual Studio compilers in 64 bit mode, a special instruction _umul128() is available.

What about non-Intel or non-64bit CPUs ? In that case, we have to do the multiplication at hand by decomposing each uint64 values into its lower uint32 and uint32 parts, doing 4 uint32*uint32->uint64 multiplications, summing the intermediary results, handling the carries and building the resulting number. Not very efficient, but we do not really care about that, since it is just a final operation.

To make it is easier, that partial 128 bit arithmetics is abstracted in a GDALUInt128 C++ class that has different implementations regarding the CPU and compiler support.

Now that we have solved the final part of the computation, we can then write
the computation loop as following :

    minimum = maximum = value[0]
    foreach value:
        if value < minimum: minimum = value
        else if value > maximum: maximum = value
        sum = sum + value
        sum_square = sum_square + value * value


Can we do better ? A bit of loop unrolling can help :

    minimum = maximum = value[0]
    foreach value pair (value1, value2):
        if value1 < minimum: minimum = value1
        else if value1 > maximum: maximum = value1
        sum = sum + value1
        sum_square = sum_square + value1 * value1
        if value < minimum: minimum = value2
        else if value > maximum: maximum = value2
        sum = sum + value2
        sum_square = sum_square + value2 * value2
    (deal with potential remaining pixel if odd number of pixels)


If we start with comparing value1 and value2, we can actually save a comparison (resulting in 3 comparisons for each pair of pixel, instead of 4) :

    minimum = maximum = value[0]
    foreach value pair (value1, value2):
        if value1 < value2:
            if value1 < minimum: minimum = value1
            if value2 > maximum: maximum = value2
        else:
            if value2 < minimum: minimum = value2
            if value1 > maximum: maximum = value1
        sum = sum + value1
        sum_square = sum_square + value1 * value1
        sum = sum + value2
        sum_square = sum_square + value2 * value2
    (deal with potential remaining pixel if odd number of pixels)


This improvement can already dramatically reduce the computation time from
1m10 to 7s, to compute 50 times the statistics on a 10000 x 10000 pixel raster.

Parallelization with SSE2

We have not yet explored the parallelization of the algorithm. One way to do it would be to use multi-threading, but for Intel-compatible CPU, we can also explore the capabilities of the SIMD (Single Instruction/Multiple Data) instruction set. On 64bit Intel, the SSE2 instruction set, which offers vectorized operations on integers, is guaranteed to be always present. 16 registers (XMM0 to XMM15) are available, each 128 bit wide.

So each register is wide enough to hold 16 packed int8/uint8, 8 packed int16/uint16, 4 packed int32/uint32 or 2 packed int64/uint64, depending on the wished representation. A diverse set of operations are offered and generally operate on the sub-parts of each register independently. For example c=_mm_add_epi8(a,b) will add independently c[i]=a[i]+b[i] for i=0 to 15, and that in just one CPU cycle._mm_add_epi16() will work on packed uint16, etc. To add some salt, not all operators are available for all elementary subtypes however.

Compilers are supposed to be able to automatically vectorize some C code, but in practice they rarely manage to do so for real world code, hence requiring the programmer to use the SIMD instruction set at hand. All major compilers (gcc, clang, Visual Studio C/C++) offer access to the SSE2 instruction set through "intrinsics", which are C inline functions that wrap the corresponding assembly instructions, but while still being C/C++. This allows the compiler to do the register allocation and various other optimizations (such as re-ordering), which is a huge win over coding directly in assembly. The Intel intrinsics guide is a useful resource to find the appropriate intrinsics.

So a temptative vectorized version of our algorithm would be :

    v_minimum = vector_of_16_bytes[0]
    v_maximum = vector_of_16_bytes[0]
    v_sum = vector_of_16_zeros
    v_sum_square = vector_of_16_zeros

    foreach vector_of_16_bytes v:
        v_minimum = vector_minimum(v_minimum, v)
        v_maximum = vector_maximum(v_maximum, v)
        v_sum = vector_add(v_sum, v)
        v_sum_square = vector_sum(v_sum_square, vector_mul(v, v))

    minimum = minimum_of_16_values(v_minimum)
    maximum = maximum_of_16_values(v_minimum)
    sum = sum_of_X??_values(v_sum)
    sum_square = sum_of_X??_values(v_sum_square)
    (deal with potential remaining pixels if number of pixels is not multiple of 16)


vector_minimum and vector_maximum do exist as _mm_min_epu8 and _mm_max_epu8. But for vector_add, which variant to use _mm_add_epi8, _mm_add_epi16, _mm_add_epi32 or _mm_add_epi64 ? Well, none directly. We want to add uint8 values, but the result cannot fit on a uint8 (255+255=510). The same holds for sum_square. The result of each square multiplication requires at least a uint16, and we want to loop several times, so we need at least a width of uint32 to hold the accumulation. We designed the overall algorithm to be able to handle an accumulator of uint64, but this would decrease the performance of the vectorization if using that in the tigher loop. So we will decompose our loop into one upper loop and and one inner loop. The inner loop will do as many iterations as possible, while still not overflowing a uint32 accumulator. So (2^32-1)/(255*255) = 66051.xxxx iterations. Which we round down to the closest multiple of 16.

So what about v_sum = vector_add(v_sum, v) ?
The first idea would be to extract the 4 lowest order bytes of v, unpack them so that they fit each on a uint32 and then use _mm_add_epi32 to add them in the v_sum accumulator.

    v_sum = _mm_add_epi32(v_sum, _mm_unpacklo_epi16(_mm_unpacklo_epi8(v, zero), zero)
_mm_unpacklo_epi8(v, zero) expands the 8 lowest order bytes of v as 8 uint16. And similarly _mm_unpacklo_epi16(v, zero)  expands the 4 lowest order uint16 of v as 4 uint32.

And then repeat that with the 3 other groups of 4 bytes :

    v_sum = _mm_add_epi32(v_sum, _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_shuffle_epi32(v, 1), zero), zero)
    v_sum = _mm_add_epi32(v_sum, _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_shuffle_epi32(v, 2), zero), zero)
    v_sum = _mm_add_epi32(v_sum, _mm_unpacklo_epi16(_mm_unpacklo_epi8(_mm_shuffle_epi32(v, 3), zero), zero)


But we can do better thans to the _mm_sad_epu8 intrinsics. It is designed to "compute the absolute differences of packed unsigned 8-bit integers in a and b, then horizontally sum each consecutive 8 differences to produce two unsigned 16-bit integers, and pack these unsigned 16-bit integers in the low 16 bits of 64-bit elements in dst." If we notice that ABS(x-0) = x when x >= 0, then it does what we want.

    v_sum = _mm_add_epi64(v_sum, _mm_sad_epu8(v, zero))

Pedantic note: we can actually use _mm_add_epi32, since there is no risk of overflow : 8 * 66051 * 255 fits on a uint32. The advantage of using _mm_add_epi32 is that as we will use it elsewhere, the compiler can re-order additions to group them in pairs and benefit from their 0.5 throughput.

_mm_sad_epu8() has a relatively high latency (5 cycles), but it is still a big win since it replaces 14 intrinsics of our initial version.

What about the computation of the square value ? There is no mnemonics to directly multiply packed bytes and get the resulting packed uint16 (or even better uint32, since that is the type we want to operate on eventually to be able to do several iterations of our loop!). One approach would be to take the 8 lowest order bytes, un-pack them to uint16, use the  _mm_mullo_epi16() intrinsics that does uint16 x uint16->uint16. Then you would take the 4 lowest order uint16 of this intermediate result, un-pack them to uint32 and finally use _mm_add_epi32 to accumulate them in v_sum_square.

    v_low = _mm_unpacklo_epi8(v, zero)
    v_low_square = _mm_mullo_epi16(v_low, v_low)
    v_sum_square = _mm_add_epi32(v_sum_square, _mm_unpacklo_epi16(v_low_square, zero)


Then repeat the operation with the 4 upper order uint16 of the intermediate result.

    v_sum_square = _mm_add_epi32(v_sum_square,
        _mm_unpacklo_epi16(_mm_shuffle_epi32(v_low_square, 2 | (3 <<2)), zero) )


_mm_shuffle_epi32(v, 2 | (3 <<2) is a trick to replicate the high 64 bits of a XMM register into its low 64 bits. We don't care about the values of the resulting high 64 bits since they will be lost with the later unpack operations.

And then repeat the whole process with the 8 highest order bytes.

    v_high = _mm_unpacklo_epi8(_mm_shuffle_epi32(v, 2 | (3 <<2)), zero)
    v_high_square = _mm_mullo_epi16(v_high, v_high)
    v_sum_square = _mm_add_epi32(v_sum_square, _mm_unpacklo_epi16(v_high_square, zero)
    v_sum_square = _mm_add_epi32(v_sum_square,
        _mm_unpacklo_epi16(_mm_shuffle_epi32(v_high_square, 2 | (3 <<2)), zero) )


We can actually do much better with the _mm_madd_epi16() mnemonics that "Multiply packed signed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Horizontally add adjacent pairs of intermediate 32-bit integers, and pack the results". This is really close to what we need. We just need to prepare uint16/int16 integers (the sign convention here does not matter since a uint8 zero-extended to 16 bit is both a uint16/int16)

    v_low_16bit = _mm_unpacklo_epi8(v, zero)
    v_sum_square = _mm_add_epi32(v_sum_square, _mm_madd_epi16(v_low_16bit, v_low_16bit))
    v_high_16bit = _mm_unpacklo_epi8(_mm_shuffle_epi32(v, 2 | (3 <<2)), zero)
    v_sum_square = _mm_add_epi32(v_sum_square, _mm_madd_epi16(v_high_16bit, v_high_16bit))


The latencies and throughput of _mm_mullo_epi16 and _mm_madd_epi16 are the same, so the second version is clearly a big win.

Use of AVX2

We can tweak performance a bit by doing a 2x loop unrolling, which will enable the compiler to re-order some operations so that those who have a throughput of 0.5 cycle to be consecutive (such as _mm_add_epi32, _mm_unpacklo_epi8) and thus be able to executive 2 of them in a single cycle. When doing so, we can notice that we are operating on a virtual 256 bit register. But 256 bit registers do exist in the AVX2 instruction set, that was introduced in relatively recent hardware (2013 for Intel Haswell). AVX/AVX2 offer the YMM registers, equivalent of XMM registers but on a doubled bit width (the 128 bit low part of a YMM register is its corresponding XMM register). One particularity of the YMM register is that it operates on quite distinct "128 bit lanes", but you can still extract each lane.

The port to AVX2 is quite straightforward :

    v = _mm256_load_si256(data + i)
    v_sum = _mm256_add_epi32(v_sum, _mm256_sad_epu8(v, zero))
    v_low_16bit = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(v, 0));
    v_sum_square = _mm256_add_epi32(v_sum_square, _mm256_madd_epi16(v_low_16bit, v_low_16bit))
    v_high_16bit = _mm256_cvtepu8_epi16(_mm256_extracti128_si256(v, 1));
    v_sum_square = _mm_add_epi32(v_sum_square, _mm_madd_epi16(v_high_16bit, v_high_16bit))


_mm256_extracti128_si256(v,0) extracts the 128 bit lower part of the register,
and _mm256_extracti128_si256(v,1) the 128 bit upper part.

The good news is that we can have a single code base for the SSE2 and AVX2 variants, by using the AVX2 code. In the case of SSE2, we in fact define the _mm256 functions with their corresponding _mm 128bit functions operating on the low and high 128 bit parts. For example:

static inline GDALm256i GDALmm256_add_epi32(GDALm256i r1, GDALm256i r2)
{
    GDALm256i reg;
    reg.low = _mm_add_epi32(r1.low, r2.low);
    reg.high = _mm_add_epi32(r1.high, r2.high);
    return reg;
}

The AVX2-with-SSE2 emulation can be found in :
https://github.com/OSGeo/gdal/blob/trunk/gdal/gcore/gdal_avx2_emulation.hpp

Thanks to inlining and usual compiler optimizations, this will be equivalent to our hand 2x unrolled version ! The final code is here.

Regarding timings, our SSE2-emulated AVX2 version runs in 1.6s, so roughly a 4x time improvement with respect to the portable optimized C version. On a hardware capable of AVX2, the pure AVX2 version is 15% faster than the SSE2-emulated version. So definitely not enough to justify a dedicated code path, but here as we have a unified one, it comes almost for free. Provided that the code is explicitly compiled to enable AVX2.

Nodata values

Up to now, we have ignored the potential existence of nodata values. When computing statistics, we do not want pixels that match the nodata value to be taken into account in the minimum, maximum, mean or standard deviation.

In the pure C approach, this is easy. Just ignore pixels that match the nodata value:

    minimum = maximum = value[0]
    foreach value:
        if value != nodata:
            valid_pixels = valid_pixels + 1
            minimum = min(minimum, value)
            maximum = max(minimum, value)
            sum = sum + value
            sum_square = sum_square + value * value


We cannot directly translate that with SSE2/AVX2 mnemonics since the result of the value != nodata test can be different for each of the 32 packed bytes of the (virtual) AVX2 register, and making tests for each components of the vector register would kill performance to a point where it would be worse than the pure C approach !

We can however rewrite the above in a vector friendly way with :

    minimum = maximum = first value that is not nodata
    neutral_value = minimum (or any value in final [min,max] range that is not nodata)
    foreach value:
        validity_flag = if (value != nodata) 0xFF else 0
        value_potentially_set_to_zero = value & validity_flag
        value_potentially_set_to_neutral = (value & validity_flag) | (neutral_value & ~validity_flag)
        valid_pixels = valid_pixels + validity_flag / 255
        minimum = min(minimum, value_potentially_set_to_neutral)
        maximum = max(minimum, value_potentially_set_to_neutral)
        sum = sum + value_potentially_set_to_zero
        sum_square = sum_square + value_potentially_set_to_zero * value_potentially_set_to_zero


(value & validity_flag) | (neutral_value & ~validity_flag) is a quite common pattern in SIMD code to implement a if/then/else pattern without branches (for classic scalar code, if/then/else branches are more efficient due to the CPU being able to do branch prediction)

The only complication is that there is no SSE2 intrinsics for non-equality testing, so we have to transform that a bit to use equality testing only. And we will also remove the need for division in the loop :

    foreach value:
        invalidity_flag = if (value == nodata) 0xFF else 0
        value_potentially_set_to_zero = value & ~invalidity_flag
        value_potentially_set_to_neutral = (value & ~invalidity_flag) | (neutral_value & invalidity_flag)
        invalid_pixels_mul_255 = invalid_pixels_mul_255 + invalidity_flag
        minimum = min(minimum, value_potentially_set_to_neutral)
        maximum = max(minimum, value_potentially_set_to_neutral)
        sum = sum + value_potentially_set_to_zero
        sum_square = sum_square + value_potentially_set_to_zero * value_potentially_set_to_zero

    valid_pixels = total_pixels - invalid_pixels_mul_255 / 255


The computation of invalid_pixels_mul_255 in a vectorized way is the same as
v_sum, using the _mm_sad_epu8() trick. The resulting SSE2 code is :

    foreach vector_of_16_bytes v:
        v_invalidity_flag = _mm_cmpeq_epi8(v, v_nodata)
        v_value_potentially_set_to_zero = _mm_andnot_si128(v_invalidity_flag, v)
        v_value_potentially_set_to_neutral = _mm_or_si128(
            v_value_potentially_set_to_zero, _mm_and_si128(v_invalidity_flag, v_neutral))
        v_invalid_pixels_mul_255 = _mm_add_epi32(invalid_pixels_mul_255,
                                        _mm_sad_epu8(v_invalidity_flag, zero))
        [ code for min, max operating on v_value_potentially_set_to_neutral ]
        [ code for sum and sum_square operating on v_value_potentially_set_to_zero ]


The transposition to AVX2 is straightforward.

We can notice that this version that takes into account nodata value can only be used once we have hit a pixel that is not the nodata value, to be able to initialize the neutral value.

What about uint16 rasters ?


The same general principles apply. If we still want to limit ourselves to operate with at most uint64 accumulators, given that the maximum square value of a uint16 is 65535*65535, this limits to rasters of 2^64/(65535*65535) ~= 2 billion pixels, which remains acceptable for common use cases.

One oddity of the SSE-2 instruction set is that it includes only a _mm_min_epi16() / _mm_max_epi16() mnemonics, that is to say that operates on signed int16. The _mm_min_epu16() that operates on uint16 has been introduced in the later SSE 4.1 instruction set (that is quite commonly found in not so recent CPUs).

There are tricks to emulate _mm_min_epu16() in pure SSE2 using saturated subtraction and masking :

    // if x <= y, then mask bits will be set to 1.
    mask = _mm_cmpeq_epi16( _mm_subs_epu16(x, y), zero )

    // select bits from x when mask is 1, y otherwise
    min(x,y) = _mm_or_si128(_mm_and_si128(mask, x), _mm_andnot_si128(mask, y));


Another way is to shift the unsigned values by -32768, so as to operate on signed 16bit values.

This -32768 shift trick is also necessary since, like for the byte case, we want to still be able to use the _madd_epi16 intrinsics, which operates on signed int16, to compute the sum of square values. One subtelty to observe is that when you operate on 2 consecutive pixels at 0, _mm_madd_epi16 does :

 (0 - 32768) * (0 - 32768) + (0 - 32768) * (0 - 32768)
= 1073741824 + 1073741824
= 2147483648 = 2^31


Which actually overflows the range of signed int32 ( [-2^31, 2^31-1] ) ! The good news is that _mm_madd_epi16 does not saturate the result, so it will actually return 0x80000000 as a result. This should normally be interpreted as -2^31 in signed int32 convention, but as we know that the result of _madd_epi16(x,x) is necessary positive values, we can still correctly interpret the result as a uint32 value. This is where you feel lucky that Intel chose two's complement convention for signed integers.

To compute the sum of values, no nice trick equivalent to _mm_sad_epu8. So we just do it the boring way: unpack separately the 64bit low and high part of the value register from uint16 to uint32 and accumulate them with _mm_add_epi32.

Exactly as for the byte case, the uint16 case can be easily transposed to AVX2 or
emulated-AVX2.

Conclusion


Conversion between integer and floating-point operations can be costly, so avoiding them as much as possible is a big win (provided that you make sure not to overflow your integer accumulators)

In theory, the gains offered by a SSE2/AVX2 optimized version are at most limited to a factor of with_of_vector_register / with_of_elementary_type, so, for bytes and SSE2, to 16. But often the gain is lesser, so do that only when you have come to an already optimized portable C version (or if the SIMD instruction set includes a dedicated intrinsics that just do what you want)

lundi 2 mai 2016

GDAL/OGR 2.1.0 released

On behalf of the GDAL/OGR development team and community, I am pleased to announce the release of GDAL/OGR 2.1.0.  GDAL/OGR is a C++ geospatial data access library for raster and vector file formats, databases and web services.  It includes bindings for several languages, and a variety of command line tools.

The 2.1.0 release is a major new feature release with the following highlights:
  • New GDAL/raster drivers:
    • CALS: read/write driver for CALS Type I rasters
    • DB2: read/write support for DB2 database (Windows only)
    • ISCE: read/write driver
    • MRF: read/write driver for Meta Raster Format
    • SAFE: read driver for ESA SENTINEL-1 SAR products
    • SENTINEL2: read driver for ESA SENTINEL-2 L1B/LC1/L2A products
    • WMTS: read driver for OGC WMTS services
  • New OGR/vector drivers:
    • AmigoCloud: read/write support for AmigoCloud mapping platform
    • DB2: read/write support for DB2 database (Windows only)
    • MongoDB: read/write driver
    • netCDF: read/write driver
    • VDV: read/write VDV-451/VDV-452 driver, with specialization for the Austrian official open government street graph format
  • Significantly improved drivers:
    • CSV: new options, editing capabilities of existing file
    • ElasticSearch: read support and support writing any geometry type
    • GeoJSON: editing capabilities of existing file, "native data" (RFC 60) support
    • MBTiles: add raster write support. fixes in open support
    • PDF: add PDFium library as a possible back-end.
    • PLScenes: add support for V1 API
    • VRT: on-the-fly pan-sharpening
    • GTiff: multi-threaded compression for some compression methods
  • Port library: add /vsis3/, /vsis3_streaming/, /vsicrypt/ virtual file systems
  • Upgrade to EPSG database v8.8 
  • General sanitization pass to clean-up code, fix a lot of compiler warnings, as well as issues pointed by static code analyzers.
  • Fixes in a number of drivers to be more robust against corrupted files . 
You can also find more complete information on the new features and fixes in the 2.1.0.

The release can be downloaded from:
  * http://download.osgeo.org/gdal/2.1.0/gdal210.zip - source as a zip
  * http://download.osgeo.org/gdal/2.1.0/gdal-2.1.0.tar.gz - source as .tar.gz
  * http://download.osgeo.org/gdal/2.1.0/gdal-2.1.0.tar.xz - source as .tar.xz
  * http://download.osgeo.org/gdal/2.1.0/gdal-grass-2.1.0.tar.gz - source of GDAL GRASS plugin
  * http://download.osgeo.org/gdal/2.1.0/gdalautotest-2.1.0.tar.gz - test suite
  * http://download.osgeo.org/gdal/2.1.0/gdal210doc.zip - documentation/website



As there have been a few changes that affect the behaviour of the library, developers are strongly advised to read the migration guide.

dimanche 6 mars 2016

Paris OSGeo Code Sprint 2016 debrief

While my memories are still fresh, here is a report of this week of code sprinting. First, a big thanks to Olivier Courtin for organizing this event, to all sponsors that brought up the money to make it happen and to the Mozilla Foundation for hosting us in the most scenic coding venue I've ever seen.



As expected, I mostly concentrated on GDAL work. My main tasks were related to polishing and extending the work initiatied by Ari Jolma for the support of the "M dimension" of geometries, M standing for Measurement, a numeric property attach to each point/vertex and that can encode different attributes: time, lengths, or any other interesting property beyond x, y and z....
Those good old shapefiles are still a bit fancy since they do not really distinguish between XYZ and XYZM geometries up-front. In fact as soon as you have a Z component, the Shapefile specification requires a M value to be encoded, even if not used. There's consequently a nodata value (any value lower than -10^38) for such cases. As M geometries are a bit esoteric, we want to avoid to report them when not being used. Consequently a heuristics has been added to the shapefile driver to probe by default the first shape in the file and checks if it has meaningful M values. If not, the layer geometry type is just declared as being XYZ. This should help with backward compatibility of software using GDAL. Implemented per r33538 and r33539.
The support of M in the CSV driver was more straightforward (r33544) due to the bulk of the work being of course done in the WKT importer/exporter.
Regarding the GeoPackage driver, the main need was to be able to parse correctly geometry headers for XYM or XYZM bounding boxes that may be found. The main difficulty was to test that since OGR itself just generates XY or XYZ bounding boxes, so editing hexadecimal WKB was needed. Somewhat amusing with a broken laptop screen. Anyway, was done through r33551
Support for M geometries in SQLite/Spatialite required a number of small changes scattered through the driver code base, and new tests for the various variants (regular geometries vs compressed ones). The upgrade of this driver makes it also possible to use XYM/XYZM geometries with the SQLite SQL dialect usable by all other drivers. Implemented per r33554
The upgrade of the FileGDB and OpenFileGDB drivers gave me some headaches as it turned out the support of writing M values in the older FileGDB SDK 1.3 was broken. After upgrading to v1.4, things went much more smoothly. Support for M with FileGDB v9.X. Implemented per r33563 . For the nostalgics, the PGeo driver should also benefit from those changes, although this wasn't tested.

On the MapServer front, in the middle of many other things, Thomas Bonfort merged in time for MapServer 7.0.1 an older pull request from mine that I had forgotten to support 64 bit integer fields that may now come with GDAL 2.0. I also backported a fix to handle WMS TIME on contour layers, in time for MapServer 6.4.3.

Aside for my own coding, I enjoyed spending time with other developers to help them on their GDAL tasks. With Rob Emmanuele, we tried to figure out how to make the "driver" that handles files accessible through HTTP/HTTPS to better report errors, especially on Amazon S3 storage, so that upper library or application layers can better deal with them. In particular, you want to be able to distinguish an inexting ressource (typo in the URL for example), from a valid one but for which you have not specified the right credentials. This turned out to be much more difficult as I would have myself anticipated, since there are a lot of situations where we want errors to accessing files to be silent (for example when drivers probe from potential "sidecar" files that accompany main files. Think to the .prj, .wld, .aux files), and there's no way in the current design to know when to be verbose or not. Rob finally came with a design of a file system error reporthing mechanism, that is not verbose by default, but that may be queried by the code paths that want to report errors in a verbose way. This is still work in progress, but hopefully Rob should be able to polish it to be included in the upcoming GDAL 2.1 release (feature freeze at the end of this month).

With Yann Chemin, we had quite of fun exploring how to better support the catalog of spatial reference systems published by the IAU (International Astronomical Union) that describes the SRS used for other planets and satellites. In particular, we discovered that some of those SRS used the Oblique Cylindrical Equal Area (OCEA) projection. This projection is supported by proj.4 (thanks to Howard Butler for designing a modern website for this not always sexy but so fundamental piece of software that is proj.4), but not by the OGR Spatial Refrence (OSR) component of GDAL itself. The main challenge to make it available through OSR is to be able to map the proj.4 parameters of the projection to parameter names in WKT. Documentation to do that is generally scarce, and we ended up opening the bible of the projection experts, that is to say "Map Projections - A Working Manual", by John P. Snyder, USGS Professional Paper 1395, whose proj.4 is mostly the translation in C code. The book gave some light at its page 80 regarding the OCEA projection. The interesting part of OCEA is that it comes with 2 variants... The gist of the support is now in this pull request, with some more work and research to clarify the remaining mysteries. In the meantime, GRASS can now benefits from IAU codes (r67950 and r67951)

Always wondering about the possible command line switches of GDAL/OGR utilities ? Guillaume Pasero contributed a bash completion script to improve your user experience.

$ ogr2ogr - (TAB character pressed)
-append --debug -dsco --format --help-general --locale --optfile -preserve_fid -skipfailures -sql -update
-a_srs -dialect -f --formats -lco -nln -overwrite -progress -spat -s_srs --version
--config -dim -fid -geomfield --license -nlt --pause -select -spat_srs -t_srs -where


Regine Obe also worked on improving the ODBC support in OGR: build support of Windows ODBC libarries with the mingw64 compiler, ability to support a large number of columns in tables.




mardi 5 janvier 2016

Software quality improvements in GDAL

As a new year is popping up, it is time to take good resolutions. To help you, especially if you are a C/C++ developer, this article gives feedback on efforts made over the last few months to improve the quality of the GDAL/OGR code base, and hopefully its quality for the end user.

Enable as many compiler warning options as possible


By default, C/C++ compilers enable a few warning categories, but this is far from being sufficient. You want to turn on extra warnings. The set of available warning options depends on the compiler, so you may want to autodetect which are available in your configure/cmake script. The below flags are used by GDAL with GCC and CLang:


In the "Must have" category :
  • -Wall: basic set of extra warnings
  • -Wextra: a few more
  • -Wdeclaration-after-statement : for C code, to avoid compilation error on Microsoft Visual Studio.
  • -Werror=vla: variable length arrays are a GCC extension not supported by V.S.
  • -Wformat: detects error in the use of printf() like statements
  • -Werror=format-security: error out on such errors that have security implications
  • -Wno-format-nonliteral: this is an exception to allow the formatting strict to be a non constant. We need that in a few cases for GDAL, but try without if you can.
  • -Winit-self: warn about uninitialized variables that are initialized with themselves

Good practice:
  • -Wmissing-prototypes: for C code
  • -Wmissing-declarations: helped fixing mismatches between function prototypes and their implementation.
  • -Wnon-virtual-dtor: make it compulsory to define a destructor of a C++ class as virtual
  • -Wlogical-op: (GCC only) a few heuristics that detect wrong uses of the logical and/or operators. Can have some false positives sometimes, but helped found quite a few bugs like a condition that always evaluate to false, another one that always evaluate to true (or this one too) and my preferred one (we were very lucky that, in ASCII, that the single quote character is just before open parenthesis, otherwise this check wouldn't have detected the issue). Interestingly, not all versions of GCC or CLang raise the same warnings, due to varying heuristics.
  • -Wunused-private-field: (CLang only, C++) detect unused private members in classes.
  • -Wunused-parameter: detects unused parameters. In C++, you can just omit the argument name if it is unused. In C, you can use a macro that will expand to  __attribute((__unused__)) on GCC compatible compilers.
  • -Wnull-dereference:  detects situations where a NULL pointer dereference is possible. Only available in (unreleased at this time) GCC 6. A few positives are possible (usually the warning can be workarounded)
  • -Wduplicated-cond:  detects redundant conditions in if / else if constructs. Only available in GCC 6

Nice to have, but can require extensive efforts to fix:
  • -Wshorten-64-to-32: (CLang only) will detect potential portability problems for 64-bit builds. 
  • -Wno-sign-compare: (CLang only) warn on comparisons where members accross the comparison operators have not the same signedness.
  • -Wshadow: detect variable "shadowing", for example a local variable with the same name as a global variable or a class member. This can help enforcing style conventions like using m_ to prefix member variables.
  • -ftrapv: generates runtime error if overflow occurs on signed integers. Such overflows are unspecified behaviour in C/C++, so it is good to be able to catch them. It is somewhat redundant of -fsanitize=undefined, although it has been available in compilers for a longer time (but with uneffective implementations in older GCC versions. Recent clang versions have it really working well though). Perhaps only enable this in debug builds as we do in GDAL.

And once you have cleaned up your code base, you can add the magic -Werror flag that will cause any warning to be treated as an error so as to maintain it in a warning-free state.

Sometimes you have unfortunately to deal with external library headers that trigger compiler warnings themselves. Nothing you can really do about that. GCC and clang have an interesting workaround for that. Basically create your own header, and call #pragma GCC system_header before including the third-party headers. Here's an example.

In GDAL 2.0, enabling the above mentionned warning options caused 3865 warnings to be raised. In GDAL 2.1.0dev, we cut it down to 0.

For Visual Studio, enable /W4 for the more extensive set of warnings, and add exceptions when needed. In GDAL, we use the following exceptions (only enable them if really needed):
  • /Wd4127: conditional expression is constant
  • /Wd4251: related to export of symbols
  • /Wd4275: related to export of symbols
  • /Wd4100: unreferenced formal parameter (this would be the equivalent of -Wno-unused-parameter) since there's no way of tagging a function parameter as being unused in VS
  • /Wd4245: to disable warnings about conversions between signed and unsigned integers 
  • /Wd4611: to disable warnings about using setjmp() and C++

Use different compilers

The more compilers you try, the more issues they will raise. In GDAL, we have continuous integrations targets that use different versions of GCC, CLang and Microsoft Visual Studio.

Function annotations

GCC and CLang offer a set of attributes/annotations that can be added to a function declaration.
  • __attribute__((warn_unused_result)): to warn when the return value of a function is not used. This will increase the reliability of your code by ensuring that error conditions are properly dealt with (in the case you deal with errors with return values rather than C++ exceptions)
  • _attribute__((__format__ (__printf__, format_idx, arg_idx))): to flag a function as behaving like printf() and related functions. The compiler will then check that the arguments passed in the variable list are of the correct type and number with respect to the formatting string.
  • __attribute__((__sentinel__)): to flag a function taking a variable list of arguments to expect the last argument to be a NULL pointer.

Static code analysis

Static code analysis is the logical extension of checks done by the compiler, except that more complex checks can be done to try detecting logic errors  beyond checks that are strictly needed to compile the code.
CLang Static Analyzer, an add-on to the LLVM/CLang compiler, is such a tool. I must warn it has a significant rate of false positive warnings, which with some effort can generally be workarounded by added extra assertions in your code. Or, if it takes you more than 10 seconds to figure out that the warning is in fact a false positive, it is a sign that your code is likely too complex, and by simplifying it, you will make your life and the one of the analyzer easier. Despite false positives, it can finds real bugs such as an access beyond buffer bounds, a memory leak or the dereferencing of a NULL pointer. I'd recommend enabling the optional alpha.unix.cstring.OutOfBounds and alpha.unix.cstring.BufferOverlap checkers. We finally got to 0 warnings in the GDAL code base. You can even write your own checkers, to enforce specific development rules, as the Libreoffice developers did, but this can be rather involved and we haven't been up to that point yet in GDAL.

cppcheck is another free&open-source C/C++ static analysis tool that can be used, although I found it to be less powerful and slower than CLang Static Analyzer. You can run it with cppcheck --enable=all --force --inconclusive *.cpp and ignore warnings about unused functions.

In GDAL, we also use the commercial Coverity Scan tool, whose use is free (as in beer) for free&open source software. Our experience is that Coverity Scan has a reasonably low rate of false positives (probably around 10%). One of its strength is its ability to correlate portions of code that are not in the same C/C++ file. In June 2015, we had more than 1100 warnings raised by Coverity Scan. With the help of Kurt Schwehr who fixed several hundreds of them, we finally reached 0 (with a bit less than 100 annotations to discard false positives).

Side node: as GDAL uses Python for a few of its utilities and its test suite, we use the pyflakes utility to do some basic checks on the Python code.

Automated test suite and code coverage

What we mentionned above were static checks, that is to say checks that are done just by "looking" at the code, but not by running it. To cover the dynamic aspect, GDAL has an extensive automated test suite that checks behaviours of utilities, core functions and driver behaviours. Writing tests is good, but knowing what part of your code the tests cover is even better. In order to do so, we have a continuous integration target that compiles the code with profiling options (--coverage flag of gcc) so that you can get a report after tests execution of which lines and branches of code have been actually run. Combined with the gcov and lcov utilities, this can produce a nice HTML output. With the default set of test data, 63% of the compiled lines of GDAL are executed at least once (and with some test driver methodology, we can reach 90% or more in recently developed drivers such as Sentinel2, WMTS or VDV. Some projects who use GitHub / Travis-CI also go with the Coveralls service, that integrates well with those, to track code coverage (my tests with Coveralls roughly one year ago were not successfull, apparently due to the size of the GDAL code base).

If you decide fixing compiler and static code analyzis warnings, I would strongly recommend making sure your test suite covers the code you are fixing, as it is sometimes easy to introduce regressions while trying to fix static warnings.

Dynamic checkers


In C/C++, it is easy to misuse dynamically allocated memory, either by reading or writing outside of the allocated zones, using freed memory or forgetting to free memory. You can run your test suite through the Valgrind memory debugger (available on Linux and MacOSX) or DrMemory (Windows 32 bit). Valgrind is really an excellent tool. You should use it. Unfortunately it slows down execution by a significant factor (10 or more), due to on-the-fly code instrumentation, which can make it unpractical for some continuous integration environment where runs are time limited.

Another option is to use the -fsanitize=address flag of recent GCC/CLang versions that does similar checks as Valgrind, but the instrumentation is done at compile time, which makes the slowdown to be much more bearable. Other sanitizers such as -fsanitize=undefined can also been used to catch situations where undefined behaviour as defined in the C/C++ standards happen, and so you rely on the choices done by the compiler, or the specific logic of the CPU architecture (as this bug reports shows, not all CPU architectures deal the same with overflows during signed/unsigned conversions). -fsanitize=thead can also be used to detect issues with thread usage

Fuzz testing

American Fuzzy Lop, also known as AFL, is a quite recent tool to do fuzz testing that has gained a lot of popularity. The principle is that you feed it with an initial file that you run through your utility and AFL will do various random or not-so-random changes in it to try triggering bugs. This is particularly well suited for command line utilities such as gdalinfo or ogrinfo that takes a file as input.
An interesting innovation of AFL with respect to similar tools is that, through compilation time instrumentation, it checks with code branches have been taken or not, to determine which changes in the test file cause which code branches to be taken, so as to maximize the use of code branches. It can also be used to generate test cases for greater code coverage by writing out the input file when you hit a branch you want covered

AFL has for example identified this divide by zero bug or this improper use of data types for which the fix is more involved.

Lately I've played with the afl-clang-fast experimental module of AFL that requires the code to be compiled with CLang. With special instrumentation (but very simple to put in place), in the GDAL binaries like gdalinfo or ogrinfo, AFL can run typically 10 times faster, reaching a thousand of tests per second. Combined with -ftrapv (and possibly other sanitizers such as -fsanitize=undefined), it has for example caught dozains of situations where integer overflow could happen on corrupted data.

Continuous integration

To make all the above mentionned good practice really useful (perhaps except fuzz testing which is a slow operation), it is highly recommended to have continuous integration practices. That is to say use a mechanism that automates compilation and execution of your test suite each time a change is pushed to the code repository. In GDAL, we have 16 different configurations that are regularly run, using different versions of gcc/clang/Visual Studio, 32 bit vs 64 bit builds, Python 2.X vs Python 3.X, Linux/MacOsX/Windows/Android, big endian host, C++11 compatibility, a target running the test suite with GCC -fsanitize=address -fanitize=undefined and also a CLang Static Analysis target. You can find the badges for those different target on the GitHub mirror home page.
Just thinking that a C++14 target could also be useful as sometimes upgrading to the newer standard can reveal bugs only at runtime as this bug report shows.

To run all those configurations, we use the Travis-CI (all configurations except Visual Studio builds) and AppVeyor (Visual Studio builds) online services (sorry neither use free&open-source software). Alternatives such as GitLab-CI using F.O.S.S exist.

Conclusion

When you decide to finally tackle compiler and static code analyzis warnings in a code base as large as GDAL, this can be a huge effort (to be evaluated in weeks/months of manpower). But the effort is worth it. It helps uncovering real bugs that were lying around and make it more friendly for contributors to do further code changes.

This article has been written by Even Rouault (Spatialys), with contributions from Kurt Schwehr.