jeudi 2 juillet 2015

Reliable multithreading is hard

As the race to increase processor speed has stalled during the last decade, CPU manufacturors have rather increased the number of transistors and parallel execution units within a single CPU to keep up with Moore's law. Hence the multi-cores, hyper-threading technologies. This is all good but contrary to extra MHz, software has to be explicitely designed to deal with those parallelism mechanisms, in order to use the full capacity of the CPU. Multi-threaded programming is not a novelty, but even when you're aware of its caveats, it is easy to be caught in troubles and not trivial to diagnose and fix them.

I've spent a good deal of time those last days trying to understand the reason for a failure that began to occur randomly in the GDAL automated test suite that ran on the Travis continuous integration system. Step 27 of the tests of the JPEG2000 OpenJPEG driver crashed roughly 1/3 to 1/2 of the time. The crashes started just after I merged RFC 26 - GDAL block cache improvements in the development version (future 2.1.0).

A few words about this global block cache for rasters: a block is a rectangular region of a raster, generally one or several lines (strips), or squares of size 256x256 typically. When code fetches pixel values from a raster, GDAL caches the contents of the blocks of the raster that intersect the requested area, so that following requests in the same area are faster.

My first reaction when seeing the new failures was to try reproducing them locally on my machine, but that didn't work. I also tried on another machine: no better. Then I ran the tests under the Valgrind memory debugger, since random failures tend to be often associated with misuse of dynamic memory, and still no way of making the tests crash. This became a hint that multi-threading could be involved.

The main work involved in RFC 26 wasn't primarily about modifying multi-threaded behaviour, but there were indeed a few non-trivial changes done in that area, so as to preserve and even slightly improve performance of the global block cache in a multi-threaded context, by reducing to the minimum the scope of the mutex that protects the shared linked list of cached blocks. So it seemed likely I had introduced a regression in the multi-threaded case with my changes. Strangely, the test step that crashed did not involve multi-threading... But reviewing RFC 26 code again, I could not spot a potential race (and some of the work of this RFC actually consisted in removing some potential old races). However, even with careful review, it is easy to miss threading issues.

So I added a few assertion checks and debug traces just before the point where the crash occured, but by doing so, of course, the crash did not happen anymore. So I removed some of them, and today, I finally got again a crash with some usefull debug information I had kept fortunately. The reason for the crash was that a block remained in the block cache even after the raster band to which it belonged to had been destroyed, which should normally not happen, as at band destruction, the cached blocks that belongs to a band are removed from the cache. In step 27, due to the saturation of the cache, the block got evicted from the cache (expected behaviour since the cache has a finite size). Evicting a block requires some interaction with the band to which it belongs to. When the band is in a zombie state, crash is likely to happen. The particular characteristics of this zombie block helped me to identify it as coming from the earlier step 15 of the OpenJPEG tests. And interestingly, this test case exercises an optimization in the OpenJPEG driver to have multi-threaded reading of JPEG2000 tiles. Now investigating the OpenJPEG driver, I discovered it violated a general principle of GDAL, by concurrently accessing in a few places a dataset from several threads. Before RFC 26, the potential for this violation to result in a crash was low (those tests had run for 2 or 3 years without crashing), although theoretically possible. With the new code of RFC 26, an optimization of the TryGetLockedBlockRef() that no longer instanciated the rasterband cache caused it to be later concurrently initialized, which is in no way supported. Explanation found. The fix was simple: just protect with a mutex the few places in the OpenJPEG driver where concurrent use of the dataset could happen.

A few lessons :
  • when adding new code, if something fails, the culprit is not necessarily the new code by itself, but it might just reveal potential dormant defects in existing code.
  • debugging multi-threading issues sometimes looks like quantum physics: when you are looking at the system, you modify its behaviour.
  • extensive test suites (more than 180 000 lines of Python code for GDAL) are an investment that ends up paying off. In that case, the fact that the test steps do not run in isolation from each other, but are run by a long living process, probably made it possible to observe the bug frequently enough.
  • continuous integration helps uncovering problems close to their introduction, which is far easier to solve rather than few months later, just before releasing.

While we are talking about multi-threading, a change in Mapnik to improve efficiency of GDAL use turned to cause random crashes when using VRTs with node-mapnik. The ground reason was that VRT datasets are opened in the main thread of node-mapnik before being dispatched to worker threads. The issue is that, by default, VRTs open their underlying sources in a shared mode, i.e. if one or several VRTs points to the same source (e.g. a TIFF file), and, important detail, if the opening is done by the same thread, it is opened just once. Which was the case here. So dispatching those VRT dataset handles into multiple thread afterwards can cause concurrent access to the TIFF dataset objects. To solve this, an improvement had been introduced in GDAL 2.0 to add the possibility to open VRT sources in a non-shared mode. Is has been discovered today that in the case where you use VRTs whose sources are themselves VRT, an extra fix was needed.