GDAL 2.0.0 should be released soon (RC2 available), but the development version (2.1.0dev) has already received interesting improvements, especially for massive pixel crunchers.
The lossless DEFLATE compression scheme is notoriously known to be rather slow on the compression side. For example, converting with gdal_translate a 21600x21600 pixels RGB BMNG tile into a DEFLATE compressed GeoTIFF takes 44s on a Core i5 750 with the default setting. Using low values of the ZLEVEL creation option, that controls the speed vs compresssion ratio, makes things a bit faster :
- ZLEVEL=1, 1 thread: 26s, 486 MB
- ZLEVEL=6(default), 1 thread: 44s, 440 MB
With the development version, you can now use the NUM_THREADS creation option, whose value must be an integer number or ALL_CPUS to make all your cores busy. New results:
ZLEVEL=1, 4 threads: 12s, 486 MB- ZLEVEL=6, 4 threads: 16s, 440 MB
Using horizontal differencing prediction (PREDICTOR=2) leads to further compression gains (generally, but that depends on the nature of data) with a compression time slightly greater (almost unchanged at ZLEVEL=1, but noticeably slower at ZLEVEL=6) :
- ZLEVEL=1, PREDICTOR=2, 1 thread : 25 s, 354 MB
- ZLEVEL=1, PREDICTOR=2, 4 threads : 12 s, 354 MB
- ZLEVEL=6, PREDICTOR=2, 1 thread : 1m 5s, 301 MB
- ZLEVEL=6, PREDICTOR=2, 4 threads : 22 s, 301 MB
Multi-threaded compression has been implemeneted for DEFLATE, LZW, PACKBITS and LZMA compression schemes. JPEG could potentially be added as well, but probably with little benefits since it is already a very fast compression method, especially with libjpeg-turbo (10 s on above example, single-threaded).
Surprisingly, multi-threaded compression is easier to implement than multi-threaded decompression. This is due to the fact that the writing side of gdal_translate does not really care when data actually hits the disk : it pushes pixel blocks to the output driver and let it do what it needs to do with them (of course details are a bit more complicated since GDAL has a block cache that buffers actual writes, the GeoTIFF driver has a one-block cache to deal with pixel-interleaved files, and in some circumstances a driver may need to re-read data it has written). On the reading side, when a block is asked, the pixel values must be returned synchronously. So multi-threaded reading is less obvious and has been left aside for now in the GeoTIFF driver. It could be enabled if a large enough window of pixels is announced to be read (this is actually what is currently implemented in the JPEG2000 OpenJPEG driver when a request crosses several JPEG2000 tiles). When only small windows are requested, a heuristics could be used to recognize the read pattern and start asynchronous decoding of the next likely blocks that will be asked. For example, if the driver realizes that the blocks are asked sequentially from the top to the bottom of the image, it might anticipate that the full image will be eventually read and start spawning tasks to decode the next blocks. We could also imagine to have that as a generic mechanism: in that case a dedicated GDAL dataset would have to be opened for each worker thread so as to avoid messing up with the internal state.
Regarding lossless compression schemes, the LZ4 method might be worth considering as a new method for TIFF. It offers great compression performance and blazingly fast decompression. At the expense of the compression ratio.