Over this summer I have spent 40 days (*) in the guts of the OpenJPEG open-source library (BSD 2-clause licensed) optimizing the decoding speed and memory consumption. The result of this work is now available in the OpenJPEG 2.3.0 release.
For those who are not familiar with JPEG-2000, and they have a lot of excuse given its complexity, this is a standard for image compression, that supports lossless and lossy methods. It uses discrete wavelet transform for multi-resolution analysis, and a context-driven binary arithmetic coder for encoding of bit plane coefficients. When you go into the depths of the format, what is striking is the number of independent variables that can be tuned:
- use of tiling or not, and tile dimensions
- number of resolutions
- number of quality layers
- code-block dimensions
- 6 independent options regarding how code-blocks are encoded (code-block styles): use of Selective arithmetic coding bypass, use of Reset context probabilities on coding pass boundaries, use of Termination on each coding pass, use of Vertically stripe causal context, use of Predictable termination, use of Segmentation Symbols. Some can bring decoding speed advantages (notably selective arithmetic coding bypass), at the price of less compression efficiency. Others might help hardware based implementations. Others can help detecting corruption in the codestream (predictable termination)
- spatial partition of code-blocks into so-called precincts, whose dimension may vary per resolution
- progression order, ie the criterion to decide how packets are ordered, which is a permutation of the 4 variables: Precincts, Component, Resolution, Layer. The standard allows for 5 different permutations. To add extra fun, the progression order might be configured to change several times among the 5 possible (something I haven't yet had the opportunity to really understand)- division of packets into tile-parts
- use of multi-component transform or not
- choice of lossless or lossy wavelet transforms
- use of start of packet / end of packet markers
- use of Region Of Interest, to have higher quality in some areas
- choice of image origin and tiling origins with respect to a reference grid (the image and tile origin are not necessarily pixel (0,0))
And if that was not enough, some/most of those parameters may vary per-tile! If you already found that TIFF/GeoTIFF had too many parameters to tune (tiling or not, pixel or band interleaving, compression method), JPEG-2000 is probably one or two orders of magnitude more complex. JPEG-2000 is truly a technological and mathematical jewel. But needless to say that having a compliant JPEG-2000 encoder/decoder, which OpenJPEG is (it is an official reference implementation of the standard) is already something complex. Having it perform optimally is yet another target.
Previously to that latest optimization round, I had already worked at enabling multi-threaded decoding at the code-block level, since they can be decoded independently (once you've re-assembled from the code-stream the bytes that encode a code-block), and in the inverse wavelet transform as well (during the horizontal pass, resp vertical pass, rows, resp columns, can be transformed independently). But the single-thread use had yet to be improved. Roughly, around 80 to 90% of the time during JPEG-2000 decoding is spent in the context-driven binary arithmetic decoder, around 10% in the inverse wavelet transform and the rest in other operations such as multi-component transform. I managed to get around 10% improvement in the global decompression time by porting to the decoder an optimization that had been proposed by Carl Hetherington for the encoding side, in the code that determines which bit of wavelet transformed coefficient must be encoded during which coding pass. The trick here was to reduce the memory needed for the context flags, so as to decrease the pressure on the CPU cache. Other optimizations in that area have consisted in making sure that some critical variables are kept preferably in CPU registers rather than in memory. I've spent a good deal of time looking at the disassembly of the compiled code.
I've also optimized the reversible (lossless) inverse transform to use the Intel SSE2 (or AVX2) instruction sets to be able to process several rows, which can result up to 3x speed-up for that stage (so a global 3% improvement)
I've also worked on reducing the memory consumption needed to decode images, by removing the use of intermediate buffers when possible. The result is that the amount of memory needed to do full-image decoding was reduced by 2.4.
Another major work direction was to optimize speed and memory consumption for sub-window decoding. Up to now, the minimal unit of decompression was a tile. Which is OK for tiles of reasonable dimensions (let's say 1024x1024 pixels), but definitely not on images that don't use tiling, and that hardly fit into memory. In particular, OpenJPEG couldn't open images of more than 4 billion pixels. The work has consisted in 3 steps :
- identifying which precincts and code-blocks are needed for the reconstruction of a spatial region
- optimize the inverse wavelet transform to operate only on rows and columns needed
- reducing the allocation of buffers to the amount strictly needed for the subwindow of interest
The overall result is that the decoding time and memory consumption are now roughly proportional to the size of the subwindow to decode, whereas they were previously constant. For example decoding 256x256 pixels in a 13498x9944x3 bands image takes now only 190 ms, versus about 40 seconds before.
As a side activity, I've also fixed 2 different annoying bugs that could cause lossless encoding to not be lossless for some combinations of tile sizes and number of resolutions, or when some code-block style options were used.
I've just updated the GDAL OpenJPEG driver (in GDAL trunk) to be more efficient when dealing with untiled JPEG-2000 images.
There are many more things that could be done in the OpenJPEG library :
- port a number of optimizations on the encoding side: multi-threadig, discrete wavelet transform optimizations, etc...
- on the decoding side, reduce again the memory consumption, particularly in the untiled case. Currently we need to ingest into memory the whole codestream for a tile (so the whole compressed file, on a untiled image)
- linked to the above, use of TLM and PLT marker segments (kind of indexes to tiles and packets)
- on the decoding side, investigate further improvements for the code specific of irreversible / lossy compression
- make the opj_decompress utility do a better use of the API and consume less memory. Currently it decodes a full image into memory instead of proceeding by chunks (you won't have this issue if using gdal_translate)
- investigate how using GPGPU capabilities (CUDA or OpenCL) could help reduce the time spent in context-driven binary arithmetic decoder.
Contact me if you are interested in some of those items (or others !)
(*) funding provided by academic institutions and archival organizations, namely
- Wellcome Library
- Stanford University
- Nationale Bibliotheek van Nederland (KBNL)
- University of Michigan
- University of California, Los Angeles (UCLA)