dimanche 21 septembre 2014

GeoTIFF tile de-duplication

Have you ever had the opportunity to work with a raster dataset, that has world coverage, including oceans, and a resolution of 38 meters ? With World Mercator projection, the width and height of such a raster is 1 million pixels (1 048 576 exactly). If we also add 15 overviews (to go to a 32x32 thumbnail), how big would be such a raster ? More than 5 terabytes ?! 1 million * 1 million * 4 (for RGBA) * 1.33 (space for overviews : 1/4 + 1/16 +... = 0.333..). BigTIFF to the rescue ?

You are wrong! Such a raster dataset can be as small as 1 392 764 bytes (1.3 MB) in standard GeoTIFF format (that can be further compressed to 78838 bytes once put in a .zip). And in that size, it can feature 1 431 655 765 (1.4 billion) GDAL logos. If you don't believe me, just download it now and check by yourself !

http://even.rouault.free.fr/gdal_everywhere2.tif

You should be able to display it at light speed in any reasonable desktop GIS. QGIS is one of them.

What is the recipe for such a file ? Simply (ab)using possibilities offered by the TIFF specification. Namely, for a tiled TIFF, for each resolution, there are 2 arrays : one that contains the location (offsets) of each tile data (TileOffsets tag), and another one the size of each tile data (TileByteCounts tag). Here we simply put the same value for the location of all tiles, and write just once a 2048x2048 tile (compressed with DEFLATE codec) that mosaics the 32x32 GDAL logo. Using just that leads to a file of size 2 971 732 bytes, much larger than needed. So we are going to abuse the TIFF specification even more. First by noticing that if the offset of the tile matches also its size, then we can use the same array for TileOffsets and TileByteCounts, thus saving (1048576/2048)^2*4 = 1048576 bytes. In that instance, the tile is 172094 bytes large, so we place it at offset 172094. And finally, we can also make the overviews point their TileOffsets and TileByteCounts tags to the single array of the full resolution. Actually, as the definition of the 16 TIFF directories ends at offset 3446, we have also 172094-3446 = 168 648 spare bytes !

Letting aside this challenge, it could be interesting to have that tile de-duplication  capability directly incorporated into the GDAL GeoTIFF driver in a more user friendly way than the mix of GDAL and direct byte access that has been used to build that file. A typical use case is when creating raster with a lot of oceanic area where tiles are in solid blue. Such technique can be used when creating MBTiles, but the good old TIFF can also do it. If you are interested, contact me !