jeudi 6 novembre 2014

Hacking Python module constants

I've spent a whole afternoon figuring out how to add a deprecation warning when using a constant attached to a Python module, so here's how to do it in case you would need to do something similar.

The use case is the following : as GDAL RFC 49 introduces curve geometries, I wanted the user to be warned if he uses the now deprecated ogr.wkb25DBit constant, since the new geometry types no longer use the most significant bit of the value to indicate the Z dimension.

from osgeo import ogr

ogr.wkb25DBit DeprecationWarning: deprecated: use ogr.GT_Flatten(), ogr.GT_HasZ() or ogr.GT_SetZ() instead
  warnings.warn("deprecated: use ogr.GT_Flatten(), ogr.GT_HasZ() or ogr.GT_SetZ() instead", DeprecationWarning)

The issue is that ogr.wkb25DBit was a variable assigned to a module, and you cannot do anything to "attach" a function that would be evaluated at runtime when the constant is used.

After several non conclusive attempts, the solution finally came from this article. Basically, you can use the sys.modules dictionnary to substitute the original module by a class instance (called pseudo-module class instance in the rest of this writing). And on the class of this class instance, you can define a property that will call a function when it is read. To get all other global functions and classes of the original module, you can copy the global dictionnary of the module to the pseudo-module class instance, and here you are ! If this summary does not make sense, the above mentionned article explains that in greater details.

My personal touch is a further improvement. The above trick is nice, but when from the Python console, you call help(ogr), it displays the help of the pseudo-module class instance, and not the one of the original module. So you loose the help of all other constants, functions and classes. Almost everything in fact.

But even Python builtins functions like help() can be replaced by a custom version. The custom help() tests if the object passed is the pseudo-module class instance, in which case it substitutes it temporarily with the original module  before calling the original help().

Apart from issuing deprecation warnings, such a technique can be usefull to make module constants really constants. Did you know that you can affect another value to math.pi... ?

The full code for all above tricks (works with all Python 2 and 3 versions starting with 2.4) :

# Original module constant
my_constant = 1

# Backup original dictionnary before doing anything else
_initial_dict = globals().copy()

def my_constant(module):
    import warnings
    warnings.warn("my_constant is deprecated", DeprecationWarning)
    return module._initial_dict['constant']

# Inspired from
class _Module(object):
    def __init__(self):
        self.__dict__ = globals()
        self._initial_dict = _initial_dict

        # Transfer properties from the object to the Class
        for k, v in list(self.__dict__.items()):
            if isinstance(v, property):
                setattr(self.__class__, k, v)

        # Replace original module by our object
        import sys
        self._original_module = sys.modules[self.__name__]
        sys.modules[self.__name__] = self

# Custom help() replacement to display the help of the original module
# instead of the one of our instance object
class _MyHelper(object):

    def __init__(self, module):
        self.module = module
        self.original_help = help

        # Replace builtin help by ours
            import __builtin__ as builtins # Python 2
        except ImportError:
            import builtins # Python 3 = self

    def __repr__(self):
        return self.original_help.__repr__()

    def __call__(self, *args, **kwds):

        if args == (self.module,):
            import sys

            # Restore original module before calling help() otherwise
            # we don't get methods or classes mentionned
            sys.modules[self.module.__name__] = self.module._original_module

            ret = self.original_help(self.module._original_module, **kwds)

            # Reinstall our module
            sys.modules[self.module.__name__] = self.module

            return ret
        elif args == (self,):
            return self.original_help(self.original_help, **kwds)
            return self.original_help(*args, **kwds)

del _MyHelper
del _Module

mercredi 22 octobre 2014

Blending metadata into vector formats

This post explores a few ideas, and the resulting experiments, I've had recently to put metadata (or arbitrary information) into vector GIS formats that have no provision for them. One typical such format is the good-old Shapefile format. A shapefile generally consists in 3 files, a .shp file that contains the geometries, a .shx that is an index from the shape number to the offset in the .shp file where the geometry is located (to allow fast retrieval by shape ID) and a .dbf file that contains the attributes of each shape.
Of course, the most simple way of adding metadata would be to but an additional file besides the 3 mentionned ones, but that would not be very challenging (plus the risk of losing it during copy).
Most implementations require at least those 3 files to be present. Some allow .dbf to be missing (e.g. GDAL/OGR). Some allow .shx to be missing, like OpenJUMP which doesn't read it even if it is available, which is both a feature and a drawback in situations when there are "holes" in the .shp due to editing.

A basic solution is to add our metadata at the end of one of those 3 files. I've done tests with GDAL/OGR (based on Shapelib), GeoTools 12.0, OpenJUMP 1.7.1 (whose shapefile reader is a forked version of the GeoTools one with changes), proprietary software code-named "GM" and proprietary software "AG"
.dbf : all 5 implementations are happy with extra content at the end of the file
.shp : all implementations happy, except OpenJUMP that opens the file, but throws a warning because it tries to interprete the additional bytes as shape.
.shx : all 5 implementations are happy
So we have at least 2 possibilities that are rather portable.
It should be checked how they react in editing use cases, like adding new features to the shapefile. Regarding GDAL/OGR, I can say that it would overwrite the extra content at the end of the .dbf and the .shx. It would let the extra content at the end of the .shp to write the new geometry afterwards.

What if we want to "link" the metadata per feature in a way where it is preserved if shapes are added ? And for the sake of exploring more possibilities, we will exclude using the data-at-end-of-file track. Interleaving data and metadata is not possible in .dbf since the records are placed consecutively. Same for .shx. In .shp, we can try reserving some space between all geometry records and make sure that the .shx index takes the holes into account. Due to the fact that size and offsets in shapefile are expressed in term of 16 bit words, that extra space must be a multiple of 16 bit too. That works fine for all implementations, except OpenJUMP for the same reason as above. Hum, and what if we incorporate the metadata, not between the encoded geometries, but inside them ? Each geometry record is indeed structured like this :

Shape Id: 4 bytes
Record length (number of 16 bit words after that field): 4 bytes
Record content: (2 * record length) bytes
    Shape Type: 4 bytes
    Variable payload according to shape type

We can try adding extra payload at the end of record content while still updating record length to take into account. We could have thought that implementations strictly checks that the declared record length is consistant with the shape type, but experimentations (and code inspection on the 3 Open Source implementations) show that, when they check, they check that the record length is at least greater or equal to the minimum expected record length. So this works for the 5 implementations ! At least on a layer with 2D polygons. That should also work for other 2D geometry type. 3D shapes consist in the 2D information, followed by the Z information, and optionaly by the M(eausre) information. M information is sometimes omitted when it is not present (this is the case of the OGR writer). So if we would want to add metadata for 3D shapes, we would have to write dummy M information (writting not-a-number double values is commonly done to indicate that M information is invalid).

To go back to .dbf file a bit, sometimes the width of fields of string type is larger than strictly needed. The values are left aligned in the field and remaining space is padded with space characters. I've tried to insert a nul character just at the end of the string, and put the extra information afterwards. This works fine for the 3 C/C++ based shapefile readers (GDAL/OGR, G.M., A.G) since nul character is conventionnaly used to terminate a string in C/C++. Unfortunately that does not work with the 2 Java based implementations that do not use that convention : the extra content is displayed after the field content.

As we have started exploring modifying the data itself, let's return to .shp file. One thing to consider is that coordinates in shapefiles are stored as double precision floating point numbers, stored on 64 bits using the IEEE-754 binary representation. Such numbers are decomposed like the following : 1 bit for the sign of the value, 11 bits for the exponent and its sign and 52 bits for the mantissa. The mantissa is where the significand precision of the number is stored. How big is that ? Let's go back to geography a bit. The Earth has rougly a circonference of 40 000 km. If we want to map features with a precision of 1 cm, we need 40 000 000 / 0.01 = 4 billion distinct numbers. 4 billion fits conveniently on a 32 bit integer (and OpenStreetMap .pbf optimized format store coordinates on 32 bit integers based on that observation). So 52 bits allow 2^(52-32)=2^20, roughly 1 million more numbers, i.e. a precision of 10^-8 meters = 10 nanometers ! We could almost map every molecule located on the Earth surface !
It is consequently reasonable to borrow the 16 least significant bits from the mantissa for other use. Said differently for every 2D point/vertex, we can get back 4 bytes without any noticeable loss of precision. Depending on the shape complexity, this might be not big enough to store per-feature metadata. But on a typical shapefile, if we spead the metadata over the features, we can certainly store useful content. And the really great news is that this metadata would be preserved naturally in most format conversions (at least with GDAL/OGR whose internal geometry representation also uses 64-bit floating point numbers, and probably most other geometry engines), and for formats like Spatialite or GeoPackage that also use 64-bit floating point numbers. However, one must be aware than any other operation like rescaling or reprojection would completely change the least significant bits and erase our metadata.

Admitedly this is not a new idea. People have explored similar ideas for digital watermarking and more generally steganography, typically used to embed copyright information or source tracking (i.e. you generate a slightly different dataset for each customer, hence if a copy is then available for download, you can identify the origin of the leak), generally in a not noticeable way. Using least significant bits is the very basic technique, that can be circumvented easily by just zeroing them or adding noise. More advanced technique operate in the spectral domain, like DCT (Discrete Cosine Transform), DFT (Discrete Fourier Transform) or DWT (Discrete Wavelet Tranform). Some techniques have been specifically designed for GIS data, using topological properties for example. The common target of those techniques is to have robustness against attempts of removing the watermark from the signal, at the expense of a reduced bandwith for the inserted information. But for regular metadata, we do not need such guarantee and the use of least-significant bits might be good enough and easily implemented.

Any other ideas ? Sure...

For polygons, the shapefile specification states that the vertices of the outer ring must be listed in clockwise order. But it does not specify which vertex of the outline must be the first one. Let's consider that the top-most vertex of the polygon is numbered 0 (if there are several vertices with the same y coordinate, let's take the one of them with the minimum x), the following vertex in clockwise order is 1, etc... If our polygon has 16 vertices, and we serialize it starting at vertex 11, we have coded the 11 number. Combined with information of following polygons, we can build a longer message. This idea could only work in practice for shapefiles of complex/dense enough polygons. If every polygon has 256 vertex, we can encode log2(256)=8 bits per polygon. More generally, for a polygon with N vertex, we could encode log2(N) bits (rounded to inferior integer). So we need also at least hundreds or thousands of polygons of that complexity to be able to encode something useful. The advantage of this technique is that it is robust to rescaling, and probably most reprojections (at least the one that globally preserve the appearance of shapes), provided that the shapes are rewritten in the same order as in the original data.
That technique could also be adapted for lines. Let's consider a line made of (V1,V2,....Vn). We can for example simply build a multi-polyline of 2 parts (V1,...Vi) and (Vi,....,VN) that will visually looks like the original line and will encode for the i value. The increase in binary encoding would be modest (4+8+8=20 extra bytes).

Another technique might be to use repeated vertices. Let's consider a line or a polygon: if while listing consecutive vertices, they are repeated, this would encode a 1 value. Otherwise 0. For example, if a line is made of the sequence of vertices (V1,V1,V2,V3,V4,V4,V5,V5,V6), it would be equivalent to binary number 100110. So we could encode as many bits as vertices in the geometry. If needed, we can also use more repetitions to encode more bits. For one bit per vertex, on average such a technique would increase shapefile size by 50% (because on average, half of bits in a message are 1). It would preserve metadata perfectly for all coordinate transformations (geometry engines generally operate on vertices separately). But not to operations that would remove duplicated vertices.

Finally, here's another idea, conceptually close to the one based on the starting vertex. Excluding implementations that don't rely on the .shx (I've no prejudice against such one ! Keep on good work folks !), we could use the order of shapes in the .shp to encode information. Traditionnaly, feature 1 appears first in the .shp, followed by feature 2, etc... But we could re-order the shapes as we wish, provided we make the .shx point to the right offset in the .shp. If we have N shapes, there are N! (factorial(N) = N*(N-1)*(N-2)*...2*1) ways of ordering them. So for N shapes, we can encode log2(N!) bits. In practice for 10 shapes, that is 21 bits. For 100 shapes, 524 bits. For 1000 shapes, 8529 bits. And for 10000, 118458. Advantages: works for all geometry types, no increase in file size. Inconvenients: possibly less performant sequential reading because of apparently random seeking within the .shp, doesn't resist to file conversion.

I've not mentionned it, but for nearly all mentionned techniques, especially the last ones, we would need to reserve a few bits to insert a CRC or any other integrity mechanism, so as to make sure that we think is metadata really is. And all them could be potentially combined !

jeudi 16 octobre 2014

Warping, overviews and... warped overviews

The development version of GDAL has lately received a few long awaited improvements in the area of warping and overview computation.

For those non familiar with GDAL, warping is mainly used for reprojecting datasets from one source coordinate system to a target one, or to create a "north-up" image from a rotated image or an image that has ground control points. Overviews in GDAL are also called pyramids in other GIS software and are sub-sampled (i.e. with coarser resolution) versions of full-resolution datasets, that are mainly used for fast display in zooming out operations. Depending on the utility (warper or overview computation), different resampling methods are available : bilinear, cubic, cubicspline, lanczos, average, etc..

Cubic resampling

Up to now, the bi-cubic resampling algorithm used when computing warped images and overviews was a 4x4 convolution kernel. This was appropriate for warping, when the dimensions of the target dataset are of the same order as the source dataset. However if the target dataset was downsized (which is the nominal case of overview computation), the result was sub-optimal, not to say plainly bad, because not enough source pixels were captured, leading to a result close to what nearest neighbour would give. Now, the convolution kernel dynamically uses the subsampling ratio to take into account all source pixels that have an influence on each target pixel, so e.g 8x8 pixels if subsampling by a factor of 2.
Of course, this involves more computation and could be slower. Fortunately, for 64 bit builds, Intel SSE2 intrinsics are at the rescue to compute convolutions in a very efficient way.

For example in GDAL 2.0dev, computing 5 overview levels on a 10474x4951 RGB raster with cubic resampling takes 2.4 seconds on a Core i5-750, to be compared with 3.8s with GDAL 1.11

$ gdaladdo -ro -r cubic world_4326.tif 2 4 8 16 32

To compare both results, we can select the 5th overview level with the fresh new open option OVERVIEW_LEVEL=4 (index are 0 based)

$ gdal_translate world_4326.tif out.tif -oo OVERVIEW_LEVEL=4

5th overview generated by GDAL 2.0dev

5th overview generated by GDAL 1.11.1

So yes, faster (a bit) and better (a lot) !

Similar result can also be obtained with :

$ gdalwarp -r cubic world_4326.tif out.tif -ts 328 155

The "-oo OVERVIEW_LEVEL=xxx" option can be used with gdalinfo, gdal_translate and gdalwarp, or with the new GDALOpenEx() API.

Related work could involve adding resampling method selection in the RasterIO() API that currently only does nearest neighbour sampling. If that might interest you, please contact me.

Overviews in warping

Related to the OVERVIEW_LEVEL open option, another long due improvement was the selection of the appropriate overview level when warping. A typical use case is to start with a WMS or tiled dataset, e.g the OpenStreetMap tiles, and wanting to reproject full or partial extent to an image with reasonably small dimensions. Up to now, GDAL would alway use the most precision dataset (typically zoom level 18 for OpenStreetMap), which would make the operation terribly slow and unpractical.

Now, the following will run in just a few seconds :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000

With the -ovr flag, you can modify the overview selection strategy, and for example specify you want to use the overview if the level immediately before the one that would have been automatically selected (i.e. with bigger dimensions, more precise)

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr AUTO-1

You can also specify a precise overview level to control the level of details, which is particuarly relevant in the case of OSM since the rendering depends on the scale :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr 9

(Note: -ovr 9 is equivalent to OSM zoom level 8, since GDAL_overview_level = OSM_max_zoom_level - 1 - OSM_level, 9 = 18 - 1 - 8. )

With -ovr 9 (zoom level 8)

With -ovr 10 (zoom level 7)

With -ovr 11 (zoom level 6) or without any -ovr parameter

With -ovr 12 (zoom level 5)
(All above images are © OpenStreetMap contributors)

Overviews in warped VRT

GDAL advanced users will perhaps know the Virtual Raster (.vrt) format. There are several flavors of VRT files, one of them is the so-called "warped VRT", which can be produced by "gdalwarp -of VRT". This is an XML file that captures the name of the source dataset being warped and the parameters of the warping: output resolution, extent, dimensions, transformer used, etc... This can be convenient to do on-the-fly reprojection without needing to store the result of the reprojection. Similarly to regular warping, warped VRT can now make use of overviews of the source dataset to expose "implicit" overviews in the warped VRT dataset. Which make it possible to use warped VRT in a GIS viewer ith decent performance when zooming out. Among others, this will be  beneficial to QGIS that use the "auto-warped-VRT" mechanism when opening a raster that is not a "north-up" dataset.

Still playing with our OpenStreetMap dataset, let's create a warped VRT around western Europe :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.vrt -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -of VRT

We can see that the VRT now advertizes overviews :

$ gdalinfo out.vrt
Size is 4767192, 4767192
Band 1 Block=512x128 Type=Byte, ColorInterp=Red
  Overviews: 2383596x2383596, 1191798x1191798, 595899x595899,
             297950x297950, 148975x148975, 74487x74487,
             37244x37244, 18622x18622, 9311x9311, 4655x4655,
             2328x2328, 1164x1164, 582x582, 291x291, 145x145,
             73x73, 36x36, 18x18

I'd like to thank Koordinates and Land Information New Zealand for funding those improvements.

mercredi 1 octobre 2014

GDAL/OGR 1.11.1 released

On behalf of the GDAL/OGR development team, I am pleased to
announce the release of the GDAL/OGR 1.11.1 bug fix release.  This
release contains more than 80 bug fixes since the April 1.11.0 release.

The source is available at:

Details on the the fixes in this release, and a security announcement, are
available at:

dimanche 21 septembre 2014

GeoTIFF tile de-duplication

Have you ever had the opportunity to work with a raster dataset, that has world coverage, including oceans, and a resolution of 38 meters ? With World Mercator projection, the width and height of such a raster is 1 million pixels (1 048 576 exactly). If we also add 15 overviews (to go to a 32x32 thumbnail), how big would be such a raster ? More than 5 terabytes ?! 1 million * 1 million * 4 (for RGBA) * 1.33 (space for overviews : 1/4 + 1/16 +... = 0.333..). BigTIFF to the rescue ?

You are wrong! Such a raster dataset can be as small as 1 392 764 bytes (1.3 MB) in standard GeoTIFF format (that can be further compressed to 78838 bytes once put in a .zip). And in that size, it can feature 1 431 655 765 (1.4 billion) GDAL logos. If you don't believe me, just download it now and check by yourself !

You should be able to display it at light speed in any reasonable desktop GIS. QGIS is one of them.

What is the recipe for such a file ? Simply (ab)using possibilities offered by the TIFF specification. Namely, for a tiled TIFF, for each resolution, there are 2 arrays : one that contains the location (offsets) of each tile data (TileOffsets tag), and another one the size of each tile data (TileByteCounts tag). Here we simply put the same value for the location of all tiles, and write just once a 2048x2048 tile (compressed with DEFLATE codec) that mosaics the 32x32 GDAL logo. Using just that leads to a file of size 2 971 732 bytes, much larger than needed. So we are going to abuse the TIFF specification even more. First by noticing that if the offset of the tile matches also its size, then we can use the same array for TileOffsets and TileByteCounts, thus saving (1048576/2048)^2*4 = 1048576 bytes. In that instance, the tile is 172094 bytes large, so we place it at offset 172094. And finally, we can also make the overviews point their TileOffsets and TileByteCounts tags to the single array of the full resolution. Actually, as the definition of the 16 TIFF directories ends at offset 3446, we have also 172094-3446 = 168 648 spare bytes !

Letting aside this challenge, it could be interesting to have that tile de-duplication  capability directly incorporated into the GDAL GeoTIFF driver in a more user friendly way than the mix of GDAL and direct byte access that has been used to build that file. A typical use case is when creating raster with a lot of oceanic area where tiles are in solid blue. Such technique can be used when creating MBTiles, but the good old TIFF can also do it. If you are interested, contact me !

vendredi 25 avril 2014

GDAL/OGR 1.11.0 released

On behalf of the GDAL/OGR development team and community, I am pleased to
announce the release of GDAL/OGR 1.11.0.  GDAL/OGR is a C++ geospatial
data access library for raster and vector file formats, databases and
web services.  It includes bindings for several languages, and a variety
of command line tools.

The 1.11.0 release is a major new feature release with the following

 * New GDAL drivers:
    - KRO: read/write support for KRO KOLOR Raw format

 * New OGR drivers:
    - CartoDB : read/write support
    - GME / Google Map Engine : read/write support
    - GPKG / GeoPackage : read-write support (vector part of the spec.)
    - OpenFileGDB: read-only support (no external dependency)
    - SXF: read-only support
    - WALK: read-only support
    - WasP .map : read-write support

 * Significantly improved drivers: GML, LIBKML

 * RFC 40: enhanced RAT support
 * RFC 41: multiple geometry fields support
 * RFC 42: OGR Layer laundered field lookup
 * RFC 43: add GDALMajorObject::GetMetadataDomainList()       
 * RFC 45: GDAL datasets and raster bands as virtual memory mapping
 * Upgrade to EPSG 8.2 database

More complete information on the new features and fixes in the 1.11.0
release can be found at:

The new release can be downloaded from:
  * - source as a zip
  * - source as
  * - test
  * - documentation /

mercredi 23 avril 2014

Advanced JPEG-in-TIFF uses in GDAL

This post is about advanced uses of JPEG compression in TIFF/GeoTIFF files. We will call such files "JPEG-in-TIFF" for the sake of shortness.

JPEG-in-TIFF is a popular variation of TIFF, described in TIFF specification supplement 2, well-suited for aerial/satellite imagery, that exhibits an interesting quality / (size * decompression_time) ratio, while remaining a format simple to encode/decode with Free and Open Source software.

Side note: while JPEG 2000 compression is a much more capable format, F.O.S.S. is still trying to catch up with proprietary implementations, although the OpenJPEG library (that can be used through the GDAL JP2OpenJPEG driver) has made recent advances that make it worth to be considered.

JPEG-in-TIFF creation options

To go back to JPEG-in-TIFF, quality/size can be controlled by selecting :
  • appropriate subsampling and colorspace. For RGB "natural" images, a good choice is YCbCr colorspace with subsampling of factor 2 on the chrominance difference componants (YCbCr 4:2:0). This is the PHOTOMETRIC=YCbCr creation option in the GDAL GTiff driver. Using it make the size of the image typically 2 to 3 times smaller than the default value for photometric interpretation (RGB)
  • the usual JPEG quality parameter that acts on the quantization coefficients. This is the JPEG_QUALITY creation option.
Generally, you will want to generate a tiled version of JPEG-in-TIFF (TILED=YES creation option), so as to be able to access efficiently and in a random way to parts of the image.

Implicit overviews

The very latest improvements added to the GDAL development version (trunk r27226 or later, already deprecating the soon to-be-released GDAL 1.11) make it possible to have faster downsampled versions of JPEG-in-TIFF than before. Despite this improvement, the recommandation remains to generate overviews, either external or internal, with the gdaladdo utility, in order to have very fast access to downsampled versions of a raster (at the expense of increased storage space)

But what can we do when such overviews are not (yet) generated ? Previously, the GTiff driver would decompress the queried part of the raster at its full resolution and compute a downsampled image from it. But this is more slow than needed.

Schematically (voluntary omitting quantization and Huffman compression steps), a JPEG compressed stream is made of a sequence of squares of size 8x8 (or 16x16 with YCbCr 4:2:0) pixels (the technical name for such as block is a MCU, Minimum Code Unit) that contain the coefficients resulting from the Discrete Cosine Transform of the original 8x8 (16x16) pixels. To decompress a MCU to its full resolution, you need to compute the inverse DCT on the whole set of 8x8 (16x16) coefficients, which has some cost. But an interesting property of MCU coefficients is that you only need to operate on the high order ones to compute a lower resolution of the uncompressed block, and libjpeg, the software library that does the low-level job of compressing and decompressing the JPEG codestream, is capable of that ! Actually, we had already used that capability in the GDAL JPEG driver of GDAL 1.10, to expose implicit overview levels (at x2, x4, x8 sub-sampling factors), but it was not yet plugged into the GTiff driver.

Now, JPEG-in-TIFF files, in all possible formulations (tiled / stripped / single-stripped, pixel-interleaved vs band-interleaved, single band vs YCbCr 4:2:0 vs RGB colorspace), will internally expose overview levels at x2, x4 and x8 sub-sampling factors for raster operations.

So computing the 1/16th reduction of a BMNG tile of size 21600x21600, with 256x256 tiling, JPEG RGB compression, now takes about 3.5s with the latest developmenet version about 21s in GDAL 1.11 :

GDAL trunk :
$ time gdal_translate world.topo.bathy.200406.3x21600x21600.B2.tif out.tif -outsize 6.25% 6.25%
real    0m3.441s

GDAL 1.11 :
$ time gdal_translate world.topo.bathy.200406.3x21600x21600.B2.tif out.tif -outsize 6.25% 6.25%
real    0m20.987s
Note that the whole JPEG codestream will still be read from the storage, so the new optimization will be especially worthwile when I/O speed is good w.r.t CPU speed (whereas with JPEG2000 compression, due to the way how wavelet coefficients are packed, you only need to read small portion of the file).

If you try gdalinfo on a JPEG-in-TIFF file, relax if you don't see the implicit overviews mentionned. They are hidden most of the time to avoid confusion : it would be difficult for users to distinguish between internal pre-computed overviews, which benefit from fast acces, and the new implicit overviews. The latter ones are only made visible to the internals of the GTiff driver when a raster operation takes place.

Lossless conversion of JPEG into JPEG-in-TIFF

This is a feature that appeared in GDAL 1.10 released last year, but which has probably been unnoticed in the NEWS. The conversion of a JPEG file to a JPEG-in-TIFF is done without decompression and recompression cycles, through the preservation of the MCU coefficients, making it effectively lossless (the initial JPEG compression was lossy, but the conversion into JPEG-in-TIFF is lossless).
This optimized conversion path is taken if all the following conditions are met :
  • the source dataset is a JPEG file (or a VRT with a JPEG as a single SimpleSource)
  • the target dataset is a JPEG-in-TIFF file
  • no explicity target JPEG quality is specified
  • no change in colorspace is specified
  • no sub-windowing is requested
  • etc...
But it is compatible with the generation of a tiled JPEG-in-TIFF from the original JPEG image. Explicit assigment of target SRS and bounds are also possible.

So, the following commands will use the lossless copy method :
$ gdal_translate in.jpg out.tif -co COMPRESS=JPEG

$ gdal_translate in.jpg out.tif -co COMPRESS=JPEG -co TILED=YES

$ gdal_translate in.jpg out.tif -co COMPRESS=JPEG -a_srs EPSG:4326 -a_ullr -180 90 180 -90
whereas the following commands will NOT :
$ gdal_translate in.jpg out.tif -co COMPRESS=JPEG -co QUALITY=60

$ gdal_translate in.jpg out.tif -srcwin 0 0 500 500 -co COMPRESS=JPEG

Lossless extraction of JPEG tiles from JPEG-in-TIFF

The fresh new Python script (needs GDAL trunk) does (part of) the reverse operation. From a JPEG-in-TIFF, it can extract one particular tile/strip into a standalone JPEG file, and generate the companion .aux.xml file if the source JPEG-in-TIFF is georeferenced.

The following command will extract the tile at column 10 (count starts at 0), row 20 from a tiled JPEG-in-TIFF :
python world.topo.bathy.200406.3x21600x21600.B2.tif out_10_20.jpg 10 20
Or to extract all the tiles (filenames will have the out_X_Y.jpg pattern) :
python world.topo.bathy.200406.3x21600x21600.B2.tif out.jpg
This could be interesting for tiling servers that want to keep global mosaics as sources.

Note: this is not exactly the reverse operation from JPEG --> JPEG-in-TIFF conversion, since it will not merge several JPEG-in-TIFF strips/tiles into a single JPEG file.

Ideas for later...

Instead of the script, we could imagine that the lossless extraction of JPEG from JPEG-in-TIFF could be done, in a natural way, with :
gdal_translate -srcwin X Y XSIZE YSIZE in.tif out.jpg -of JPEG
That would require detecting a sub-windowing pattern in the temporary VRT generated by gdal_translate, and then reassembling the right MCU coefficients. X, Y, XSIZE and YSIZE should be multiple of 8 or 16 to match MCU dimensions.

A more powerful, but even more complicated, idea would be to have first-class support in GDAL for the DCT coefficients, as raster bands ?, but it would require some thinking to find the right modelisation, and even more to implement it (with complications like YCbCr 4:2:0 subsampling).

In a similar vein, why not imagining:

gdal_translate mosaic_of_jpeg_images.vrt out.tif -co COMPRESS=JPEG
To make it easier, the VRT file should be made of JPEG tiles whose dimensions are a multiple of the MCU dimensions, and that are placed into the mosaic at offsets that are themselves multiple of the tile dimensions. An additional constraint is that all the JPEG tiles should share the same JPEG quantization and Huffman tables, since in JPEG-in-TIFF, those tables are common for all tiles/strips and placed in the JPEGTABLES TIFF tag.
Building a JPEG-in-TIFF from a mosaic in the GTiff driver might be tricky, but an ad-hoc Python script might be possible.

I will stop here with science-fiction. There is already enough to experiment !