lundi 25 mai 2015

GDAL 2.0 driver metadata

Among the many improvements that GDAL 2.0 will bring (hint: its beta2 is now available. Test it!) is the fact that vector drivers (OGR drivers) can now expose metadata about their capabilities, as a side effect of the GDAL/OGR unification. Previously, you had to refer only to the documentation page of each driver. Now, you can know them by exploring the driver metadata, which should make it possible to have automatically generated user interfaces (thinking to QGIS for example).

In addition to the existing dataset and layer creation options that were available for vector drivers, open options have been added. Those are user specified options provided to the driver when opening an existing dataset to modify its default behaviour. Up to now they used to be modeled through global configuration options, that were generally specified as environment variables. The new approach brings the benefit of being able to specify precisely for which dataset you want to apply the open option, and to be able to validate them against the published list of available options.

Let's look at the output of driver metadata for the PostgreSQL driver :

$ ogrinfo --format postgresql
Format Details:
  Short Name: PostgreSQL
  Long Name: PostgreSQL/PostGIS
  Supports: Vector
  Help Topic: drv_pg.html
  Supports: Open() - Open existing dataset.
  Supports: Create() - Create writeable dataset.
  Creation Field Datatypes: Integer Integer64 Real String Date DateTime Time IntegerList Integer64List RealList StringList Binary
  Supports: Creating fields with NOT NULL constraint.
  Supports: Creating fields with DEFAULT values.
  Supports: Creating geometry fields with NOT NULL constraint.

<CreationOptionList />

  <Option name="GEOM_TYPE" type="string-select" description="Format of geometry columns" default="geometry">
  <Option name="OVERWRITE" type="boolean" description="Whether to overwrite an existing table with the layer name to be created" default="NO" />
  <Option name="LAUNDER" type="boolean" description="Whether layer and field names will be laundered" default="YES" />
  <Option name="PRECISION" type="boolean" description="Whether fields created should keep the width and precision" default="YES" />
  <Option name="DIM" type="integer" description="Set to 2 to force the geometries to be 2D, or 3 to be 2.5D" />
  <Option name="GEOMETRY_NAME" type="string" description="Name of geometry column. Defaults to wkb_geometry for GEOM_TYPE=geometry or the_geog for GEOM_TYPE=geography" />
  <Option name="SCHEMA" type="string" description="Name of schema into which to create the new table" />
  <Option name="SPATIAL_INDEX" type="boolean" description="Whether to create a spatial index" default="YES" />
  <Option name="TEMPORARY" type="boolean" description="Whether to a temporary table instead of a permanent one" default="NO" />
  <Option name="UNLOGGED" type="boolean" description="Whether to create the table as a unlogged one" default="NO" />
  <Option name="NONE_AS_UNKNOWN" type="boolean" description="Whether to force non-spatial layers to be created as spatial tables" default="NO" />
  <Option name="FID" type="string" description="Name of the FID column to create" default="ogc_fid" />
  <Option name="FID64" type="boolean" description="Whether to create the FID column with BIGSERIAL type to handle 64bit wide ids" default="NO" />
  <Option name="EXTRACT_SCHEMA_FROM_LAYER_NAME" type="boolean" description="Whether a dot in a layer name should be considered as the separator for the schema and table name" default="YES" />
  <Option name="COLUMN_TYPES" type="string" description="A list of strings of format field_name=pg_field_type (separated by comma) to force the PG column type of fields to be created" />

  Connection prefix: PG:
  <Option name="DBNAME" type="string" description="Database name" />
  <Option name="PORT" type="int" description="Port" />
  <Option name="USER" type="string" description="User name" />
  <Option name="PASSWORD" type="string" description="Password" />
  <Option name="HOST" type="string" description="Server hostname" />
  <Option name="ACTIVE_SCHEMA" type="string" description="Active schema" />
  <Option name="SCHEMAS" type="string" description="Restricted sets of schemas to explore (comma separated)" />
  <Option name="TABLES" type="string" description="Restricted set of tables to list (comma separated)" />
  <Option name="LIST_ALL_TABLES" type="boolean" description="Whether all tables, including non-spatial ones, should be listed" default="NO" />

For drivers that do not work directly on filenames, they expose a connection prefix, "PG:" in that instance (can be discovered through the GDAL_DMD_CONNECTION_PREFIX="DMD_CONNECTION_PREFIX" driver metadata item). It also lists the various open options available (through GDAL_DMD_OPENOPTIONLIST="DMD_OPENOPTIONLIST" driver metadata item). In the above example, most open options had to be passed in the connection string, with the exception of LIST_ALL_TABLES that was available as the PG_LIST_ALL_TABLES configuration option.

This is enough to know that "ogrinfo PG: -oo DBNAME=autotest -oo PORT=5432" is a correct syntax. The historical "ogrinfo 'PG:dbname=autotest port=5432'" syntax is of course preserved for backward compatiblity, but if you switch between PostreSQL/MySQL/OCI, you had to remember subtle differences. For example "MYSQL:autotest,port=3306". Now by exploring metadata, you can see you can use "ogrinfo MYSQL: -oo DBNAME=autotest -oo PORT=3306"

Admitedly not all drivers have been modified to use those new capabilities, or some obscure configuration options might not yet be available through open options, but at least the mechanism now exists.

The -oo option can be used with most GDAL (a few raster drivers have also open options such as the PDF driver) and OGR utilities. For ogr2ogr, in update or append mode, you can also use -doo (destination open option) for the target dataset.

Worth noting: we also have now dual drivers that can accept both raster & vector data.

  PCIDSK -raster,vector- (rw+v): PCIDSK Database File
 JP2ECW -raster,vector- (rw+v): ERDAS JPEG2000 (SDK 3.x)
 JP2OpenJPEG -raster,vector- (rwv): JPEG-2000 driver based on OpenJPEG library
  JPEG2000 -raster,vector- (rwv): JPEG-2000 part 1 (ISO/IEC 15444-1), based on Jasper library
  PDF -raster,vector- (rw+vs): Geospatial PDF
  GPKG -raster,vector- (rw+vs): GeoPackage
  PLSCENES -raster,vector- (ro): Planet Labs Scenes API
  HTTP -raster,vector- (ro): HTTP Fetching Wrapper

Depending on the open flags provided to the new GDALOpenEx() and the actual content of the dataset, GDALOpenEx() will return a NULL handle for example if you asked for vector data only and there is only raster data (this is behaviour of the legacy OGROpen() format).
If you wonder why JPEG 2000 drivers are listed, this is because now that GDAL supports the GMLJP2 v2 standard, vector features can be embedded into/read from a GMLJP2 v2 box.

mardi 17 février 2015

GDAL 1.11.2 released

On behalf of the GDAL/OGR development team, I am pleased to
announce the release of the GDAL/OGR 1.11.2 bug fix release.  This
release contains more than 60 bug fixes since the October 2014 1.11.1 release.

The source is available at:

The GDAL GRASS plugin package has also been updated (GRASS
7.0 compatibility) :

Details on the the fixes in this release are available at:

jeudi 18 décembre 2014

JPEG-in-TIFF with mixed qualities

Some time ago, I wrote about advanced JPEG-in-TIFF uses in GDAL, i.e. TIFF files compressed with the JPEG codec. Recently, I've got again to that topic when I realized that recent libtiff versions produced files a bit bigger than necessary by repeating the quantization tables in each tile or strip, instead of omitting them in favor of the global quantization tables that are stored in a global location, the JpegTables TIFF tag, shared by all tiles or strips. This issue is now solved.

A quantization table is a set of 64 integer coefficients that are used to divide values coming from the Discrete Cosine Transform, before passing them to the Huffman compression stage and finally reaching the JPEG binary stream. The higher the coefficients, the lower the quality. The decoder must have the quantization table used by the encoder to properly recompute the values needed to do the Inverse Discrete Cosine Transform. The TIFF specification supplement 2 allows the quantization table to be stored in the JpegTables TIFF tag for all tiles or strips. Up to now, I thought this was a requirement, but indeed, reading more closely the specification, I found this was only an option. So there are several legit possibilities :
  1. centralized quantization tables stored in JpegTables TIFF tag, and none in each tile/strip. This is what GDAL generates with its internal version of libtiff (or if using older libtiff, such as 3.8.2, or the to-be-released libtiff 4.0.4) will now produce.
  2. centralized quantization tables stored in JpegTables TIFF tag, and redefined in some tile/strip if needed.
  3. no JpegTables TIFF tag, and quantization tables defined in each tile/strip.
Options 2 and 3 makes it possible to have TIFF files with strips/tiles of different qualities. Actually even option 1 could be used to have up to 2 different qualities. Why 2 ? The reason is that JPEG specification allows a maximum of 4 different quantization tables (also technically it could reference up to 16 tables, but the limit of 4 is actually enforced by libjpeg, so better not play with that!). For RGB imagery compressed in the YCbCr color space, libjpeg uses two quantization tables: one for the Y component and another one for the Cb and Cr components. Hence the capability of storing a set of table for 2 quality settings.

So, as mentionned in my previous article about GeoPackage, it would actually be possible to generate JPEG-in-TIFF with different qualities depending on area(s) of interest, for example medium quality (and high compression) on most areas, and high quality on specific spots. I would expect such variants to have good compatibility with existing readers of JPEG-in-TIFF.

If you are interested in exploring that, you can contact me.

lundi 15 décembre 2014

GDAL GeoPackage raster support


One of the recent additions in GDAL development version (2.0dev) is the support for raster tiles in the GeoPackage format.

A bit of history

The GeoPackage format has been adopted this year as a OGC standard and covers storage of both raster and vector data within the same SQLite3 container. This is not a completely revolutionary approach as there were precedents (I may have forgotten some!) :
- storage of vector format in SQLite3 database was at least done by FDO (with storage of geometries as WKT, WKB or Autodesk own geometry binary format "FGF") and Spatialite (in its own Spatialite geometry binary format, somewhat derived from WKB, but incompatible, and with compressed variants as well). Spatialite introduced the use of SQlite virtual RTree tables to implement spatial index.  Both formats have been supported for long by the OGR SQLite driver. GeoPackage vector adds yet another geometry binary format, GPB (GeoPackageBinary) that consists of a header followed by actual WKB, and borrowed the idea of RTree for spatial index from Spatialite (which was a candidate implementation in early draft versions)
- storage of rasters in SQLite3 database was at least done by MBTiles, with raster tiles being stored as PNG or JPEG tiles as BLOB records, and a multi zoom level tile indexing scheme. GeoPackage raster support clearly derives from that design choice, and use the same naming of columns in the tile tables, but with various improvements (that will be perceived as defects by proponents of MBTiles simplicity), such as support for arbitrary reference spatial systems (MBTiles is bound to Google Mercator projection) and custom tiling schemes... and also a subtle semantic difference that we will detail afterwards. GeoPackage can also support several raster tables within the same container. One current limitation of GeoPackage is that currently only images that have 8-bit depth per channel, limited to R,G,B,A color space are supported, which prevents from storing DEMs or multi-spectral imagery. So there is still place for other solutions, such as Rasterlite 2, the raster-side of Spatialite, which offers a variety of storage formats, supported bit depths, multiple regions with different resolutions within the same coverage, etc...

In addition to the above, GeoPackage introduces metadata (with a potential hierarchical organization), ways of expressing layer schema and constraints (beyond SQL capabilities provided by SQLite 3), and a formalism to define extensions to the specification, both for a few standardized official extensions and more "proprietary" extensions.

GDAL 1.11 already had support for most of GeoPackage vector specification. More recent developments have added support for spatial index, curve geometries, and aspatial tables.

Raster support was still missing. This is now the case in the latest development version.

GDAL GeoPackage raster support

GeoPackage is now one of the very few GDAL drivers to support both raster and vector with the same "Dataset" object, which is now possible since the unification of the GDAL and OGR driver models. This means that you only have to open it once to explore its content, and not once with the GDAL API, and another one with the OGR API. I should note that the way we handle multiple raster "layers" in the GDAL API through the subdataset mechanism, which requires multiple Open() calls, is probably not yet optimal compared to the vector layer API. Food for thought...

The GeoPackage raster driver has the following capabilities :
- reading, creation... and update of raster layers. That means that you can use GeoPackage as the direct output format of any GDAL utility or API, gdal_translate, gdalwarp, etc...
- on-the-fly conversion in both ways between R,G,B,A colorspace exposed by GDAL and the storage tile format :
    * grey level PNG or JPEG tiles,
    * grey level with alpha band PNG,
    * RGB PNG, JPEG or WebP tiles (storage as WebP tiles is one of the official extensions to the baseline specification),
    * RGBA PNG or WebP tiles
    * 8-bit quantized (256 color palette) PNG tiles
- on creation/update, it is possible to use a strategy where JPEG tiles are used when the tile content is fully opaque, and PNG when there is transparency involved. Note that fully transparent tiles are not stored at all, as allowed by the specification, to allow efficient sparse storage.
- use, creation and update of multiple zoom levels (known as overviews in GDAL, or pyramids) for fast zoom-in/zoom-out operations. Including support for overview levels whose resolution does not necessarily differ by power-of-two factors ( "gpkg_other_zoom" extension in GeoPackage terminology )
- reading, creation and update of metadata (in a "flat" way, i.e. ignoring the potential hierarchy of metadata records)
- reading and creation of several tiling schemes. By default, the driver will create rasters with a tiling scheme that perfectly fits the resolution and spatial registration of the input dataset being converted to GeoPackage, so that no loss of image quality (if using PNG storage) or georeferencing occurs. But for some uses, adoption of more "universal" tiling schemes with world coverage might be desirable to ease overlapping of several raster coverages, or extending the spatial extent of an existing raster. A few such tiling schemes are available, such as the popular GoogleMapsCompatible on (reused from WMTS specification). Note: although GDAL should have no problem with it, some tests have shown that another available tiling scheme, GoogleCRS84Quad, might be difficult to handle for other implementations of the specification, so better not use it until its relevance for GeoPackage has been clarified.

Full documentation of the driver page is available for those who want to explore its capabilities. All in all, pretty much everything in the raster part of the specification has been implemented.

Tiles and blocks

One of the toughest part of the implementation was the update of tiles when the origin of the GDAL area of interest does not match the corner of a tile. This is the general case when using a global tiling scheme, where the origin of a raster can be anywhere within a tile. GDAL internally uses its own tiling system, with raster "blocks" (in red in the below drawing). In the case of GeoPackage, of course we choose block dimensions that exactly matches GeoPackage tiles. But due to that possible shift of origins, a GDAL block can potentially overlap 4 GeoPackage tiles (in blue). And to add more fun, the dimensions of GDAL rasters (black rectangle) are not necessarily a multiple of the block size.

Filling GDAL blocks from GeoPackage tiles is relatively easy : you figure out which 4 tiles are needed, read them and composite the interesting pixels into the GDAL block. For more performance we cache the last read tiles so that if reading the file from left to right (typical way GDAL algorithms process a raster), we only need to load 2 new tiles, instead of 4, for each GDAL block.
Regarding writing of tiles from GDAL blocks, a naive implementation would reload, update and recompress each of those 4 tiles each time a block is updated, but besides the fact that this would be time consuming, this would introduce repeated steps of image quality loss when using lossy compression formats, such as JPEG or WebP (or 8-bit PNG), as the storage format. Instead, we use a temporary database of uncompressed and partially updated tiles, and wait for a tile to be completely updated (and that considering its 4 R,G,B,A components) before compressing it to its final storage.

What's next ?

Potential future enhancements, on the raster as well as vector side, could consist in :
- studying how de-duplication of tiles (e.g. avoiding storing multiple times a fully blue tile when mapping oceanic areas) could be done. This is likely possible through updatable views and is a feature of MBTiles.
- as tiles are completely independant from each other, it might possible to have different quality settings of JPEG/WebP compression per area(s) of interest.
- adding reading and writing of metadata at the vector layer level. This should now be possible with the GDAL-OGR unification. That would require an update in ogrinfo to display such metadata, and in ogr2ogr to define and propagate them.
- use of schema/column constraints in vector layers.
- more technical: creating the triggers that correspond to the gpkg_geometry_type_trigger and gpkg_srs_id_trigger extensions, that can be used to ensure consistency of geometries.
- and likely keeping up with future versions of the specification.


The fun part ! Two formats in one

Ah, and for those who remembered and reached that part of the article, I mentionned that there was a subtle difference between GeoPackage rasters and MBTiles. I can hear you : "which one??" Well, by using the GoogleMapsCompatible tiling scheme, mandatory for MBTiles, in a GeoPackage, could not we manage to have a same container that is at the same time both a valid GeoPackage and MBTiles ? The answer is yes, but that requires some trickery. A key difference is that MBTiles was based on the OSGeo TMS (Tile Map Service ) specification that decided to use a tile row number of 0 for the bottom most tile of the tiling, whereas the later OGC WMTS specification on which GeoPackage is based decided that it would be the top most tile! Fortunately with some SQL, we can do the renumbering. Demo :

$ gdal_translate test.tif test.gpkg -of GPKG -co TILE_FORMAT=PNG -co TILING_SCHEME=GoogleMapsCompatible

$ gdaladdo -r cubic test.gpkg 2 4
$ ogrinfo test.gpkg  -sql "CREATE VIEW tiles AS SELECT test.zoom_level, tile_column, tm.matrix_height-1-tile_row AS tile_row, tile_data FROM test JOIN
gpkg_tile_matrix tm ON test.zoom_level = tm.zoom_level AND tm.table_name = 'test'"
$ ogrinfo test.gpkg  -sql "CREATE TABLE metadata(name TEXT, value TEXT)"

$ ogrinfo test.gpkg  -sql "INSERT INTO metadata VALUES('name', 'my_tileset')"
$ ogrinfo test.gpkg  -sql "INSERT INTO metadata VALUES('type', 'overlay')" // or 'baselayer'
$ ogrinfo test.gpkg  -sql "INSERT INTO metadata VALUES('version', '1.1')"
$ ogrinfo test.gpkg  -sql "INSERT INTO metadata VALUES('description', 'description')"
$ ogrinfo test.gpkg  -sql "INSERT INTO metadata VALUES('format', 'PNG')"

And the result is also now a valid MBTiles dataset (while still being a valid GeoPackage), that can for example be opened by the GDAL MBTiles driver (after renaming to .mbtiles, or creating a symbolic link, since the MBTiles driver will not try to open a .gpkg file).

Finally, I would like to thank Safe Software for having financially supported this work on GeoPackage raster support.

jeudi 6 novembre 2014

Hacking Python module constants

I've spent a whole afternoon figuring out how to add a deprecation warning when using a constant attached to a Python module, so here's how to do it in case you would need to do something similar.

The use case is the following : as GDAL RFC 49 introduces curve geometries, I wanted the user to be warned if he uses the now deprecated ogr.wkb25DBit constant, since the new geometry types no longer use the most significant bit of the value to indicate the Z dimension.

from osgeo import ogr

ogr.wkb25DBit DeprecationWarning: deprecated: use ogr.GT_Flatten(), ogr.GT_HasZ() or ogr.GT_SetZ() instead
  warnings.warn("deprecated: use ogr.GT_Flatten(), ogr.GT_HasZ() or ogr.GT_SetZ() instead", DeprecationWarning)

The issue is that ogr.wkb25DBit was a variable assigned to a module, and you cannot do anything to "attach" a function that would be evaluated at runtime when the constant is used.

After several non conclusive attempts, the solution finally came from this article. Basically, you can use the sys.modules dictionnary to substitute the original module by a class instance (called pseudo-module class instance in the rest of this writing). And on the class of this class instance, you can define a property that will call a function when it is read. To get all other global functions and classes of the original module, you can copy the global dictionnary of the module to the pseudo-module class instance, and here you are ! If this summary does not make sense, the above mentionned article explains that in greater details.

My personal touch is a further improvement. The above trick is nice, but when from the Python console, you call help(ogr), it displays the help of the pseudo-module class instance, and not the one of the original module. So you loose the help of all other constants, functions and classes. Almost everything in fact.

But even Python builtins functions like help() can be replaced by a custom version. The custom help() tests if the object passed is the pseudo-module class instance, in which case it substitutes it temporarily with the original module  before calling the original help().

Apart from issuing deprecation warnings, such a technique can be usefull to make module constants really constants. Did you know that you can affect another value to math.pi... ?

The full code for all above tricks (works with all Python 2 and 3 versions starting with 2.4) :

# Original module constant
my_constant = 1

# Backup original dictionnary before doing anything else
_initial_dict = globals().copy()

def my_constant(module):
    import warnings
    warnings.warn("my_constant is deprecated", DeprecationWarning)
    return module._initial_dict['constant']

# Inspired from
class _Module(object):
    def __init__(self):
        self.__dict__ = globals()
        self._initial_dict = _initial_dict

        # Transfer properties from the object to the Class
        for k, v in list(self.__dict__.items()):
            if isinstance(v, property):
                setattr(self.__class__, k, v)

        # Replace original module by our object
        import sys
        self._original_module = sys.modules[self.__name__]
        sys.modules[self.__name__] = self

# Custom help() replacement to display the help of the original module
# instead of the one of our instance object
class _MyHelper(object):

    def __init__(self, module):
        self.module = module
        self.original_help = help

        # Replace builtin help by ours
            import __builtin__ as builtins # Python 2
        except ImportError:
            import builtins # Python 3 = self

    def __repr__(self):
        return self.original_help.__repr__()

    def __call__(self, *args, **kwds):

        if args == (self.module,):
            import sys

            # Restore original module before calling help() otherwise
            # we don't get methods or classes mentionned
            sys.modules[self.module.__name__] = self.module._original_module

            ret = self.original_help(self.module._original_module, **kwds)

            # Reinstall our module
            sys.modules[self.module.__name__] = self.module

            return ret
        elif args == (self,):
            return self.original_help(self.original_help, **kwds)
            return self.original_help(*args, **kwds)

del _MyHelper
del _Module

mercredi 22 octobre 2014

Blending metadata into vector formats

This post explores a few ideas, and the resulting experiments, I've had recently to put metadata (or arbitrary information) into vector GIS formats that have no provision for them. One typical such format is the good-old Shapefile format. A shapefile generally consists in 3 files, a .shp file that contains the geometries, a .shx that is an index from the shape number to the offset in the .shp file where the geometry is located (to allow fast retrieval by shape ID) and a .dbf file that contains the attributes of each shape.
Of course, the most simple way of adding metadata would be to but an additional file besides the 3 mentionned ones, but that would not be very challenging (plus the risk of losing it during copy).
Most implementations require at least those 3 files to be present. Some allow .dbf to be missing (e.g. GDAL/OGR). Some allow .shx to be missing, like OpenJUMP which doesn't read it even if it is available, which is both a feature and a drawback in situations when there are "holes" in the .shp due to editing.

A basic solution is to add our metadata at the end of one of those 3 files. I've done tests with GDAL/OGR (based on Shapelib), GeoTools 12.0, OpenJUMP 1.7.1 (whose shapefile reader is a forked version of the GeoTools one with changes), proprietary software code-named "GM" and proprietary software "AG"
.dbf : all 5 implementations are happy with extra content at the end of the file
.shp : all implementations happy, except OpenJUMP that opens the file, but throws a warning because it tries to interprete the additional bytes as shape.
.shx : all 5 implementations are happy
So we have at least 2 possibilities that are rather portable.
It should be checked how they react in editing use cases, like adding new features to the shapefile. Regarding GDAL/OGR, I can say that it would overwrite the extra content at the end of the .dbf and the .shx. It would let the extra content at the end of the .shp to write the new geometry afterwards.

What if we want to "link" the metadata per feature in a way where it is preserved if shapes are added ? And for the sake of exploring more possibilities, we will exclude using the data-at-end-of-file track. Interleaving data and metadata is not possible in .dbf since the records are placed consecutively. Same for .shx. In .shp, we can try reserving some space between all geometry records and make sure that the .shx index takes the holes into account. Due to the fact that size and offsets in shapefile are expressed in term of 16 bit words, that extra space must be a multiple of 16 bit too. That works fine for all implementations, except OpenJUMP for the same reason as above. Hum, and what if we incorporate the metadata, not between the encoded geometries, but inside them ? Each geometry record is indeed structured like this :

Shape Id: 4 bytes
Record length (number of 16 bit words after that field): 4 bytes
Record content: (2 * record length) bytes
    Shape Type: 4 bytes
    Variable payload according to shape type

We can try adding extra payload at the end of record content while still updating record length to take into account. We could have thought that implementations strictly checks that the declared record length is consistant with the shape type, but experimentations (and code inspection on the 3 Open Source implementations) show that, when they check, they check that the record length is at least greater or equal to the minimum expected record length. So this works for the 5 implementations ! At least on a layer with 2D polygons. That should also work for other 2D geometry type. 3D shapes consist in the 2D information, followed by the Z information, and optionaly by the M(eausre) information. M information is sometimes omitted when it is not present (this is the case of the OGR writer). So if we would want to add metadata for 3D shapes, we would have to write dummy M information (writting not-a-number double values is commonly done to indicate that M information is invalid).

To go back to .dbf file a bit, sometimes the width of fields of string type is larger than strictly needed. The values are left aligned in the field and remaining space is padded with space characters. I've tried to insert a nul character just at the end of the string, and put the extra information afterwards. This works fine for the 3 C/C++ based shapefile readers (GDAL/OGR, G.M., A.G) since nul character is conventionnaly used to terminate a string in C/C++. Unfortunately that does not work with the 2 Java based implementations that do not use that convention : the extra content is displayed after the field content.

As we have started exploring modifying the data itself, let's return to .shp file. One thing to consider is that coordinates in shapefiles are stored as double precision floating point numbers, stored on 64 bits using the IEEE-754 binary representation. Such numbers are decomposed like the following : 1 bit for the sign of the value, 11 bits for the exponent and its sign and 52 bits for the mantissa. The mantissa is where the significand precision of the number is stored. How big is that ? Let's go back to geography a bit. The Earth has rougly a circonference of 40 000 km. If we want to map features with a precision of 1 cm, we need 40 000 000 / 0.01 = 4 billion distinct numbers. 4 billion fits conveniently on a 32 bit integer (and OpenStreetMap .pbf optimized format store coordinates on 32 bit integers based on that observation). So 52 bits allow 2^(52-32)=2^20, roughly 1 million more numbers, i.e. a precision of 10^-8 meters = 10 nanometers ! We could almost map every molecule located on the Earth surface !
It is consequently reasonable to borrow the 16 least significant bits from the mantissa for other use. Said differently for every 2D point/vertex, we can get back 4 bytes without any noticeable loss of precision. Depending on the shape complexity, this might be not big enough to store per-feature metadata. But on a typical shapefile, if we spead the metadata over the features, we can certainly store useful content. And the really great news is that this metadata would be preserved naturally in most format conversions (at least with GDAL/OGR whose internal geometry representation also uses 64-bit floating point numbers, and probably most other geometry engines), and for formats like Spatialite or GeoPackage that also use 64-bit floating point numbers. However, one must be aware than any other operation like rescaling or reprojection would completely change the least significant bits and erase our metadata.

Admitedly this is not a new idea. People have explored similar ideas for digital watermarking and more generally steganography, typically used to embed copyright information or source tracking (i.e. you generate a slightly different dataset for each customer, hence if a copy is then available for download, you can identify the origin of the leak), generally in a not noticeable way. Using least significant bits is the very basic technique, that can be circumvented easily by just zeroing them or adding noise. More advanced technique operate in the spectral domain, like DCT (Discrete Cosine Transform), DFT (Discrete Fourier Transform) or DWT (Discrete Wavelet Tranform). Some techniques have been specifically designed for GIS data, using topological properties for example. The common target of those techniques is to have robustness against attempts of removing the watermark from the signal, at the expense of a reduced bandwith for the inserted information. But for regular metadata, we do not need such guarantee and the use of least-significant bits might be good enough and easily implemented.

Any other ideas ? Sure...

For polygons, the shapefile specification states that the vertices of the outer ring must be listed in clockwise order. But it does not specify which vertex of the outline must be the first one. Let's consider that the top-most vertex of the polygon is numbered 0 (if there are several vertices with the same y coordinate, let's take the one of them with the minimum x), the following vertex in clockwise order is 1, etc... If our polygon has 16 vertices, and we serialize it starting at vertex 11, we have coded the 11 number. Combined with information of following polygons, we can build a longer message. This idea could only work in practice for shapefiles of complex/dense enough polygons. If every polygon has 256 vertex, we can encode log2(256)=8 bits per polygon. More generally, for a polygon with N vertex, we could encode log2(N) bits (rounded to inferior integer). So we need also at least hundreds or thousands of polygons of that complexity to be able to encode something useful. The advantage of this technique is that it is robust to rescaling, and probably most reprojections (at least the one that globally preserve the appearance of shapes), provided that the shapes are rewritten in the same order as in the original data.
That technique could also be adapted for lines. Let's consider a line made of (V1,V2,....Vn). We can for example simply build a multi-polyline of 2 parts (V1,...Vi) and (Vi,....,VN) that will visually looks like the original line and will encode for the i value. The increase in binary encoding would be modest (4+8+8=20 extra bytes).

Another technique might be to use repeated vertices. Let's consider a line or a polygon: if while listing consecutive vertices, they are repeated, this would encode a 1 value. Otherwise 0. For example, if a line is made of the sequence of vertices (V1,V1,V2,V3,V4,V4,V5,V5,V6), it would be equivalent to binary number 100110. So we could encode as many bits as vertices in the geometry. If needed, we can also use more repetitions to encode more bits. For one bit per vertex, on average such a technique would increase shapefile size by 50% (because on average, half of bits in a message are 1). It would preserve metadata perfectly for all coordinate transformations (geometry engines generally operate on vertices separately). But not to operations that would remove duplicated vertices.

Finally, here's another idea, conceptually close to the one based on the starting vertex. Excluding implementations that don't rely on the .shx (I've no prejudice against such one ! Keep on good work folks !), we could use the order of shapes in the .shp to encode information. Traditionnaly, feature 1 appears first in the .shp, followed by feature 2, etc... But we could re-order the shapes as we wish, provided we make the .shx point to the right offset in the .shp. If we have N shapes, there are N! (factorial(N) = N*(N-1)*(N-2)*...2*1) ways of ordering them. So for N shapes, we can encode log2(N!) bits. In practice for 10 shapes, that is 21 bits. For 100 shapes, 524 bits. For 1000 shapes, 8529 bits. And for 10000, 118458. Advantages: works for all geometry types, no increase in file size. Inconvenients: possibly less performant sequential reading because of apparently random seeking within the .shp, doesn't resist to file conversion.

I've not mentionned it, but for nearly all mentionned techniques, especially the last ones, we would need to reserve a few bits to insert a CRC or any other integrity mechanism, so as to make sure that we think is metadata really is. And all them could be potentially combined !

jeudi 16 octobre 2014

Warping, overviews and... warped overviews

The development version of GDAL has lately received a few long awaited improvements in the area of warping and overview computation.

For those non familiar with GDAL, warping is mainly used for reprojecting datasets from one source coordinate system to a target one, or to create a "north-up" image from a rotated image or an image that has ground control points. Overviews in GDAL are also called pyramids in other GIS software and are sub-sampled (i.e. with coarser resolution) versions of full-resolution datasets, that are mainly used for fast display in zooming out operations. Depending on the utility (warper or overview computation), different resampling methods are available : bilinear, cubic, cubicspline, lanczos, average, etc..

Cubic resampling

Up to now, the bi-cubic resampling algorithm used when computing warped images and overviews was a 4x4 convolution kernel. This was appropriate for warping, when the dimensions of the target dataset are of the same order as the source dataset. However if the target dataset was downsized (which is the nominal case of overview computation), the result was sub-optimal, not to say plainly bad, because not enough source pixels were captured, leading to a result close to what nearest neighbour would give. Now, the convolution kernel dynamically uses the subsampling ratio to take into account all source pixels that have an influence on each target pixel, so e.g 8x8 pixels if subsampling by a factor of 2.
Of course, this involves more computation and could be slower. Fortunately, for 64 bit builds, Intel SSE2 intrinsics are at the rescue to compute convolutions in a very efficient way.

For example in GDAL 2.0dev, computing 5 overview levels on a 10474x4951 RGB raster with cubic resampling takes 2.4 seconds on a Core i5-750, to be compared with 3.8s with GDAL 1.11

$ gdaladdo -ro -r cubic world_4326.tif 2 4 8 16 32

To compare both results, we can select the 5th overview level with the fresh new open option OVERVIEW_LEVEL=4 (index are 0 based)

$ gdal_translate world_4326.tif out.tif -oo OVERVIEW_LEVEL=4

5th overview generated by GDAL 2.0dev

5th overview generated by GDAL 1.11.1

So yes, faster (a bit) and better (a lot) !

Similar result can also be obtained with :

$ gdalwarp -r cubic world_4326.tif out.tif -ts 328 155

The "-oo OVERVIEW_LEVEL=xxx" option can be used with gdalinfo, gdal_translate and gdalwarp, or with the new GDALOpenEx() API.

Related work could involve adding resampling method selection in the RasterIO() API that currently only does nearest neighbour sampling. If that might interest you, please contact me.

Overviews in warping

Related to the OVERVIEW_LEVEL open option, another long due improvement was the selection of the appropriate overview level when warping. A typical use case is to start with a WMS or tiled dataset, e.g the OpenStreetMap tiles, and wanting to reproject full or partial extent to an image with reasonably small dimensions. Up to now, GDAL would alway use the most precision dataset (typically zoom level 18 for OpenStreetMap), which would make the operation terribly slow and unpractical.

Now, the following will run in just a few seconds :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000

With the -ovr flag, you can modify the overview selection strategy, and for example specify you want to use the overview if the level immediately before the one that would have been automatically selected (i.e. with bigger dimensions, more precise)

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr AUTO-1

You can also specify a precise overview level to control the level of details, which is particuarly relevant in the case of OSM since the rendering depends on the scale :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr 9

(Note: -ovr 9 is equivalent to OSM zoom level 8, since GDAL_overview_level = OSM_max_zoom_level - 1 - OSM_level, 9 = 18 - 1 - 8. )

With -ovr 9 (zoom level 8)

With -ovr 10 (zoom level 7)

With -ovr 11 (zoom level 6) or without any -ovr parameter

With -ovr 12 (zoom level 5)
(All above images are © OpenStreetMap contributors)

Overviews in warped VRT

GDAL advanced users will perhaps know the Virtual Raster (.vrt) format. There are several flavors of VRT files, one of them is the so-called "warped VRT", which can be produced by "gdalwarp -of VRT". This is an XML file that captures the name of the source dataset being warped and the parameters of the warping: output resolution, extent, dimensions, transformer used, etc... This can be convenient to do on-the-fly reprojection without needing to store the result of the reprojection. Similarly to regular warping, warped VRT can now make use of overviews of the source dataset to expose "implicit" overviews in the warped VRT dataset. Which make it possible to use warped VRT in a GIS viewer ith decent performance when zooming out. Among others, this will be  beneficial to QGIS that use the "auto-warped-VRT" mechanism when opening a raster that is not a "north-up" dataset.

Still playing with our OpenStreetMap dataset, let's create a warped VRT around western Europe :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.vrt -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -of VRT

We can see that the VRT now advertizes overviews :

$ gdalinfo out.vrt
Size is 4767192, 4767192
Band 1 Block=512x128 Type=Byte, ColorInterp=Red
  Overviews: 2383596x2383596, 1191798x1191798, 595899x595899,
             297950x297950, 148975x148975, 74487x74487,
             37244x37244, 18622x18622, 9311x9311, 4655x4655,
             2328x2328, 1164x1164, 582x582, 291x291, 145x145,
             73x73, 36x36, 18x18

I'd like to thank Koordinates and Land Information New Zealand for funding those improvements.