mercredi 22 octobre 2014

Blending metadata into vector formats

This post explores a few ideas, and the resulting experiments, I've had recently to put metadata (or arbitrary information) into vector GIS formats that have no provision for them. One typical such format is the good-old Shapefile format. A shapefile generally consists in 3 files, a .shp file that contains the geometries, a .shx that is an index from the shape number to the offset in the .shp file where the geometry is located (to allow fast retrieval by shape ID) and a .dbf file that contains the attributes of each shape.
Of course, the most simple way of adding metadata would be to but an additional file besides the 3 mentionned ones, but that would not be very challenging (plus the risk of losing it during copy).
Most implementations require at least those 3 files to be present. Some allow .dbf to be missing (e.g. GDAL/OGR). Some allow .shx to be missing, like OpenJUMP which doesn't read it even if it is available, which is both a feature and a drawback in situations when there are "holes" in the .shp due to editing.

A basic solution is to add our metadata at the end of one of those 3 files. I've done tests with GDAL/OGR (based on Shapelib), GeoTools 12.0, OpenJUMP 1.7.1 (whose shapefile reader is a forked version of the GeoTools one with changes), proprietary software code-named "GM" and proprietary software "AG"
.dbf : all 5 implementations are happy with extra content at the end of the file
.shp : all implementations happy, except OpenJUMP that opens the file, but throws a warning because it tries to interprete the additional bytes as shape.
.shx : all 5 implementations are happy
So we have at least 2 possibilities that are rather portable.
It should be checked how they react in editing use cases, like adding new features to the shapefile. Regarding GDAL/OGR, I can say that it would overwrite the extra content at the end of the .dbf and the .shx. It would let the extra content at the end of the .shp to write the new geometry afterwards.

What if we want to "link" the metadata per feature in a way where it is preserved if shapes are added ? And for the sake of exploring more possibilities, we will exclude using the data-at-end-of-file track. Interleaving data and metadata is not possible in .dbf since the records are placed consecutively. Same for .shx. In .shp, we can try reserving some space between all geometry records and make sure that the .shx index takes the holes into account. Due to the fact that size and offsets in shapefile are expressed in term of 16 bit words, that extra space must be a multiple of 16 bit too. That works fine for all implementations, except OpenJUMP for the same reason as above. Hum, and what if we incorporate the metadata, not between the encoded geometries, but inside them ? Each geometry record is indeed structured like this :

Shape Id: 4 bytes
Record length (number of 16 bit words after that field): 4 bytes
Record content: (2 * record length) bytes
    Shape Type: 4 bytes
    Variable payload according to shape type

We can try adding extra payload at the end of record content while still updating record length to take into account. We could have thought that implementations strictly checks that the declared record length is consistant with the shape type, but experimentations (and code inspection on the 3 Open Source implementations) show that, when they check, they check that the record length is at least greater or equal to the minimum expected record length. So this works for the 5 implementations ! At least on a layer with 2D polygons. That should also work for other 2D geometry type. 3D shapes consist in the 2D information, followed by the Z information, and optionaly by the M(eausre) information. M information is sometimes omitted when it is not present (this is the case of the OGR writer). So if we would want to add metadata for 3D shapes, we would have to write dummy M information (writting not-a-number double values is commonly done to indicate that M information is invalid).

To go back to .dbf file a bit, sometimes the width of fields of string type is larger than strictly needed. The values are left aligned in the field and remaining space is padded with space characters. I've tried to insert a nul character just at the end of the string, and put the extra information afterwards. This works fine for the 3 C/C++ based shapefile readers (GDAL/OGR, G.M., A.G) since nul character is conventionnaly used to terminate a string in C/C++. Unfortunately that does not work with the 2 Java based implementations that do not use that convention : the extra content is displayed after the field content.

As we have started exploring modifying the data itself, let's return to .shp file. One thing to consider is that coordinates in shapefiles are stored as double precision floating point numbers, stored on 64 bits using the IEEE-754 binary representation. Such numbers are decomposed like the following : 1 bit for the sign of the value, 11 bits for the exponent and its sign and 52 bits for the mantissa. The mantissa is where the significand precision of the number is stored. How big is that ? Let's go back to geography a bit. The Earth has rougly a circonference of 40 000 km. If we want to map features with a precision of 1 cm, we need 40 000 000 / 0.01 = 4 billion distinct numbers. 4 billion fits conveniently on a 32 bit integer (and OpenStreetMap .pbf optimized format store coordinates on 32 bit integers based on that observation). So 52 bits allow 2^(52-32)=2^20, roughly 1 million more numbers, i.e. a precision of 10^-8 meters = 10 nanometers ! We could almost map every molecule located on the Earth surface !
It is consequently reasonable to borrow the 16 least significant bits from the mantissa for other use. Said differently for every 2D point/vertex, we can get back 4 bytes without any noticeable loss of precision. Depending on the shape complexity, this might be not big enough to store per-feature metadata. But on a typical shapefile, if we spead the metadata over the features, we can certainly store useful content. And the really great news is that this metadata would be preserved naturally in most format conversions (at least with GDAL/OGR whose internal geometry representation also uses 64-bit floating point numbers, and probably most other geometry engines), and for formats like Spatialite or GeoPackage that also use 64-bit floating point numbers. However, one must be aware than any other operation like rescaling or reprojection would completely change the least significant bits and erase our metadata.

Admitedly this is not a new idea. People have explored similar ideas for digital watermarking and more generally steganography, typically used to embed copyright information or source tracking (i.e. you generate a slightly different dataset for each customer, hence if a copy is then available for download, you can identify the origin of the leak), generally in a not noticeable way. Using least significant bits is the very basic technique, that can be circumvented easily by just zeroing them or adding noise. More advanced technique operate in the spectral domain, like DCT (Discrete Cosine Transform), DFT (Discrete Fourier Transform) or DWT (Discrete Wavelet Tranform). Some techniques have been specifically designed for GIS data, using topological properties for example. The common target of those techniques is to have robustness against attempts of removing the watermark from the signal, at the expense of a reduced bandwith for the inserted information. But for regular metadata, we do not need such guarantee and the use of least-significant bits might be good enough and easily implemented.

Any other ideas ? Sure...

For polygons, the shapefile specification states that the vertices of the outer ring must be listed in clockwise order. But it does not specify which vertex of the outline must be the first one. Let's consider that the top-most vertex of the polygon is numbered 0 (if there are several vertices with the same y coordinate, let's take the one of them with the minimum x), the following vertex in clockwise order is 1, etc... If our polygon has 16 vertices, and we serialize it starting at vertex 11, we have coded the 11 number. Combined with information of following polygons, we can build a longer message. This idea could only work in practice for shapefiles of complex/dense enough polygons. If every polygon has 256 vertex, we can encode log2(256)=8 bits per polygon. More generally, for a polygon with N vertex, we could encode log2(N) bits (rounded to inferior integer). So we need also at least hundreds or thousands of polygons of that complexity to be able to encode something useful. The advantage of this technique is that it is robust to rescaling, and probably most reprojections (at least the one that globally preserve the appearance of shapes), provided that the shapes are rewritten in the same order as in the original data.
That technique could also be adapted for lines. Let's consider a line made of (V1,V2,....Vn). We can for example simply build a multi-polyline of 2 parts (V1,...Vi) and (Vi,....,VN) that will visually looks like the original line and will encode for the i value. The increase in binary encoding would be modest (4+8+8=20 extra bytes).

Another technique might be to use repeated vertices. Let's consider a line or a polygon: if while listing consecutive vertices, they are repeated, this would encode a 1 value. Otherwise 0. For example, if a line is made of the sequence of vertices (V1,V1,V2,V3,V4,V4,V5,V5,V6), it would be equivalent to binary number 100110. So we could encode as many bits as vertices in the geometry. If needed, we can also use more repetitions to encode more bits. For one bit per vertex, on average such a technique would increase shapefile size by 50% (because on average, half of bits in a message are 1). It would preserve metadata perfectly for all coordinate transformations (geometry engines generally operate on vertices separately). But not to operations that would remove duplicated vertices.

Finally, here's another idea, conceptually close to the one based on the starting vertex. Excluding implementations that don't rely on the .shx (I've no prejudice against such one ! Keep on good work folks !), we could use the order of shapes in the .shp to encode information. Traditionnaly, feature 1 appears first in the .shp, followed by feature 2, etc... But we could re-order the shapes as we wish, provided we make the .shx point to the right offset in the .shp. If we have N shapes, there are N! (factorial(N) = N*(N-1)*(N-2)*...2*1) ways of ordering them. So for N shapes, we can encode log2(N!) bits. In practice for 10 shapes, that is 21 bits. For 100 shapes, 524 bits. For 1000 shapes, 8529 bits. And for 10000, 118458. Advantages: works for all geometry types, no increase in file size. Inconvenients: possibly less performant sequential reading because of apparently random seeking within the .shp, doesn't resist to file conversion.

I've not mentionned it, but for nearly all mentionned techniques, especially the last ones, we would need to reserve a few bits to insert a CRC or any other integrity mechanism, so as to make sure that we think is metadata really is. And all them could be potentially combined !

jeudi 16 octobre 2014

Warping, overviews and... warped overviews

The development version of GDAL has lately received a few long awaited improvements in the area of warping and overview computation.

For those non familiar with GDAL, warping is mainly used for reprojecting datasets from one source coordinate system to a target one, or to create a "north-up" image from a rotated image or an image that has ground control points. Overviews in GDAL are also called pyramids in other GIS software and are sub-sampled (i.e. with coarser resolution) versions of full-resolution datasets, that are mainly used for fast display in zooming out operations. Depending on the utility (warper or overview computation), different resampling methods are available : bilinear, cubic, cubicspline, lanczos, average, etc..

Cubic resampling


Up to now, the bi-cubic resampling algorithm used when computing warped images and overviews was a 4x4 convolution kernel. This was appropriate for warping, when the dimensions of the target dataset are of the same order as the source dataset. However if the target dataset was downsized (which is the nominal case of overview computation), the result was sub-optimal, not to say plainly bad, because not enough source pixels were captured, leading to a result close to what nearest neighbour would give. Now, the convolution kernel dynamically uses the subsampling ratio to take into account all source pixels that have an influence on each target pixel, so e.g 8x8 pixels if subsampling by a factor of 2.
Of course, this involves more computation and could be slower. Fortunately, for 64 bit builds, Intel SSE2 intrinsics are at the rescue to compute convolutions in a very efficient way.

For example in GDAL 2.0dev, computing 5 overview levels on a 10474x4951 RGB raster with cubic resampling takes 2.4 seconds on a Core i5-750, to be compared with 3.8s with GDAL 1.11

$ gdaladdo -ro -r cubic world_4326.tif 2 4 8 16 32

To compare both results, we can select the 5th overview level with the fresh new open option OVERVIEW_LEVEL=4 (index are 0 based)

$ gdal_translate world_4326.tif out.tif -oo OVERVIEW_LEVEL=4

5th overview generated by GDAL 2.0dev

5th overview generated by GDAL 1.11.1


So yes, faster (a bit) and better (a lot) !

Similar result can also be obtained with :

$ gdalwarp -r cubic world_4326.tif out.tif -ts 328 155

The "-oo OVERVIEW_LEVEL=xxx" option can be used with gdalinfo, gdal_translate and gdalwarp, or with the new GDALOpenEx() API.

Related work could involve adding resampling method selection in the RasterIO() API that currently only does nearest neighbour sampling. If that might interest you, please contact me.

Overviews in warping


Related to the OVERVIEW_LEVEL open option, another long due improvement was the selection of the appropriate overview level when warping. A typical use case is to start with a WMS or tiled dataset, e.g the OpenStreetMap tiles, and wanting to reproject full or partial extent to an image with reasonably small dimensions. Up to now, GDAL would alway use the most precision dataset (typically zoom level 18 for OpenStreetMap), which would make the operation terribly slow and unpractical.

Now, the following will run in just a few seconds :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000

With the -ovr flag, you can modify the overview selection strategy, and for example specify you want to use the overview if the level immediately before the one that would have been automatically selected (i.e. with bigger dimensions, more precise)

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr AUTO-1

You can also specify a precise overview level to control the level of details, which is particuarly relevant in the case of OSM since the rendering depends on the scale :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.tif -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -ts 1000 1000 -ovr 9

(Note: -ovr 9 is equivalent to OSM zoom level 8, since GDAL_overview_level = OSM_max_zoom_level - 1 - OSM_level, 9 = 18 - 1 - 8. )

With -ovr 9 (zoom level 8)

With -ovr 10 (zoom level 7)

With -ovr 11 (zoom level 6) or without any -ovr parameter

With -ovr 12 (zoom level 5)
(All above images are © OpenStreetMap contributors)

Overviews in warped VRT


GDAL advanced users will perhaps know the Virtual Raster (.vrt) format. There are several flavors of VRT files, one of them is the so-called "warped VRT", which can be produced by "gdalwarp -of VRT". This is an XML file that captures the name of the source dataset being warped and the parameters of the warping: output resolution, extent, dimensions, transformer used, etc... This can be convenient to do on-the-fly reprojection without needing to store the result of the reprojection. Similarly to regular warping, warped VRT can now make use of overviews of the source dataset to expose "implicit" overviews in the warped VRT dataset. Which make it possible to use warped VRT in a GIS viewer ith decent performance when zooming out. Among others, this will be  beneficial to QGIS that use the "auto-warped-VRT" mechanism when opening a raster that is not a "north-up" dataset.

Still playing with our OpenStreetMap dataset, let's create a warped VRT around western Europe :

$ gdalwarp frmt_wms_openstreetmap_tms.xml out.vrt -t_srs EPSG:4326 \
  -r cubic -te -10 35 10 55 -overwrite -of VRT

We can see that the VRT now advertizes overviews :

$ gdalinfo out.vrt
[...]
Size is 4767192, 4767192
[...]
Band 1 Block=512x128 Type=Byte, ColorInterp=Red
  Overviews: 2383596x2383596, 1191798x1191798, 595899x595899,
             297950x297950, 148975x148975, 74487x74487,
             37244x37244, 18622x18622, 9311x9311, 4655x4655,
             2328x2328, 1164x1164, 582x582, 291x291, 145x145,
             73x73, 36x36, 18x18


I'd like to thank Koordinates and Land Information New Zealand for funding those improvements.

mercredi 1 octobre 2014

GDAL/OGR 1.11.1 released

On behalf of the GDAL/OGR development team, I am pleased to
announce the release of the GDAL/OGR 1.11.1 bug fix release.  This
release contains more than 80 bug fixes since the April 1.11.0 release.

The source is available at:

  http://download.osgeo.org/gdal/1.11.1/gdal-1.11.1.tar.xz
  http://download.osgeo.org/gdal/1.11.1/gdal-1.11.1.tar.gz
  http://download.osgeo.org/gdal/1.11.1/gdal1111.zip

Details on the the fixes in this release, and a security announcement, are
available at:
  http://trac.osgeo.org/gdal/wiki/Release/1.11.1-News