mercredi 22 octobre 2014

Blending metadata into vector formats

This post explores a few ideas, and the resulting experiments, I've had recently to put metadata (or arbitrary information) into vector GIS formats that have no provision for them. One typical such format is the good-old Shapefile format. A shapefile generally consists in 3 files, a .shp file that contains the geometries, a .shx that is an index from the shape number to the offset in the .shp file where the geometry is located (to allow fast retrieval by shape ID) and a .dbf file that contains the attributes of each shape.
Of course, the most simple way of adding metadata would be to but an additional file besides the 3 mentionned ones, but that would not be very challenging (plus the risk of losing it during copy).
Most implementations require at least those 3 files to be present. Some allow .dbf to be missing (e.g. GDAL/OGR). Some allow .shx to be missing, like OpenJUMP which doesn't read it even if it is available, which is both a feature and a drawback in situations when there are "holes" in the .shp due to editing.

A basic solution is to add our metadata at the end of one of those 3 files. I've done tests with GDAL/OGR (based on Shapelib), GeoTools 12.0, OpenJUMP 1.7.1 (whose shapefile reader is a forked version of the GeoTools one with changes), proprietary software code-named "GM" and proprietary software "AG"
.dbf : all 5 implementations are happy with extra content at the end of the file
.shp : all implementations happy, except OpenJUMP that opens the file, but throws a warning because it tries to interprete the additional bytes as shape.
.shx : all 5 implementations are happy
So we have at least 2 possibilities that are rather portable.
It should be checked how they react in editing use cases, like adding new features to the shapefile. Regarding GDAL/OGR, I can say that it would overwrite the extra content at the end of the .dbf and the .shx. It would let the extra content at the end of the .shp to write the new geometry afterwards.

What if we want to "link" the metadata per feature in a way where it is preserved if shapes are added ? And for the sake of exploring more possibilities, we will exclude using the data-at-end-of-file track. Interleaving data and metadata is not possible in .dbf since the records are placed consecutively. Same for .shx. In .shp, we can try reserving some space between all geometry records and make sure that the .shx index takes the holes into account. Due to the fact that size and offsets in shapefile are expressed in term of 16 bit words, that extra space must be a multiple of 16 bit too. That works fine for all implementations, except OpenJUMP for the same reason as above. Hum, and what if we incorporate the metadata, not between the encoded geometries, but inside them ? Each geometry record is indeed structured like this :

Shape Id: 4 bytes
Record length (number of 16 bit words after that field): 4 bytes
Record content: (2 * record length) bytes
    Shape Type: 4 bytes
    Variable payload according to shape type

We can try adding extra payload at the end of record content while still updating record length to take into account. We could have thought that implementations strictly checks that the declared record length is consistant with the shape type, but experimentations (and code inspection on the 3 Open Source implementations) show that, when they check, they check that the record length is at least greater or equal to the minimum expected record length. So this works for the 5 implementations ! At least on a layer with 2D polygons. That should also work for other 2D geometry type. 3D shapes consist in the 2D information, followed by the Z information, and optionaly by the M(eausre) information. M information is sometimes omitted when it is not present (this is the case of the OGR writer). So if we would want to add metadata for 3D shapes, we would have to write dummy M information (writting not-a-number double values is commonly done to indicate that M information is invalid).

To go back to .dbf file a bit, sometimes the width of fields of string type is larger than strictly needed. The values are left aligned in the field and remaining space is padded with space characters. I've tried to insert a nul character just at the end of the string, and put the extra information afterwards. This works fine for the 3 C/C++ based shapefile readers (GDAL/OGR, G.M., A.G) since nul character is conventionnaly used to terminate a string in C/C++. Unfortunately that does not work with the 2 Java based implementations that do not use that convention : the extra content is displayed after the field content.

As we have started exploring modifying the data itself, let's return to .shp file. One thing to consider is that coordinates in shapefiles are stored as double precision floating point numbers, stored on 64 bits using the IEEE-754 binary representation. Such numbers are decomposed like the following : 1 bit for the sign of the value, 11 bits for the exponent and its sign and 52 bits for the mantissa. The mantissa is where the significand precision of the number is stored. How big is that ? Let's go back to geography a bit. The Earth has rougly a circonference of 40 000 km. If we want to map features with a precision of 1 cm, we need 40 000 000 / 0.01 = 4 billion distinct numbers. 4 billion fits conveniently on a 32 bit integer (and OpenStreetMap .pbf optimized format store coordinates on 32 bit integers based on that observation). So 52 bits allow 2^(52-32)=2^20, roughly 1 million more numbers, i.e. a precision of 10^-8 meters = 10 nanometers ! We could almost map every molecule located on the Earth surface !
It is consequently reasonable to borrow the 16 least significant bits from the mantissa for other use. Said differently for every 2D point/vertex, we can get back 4 bytes without any noticeable loss of precision. Depending on the shape complexity, this might be not big enough to store per-feature metadata. But on a typical shapefile, if we spead the metadata over the features, we can certainly store useful content. And the really great news is that this metadata would be preserved naturally in most format conversions (at least with GDAL/OGR whose internal geometry representation also uses 64-bit floating point numbers, and probably most other geometry engines), and for formats like Spatialite or GeoPackage that also use 64-bit floating point numbers. However, one must be aware than any other operation like rescaling or reprojection would completely change the least significant bits and erase our metadata.

Admitedly this is not a new idea. People have explored similar ideas for digital watermarking and more generally steganography, typically used to embed copyright information or source tracking (i.e. you generate a slightly different dataset for each customer, hence if a copy is then available for download, you can identify the origin of the leak), generally in a not noticeable way. Using least significant bits is the very basic technique, that can be circumvented easily by just zeroing them or adding noise. More advanced technique operate in the spectral domain, like DCT (Discrete Cosine Transform), DFT (Discrete Fourier Transform) or DWT (Discrete Wavelet Tranform). Some techniques have been specifically designed for GIS data, using topological properties for example. The common target of those techniques is to have robustness against attempts of removing the watermark from the signal, at the expense of a reduced bandwith for the inserted information. But for regular metadata, we do not need such guarantee and the use of least-significant bits might be good enough and easily implemented.

Any other ideas ? Sure...

For polygons, the shapefile specification states that the vertices of the outer ring must be listed in clockwise order. But it does not specify which vertex of the outline must be the first one. Let's consider that the top-most vertex of the polygon is numbered 0 (if there are several vertices with the same y coordinate, let's take the one of them with the minimum x), the following vertex in clockwise order is 1, etc... If our polygon has 16 vertices, and we serialize it starting at vertex 11, we have coded the 11 number. Combined with information of following polygons, we can build a longer message. This idea could only work in practice for shapefiles of complex/dense enough polygons. If every polygon has 256 vertex, we can encode log2(256)=8 bits per polygon. More generally, for a polygon with N vertex, we could encode log2(N) bits (rounded to inferior integer). So we need also at least hundreds or thousands of polygons of that complexity to be able to encode something useful. The advantage of this technique is that it is robust to rescaling, and probably most reprojections (at least the one that globally preserve the appearance of shapes), provided that the shapes are rewritten in the same order as in the original data.
That technique could also be adapted for lines. Let's consider a line made of (V1,V2,....Vn). We can for example simply build a multi-polyline of 2 parts (V1,...Vi) and (Vi,....,VN) that will visually looks like the original line and will encode for the i value. The increase in binary encoding would be modest (4+8+8=20 extra bytes).

Another technique might be to use repeated vertices. Let's consider a line or a polygon: if while listing consecutive vertices, they are repeated, this would encode a 1 value. Otherwise 0. For example, if a line is made of the sequence of vertices (V1,V1,V2,V3,V4,V4,V5,V5,V6), it would be equivalent to binary number 100110. So we could encode as many bits as vertices in the geometry. If needed, we can also use more repetitions to encode more bits. For one bit per vertex, on average such a technique would increase shapefile size by 50% (because on average, half of bits in a message are 1). It would preserve metadata perfectly for all coordinate transformations (geometry engines generally operate on vertices separately). But not to operations that would remove duplicated vertices.

Finally, here's another idea, conceptually close to the one based on the starting vertex. Excluding implementations that don't rely on the .shx (I've no prejudice against such one ! Keep on good work folks !), we could use the order of shapes in the .shp to encode information. Traditionnaly, feature 1 appears first in the .shp, followed by feature 2, etc... But we could re-order the shapes as we wish, provided we make the .shx point to the right offset in the .shp. If we have N shapes, there are N! (factorial(N) = N*(N-1)*(N-2)*...2*1) ways of ordering them. So for N shapes, we can encode log2(N!) bits. In practice for 10 shapes, that is 21 bits. For 100 shapes, 524 bits. For 1000 shapes, 8529 bits. And for 10000, 118458. Advantages: works for all geometry types, no increase in file size. Inconvenients: possibly less performant sequential reading because of apparently random seeking within the .shp, doesn't resist to file conversion.

I've not mentionned it, but for nearly all mentionned techniques, especially the last ones, we would need to reserve a few bits to insert a CRC or any other integrity mechanism, so as to make sure that we think is metadata really is. And all them could be potentially combined !

Aucun commentaire:

Enregistrer un commentaire