Geo tips & tricks: 2012

dimanche 30 décembre 2012

A geocoding client API in GDAL/OGR

A new API has been added to GDAL/OGR to use geocoding services, such as OpenStreetMap Nominatim, MapQuest Nominatim, Yahoo! PlaceFinder, GeoNames or Bing.

From C, this is as simple as :

/* Create a session with default options */
OGRGeocodingSessionH hSession = OGRGeocodeCreateSession(NULL);

/* Now query the service */
OGRLayerH hLayer = OGRGeocode(hSession, "Paris", NULL, NULL);

/* Use the result layer (the first feature is generally the most relevant) */

/* Cleanup */
OGRGeocodeFreeResult(hLayer);
OGRGeocodeDestroySession(hSession);

More interesting, you can use that capability in SQL with the ogr_geocode() function that has been added to the SQLite SQL dialect :

SELECT ST_Centroid(ogr_geocode('Paris'))

returns:

OGRFeature(SELECT):0

POINT (2.342878767069653 48.85661793020374)

or a bit more evolved, to add geocoded information to a CSV file :

ogrinfo cities.csv -dialect sqlite -sql \
   "SELECT *, ogr_geocode(city, 'country') AS country,
    ST_Centroid(ogr_geocode(city)) FROM cities" \
    --config OGR_GEOCODE_LANGUAGE en

returns:

OGRFeature(SELECT):0
id (Real) = 1
city (String) = Paris
country (String) = France
POINT (2.342878767069653 48.85661793020374)

OGRFeature(SELECT):1
id (Real) = 2
city (String) = London
country (String) = United Kingdom
POINT (-0.109369427546499 51.500506667319407)

[...]

OGRFeature(SELECT):6
id (Real) = 7
city (String) = Beijing
country (String) = People's Republic of China
POINT (116.391195 39.9064702)

For better efficiency, the results of geocoding queries are cached into a local datasource (by default, a simple SQLite database in the working directory, but the location can be overriden by a configuration option, and other formats - CSV or PostgreSQL - can be selected).

Reverse geocoding is also possible with OGRGeocodeReverse() or the ogr_geocode_reverse() SQL function :

SELECT ogr_geocode_reverse(-100,45,'country','zoom=3') AS country

returns :

OGRFeature(SELECT):0
country (String) = United States of America

Note: the online services that you may use have generally Term of Uses that disallow bulk geocoding, so use the above capability with care.

dimanche 24 juin 2012

GDAL/OGR using Shapefile native .sbn spatial index

This is the end of a mistery and a frustration that have lasted for about 15 years. The Shapefile format had been documented since 1998, but the documentation was limited to the minimum core, that is to say the .shp file that contains the geometries and the .shx that is an index to the geometries. However the format of the .sbn file, that was known to contain spatial index (aimed at speeding up spatial filters), has never been published.

The FOSS community came with an alternate spatial index format, the .qix format, originally used by Shapelib, MapServer and GDAL/OGR, and whose use has propagated into other software stacks, such as GeoTools.

But Joel Lawhead, Marc Pfister and another developer, Francisco, have actively started since the end of 2011 the reverse engineering of the .sbn format, and their effort finally came to a conclusion. Many kudos to them for leading this great effort !

There is not yet a formal specification of the .sbn format, but with the help of the blog entries and the debug/testing code that they made available, it was relatively easy to put pieces together. So, as an application of their work and a new step to increase interoperability between FOSS and proprietary software, I've just added support for reading and using .sbn files in the latest revision of the developement version of GDAL/OGR (GDAL 2.0dev). You can refer to ticket 4719 for the details and the code.

~~~

As this code is rather new, crowd testing is of course much appreciated. Provided that you use the latest version of GDAL 2.0dev and have python-gdal ready, you can use the testsbn.py script to check if the spatial index is correctly used with your own datasets.

Let's suppose that you have a shapefile "some_shapefile.shp", with an accompaying .sbn file "some_shapefile.sbn", you can try :

python testsbn.py some_shapefile.shp

This script will iterate over all the shapes of the shapefile, and for each shape, define a spatial filter whose extent is the bounding box of the geometry and issue a spatial request. It will check that the result of the spatial request contains the shape that served to define the spatial filter.

~~~

My initial testing of .sbn vs .qix queries would tend to show that search in .sbn is faster than in .qix (more than twice on large datasets, and with a lot of spatial filter request such as in testsbn.py), while being smaller. An explaination would be that spatial requests against .qix files use floating-point comparisons, whereas equivalent requests against .sbn files only use integer comparisons.

As far as adding write support for .sbn files, I'm a bit ambivalent for now. According to Joel's blog, there are still a few details that have to be solved, in order to understand all the subtelties of the algorithm that dispatches shapes into tree nodes, and ensure full interoperability with the software stack of the vendor at the origin of the .sbn format.

vendredi 18 mai 2012

A new GDAL virtual file system to read streamed data (e.g. for OGR WFS)

GDAL/OGR can of course read data from regular file systems, but also from more exotic sources thanks to a "virtual file system" API.

Let's start with the /vsizip/ virtual file system. If you have a ZIP file, myzip.zip, that contains a shapefile myshape.shp (and its associated .shx and .dbf files), you can read with :

ogrinfo -ro /vsizip/myzip.zip/myshape.shp

(or more simply /vsizip/myzip.zip as it is considered as a directory, and a directory is a valid datasource for the Shapefile driver).

If your data is located on a HTTP/FTP server, you can use the /vsicurl/ virtual file system, like this :

ogrinfo -ro /vsicurl/htttp://example.com/myshape.shp

As a bonus, you can combine both to read inside a remote ZIP file :

ogrinfo -ro /vsizip/vsicurl/htttp://example.com/myzip.zip/myshape.shp

Developers or advanced users can be interested by the /vsimem/, to use a memory buffer as a datasource, or /vsisubfile/ to access a file located instead another one.

/vsicurl/ is convenient to access static files and provides random access to data inside them provided that the web server supports "range downloading", i.e. the capability of returning data in a range of offsets.

Unfortunately, in some circumstances, the file is dynamically generated at the time you request it, so range downloading isn't supported. One such example in the scope of GDAL/OGR is the GML document generated by a WFS GetFeature request. Currently, the OGR WFS driver fetches the document as a whole with the CPLHTTPFetch() API, and passes the buffer to the GML driver (as a /vsimem/ file).

This behaviour has at least 2 drawbacks :

Even if you need to read one single feature, the driver will fetch the whole WFS GetFeature response, which can be long.
If the WFS GetFeature response is too long, it might not fit into memory at all.

It was possible to mitigate that by using the paging capability of some WFS servers, that is a non-standardized extension for WFS 1.0 or 1.1 (now normalized in WFS 2.0 spec).

In the GDAL/OGR trunk (2.0dev >= r24460), you can find a /vsicurl_streaming/ virtual file system that can be used to read data from a streaming server. This works efficiently only if the access pattern to the data is linear, and not random access. The OGR GML driver already natively parses data as a stream, so it can work nicely with /vsicurl_streaming/ :

ogrinfo -ro -al "/vsicurl_streaming/http://testing.deegree.org/deegree-wfs/services?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetFeature&TYPENAME=app:Springs"

Or, more simply, since the OGR WFS driver has been retrofitted to use it transparently :

ogrinfo -ro WFS:http://testing.deegree.org/deegree-wfs/services
app:Springs

mardi 31 janvier 2012

Welcome to the 200th GDAL/OGR driver !

A few hours ago, I merged into the GDAL/OGR source tree the ElasticSearch driver, which was contributed by Adam Estrada. According to the driver testing status page, it happens to be the 200^th driver !

To be honest, a few of them have been retired over the years (mainly because being deprecated by rewritten versions), so even if you try to enable all possible drivers, you won't reach 200. But you can get close : my build currently includes 124 GDAL drivers and 60 OGR drivers.

This 200th driver is a bit particular, because it is a write-only driver, whereas 99% of drivers are generally read-only or read-write.

What about the other new drivers that have been committed in GDAL trunk since 1.9.0 release ?

As far as GDAL drivers are concerned, there is a driver to read MBTiles that Dane Springmeyer already blogged about.

More recently, new spreadsheet formats have also made their way into OGR. Namely the ODS driver and the XLSX driver that respectively handle files in the Open Document Spreadsheet format, used by applications like OpenOffice / LibreOffice and Office Open XML, generated by applications like Microsoft Office 2007 and later versions. If you are wondering about the fate of the older XLS format, a XLS driver is already included in GDAL/OGR 1.9.0, provided your GDAL build links against the FreeXL library written by Alessandro Furieri, the main author of libspatialite.

The ODS and XLSX drivers have very similar capabilities and source code, which is not surprising, because the technologies behind the 2 formats are the same : XML files in a ZIP container (if you don't believe me, you can just try renaming your .ods and .xlsx files into .zip, and open them with your favorite ZIP browser).

Writing the drivers was surprisingly simpler that I initially expected. In order to retrieve cell values, you just need to extract a few XML elements. From a developer point of view, the award of the most simple format to read goes to ODS with a nice separation between semantics and styling, and only one file (content.xml) to parse. XLSX is a bit more complicated to analyze because you have to read at least 4 different files (workbook.xml, sharedStrings.xml, styles.xml and a file for each sheet in the spreadsheet) and you need to understand some of the styling information to make the difference between a regular numeric value and a date.

Those drivers also support creating ODS and XLSX files. Caution: only raw values will be written. No fancy styling ! Update of existing files is also supported. But this uses the same serialization mechanism as the one used to create a new file, so be aware than existing formulas, charts, drawings, etc... will be lost.

Not detailed in the documentation page of the drivers, if you need some form of spatial support with those formats, you can combine them with the use of OGR VRT, in particular the GeometryField element, to be able to use column(s) of your spreadsheet as geometry columns.

The good news is that those 2 new drivers don't have any other dependency to third-party libraries than the Expat XML-parser library, that is also already used by many others drivers in GDAL and that most binary distributions of GDAL will link against. Typically, you will find them ready-to-use in Tamas Szekeres automated Windows daily builds (fetch the -development packages at the top of the first table to get builds corresponding to the latest GDAL trunk version).

Testing is highly encouraged, as well as reporting of issues you might run into.

In particular, interpreting spreadsheets that make use of formulas can be a tricky point. Depending on the application that writes the files, the result of the evaluation of formulas might or might not be written in the file. The ODS and XLSX drivers will use the result of the evaluation if available. In the case it is missing, I've plugged into the ODS driver a simple formula evaluator that can understand and evaluate a restricted set of functions (readers interested in the details will find the detailed list in the first enumeration of the ods_formula.h header file). Based on my testing, OpenOffice always writes the evaluation of formulas, whereas the OpenOffice export of Google Spreadsheet documents will not.

For now, there is nothing equivalent implemented in the XLSX driver, as I have not access to a sufficiently representative set of files, and it is not yet clear if it is a common practice or not to have non-evaluated formulas for that format. The good news is that, should the need arise, the first tests would tend to show that it should be possible to extend the ODS formula evaluator with just a few changes, so it can also understand XLSX formulas.