dimanche 30 décembre 2012

A geocoding client API in GDAL/OGR

A new API has been added to GDAL/OGR to use geocoding services, such as OpenStreetMap Nominatim, MapQuest Nominatim, Yahoo! PlaceFinder, GeoNames or Bing.

From C, this is as simple as :
/* Create a session with default options */
OGRGeocodingSessionH hSession = OGRGeocodeCreateSession(NULL);

/* Now query the service */
OGRLayerH hLayer = OGRGeocode(hSession, "Paris", NULL, NULL);

/* Use the result layer (the first feature is generally the most relevant) */

/* Cleanup */
OGRGeocodeFreeResult(hLayer);
OGRGeocodeDestroySession(hSession);
More interesting, you can use that capability in SQL with the ogr_geocode() function that has been added to the SQLite SQL dialect :
SELECT ST_Centroid(ogr_geocode('Paris'))

returns:

OGRFeature(SELECT):0
POINT (2.342878767069653 48.85661793020374)
or a bit more evolved, to add geocoded information to a CSV file :
ogrinfo cities.csv -dialect sqlite -sql \
   "SELECT *, ogr_geocode(city, 'country') AS country,
    ST_Centroid(ogr_geocode(city)) FROM cities" \
    --config OGR_GEOCODE_LANGUAGE en

returns:

OGRFeature(SELECT):0
id (Real) = 1
city (String) = Paris
country (String) = France
POINT (2.342878767069653 48.85661793020374)

OGRFeature(SELECT):1
id (Real) = 2
city (String) = London
country (String) = United Kingdom
POINT (-0.109369427546499 51.500506667319407)

[...]

OGRFeature(SELECT):6
id (Real) = 7
city (String) = Beijing
country (String) = People's Republic of China
POINT (116.391195 39.9064702)

For better efficiency, the results of geocoding queries are cached into a local datasource (by default, a simple SQLite database in the working directory, but the location can be overriden by a configuration option, and other formats - CSV or PostgreSQL - can be selected).

Reverse geocoding is also possible with OGRGeocodeReverse() or the ogr_geocode_reverse() SQL function :
SELECT ogr_geocode_reverse(-100,45,'country','zoom=3') AS country

returns :
OGRFeature(SELECT):0
  country (String) = United States of America
Note: the online services that you may use have generally Term of Uses that disallow bulk geocoding, so use the above capability with care.

dimanche 24 juin 2012

GDAL/OGR using Shapefile native .sbn spatial index

This is the end of a mistery and a frustration that have lasted for about 15 years. The Shapefile format had been documented since 1998, but the documentation was limited to the minimum core, that is to say the .shp file that contains the geometries and the .shx that is an index to the geometries. However the format of the .sbn file, that was known to contain spatial index (aimed at speeding up spatial filters), has never been published.

The FOSS community came with an alternate spatial index format, the .qix format, originally used by Shapelib, MapServer and GDAL/OGR, and whose use has propagated into other software stacks, such as GeoTools.

But Joel Lawhead, Marc Pfister and another developer, Francisco, have actively started since the end of 2011 the reverse engineering of the .sbn format, and their effort finally came to a conclusion. Many kudos to them for leading this great effort !

There is not yet a formal specification of the .sbn format, but with the help of the blog entries and the debug/testing code that they made available, it was relatively easy to put pieces together. So, as an application of their work and a new step to increase interoperability between FOSS and proprietary software, I've just added support for reading and using .sbn files in the latest revision of the developement version of GDAL/OGR (GDAL 2.0dev). You can refer to ticket 4719 for the details and the code.

~~~

As this code is rather new, crowd testing is of course much appreciated. Provided that you use the latest version of GDAL 2.0dev and have python-gdal ready, you can use the testsbn.py script to check if the spatial index is correctly used with your own datasets.

Let's suppose that you have a shapefile "some_shapefile.shp", with an accompaying .sbn file "some_shapefile.sbn", you can try :
python testsbn.py some_shapefile.shp
This script will iterate over all the shapes of the shapefile, and for each shape, define a spatial filter whose extent is the bounding box of the geometry and issue a spatial request. It will check that the result of the spatial request contains the shape that served to define the spatial filter.

~~~

My initial testing of .sbn vs .qix queries would tend to show that search in .sbn is faster than in .qix (more than twice on large datasets, and with a lot of spatial filter request such as in testsbn.py), while being smaller. An explaination would be that spatial requests against .qix files use floating-point comparisons, whereas equivalent requests against .sbn files only use integer comparisons.

As far as adding write support for .sbn files, I'm a bit ambivalent for now. According to Joel's blog, there are still a few details that have to be solved, in order to understand all the subtelties of the algorithm that dispatches shapes into tree nodes, and ensure full interoperability with the software stack of the vendor at the origin of the .sbn format.

vendredi 18 mai 2012

A new GDAL virtual file system to read streamed data (e.g. for OGR WFS)

GDAL/OGR can of course read data from regular file systems, but also from more exotic sources thanks to a "virtual file system" API.

Let's start with the /vsizip/ virtual file system. If you have a ZIP file, myzip.zip, that contains a shapefile myshape.shp (and its associated .shx and .dbf files), you can read with :
ogrinfo -ro /vsizip/myzip.zip/myshape.shp
(or more simply /vsizip/myzip.zip as it is considered as a directory, and a directory is a valid datasource for the Shapefile driver).

If your data is located on a HTTP/FTP server, you can use the /vsicurl/ virtual file system, like this :
ogrinfo -ro /vsicurl/htttp://example.com/myshape.shp
As a bonus, you can combine both to read inside a remote ZIP file :
ogrinfo -ro /vsizip/vsicurl/htttp://example.com/myzip.zip/myshape.shp
Developers or advanced users can be interested by the /vsimem/, to use a memory buffer as a datasource, or /vsisubfile/ to access a file located instead another one.

/vsicurl/ is convenient to access static files and provides random access to data inside them provided that the web server supports "range downloading", i.e. the capability of returning data in a range of offsets.

Unfortunately, in some circumstances, the file is dynamically generated at the time you request it, so range downloading isn't supported. One such example in the scope of GDAL/OGR is the GML document generated by a WFS GetFeature request. Currently, the OGR WFS driver fetches the document as a whole with the CPLHTTPFetch() API, and passes the buffer to the GML driver (as a /vsimem/ file).

This behaviour has at least 2 drawbacks :
  • Even if you need to read one single feature, the driver will fetch the whole WFS GetFeature response, which can be long.
  • If the WFS GetFeature response is too long, it might not fit into memory at all.
It was possible to mitigate that by using the paging capability of some WFS servers, that is a non-standardized extension for WFS 1.0 or 1.1 (now normalized in WFS 2.0 spec).

In the GDAL/OGR trunk (2.0dev >= r24460), you can find a /vsicurl_streaming/ virtual file system that can be used to read data from a streaming server. This works efficiently only if the access pattern to the data is linear, and not random access. The OGR GML driver already natively parses data as a stream, so it can work nicely with /vsicurl_streaming/ :
ogrinfo -ro -al "/vsicurl_streaming/http://testing.deegree.org/deegree-wfs/services?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetFeature&TYPENAME=app:Springs"
Or, more simply, since the OGR WFS driver has been retrofitted to use it transparently :
ogrinfo -ro WFS:http://testing.deegree.org/deegree-wfs/services
app:Springs

mardi 31 janvier 2012

Welcome to the 200th GDAL/OGR driver !

A few hours ago, I merged into the GDAL/OGR source tree the ElasticSearch driver, which was contributed by Adam Estrada. According to the driver testing status page, it happens to be the 200th driver !

To be honest, a few of them have been retired over the years (mainly because being deprecated by rewritten versions), so even if you try to enable all possible drivers, you won't reach 200. But you can get close : my build currently includes 124 GDAL drivers and 60 OGR drivers.

This 200th driver is a bit particular, because it is a write-only driver, whereas 99% of drivers are generally read-only or read-write.

What about the other new drivers that have been committed in GDAL trunk since 1.9.0 release ?

As far as GDAL drivers are concerned, there is a driver to read MBTiles that Dane Springmeyer already blogged about.

More recently, new spreadsheet formats have also made their way into OGR. Namely the ODS driver and the XLSX driver that respectively handle files in the Open Document Spreadsheet format, used by applications like OpenOffice / LibreOffice and Office Open XML, generated by applications like Microsoft Office 2007 and later versions. If you are wondering about the fate of the older XLS format, a XLS driver is already included in GDAL/OGR 1.9.0, provided your GDAL build links against the FreeXL library written by Alessandro Furieri, the main author of libspatialite.

The ODS and XLSX drivers have very similar capabilities and source code, which is not surprising, because the technologies behind the 2 formats are the same : XML files in a ZIP container (if you don't believe me, you can just try renaming your .ods and .xlsx files into .zip, and open them with your favorite ZIP browser).

Writing the drivers was surprisingly simpler that I initially expected. In order to retrieve cell values, you just need to extract a few XML elements. From a developer point of view, the award of the most simple format to read goes to ODS with a nice separation between semantics and styling, and only one file (content.xml) to parse. XLSX is a bit more complicated to analyze because you have to read at least 4 different files (workbook.xml, sharedStrings.xml, styles.xml and a file for each sheet in the spreadsheet) and you need to understand some of the styling information to make the difference between a regular numeric value and a date.

Those drivers also support creating ODS and XLSX files. Caution: only raw values will be written. No fancy styling ! Update of existing files is also supported. But this uses the same serialization mechanism as the one used to create a new file, so be aware than existing formulas, charts, drawings, etc... will be lost.

Not detailed in the documentation page of the drivers, if you need some form of spatial support with those formats, you can combine them with the use of OGR VRT, in particular the GeometryField element, to be able to use column(s) of your spreadsheet as geometry columns.

The good news is that those 2 new drivers don't have any other dependency to third-party libraries than the Expat XML-parser library, that is also already used by many others drivers in GDAL and that most binary distributions of GDAL will link against. Typically, you will find them ready-to-use in Tamas Szekeres automated Windows daily builds (fetch the -development packages at the top of the first table to get builds corresponding to the latest GDAL trunk version).

Testing is highly encouraged, as well as reporting of issues you might run into.

In particular, interpreting spreadsheets that make use of formulas can be a tricky point. Depending on the application that writes the files, the result of the evaluation of formulas might or might not be written in the file. The ODS and XLSX drivers will use the result of the evaluation if available. In the case it is missing, I've plugged into the ODS driver a simple formula evaluator that can understand and evaluate a restricted set of functions (readers interested in the details will find the detailed list in the first enumeration of the ods_formula.h header file). Based on my testing, OpenOffice always writes the evaluation of formulas, whereas the OpenOffice export of Google Spreadsheet documents will not.

For now, there is nothing equivalent implemented in the XLSX driver, as I have not access to a sufficiently representative set of files, and it is not yet clear if it is a common practice or not to have non-evaluated formulas for that format. The good news is that, should the need arise, the first tests would tend to show that it should be possible to extend the ODS formula evaluator with just a few changes, so it can also understand XLSX formulas.

lundi 5 décembre 2011

Seamless access to remote Global Multi-resolution Terrain Elevation Data 2010 with GDAL

The USGS has recently released GMTED2010, a new dataset of elevation data, available in resolutions of 30, 15 and 7.5 arc-seconds. A look at the download page shows that the dataset is delivered as tiles whose dimensions are 30° of longitude x 20° of latitude.

There are also global grids available, but due to the fact they are contained in a .zip file, seeking to arbitrary offsets will undoubtedly be very slow. The content of the be75_grd.zip file can be obtained with the new gdal_ls.py sample script and a combination of /vsizip/ and /vsicurl/ virtual filesystem prefixes :

$ python swig/python/samples/gdal_ls.py -R /vsizip/vsicurl/http://igskmncngs506.cr.usgs.gov/gmted/Grid_ZipFiles/be75_grd.zip

Which shows that the .zip contains an Arc/Info Binary Grid :

$ gdalinfo /vsizip/vsicurl/http://igskmncngs506.cr.usgs.gov/gmted/Grid_ZipFiles/be75_grd.zip/be75_grd/hdr.adf -nofl -norat
Driver: AIG/Arc/Info Binary Grid
Files: /vsizip/vsicurl/http://igskmncngs506.cr.usgs.gov/gmted/Grid_ZipFiles/be75_grd.zip/be75_grd
Size is 172800, 67200
Coordinate System is:
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
TOWGS84[0,0,0,0,0,0,0],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,
AUTHORITY["EPSG","9108"]],
AUTHORITY["EPSG","4326"]]
Origin = (-180.000138888888898,83.999861111111116)
Pixel Size = (0.002083333333333,-0.002083333333333)
Corner Coordinates:
Upper Left  (-180.0001389,  83.9998611) (180d 0' 0.50"W, 83d59'59.50"N)
Lower Left  (-180.0001389, -56.0001389) (180d 0' 0.50"W, 56d 0' 0.50"S)
Upper Right ( 179.9998611,  83.9998611) (179d59'59.50"E, 83d59'59.50"N)
Lower Right ( 179.9998611, -56.0001389) (179d59'59.50"E, 56d 0' 0.50"S)
Center      (  -0.0001389,  13.9998611) (  0d 0' 0.50"W, 13d59'59.50"N)
Band 1 Block=256x4 Type=Int16, ColorInterp=Undefined
Description = Band_1
Min=-606.000 Max=8746.000
NoData Value=-32768
Metadata:
LAYER_TYPE=athematic


With the build_gmted.py Python script leveraging on GDAL Python and utilities, we create a GDAL virtual dataset that merges all the tiles into global datasets, so that we can access them seamlessly. Of course, we also take advantage of the multi-resolution nature of the dataset to use the tiles of resolutions 30 and 15 arc-seconds as overviews that can speed-up sub-sampling processing. The script currently uses the breakline emphasis layer of the GMTED dataset, but can easily be customized to use other layers (minimum, maximum, mean, medium).

For best performance, the script and the following operations must be used with the latest GDAL development version (1.9.0dev, >= r23468) to benefit from new optimizations.

$ python build_gmted.py

After a few minutes, the script has created a gmted subdirectory that contains several VRT files : one for each tile at each resolution, and 3 global ones : gmted/all075.vrt, gmted/all150.vrt and gmted/all300.vrt.

Let's have a look at the content of gmted/all075.vrt with the gdalinfo utility :

$ gdalinfo gmted/all075.vrt -nofl
Driver: VRT/Virtual Raster
Files: gmted/all075.vrt
Size is 172800, 86400
Coordinate System is:
GEOGCS["WGS 84",
    DATUM["WGS_1984",
        SPHEROID["WGS 84",6378137,298.257223563,
            AUTHORITY["EPSG","7030"]],
        AUTHORITY["EPSG","6326"]],
    PRIMEM["Greenwich",0],
    UNIT["degree",0.0174532925199433],
    AUTHORITY["EPSG","4326"]]
Origin = (-180.000138888888898,89.999861111111116)
Pixel Size = (0.002083333333333,-0.002083333333333)
Corner Coordinates:
Upper Left  (-180.0001389,  89.9998611) (180d 0' 0.50"W, 89d59'59.50"N)
Lower Left  (-180.0001389, -90.0001389) (180d 0' 0.50"W, 90d 0' 0.50"S)
Upper Right ( 179.9998611,  89.9998611) (179d59'59.50"E, 89d59'59.50"N)
Lower Right ( 179.9998611, -90.0001389) (179d59'59.50"E, 90d 0' 0.50"S)
Center      (  -0.0001389,  -0.0001389) (  0d 0' 0.50"W,  0d 0' 0.50"S)
Band 1 Block=128x128 Type=Int16, ColorInterp=Gray
  Min=-606.000 Max=8746.000
  NoData Value=-32768
  Overviews: 86400x43200, 43200x21600


The global dataset covers 172 200 x 86 400 elevation points, i.e. 29.8 GB of data. all075.vrt also references all150.vrt and all300.vrt as overviews. And we can verify that it is consistent with the global grid provided by USGS (except that our VRT extends more towards the poles).

With gdallocationinfo, we can query the elevation at one point :

$ gdallocationinfo gmted/all075.vrt -geoloc 2 49
Report:
  Location: (87360P,19679L)
  Band 1:
    <LocationInfo><File>gmted/30N000E_20101117_gmted_bln075.vrt</File></LocationInfo>
    Value: 183


Let's build a small overview of 2048x1024 pixels of the global dataset. (Note: this may need several dozen minutes to complete)

$ gdal_translate gmted/all075.vrt gmted/global_2048x1024.tif -outsize 2048 1024 --config CPL_VSIL_CURL_ALLOWED_EXTENSIONS ".tif" --config GTIFF_DIRECT_IO YES --debug on

Screen-shot of the overview of 2048x1024 pixels

(The white strip between Patagonia and Antarctica is filled with no-data values)

2 options (new in GDAL 1.9.0dev) have been used to speed up access :
  • The CPL_VSIL_CURL_ALLOWED_EXTENSIONS option is set to ".tif" so that only TIF files are queried from the server, and no other auxiliary files (.aux.xml, .ovr, .msk, etc..)
  • The GTIFF_DIRECT_IO option is set to YES so that the minimum number of bytes are fetched from the remote files. Partial TIFF strips can be read. It also takes advantage of the ability of the HTTP server to return multi-range of data at once, minimizing client-server round-trips. On the downside, it makes no use of the GDAL block cache mechanism, hence it is not an appropriate default choice for general purpose I/O. It is also restricted to simple configurations of TIFF files : single-band, untiled, uncompressed.
We can now edit gmted/all075.vrt with our favorite editor to take advantage of the local overview that we have just generated. We just have to insert the following snippet before the </VRTRasterBand> element at the end of the file :

    <Overview>
      <SourceFilename relativeToVRT="1">
global_2048x1024.tif</SourceFilename>
      <SourceBand>1</SourceBand>
    </Overview>

Let's open the global dataset with QGIS (linked against our development version of GDAL) with close to acceptable performances :

$ GTIFF_DIRECT_IO=YES CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".tif" CPL_DEBUG=ON qgis gmted/all075.vrt

Depending on the zoom level, either the local 2048x1024 dataset, or the remote 300, 15 and 7.5 arc-seconds datasets will be used.

For example, let's zoom in on Reunion island. We can easily distinguish its 3 magnificent cirques (the 3 dark areas near the center/west) and the Piton de la Fournaise volcano (white circle at the east) :

Screen-shot of Reunion Island in QGIS


We can extract a low-resolution window covering mainland France :

$ GTIFF_DIRECT_IO=YES CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".tif" gdal_translate gmted/all075.vrt france.tif -projwin -5 52 10 41 --debug on -outsize 10% 10%

Mainland France

On my PC, this command completes in about 20 seconds with GDAL 1.9.0dev, versus more than 3 minutes with GDAL 1.8.

If you don't want to do all the above steps, you can just use the following pre-generated set of .vrt files : 

$  ln -s /vsicurl/http://even.rouault.free.fr/gmted/all075.vrt remote_all075.vrt
$  CPL_DEBUG=ON GTIFF_DIRECT_IO=YES CPL_VSIL_CURL_ALLOWED_EXTENSIONS=".tif,.vrt" qgis remote_all075.vrt


Did you realize that you are now using a remote virtual dataset made of virtual datasets pointing to remote GeoTIFF tiles ;-) ?