vendredi 18 mai 2012

A new GDAL virtual file system to read streamed data (e.g. for OGR WFS)

GDAL/OGR can of course read data from regular file systems, but also from more exotic sources thanks to a "virtual file system" API.

Let's start with the /vsizip/ virtual file system. If you have a ZIP file, myzip.zip, that contains a shapefile myshape.shp (and its associated .shx and .dbf files), you can read with :
ogrinfo -ro /vsizip/myzip.zip/myshape.shp
(or more simply /vsizip/myzip.zip as it is considered as a directory, and a directory is a valid datasource for the Shapefile driver).

If your data is located on a HTTP/FTP server, you can use the /vsicurl/ virtual file system, like this :
ogrinfo -ro /vsicurl/htttp://example.com/myshape.shp
As a bonus, you can combine both to read inside a remote ZIP file :
ogrinfo -ro /vsizip/vsicurl/htttp://example.com/myzip.zip/myshape.shp
Developers or advanced users can be interested by the /vsimem/, to use a memory buffer as a datasource, or /vsisubfile/ to access a file located instead another one.

/vsicurl/ is convenient to access static files and provides random access to data inside them provided that the web server supports "range downloading", i.e. the capability of returning data in a range of offsets.

Unfortunately, in some circumstances, the file is dynamically generated at the time you request it, so range downloading isn't supported. One such example in the scope of GDAL/OGR is the GML document generated by a WFS GetFeature request. Currently, the OGR WFS driver fetches the document as a whole with the CPLHTTPFetch() API, and passes the buffer to the GML driver (as a /vsimem/ file).

This behaviour has at least 2 drawbacks :
  • Even if you need to read one single feature, the driver will fetch the whole WFS GetFeature response, which can be long.
  • If the WFS GetFeature response is too long, it might not fit into memory at all.
It was possible to mitigate that by using the paging capability of some WFS servers, that is a non-standardized extension for WFS 1.0 or 1.1 (now normalized in WFS 2.0 spec).

In the GDAL/OGR trunk (2.0dev >= r24460), you can find a /vsicurl_streaming/ virtual file system that can be used to read data from a streaming server. This works efficiently only if the access pattern to the data is linear, and not random access. The OGR GML driver already natively parses data as a stream, so it can work nicely with /vsicurl_streaming/ :
ogrinfo -ro -al "/vsicurl_streaming/http://testing.deegree.org/deegree-wfs/services?SERVICE=WFS&VERSION=1.1.0&REQUEST=GetFeature&TYPENAME=app:Springs"
Or, more simply, since the OGR WFS driver has been retrofitted to use it transparently :
ogrinfo -ro WFS:http://testing.deegree.org/deegree-wfs/services
app:Springs