jeudi 16 janvier 2014

OGR OpenFileGDB driver

Last year, I blogged about the reverse-engineering of the ESRI file geodatabase format.

What's new ?

That work, thanks to funding, has lead to the writing of a OGR OpenFileGDB driver, now available in GDAL/OGR trunk repository.

Since the initial phase of the reverse-engineering, significant advances have been made and documented :
  • .gdbtablx files (that contain offsets in the .gdbtable files of features) can use optimization when the feature ids are sparsed (if a series of 1024 consecutive feature ids do not exist)
  • .gdbindexes files contain the list of fields that have an attribute index, and the filename of that index (only the format for ArcGIS v10 geodatabases for now. The v9 format is different and more complicated)
  • .atx files are such attribute index files. The OpenFileGDB driver can use those attribute indexes to speed-up simple WHERE clauses of SQL requests, or attribute filters (SetAttributeFilter())
  • .spx files are partially deciphered, but not yet to the point of being usable. They share exactly the same base structure as attribute indexes. For each feature (or group of features in non-final pages of an index of depth greater than one), the value indexed is a 8 byte structure that describe the spatial footprint. The first 4 bytes seem to be about the y coordinate, and the next 4 bytes the x coordinate. Rather simple, no ? Except that the encoding of those bytes is still not fully understood (I think I've captured the logic for point geometries at the final page level, but non-final pages or non-point geometries are still mysterious). So, for now, we use the minimum bounding rectangle found at the beginning of geometry blob to speed-up spatial queries. And during the first full scan, we build an in-memory spatial index that is used for later spatial queries in the same session.
FileGDB vs OpenFileGDB

In the current state, if we do a comparison of the OpenFileGDB driver with the FileGDB driver using FileGDB API SDK v1.3, we find :

On the plus side :
  • Can read ArcGIS 9.X Geodatabases, and not only 10 or above.
  • Can open layers with any spatial reference system.
  • Thread-safe (i.e. datasources can be processed in parallel).
  • Uses the VSI Virtual File API, enabling the user to read a Geodatabase in a ZIP file or stored on a HTTP server.
  • Faster on databases with a big number of fields.
  • Does not depend on a third-party library. Available on any platform supported by GDAL/OGR
  • Robust against corrupted Geodatabase files.

On the minus side :
  • Read-only.
  • Cannot use spatial indexes.
And now ?

In the testing process, I discovered datasets that have layers compressed using a so-called Smart Data Compression. Which is a completely different beast from standard GDB tables. Not sure how "smart" that compression is, but the end result is particularly cryptic. The only thing that can be recognized is the field names... Those .gdbtable.sdc are neither supported by the OpenFileGDB driver, and guess what?, nor by the FileGDB API. So we are on (non-)feature parity...

I've encountered a few raster File Geodatabase datasets
(apparently tiled), and a quick inspection of the tables makes me believe
that a raster driver would be doable.

That's all for now. Testers appreciated as usual ! Windows users can for example download the builds tagged "-development" kindly provided by Tamas Szekeres on gisinternals.