mercredi 9 octobre 2013

FileGDB format reverse-engineered

For those who cannot wait, you can rush directly to the resulting specification. Caution: this is work-in-progress !

Now, the introduction. FileGDB is the ESRI File Geodatabase format used natively by ArcGIS to store datasets in a file system directory.

Since the version 1.9 of GDAL/OGR, a FileGDB driver exists and provides read/update/creation support for FileGDB datasources, but it has a few limitations :
  • it relies on a free (as in beer) but closed-source library, the FileGDB API. Not ideal philosophically and practically.
  • the FileGDB API is limited to opening FileGDB datasources created by ArcGIS 10 or later, but not the ones created by earlier versions.
  • the FileGDB API has some bugs that prevents it from opening valid FileGDB layers, e.g. if the SRS is a custom projection, and users can just wait for a version from the vendor that will eventually fix them.
  • it seems to have (unnecessary) quadratic performance when the number of fields grow, e.g. with some US Census datasources that have more than 1800 fields, which makes it reading them very slow.
So I decided to open my favorite hexadecimal editor and have a closer look at what there is in the guts of the files found in a .gdb directory. From the extensions, it is obvious that the .gdbtable and .gdbtablx files should be the most interesting.

A .gdbtable matches a layer / table, and contains the description of the fields (name, type, width, etc..), geospatial information ( type of geometries, SRS, extent ) as well as the content of the rows / features. This is the equivalent of a .shp and .dbf files of shapefiles. The .gdbtablx is an index that contains the offset to each row of the .gdbtable. This is an equivalent of the .shx file of shapefiles.

To my own surprise, the process of reverse engineering went rather fast. Generating very trivial layers with the FileGDB API, with small variations, and analyzing the differences helped a lot. Most datatypes can be guessed in an obvious way with an hexadecimal editor (little-endian int16/int32/IEEE 754 float64/UTF-16 strings). A few non-obvious technical details :
  • the use of a variable length encoding for integers (which AFAICS is identical to Protocol buffer base 128 varints), mostly for coordinates (and as well to specify the length of strings). The first coordinate tuple of a geometry is encoded in "absolute" form (that must be later offseted and scaled by constants described in the geometry field definition). All following coordinates are encoded as the difference with previous coordinates.
  • understanding how datetimes are encoded took me an astonishingly long time, compared to the outcome : it is just the number of days (possibly with decimals) since 1899/30/12 00:00:00, encoded as a float64.
  • how the flags indicating the absence or presence of fields worked. The tricky part was to understand that only nullable fields are represented in the bit field.
I've developed a small Python script that dumps the content of a .gdbtable file (using its .gdbtablx companion file) and, up to now, it manages to successfully dump all the .gdbtable files I've tried, both technical tables (GDB_xxxxxx ) and user tables (vector layers), including from GDB datasources that cannot be read by the FileGDB API (old v9.X datasources, or datasources with custom projections).

Just an example on a sample layer distributed with the FileGDB API (possibly only sexy to the eyes of the fans of command line utilities) :

$ python dump_gdbtable.py /home/even/FileGDB_API/samples/data/Shapes.gdb/a00000013.gdbtable
nfeaturesx = 233
nfeatures = 233
header_offset = 40
header_length = 627
layer_geom_type = 3
polyline
nfields = 6

nbcar = 8
name = OBJECTID
nbcar_alias = 0
alias =
type = 6 (objectid)
magic1 = 4
magic2 = 2
nullable = 0

nbcar = 5
name = Shape
nbcar_alias = 0
alias =
type = 7 (geometry)
magic1 = 0
magic2 = 7
wkt = GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]
magic3 = 7
xorigin = -400.000000000000000
yorigin = -400.000000000000000
xyscale = 11258999068426.238281250000000
zorigin = -10000.000000000000000
zscale = 10000.000000000000000
morigin = -10000.000000000000000
mscale = 10000.000000000000000
xytolerance = 0.000000008983153
ztolerance = 0.001000000000000
mtolerance = 0.001000000000000
xmin = -158.090073924418533
ymin = 21.277505248718398
xmax = -67.781176946816174
ymax = 62.145206966737987
magic4 = 3
25.6301473254
79.4534567087
246.305715797
nullable = 1

nbcar = 9
name = ROUTE_NUM
nbcar_alias = 0
alias =
type = 4 (string)
width = 8
flag = 5
magic = 0
nullable = 1

nbcar = 10
name = DIST_MILES
nbcar_alias = 0
alias =
type = 2 (float32)
width = 4
flag = 5
magic = 0
nullable = 1

nbcar = 7
name = DIST_KM
nbcar_alias = 0
alias =
type = 2 (float32)
width = 4
flag = 5
magic = 0
nullable = 1

nbcar = 12
name = Shape_Length
nbcar_alias = 0
alias =
type = 3 (float64)
width = 8
flag = 3
magic = 0
nullable = 1


FID = 1
feature_offset = 671
blob_len = 17880
flags = [224]
geom_len = 17856
geom_type = 3
polyline
nb_total_points: 1530
nb_geoms: 3
minx = -118.485959167123966
miny = 29.394035111730691
maxx = -81.682130704462821
maxy = 34.087306274650871
nb_points[0] = 35
nb_points[1] = 907
[1] -118.485959167123966 34.014739115964872
[2] -118.475133048781629 34.022692037809506
[...]
[34] -118.226133325056821 34.028554888549152
[35] -118.220945200338008 34.033902601472484

[1] -118.213980138991928 34.055149589953267
[2] -118.206975896201286 34.053569604619412
[...]
[906] -95.292373751724142 29.777536210780305
[907] -95.284004512870197 29.777996158193261

[1] -95.258536827702414 29.774727145411369
[2] -95.237242253618717 29.773808093496442
[...]
[587] -81.690798069370828 30.320550259572705
[588] -81.682130704462821 30.321330304058421

Field ROUTE_NUM : "I10"
Field DIST_MILES : 2449.120117
Field DIST_KM : 3941.479980
Field Shape_Length : 40.473531

[...]


Next steps ?
  • A GDAL/OGR driver implementing this specification, without any third-party dependency, would be cool, wouldn't it ? Probably read-only in the current state of knowledge. But it would likely solve all the limitations of the existing driver, and offer the following benefits : free and open source, FileGDB v9.X compatibility, no SRS limitations. Hint : funding would be welcome to help developing it.
  • Understanding the meaning of some "magic" fields that might have a practical importance.
  • I've encountered a datasource with raster data. The dump utility would need some extra love to deal with binary fields, that I did not investigate further. But perhaps raster FileGDB datasets could be read as well.
  • For brave people : decyphering the format of the other categories of files that have been left aside : .gdbindexes, .freelist, .TablesByName.atx, .CatItemsByPhysicalName.atx, .CatItemsByType.atx, .FDO_UUID.atx, .spx. Various indexes must be in there, some potentially for creation/update operations. This might be much more difficult to guess than the .gdbtable itself, if we take into account how difficult was the reverse engineering the shapefile .sbn spatial index.