mercredi 9 octobre 2013

FileGDB format reverse-engineered

For those who cannot wait, you can rush directly to the resulting specification. Caution: this is work-in-progress !

Now, the introduction. FileGDB is the ESRI File Geodatabase format used natively by ArcGIS to store datasets in a file system directory.

Since the version 1.9 of GDAL/OGR, a FileGDB driver exists and provides read/update/creation support for FileGDB datasources, but it has a few limitations :
  • it relies on a free (as in beer) but closed-source library, the FileGDB API. Not ideal philosophically and practically.
  • the FileGDB API is limited to opening FileGDB datasources created by ArcGIS 10 or later, but not the ones created by earlier versions.
  • the FileGDB API has some bugs that prevents it from opening valid FileGDB layers, e.g. if the SRS is a custom projection, and users can just wait for a version from the vendor that will eventually fix them.
  • it seems to have (unnecessary) quadratic performance when the number of fields grow, e.g. with some US Census datasources that have more than 1800 fields, which makes it reading them very slow.
So I decided to open my favorite hexadecimal editor and have a closer look at what there is in the guts of the files found in a .gdb directory. From the extensions, it is obvious that the .gdbtable and .gdbtablx files should be the most interesting.

A .gdbtable matches a layer / table, and contains the description of the fields (name, type, width, etc..), geospatial information ( type of geometries, SRS, extent ) as well as the content of the rows / features. This is the equivalent of a .shp and .dbf files of shapefiles. The .gdbtablx is an index that contains the offset to each row of the .gdbtable. This is an equivalent of the .shx file of shapefiles.

To my own surprise, the process of reverse engineering went rather fast. Generating very trivial layers with the FileGDB API, with small variations, and analyzing the differences helped a lot. Most datatypes can be guessed in an obvious way with an hexadecimal editor (little-endian int16/int32/IEEE 754 float64/UTF-16 strings). A few non-obvious technical details :
  • the use of a variable length encoding for integers (which AFAICS is identical to Protocol buffer base 128 varints), mostly for coordinates (and as well to specify the length of strings). The first coordinate tuple of a geometry is encoded in "absolute" form (that must be later offseted and scaled by constants described in the geometry field definition). All following coordinates are encoded as the difference with previous coordinates.
  • understanding how datetimes are encoded took me an astonishingly long time, compared to the outcome : it is just the number of days (possibly with decimals) since 1899/30/12 00:00:00, encoded as a float64.
  • how the flags indicating the absence or presence of fields worked. The tricky part was to understand that only nullable fields are represented in the bit field.
I've developed a small Python script that dumps the content of a .gdbtable file (using its .gdbtablx companion file) and, up to now, it manages to successfully dump all the .gdbtable files I've tried, both technical tables (GDB_xxxxxx ) and user tables (vector layers), including from GDB datasources that cannot be read by the FileGDB API (old v9.X datasources, or datasources with custom projections).

Just an example on a sample layer distributed with the FileGDB API (possibly only sexy to the eyes of the fans of command line utilities) :

$ python dump_gdbtable.py /home/even/FileGDB_API/samples/data/Shapes.gdb/a00000013.gdbtable
nfeaturesx = 233
nfeatures = 233
header_offset = 40
header_length = 627
layer_geom_type = 3
polyline
nfields = 6

nbcar = 8
name = OBJECTID
nbcar_alias = 0
alias =
type = 6 (objectid)
magic1 = 4
magic2 = 2
nullable = 0

nbcar = 5
name = Shape
nbcar_alias = 0
alias =
type = 7 (geometry)
magic1 = 0
magic2 = 7
wkt = GEOGCS["GCS_North_American_1983",DATUM["D_North_American_1983",SPHEROID["GRS_1980",6378137.0,298.257222101]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]
magic3 = 7
xorigin = -400.000000000000000
yorigin = -400.000000000000000
xyscale = 11258999068426.238281250000000
zorigin = -10000.000000000000000
zscale = 10000.000000000000000
morigin = -10000.000000000000000
mscale = 10000.000000000000000
xytolerance = 0.000000008983153
ztolerance = 0.001000000000000
mtolerance = 0.001000000000000
xmin = -158.090073924418533
ymin = 21.277505248718398
xmax = -67.781176946816174
ymax = 62.145206966737987
magic4 = 3
25.6301473254
79.4534567087
246.305715797
nullable = 1

nbcar = 9
name = ROUTE_NUM
nbcar_alias = 0
alias =
type = 4 (string)
width = 8
flag = 5
magic = 0
nullable = 1

nbcar = 10
name = DIST_MILES
nbcar_alias = 0
alias =
type = 2 (float32)
width = 4
flag = 5
magic = 0
nullable = 1

nbcar = 7
name = DIST_KM
nbcar_alias = 0
alias =
type = 2 (float32)
width = 4
flag = 5
magic = 0
nullable = 1

nbcar = 12
name = Shape_Length
nbcar_alias = 0
alias =
type = 3 (float64)
width = 8
flag = 3
magic = 0
nullable = 1


FID = 1
feature_offset = 671
blob_len = 17880
flags = [224]
geom_len = 17856
geom_type = 3
polyline
nb_total_points: 1530
nb_geoms: 3
minx = -118.485959167123966
miny = 29.394035111730691
maxx = -81.682130704462821
maxy = 34.087306274650871
nb_points[0] = 35
nb_points[1] = 907
[1] -118.485959167123966 34.014739115964872
[2] -118.475133048781629 34.022692037809506
[...]
[34] -118.226133325056821 34.028554888549152
[35] -118.220945200338008 34.033902601472484

[1] -118.213980138991928 34.055149589953267
[2] -118.206975896201286 34.053569604619412
[...]
[906] -95.292373751724142 29.777536210780305
[907] -95.284004512870197 29.777996158193261

[1] -95.258536827702414 29.774727145411369
[2] -95.237242253618717 29.773808093496442
[...]
[587] -81.690798069370828 30.320550259572705
[588] -81.682130704462821 30.321330304058421

Field ROUTE_NUM : "I10"
Field DIST_MILES : 2449.120117
Field DIST_KM : 3941.479980
Field Shape_Length : 40.473531

[...]


Next steps ?
  • A GDAL/OGR driver implementing this specification, without any third-party dependency, would be cool, wouldn't it ? Probably read-only in the current state of knowledge. But it would likely solve all the limitations of the existing driver, and offer the following benefits : free and open source, FileGDB v9.X compatibility, no SRS limitations. Hint : funding would be welcome to help developing it.
  • Understanding the meaning of some "magic" fields that might have a practical importance.
  • I've encountered a datasource with raster data. The dump utility would need some extra love to deal with binary fields, that I did not investigate further. But perhaps raster FileGDB datasets could be read as well.
  • For brave people : decyphering the format of the other categories of files that have been left aside : .gdbindexes, .freelist, .TablesByName.atx, .CatItemsByPhysicalName.atx, .CatItemsByType.atx, .FDO_UUID.atx, .spx. Various indexes must be in there, some potentially for creation/update operations. This might be much more difficult to guess than the .gdbtable itself, if we take into account how difficult was the reverse engineering the shapefile .sbn spatial index.

dimanche 25 août 2013

A Linux sandbox for the benefit of GDAL/OGR binaries : seccomp_launcher

You probably already know that GDAL/OGR is a fantastic tool to read, convert and do other processing on raster and vector datasets.

But when you deal with more than 200 file formats, it is difficult (not to say impossible) to ensure that no defect exists in a code base of nearly 1 million lines (C and C++ files, empty and comment lines included), and I don't mention the sources of the load of libraries, open source or sometimes closed source, that GDAL/OGR might depend on. Especially defects that are normally not triggered by correct datasets.

If you have to process data that can come from untrusted sources, you could find yourself in a situation where an hostile party would submit a specially crafted dataset aimed at triggering a defect, with unfortunate consequences (e.g. arbitrary code execution, theft of data, ...). A page aimed at discussing security issues has been recently added on the Trac wiki to collect knowledge on that topic and provide a few recommendations (contributions from people having deployed GDAL and wishing to share the security measures that they have taken are welcome)

I have recently discovered an interesting and very elegant security mechanism provided by the Linux kernel : seccomp. The principle of that mechanism is very simple to understand : once an executable (more exactly a thread) has turned seccomp on, it can only run 4 (yes four) system calls : read(), write(), exit() and sigreturn(). System calls are the interface between a user program (e.g. a GDAL/OGR utility) and the Linux kernel. From the name, you probably figured that read() and write() are used to... read and write files (regular files, but also pipelines, network sockets). exit() is called at process termination, and sigreturn() is too obscure to be worth an explanation.

Reducing the number of system calls available to a binary considerably restricts what the user program can do, in good... or bad. In particular, once in seccomp mode ("strict seccomp", since there is also a relaxed and more customizable form of seccomp in newer Linux kernels), a program can no longer open files, create threads, initiate network connections, or even jump to an arbitrary position in a file (seek() operation) etc...

I have started recently an experiment, seccomp_launcher, to use that mechanism in order to provide a sandbox for the benefit of GDAL/OGR utilities.

Using seccomp_launcher is very simple. It is just a matter of writting "seccomp_launcher", with an optional acces mode, in front of the command. See the below examples :

$ ./seccomp_launcher gdalinfo some.tif -stats

$ ./seccomp_launcher -rw gdal_translate some.tif target.tif

$ ./seccomp_launcher -rw ogr2ogr -f filegdb out.gdb poly.gdb -progress

$ ./seccomp_launcher python swig/python/samples/gdalinfo.py some.tif
And now, a situation where it can prevent a confidential file (my private SSH key) from being accessed :

 $ ./seccomp_launcher gdal_translate hostile.vrt out.tif

INFO: in PR_SET_SECCOMP mode
AccCtrl: open(/home/even/.ssh/id_dsa,2,00) rejected. Not in white list
AccCtrl: open(/home/even/.ssh/id_dsa,0,00) rejected. Not in white list
ERROR 4: Unable to open /home/even/.ssh/id_dsa.
Permission denied
GDALOpen failed - 4
Unable to open /home/even/.ssh/id_dsa.
Permission denied

The software is made of two main parts :
  • the seccomp_launcher binary (source: seccomp_launcher.c), which as implied by its name, launches the user binary, and can run priviledge system calls (opening a file, etc...) on its behalf, after having checked that they are authorized.
  • the libseccomp_preload.so dynamic library (source: seccomp_preload.c) that is "injected" into the user binary (e.g. gdalinfo) before it starts, to force it to run in seccomp mode, and forward priviledged system calls to seccomp_launcher, by overriding some interesting entry points of the GNU libc library.
You will find more technical details in the README file.

How to build it ? (provided that you have a C compiler, gcc or clang, and make already installed)

 
What else to mention ?
  • This is still alpha / experimental code made available under the "release early, release often" motto. In particular, no independant audit of the seccomp_launcher.c code has been made yet, which is the critical place where the security checks and delegated system calls are done. So if you intend to use it in production, please take some time to review it. Please also take time to read the README carefully, in particular the intended scope of the software (in short: do not use it to protect against hostile binaries, but only against hostile input data)
  • It is available under the same X/MIT licence as the GDAL/OGR sources.
  • Contributions (testing, bug reports, code contributions) are of course welcome !