mardi 31 janvier 2012

Welcome to the 200th GDAL/OGR driver !

A few hours ago, I merged into the GDAL/OGR source tree the ElasticSearch driver, which was contributed by Adam Estrada. According to the driver testing status page, it happens to be the 200th driver !

To be honest, a few of them have been retired over the years (mainly because being deprecated by rewritten versions), so even if you try to enable all possible drivers, you won't reach 200. But you can get close : my build currently includes 124 GDAL drivers and 60 OGR drivers.

This 200th driver is a bit particular, because it is a write-only driver, whereas 99% of drivers are generally read-only or read-write.

What about the other new drivers that have been committed in GDAL trunk since 1.9.0 release ?

As far as GDAL drivers are concerned, there is a driver to read MBTiles that Dane Springmeyer already blogged about.

More recently, new spreadsheet formats have also made their way into OGR. Namely the ODS driver and the XLSX driver that respectively handle files in the Open Document Spreadsheet format, used by applications like OpenOffice / LibreOffice and Office Open XML, generated by applications like Microsoft Office 2007 and later versions. If you are wondering about the fate of the older XLS format, a XLS driver is already included in GDAL/OGR 1.9.0, provided your GDAL build links against the FreeXL library written by Alessandro Furieri, the main author of libspatialite.

The ODS and XLSX drivers have very similar capabilities and source code, which is not surprising, because the technologies behind the 2 formats are the same : XML files in a ZIP container (if you don't believe me, you can just try renaming your .ods and .xlsx files into .zip, and open them with your favorite ZIP browser).

Writing the drivers was surprisingly simpler that I initially expected. In order to retrieve cell values, you just need to extract a few XML elements. From a developer point of view, the award of the most simple format to read goes to ODS with a nice separation between semantics and styling, and only one file (content.xml) to parse. XLSX is a bit more complicated to analyze because you have to read at least 4 different files (workbook.xml, sharedStrings.xml, styles.xml and a file for each sheet in the spreadsheet) and you need to understand some of the styling information to make the difference between a regular numeric value and a date.

Those drivers also support creating ODS and XLSX files. Caution: only raw values will be written. No fancy styling ! Update of existing files is also supported. But this uses the same serialization mechanism as the one used to create a new file, so be aware than existing formulas, charts, drawings, etc... will be lost.

Not detailed in the documentation page of the drivers, if you need some form of spatial support with those formats, you can combine them with the use of OGR VRT, in particular the GeometryField element, to be able to use column(s) of your spreadsheet as geometry columns.

The good news is that those 2 new drivers don't have any other dependency to third-party libraries than the Expat XML-parser library, that is also already used by many others drivers in GDAL and that most binary distributions of GDAL will link against. Typically, you will find them ready-to-use in Tamas Szekeres automated Windows daily builds (fetch the -development packages at the top of the first table to get builds corresponding to the latest GDAL trunk version).

Testing is highly encouraged, as well as reporting of issues you might run into.

In particular, interpreting spreadsheets that make use of formulas can be a tricky point. Depending on the application that writes the files, the result of the evaluation of formulas might or might not be written in the file. The ODS and XLSX drivers will use the result of the evaluation if available. In the case it is missing, I've plugged into the ODS driver a simple formula evaluator that can understand and evaluate a restricted set of functions (readers interested in the details will find the detailed list in the first enumeration of the ods_formula.h header file). Based on my testing, OpenOffice always writes the evaluation of formulas, whereas the OpenOffice export of Google Spreadsheet documents will not.

For now, there is nothing equivalent implemented in the XLSX driver, as I have not access to a sufficiently representative set of files, and it is not yet clear if it is a common practice or not to have non-evaluated formulas for that format. The good news is that, should the need arise, the first tests would tend to show that it should be possible to extend the ODS formula evaluator with just a few changes, so it can also understand XLSX formulas.