jeudi 26 juillet 2018

SRS barn raising: 2nd report

This is the second progress report of the GDAL SRS barn effort.

Since the previous report, a number of topics have been addressed:
- extension of the class hierarchy to implement BoundCRS (the generalization of the WKT1 TOWGS84 concept. This concept only exists in WKT 2 and has not been modeled in ISO-19111, so I went on my own modelling), TimeCRS, DerivedGeodeticCRS
- implementation of the exportToPROJ() method for CRS and related objects, and CoordinateOperation
- addition of all documentation needed at class and method level so that Doxygen passes without warnings
- implementation of CoordinateOperation::createOperation() method that can instanciate a transformation between two CRSs. For now, it does not use yet the operations of the EPSG database, but it can already take into account the hints of BoundCRS.
- implementation of a number of Transformations: geocentric translation, position vector transformation, coordinate frame rotation, NTv2,  GravityRelatedHeightToGeographic3D, VERTCON.
- start of mapping all GDAL currently supported projection methods. For now: UTM, Transverse Mercator, Transerve Mercator South Oriented, Two Point Equidistant, Tunisia Mapping Grid, Albers Conic Equal Area, Lambert Conic Conformal 1SP, New Zealand Map Grid. Several tens are obviously still missing.
- addition of a isEquivalentTo() method to compare the various objects.
- and of course, extensive unit testing of all the above work.

The result of this continued work can be followed in this pull request.

As a related effort, I've coordinated with the OGC CRS Working Group to provide my comments on the upcoming ISO:19168 / WKTv2 2018 revision.

lundi 18 juin 2018

The barn is raising

Thanks to the support given by the sponsors of the GDAL SRS barn effort, I have been able to kick in the first works in the past weeks. The work up to now has been concentrated on the PROJ front.

The first step was to set a foundation of C++ classes that implement the ISO-19111 / OGC Topic 2 "Referencing by coordinates" standard. Actually I have anticipated the future adoption of the 18-005r1 2018 revision of the standard that takes into account the latest advances in the modelling of coordinate reference systems (in particular dynamic reference frames, geoid-based vertical coordinate reference systems, etc.), which will be reflected in the corresponding update of the WKT2:2018 standard and future updates of the EPSG dataset. If you are curious, you can skim through the resulting PROJ class hierarchy which is really close to the abstract specification (a number of those classes currenty lack a real implementation for now). With the agreement of the newly born PROJ project steering committee, I have opted for C++11 which offers a number of useful modern features to reduce boilerplate and concentrate on the interesting aspects of the work.

On the functional front, there is already support to read WKT1 (its GDAL variant for now) and WKT2 strings and build a subset of the before mentionned C++ objects. And conversely to dump those C++ objects as WKT1 and WKT2 strings. In particular you can import from WKT1 and export to WKT2, or the reverse (within the limitations of each format). So this abstract modelling (quite close to WKT2 of course) effectively serves its purpose to help being independant from the actual representation of the CRS. As I mentionned an early adoption of the OGC Topic 2 standard, similarly I've taken into account the future WKT2:2018 (OGC 18-010) standard that aligns well with the abstract specification. In the API, the user can select if he wants to export according to the currently adopted version WKT2:2015 (OGC 12-063r5), or with the future WKT2:2018 revision.

The result of those first steps can be followed in this pull request.

Another task that has been accomplished is the addition of the Google Test C++ testing framework to PROJ (thanks to Mateusz Loskot for his help with the CMake integration), so all those new features can be correctly tested locally and on all platforms supported by PROJ continuous integration setup.

There are many future steps to do just on the PROJ front :
  • implement remaining classes
  • code documentation
  • comprehensive support of projection methods (at least the set currently supported by GDAL)
  • import from and export to PROJ strings for CRS definitions and coordinate operations
  • use of the EPSG dataset

vendredi 15 juin 2018

SQLite and POSIX advisory locks

Those past couple days, I was working on implementing multi-layer transaction support for GeoPackage datasources (for QGIS 3.4). Multi-layer transaction is an advanced functionality of QGIS (you have to enable it in project settings), initially implemented for PostgreSQL connections where several layers can be edited together so as to have atomic modifications when editing them. Modifications are automatically sent to the database, using SQL savepoints to implement undo/redo operations, instead of being queued in memory and committed at once when the user stops editing  the layer.

While debugging my work during development, I stumbled upon a heisenbug. From time to time, the two auxiliary files attached to a SQLite database opened in Write Ahead Logging (WAL) mode, suffixed -wal and -shm, would suddenly disappear, whereas the file was still being opened by QGIS. As those files are absolutely required, the consequence of this was that following operations on the database failed: new readers (in the QGIS process) would be denied opening the file, and QGIS could not commit any new change to it. When the file was closed, the file returned again in a proper state (which shows the robustness of sqlite). After some time, I realized that my issue arised exactly when I observed the database being edited by QGIS with an external ogrinfo on it (another way to reproduce the issue would be to open a second QGIS instance on the same file and close it). I indeed used ogrinfo to check that the state of the database was consistent during the editing operations. Okay, so instead of a random bug, I had now a perfectly reproducable bug. Half of the way to solve it, right ?

How come ogrinfo, which involves read-only operations, could cause those -wal and -shm files to disappear ? I had some recollection of code I had written in the OGR SQLite driver regarding this. When a dataset opened in read-only mode is closed by OGR, it checks if there's a -wal file still existing (which could happen if a database had not been cleanly closed, like a killed process), and if so, it re-opens it temporarily in update mode, does a dummy operation on it, and close it. If the ogrinfo process is the only one that had a connection on the database, libsqlite would remove the -wal and -shm files automatically (OGR does not directly remove the file, it relies on libsqlite wisdom to determine if they can be removed or not). But wait, in my above situation, ogrinfo was not the exclusive process operating on the database: QGIS was still editing it.... Would that be a bug in the venerable libsqlite ??? (spoiler: no)

I tried to reproduce the situation with replacing QGIS by a plain sqlite console opening the file, and doing a ogrinfo on it. No accidental removal of the -wal and -shm files. Okay, so what is the difference between QGIS and the sqlite console (beside QGIS having like one million extra line of code;-)). Well, QGIS doesn't directly use libsqlite3 to open GeoPackage databases, but uses the OGR GPKG driver. So instead of opening with QGIS or a sqlite3 console, what if I opened with the OGR GPKG driver ? Bingo, in that situation, I could also reproduce the issue. So something in OGR was the culprit. I will save you of the other details, but at the end it turned out that if OGR was opening itself a .gpkg file using standard file API, whereas libsqlite3 was opening it, chaos would result. This situation can happen since for example when opening a dataset, OGR has to open the underlying file to at least read its header and figure out which driver would handle it. So the sequence of operation is normally:
1) the GDALOpenInfo class opens the file
2) the OGR GeoPackage driver realizes this file is for it, and use the sqlite3_open() API to open it
3) the GDALOpenInfo class closes the file it has opened in step 1 (libsqlite3 still manages its own file handle)

When modifying the above sequence, so that 3) is executed before 2), the bug would not appear. At that point, I had some recollection that sqlite3 used POSIX advisory locks to handle concurrent accesses, and that there were some issues with that POSIX API. Digging into the sqlite3.c source code revealed a very interesting 86 line long comment about how POSIX advisory locks are broken by design. The main brokenness are they are advisory and not compulsory of course, but as this is indicated in the name, one cannot really complain about that being a hidden feature. The most interesting finding was: """If you close a file descriptor that points to a file that has locks, all locks on that file that are owned by the current process are released.""" Bingo: that was just what OGR was doing.
My above workaround (to make sure the file is closed before sqlite opens it and set its locks) was OK for a single opening of a file in a process. But what if the user wants to open a second connection on the same file (which arises easily in the QGIS context) ? The rather ugly solution I came off was that the OGR GPKG driver would warn the GDALOpenInfo not to try to open a given file while it was still opened by the driver and pass it the file header it would be supposed to find if it could open the file, so that the driver identification logic can still work. Those fixes are queued for GDAL 2.3.1, whose release candidate is planned next Friday.

  • never ever open (actually close) a SQLite database with regular file API while libsqlite3 is operating on it (in the same process)
  • POSIX advisory locks are awful.

jeudi 12 octobre 2017

Optimizing JPEG2000 decoding

Over this summer I have spent 40 days (*) in the guts of the OpenJPEG open-source library (BSD 2-clause licensed) optimizing the decoding speed and memory consumption. The result of this work is now available in the OpenJPEG 2.3.0 release.

For those who are not familiar with JPEG-2000, and they have a lot of excuse given its complexity, this is a standard for image compression, that supports lossless and lossy methods. It uses discrete wavelet transform for multi-resolution analysis, and a context-driven binary arithmetic coder for encoding of bit plane coefficients. When you go into the depths of the format, what is striking is the number of independent variables that can be tuned:

- use of tiling or not, and tile dimensions
- number of resolutions
- number of quality layers
- code-block dimensions
- 6 independent options regarding how code-blocks are encoded (code-block styles): use of Selective arithmetic coding bypass, use of Reset context probabilities on coding pass boundaries, use of Termination on each coding pass, use of Vertically stripe causal context, use of Predictable termination, use of Segmentation Symbols. Some can bring decoding speed advantages (notably selective arithmetic coding bypass), at the price of less compression efficiency. Others might help hardware based implementations. Others can help detecting corruption in the codestream (predictable termination)
- spatial partition of code-blocks into so-called precincts, whose dimension may vary per resolution
- progression order, ie the criterion to decide how packets are ordered, which is a permutation of the 4 variables: Precincts, Component, Resolution, Layer. The standard allows for 5 different permutations. To add extra fun, the progression order might be configured to change several times among the 5 possible (something I haven't yet had the opportunity to really understand)
- division of packets into tile-parts
- use of multi-component transform or not
- choice of lossless or lossy wavelet transforms
- use of start of packet / end of packet markers
- use of  Region Of Interest, to have higher quality in some areas
- choice of image origin and tiling origins with respect to a reference grid (the image and tile origin are not necessarily pixel (0,0))

And if that was not enough, some/most of those parameters may vary per-tile! If you already found that TIFF/GeoTIFF had too many parameters to tune (tiling or not, pixel or band interleaving, compression method), JPEG-2000 is probably one or two orders of magnitude more complex. JPEG-2000 is truly a technological and mathematical jewel. But needless to say that having a compliant JPEG-2000 encoder/decoder, which OpenJPEG is (it is an official reference implementation of the standard) is already something complex. Having it perform optimally is yet another target.

Previously to that latest optimization round, I had already worked at enabling multi-threaded decoding at the code-block level, since they can be decoded independently (once you've re-assembled from the code-stream the bytes that encode a code-block), and in the inverse wavelet transform as well (during the horizontal pass, resp vertical pass, rows, resp columns, can be transformed independently). But the single-thread use had yet to be improved. Roughly, around 80 to 90% of the time during JPEG-2000 decoding is spent in the context-driven binary arithmetic decoder, around 10% in the inverse wavelet transform and the rest in other operations such as multi-component transform. I managed to get around 10% improvement in the global decompression time by porting to the decoder an optimization that had been proposed by Carl Hetherington for the encoding side, in the code that determines which bit of wavelet transformed coefficient must be encoded during which coding pass. The trick here was to reduce the memory needed for the context flags, so as to decrease the pressure on the CPU cache. Other optimizations in that area have consisted in making sure that some critical variables are kept preferably in CPU registers rather than in memory. I've spent a good deal of time looking at the disassembly of the compiled code.
I've also optimized the reversible (lossless) inverse transform to use the Intel SSE2 (or AVX2) instruction sets to be able to process several rows, which can result up to 3x speed-up for that stage (so a global 3% improvement)

I've also worked on reducing the memory consumption needed to decode images, by removing the use of intermediate buffers when possible. The result is that the amount of memory needed to do full-image decoding was reduced by 2.4.

Another major work direction was to optimize speed and memory consumption for sub-window decoding. Up to now, the minimal unit of decompression was a tile. Which is OK for tiles of reasonable dimensions (let's say 1024x1024 pixels), but definitely not on images that don't use tiling, and that hardly fit into memory. In particular, OpenJPEG couldn't open images of more than 4 billion pixels. The work has consisted in 3 steps :
- identifying which precincts and code-blocks are needed for the reconstruction of a spatial region
- optimize the inverse wavelet transform to operate only on rows and columns needed
- reducing the allocation of buffers to the amount strictly needed for the subwindow of interest
The overall result is that the decoding time and memory consumption are now roughly proportional to the size of the subwindow to decode, whereas they were previously constant. For example decoding 256x256 pixels in a 13498x9944x3 bands image takes now only 190 ms, versus about 40 seconds before.

As a side activity, I've also fixed 2 different annoying bugs that could cause lossless encoding to not be lossless for some combinations of tile sizes and number of resolutions, or when some code-block style options were used.

I've just updated the GDAL OpenJPEG driver (in GDAL trunk) to be more efficient when dealing with untiled JPEG-2000 images.

There are many more things that could be done in the OpenJPEG library :
- port a number of optimizations on the encoding side: multi-threadig, discrete wavelet transform optimizations, etc...
- on the decoding side, reduce again the memory consumption, particularly in the untiled case. Currently we need to ingest into memory the whole codestream for a tile (so the whole compressed file, on a untiled image)
- linked to the above, use of TLM and PLT marker segments (kind of indexes to tiles and packets)
- on the decoding side, investigate further improvements for the code specific of irreversible / lossy compression
- make the opj_decompress utility do a better use of the API and consume less memory. Currently it decodes a full image into memory instead of proceeding by chunks (you won't have this issue if using gdal_translate)
- investigate how using GPGPU capabilities (CUDA or OpenCL) could help reduce the time spent in context-driven binary arithmetic decoder.

Contact me if you are interested in some of those items (or others !)

(*) funding provided by academic institutions and archival organizations, namely
… And logistic support from the International Image Interoperability Framework (IIIF), the Council on Library and Information Resources (CLIR), intoPIX, and of course the Image and Signal Processing Group (ISPGroup) from University of Louvain (UCL, Belgium) hosting the OpenJPEG project.

mercredi 11 octobre 2017

GDAL and cloud storage

In the past weeks, a number of improvements related to access to cloud storage have been committed to GDAL trunk (future GDAL 2.3.0)

Cloud based virtual file systems

There was already support to access private data in Amazon S3 buckets through the /vsis3/ virtual file system (VFS). Besides a few robustness fixes, a few new capabilities have been added, like creation and deletion of directories inside a bucket with VSIMkdir() / VSIRmdir(). The authentication methods have also been extended to support, beyond the AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID environment variables, the other ways accepted by the "aws" command line utilities, that is to say storing credentials in the ~/.aws/credentials or ~/.aws/config files. If GDAL is executed since a Amazon EC2 instance that has been assigned rights to buckets, GDAL will automatically fetch the instance profile credentials.

The existing read-only /vsigs/ VFS for Google Cloud Storage as being extended with write capabilities (creation of new files), to be on feature parity with /vsis3/. The authentication methods have also been extended to support OAuth2 authentication with a refresh token, or with service account authentication. The credentials can be stored in a ~/.boto configuration file. And when run from a Google Compute Engine virtual machine, GDAL will automatically fetch the instance profile credentials.

Two new VFS have also been added, /vsiaz/ for Microsoft Azure Blobs and /vsioss/ for Alibaba Cloud Object Storage Service. They support read and write operations similarly to the two previously mentioned VFS.

To make file and directory management easy, a number of Python sample scripts have been created or improved: my.tif /vsis3/mybucket/raster/ -r /vsis3/mybucket/raster /vsigs/somebucket -lr /vsis3/mybucket /vsis3/mybucket/raster/my.tif /vsis3/mybucket/newdir -r /vsis3/mybucket/newdir

Cloud Optimized GeoTIFF

Over the last past few months, there has been adoption by various actors of the cloud optimized formulation of GeoTIFF files, which enables clients to efficiently open and access portions of a GeoTIFF file available through HTTP GET range requests.

Source code for a online service that offers validation of cloud optimized GeoTIFF (using GDAL and the script underneath) and can run as a AWS Lambda function is available. Note: as the current definition of what is or is not a cloud optimized formulation has been uniteraly decided up to now, it cannot be excluded that it might be changed on some points (for example relaxing constraints on the ordering of the data of each overview level, or enforcing that tiles are ordered in a top-to-bottom left-to-right way)

GDAL trunk has received improvements to speed up access to sub windows of a GeoTIFF file by making sure that the tiles that participate to a sub-window of interest are requested in parallel (this is true for public files accessed through /vsicurl/ or with the four above mentioned specialized cloud VFS), by reducing the amount of data fetched to the strict minimum and merging requests for consecutive ranges. In some environments, particularly when accessing to the storage service of a virtual machine of the same provider, HTTP/2 can be used by setting the GDAL_HTTP_VERSION=2 configuration option (provided you have a libcurl recent enough and built against nghttp2). In that case, HTTP/2 multiplexing will be used to request and retrieve data on the same HTTP connection (saving time to establish TLS for example). Otherwise, GDAL will default to several parallel HTTP/1.1 connections. For long lived processes, efforts have been made to re-use as much as possible existing HTTP connections.

samedi 11 mars 2017

Dealing with huge vector GeoPackage databases in GDAL/OGR

Recently, I've fixed a bug in the OGR OpenFileGDB driver, the driver made from the reverse engineering the ESRI FileGeoDatabase format, so as to be able to read tables whose section that enumerates and describes fields is located beyond the first 4 GB of the file. This table from the 2016 TIGER database is indeed featuring all linear edges of the USA and is 15 GB large (feature and spatial indexes included), with 85 million features.

Some time before, I had to deal with a smaller database - 1.7 GB as GeoPackage - with 5.4 million polygons (bounding box) from the cadastre of an Italian province. One issue I noticed is that when you want to get the summary of the layer, with ogrinfo -al -so the.gpkg, it was very slow. The reason is that this summary includes the feature count, and there's no way to get this metadata quickly, apart from running the "SELECT COUNT(*) FROM the_table" request, which causes a full scan of the table. For small databases, this runs fast, but when going into the gigabyte realm, this can take several dozains of seconds. But getting the spatial extent of the layer, which is one of the other information displayed by the summary mode of ogrinfo, is fast since the gpkg_contents "system" table of a GeoPackage database includes the bounding box of the table. So my idea was to extend the definition of the gpkg_contents table to add a new column, ogr_feature_count, to store the feature count. I went to implement that, and it worked fine. The synchronization of the value of ogr_feature_count after edits can be done with 2 SQLite triggers, on row insertion and deletion, and that  works with implementations that are not aware of the existence of this new column. Like older OGR versions. Unfortunately it appears that at least one other implementation completely rejected such databases. There is some inconsistency in the GeoPackage specification if additional columns are accepted or not in system tables. From the /base/core/contents/data/table_def test case, "Column order, check constraint and trigger definitions, and other column definitions in the returned sql are irrelevant.", it would seem that additional columns should still be considered as a valid GeoPackage. Anyway, that's only the theory and we don't want to break interoperability for just a nice-to-have feature... So I went to change the design a bit and created a new table, gpkg_ogr_contents, with a table_name and feature_count columns. I'm aware that I should not borrow the gpkg_ prefix, but I felt it was safer to do so since other implementations will probably ignore any unknown gpkg_ prefixed table. And the addition of the ogr_ prefix makes collisions with future extension of the GeoPackage specification unlikely. The content of this table is also maintained in synchronization with the data table thanks to two triggers, and this makes the other software that rejected my first attempt happy. Problem solved.

Let's come back to our 13 GB FileGeoDatabase. My first attempt to convert is to GeoPackage with ogr2ogr resulted in converting the features in about half an hour, but once this 100% stage reached, the finalization, which includes building the spatial index took ages. So long, that after a whole night it wasn't yet finished and seriously making the computer non responsive due to massive I/O activity. In the GeoPackage driver, the spatial index is indeed created after feature insertion, so that the feature table and spatial index tables are well separated in the file, and from previous experiment with the Spatialite driver, it proved to be the right strategy. Populating the SQLite R-Tree is done with a simple statement: INSERT INTO my_rtree SELECT fid, ST_MinX(geom), ST_MaxX(geom), ST_MinY(geom), ST_MaxY(geom) FROM the_table. Analyzing what happens in the SQLite code is not easy when you are not familiar with that code base, but my intuition is that there was constant back and forth between the geometry data area and the RTree area in the file, making the SQLite page cache inefficient. So I decided to experiment with a more progressive approach. Let's iterate over the feature table and collect the fid, minx, max, miny, maxy by chunks of 100 000 rows, and the insert those 100 000 bounding boxes into the R-Tree, and loop again unil we have completely read the feature table. With such a strategy, the spatial index can now be built in 4h30. The resulting GeoPackage file weights 31.6 GB, so twice at large than the FileGeoDatabase. One of the reasons for the difference must be due to geometries in FileGeoDatabase being compressed (quantization for coordinate precision, delta encoding and use of variable integer) whereas GeoPackage uses a uncompressed SQLite BLOB based on OGC WKB.
My first attempt at opening it in QGIS resulted in the UI to be frozen, probably for hours. The reason is that QGIS always issues a spatial filter, even when requesting on a area of interest that is at least as large as the extent of the layer, where there is no performance gain to expect from using it. So the first optimization was in the OGR GeoPackage to detect that situation and to not translate the OGR spatial filter as SQLite R-Tree filter. QGIS could now open the database and progressively displays the features. Unfortunately when zooming in, the UI became frozen again. When applying a spatial filter, the GeoPackage driver created a SQL request like the following one:

SELECT * FROM the_table WHERE fid IN 
       (SELECT id FROM the_rtree WHERE 
        xmin <= bbox_xmax AND xmax >= bbox_xmin AND
        ymin <= bboy_ymay AND ymay >= bboy_ymin)

It turns out that the sub-select (the one that fetches the feature ID from the spatial index) is apparently entirely run before the outer select (the one that returns geometry and attributes) starts being evaluated. This way of expressing the spatial filter came from the Spatialite driver (since GeoPackage and Spatialite use the exact same mechanisms for spatial indexing), itself based on examples from an old Spatialite tutorial. For not too big databases, this runs well. After some experiment, it turns out that doing a JOIN between the feature table and the RTree virtual table makes it possible to have a non blocking request:

SELECT * FROM the_table t JOIN the_rtree r ON t.fid =
WHERE r.xmin <= bbox_xmax AND r.xmax >= bbox_xmin AND
      r.ymin <= bboy_ymax AND r.ymax >= bboy_ymin

Now QGIS is completely responsive, although I find that even on high zoom levels, the performance is somehow disappointing, ie features appear rather slowly. There seems to be some threshold effect on the size of the database, since the performance is rather good on the Italian province cadastral use case.

Another experiment showed that increasing the SQLite page size from 1024 bytes (the default in SQLite 3.11 or earlier) to 4096 bytes (the default since SQLite 3.12) decreases the database size to 28.8 GB. This new page size of 4096 bytes is now used by default by the OGR SQLite and GPKG drivers (unless OGR_SQLITE_PRAGMA=page_size=xxxx is specified as a configuration option).

I also discovered that increasing the SQLite page cache from its 2 MB default to 2 GB (with --config OGR_SQLITE_CACHE 2000) significantly improved the time to build the spatial index, decreasing the total conversion time from 4h30 to 2h10. 2GB is just a value selected at random. It might be too large or perhaps a larger value would help further.

All improvements mentionned above (faster spatial index creation, better use of spatial index and change of default page size) are now in GDAL trunk, and will be available in the upcoming 2.2.0 release.

lundi 19 septembre 2016

Running FreeBSD in Travis-CI

Note for geospatial focused readers: this article has little to do with geo, although it is applied to GDAL, but more with software virtualization, hacks, software archeology and the value of free software. Note for virtualization experts: I'm not one, so please bear with my approximate language and inaccuracies.

Travis-CI is a popular continuous integration platform, that can be easily used with software projects hosted at GitHub. Travis-CI has a free offer for software having public repository at GitHub. Travis-CI provides cloud instances running Linux or Mac OS X. To increase portability tests of GDAL, I wondered if it was somehow possible to run another operating system with Travis-CI, for example FreeBSD. A search lead me to this question in their bug tracker but the outcome seems to be that it is not possible, nor in their medium or long term plans.

One idea that came quickly to mind was to use the QEMU machine emulator that can simulate full machines (CPU, peripherals, memory, etc), of several hardware architectures (Intel x86, ARM, MIPS, SPARC, PowerPC, etc..). To run QEMU, you mostly need to have a virtual hard drive, i.e. a file that replicates the content of the hard disk of the virtual machine you want to run. I found here a small ready-to-use x86_64 image of FreeBSD 9.2, with one nice property: the ssh server and DHCP are automatically started, making it possible to remote connect to it.

So starting with a Travis-CI Ubuntu Trusty (14.04) image, here are the step to launch our FreeBSD guest:

sudo apt-get install qemu
tar xJvf freebsd.9-2.x86-64.20140103.raw.img.txz
qemu-system-x86_64 -daemonize -display none \
   freebsd.9-2.x86-64.20140103.raw.img \
   -m 1536 -smp 4 -net user,hostfwd=tcp::10022-:22 -net nic

The qemu invokation starts the virtual machine as a daemon without display, turn on networking and asks for the guest (ie FreeBSD) TCP port 22 (the ssh port) to be accessible by the host (Linux Trusty) as port 10022

To ssh into the VM, there's one slight inconvenience: ssh login requires a password. The root password for this VM is "password". But ssh is secured and doesn't accept the password to be provided through files or piped in with "echo". I found that the sshpass utility was designed to overcome this in situations where security isn't really what matters. However, the version of sshpass bundled with Ubuntu Trusty didn't work with the corresponding ssh version (not surprisingly since the authors of sshpass mention that it is full of assumptions about how ssh works, that can be easily breaks with changes of ssh). I found that the latest version 1.0.6 worked however.

With 4 extra lines, we can now login into our FreeBSD instance:

tar xzf sshpass-1.06.tar.gz
cd sshpass-1.06 && ./configure && make -j3 && cd ..
export MYSSH="sshpass-1.06/sshpass -p password ssh \
   -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no \
    root@localhost -p10022" 

So now we can configure a bit our FreeBSD VM to install with the 'pkg' package manager a few dependencies to build GDAL:

$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg bootstrap'
$MYSSH 'mkdir /etc/pkg'
sshpass-1.06/sshpass -p password scp \
   -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no \
   -P 10022 FreeBSD.conf root@localhost:/etc/pkg/FreeBSD.conf
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install gmake'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install python27'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install py27-numpy'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install sqlite3 curl'
$MYSSH 'env ASSUME_ALWAYS_YES=YES pkg install expat'
Here we go: ./configure && make ! That works, but 50 minutes later (the maximum length of a Travis-CI job), our job is killed with perhaps only 10% of the GDAL code base being compiled. The reason is that we used the pure software emulation mode of QEMU that involves on-the-fly disassembling of the code to be run and re-assembling. QEMU can for example emulate a ARM guest on a Intel host, and vice-versa, and there's no really shortcuts when the guest and host architectures are the same. So your guest can typically run 10 times slower than it would on a real machine with its native architecture. Actually, that's not true, since with the addition of CPU instructions dedicated to virtualization (VT-x for Intel, AMD-V for AMD), an hypervisor called KVM (Kernel Virtual Machine) was added to the Linux kernel, and QEMU can use KVM to implement the above mentioned shortcuts to reach near bare-metal performance. It just takes to use 'kvm' instead of 'qemu-system-x86_64'. Let's do that ! Sigh, our attempt fails miserably with a "failed to initialize KVM" error message. If we display the content of /proc/cpuinfo, we get:

flags  : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc
rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1
sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm
fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt

A lot of nice to have things, but the important thing to notice is the absence of the 'vmx' (Intel virtualization instruction set) and 'svm' (similar for AMD) flags. So this machine has no hardware virtualization capabilities ! Or more precisely, this *virtual* machine has no such capabilities. The documentation of the Trusty Travis-CI environment mentionned they are based on Google Computing Engine as the hypervisor, and apparently it does not allow (or is not configured to allow) nested virtualization, despite GCE being based on KVM, and KVM potentially allowing nested virtualization. GCE allows Docker to run inside VM, but Docker only runs Linux "guests". So it seems we are really stuck.

Here comes the time for good old memories and a bit of software archeology. QEMU was started by Fabrice Bellard. If you didn't know his name yet, F. Bellard created FFMPEG and QEMU, holds a world record for the number of decimals of Pi computed on a COTS PC, has ported QEMU in JavaScript to run the Linux kernel in your browser, devised BPG, a new compression based on HEVC, etc....

At the time where his interest was focused on QEMU, he created KQemu, a kernel module (for Linux, Windows, FreeBSD hosts), that could significantly enhance QEMU performance when the guest and hosts are x86/x86_64. KQemu requires QEMU to be modified to communicate with the kernel module (similarly to the working of QEMU with the KVM kernel module). KQemu started as a closed source project and was eventually released as GPL v2. One of the key feature of KQemu is that it does not require (nor use) hardware virtualization instructions. KQemu software virtualization involves complicated tricks, particularly for code in the guest that run in "Ring 0", ie with the highest priviledges, that you must patch to run as Ring 3 (non-priviledge) code in the host. You can get an idea of what is involved by reading the documentation of VirtualBox regarding software virtualization. I will not pretend that QEMU+KQemu did the exact same tricks as VirtualBox, but that should give you at least a picture of the challenges involved.  This complexity is what lead to KQemu to be eventually abandonned when CPUs with hardware virtualization became widespread to the market since KVM based virtualization is much cleaner to implement. Starting with QEMU 0.12.0, KQemu support was finally dropped from QEMU code base.

Due to KQemu not using hardware virtualization instructions, there is a good hope that it can run inside a virtualized environment. So let's have a try with QEMU 0.11.1 and KQemu 1.4.0pre. Compiling QEMU 0.11.1 on Ubuntu Trusty runs quite well, except a linking error easily fixed with this trivial patch. Building KQemu is a bit more involved, being a kernel module and the (internal) Linux kernel API being prone to changes from time to time. One good news is that the Linux specific part of kqemu is a relatively small file and the API breaks were limited to 2 aspects. The way to get the memory management structure of the current task had changed in Linux 2.6.23 and I found this simple patch to solve it. Another change that occured in a later Linux release is the removal of kernel semaphores to be replaced by mutexes. My cumulated patch to fix all compilation issues is here. I don't pretend that it is technically correct as my knowledge of kernel internals is more than limited, but a local test seemed to confirm that adding -enable-kqemu to the qemu command line worked sufficiently well to start and do things in the FreeBSD VM, and at a very decent speed. I tried the -kernel-qemu switch that turns on KQemu acceleration for kernel guest code, but that resulted in a crash of qemu near the end of the boot process of FreeBSD. Which is not surprising as kernel-qemu makes some assumptions on the internal working of the guest OS, which perhaps FreeBSD does not meet. Or perhaps this is just a bug of qemu/kqemu.

Running it on Travis-CI was successful too, with the compilation being done in 20 minutes, so probably half of the speed of bare metal, which is good enough. kqemu does not support SMP guests (but this was listed in the potential "roadmap", so probably achievable), but if we wanted to speed up compilation, we could potentially launch 2 kqemu-enabled qemu instances (the Travis-CI VM have 2 cores available) that would compile different parts of the software with the build tree being hosted in a NFS share. I said that compilation goes fine, except that the build process (actually the qemu instance) crashes at building time (I can also reproduce that locally). This is probably because the history of qemu & kqemu wasn't long enough to go from beta quality to production quality. I've workarounded this issue by only doing the compilation in -enable-kqemu mode, restarting the VM in pure software emulation to do the linking, and then restarting in -enable-kqemu mode. Unfortunately running the GDAL Python autotest suite in kqemu mode also leads to a qemu crash (due to the design of kqemu only runnnig code in ring 3, crashes do not affect the host), and running it completely in pure emulation mode reaches the 50 minute time-out, so for the sake of this demonstration, I only run one of the test file. And now we have our first succesful build given this build recipee.

I could also have potentially tried VirtualBox because, as mentionned above, it supports software virtualization with acceleration. But that is only for 32 bit guests (and I didn't find a ready-made FreeBSD 32bit image that you can directly ssh into). For 64 bit guests, VirtualBox require hardware virtualization to be available in the host. To the best of my knowledge, KQemu is (was) the only solution to enable acceleration of 64 bit guests without hardware requirements.

My main conclusion of this experiment is it is a striking example of a key advantage of the open source model. If kqemu had not been released as GPL v2, I would have never been able to resurrect it and modify it to run on newer kernels (actually there was also QVM86, an attempt of developing an alternative to Kqemu while Kqemu was still closed source and that was abandonned when VirtualBox was open sourced).