jeudi 26 septembre 2024

Those concepts in the geospatial field that have cost us millions of $$$

Every domain has its baggage of concepts, which at first sight don't appear to be that terrible, but which are in practice.

Let's start with a non-geospatial example. A lot of programming languages have the concept of a "null" pointer. It is available in the C and C++ programming languages in particular, but in Java as well, or Python, etc.. Tony Hoare, null's creator, regrets its invention: "I call it my billion-dollar mistake." While very useful in some contexts, that feature also happens to cause a lot of bugs, with severe criticity in some cases. The Rust language has for example (almost) removed the null pointer concept and replace it with optional references, where the compiler enforces that the code explictly checks for the validity of the reference before using it. At the minute, I'm writing this, I'm debating about that very issue with a colleague.

The geospatial field is not free of concepts that are a never-ending source of troubles, and I will come with my own list, with my perspective of software developer.

- Geometry model. Point, lines, polygons. What could be most simple than that? Well, the commonly-used OGC Simple Features model allows those objects to be empty. How is that useful in practice? Hard to tell: NULL geometries are (somewhat paradoxically given the above paragraphy) actually a better replacement. My own perception of that "feature" is that it is mostly a huge headache that goes into your way when trying to write robust code. One evidence of that is that the same standard that invented it neglected to specify how to export an empty point in the Well Known Binary representation. Over the years, the tradition has become to use the Not-A-Number value for the X and Y value of an empty point. Which by itself may cause interesting consequences for applications that would neglect to make a special case. A Not-A-Number is ... well ... not a number, and for example it is not equal to itself ... (and in the IEEE-754 representation, there are litteraly billions of different binary potential encodings of a NotANumber). Everything you learnt at school in your math courses will break with it (this is actually quite general when crunching numbers with computers. Finite precision break a lot of the expected properties of ideal numbers). An empty line has the undesirable property of not having a start or end point: any algorithm must be ready for that. Another annoyance is that there is not just one "geometry empty" object, but a "point empty", a "line empty", a "polygon empty", etc. What is the expected intersection of an empty line with an empty polygon ? An empty line, an empty polygon, an empty point, ... ? The developers of the GEOS library or the PostGIS spatial extension have unfortunately to regularly debate at length about those topics. One can but think there would be a better use of their expertise and time than having to deal with such esoteric subjects (I didn't ask them, so they may actually be thrilled by them. You never know...)

- Coordinate reference system (CRS) axis order.  CRS, such as WGS 84 geographic, or UTM 31N / ETRS89 have several axis. For geographic CRS, this will be the longitude, the latitude, and optionally the ellipsoidal height. When expressing coordinates in a CRS, one must decide in which order they are specified. They are lengthy debates whether this should be longitude first, latitude second, or the reverse. The ISO 19111 / OGC Abstract Topic 2 specification or geodetic registries have decided to not take a firm stance on that, and have allowed authorities responsible for CRS definition and maintenance, to submit CRS definitions with the axis order they wish. Excellent! Well no. The issue is that while non-geomaticians user may chose to express a coordinate in prose like "50 degree of latitude north, 15 degree of longitude east", or "15 degree of longitude east, 50 degree of latitude north", that doesn't mean it is a good idea that the software systems reflect that liberty of speech. Some GIS formats have no way of clearly expressing the CRS, or if they have, they might use an incomplete way of specifying it, in particular lacking the way to express the axis order. The usual practice is to generally specify the longitude as the first value (as corresponding to the X/horizontal axis of a Cartesian plan) and latitude next (Y/vertical axis), refleting the natural mapping to make a graphical representation. Other formats (GML in particular) require that the coordinates are expressed in the order of the CRS definition, which require access to a database to get the axis order, given that in GML vector files, the CRS is only referenced through a code, and not defined inline. Depending on whether the persons responsible to design the protocol/file format, the order may be the "GIS friendly one" (longitude-latitude), or the "CRS pedantic one" (latitude-longitude for example for geographic CRS defined by the EPSG geodetic registry). This is an eternal source of confusion. Sometimes with absurd situations. The OGC GeoPackage file format captures a full definition of the CRS used in the vector tables it contains, including in particular the official axis order, but to reflect the long-GIS tradition, as an amendment, specify that the encoding of coordinates in its (extension of) the WKB format mentionned in the previous paragraph should be longitude-latitude (for geographic CRS) or easting-northing (for projected CRS). I will not blame anyone particular for this. This is an "overall system error". In the ideal situation, a courageous geomatician in a standard organization or in a geodetic registry should have said "here, we are geomaticians: geographic CRS are always longitude-latitude, and projected CRS are always easting-northing.  It is your responsibility as users of our system to provide data always in that order".  Failing to have access to a time-travel machine to warn in advance my glorious predecessors about the incoming catastrophe, the only solution I see to solve the issue it is to ask all population to relocate on the line of longitude=latitude, and exclude any mapping outside of it.

- Raster cell registration issues, a.k.a pixel-centre versus pixel-corner, or pixel-is-point versus pixel-is-area, a.k.a the half-pixel shift error. A raster cell is both an entity you reference with a (column, line) integer pair, so perceived as a discrete point, but when displayed, it actually occupies a non-zero area on your display device. When that raster cell is registered against geospatial coordinates, one debate is: "what exact place in that cell does this (longitude, latitude) or (easting, northing) refer to? Is that the center of the pixel, or maybe its top-left corner?" . Obviously, whenever there is a choice, file format and service specifications never agree together. The GDAL software has courgeously decided to "standardize" its internal model to the convention where that the georeferenced coordinate associated to a pixel qualifies the location of the top-left corner. GDAL format drivers do the necessary dance of applying and de-applying a half-pixel shift to go into that convention ... when they have access to the convention of the underlying format.

A temptative conclusion could be that any proposed standard or specification should go to the step of having an actual real-world implementation of it, not just a "working prototype" ("toy implementation" more casually), to check whether some apparently minor details are not a major source of inconvenience.

dimanche 26 septembre 2021

Order of GeoJSON properties and graph theory

All this story started with this QGIS bug about a crash when a user adds a new field to a GeoJSON file. Who is the culprit between GDAL and QGIS could be debated. Let's look at the GDAL side of things.

Let's consider this simple GeoJSON file:

{ "type": "FeatureCollection", "features": [
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "C": "c" },
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "B": "b", "C": "c" }
]}

GDAL versions up to 3.3.x will expose a layer definition with the fields in the following order, [A, C, B], and this is what confused QGIS, since the C field was appended as the last action, and QGIS expected it to be the last field.

One could also argue that we should not worry about the order of properties since a JSON object such as the "properties" one is unordered. However the JSON parser and writer used by GDAL has the property of respecting the order of appearance of the keys which makes GeoJSON a format that behaves similarly to other formats such as Shapefile, GeoPackage or CSV where the field order is fixed. But in a GeoJSON file where there is no declaration of properties (no "header"), and when properties are omitted in some features, presenting a consistent field order is  more challenging.

How does current GDAL in current proceeds to build the list of fields ? That's very simple: iterate over features, for each feature, iterate over its properties, and as soon as a new property is found, append it to the list of properties of the layers. So on our above example, after the first feature, we have [ A, C ]. And when parsing the second feature, we append B to it, hence [A, C, B ]. We can also notice that the order in which features appear in the collections influences the order of properties exposed in the layer definition.

How to fix that ? For that very simple case, we could say "B appears after A, which is already in our list, and before C, which is already in our list and was just after A, so let's insert it in between". Great ! Problem solved ? Unfortunately not.

What if you see features with the following properties:

- [A, D], ==> current consolitated list [A, D]

- then [A, B, D], ==> current consolitated list [A, B, D]

- then [C, D], ==> hum C is before D, but where relatively to A and B ??

- then [A, C], ==> we know now that C is after A. But before or after B ?

- then [B, C], ==> ok, before B. So final list is [A, B, C, D]

Our simple logic of insertion no longer works here, and we can know feel that the problem is much more involved than what it initially looked like. We could also have situations where the dataset is so sparse that the relative order of some fields is indeterminate. Or where fields appear in non consistent order in several features.

My initial though was to use a sorting algorithm, such as std::sort in the standard C++ library, with a custom comparison function that would use the (i, j) apparition order ( (i, j) means that field i immediately appears before j in a feature ). But there are several issues:

- the sort algorithm may invoke the comparison function with any pair of field names, but we have only a knowledge of a subset of those. If we build a graph where the nodes are the field names and a directed edge between i and j means that field i immediately appears before j in a feature, then we can know if there's a path between any tuple of nodes by using standard graph algorithm.

- the comparison function of std::sort implies that we have a strict weak ordering. When there's a path in the graph between 2 nodes, the ordering is clear. We note immediately that the graph should not have cycles, otherwise we would violate the asymmetry property of the strict weak ordering. So when building the graph, we should reject any new edge i->j that would result in a cycle, which boils down to finding if the reverse path j->i already exists. But what to do for nodes that aren't related by a path ? We could decide that they are incomparable. However the "transitivity of incomparibility" property could then be violated. For example if there's no path between A and B, and between B and C (for example B is an isolated node), there might be a path between A and C. So it is far from obvious that we can define a comparison function that respects the strict weak ordering. And if it failed to respect it, then the std::sort() execution is undefined behaviour, meaning that anything can happen: at best, a wrong result; at worse, a crash, infinite loop, etc. Look at this post for other examples of comparision function that subtely violate strict weak ordering

At the end, we want to somewhat "linearize" our graph, by respecting the directed edge.

So given:


We want to get a rresult like [A, F, B, G, D, C, E], or [F, A, B, C, D, E, G], or ...

It appears that this is a known problem in graph theory: computing the topological ordering of a directed acyclic graph. As show in the above example, there are in general several solutions. To try to make the result less "random", we use lexicographical order of field names for the working set S of the Kahn's algorithm, which is an additional sorting rules for nodes that have no direct relationship in the graph.

The result of that work is in https://github.com/OSGeo/gdal/pull/4552


mardi 14 septembre 2021

The ultimate guide for successful contribution to open source

(This is a post I started on December 2019 and didn't publish because I felt it was too negative. But recent events show that it is still a current topic, and at least this will document my own preception of things) 
 
We have lately stumbled upon a few clumsy attempts of corporations at contributing to open source software. Needless to say this is a bumpy ride not for the faint of heart, from both side of the equation.

If you just want to stop your reading here, just remember that good old motto: When in Rome, do as the Romans do. And be prepared to become a Roman citizen.

Us, open source developers have our own hard-to-decipher customs. We pay tribute to the Allmighty Git (*), preferably his most revered (often with awe)  incarnation called GitHub. You may also find followers of GitLab, gitea, gitorious.
In that case, never pronounce the word 'GitHub' before them. It is also said that some might still use email patches: be sure to set a 80 character limit to lines, use LF line endings and refrain from using PNG signatures.  Be sure to stay away from SourceForge followers. They will continuously impose on you advertising messages that will destroy your sanity.

You should subscribe to mailing lists. Be wary of trolls. We may occasionally engage into flame wars, generally sanctionned by motions. We are structured into groups that defer to a Project Steering Committee to decide (or not) which way to follow. Our groups regularly gather at solstice time in dark rooms for a week-long trance (note: was written pre-pandemic), known as a hackfest, hackathon or code sprint. For the tribes involved in geospatial, we have an annual Pow-Wow, called FOSS4G, during which we temporarily bury the hatchet and occasionaly thank our sponsors. You may stumble upon a C89 developer sharing a beer with a Java 11 one, or an adorer of Leaflet trying to convert a OpenLayers follower. If you want to join the Pow-Wow, remove your tie and suit, wear a t-shirt and sandals, and make sure to display a prominent "I Love Open Source" sticker on your laptop. You may also add a "I Love Shapefile" or "I Love GeoPackage" sticker, but only if you are confident to which one the group pays tribute. Failure to display the appropriate sticker will expose you to terrific insults about 10-character limits or obscure write-ahead log locking issues. If unsure, use the "I ? Shapefile" sticker.

We have taboo words. Thou Should Not Talk about business plan, intellectual property, patent portfolio, customer demand, strategy, costless SDK, CVE number, internal policy, consolidated codebases, education license fees. Planning and roadmap should also be pronounced with care, as some tribes will probably not have even the slightest idea of what you talk about.

If despite all those strange habits, you want to join with us, be prepared to follow a long and demanding initiation path. You will have to demonstrate your (possibly affected) adoration to our principles. As strange as it can be, we abhor being offered large presents that we cannot prevent ourselves from seeing as Trojan horses. You will rather have to locate a long list of problems pinned on a wall called bug tracker where each member of the community has written an issue or a wish he has. Start by the modest one that makes sense to you, solve it and offer its resolution to the Reviewer in a sufficiently argumented Pull Request. The Reviewer, which often equates to the Maintainer, will scrutinize at your gift, often at night time after Day Job. He may decide to accept it right away, or return it to you with a comment requesting a change, or just ignore it with the highest contempt. Be humble and obey to his command. You may beg for enlightenment, but never object even if you cannot make sense of his rebutal. He is the allmighty after the Allmighty. Never Rebase if asked to Merge; never Merge if asked to Rebase. Refactor when asked, but don't when not ! You should rewrite history before submitting, or maybe not. If unsure, consult Contributing.md, or be prepared that RTFM will be yelled at you (only for the tribes that don't have yet written CodeOfConduct.md). Do not even consider objecting that you have not been tasked to address his demands. You must also make sure to listen to the complaints of the Continuous Integration (CI) half-gods: the Reviewer will probably not even look at your gift until CI has expressed its satisfaction. Retry as many times as needed until they are pleased. You may attempt at submitting a RFC, but be prepared for lengthy and lively discussions. Listen, correct or you may not survive to your first "-1" spell !

We praise especially gifts that have no direct value for you: improved test suite (e.g. https://github.com/OSGeo/gdal/issues/4407), documentation addition and fixes, answering to users on the mailing list. Only when you feel that you have built enough trust, you might try to offer your initial gift. But, even if it is accepted, the most surprising habit is that your gift will remain yours. You will have to make sure you regularly remove the dust that accumlates on it over time, so it remains constantly shiny. While removing dust on your gift, never neglect from removing it also from nearby gifts offered by others. Otherwise the Maintainer might threaten at invoking the terrible Revert incantation on you. Also consider that existing contributors to a project might see your new code, they won't have directly use of, as an extra burden to their daily tasks (studying the commit history of https://github.com/qgis/QGIS/commits/master/src/providers/hana, or even more crudly at https://github.com/qgis/QGIS/commits/master/src/providers/db2, demonstrates that)

Ready for a ride, and possibly enjoying the feeling that "our" code can also become "yours" ?


(*) some tribes, cut off from the rest of the world, are said to still pursue their adoration of more ancient divinities sometimes known as SVN or CVS. We cannot confirm and have eradicated the last remains of those old cults.

mardi 31 août 2021

Analyze of OSS-Fuzz related data on GDAL

GDAL has now been put under the continuous scrutinity of  OSS-Fuzz for more than 4 years. To keep it simple, OSS-Fuzz is a continuous running infrastructure that stresses a software with (not-so-)random data to discover various flaws, and automatically files issues in a dedicated issue tracker, with reproducer test cases and stack traces when available. It is time to use a bit the data accumulated to give a few trends.

First, we can see a total of 1787 issues having been found, which represents on average a bit more than one per day. Out of those, only 38 are still open (so 97.8% have been fixed or are no longer reproducible). Those 1787 issues are out of a total of 37 769 filed issues against all 530 enrolled software, hence representing 4.6 % (so significantly higher than the naive 1 / 530 = 0.2 % proportion that we could expect, at least if all software were of the same size, but GDAL is likely larger than the average). Does that mean that the quality of GDAL is lower than the average of enrolled software, or that it is stressed in a more efficient way... ? Most of GDAL code being about dealing with a lot of file formats, it is the ideal fit for fuzz testing.

We should mention that a number of issues attributed to GDAL actually belong to some of its dependencies: PROJ (coordinate transformation), Poppler (PDF rendering), cURL (network access), SQLite3 (embedded database), Xerces-C (XML parsing), etc. And we reguarly report or fix those issues to those upstream components.

Addressing those issues is now facilitated with the sponsorship program which allows us to spend funded time on such unsexy and usually hard to fund topics, but important to enhance the robustness of the software.

We have run a script that parses git commit logs to identify commits that explicitly reference an issue filed by ossfuzz and tried to analyze the commit message to find the category of the issue that it addresses.

Here's its output:

Category                                                Count  Pct
----------------------------------------------------------------------
integer_overflow_signed                                  178  14.67 %
excessive_memory_use                                     174  14.34 %
other_issue                                              114   9.40 %
buffer_overflow_unspecified                              105   8.66 %
excessive_processing_time                                102   8.41 %
null_pointer_dereference                                  93   7.67 %
integer_overflow_unsigned                                 68   5.61 %
division_by_zero_integer                                  54   4.45 %
heap_buffer_overflow_read                                 37   3.05 %
unspecified_crash                                         32   2.64 %
stack_call_overflow                                       31   2.56 %
memory_leak_error_code_path                               23   1.90 %
heap_buffer_overflow_write                                22   1.81 %
division_by_zero_floating_point_unknown_consequence       19   1.57 %
stack_buffer_overflow_unspecified                         19   1.57 %
invalid_cast                                              14   1.15 %
invalid_shift_unspecified_dir                             14   1.15 %
memory_leak_unspecified                                   13   1.07 %
assertion                                                 13   1.07 %
invalid_enum                                              11   0.91 %
division_by_zero_floating_point_harmless                  10   0.82 %
infinite_loop                                             10   0.82 %
integer_overflow_unsigned_harmless                         8   0.66 %
invalid_memory_dereference                                 7   0.58 %
undefined_shift_left                                       7   0.58 %
double_free                                                6   0.49 %
stack_buffer_overflow_read                                 6   0.49 %
use_after_free                                             6   0.49 %
integer_overflow_harmless                                  5   0.41 %
undefined_behavior_unspecified                             3   0.25 %
negative_size_allocation                                   3   0.25 %
unhandled_exception                                        2   0.16 %
invalid_shift_right                                        2   0.16 %
unsigned_integer_underflow                                 1   0.08 %
uninitialized_variable                                     1   0.08 %
----------------------------------------------------------------------
Total                                                   1213


So 1213 commits for 1787-38 fixed issues: the difference is duplicated issues, non-reliably reproducing issues that end up being closed or issues fixed in other code bases than GDAL.

Let's dig into those categories, from most frequently hit to lesser ones:

  • integer_overflow_signed: an arithmetic operation on a signed integer whose results overflows its size. This is a undefined behavior in C/C++, meaning that anything can happen in theory. In practice, most reasonable compilers and common CPU architectures implementing complement-to-two signed integers will have a wrap around behavior, and not crash when the overflow occurs. However this has often later consequences, like allocating an array of the wrong size, and having out-of-bounds access.

  • excessive_memory_use: this is not a vulnerability by itself. This issue is raised because processes that run under OSSFuzz are limited to 2 GB of RAM usage, which is reasonable given than OSSFuzz manipulates input buffers that are generally large of a few tens of kilobytes. It is thus expected that for those small inputs, RAM consumption should remain small. When that's violated, it is because the code generally trusts too much some fields in the input data that drive a memory allocation, without checking that against reasonable bounds, or the file size. However for some file formats, it is difficult to implement definitive bounds because they allow valid small files that need a lot of RAM to be processed. Part of the remaining open issues belong to that category.

  • other_issue: issues that could not be classified under a more precise category. Bonus points for anyone improving the script to analyze the diff and figure from the code the cases where the commit message lacks details! Or perhaps just parse the OSSFuzz issue itself which gives a categorization.

  • buffer_overflow_unspecified: an access outside the validity area of some buffer (string, array, vector, etc.), and we couldn't determine if it is a heap allocated or stack allocated, or a read attempt or write attempt. Potentially can result in arbitrary code execution.

  • excessive_processing_time: this is when a program exceeds the timeout of 60 second granted by OSS-Fuzz to complete processing of an input buffer. This is often due to a sub-optimal algorithm (e.g. quadratic performance whereas linear can be met), or a unexpected network access. Most of the remaining open issues are in that category. A significant numbers are also in a out-of-memory situation hit by the fuzzer while iterating over many inputs, but often that cannot be reproduced on a individual test case. We suspect heap fragmentation to happen in some of those situations.

  • null_pointer_dereference: a classic programming issue: accessing a null pointer, which results in a immediate crash.

  • integer_overflow_unsigned: this one is interesting. Technically in C/C++, overflow of unsigned integers is a well-defined behavior. Wrap around behavior is guaranteed by the standards. However we assumed that in most cases, the overflow was unintended and could lead to similar bugs as signed integer overflow, hence we opted in for OSSFuzz to consider those overflows as issues. For the very uncommon cases where the overflow is valid (e.g when applying a difference filter on a sequence of bytes), we can tag the function where it occurs with a __attribute__((no_sanitize("unsigned-integer-overflow"))) annotation

  • division_by_zero_integer: an integer divided by zero, with zero being evaluated as an integer. Results in immediate crash on at least x86 architecture

  • heap_buffer_overflow_read: read access outside the validity area of a heap allocated data structure. Results generally in a crash.

  • unspecified_crash: crash for a unidentified reason. Same bonus point as above to help better categorizing them.

  • stack_call_overflow: recursive calls to methods that ends up blowing the size limit of the stack, which results in a crash.

  • memory_leak_error_code_path: memory leak that is observed in a non-nominal code path, that is on corrupted/hostile datasets.

  • heap_buffer_overflow_write: write access outside the validity area of a heap allocated data structure. Results generally in a crash. Could also be sometimes exploited for arbitrary code execution.

  • division_by_zero_floating_point_unknown_consequence: a variable is divided by zero, and this operation is done with floating point evaluation. In C/C++, this is undefined behavior, but on CPU architectures implementing IEEE-754 (that is pretty much all of them nowadays) with default settings, the result is either infinity or not-a-number. If that result is cast to an integer, this results in undefined behavior again (generally not crashing), and various potential bugs (which can be crashing)

  • stack_buffer_overflow: a read or write access outside of a stack-allocated buffer. Often results in crashes, and if a write access, can be sometimes exploited for arbitrary code execution.

  • invalid_cast: this may be an integer cast to an invalid value for an enumeration (unspecified behavior, which can results in bugs due to not catching that situation later), an instance of a C++ class cast to an invalid type (unspecified behavior, crash likely)

  • invalid_shift_unspecified_direction: a left- or right-shift binary operation, generally on a signed integer. For left-shift, this is when this results in setting the most-significant-bit being set (either shifting too much a positive value, which results to a negative value, or shifting a negative value), or shifting by a number of bits equal or greater than the width of the integer. For rigth-shift, this is when shiting by a number of bits equal or greater than the width of the integer. Those undefined behaviors do not result in immediate crashes on common compilers/platforms, but can lead to subsequent bugs.

  • memory_leak_unspecified: self explanatory.

  • assertion: an assert() in the code is hit. A lot of what we initally think as programming invariants can actually be violated by specially crafted input, and should be replaced by classic checks to error out in a clean way.

  • invalid_enum: a particular case of invalid_cast where an invalid value is stored in a variable of a enumeration type.

  • division_by_zero_floating_point_harmless: a floating-point division by zero, whose consequences are estimated to be harmless. For example the NaN or infinity value is directly returned to the user and does not influence further execution of the code.

  • infinite_loop: a portion of the code is executed in a endless way. This is a denial of service.

  • integer_overflow_unsigned_harmless:  an example of that could be some_string.substr(some_string.find(' ') + 1) to extract the part of a string after the first space character, or the whole string if there's none. find() will return std::string::npos which is the largest positive size_t, and thus adding 1 to it will result in 0.

  • invalid_memory_dereference: generally the same as a heap_buffer_overflow_read.

  • invalid_shift_left: mentioned above.

  • double_free: a heap-allocated buffer is destroyed twice. Later crash is likely to occur. Could potentially be exploited for arbitrary code execution.

  • stack_buffer_overflow_read: mentioned above

  • use_after_free: subcase of heap_buffer_overflow_read where one accesses some buffer after it has been freed.

  • integer_overflow_harmless: mentioned above, but here, if we exclude the undefined behavior aspect of it, consequences are estimated to be harmless.

  • undefined_behavior_unspecified: some undefined behavior of a non identified category, restricted to those caught by the UBSAN analyzer of clang

  • negative_size_allocation: a negative size (actually a unsigned integer with its most significant bit set) is passed to a heap memory allocation routine. Not a bug by itself, but often reflects a previous issue.

  • unhandled_exception: a C++ exception that propagates up to the main() without being caught. Crash ensured for C code, or C++ callers not expecting it. As the fuzzer programs used by GDAL use the C API, such exception popping up is definitely a bug.

  • invalid_shift_right: mentioned above.

  • unsigned_integer_underflow: similar as the overflow but going through the negative values. Well defined behavior according to the standards, but often undesirable.

  • uninitialized_variable: access to a variable whose content is uninitialized. There are very few instances of that. The reason is probably that we use extensively strict compiler warnings and static code analyzers (cppcheck, clang static analyzer and Coverity Scan), that are very good to catch such issues.

Despite the big number of categories caught by the sanitizers run by OSS-Fuzz, there are some gaps not covered. For example casting an integer (or floating point value) to a narrower integer type with a value that does not fit into the target type. Those are considered as "implementation defined" (the compiler must do something, potentially different from another one, and document what it does), and thus do not enter into what is caught by undefined behavior sanitizers.

PS: For those interested in academic research analyzing outputs of OSSFuzz, you can find this paper (where GDAL actually appears in one figure)

jeudi 2 mai 2019

Incremental Docker builds using ccache

Those past days, I've experimented with Docker to be able to produce "official" nightly builds of GDAL
Docker Hub is supposed to have an automated build mechanism, but the cloud resources they put behind that feature seem to be insufficient to sustain the demand and builds tend to drag forever.
Hence I decided to setup a local cron job to refresh my images and push them. Of course, as there are currently 5 different Dockerfile configurations and building both PROJ and GDAL from scratch could be time consuming, I wanted this to be as most efficient as possible. One observation is that between two nightly builds, very few files changes on average, so ideally I would want to recompile only the ones that have changed, and have the minimum of updated Docker layers refreshed and pushed.

There are several approaches I combined together to optimize the builds. For those already familiar with Docker, you can probably skip to the "Use of ccache" section of this post.

Multi-stage builds

This is a Docker 17.05 feature in which you can define several steps (that will form each a separate image), and later steps can copy from the file system of the previous steps. Typically you use a two-stage approach.
The first stage installs development packages, builds the application and installs it in some /build directory.
The second stage starts from a minimal image, installs runtime dependency, and copies the binaries generated at the previous stage from the /build to the root of the final image.
This approach avoids any development packages to be in the final image, which keeps it lean.

Such Dockerfile will look like:

FROM ubuntu:18.04 AS builder
RUN apt-get install g++ make
RUN ./configure --prefix=/usr && make && make install DESTDIR=/build

FROM ubuntu:18.04 AS finalimage
RUN apt-get install libstdc+
COPY --from=builder /build/usr/ /usr/



Fine-grained layering of the final image

Each step in a Dockerfile generates a layer, which chained together form an image.
When pulling/pushing an image, layers are processed individually, and only the ones that are not present on the target system are pulled/pushed.
One important note is that the refresh/invalidation of a step/layer causes the
refresh/invalidation of later steps/layers (even if the content of the layer does
not change in a user observable way, its internal ID will change).
So one approach is to put first in the Dockerfile the steps that will change the less frequently, such as dependencies coming from the package manager, third-party dependencies whose versions rarely change, etc. And at the end, the applicative part. And even the applications refreshed as part of the nightly builds can be decomposed in fine-grained layers.
In the case of GDAL and PROJ, the installed directories are:
$prefix/usr/include
$prefix/usr/lib
$prefix/usr/bin
$prefix/usr/share/{proj|gdal}

The lib is the most varying one (each time a .cpp file changes, the .so changes).
But installed include files and resources tend to be less frequently updated.

So a better ordering of our Dockerfile is:
COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build/usr/lib/ /usr/lib/


With one subtlety, as part of our nightly builds, the sha1sum of the HEAD of the git repository is embedded in a string of $prefix/usr/include/gdal_version.h. So in the builder stage, I separate that precise file from other include files and put it in a dedicated /build_most_varying target together with the .so files.

RUN [..] \
    && make install DESTDIR=/build \
    && mkdir -p /build_most_varying/usr/include \
    && mv /build/usr/include/gdal_version.h /build_most_varying/usr/include \
    && mv /build/usr/lib /build_most_varying/usr


And thus, the finalimage stage is slightly changed to:

COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build_most_varying/usr/ /usr/


Layer depending on a git commit

In the builder stage, the step that refreshes the GDAL build depends on an
argument, GDAL_VERSION, that defaults to "master"

ARG GDAL_VERSION=master
RUN wget -q https://github.com/OSGeo/gdal/archive/${GDAL_VERSION}.tar.gz \
    && build instructions here...

Due to how Docker layer caching works, building several times in a row this Dockerfile would not refresh the GDAL build (unless you invoke docker build with the --no-cache switch, which disable all layer caching), so the script that triggers the docker build, gets the sha1sum of the latest git commit and passes it with:

GDAL_VERSION=$(curl -Ls https://api.github.com/repos/OSGeo/gdal/commits/HEAD -H "Accept: application/vnd.github.VERSION.sha")
docker build --build-var GDAL_VERSION=${GDAL_VERSION} -t myimage .

In the (unlikely) event where the GDAL repository would not have changed, no
new build would be even attempted.

Note: this part is not necessarily a best practice. Other Docker mechanisms,
such as using a Git URL as the build context, could potentially be used. But as
we want to be able to refresh both GDAL and PROJ, that would not be really suitable.
Another advantage of the above approach is that the Dockerfile is self sufficient
to create an image with just "docker build -t myimage ."

Use of ccache

This is the part for which I could not find an already made & easy to deploy solution.

With the previous techniques, we have a black and white situation. A GDAL build is either entirely cached by the Docker layer caching in the case the repository did not change at all, or completely done from scratch if the commit id has changed (which may be some change not affecting at all the installed file). It would be better if we could use ccache to minimize the number of files to be rebuilt.
Unfortunately it is not possible with docker build to mount a volume where the ccache directory would be stored (apparently because of security issues). There is an experimental RUN --mount=type=cache feature in Docker 18.06 that could perhaps be equivalently used, but it requires both the client and daemon to be started in experimental mode.

The trick I use, which has the benefit of working with a default Docker installation, is to download from the Docker build container the content of a ccache directory from the host, do the build and upload the modified ccache back onto the host.

I use rsync for that, as it is simple to setup. Initially, I used a rsync daemon directly running in the host, but based on inspiration given by https://github.com/WebHare/ccache-memcached-server which proposes an alternative, I've modified it to run in a Docker container, gdal_rsync_daemon, which mounts the host ccache directory. The benefit of my approach over the ccache-memcached-server one is that it does not require a patched version of ccache to run in the build instance.

So the synopsis is:

host cache directory <--> gdal_rsync_daemon (docker instance)  <------> Docker build instance
                  (docker volume mounting)                           (rsync network protocol)


You can consult here the relevant portion of the launching script which builds and launches the gdal_rsync_daemon. And the corresponding Dockerfile step in the builder stage is rather straightforward:

# for alpine. or equivalent with other package managers
RUN apk add --nocache rsync ccache

ARG RSYNC_REMOTE
ARG GDAL_VERSION=master
RUN if test "${RSYNC_REMOTE}" != ""; then \
        echo "Downloading cache..."; \
        rsync -ra ${RSYNC_REMOTE}/gdal/ $HOME/; \
        export CC="ccache gcc"; \
        export CXX="ccache g++"; \
        ccache -M 1G; \
    fi \
    # omitted: download source tree depending on GDAL_VERSION
    # omitted: build
    && if test "${RSYNC_REMOTE}" != ""; then \
        ccache -s; \
        echo "Uploading cache..."; \
        rsync -ra --delete $HOME/.ccache ${RSYNC_REMOTE}/gdal/; \
        rm -rf $HOME/.ccache; \
    fi


I also considered a simplified variation of the above that would not use rsync, where after the build, we would "docker cp" the cache from the build image to the host, and at the next build, copy the cache into the build context. But that would have two drawbacks:
  • our build layers would contain the cache
  • any chance in the cache would cause the build context to be different and subsequent builds to have their cached layers invalidated.

Summary

We have managed to create a Dockerfile that can be used in a standalone mode
to create a GDAL build from scratch, or can be integrated in a wrapper build.sh
script that offers incremental rebuild capabilities to minimize use of CPU resources. The image has fine grained layering which also minimizes upload and download times for frequent push / pull operations.

jeudi 31 janvier 2019

SRS barn raising: 8th report. Ready for your testing !

This is the 8th progress report of the GDAL SRS barn effort.

As the title implies, a decisive milestone has now been reached, with the "gdalbarn" branches of libgeotiff and GDAL having been now merged in their respective master branch.

On the PROJ side, a number of fixes and enhancements have been done:
- missing documentation for a few functions, the evolution of cs2cs and the new projinfo utility has been added
- the parser of the WKT CONCATENATEDOPERATION construct can now understand step presented in a reverse order
- a few iterations to update the syntax parsing rules of WKT2:2018 following the latest adjustments done by the OGC WKT Standard Working Group
- in my previous work, I had introduced a "PROJ 5" to export CRS using pipeline/unitconvert/axisswap as an attempt of improving the PROJ.4 format used by GDAL and other products. However after discussion with other PROJ developers, we realize that it is likely a dead-end since it is still lossy in many aspects and can cause confusion with coodinate operations. Consequently the PROJ_5 convention will be identical to PROJ_4 for CRS export. And the use of PROJ strings to express CRS themselves is discouraged. It can still makes sense if using the "early-binding" approach and specifying towgs84/nadgrids/geoidgrids parameters. But in a late-binding approach, WKT is much more powerful to capture important information like geodetic datum names.
- when examining how the new code I added those past months with the existing PROJ codebase, it became clear that there was a confusion when importing PROJ strings expressing coordinate operations versus PROJ strings expressing a CRS. So for the later use case, a "+type=crs" must be added in the PROJ string. As a consequence the proj_create_from_proj_string() and proj_create_from_user_input() functions have been removed, and proj_create() can now been used for all types of PROJ strings.
- The PROJ_LIB environment variable now supports multiple paths, separated by colon on Unix and semi-colon on Windows

On the GDAL side,
  • the OGRCoordinateTransformation now uses the PROJ API to automatically compute the best transformation pipeline, enabling late-binding capabilities. In the case where the user does not provide an explicit area of use and several coordinate operations are possible, the Transform() method can automatically switches between coordinate operations given the input coordinate values. This should offer behaviour similar to previous versions  for example for NAD27 to NAD83 conversion when PROJ had a +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat hardcoded rule. This dymanic selection logic has also been moved to PROJ proj_create_crs_to_crs() function. Note that however this might not always lead to the desired results, so specifying a precise area of interest, or even a specific coordinate operation, is preferred when full control is needed.
  • gdalinfo, ogrinfo and gdalsrsinfo now outputs WKT2:2018 by default (can be changed with a command line switch). On the API side, the exportToWKT() method will still export WKT1 by default (can of course be changed). The rationale is that WKT consumers might not be ready yet for WKT2, so this should limit the backward compatibility issues. In the future (couple of years timeframe), this default WKT version might be upgraded when more consumers are WKT2-ready.
  • A RFC 73: Integration of PROJ6 for WKT2, late binding capabilities, time-support and unified CRS database document was created to document all the GDAL changes. After discussion with the community, this RFC has been approved
  • As a result, all of the above mentionned work has now been merged in GDAL master
  • Important practical discussion: GDAL master now depends on PROJ master (and ultimately PROJ 6.0 once it is released)
Consequently, on the pure development front, most of the work has now been completed.  As all those changes done those last months deeply impact SRS related functionnality in GDAL and PROJ, we rely now on your careful testing to spot the inevitable issues that have not yet been detected by their respective automatic regression test suites. The earlier they are detected, the easier they will be fixable, in particular if they impact the API.

vendredi 28 décembre 2018

SRS barn raising: 7th report

This is the 7th progress report of the GDAL SRS barn effort.

On the PROJ side, things are consolidating up. Tens of "random" fixes have been pushed due to the GDAL autotest suite triggering a number of interesting cases. The C API has also been enhanced to accommodate for the needs of GDAL and libgeotiff. We also have received feedback from an early adopter (a developer of a pyproj binding based on PROJ master). The major development items have been the move of the WKT 1 syntax validation from GDAL to PROJ, as well as the development of an equivalent WKT 2 syntax validator in PROJ (this task has been useful to uncover a few minor issues in the draft of the future WKT2:2018 standard). A reorganization of the PROJ source tree (with a conversion of most C files to C++ files) has also been done, as a preliminary step, for a pull request to better integrate the new ISO-19111 functionality I developed those last months with the existing C API.

Regarding libgeotiff, a v1.4.3 maintenance version has been released with the fixes of the last two years, before the new works for the integration of PROJ master are done. libgeotiff development has been moved from Subversion to https://github.com/OSGeo/libgeotiff. As a preliminary step, continuous integration capability has been added to test compilation under Linux/GCC and Windows/Visual Studo, with a few runtime tests.
A pull request is ready with the integration of PROJ master with libgeotiff. It features:
  • PROJ master / PROJ 6 as a required depedency of libgeotiff
  • Use of the proj.db database to resolve the various coded values, mostly in the GTIFGetDefn() "normalization" function of libgeotiff, instead of using the .csv files previously generated from the EPSG database. Typically an EPSG code identifying a projected CRS is resolved them into its base elements: projection method, projection parameters, base geodetic CRS, etc..
  • Complete removal of those .csv files and associated functionality
This work will be merged once the above mention PROJ pull request that affects naming of the new functions of the C API has been merged and taken into account. This gdalbarn branch of libgeotiff has also been succesfully integrated in the internal libgeotiff copy of the GDAL gdalbarn branch.

Regarding GDAL, a maintenance v2.3.3 and feature v2.4.0 versions have been released for the same reasons as above.
Most methods of the OGRSpatialReference class have now been re-implemented to rely on the PROJ C API to do queries and state changes, which avoids potentially lossy import / export to WKT 1. Similarly to the libgeotiff work, the ImportFromEPSG() functionality now relies on proj.db, and consequently all EPSG or ESRI related .csv files have been removed from the GDAL data directory. I've also drafted a plan regarding on how to be able to take into account WKT 2 by GDAL raster drivers, and proposed changes regarding how to better handle the gap between the axis order as mandated by the CRS authority (for example latitude first, longitude second for geographic CRS in the EPSG dataset) and the actual order of the values in raster metadata or vector geometries. The first part (use of OGRSpatialReference in raster driver) has been implemented in the gdalbarn branch and the second part is in good progress, with the drivers now advertizing their data axis to CRS axis mapping. The ongoing work is to make  OGRCoordinateTransformation use the PROJ API to automatically compute the best transformation pipeline, enabling late-binding capabilities.