dimanche 26 septembre 2021

Order of GeoJSON properties and graph theory

All this story started with this QGIS bug about a crash when a user adds a new field to a GeoJSON file. Who is the culprit between GDAL and QGIS could be debated. Let's look at the GDAL side of things.

Let's consider this simple GeoJSON file:

{ "type": "FeatureCollection", "features": [
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "C": "c" },
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "B": "b", "C": "c" }
]}

GDAL versions up to 3.3.x will expose a layer definition with the fields in the following order, [A, C, B], and this is what confused QGIS, since the C field was appended as the last action, and QGIS expected it to be the last field.

One could also argue that we should not worry about the order of properties since a JSON object such as the "properties" one is unordered. However the JSON parser and writer used by GDAL has the property of respecting the order of appearance of the keys which makes GeoJSON a format that behaves similarly to other formats such as Shapefile, GeoPackage or CSV where the field order is fixed. But in a GeoJSON file where there is no declaration of properties (no "header"), and when properties are omitted in some features, presenting a consistent field order is  more challenging.

How does current GDAL in current proceeds to build the list of fields ? That's very simple: iterate over features, for each feature, iterate over its properties, and as soon as a new property is found, append it to the list of properties of the layers. So on our above example, after the first feature, we have [ A, C ]. And when parsing the second feature, we append B to it, hence [A, C, B ]. We can also notice that the order in which features appear in the collections influences the order of properties exposed in the layer definition.

How to fix that ? For that very simple case, we could say "B appears after A, which is already in our list, and before C, which is already in our list and was just after A, so let's insert it in between". Great ! Problem solved ? Unfortunately not.

What if you see features with the following properties:

- [A, D], ==> current consolitated list [A, D]

- then [A, B, D], ==> current consolitated list [A, B, D]

- then [C, D], ==> hum C is before D, but where relatively to A and B ??

- then [A, C], ==> we know now that C is after A. But before or after B ?

- then [B, C], ==> ok, before B. So final list is [A, B, C, D]

Our simple logic of insertion no longer works here, and we can know feel that the problem is much more involved than what it initially looked like. We could also have situations where the dataset is so sparse that the relative order of some fields is indeterminate. Or where fields appear in non consistent order in several features.

My initial though was to use a sorting algorithm, such as std::sort in the standard C++ library, with a custom comparison function that would use the (i, j) apparition order ( (i, j) means that field i immediately appears before j in a feature ). But there are several issues:

- the sort algorithm may invoke the comparison function with any pair of field names, but we have only a knowledge of a subset of those. If we build a graph where the nodes are the field names and a directed edge between i and j means that field i immediately appears before j in a feature, then we can know if there's a path between any tuple of nodes by using standard graph algorithm.

- the comparison function of std::sort implies that we have a strict weak ordering. When there's a path in the graph between 2 nodes, the ordering is clear. We note immediately that the graph should not have cycles, otherwise we would violate the asymmetry property of the strict weak ordering. So when building the graph, we should reject any new edge i->j that would result in a cycle, which boils down to finding if the reverse path j->i already exists. But what to do for nodes that aren't related by a path ? We could decide that they are incomparable. However the "transitivity of incomparibility" property could then be violated. For example if there's no path between A and B, and between B and C (for example B is an isolated node), there might be a path between A and C. So it is far from obvious that we can define a comparison function that respects the strict weak ordering. And if it failed to respect it, then the std::sort() execution is undefined behaviour, meaning that anything can happen: at best, a wrong result; at worse, a crash, infinite loop, etc. Look at this post for other examples of comparision function that subtely violate strict weak ordering

At the end, we want to somewhat "linearize" our graph, by respecting the directed edge.

So given:


We want to get a rresult like [A, F, B, G, D, C, E], or [F, A, B, C, D, E, G], or ...

It appears that this is a known problem in graph theory: computing the topological ordering of a directed acyclic graph. As show in the above example, there are in general several solutions. To try to make the result less "random", we use lexicographical order of field names for the working set S of the Kahn's algorithm, which is an additional sorting rules for nodes that have no direct relationship in the graph.

The result of that work is in https://github.com/OSGeo/gdal/pull/4552


mardi 14 septembre 2021

The ultimate guide for successful contribution to open source

(This is a post I started on December 2019 and didn't publish because I felt it was too negative. But recent events show that it is still a current topic, and at least this will document my own preception of things) 
 
We have lately stumbled upon a few clumsy attempts of corporations at contributing to open source software. Needless to say this is a bumpy ride not for the faint of heart, from both side of the equation.

If you just want to stop your reading here, just remember that good old motto: When in Rome, do as the Romans do. And be prepared to become a Roman citizen.

Us, open source developers have our own hard-to-decipher customs. We pay tribute to the Allmighty Git (*), preferably his most revered (often with awe)  incarnation called GitHub. You may also find followers of GitLab, gitea, gitorious.
In that case, never pronounce the word 'GitHub' before them. It is also said that some might still use email patches: be sure to set a 80 character limit to lines, use LF line endings and refrain from using PNG signatures.  Be sure to stay away from SourceForge followers. They will continuously impose on you advertising messages that will destroy your sanity.

You should subscribe to mailing lists. Be wary of trolls. We may occasionally engage into flame wars, generally sanctionned by motions. We are structured into groups that defer to a Project Steering Committee to decide (or not) which way to follow. Our groups regularly gather at solstice time in dark rooms for a week-long trance (note: was written pre-pandemic), known as a hackfest, hackathon or code sprint. For the tribes involved in geospatial, we have an annual Pow-Wow, called FOSS4G, during which we temporarily bury the hatchet and occasionaly thank our sponsors. You may stumble upon a C89 developer sharing a beer with a Java 11 one, or an adorer of Leaflet trying to convert a OpenLayers follower. If you want to join the Pow-Wow, remove your tie and suit, wear a t-shirt and sandals, and make sure to display a prominent "I Love Open Source" sticker on your laptop. You may also add a "I Love Shapefile" or "I Love GeoPackage" sticker, but only if you are confident to which one the group pays tribute. Failure to display the appropriate sticker will expose you to terrific insults about 10-character limits or obscure write-ahead log locking issues. If unsure, use the "I ? Shapefile" sticker.

We have taboo words. Thou Should Not Talk about business plan, intellectual property, patent portfolio, customer demand, strategy, costless SDK, CVE number, internal policy, consolidated codebases, education license fees. Planning and roadmap should also be pronounced with care, as some tribes will probably not have even the slightest idea of what you talk about.

If despite all those strange habits, you want to join with us, be prepared to follow a long and demanding initiation path. You will have to demonstrate your (possibly affected) adoration to our principles. As strange as it can be, we abhor being offered large presents that we cannot prevent ourselves from seeing as Trojan horses. You will rather have to locate a long list of problems pinned on a wall called bug tracker where each member of the community has written an issue or a wish he has. Start by the modest one that makes sense to you, solve it and offer its resolution to the Reviewer in a sufficiently argumented Pull Request. The Reviewer, which often equates to the Maintainer, will scrutinize at your gift, often at night time after Day Job. He may decide to accept it right away, or return it to you with a comment requesting a change, or just ignore it with the highest contempt. Be humble and obey to his command. You may beg for enlightenment, but never object even if you cannot make sense of his rebutal. He is the allmighty after the Allmighty. Never Rebase if asked to Merge; never Merge if asked to Rebase. Refactor when asked, but don't when not ! You should rewrite history before submitting, or maybe not. If unsure, consult Contributing.md, or be prepared that RTFM will be yelled at you (only for the tribes that don't have yet written CodeOfConduct.md). Do not even consider objecting that you have not been tasked to address his demands. You must also make sure to listen to the complaints of the Continuous Integration (CI) half-gods: the Reviewer will probably not even look at your gift until CI has expressed its satisfaction. Retry as many times as needed until they are pleased. You may attempt at submitting a RFC, but be prepared for lengthy and lively discussions. Listen, correct or you may not survive to your first "-1" spell !

We praise especially gifts that have no direct value for you: improved test suite (e.g. https://github.com/OSGeo/gdal/issues/4407), documentation addition and fixes, answering to users on the mailing list. Only when you feel that you have built enough trust, you might try to offer your initial gift. But, even if it is accepted, the most surprising habit is that your gift will remain yours. You will have to make sure you regularly remove the dust that accumlates on it over time, so it remains constantly shiny. While removing dust on your gift, never neglect from removing it also from nearby gifts offered by others. Otherwise the Maintainer might threaten at invoking the terrible Revert incantation on you. Also consider that existing contributors to a project might see your new code, they won't have directly use of, as an extra burden to their daily tasks (studying the commit history of https://github.com/qgis/QGIS/commits/master/src/providers/hana, or even more crudly at https://github.com/qgis/QGIS/commits/master/src/providers/db2, demonstrates that)

Ready for a ride, and possibly enjoying the feeling that "our" code can also become "yours" ?


(*) some tribes, cut off from the rest of the world, are said to still pursue their adoration of more ancient divinities sometimes known as SVN or CVS. We cannot confirm and have eradicated the last remains of those old cults.

mardi 31 août 2021

Analyze of OSS-Fuzz related data on GDAL

GDAL has now been put under the continuous scrutinity of  OSS-Fuzz for more than 4 years. To keep it simple, OSS-Fuzz is a continuous running infrastructure that stresses a software with (not-so-)random data to discover various flaws, and automatically files issues in a dedicated issue tracker, with reproducer test cases and stack traces when available. It is time to use a bit the data accumulated to give a few trends.

First, we can see a total of 1787 issues having been found, which represents on average a bit more than one per day. Out of those, only 38 are still open (so 97.8% have been fixed or are no longer reproducible). Those 1787 issues are out of a total of 37 769 filed issues against all 530 enrolled software, hence representing 4.6 % (so significantly higher than the naive 1 / 530 = 0.2 % proportion that we could expect, at least if all software were of the same size, but GDAL is likely larger than the average). Does that mean that the quality of GDAL is lower than the average of enrolled software, or that it is stressed in a more efficient way... ? Most of GDAL code being about dealing with a lot of file formats, it is the ideal fit for fuzz testing.

We should mention that a number of issues attributed to GDAL actually belong to some of its dependencies: PROJ (coordinate transformation), Poppler (PDF rendering), cURL (network access), SQLite3 (embedded database), Xerces-C (XML parsing), etc. And we reguarly report or fix those issues to those upstream components.

Addressing those issues is now facilitated with the sponsorship program which allows us to spend funded time on such unsexy and usually hard to fund topics, but important to enhance the robustness of the software.

We have run a script that parses git commit logs to identify commits that explicitly reference an issue filed by ossfuzz and tried to analyze the commit message to find the category of the issue that it addresses.

Here's its output:

Category                                                Count  Pct
----------------------------------------------------------------------
integer_overflow_signed                                  178  14.67 %
excessive_memory_use                                     174  14.34 %
other_issue                                              114   9.40 %
buffer_overflow_unspecified                              105   8.66 %
excessive_processing_time                                102   8.41 %
null_pointer_dereference                                  93   7.67 %
integer_overflow_unsigned                                 68   5.61 %
division_by_zero_integer                                  54   4.45 %
heap_buffer_overflow_read                                 37   3.05 %
unspecified_crash                                         32   2.64 %
stack_call_overflow                                       31   2.56 %
memory_leak_error_code_path                               23   1.90 %
heap_buffer_overflow_write                                22   1.81 %
division_by_zero_floating_point_unknown_consequence       19   1.57 %
stack_buffer_overflow_unspecified                         19   1.57 %
invalid_cast                                              14   1.15 %
invalid_shift_unspecified_dir                             14   1.15 %
memory_leak_unspecified                                   13   1.07 %
assertion                                                 13   1.07 %
invalid_enum                                              11   0.91 %
division_by_zero_floating_point_harmless                  10   0.82 %
infinite_loop                                             10   0.82 %
integer_overflow_unsigned_harmless                         8   0.66 %
invalid_memory_dereference                                 7   0.58 %
undefined_shift_left                                       7   0.58 %
double_free                                                6   0.49 %
stack_buffer_overflow_read                                 6   0.49 %
use_after_free                                             6   0.49 %
integer_overflow_harmless                                  5   0.41 %
undefined_behavior_unspecified                             3   0.25 %
negative_size_allocation                                   3   0.25 %
unhandled_exception                                        2   0.16 %
invalid_shift_right                                        2   0.16 %
unsigned_integer_underflow                                 1   0.08 %
uninitialized_variable                                     1   0.08 %
----------------------------------------------------------------------
Total                                                   1213


So 1213 commits for 1787-38 fixed issues: the difference is duplicated issues, non-reliably reproducing issues that end up being closed or issues fixed in other code bases than GDAL.

Let's dig into those categories, from most frequently hit to lesser ones:

  • integer_overflow_signed: an arithmetic operation on a signed integer whose results overflows its size. This is a undefined behavior in C/C++, meaning that anything can happen in theory. In practice, most reasonable compilers and common CPU architectures implementing complement-to-two signed integers will have a wrap around behavior, and not crash when the overflow occurs. However this has often later consequences, like allocating an array of the wrong size, and having out-of-bounds access.

  • excessive_memory_use: this is not a vulnerability by itself. This issue is raised because processes that run under OSSFuzz are limited to 2 GB of RAM usage, which is reasonable given than OSSFuzz manipulates input buffers that are generally large of a few tens of kilobytes. It is thus expected that for those small inputs, RAM consumption should remain small. When that's violated, it is because the code generally trusts too much some fields in the input data that drive a memory allocation, without checking that against reasonable bounds, or the file size. However for some file formats, it is difficult to implement definitive bounds because they allow valid small files that need a lot of RAM to be processed. Part of the remaining open issues belong to that category.

  • other_issue: issues that could not be classified under a more precise category. Bonus points for anyone improving the script to analyze the diff and figure from the code the cases where the commit message lacks details! Or perhaps just parse the OSSFuzz issue itself which gives a categorization.

  • buffer_overflow_unspecified: an access outside the validity area of some buffer (string, array, vector, etc.), and we couldn't determine if it is a heap allocated or stack allocated, or a read attempt or write attempt. Potentially can result in arbitrary code execution.

  • excessive_processing_time: this is when a program exceeds the timeout of 60 second granted by OSS-Fuzz to complete processing of an input buffer. This is often due to a sub-optimal algorithm (e.g. quadratic performance whereas linear can be met), or a unexpected network access. Most of the remaining open issues are in that category. A significant numbers are also in a out-of-memory situation hit by the fuzzer while iterating over many inputs, but often that cannot be reproduced on a individual test case. We suspect heap fragmentation to happen in some of those situations.

  • null_pointer_dereference: a classic programming issue: accessing a null pointer, which results in a immediate crash.

  • integer_overflow_unsigned: this one is interesting. Technically in C/C++, overflow of unsigned integers is a well-defined behavior. Wrap around behavior is guaranteed by the standards. However we assumed that in most cases, the overflow was unintended and could lead to similar bugs as signed integer overflow, hence we opted in for OSSFuzz to consider those overflows as issues. For the very uncommon cases where the overflow is valid (e.g when applying a difference filter on a sequence of bytes), we can tag the function where it occurs with a __attribute__((no_sanitize("unsigned-integer-overflow"))) annotation

  • division_by_zero_integer: an integer divided by zero, with zero being evaluated as an integer. Results in immediate crash on at least x86 architecture

  • heap_buffer_overflow_read: read access outside the validity area of a heap allocated data structure. Results generally in a crash.

  • unspecified_crash: crash for a unidentified reason. Same bonus point as above to help better categorizing them.

  • stack_call_overflow: recursive calls to methods that ends up blowing the size limit of the stack, which results in a crash.

  • memory_leak_error_code_path: memory leak that is observed in a non-nominal code path, that is on corrupted/hostile datasets.

  • heap_buffer_overflow_write: write access outside the validity area of a heap allocated data structure. Results generally in a crash. Could also be sometimes exploited for arbitrary code execution.

  • division_by_zero_floating_point_unknown_consequence: a variable is divided by zero, and this operation is done with floating point evaluation. In C/C++, this is undefined behavior, but on CPU architectures implementing IEEE-754 (that is pretty much all of them nowadays) with default settings, the result is either infinity or not-a-number. If that result is cast to an integer, this results in undefined behavior again (generally not crashing), and various potential bugs (which can be crashing)

  • stack_buffer_overflow: a read or write access outside of a stack-allocated buffer. Often results in crashes, and if a write access, can be sometimes exploited for arbitrary code execution.

  • invalid_cast: this may be an integer cast to an invalid value for an enumeration (unspecified behavior, which can results in bugs due to not catching that situation later), an instance of a C++ class cast to an invalid type (unspecified behavior, crash likely)

  • invalid_shift_unspecified_direction: a left- or right-shift binary operation, generally on a signed integer. For left-shift, this is when this results in setting the most-significant-bit being set (either shifting too much a positive value, which results to a negative value, or shifting a negative value), or shifting by a number of bits equal or greater than the width of the integer. For rigth-shift, this is when shiting by a number of bits equal or greater than the width of the integer. Those undefined behaviors do not result in immediate crashes on common compilers/platforms, but can lead to subsequent bugs.

  • memory_leak_unspecified: self explanatory.

  • assertion: an assert() in the code is hit. A lot of what we initally think as programming invariants can actually be violated by specially crafted input, and should be replaced by classic checks to error out in a clean way.

  • invalid_enum: a particular case of invalid_cast where an invalid value is stored in a variable of a enumeration type.

  • division_by_zero_floating_point_harmless: a floating-point division by zero, whose consequences are estimated to be harmless. For example the NaN or infinity value is directly returned to the user and does not influence further execution of the code.

  • infinite_loop: a portion of the code is executed in a endless way. This is a denial of service.

  • integer_overflow_unsigned_harmless:  an example of that could be some_string.substr(some_string.find(' ') + 1) to extract the part of a string after the first space character, or the whole string if there's none. find() will return std::string::npos which is the largest positive size_t, and thus adding 1 to it will result in 0.

  • invalid_memory_dereference: generally the same as a heap_buffer_overflow_read.

  • invalid_shift_left: mentioned above.

  • double_free: a heap-allocated buffer is destroyed twice. Later crash is likely to occur. Could potentially be exploited for arbitrary code execution.

  • stack_buffer_overflow_read: mentioned above

  • use_after_free: subcase of heap_buffer_overflow_read where one accesses some buffer after it has been freed.

  • integer_overflow_harmless: mentioned above, but here, if we exclude the undefined behavior aspect of it, consequences are estimated to be harmless.

  • undefined_behavior_unspecified: some undefined behavior of a non identified category, restricted to those caught by the UBSAN analyzer of clang

  • negative_size_allocation: a negative size (actually a unsigned integer with its most significant bit set) is passed to a heap memory allocation routine. Not a bug by itself, but often reflects a previous issue.

  • unhandled_exception: a C++ exception that propagates up to the main() without being caught. Crash ensured for C code, or C++ callers not expecting it. As the fuzzer programs used by GDAL use the C API, such exception popping up is definitely a bug.

  • invalid_shift_right: mentioned above.

  • unsigned_integer_underflow: similar as the overflow but going through the negative values. Well defined behavior according to the standards, but often undesirable.

  • uninitialized_variable: access to a variable whose content is uninitialized. There are very few instances of that. The reason is probably that we use extensively strict compiler warnings and static code analyzers (cppcheck, clang static analyzer and Coverity Scan), that are very good to catch such issues.

Despite the big number of categories caught by the sanitizers run by OSS-Fuzz, there are some gaps not covered. For example casting an integer (or floating point value) to a narrower integer type with a value that does not fit into the target type. Those are considered as "implementation defined" (the compiler must do something, potentially different from another one, and document what it does), and thus do not enter into what is caught by undefined behavior sanitizers.

PS: For those interested in academic research analyzing outputs of OSSFuzz, you can find this paper (where GDAL actually appears in one figure)

jeudi 2 mai 2019

Incremental Docker builds using ccache

Those past days, I've experimented with Docker to be able to produce "official" nightly builds of GDAL
Docker Hub is supposed to have an automated build mechanism, but the cloud resources they put behind that feature seem to be insufficient to sustain the demand and builds tend to drag forever.
Hence I decided to setup a local cron job to refresh my images and push them. Of course, as there are currently 5 different Dockerfile configurations and building both PROJ and GDAL from scratch could be time consuming, I wanted this to be as most efficient as possible. One observation is that between two nightly builds, very few files changes on average, so ideally I would want to recompile only the ones that have changed, and have the minimum of updated Docker layers refreshed and pushed.

There are several approaches I combined together to optimize the builds. For those already familiar with Docker, you can probably skip to the "Use of ccache" section of this post.

Multi-stage builds

This is a Docker 17.05 feature in which you can define several steps (that will form each a separate image), and later steps can copy from the file system of the previous steps. Typically you use a two-stage approach.
The first stage installs development packages, builds the application and installs it in some /build directory.
The second stage starts from a minimal image, installs runtime dependency, and copies the binaries generated at the previous stage from the /build to the root of the final image.
This approach avoids any development packages to be in the final image, which keeps it lean.

Such Dockerfile will look like:

FROM ubuntu:18.04 AS builder
RUN apt-get install g++ make
RUN ./configure --prefix=/usr && make && make install DESTDIR=/build

FROM ubuntu:18.04 AS finalimage
RUN apt-get install libstdc+
COPY --from=builder /build/usr/ /usr/



Fine-grained layering of the final image

Each step in a Dockerfile generates a layer, which chained together form an image.
When pulling/pushing an image, layers are processed individually, and only the ones that are not present on the target system are pulled/pushed.
One important note is that the refresh/invalidation of a step/layer causes the
refresh/invalidation of later steps/layers (even if the content of the layer does
not change in a user observable way, its internal ID will change).
So one approach is to put first in the Dockerfile the steps that will change the less frequently, such as dependencies coming from the package manager, third-party dependencies whose versions rarely change, etc. And at the end, the applicative part. And even the applications refreshed as part of the nightly builds can be decomposed in fine-grained layers.
In the case of GDAL and PROJ, the installed directories are:
$prefix/usr/include
$prefix/usr/lib
$prefix/usr/bin
$prefix/usr/share/{proj|gdal}

The lib is the most varying one (each time a .cpp file changes, the .so changes).
But installed include files and resources tend to be less frequently updated.

So a better ordering of our Dockerfile is:
COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build/usr/lib/ /usr/lib/


With one subtlety, as part of our nightly builds, the sha1sum of the HEAD of the git repository is embedded in a string of $prefix/usr/include/gdal_version.h. So in the builder stage, I separate that precise file from other include files and put it in a dedicated /build_most_varying target together with the .so files.

RUN [..] \
    && make install DESTDIR=/build \
    && mkdir -p /build_most_varying/usr/include \
    && mv /build/usr/include/gdal_version.h /build_most_varying/usr/include \
    && mv /build/usr/lib /build_most_varying/usr


And thus, the finalimage stage is slightly changed to:

COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build_most_varying/usr/ /usr/


Layer depending on a git commit

In the builder stage, the step that refreshes the GDAL build depends on an
argument, GDAL_VERSION, that defaults to "master"

ARG GDAL_VERSION=master
RUN wget -q https://github.com/OSGeo/gdal/archive/${GDAL_VERSION}.tar.gz \
    && build instructions here...

Due to how Docker layer caching works, building several times in a row this Dockerfile would not refresh the GDAL build (unless you invoke docker build with the --no-cache switch, which disable all layer caching), so the script that triggers the docker build, gets the sha1sum of the latest git commit and passes it with:

GDAL_VERSION=$(curl -Ls https://api.github.com/repos/OSGeo/gdal/commits/HEAD -H "Accept: application/vnd.github.VERSION.sha")
docker build --build-var GDAL_VERSION=${GDAL_VERSION} -t myimage .

In the (unlikely) event where the GDAL repository would not have changed, no
new build would be even attempted.

Note: this part is not necessarily a best practice. Other Docker mechanisms,
such as using a Git URL as the build context, could potentially be used. But as
we want to be able to refresh both GDAL and PROJ, that would not be really suitable.
Another advantage of the above approach is that the Dockerfile is self sufficient
to create an image with just "docker build -t myimage ."

Use of ccache

This is the part for which I could not find an already made & easy to deploy solution.

With the previous techniques, we have a black and white situation. A GDAL build is either entirely cached by the Docker layer caching in the case the repository did not change at all, or completely done from scratch if the commit id has changed (which may be some change not affecting at all the installed file). It would be better if we could use ccache to minimize the number of files to be rebuilt.
Unfortunately it is not possible with docker build to mount a volume where the ccache directory would be stored (apparently because of security issues). There is an experimental RUN --mount=type=cache feature in Docker 18.06 that could perhaps be equivalently used, but it requires both the client and daemon to be started in experimental mode.

The trick I use, which has the benefit of working with a default Docker installation, is to download from the Docker build container the content of a ccache directory from the host, do the build and upload the modified ccache back onto the host.

I use rsync for that, as it is simple to setup. Initially, I used a rsync daemon directly running in the host, but based on inspiration given by https://github.com/WebHare/ccache-memcached-server which proposes an alternative, I've modified it to run in a Docker container, gdal_rsync_daemon, which mounts the host ccache directory. The benefit of my approach over the ccache-memcached-server one is that it does not require a patched version of ccache to run in the build instance.

So the synopsis is:

host cache directory <--> gdal_rsync_daemon (docker instance)  <------> Docker build instance
                  (docker volume mounting)                           (rsync network protocol)


You can consult here the relevant portion of the launching script which builds and launches the gdal_rsync_daemon. And the corresponding Dockerfile step in the builder stage is rather straightforward:

# for alpine. or equivalent with other package managers
RUN apk add --nocache rsync ccache

ARG RSYNC_REMOTE
ARG GDAL_VERSION=master
RUN if test "${RSYNC_REMOTE}" != ""; then \
        echo "Downloading cache..."; \
        rsync -ra ${RSYNC_REMOTE}/gdal/ $HOME/; \
        export CC="ccache gcc"; \
        export CXX="ccache g++"; \
        ccache -M 1G; \
    fi \
    # omitted: download source tree depending on GDAL_VERSION
    # omitted: build
    && if test "${RSYNC_REMOTE}" != ""; then \
        ccache -s; \
        echo "Uploading cache..."; \
        rsync -ra --delete $HOME/.ccache ${RSYNC_REMOTE}/gdal/; \
        rm -rf $HOME/.ccache; \
    fi


I also considered a simplified variation of the above that would not use rsync, where after the build, we would "docker cp" the cache from the build image to the host, and at the next build, copy the cache into the build context. But that would have two drawbacks:
  • our build layers would contain the cache
  • any chance in the cache would cause the build context to be different and subsequent builds to have their cached layers invalidated.

Summary

We have managed to create a Dockerfile that can be used in a standalone mode
to create a GDAL build from scratch, or can be integrated in a wrapper build.sh
script that offers incremental rebuild capabilities to minimize use of CPU resources. The image has fine grained layering which also minimizes upload and download times for frequent push / pull operations.

jeudi 31 janvier 2019

SRS barn raising: 8th report. Ready for your testing !

This is the 8th progress report of the GDAL SRS barn effort.

As the title implies, a decisive milestone has now been reached, with the "gdalbarn" branches of libgeotiff and GDAL having been now merged in their respective master branch.

On the PROJ side, a number of fixes and enhancements have been done:
- missing documentation for a few functions, the evolution of cs2cs and the new projinfo utility has been added
- the parser of the WKT CONCATENATEDOPERATION construct can now understand step presented in a reverse order
- a few iterations to update the syntax parsing rules of WKT2:2018 following the latest adjustments done by the OGC WKT Standard Working Group
- in my previous work, I had introduced a "PROJ 5" to export CRS using pipeline/unitconvert/axisswap as an attempt of improving the PROJ.4 format used by GDAL and other products. However after discussion with other PROJ developers, we realize that it is likely a dead-end since it is still lossy in many aspects and can cause confusion with coodinate operations. Consequently the PROJ_5 convention will be identical to PROJ_4 for CRS export. And the use of PROJ strings to express CRS themselves is discouraged. It can still makes sense if using the "early-binding" approach and specifying towgs84/nadgrids/geoidgrids parameters. But in a late-binding approach, WKT is much more powerful to capture important information like geodetic datum names.
- when examining how the new code I added those past months with the existing PROJ codebase, it became clear that there was a confusion when importing PROJ strings expressing coordinate operations versus PROJ strings expressing a CRS. So for the later use case, a "+type=crs" must be added in the PROJ string. As a consequence the proj_create_from_proj_string() and proj_create_from_user_input() functions have been removed, and proj_create() can now been used for all types of PROJ strings.
- The PROJ_LIB environment variable now supports multiple paths, separated by colon on Unix and semi-colon on Windows

On the GDAL side,
  • the OGRCoordinateTransformation now uses the PROJ API to automatically compute the best transformation pipeline, enabling late-binding capabilities. In the case where the user does not provide an explicit area of use and several coordinate operations are possible, the Transform() method can automatically switches between coordinate operations given the input coordinate values. This should offer behaviour similar to previous versions  for example for NAD27 to NAD83 conversion when PROJ had a +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat hardcoded rule. This dymanic selection logic has also been moved to PROJ proj_create_crs_to_crs() function. Note that however this might not always lead to the desired results, so specifying a precise area of interest, or even a specific coordinate operation, is preferred when full control is needed.
  • gdalinfo, ogrinfo and gdalsrsinfo now outputs WKT2:2018 by default (can be changed with a command line switch). On the API side, the exportToWKT() method will still export WKT1 by default (can of course be changed). The rationale is that WKT consumers might not be ready yet for WKT2, so this should limit the backward compatibility issues. In the future (couple of years timeframe), this default WKT version might be upgraded when more consumers are WKT2-ready.
  • A RFC 73: Integration of PROJ6 for WKT2, late binding capabilities, time-support and unified CRS database document was created to document all the GDAL changes. After discussion with the community, this RFC has been approved
  • As a result, all of the above mentionned work has now been merged in GDAL master
  • Important practical discussion: GDAL master now depends on PROJ master (and ultimately PROJ 6.0 once it is released)
Consequently, on the pure development front, most of the work has now been completed.  As all those changes done those last months deeply impact SRS related functionnality in GDAL and PROJ, we rely now on your careful testing to spot the inevitable issues that have not yet been detected by their respective automatic regression test suites. The earlier they are detected, the easier they will be fixable, in particular if they impact the API.

vendredi 28 décembre 2018

SRS barn raising: 7th report

This is the 7th progress report of the GDAL SRS barn effort.

On the PROJ side, things are consolidating up. Tens of "random" fixes have been pushed due to the GDAL autotest suite triggering a number of interesting cases. The C API has also been enhanced to accommodate for the needs of GDAL and libgeotiff. We also have received feedback from an early adopter (a developer of a pyproj binding based on PROJ master). The major development items have been the move of the WKT 1 syntax validation from GDAL to PROJ, as well as the development of an equivalent WKT 2 syntax validator in PROJ (this task has been useful to uncover a few minor issues in the draft of the future WKT2:2018 standard). A reorganization of the PROJ source tree (with a conversion of most C files to C++ files) has also been done, as a preliminary step, for a pull request to better integrate the new ISO-19111 functionality I developed those last months with the existing C API.

Regarding libgeotiff, a v1.4.3 maintenance version has been released with the fixes of the last two years, before the new works for the integration of PROJ master are done. libgeotiff development has been moved from Subversion to https://github.com/OSGeo/libgeotiff. As a preliminary step, continuous integration capability has been added to test compilation under Linux/GCC and Windows/Visual Studo, with a few runtime tests.
A pull request is ready with the integration of PROJ master with libgeotiff. It features:
  • PROJ master / PROJ 6 as a required depedency of libgeotiff
  • Use of the proj.db database to resolve the various coded values, mostly in the GTIFGetDefn() "normalization" function of libgeotiff, instead of using the .csv files previously generated from the EPSG database. Typically an EPSG code identifying a projected CRS is resolved them into its base elements: projection method, projection parameters, base geodetic CRS, etc..
  • Complete removal of those .csv files and associated functionality
This work will be merged once the above mention PROJ pull request that affects naming of the new functions of the C API has been merged and taken into account. This gdalbarn branch of libgeotiff has also been succesfully integrated in the internal libgeotiff copy of the GDAL gdalbarn branch.

Regarding GDAL, a maintenance v2.3.3 and feature v2.4.0 versions have been released for the same reasons as above.
Most methods of the OGRSpatialReference class have now been re-implemented to rely on the PROJ C API to do queries and state changes, which avoids potentially lossy import / export to WKT 1. Similarly to the libgeotiff work, the ImportFromEPSG() functionality now relies on proj.db, and consequently all EPSG or ESRI related .csv files have been removed from the GDAL data directory. I've also drafted a plan regarding on how to be able to take into account WKT 2 by GDAL raster drivers, and proposed changes regarding how to better handle the gap between the axis order as mandated by the CRS authority (for example latitude first, longitude second for geographic CRS in the EPSG dataset) and the actual order of the values in raster metadata or vector geometries. The first part (use of OGRSpatialReference in raster driver) has been implemented in the gdalbarn branch and the second part is in good progress, with the drivers now advertizing their data axis to CRS axis mapping. The ongoing work is to make  OGRCoordinateTransformation use the PROJ API to automatically compute the best transformation pipeline, enabling late-binding capabilities.



vendredi 30 novembre 2018

SRS barn raising: 6th report

This is the sixth progress report of the GDAL SRS barn effort. The pace of changes has not yet slow down, with still a significant part of the work being in PROJ, and an initial integration in GDAL.

The major news item is that RFC2, implementing the new capabilities (WKT-2 support, late-binding approach, SQLite database), has now been merged into PROJ master.

An initial integration of PROJ master into GDAL has been started in a custom GDAL branch . This includes:
  • PROJ master, which will be released as 6.0, is now a required dependency of GDAL. It actually becomes the only required external third-party dependency (if we except the copies of a few libraries, such as libtiff, libgeotiff, etc. that have been traditionaly included into the GDAL source tree)
  • The dozen of continuous integration configurations have been modified to build PROJ master as a preliminary step.
  • Related to the above, we have including into PROJ a way to "version" its symbols. If PROJ is built with -DPROJ_RENAME_SYMBOLS in CFLAGS and CXXFLAGS, all its exported symbols are prefixed with "internal_". This enables GDAL to link against PROJ master, while still using pre-compiled dependencies (such as libspatialite) that link against the system PROJ version, without a risk of symbol clash. This is particularly useful to be able to run GDAL autotests on continuous integration environments that use pre-packaged dependencies (or if you want to test the new GDAL without rebuilding all reverse dependencies of GDAL). This however remains a hack, and ultimately when PROJ 6 has been released, all reverse dependencies should be built against it. (this solution has been successfully tested in the past where GDAL had a libtiff 4.0 internal copy, whereas external libtiff used by some GDAL dependencies relied on the system libtiff 3.X)
  • Compatibility mechanisms which were required to support older PROJ versions have been removed. In particular, the runtime loading (using dlopen() / LoadLibrary() mechanism) has been removed. It proved to cause code complication, and users frequently ran into headaches with different PROJ versions being loaded and clashing/crashing at runtime.
  • The OGRSpatialReference class which implements CRS manipulation in GDAL has been modified to use the new PROJ functions to import and export between WKT and PROJ strings. Previously GDAL had such code, which is now redundant with what PROJ offers. This preliminary integration caused a number of fixes to be made on PROJ to have compatibility with the input and output of GDAL for WKT.1 and PROJ strings. Besides "moving" code from GDAL to PROJ, a practical consequence is that the addition of a new projection method into PROJ will no longer require changes to be made to GDAL for it to be usable for reprojection purposes.

There have been reflections on how to use the new code developped in PROJ by the existing PROJ code. A pull request is currently under review and implements:
  • changes needed to remove from the data/ directory the now obsolete EPSG, IGNF, esri and esri.extra files to rely instead of the proj.db dataase 
  • making the proj_create_crs_to_crs() API use the new late-binding approach to create transformation pipelines
  • updating cs2cs to use that new API. 
  • list and address backward compatibility issues related to honouring official axis order
Integration phase in GDAL continus with the aim of using more of the new PROJ code. Typically the OGRSpatialReference class that models in GDAL the CRS/SRS was up to now mostly a hierarchy of WKT nodes, where setters methods of OGRSpatialReference would directly create/modify/delete nodes, and getter methods query them. This approach was fine when you had to manage just one WKT version (with the caveat that it was also easy to produce invalid WKT representions, lacking mandatory nodes). However, this is no longer appropriate now that we want to support multiple WKT versions. Our goal is to make OGRSpatialReference act rather on osgeo::proj::CRS objects (and its derived classes). Switching between the two abstractions is a non-trivial task and doing it in a bing-bang approach seemed risky, so we are progressively doing it by using a dual internal modelling. A OGRSpatialReference instance will maintain as a primary source a osgeo::proj::CRS object, and for operations not yet converted to the new approach, will fallback to translating it internally to WKT.1 to allow direct manipulation of the nodes, and then retranslate that updated WKT.1 representation back to a osgeo::proj::CRS object. Ultimately the proportion of methods using the fallback way should decrease (it is not completely clear we can remove all of them since direct node manipulation is spread in a significant number of GDAL drivers). The task is slowly progressing, because each change can subtely modify the final WKT.1 representation (nodes being added, number of significant digits changing) and cause a number of unit tests to break (GDAL autotest suite is made of 280 000 lines of Python code) and be analyzed to see if there was a bug and or just an expected result to be slightly altered.
Because of all the above impacts, we have decided to do an early release in December of GDAL master as GDAL 2.4.0 with all the new features since GDAL 2.3, in order to be able to land this PROJ integration afterwards afterwards. A GDAL 2.5.0 release will hopefully follow around May 2019 with the result of the gdalbarn work.


Other side activities regarding collecting transformation grids:
  • Following a clarification from IGN France on the open licensing of their geodesy related resources, their CRS and transformation XML registry is now processed to populate the IGNF objects in the proj.db database (the previous import used the already processed IGNF file containing PROJ string, which caused information losses). The associated vertical shift grids have also been converted from their text-based format to the PROJ digestable .gtx format, integrated in the proj-datumgrid-europe package, and they have been referenced in the database for transformations that use them.
  • The NGS GEOID 2012B vertical grids to convert between NAD83 ellipsoidal heights and NAVD88 heights have also been integrated in the proj-datumgrid-north-america package