dimanche 26 septembre 2021

Order of GeoJSON properties and graph theory

All this story started with this QGIS bug about a crash when a user adds a new field to a GeoJSON file. Who is the culprit between GDAL and QGIS could be debated. Let's look at the GDAL side of things.

Let's consider this simple GeoJSON file:

{ "type": "FeatureCollection", "features": [
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "C": "c" },
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "B": "b", "C": "c" }
]}

GDAL versions up to 3.3.x will expose a layer definition with the fields in the following order, [A, C, B], and this is what confused QGIS, since the C field was appended as the last action, and QGIS expected it to be the last field.

One could also argue that we should not worry about the order of properties since a JSON object such as the "properties" one is unordered. However the JSON parser and writer used by GDAL has the property of respecting the order of appearance of the keys which makes GeoJSON a format that behaves similarly to other formats such as Shapefile, GeoPackage or CSV where the field order is fixed. But in a GeoJSON file where there is no declaration of properties (no "header"), and when properties are omitted in some features, presenting a consistent field order is  more challenging.

How does current GDAL in current proceeds to build the list of fields ? That's very simple: iterate over features, for each feature, iterate over its properties, and as soon as a new property is found, append it to the list of properties of the layers. So on our above example, after the first feature, we have [ A, C ]. And when parsing the second feature, we append B to it, hence [A, C, B ]. We can also notice that the order in which features appear in the collections influences the order of properties exposed in the layer definition.

How to fix that ? For that very simple case, we could say "B appears after A, which is already in our list, and before C, which is already in our list and was just after A, so let's insert it in between". Great ! Problem solved ? Unfortunately not.

What if you see features with the following properties:

- [A, D], ==> current consolitated list [A, D]

- then [A, B, D], ==> current consolitated list [A, B, D]

- then [C, D], ==> hum C is before D, but where relatively to A and B ??

- then [A, C], ==> we know now that C is after A. But before or after B ?

- then [B, C], ==> ok, before B. So final list is [A, B, C, D]

Our simple logic of insertion no longer works here, and we can know feel that the problem is much more involved than what it initially looked like. We could also have situations where the dataset is so sparse that the relative order of some fields is indeterminate. Or where fields appear in non consistent order in several features.

My initial though was to use a sorting algorithm, such as std::sort in the standard C++ library, with a custom comparison function that would use the (i, j) apparition order ( (i, j) means that field i immediately appears before j in a feature ). But there are several issues:

- the sort algorithm may invoke the comparison function with any pair of field names, but we have only a knowledge of a subset of those. If we build a graph where the nodes are the field names and a directed edge between i and j means that field i immediately appears before j in a feature, then we can know if there's a path between any tuple of nodes by using standard graph algorithm.

- the comparison function of std::sort implies that we have a strict weak ordering. When there's a path in the graph between 2 nodes, the ordering is clear. We note immediately that the graph should not have cycles, otherwise we would violate the asymmetry property of the strict weak ordering. So when building the graph, we should reject any new edge i->j that would result in a cycle, which boils down to finding if the reverse path j->i already exists. But what to do for nodes that aren't related by a path ? We could decide that they are incomparable. However the "transitivity of incomparibility" property could then be violated. For example if there's no path between A and B, and between B and C (for example B is an isolated node), there might be a path between A and C. So it is far from obvious that we can define a comparison function that respects the strict weak ordering. And if it failed to respect it, then the std::sort() execution is undefined behaviour, meaning that anything can happen: at best, a wrong result; at worse, a crash, infinite loop, etc. Look at this post for other examples of comparision function that subtely violate strict weak ordering

At the end, we want to somewhat "linearize" our graph, by respecting the directed edge.

So given:


We want to get a rresult like [A, F, B, G, D, C, E], or [F, A, B, C, D, E, G], or ...

It appears that this is a known problem in graph theory: computing the topological ordering of a directed acyclic graph. As show in the above example, there are in general several solutions. To try to make the result less "random", we use lexicographical order of field names for the working set S of the Kahn's algorithm, which is an additional sorting rules for nodes that have no direct relationship in the graph.

The result of that work is in https://github.com/OSGeo/gdal/pull/4552


mardi 14 septembre 2021

The ultimate guide for successful contribution to open source

(This is a post I started on December 2019 and didn't publish because I felt it was too negative. But recent events show that it is still a current topic, and at least this will document my own preception of things) 
 
We have lately stumbled upon a few clumsy attempts of corporations at contributing to open source software. Needless to say this is a bumpy ride not for the faint of heart, from both side of the equation.

If you just want to stop your reading here, just remember that good old motto: When in Rome, do as the Romans do. And be prepared to become a Roman citizen.

Us, open source developers have our own hard-to-decipher customs. We pay tribute to the Allmighty Git (*), preferably his most revered (often with awe)  incarnation called GitHub. You may also find followers of GitLab, gitea, gitorious.
In that case, never pronounce the word 'GitHub' before them. It is also said that some might still use email patches: be sure to set a 80 character limit to lines, use LF line endings and refrain from using PNG signatures.  Be sure to stay away from SourceForge followers. They will continuously impose on you advertising messages that will destroy your sanity.

You should subscribe to mailing lists. Be wary of trolls. We may occasionally engage into flame wars, generally sanctionned by motions. We are structured into groups that defer to a Project Steering Committee to decide (or not) which way to follow. Our groups regularly gather at solstice time in dark rooms for a week-long trance (note: was written pre-pandemic), known as a hackfest, hackathon or code sprint. For the tribes involved in geospatial, we have an annual Pow-Wow, called FOSS4G, during which we temporarily bury the hatchet and occasionaly thank our sponsors. You may stumble upon a C89 developer sharing a beer with a Java 11 one, or an adorer of Leaflet trying to convert a OpenLayers follower. If you want to join the Pow-Wow, remove your tie and suit, wear a t-shirt and sandals, and make sure to display a prominent "I Love Open Source" sticker on your laptop. You may also add a "I Love Shapefile" or "I Love GeoPackage" sticker, but only if you are confident to which one the group pays tribute. Failure to display the appropriate sticker will expose you to terrific insults about 10-character limits or obscure write-ahead log locking issues. If unsure, use the "I ? Shapefile" sticker.

We have taboo words. Thou Should Not Talk about business plan, intellectual property, patent portfolio, customer demand, strategy, costless SDK, CVE number, internal policy, consolidated codebases, education license fees. Planning and roadmap should also be pronounced with care, as some tribes will probably not have even the slightest idea of what you talk about.

If despite all those strange habits, you want to join with us, be prepared to follow a long and demanding initiation path. You will have to demonstrate your (possibly affected) adoration to our principles. As strange as it can be, we abhor being offered large presents that we cannot prevent ourselves from seeing as Trojan horses. You will rather have to locate a long list of problems pinned on a wall called bug tracker where each member of the community has written an issue or a wish he has. Start by the modest one that makes sense to you, solve it and offer its resolution to the Reviewer in a sufficiently argumented Pull Request. The Reviewer, which often equates to the Maintainer, will scrutinize at your gift, often at night time after Day Job. He may decide to accept it right away, or return it to you with a comment requesting a change, or just ignore it with the highest contempt. Be humble and obey to his command. You may beg for enlightenment, but never object even if you cannot make sense of his rebutal. He is the allmighty after the Allmighty. Never Rebase if asked to Merge; never Merge if asked to Rebase. Refactor when asked, but don't when not ! You should rewrite history before submitting, or maybe not. If unsure, consult Contributing.md, or be prepared that RTFM will be yelled at you (only for the tribes that don't have yet written CodeOfConduct.md). Do not even consider objecting that you have not been tasked to address his demands. You must also make sure to listen to the complaints of the Continuous Integration (CI) half-gods: the Reviewer will probably not even look at your gift until CI has expressed its satisfaction. Retry as many times as needed until they are pleased. You may attempt at submitting a RFC, but be prepared for lengthy and lively discussions. Listen, correct or you may not survive to your first "-1" spell !

We praise especially gifts that have no direct value for you: improved test suite (e.g. https://github.com/OSGeo/gdal/issues/4407), documentation addition and fixes, answering to users on the mailing list. Only when you feel that you have built enough trust, you might try to offer your initial gift. But, even if it is accepted, the most surprising habit is that your gift will remain yours. You will have to make sure you regularly remove the dust that accumlates on it over time, so it remains constantly shiny. While removing dust on your gift, never neglect from removing it also from nearby gifts offered by others. Otherwise the Maintainer might threaten at invoking the terrible Revert incantation on you. Also consider that existing contributors to a project might see your new code, they won't have directly use of, as an extra burden to their daily tasks (studying the commit history of https://github.com/qgis/QGIS/commits/master/src/providers/hana, or even more crudly at https://github.com/qgis/QGIS/commits/master/src/providers/db2, demonstrates that)

Ready for a ride, and possibly enjoying the feeling that "our" code can also become "yours" ?


(*) some tribes, cut off from the rest of the world, are said to still pursue their adoration of more ancient divinities sometimes known as SVN or CVS. We cannot confirm and have eradicated the last remains of those old cults.

mardi 31 août 2021

Analyze of OSS-Fuzz related data on GDAL

GDAL has now been put under the continuous scrutinity of  OSS-Fuzz for more than 4 years. To keep it simple, OSS-Fuzz is a continuous running infrastructure that stresses a software with (not-so-)random data to discover various flaws, and automatically files issues in a dedicated issue tracker, with reproducer test cases and stack traces when available. It is time to use a bit the data accumulated to give a few trends.

First, we can see a total of 1787 issues having been found, which represents on average a bit more than one per day. Out of those, only 38 are still open (so 97.8% have been fixed or are no longer reproducible). Those 1787 issues are out of a total of 37 769 filed issues against all 530 enrolled software, hence representing 4.6 % (so significantly higher than the naive 1 / 530 = 0.2 % proportion that we could expect, at least if all software were of the same size, but GDAL is likely larger than the average). Does that mean that the quality of GDAL is lower than the average of enrolled software, or that it is stressed in a more efficient way... ? Most of GDAL code being about dealing with a lot of file formats, it is the ideal fit for fuzz testing.

We should mention that a number of issues attributed to GDAL actually belong to some of its dependencies: PROJ (coordinate transformation), Poppler (PDF rendering), cURL (network access), SQLite3 (embedded database), Xerces-C (XML parsing), etc. And we reguarly report or fix those issues to those upstream components.

Addressing those issues is now facilitated with the sponsorship program which allows us to spend funded time on such unsexy and usually hard to fund topics, but important to enhance the robustness of the software.

We have run a script that parses git commit logs to identify commits that explicitly reference an issue filed by ossfuzz and tried to analyze the commit message to find the category of the issue that it addresses.

Here's its output:

Category                                                Count  Pct
----------------------------------------------------------------------
integer_overflow_signed                                  178  14.67 %
excessive_memory_use                                     174  14.34 %
other_issue                                              114   9.40 %
buffer_overflow_unspecified                              105   8.66 %
excessive_processing_time                                102   8.41 %
null_pointer_dereference                                  93   7.67 %
integer_overflow_unsigned                                 68   5.61 %
division_by_zero_integer                                  54   4.45 %
heap_buffer_overflow_read                                 37   3.05 %
unspecified_crash                                         32   2.64 %
stack_call_overflow                                       31   2.56 %
memory_leak_error_code_path                               23   1.90 %
heap_buffer_overflow_write                                22   1.81 %
division_by_zero_floating_point_unknown_consequence       19   1.57 %
stack_buffer_overflow_unspecified                         19   1.57 %
invalid_cast                                              14   1.15 %
invalid_shift_unspecified_dir                             14   1.15 %
memory_leak_unspecified                                   13   1.07 %
assertion                                                 13   1.07 %
invalid_enum                                              11   0.91 %
division_by_zero_floating_point_harmless                  10   0.82 %
infinite_loop                                             10   0.82 %
integer_overflow_unsigned_harmless                         8   0.66 %
invalid_memory_dereference                                 7   0.58 %
undefined_shift_left                                       7   0.58 %
double_free                                                6   0.49 %
stack_buffer_overflow_read                                 6   0.49 %
use_after_free                                             6   0.49 %
integer_overflow_harmless                                  5   0.41 %
undefined_behavior_unspecified                             3   0.25 %
negative_size_allocation                                   3   0.25 %
unhandled_exception                                        2   0.16 %
invalid_shift_right                                        2   0.16 %
unsigned_integer_underflow                                 1   0.08 %
uninitialized_variable                                     1   0.08 %
----------------------------------------------------------------------
Total                                                   1213


So 1213 commits for 1787-38 fixed issues: the difference is duplicated issues, non-reliably reproducing issues that end up being closed or issues fixed in other code bases than GDAL.

Let's dig into those categories, from most frequently hit to lesser ones:

  • integer_overflow_signed: an arithmetic operation on a signed integer whose results overflows its size. This is a undefined behavior in C/C++, meaning that anything can happen in theory. In practice, most reasonable compilers and common CPU architectures implementing complement-to-two signed integers will have a wrap around behavior, and not crash when the overflow occurs. However this has often later consequences, like allocating an array of the wrong size, and having out-of-bounds access.

  • excessive_memory_use: this is not a vulnerability by itself. This issue is raised because processes that run under OSSFuzz are limited to 2 GB of RAM usage, which is reasonable given than OSSFuzz manipulates input buffers that are generally large of a few tens of kilobytes. It is thus expected that for those small inputs, RAM consumption should remain small. When that's violated, it is because the code generally trusts too much some fields in the input data that drive a memory allocation, without checking that against reasonable bounds, or the file size. However for some file formats, it is difficult to implement definitive bounds because they allow valid small files that need a lot of RAM to be processed. Part of the remaining open issues belong to that category.

  • other_issue: issues that could not be classified under a more precise category. Bonus points for anyone improving the script to analyze the diff and figure from the code the cases where the commit message lacks details! Or perhaps just parse the OSSFuzz issue itself which gives a categorization.

  • buffer_overflow_unspecified: an access outside the validity area of some buffer (string, array, vector, etc.), and we couldn't determine if it is a heap allocated or stack allocated, or a read attempt or write attempt. Potentially can result in arbitrary code execution.

  • excessive_processing_time: this is when a program exceeds the timeout of 60 second granted by OSS-Fuzz to complete processing of an input buffer. This is often due to a sub-optimal algorithm (e.g. quadratic performance whereas linear can be met), or a unexpected network access. Most of the remaining open issues are in that category. A significant numbers are also in a out-of-memory situation hit by the fuzzer while iterating over many inputs, but often that cannot be reproduced on a individual test case. We suspect heap fragmentation to happen in some of those situations.

  • null_pointer_dereference: a classic programming issue: accessing a null pointer, which results in a immediate crash.

  • integer_overflow_unsigned: this one is interesting. Technically in C/C++, overflow of unsigned integers is a well-defined behavior. Wrap around behavior is guaranteed by the standards. However we assumed that in most cases, the overflow was unintended and could lead to similar bugs as signed integer overflow, hence we opted in for OSSFuzz to consider those overflows as issues. For the very uncommon cases where the overflow is valid (e.g when applying a difference filter on a sequence of bytes), we can tag the function where it occurs with a __attribute__((no_sanitize("unsigned-integer-overflow"))) annotation

  • division_by_zero_integer: an integer divided by zero, with zero being evaluated as an integer. Results in immediate crash on at least x86 architecture

  • heap_buffer_overflow_read: read access outside the validity area of a heap allocated data structure. Results generally in a crash.

  • unspecified_crash: crash for a unidentified reason. Same bonus point as above to help better categorizing them.

  • stack_call_overflow: recursive calls to methods that ends up blowing the size limit of the stack, which results in a crash.

  • memory_leak_error_code_path: memory leak that is observed in a non-nominal code path, that is on corrupted/hostile datasets.

  • heap_buffer_overflow_write: write access outside the validity area of a heap allocated data structure. Results generally in a crash. Could also be sometimes exploited for arbitrary code execution.

  • division_by_zero_floating_point_unknown_consequence: a variable is divided by zero, and this operation is done with floating point evaluation. In C/C++, this is undefined behavior, but on CPU architectures implementing IEEE-754 (that is pretty much all of them nowadays) with default settings, the result is either infinity or not-a-number. If that result is cast to an integer, this results in undefined behavior again (generally not crashing), and various potential bugs (which can be crashing)

  • stack_buffer_overflow: a read or write access outside of a stack-allocated buffer. Often results in crashes, and if a write access, can be sometimes exploited for arbitrary code execution.

  • invalid_cast: this may be an integer cast to an invalid value for an enumeration (unspecified behavior, which can results in bugs due to not catching that situation later), an instance of a C++ class cast to an invalid type (unspecified behavior, crash likely)

  • invalid_shift_unspecified_direction: a left- or right-shift binary operation, generally on a signed integer. For left-shift, this is when this results in setting the most-significant-bit being set (either shifting too much a positive value, which results to a negative value, or shifting a negative value), or shifting by a number of bits equal or greater than the width of the integer. For rigth-shift, this is when shiting by a number of bits equal or greater than the width of the integer. Those undefined behaviors do not result in immediate crashes on common compilers/platforms, but can lead to subsequent bugs.

  • memory_leak_unspecified: self explanatory.

  • assertion: an assert() in the code is hit. A lot of what we initally think as programming invariants can actually be violated by specially crafted input, and should be replaced by classic checks to error out in a clean way.

  • invalid_enum: a particular case of invalid_cast where an invalid value is stored in a variable of a enumeration type.

  • division_by_zero_floating_point_harmless: a floating-point division by zero, whose consequences are estimated to be harmless. For example the NaN or infinity value is directly returned to the user and does not influence further execution of the code.

  • infinite_loop: a portion of the code is executed in a endless way. This is a denial of service.

  • integer_overflow_unsigned_harmless:  an example of that could be some_string.substr(some_string.find(' ') + 1) to extract the part of a string after the first space character, or the whole string if there's none. find() will return std::string::npos which is the largest positive size_t, and thus adding 1 to it will result in 0.

  • invalid_memory_dereference: generally the same as a heap_buffer_overflow_read.

  • invalid_shift_left: mentioned above.

  • double_free: a heap-allocated buffer is destroyed twice. Later crash is likely to occur. Could potentially be exploited for arbitrary code execution.

  • stack_buffer_overflow_read: mentioned above

  • use_after_free: subcase of heap_buffer_overflow_read where one accesses some buffer after it has been freed.

  • integer_overflow_harmless: mentioned above, but here, if we exclude the undefined behavior aspect of it, consequences are estimated to be harmless.

  • undefined_behavior_unspecified: some undefined behavior of a non identified category, restricted to those caught by the UBSAN analyzer of clang

  • negative_size_allocation: a negative size (actually a unsigned integer with its most significant bit set) is passed to a heap memory allocation routine. Not a bug by itself, but often reflects a previous issue.

  • unhandled_exception: a C++ exception that propagates up to the main() without being caught. Crash ensured for C code, or C++ callers not expecting it. As the fuzzer programs used by GDAL use the C API, such exception popping up is definitely a bug.

  • invalid_shift_right: mentioned above.

  • unsigned_integer_underflow: similar as the overflow but going through the negative values. Well defined behavior according to the standards, but often undesirable.

  • uninitialized_variable: access to a variable whose content is uninitialized. There are very few instances of that. The reason is probably that we use extensively strict compiler warnings and static code analyzers (cppcheck, clang static analyzer and Coverity Scan), that are very good to catch such issues.

Despite the big number of categories caught by the sanitizers run by OSS-Fuzz, there are some gaps not covered. For example casting an integer (or floating point value) to a narrower integer type with a value that does not fit into the target type. Those are considered as "implementation defined" (the compiler must do something, potentially different from another one, and document what it does), and thus do not enter into what is caught by undefined behavior sanitizers.

PS: For those interested in academic research analyzing outputs of OSSFuzz, you can find this paper (where GDAL actually appears in one figure)