dimanche 26 septembre 2021

Order of GeoJSON properties and graph theory

All this story started with this QGIS bug about a crash when a user adds a new field to a GeoJSON file. Who is the culprit between GDAL and QGIS could be debated. Let's look at the GDAL side of things.

Let's consider this simple GeoJSON file:

{ "type": "FeatureCollection", "features": [
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "C": "c" },
  { "type": "Feature", "geometry": null, "properties": { "A": "a", "B": "b", "C": "c" }

GDAL versions up to 3.3.x will expose a layer definition with the fields in the following order, [A, C, B], and this is what confused QGIS, since the C field was appended as the last action, and QGIS expected it to be the last field.

One could also argue that we should not worry about the order of properties since a JSON object such as the "properties" one is unordered. However the JSON parser and writer used by GDAL has the property of respecting the order of appearance of the keys which makes GeoJSON a format that behaves similarly to other formats such as Shapefile, GeoPackage or CSV where the field order is fixed. But in a GeoJSON file where there is no declaration of properties (no "header"), and when properties are omitted in some features, presenting a consistent field order is  more challenging.

How does current GDAL in current proceeds to build the list of fields ? That's very simple: iterate over features, for each feature, iterate over its properties, and as soon as a new property is found, append it to the list of properties of the layers. So on our above example, after the first feature, we have [ A, C ]. And when parsing the second feature, we append B to it, hence [A, C, B ]. We can also notice that the order in which features appear in the collections influences the order of properties exposed in the layer definition.

How to fix that ? For that very simple case, we could say "B appears after A, which is already in our list, and before C, which is already in our list and was just after A, so let's insert it in between". Great ! Problem solved ? Unfortunately not.

What if you see features with the following properties:

- [A, D], ==> current consolitated list [A, D]

- then [A, B, D], ==> current consolitated list [A, B, D]

- then [C, D], ==> hum C is before D, but where relatively to A and B ??

- then [A, C], ==> we know now that C is after A. But before or after B ?

- then [B, C], ==> ok, before B. So final list is [A, B, C, D]

Our simple logic of insertion no longer works here, and we can know feel that the problem is much more involved than what it initially looked like. We could also have situations where the dataset is so sparse that the relative order of some fields is indeterminate. Or where fields appear in non consistent order in several features.

My initial though was to use a sorting algorithm, such as std::sort in the standard C++ library, with a custom comparison function that would use the (i, j) apparition order ( (i, j) means that field i immediately appears before j in a feature ). But there are several issues:

- the sort algorithm may invoke the comparison function with any pair of field names, but we have only a knowledge of a subset of those. If we build a graph where the nodes are the field names and a directed edge between i and j means that field i immediately appears before j in a feature, then we can know if there's a path between any tuple of nodes by using standard graph algorithm.

- the comparison function of std::sort implies that we have a strict weak ordering. When there's a path in the graph between 2 nodes, the ordering is clear. We note immediately that the graph should not have cycles, otherwise we would violate the asymmetry property of the strict weak ordering. So when building the graph, we should reject any new edge i->j that would result in a cycle, which boils down to finding if the reverse path j->i already exists. But what to do for nodes that aren't related by a path ? We could decide that they are incomparable. However the "transitivity of incomparibility" property could then be violated. For example if there's no path between A and B, and between B and C (for example B is an isolated node), there might be a path between A and C. So it is far from obvious that we can define a comparison function that respects the strict weak ordering. And if it failed to respect it, then the std::sort() execution is undefined behaviour, meaning that anything can happen: at best, a wrong result; at worse, a crash, infinite loop, etc. Look at this post for other examples of comparision function that subtely violate strict weak ordering

At the end, we want to somewhat "linearize" our graph, by respecting the directed edge.

So given:

We want to get a rresult like [A, F, B, G, D, C, E], or [F, A, B, C, D, E, G], or ...

It appears that this is a known problem in graph theory: computing the topological ordering of a directed acyclic graph. As show in the above example, there are in general several solutions. To try to make the result less "random", we use lexicographical order of field names for the working set S of the Kahn's algorithm, which is an additional sorting rules for nodes that have no direct relationship in the graph.

The result of that work is in https://github.com/OSGeo/gdal/pull/4552

mardi 14 septembre 2021

The ultimate guide for successful contribution to open source

(This is a post I started on December 2019 and didn't publish because I felt it was too negative. But recent events show that it is still a current topic, and at least this will document my own preception of things) 
We have lately stumbled upon a few clumsy attempts of corporations at contributing to open source software. Needless to say this is a bumpy ride not for the faint of heart, from both side of the equation.

If you just want to stop your reading here, just remember that good old motto: When in Rome, do as the Romans do. And be prepared to become a Roman citizen.

Us, open source developers have our own hard-to-decipher customs. We pay tribute to the Allmighty Git (*), preferably his most revered (often with awe)  incarnation called GitHub. You may also find followers of GitLab, gitea, gitorious.
In that case, never pronounce the word 'GitHub' before them. It is also said that some might still use email patches: be sure to set a 80 character limit to lines, use LF line endings and refrain from using PNG signatures.  Be sure to stay away from SourceForge followers. They will continuously impose on you advertising messages that will destroy your sanity.

You should subscribe to mailing lists. Be wary of trolls. We may occasionally engage into flame wars, generally sanctionned by motions. We are structured into groups that defer to a Project Steering Committee to decide (or not) which way to follow. Our groups regularly gather at solstice time in dark rooms for a week-long trance (note: was written pre-pandemic), known as a hackfest, hackathon or code sprint. For the tribes involved in geospatial, we have an annual Pow-Wow, called FOSS4G, during which we temporarily bury the hatchet and occasionaly thank our sponsors. You may stumble upon a C89 developer sharing a beer with a Java 11 one, or an adorer of Leaflet trying to convert a OpenLayers follower. If you want to join the Pow-Wow, remove your tie and suit, wear a t-shirt and sandals, and make sure to display a prominent "I Love Open Source" sticker on your laptop. You may also add a "I Love Shapefile" or "I Love GeoPackage" sticker, but only if you are confident to which one the group pays tribute. Failure to display the appropriate sticker will expose you to terrific insults about 10-character limits or obscure write-ahead log locking issues. If unsure, use the "I ? Shapefile" sticker.

We have taboo words. Thou Should Not Talk about business plan, intellectual property, patent portfolio, customer demand, strategy, costless SDK, CVE number, internal policy, consolidated codebases, education license fees. Planning and roadmap should also be pronounced with care, as some tribes will probably not have even the slightest idea of what you talk about.

If despite all those strange habits, you want to join with us, be prepared to follow a long and demanding initiation path. You will have to demonstrate your (possibly affected) adoration to our principles. As strange as it can be, we abhor being offered large presents that we cannot prevent ourselves from seeing as Trojan horses. You will rather have to locate a long list of problems pinned on a wall called bug tracker where each member of the community has written an issue or a wish he has. Start by the modest one that makes sense to you, solve it and offer its resolution to the Reviewer in a sufficiently argumented Pull Request. The Reviewer, which often equates to the Maintainer, will scrutinize at your gift, often at night time after Day Job. He may decide to accept it right away, or return it to you with a comment requesting a change, or just ignore it with the highest contempt. Be humble and obey to his command. You may beg for enlightenment, but never object even if you cannot make sense of his rebutal. He is the allmighty after the Allmighty. Never Rebase if asked to Merge; never Merge if asked to Rebase. Refactor when asked, but don't when not ! You should rewrite history before submitting, or maybe not. If unsure, consult Contributing.md, or be prepared that RTFM will be yelled at you (only for the tribes that don't have yet written CodeOfConduct.md). Do not even consider objecting that you have not been tasked to address his demands. You must also make sure to listen to the complaints of the Continuous Integration (CI) half-gods: the Reviewer will probably not even look at your gift until CI has expressed its satisfaction. Retry as many times as needed until they are pleased. You may attempt at submitting a RFC, but be prepared for lengthy and lively discussions. Listen, correct or you may not survive to your first "-1" spell !

We praise especially gifts that have no direct value for you: improved test suite (e.g. https://github.com/OSGeo/gdal/issues/4407), documentation addition and fixes, answering to users on the mailing list. Only when you feel that you have built enough trust, you might try to offer your initial gift. But, even if it is accepted, the most surprising habit is that your gift will remain yours. You will have to make sure you regularly remove the dust that accumlates on it over time, so it remains constantly shiny. While removing dust on your gift, never neglect from removing it also from nearby gifts offered by others. Otherwise the Maintainer might threaten at invoking the terrible Revert incantation on you. Also consider that existing contributors to a project might see your new code, they won't have directly use of, as an extra burden to their daily tasks (studying the commit history of https://github.com/qgis/QGIS/commits/master/src/providers/hana, or even more crudly at https://github.com/qgis/QGIS/commits/master/src/providers/db2, demonstrates that)

Ready for a ride, and possibly enjoying the feeling that "our" code can also become "yours" ?

(*) some tribes, cut off from the rest of the world, are said to still pursue their adoration of more ancient divinities sometimes known as SVN or CVS. We cannot confirm and have eradicated the last remains of those old cults.