jeudi 2 mai 2019

Incremental Docker builds using ccache

Those past days, I've experimented with Docker to be able to produce "official" nightly builds of GDAL
Docker Hub is supposed to have an automated build mechanism, but the cloud resources they put behind that feature seem to be insufficient to sustain the demand and builds tend to drag forever.
Hence I decided to setup a local cron job to refresh my images and push them. Of course, as there are currently 5 different Dockerfile configurations and building both PROJ and GDAL from scratch could be time consuming, I wanted this to be as most efficient as possible. One observation is that between two nightly builds, very few files changes on average, so ideally I would want to recompile only the ones that have changed, and have the minimum of updated Docker layers refreshed and pushed.

There are several approaches I combined together to optimize the builds. For those already familiar with Docker, you can probably skip to the "Use of ccache" section of this post.

Multi-stage builds

This is a Docker 17.05 feature in which you can define several steps (that will form each a separate image), and later steps can copy from the file system of the previous steps. Typically you use a two-stage approach.
The first stage installs development packages, builds the application and installs it in some /build directory.
The second stage starts from a minimal image, installs runtime dependency, and copies the binaries generated at the previous stage from the /build to the root of the final image.
This approach avoids any development packages to be in the final image, which keeps it lean.

Such Dockerfile will look like:

FROM ubuntu:18.04 AS builder
RUN apt-get install g++ make
RUN ./configure --prefix=/usr && make && make install DESTDIR=/build

FROM ubuntu:18.04 AS finalimage
RUN apt-get install libstdc+
COPY --from=builder /build/usr/ /usr/



Fine-grained layering of the final image

Each step in a Dockerfile generates a layer, which chained together form an image.
When pulling/pushing an image, layers are processed individually, and only the ones that are not present on the target system are pulled/pushed.
One important note is that the refresh/invalidation of a step/layer causes the
refresh/invalidation of later steps/layers (even if the content of the layer does
not change in a user observable way, its internal ID will change).
So one approach is to put first in the Dockerfile the steps that will change the less frequently, such as dependencies coming from the package manager, third-party dependencies whose versions rarely change, etc. And at the end, the applicative part. And even the applications refreshed as part of the nightly builds can be decomposed in fine-grained layers.
In the case of GDAL and PROJ, the installed directories are:
$prefix/usr/include
$prefix/usr/lib
$prefix/usr/bin
$prefix/usr/share/{proj|gdal}

The lib is the most varying one (each time a .cpp file changes, the .so changes).
But installed include files and resources tend to be less frequently updated.

So a better ordering of our Dockerfile is:
COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build/usr/lib/ /usr/lib/


With one subtlety, as part of our nightly builds, the sha1sum of the HEAD of the git repository is embedded in a string of $prefix/usr/include/gdal_version.h. So in the builder stage, I separate that precise file from other include files and put it in a dedicated /build_most_varying target together with the .so files.

RUN [..] \
    && make install DESTDIR=/build \
    && mkdir -p /build_most_varying/usr/include \
    && mv /build/usr/include/gdal_version.h /build_most_varying/usr/include \
    && mv /build/usr/lib /build_most_varying/usr


And thus, the finalimage stage is slightly changed to:

COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build_most_varying/usr/ /usr/


Layer depending on a git commit

In the builder stage, the step that refreshes the GDAL build depends on an
argument, GDAL_VERSION, that defaults to "master"

ARG GDAL_VERSION=master
RUN wget -q https://github.com/OSGeo/gdal/archive/${GDAL_VERSION}.tar.gz \
    && build instructions here...

Due to how Docker layer caching works, building several times in a row this Dockerfile would not refresh the GDAL build (unless you invoke docker build with the --no-cache switch, which disable all layer caching), so the script that triggers the docker build, gets the sha1sum of the latest git commit and passes it with:

GDAL_VERSION=$(curl -Ls https://api.github.com/repos/OSGeo/gdal/commits/HEAD -H "Accept: application/vnd.github.VERSION.sha")
docker build --build-var GDAL_VERSION=${GDAL_VERSION} -t myimage .

In the (unlikely) event where the GDAL repository would not have changed, no
new build would be even attempted.

Note: this part is not necessarily a best practice. Other Docker mechanisms,
such as using a Git URL as the build context, could potentially be used. But as
we want to be able to refresh both GDAL and PROJ, that would not be really suitable.
Another advantage of the above approach is that the Dockerfile is self sufficient
to create an image with just "docker build -t myimage ."

Use of ccache

This is the part for which I could not find an already made & easy to deploy solution.

With the previous techniques, we have a black and white situation. A GDAL build is either entirely cached by the Docker layer caching in the case the repository did not change at all, or completely done from scratch if the commit id has changed (which may be some change not affecting at all the installed file). It would be better if we could use ccache to minimize the number of files to be rebuilt.
Unfortunately it is not possible with docker build to mount a volume where the ccache directory would be stored (apparently because of security issues). There is an experimental RUN --mount=type=cache feature in Docker 18.06 that could perhaps be equivalently used, but it requires both the client and daemon to be started in experimental mode.

The trick I use, which has the benefit of working with a default Docker installation, is to download from the Docker build container the content of a ccache directory from the host, do the build and upload the modified ccache back onto the host.

I use rsync for that, as it is simple to setup. Initially, I used a rsync daemon directly running in the host, but based on inspiration given by https://github.com/WebHare/ccache-memcached-server which proposes an alternative, I've modified it to run in a Docker container, gdal_rsync_daemon, which mounts the host ccache directory. The benefit of my approach over the ccache-memcached-server one is that it does not require a patched version of ccache to run in the build instance.

So the synopsis is:

host cache directory <--> gdal_rsync_daemon (docker instance)  <------> Docker build instance
                  (docker volume mounting)                           (rsync network protocol)


You can consult here the relevant portion of the launching script which builds and launches the gdal_rsync_daemon. And the corresponding Dockerfile step in the builder stage is rather straightforward:

# for alpine. or equivalent with other package managers
RUN apk add --nocache rsync ccache

ARG RSYNC_REMOTE
ARG GDAL_VERSION=master
RUN if test "${RSYNC_REMOTE}" != ""; then \
        echo "Downloading cache..."; \
        rsync -ra ${RSYNC_REMOTE}/gdal/ $HOME/; \
        export CC="ccache gcc"; \
        export CXX="ccache g++"; \
        ccache -M 1G; \
    fi \
    # omitted: download source tree depending on GDAL_VERSION
    # omitted: build
    && if test "${RSYNC_REMOTE}" != ""; then \
        ccache -s; \
        echo "Uploading cache..."; \
        rsync -ra --delete $HOME/.ccache ${RSYNC_REMOTE}/gdal/; \
        rm -rf $HOME/.ccache; \
    fi


I also considered a simplified variation of the above that would not use rsync, where after the build, we would "docker cp" the cache from the build image to the host, and at the next build, copy the cache into the build context. But that would have two drawbacks:
  • our build layers would contain the cache
  • any chance in the cache would cause the build context to be different and subsequent builds to have their cached layers invalidated.

Summary

We have managed to create a Dockerfile that can be used in a standalone mode
to create a GDAL build from scratch, or can be integrated in a wrapper build.sh
script that offers incremental rebuild capabilities to minimize use of CPU resources. The image has fine grained layering which also minimizes upload and download times for frequent push / pull operations.

jeudi 31 janvier 2019

SRS barn raising: 8th report. Ready for your testing !

This is the 8th progress report of the GDAL SRS barn effort.

As the title implies, a decisive milestone has now been reached, with the "gdalbarn" branches of libgeotiff and GDAL having been now merged in their respective master branch.

On the PROJ side, a number of fixes and enhancements have been done:
- missing documentation for a few functions, the evolution of cs2cs and the new projinfo utility has been added
- the parser of the WKT CONCATENATEDOPERATION construct can now understand step presented in a reverse order
- a few iterations to update the syntax parsing rules of WKT2:2018 following the latest adjustments done by the OGC WKT Standard Working Group
- in my previous work, I had introduced a "PROJ 5" to export CRS using pipeline/unitconvert/axisswap as an attempt of improving the PROJ.4 format used by GDAL and other products. However after discussion with other PROJ developers, we realize that it is likely a dead-end since it is still lossy in many aspects and can cause confusion with coodinate operations. Consequently the PROJ_5 convention will be identical to PROJ_4 for CRS export. And the use of PROJ strings to express CRS themselves is discouraged. It can still makes sense if using the "early-binding" approach and specifying towgs84/nadgrids/geoidgrids parameters. But in a late-binding approach, WKT is much more powerful to capture important information like geodetic datum names.
- when examining how the new code I added those past months with the existing PROJ codebase, it became clear that there was a confusion when importing PROJ strings expressing coordinate operations versus PROJ strings expressing a CRS. So for the later use case, a "+type=crs" must be added in the PROJ string. As a consequence the proj_create_from_proj_string() and proj_create_from_user_input() functions have been removed, and proj_create() can now been used for all types of PROJ strings.
- The PROJ_LIB environment variable now supports multiple paths, separated by colon on Unix and semi-colon on Windows

On the GDAL side,
  • the OGRCoordinateTransformation now uses the PROJ API to automatically compute the best transformation pipeline, enabling late-binding capabilities. In the case where the user does not provide an explicit area of use and several coordinate operations are possible, the Transform() method can automatically switches between coordinate operations given the input coordinate values. This should offer behaviour similar to previous versions  for example for NAD27 to NAD83 conversion when PROJ had a +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat hardcoded rule. This dymanic selection logic has also been moved to PROJ proj_create_crs_to_crs() function. Note that however this might not always lead to the desired results, so specifying a precise area of interest, or even a specific coordinate operation, is preferred when full control is needed.
  • gdalinfo, ogrinfo and gdalsrsinfo now outputs WKT2:2018 by default (can be changed with a command line switch). On the API side, the exportToWKT() method will still export WKT1 by default (can of course be changed). The rationale is that WKT consumers might not be ready yet for WKT2, so this should limit the backward compatibility issues. In the future (couple of years timeframe), this default WKT version might be upgraded when more consumers are WKT2-ready.
  • A RFC 73: Integration of PROJ6 for WKT2, late binding capabilities, time-support and unified CRS database document was created to document all the GDAL changes. After discussion with the community, this RFC has been approved
  • As a result, all of the above mentionned work has now been merged in GDAL master
  • Important practical discussion: GDAL master now depends on PROJ master (and ultimately PROJ 6.0 once it is released)
Consequently, on the pure development front, most of the work has now been completed.  As all those changes done those last months deeply impact SRS related functionnality in GDAL and PROJ, we rely now on your careful testing to spot the inevitable issues that have not yet been detected by their respective automatic regression test suites. The earlier they are detected, the easier they will be fixable, in particular if they impact the API.