jeudi 2 mai 2019

Incremental Docker builds using ccache

Those past days, I've experimented with Docker to be able to produce "official" nightly builds of GDAL
Docker Hub is supposed to have an automated build mechanism, but the cloud resources they put behind that feature seem to be insufficient to sustain the demand and builds tend to drag forever.
Hence I decided to setup a local cron job to refresh my images and push them. Of course, as there are currently 5 different Dockerfile configurations and building both PROJ and GDAL from scratch could be time consuming, I wanted this to be as most efficient as possible. One observation is that between two nightly builds, very few files changes on average, so ideally I would want to recompile only the ones that have changed, and have the minimum of updated Docker layers refreshed and pushed.

There are several approaches I combined together to optimize the builds. For those already familiar with Docker, you can probably skip to the "Use of ccache" section of this post.

Multi-stage builds

This is a Docker 17.05 feature in which you can define several steps (that will form each a separate image), and later steps can copy from the file system of the previous steps. Typically you use a two-stage approach.
The first stage installs development packages, builds the application and installs it in some /build directory.
The second stage starts from a minimal image, installs runtime dependency, and copies the binaries generated at the previous stage from the /build to the root of the final image.
This approach avoids any development packages to be in the final image, which keeps it lean.

Such Dockerfile will look like:

FROM ubuntu:18.04 AS builder
RUN apt-get install g++ make
RUN ./configure --prefix=/usr && make && make install DESTDIR=/build

FROM ubuntu:18.04 AS finalimage
RUN apt-get install libstdc+
COPY --from=builder /build/usr/ /usr/



Fine-grained layering of the final image

Each step in a Dockerfile generates a layer, which chained together form an image.
When pulling/pushing an image, layers are processed individually, and only the ones that are not present on the target system are pulled/pushed.
One important note is that the refresh/invalidation of a step/layer causes the
refresh/invalidation of later steps/layers (even if the content of the layer does
not change in a user observable way, its internal ID will change).
So one approach is to put first in the Dockerfile the steps that will change the less frequently, such as dependencies coming from the package manager, third-party dependencies whose versions rarely change, etc. And at the end, the applicative part. And even the applications refreshed as part of the nightly builds can be decomposed in fine-grained layers.
In the case of GDAL and PROJ, the installed directories are:
$prefix/usr/include
$prefix/usr/lib
$prefix/usr/bin
$prefix/usr/share/{proj|gdal}

The lib is the most varying one (each time a .cpp file changes, the .so changes).
But installed include files and resources tend to be less frequently updated.

So a better ordering of our Dockerfile is:
COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build/usr/lib/ /usr/lib/


With one subtlety, as part of our nightly builds, the sha1sum of the HEAD of the git repository is embedded in a string of $prefix/usr/include/gdal_version.h. So in the builder stage, I separate that precise file from other include files and put it in a dedicated /build_most_varying target together with the .so files.

RUN [..] \
    && make install DESTDIR=/build \
    && mkdir -p /build_most_varying/usr/include \
    && mv /build/usr/include/gdal_version.h /build_most_varying/usr/include \
    && mv /build/usr/lib /build_most_varying/usr


And thus, the finalimage stage is slightly changed to:

COPY --from=builder /build/usr/share/gdal/ /usr/share/gdal/
COPY --from=builder /build/usr/include/ /usr/include/
COPY --from=builder /build/usr/bin/ /usr/bin/
COPY --from=builder /build_most_varying/usr/ /usr/


Layer depending on a git commit

In the builder stage, the step that refreshes the GDAL build depends on an
argument, GDAL_VERSION, that defaults to "master"

ARG GDAL_VERSION=master
RUN wget -q https://github.com/OSGeo/gdal/archive/${GDAL_VERSION}.tar.gz \
    && build instructions here...

Due to how Docker layer caching works, building several times in a row this Dockerfile would not refresh the GDAL build (unless you invoke docker build with the --no-cache switch, which disable all layer caching), so the script that triggers the docker build, gets the sha1sum of the latest git commit and passes it with:

GDAL_VERSION=$(curl -Ls https://api.github.com/repos/OSGeo/gdal/commits/HEAD -H "Accept: application/vnd.github.VERSION.sha")
docker build --build-var GDAL_VERSION=${GDAL_VERSION} -t myimage .

In the (unlikely) event where the GDAL repository would not have changed, no
new build would be even attempted.

Note: this part is not necessarily a best practice. Other Docker mechanisms,
such as using a Git URL as the build context, could potentially be used. But as
we want to be able to refresh both GDAL and PROJ, that would not be really suitable.
Another advantage of the above approach is that the Dockerfile is self sufficient
to create an image with just "docker build -t myimage ."

Use of ccache

This is the part for which I could not find an already made & easy to deploy solution.

With the previous techniques, we have a black and white situation. A GDAL build is either entirely cached by the Docker layer caching in the case the repository did not change at all, or completely done from scratch if the commit id has changed (which may be some change not affecting at all the installed file). It would be better if we could use ccache to minimize the number of files to be rebuilt.
Unfortunately it is not possible with docker build to mount a volume where the ccache directory would be stored (apparently because of security issues). There is an experimental RUN --mount=type=cache feature in Docker 18.06 that could perhaps be equivalently used, but it requires both the client and daemon to be started in experimental mode.

The trick I use, which has the benefit of working with a default Docker installation, is to download from the Docker build container the content of a ccache directory from the host, do the build and upload the modified ccache back onto the host.

I use rsync for that, as it is simple to setup. Initially, I used a rsync daemon directly running in the host, but based on inspiration given by https://github.com/WebHare/ccache-memcached-server which proposes an alternative, I've modified it to run in a Docker container, gdal_rsync_daemon, which mounts the host ccache directory. The benefit of my approach over the ccache-memcached-server one is that it does not require a patched version of ccache to run in the build instance.

So the synopsis is:

host cache directory <--> gdal_rsync_daemon (docker instance)  <------> Docker build instance
                  (docker volume mounting)                           (rsync network protocol)


You can consult here the relevant portion of the launching script which builds and launches the gdal_rsync_daemon. And the corresponding Dockerfile step in the builder stage is rather straightforward:

# for alpine. or equivalent with other package managers
RUN apk add --nocache rsync ccache

ARG RSYNC_REMOTE
ARG GDAL_VERSION=master
RUN if test "${RSYNC_REMOTE}" != ""; then \
        echo "Downloading cache..."; \
        rsync -ra ${RSYNC_REMOTE}/gdal/ $HOME/; \
        export CC="ccache gcc"; \
        export CXX="ccache g++"; \
        ccache -M 1G; \
    fi \
    # omitted: download source tree depending on GDAL_VERSION
    # omitted: build
    && if test "${RSYNC_REMOTE}" != ""; then \
        ccache -s; \
        echo "Uploading cache..."; \
        rsync -ra --delete $HOME/.ccache ${RSYNC_REMOTE}/gdal/; \
        rm -rf $HOME/.ccache; \
    fi


I also considered a simplified variation of the above that would not use rsync, where after the build, we would "docker cp" the cache from the build image to the host, and at the next build, copy the cache into the build context. But that would have two drawbacks:
  • our build layers would contain the cache
  • any chance in the cache would cause the build context to be different and subsequent builds to have their cached layers invalidated.

Summary

We have managed to create a Dockerfile that can be used in a standalone mode
to create a GDAL build from scratch, or can be integrated in a wrapper build.sh
script that offers incremental rebuild capabilities to minimize use of CPU resources. The image has fine grained layering which also minimizes upload and download times for frequent push / pull operations.