I’ve been playing again with Docker images and their internal layers a little more over the last week — you can see some of my previous adventures at Manipulating Docker images without Docker installed. The general thrust of these adventures is understanding the format and how to manipulate it by building a tool called Occy Strap which can manipulate the format in useful ways. My eventual goal there is to be able to build OCI compliant image bundles and then have a container runtime like runc execute them, and I must say I am getting a lot closer.
This time I was interested in the exact mechanisms used by whiteout files in those layers and how that interacts with Linux kernel overlay filesystem types.
Firstly, what is a whiteout file? Well, when you delete a file or directory from a lower layer in the Docker image, it doesn’t actually get removed from that lower layer, as layers are immutable. Instead, the uppermost layer records that the file or directory has been removed, and it is therefore no longer visible in the Docker image that the container sees. This has obvious security implications if you delete a file like a password you needed during your container build process, although there’s probably better ways to deal with those problems using multi-phase Dockerfiles.
An image might help with the description:
Here we have a container image which is composed of four layers. Layer 1 creates two files, /a and /b. Layer two creates a directory, /c. Layer three deletes /a and creates /c/d. Finally, layer 4 deletes /c and /c/d — let’s assume that it does this by just deleting the /c directory recursively. As far as a container using this image would be concerned, only /b exists in the container image.
A Dockerfile (which wouldn’t actually work) to create this set of history might look like:
FROM scratch
touch /a /b # Layer 1
mkdir /c # Layer 2
rm /a # Layer 3
rm -rf /c # Layer 4
The Docker image format stores each layer as a tarfile, with that tarfile being what a Linux filesystem called AUFS would have stored for this scenario. AUFS was an early Linux overlay filesystem from around 2006, which never actually merged into the mainline Linux kernel, although it is available on Ubuntu because they maintain a patch. AUFS recorded deletion of a file by creating a “whiteout file”, which was the name of the file prepended with .wh. — so when we deleted /a, AUFS would have created a file named .wh.a in Layer 3. Similarly to recursively delete a directory, it used a whiteout file with the name of the directory.
What if I wanted to replace a directory? AUFS provided an “opaque directory” that ensured that the directory remained, but all of its previous content was hidden. This was done by adding a file in the directory to be made opaque with the name .wh..wh..opq.
You can read quite a lot more about the Docker image format in the specification, as well as the quite interesting documentation on whiteout files.
To finish this example, the contents of the tarfile for each layer should look like this:
# Layer 1
/a # a file
/b # a file
# Layer 2
/c # a directory
/c/.wh..wh..opq. # a file, created as a safety measure
# Layer 3
/.wh.a # a file
/c/d # a file
# Layer 4
/c/.wh.d # a file
/.wh.c # a file
So that’s all great, but its not actually what got me bothered. You see, modern Docker users overlayfs, which is the replacement to AUFS which actually made it into the Linux kernel. overlayfs has a similar whiteout mechanism, but it is not the same as the one in AUFS. Specifically deleted files are recorded as character devices with 0/0 device numbers, and deleted directories are recorded with an extended filesystem attribute named “trusted.overlay.opaque” set to “y”. What I wanted to find was the transcode process in Docker which converted the AUFS style tarballs into this in the filesystem while creating a container.
After a bit of digging (the code is in containerd not moby as I expected), the answer is here:
func OverlayConvertWhiteout(hdr *tar.Header, path string) (bool, error) {
base := filepath.Base(path)
dir := filepath.Dir(path)
// if a directory is marked as opaque, we need to translate that to overlay
if base == whiteoutOpaqueDir {
// don't write the file itself
return false, unix.Setxattr(dir, "trusted.overlay.opaque", []byte{'y'}, 0)
}
// if a file was deleted and we are using overlay, we need to create a character device
if strings.HasPrefix(base, whiteoutPrefix) {
originalBase := base[len(whiteoutPrefix):]
originalPath := filepath.Join(dir, originalBase)
if err := unix.Mknod(originalPath, unix.S_IFCHR, 0); err != nil {
return false, err
}
// don't write the file itself
return false, os.Chown(originalPath, hdr.Uid, hdr.Gid)
}
return true, nil
}
Effectively, as a tar file is extracted the whiteout format is transcoded into overlayfs’ format. So there you go.
A final note for implementers of random Docker image tools: the test suite looks quite useful here if you want to validate that what you do matches what Docker does.