My team at work now has a daily personal learning time called “egg time” — its a slightly silly story involving a manager who was good at taking some time to learn things each day, and an egg shaped chair.
Today I decided that I should read this paper about container image loading in AWS lambda, as recommended by Robert Collins on LinkedIn. The paper details the work they had to do to transition from all Lambda functions being packaged as relatively small zip files (250mb), to relatively large Docker containers (10gb+) while maintaining their aggressive target cold-start time of 50ms.
The paper starts by making some relatively obvious points: that Docker images are very cachable; and that they contain often reused layers. It also just throws out this initially surprising to me statement that only 6.4% of a container image’s data is actually ever used — this paper is referenced as a source for that number and definitely deserves a read later. They refer to this property as “sparsity”.
It then moves on to explore how AWS was able to exploit the sparsity of images in a different manner than previous implementations (slacker and starlight specifically). Instead of creating a fileystem oriented interface that uses overlayfs to mount each layer one on top of the other, they pre-render container images into ext4 filesystem block devices. This is a concept I’ve played with a little with Occy Strap, although not as much as I had intended to. Specifically, Occy Strap is capable of rendering a container image to a filesystem rendition without using overlayfs, but it does that more so you can inspect the image contents than to avoid IO entirely. The pre-rendering is definitely an interesting idea, and conceptually similar to Shaken Fist‘s idea of cached transcodes for virtual machine images. I should note that AWS has modified the ext4 implementation used to be deterministic about the filesystem created, so that the differences between different versions of a container image are also minimised.
The next part is really interesting to me as well — the ext4 block device images are then split into chunks, and those chunks named for their content (think named with a hash of their content), so that chunks which are shared between container images are only stored once. This is exactly what Blockstash is doing with virtual machine images, except I picked large values for the chunk size to reduce HTTP requests, and AWS picked 512KiB to ensure storage efficiency.
The chunks are then routed into the firecracker micro VM which runs a Lambda function by way of virtio, FUSE, and a local agent, which on-demand loads chunks as they are read.
AWS throws in a wild statistic at this point — 80% of uploaded Lambda functions contain zero unique chunks! That is, they’re re-uploads of previously seen container images. AWS points the finger at CI/CD systems for this behaviour, which seems reasonable to me. Of the remaining 20% of uploaded functions, the mean number of unique chunks is 4.3%, with a median of 2.5%. AWS is also quite clever, and stores their chunks encrypted, so that a given hypervisor only has access to the chunks it needs to run current workloads, and the content of container images is still confidential, whilst still being able to deduplicate those chunks across images from different customers. They do this via convergent encryption, as defined by FARSITE.
This raises two questions I don’t have answers to right now: how deterministic are the filesystems created by diskimage-builder, which is the source of the images for Blockstash; and how much of those images is actually used in the average runtime of a virtual machine? I suspect a custom virtio block driver for qemu / KVM virtual machines would be an interesting way to waste a few weeks one day. AWS’ initial implementation for their custom virtio driver was with FUSE, but they report performance problems because of the context switches required. I wonder if nbd would work reasonably?
Overall this paper was excellent and well worth the time to read if you’re interested in the performance of containerized systems.