Since about 2017, a group at Cisco has been working on an “OCI native operating system” under the title “project machine”, which is a terrible project name. I note that most of the people publicly involved in the project according to github commits no longer work at Cisco, so I cannot vouch for the health of the overall project. That said, they did come up with some interesting ideas along the way and given its a quiet time of year I figured I could do some reading.
Firstly, Docker / OCI images store their layers as tar files. This is quite inefficient, as the tar file format itself is really intended for ancient tape drives, doesn’t support concepts such as random seeks, isn’t particularly well defined (there are a few competing implementations), and generally wasn’t intended for these things. So instead, that team wrote atomfs, which stores the layers as squashfs filesystems. It should be noted that the only container runtime which appears to actually support atomfs is the project machine itself, so its not a super useful format in the real world.
Secondly, the team appears to have fairly rapidly moved on to puzzlefs instead of persuing full support for atoms in upstream runtimes. Puzzlefs uses the FastCDC content defined chunking (CDC) scheme to attempt to de-duplicate layers, which sounds like an idea I’d be interested in based on previous adventures in this space and an academic paper I read about 18 months ago.
Josh Leeb has written a nice set of introductory blog posts about CDC schemes including FastCDC, which is across four posts: content defined chunking; Gear Hashing explained; FastCDC; and RapidCDC / QuickCDC. I definitely had to read the section of the second post covering the hash judgment function a couple of times for it to click, so I’d describe these posts as very helpful. One of the nice things about the hash judgment functions described in Josh’s posts is that you don’t need a corpus of common chunks to train on — the definition of a chunk is entirely probabilistic which is nice.
Now, there is a disclaimer to insert here — FastCDC, RapidCDC, and QuickCDC areĀ not deduplication systems by themselves. What they are instead is a way of determining what chunks of a file you should consider as possible duplicates. That is, we still need to do some form of stronger (cryptographic?) hash on each of these chunks to determine if it is a duplicate. That actually works out quite well though, given that the OCI image registry expects to address content by cryptographic hashes anyway. RapidCDC / QuickCDC are performance improvements over FastCDC, but they gain those improvements by requiring some amount of pre-processing of the input files to provide “chunking hints”, which I think is probably cheating in most cases. That is, I think FastCDC is a pretty reasonable choice in the real world.
Some important points from Josh’s posts — I appear to have independently rediscovered possibly the worst implementation of Gear Hashing with the initial implementation of blockstash. Firstly I don’t solve the boundary-shift problem, although I note it exists. Secondly, I failed to realize that a cryptographically secure hash is actually not a great solution for the hashing of the chunks. Honestly, I’m ok with that as long as I learn something along the way. It seems like these algorithms have seen a bit of academic study since around 2016 when the FastCDC paper was published, so I am also only 9 years behind the state of the art!
You can learn more about puzzlefs if you’re interested in this video:
One interesting observation from the video which I didn’t see in Josh’s blog posts is that ensuring that the file system inside the image (be that a tarball, qcow2, or something else) is written in a deterministic manner will improve the performance of these chunking schemes quite a lot.
Overall, I don’t think that puzzlefs is particularly useful to me for various reasons, but reading these blog posts and watching the video has been a fun little tangent.