I don’t blog about every Shaken Fist release here, but I do feel like the 0.4 release (and the subsequent minor bug fix release 0.4.1) are a pretty big deal in the life of the project.
The focus of the v0.4 series is reliability — we’ve used behaviour in the continuous integration pipeline as a proxy for that, but it should be a significant improvement in the real world as well. This has included:
- much more extensive continuous integration coverage, including several new jobs.
- checksumming image downloads, and retrying images where the checksum fails.
- reworked locking.
- etcd reliability improvements.
- refactoring instances and networks to a new “non-volatile” object model where only immutable values are cached.
- images now track a state much like instances and networks.
- a reworked state model for instances, where its clearer why an instance ended up in an error state. This is documented inĀ our developer docs.
In terms of new features, we also added:
- a network ping API, which will emit ICMP ping packets on the network node onto your virtual network. We use this in testing to ensure instances booted and ended up online.
- networks are now checked to ensure that they have a reasonable minimum size.
- addition of a simple etcd backup and restore tool (sf-backup).
- improved data upgrade of previous installations.
- VXLAN ids are now randomized, and this has forced a new naming scheme for network interfaces and bridges.
- we are smarter about what networks we restore on startup, and don’t restore dead networks.
We also now require python 3.8.
Overall, Shaken Fist v0.4 is a place that makes me much more comfortable to run workloads I care about on that previous releases. Its far from perfect, but we’re definitely moving in the right direction.