Assuming anyone was paying attention, which I suspect they are not, they would have noticed that there are a lot of queued up pull requests for Shaken Fist right now. There are a couple of factors leading to that — there are now several bots which do automated pull requests for code hygiene purposes; and a couple of months ago I decided to give GitHub’s new merge queue functionality a go in order to keep the CI resource requirements for Shaken Fist under control. All CI runs on four machines in my home office, and there were periods of time where the testing backlog would be more than 24 hours long. I can’t simply buy more hardware, and I didn’t really want to test things less.
The basic idea of GitHub merge queues is that you have a quick set of initial tests which determine if the pull request smells, and then only run the full test suite on non-stinky code which a human has signed off on. Once the human signs off, the code will only merge if the full suite passes, and GitHub manages a queue of merge attempts to keep that reasonable.
One thing I’ve learnt in all of this is that having the initial tests be as fast as possible is super important. Shaken Fist likes to build entire clouds and then run functional tests on them, but those tests can take an hour and that’s not desirable for all pull requests before a human even looks at them. Instead I’ve decided to provide fast high level feedback — lint, unit tests, and a targeted subset of the functional tests.
Because code cannot merge without a completely successful final test run, I have had to stop the habit of hitting the retry button when a single test failed on merge. While this is good in terms of avoiding flaky tests by forcing you to fix them, it has also made it much less (approximately 0% right now) likely that a given PR will automatically merge.
(As an aside, each merge attempt in GitHub land is implemented as a new branch and “silent” pull request under the hood, so once your actual pull request is ready to attempt merging mapping it to the CI run for that attempt isn’t as easy as you’d think it would be because it does not appear in the original pull request at all. That’s because GitHub will do things like attempting to merge more than one pull request in a single attempt, so it needs these synthetic pull requests to represent that. In the end I wrote some GitHub actions magic which tells me about merge attempts in a slack channel, which has been good enough for now.)
So I should just fix the flaky tests right? That was the initial plan, and I waded through a bunch of changes doing just that. Then I hit the point where the the biggest problem was grpc crashes because Shaken Fist liked processes and grpc likes threads (and really really hates processes in python), so I fixed that. Then the unreliability was that the queuing mechanism for work inside the cluster needed some love — something I’ve been meaning to address for literally years. So I fixed that. All of this is good because the code is now better than it was, and I learnt some stuff along the way. However, automated merges are still not working, because now the unreliable bit is the Shaken Fist in-guest agent code, which has also been on my hit list for at least two years.
What is this in-guest agent? Well, Shaken Fist is a bit unique in that you can use an out of band side channel to do things on the instance. So for example if you had an instance with no networking and no user accounts, you could use the agent if it was installed to execute commands, fetch files into and out of the instance, collect telemetry, and so on. This can be useful when you want to perform work without outside influence — for example Amazon’s Nitro enclaves — but its also super useful when you want to inspect the internal state of an instance during unit tests. That second case is what I find I use the agent for — the functional testing for Shaken Fist itself is a big user of the agent.
The hypervisor and the agent talk to each other over a virtio-serial connection provided by qemu. At the moment it uses a custom protocol, so my naive assumption was I could improve reliability by instead talking serialized protobuf through that connection. This is the approach the privexec daemon in Shaken Fist takes, but it uses a unix domain socket.
However, I then realized that its not my janky protocol which is the unreliable bit. While the protocol should probably go away and be replaced by protobufs, the bigger problem is that serial connections are not multiplexed like unix domain sockets. Instead of each request being a new socket connection which need only be concerned about the response to a single request, the serial connection is just a raw stream of bytes in each direction, and each end must handle the multiplexing itself. This drives a lot of complexity in the implementation because there could be multiple operations occurring at once, especially on the hypervisor side, and that’s the code which I think is causing my issues and should go away. My realization this morning is that the multiplexing is independent of the protocol itself — TCP for example doesn’t know what’s happening inside each session, its job is just to get your byte stream from one correspondent to the other.
So how do I implement multiplexing over a serial connection in python? I’m very much open to suggestions here, but it starts to sound a bit like SLiRP or something like that to me. Or should each packet on the serial connection just have a “session id” and somehow simulate a unix domain socket on each end?
Further thought required I think, but also… suggestions welcome!