One of the core tenets of Site Reliability Engineering (SRE) is that blameless postmortems / retrospectives should be held for oncall incidents. Its part of the continuous improvement process where we learn from what went wrong and try and create processes to ensure it doesn’t happen again. Very explicitly it is not about blaming anyone for an error — the worst personal outcome for an engineer should be a decision that perhaps we failed to train them adequately. A simple example might be if a system broke because it ran out of disk, you might determine in the retrospective that it would be a super good idea to have some sort of alert for low disk space fire well before the system broke so you could intervene.
It occurs to me that I’ve never seen the same process used for SRE management though, and that seems like an obvious gap to me now. Surely the same process of asking what went wrong and working out what mechanisms could be created to ensure that we’re at least making new mistakes next time would be a good idea? Yet I’ve never seen a SRE management team willing to actually hold itself to the same bar that it holds its engineers to.
So… Has anyone ever seen this done? Did it work? Is it in fact a terrible idea?