The Culture of the Public Postmortem

Open source software (OSS) runs a big part of the world. It handles all kinds of loads from small to unimaginably huge. This tiny blog you are reading is built on OSS. The mighty Instagram is (or was?) built on OSS. But this software is hardly perfect. It fails, and when it does it can have huge and untold ramifications around the world. As a user of OSS one of the most important ways you can contribute to the open source community is to have a public postmortem after an incident.

The postmortem concept is well known in the technology industry. A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
-- Google SRE book

One important thing to note is that postmortems are not the same thing as incidents. During an incident you need an incident management strategy that allows for incidents to be resolved as fast as possible and with limited damage. Your incident management tools must allow for an easy postmortem analysis.

Don't over do it

Before I even talk about how postmortems should be handled I want to point out that not every incident deserves a postmortem. As a team, decide the appropriate triggers for postmortems. For example, don't put effort into analysing an incident that is too small and was resolved without customer impact. But, analyse any and all incidents that meet your trigger criteria.

Tools

If you want people to consistently do something, you have to provide them with tools to make their task easier. I think there are a lot of companies that try to do the right thing but are let down by the tools they use. Jira tickets are a good start but they are just not good enough. They require people to copy conversations they have on other platforms and paste them into the tickets. That is extra work that no one wants to do. If you want postmortems to be part of your company culture try to find the right tool for them. Use the tool and evolve it.

Collaborate

Instead of having a dedicated team that handles all postmortems, have a roster and allow everyone an opportunity to lead a postmortem. This ensures that everyone takes an active role in the culture and knowledge is spread well in the company. This also ensures that the team handling the incident communicates their findings and solutions clearly. Otherwise it would be difficult for anyone but the incident team to do the postmortem.

Share

The purpose of postmortems is to surface what you have learned from an incident and how you plan to prevent a recurrence. Your analysis should be intended to share that knowledge, not assign blame. When doing postmortems it can be easy to concentrate on the wrong things, like who was sloppy and who could have done better. Don't allow that toxicity into your postmortems. Instead, concentrate on the lessons learned and the way forward.

Review

Just as, I hope, incidents are reviewed, postmortems should be reviewed too. The purpose of the review should be to follow up on all action points and close all ongoing discussions and comments. As part of the review also ensure that you are now safe from whatever triggered the incident. If not, end the review process quickly and get back to preventing another incident.

Make it public

After all, write a blog about it. Ideally a public one. Even if you are not using OSS. Making postmortems public will help someone else out there. They might learn how you recover from certain incidents. If you are using OSS, other people will learn about the failure modes of the software they are using and be prepared for them.

Echo

Too often companies look to share their successes and wins. That is great, we are excited for your wins. However, there is a lot more to learn from the failures. For you and for us, us being the community. Document your failures, learn from them and then share them with us. Hopefully we will learn from them too.

Personally, I enjoy reading these. I learn a lot from them. Mailchimp recently wrote a nice one about the mandrill outage.

Ping me on Twitter with links to awesome ones you learned from.

Ciao for now!