Incidents are unplanned investments
John Allspaw
Postmortems are opportunities for ROI, where we already paid the price, let’s get something from the fire fighting stress that we already have had put into it.
Avoid Blame and Keep It Constructive
Blamelessness is not the same as psychological safety
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior.
https://sre.google/sre-book/postmortem-culture/
Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization
https://sre.google/sre-book/bibliography/#Boy13
How to write incident report
- There usually is no one Root Cause there are many contributing factors — different groups of people will come up with different root causes — 5 why technique may produce different results with different groups of people.
- Conduct 1 on 1 interviews to collect different perspectives without bias and to fully engage each person you talk, group meetings allow people to hide in the crowd.
- Focus on stories not action items — action items may not get done, instead focus on the learning and changing our mindset for the future.
- Write a incident report to be read — storytelling and learn from, not the account of the actions and timestamps.
- Focus on humans more, software less.
- Increase organisation resilience as a result — your organisation ability to weather the storm.
Interview Questions
The following questions are based on Amy Tobey excellent presentation One on One SRE
- What was your role in the incident?
- What surprised you?
- How long did you work on the incident?
- Were you able to get the support you needed??
- Do you feel that the incident was preventable??
- What actions do you feel good about?
- What do you think could have been better?
- What did you learn from this incident?
- What do you think we can do to prevent re-occurrence?
- Did our tools and documentation serve you well?
- Did you practice self-care during this process?
- Can you think of anyone else we should talk to?