When a service or system experiences an outage, it is imperative to write a public postmortem to ensure accountability and transparency.
Writing a postmortem is a means to examine and record an occurrence or event that has happened, usually with an emphasis on figuring out what went wrong, determining the core causes, and providing recommendations for how to avoid recurring problems.
As an incident handler, you may be required to draft a postmortem report that can be distributed to workers, customers, and senior executives in order to explain what went wrong and how it was fixed.
Here’s a detailed guide on writing a successful incident postmortem to assist you with this.
Best Practices To Follow For Writing Impactful Incident Postmortems
1. No Blame Game
Blameless reviews concentrate on comprehending what occurred without assigning blame to individuals in place of holding them accountable for situations.
2. Collect Information in a Commonly Accessible Location
To streamline incident investigations and ensure everyone’s on the same page, gathering all pertinent information in a single, easily accessible location is crucial. A shared document or message stream serves as a central hub for updates, notes, and findings, promoting transparency and collaboration. This becomes even more vital when considering various on-call compensation models.
3. Consider the Big Picture
To identify the underlying reasons for an incident, it is imperative to consider all relevant aspects that may be involved.
4. Promote Honesty
Foster an environment where individuals feel free to own up to their faults. Allow team members to freely communicate their mistakes to gain insightful feedback.
5. Automate the postmortem creation process
By automating this process, you may cut down on the amount of time you spend copying and pasting event data from different sources. The postmortem template functionality in Zenduty can be utilised to provide pertinent facts, allowing incident controllers to begin investigating the issue right away.
6. Learn from Past Mistakes
Living postmortems act as a constant source of information. Continuous improvement can be ensured by consulting historical incidents, reviewing the talks, and drawing lessons from past experiences.
7. Add statistics and real-time graphs
Postmortems are more than just still images of data. Responders using live charts can isolate particular metrics or interactively examine data trends across several time intervals to obtain a contextual picture of the incident’s evolution.
8. Make it simple to locate later
In order to assist team members in looking into future occurrences or creating a runbook in the future, it is imperative that the findings included in your postmortems be easily accessible.
9. Recognise and add tags
For easier searching, use clear and concise tags and names for your incidents and postmortems. If you wish to investigate specific failure types of a certain service, depending just on event IDs or dates may not be enough. By marking postmortems with pertinent service names, you can easily locate the information you require.
How To Conduct Incident Postmortem
Like many things in IT, if you have a procedure and a few fundamental guidelines in place, incident postmortems go much more smoothly (and take a lot less time). So let us establish a few:
1. Make use of a template
Make a template that you will utilise for every evaluation. This guarantees you won’t overlook anything. A template serves as the foundation for communications with impacted customers and stakeholders as well as reporting to your management team.
2. Identify the owners and roles
The person in charge of the review is in charge of running the meeting and writing the report that comes after. The owner or owners should be someone who is familiar with the situation and has sufficient awareness of the technical facts.
3. Establish guidelines for what situations require evaluations
You need to have precise, well-defined guidelines on which incidents will start the postmortem investigation. Any occurrence with a severity level of one is an excellent starting point. There can be more instances in which a review is beneficial. Think about creating a procedure that would allow service providers to ask for evaluations of occurrences that don’t fit the severity requirements but could have had a significant negative influence on their clients’ and services’ experiences.
4. Take prompt action
Your team will almost always need to take a little time off after a big incident, so don’t wait any longer than absolutely necessary. When you put off things too long, crucial information gets lost. Thus, when a major catastrophe happens, get together as soon as possible—between 24 and 48 hours.
Conclusion
The handbook highlights the necessity of fostering a blame-free culture, as well as the benefit of utilising examinations as an opportunity for learning and change rather than just documenting incidents.
Checkout Zenduty to enhance your incident management process. From incident alerting to writing postmortems, they help you with everything which speeds up your incident response time. Try it for free today!