Why and how to conduct a blameless retro for post-incident response
Issues are inevitable. In the software world, they’re not a matter of if, but when – and it’s how well teams respond (and how well that prepares them for future issues) that separates the best organizations from the pack.
The most successful organizations endeavor for a post-incident response that brings together their team to discuss what happened and what the solution was so they can learn from the problem and avoid making the same mistake going forward.
Numerous options exist for this type of response, but many teams now adopt the approach of a “blameless retro.” Let’s explore why this approach works well and how to lead a blameless retro in your organization.
Why to adopt a blameless retro approach for post-incident response
Traditionally, many organizations took a root cause analysis approach for post-incident response. However, instead of focusing on what caused the issue, these meetings typically devolved into a session full of finger pointing and calling out each others' mistakes. Unfortunately, when the focus turns to blame rather than discovery, it makes people defensive and it hurts long term incident response efforts because it doesn’t create a space where people feel comfortable hypothesizing to solve problems.
As a result of these shortcomings, many organizations have started to change their approach and instead adopt a blameless retro. The idea behind a blameless retro is to create a blameless, yet accountable environment with ample psychological safety for your team. In this type of environment, people should feel comfortable sharing ideas (both during and after the incident resolution) and talking about what happened.
Ultimately, the goal of a blameless retro is to have no repeat incidents. While issues will always come up, they should always arise due to new mistakes, rather than repeated ones.
8 best practices for conducting a blameless retro
In pursuit of better post-incident response and no repeated mistakes, how can you conduct a blameless retro? Here are eight best practices to get you started:
Schedule it immediately after the issue is resolved, but allow enough time to cool off
It’s important to schedule the blameless retro as close to the incident resolution as possible, that way the complexities are fresh in everyone’s minds and it minimizes any opportunity for recurrence. That said, you also need to balance that immediacy with a potential need for some time and space. You need to have everyone enter the meeting with a calm state of mind, so if a particular incident was extremely stressful and people are emotionally frustrated, you shouldn’t schedule the retro for the very next morning. Instead, give people enough time to cool off without scheduling it so far in the future that the details of the incident become fuzzy.
Invite only those involved in the incident
A blameless retro should not be an open meeting. Rather, it should include only those directly involved in resolving the incident and anyone who worked on code related to what might have caused the incident. Keeping this group small and confined to those directly involved creates a safe space for people to share mistakes with the team. Specifically, it makes sharing those mistakes and lessons learned along the way the norm rather than making people afraid to share their mistakes for fear of blame.
Get everyone to talk
Because the invite list for a blameless retro should be so purposeful, it’s important that everyone included in the meeting talks. After all, they are present because of their involvement in the incident and/or the resolution, and it’s important to get their perspective. Equally as important is making sure that one person doesn’t dominate the entire conversation. Asking leading questions and using shared documents are helpful tools for ensuring everyone talks and does equally.
Recreate the timeline – with context
One of the most important agenda items of a blameless retro is recreating the timeline of the incident from beginning to end, noting key milestones like when an issue was introduced, when it was first discovered, when the team identified the proximate cause and how long it took to fix.
The more detailed this timeline, the better, as context is extremely important for everyone to understand what the team explored as potential solutions and why those solutions did or did not work. For example, it’s important to note when the team thought something was the cause that ended up being untrue, which can happen a lot with complex systems. This context provides good visibility for understanding future issues. When you do mark those down, it’s helpful to annotate that the idea was ultimately untrue as soon as you present it in the timeline, that way it’s not misleading for anyone who reviews the timeline going forward.
Highlight important metrics
In addition to the detailed timeline for the incident, it’s also important to call out a handful of important metrics that can help your team get better at resolving (and even predicting) incidents over time and track that progress. These metrics include:
- How, and how quickly, you discovered the problem: Identifying how you discovered the problem can also help determine ways to find similar problems faster in the future. This speed to discovery is especially critical for more predictable issues, like those that center around exhausted resources (e.g. reaching database limits). In general, proactive monitoring can help discover issues sooner so that you find them before your customers do (consider the case of pinging a server to ensure it’s still running or regularly checking page load times to monitor for degraded experiences), if not avoid the issue entirely (in the case of predictable issues).
- Total customers impacted: Pinpointing how many customers were impacted is important for understanding what’s required in terms of customer communications. Beyond the communication need, the team should always ask how you can reduce that number going forward.
- How impacted customers experienced the problem: Understanding how customers experienced the problem will also impact your team’s understanding of other critical elements, like customer communications and overall issue identification. For instance, did customers experience a slow down or were they locked out entirely? Did it only impact the experience of customers trying to log in and not those already logged in or did everyone experience it the same way?
- Total duration until the problem was fixed: Measuring both the time from when the problem was identified until the time it was fixed and the time from when the solution to the problem was identified and when it was fixed is critical. Based on that information, you can look for ways to make both measures faster. Along with proactive monitoring to speed the time to identification, having clear communications (e.g. through a dedicated channel for incident response), written plans, access to information from previous incident resolutions and an overall safe space to hypothesize can all help improve this measure.
As you go through the detailed timeline and the key measures, it’s important to document everything. All of the information discussed in a blameless retro is extremely important for organizational context and for the team going forward as future incidents arise. As a result, documenting everything ensures that the information is available for reference for those who need it and that no one has to rely on memory alone.
Develop clear action items
The primary takeaway from a blameless retro should be clear action items for the team to help shorten feedback loops and identify what to prioritize going forward when new issues arise. These action items should focus on how to improve upon what the team did or didn’t do during the current incident and should always help cement accountability without placing blame. You should make one person responsible for ensuring you don’t leave without developing (and of course documenting) these action items.
Build an executive summary and share the outcomes with everyone
Finally, build an executive summary based on the meeting that you can share with relevant stakeholders. The summary should cover what happened, what worked as expected, what failed, the key metrics and the action items, including how you will change or improve upon your approach for incident resolution going forward. This summary should come to about one paragraph and avoid calling out anyone specific unless it draws attention to good behavior.
While the summary is intended for executive stakeholders, you should make it available to everyone, as it can provide important business context. This availability becomes especially critical as future issues arise, since teams can hopefully use these summaries to find solutions faster.
What happens when you conduct an effective blameless retro
An effective blameless retro will ensure that you never experience the exact same issue again. Beyond that most important outcome, it should also help improve your team’s response to incident resolution by helping introduce better processes, clearer communications and a feeling of safety to hypothesize on potential causes and resolutions. In doing so, this approach should ultimately enable your team to identify and resolve issues faster.
Ready to learn more? Contact us to learn how Spaceship can help your team grow and improve through efforts like this and more.