AoAD2 Practice: Incident Analysis

June 7, 2021

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Incident Analysis

Audience Whole Team

We learn from failure.

Despite your best efforts, your software will sometimes fail to work as it should. Some failures will be minor, such as a typo on a web page. Others will be more significant, such as code that corrupts customer data, or an outage that prevents customer access.

Some failures are called bugs or defects; others are called incidents. The distinction isn’t particularly important. Either way, once the dust has settled and things are running smoothly again, you need to figure out what happened and how you can improve. This is incident analysis.

The details of how to respond during an incident are out of the scope of this book. For an excellent and practical guide to incident response, see Site Reliability Engineering: How Google Runs Production Systems [Beyer et al. 2016], particularly chapters 12-14.

Key Idea: Embrace Failure

Agile teams understand failures are inevitable. People make mistakes; miscommunications occur; ideas don’t work out.

Rather than engaging in a quixotic quest to avoid failure, Agile teams embrace it. If failure is inevitable, then the important thing is to detect failure, as soon as possible; to fail early, while there’s still time to recover; to contain failure, so the consequences are minimal; and to learn from failure, not place blame.

Continuous deployment is a good example of this philosophy. Teams using continuous deployment detect failures with monitoring. They deploy every hour or two, which reveals failures early while also reducing their impact. They use canary servers to minimize the consequences of failure, and they use each failure to understand their limits and improve. Counterintuitively, embracing failure leads to less risk and better results.

Failures may be inevitable, but that doesn’t prevent success.

The Nature of Failure

Failure is a consequence of your entire development system.

It’s tempting to think of failure as a simple sequence of cause and effect—A did this, which led to B, which led to C—but that’s not what really happens. In reality, failure is a consequence of the entire development system in which work is performed. (The development system is your entire approach to building software, from tools to organizational structure. It’s in contrast to your software system, which is the thing you’re building.) Each failure, no matter how minor, is a clue about the nature and weaknesses of that development system.1

1My discussion of the nature of failure is based on [Woods et al. 2010] and [Dekker 2014].

Ally No Bugs

Failure is the result of many interacting events. Small problems are constantly occurring, but the system has norms, such as the ones described in “No Bugs,” that keep them inside a safe boundary. A programmer makes an off-by-one error, but their pairing partner suggests a test to catch it. An on-site customer explains a story poorly, but notices the misunderstanding during customer review. A team member accidentally erases a file, but continuous integration rejects the commit.

When failure occurs, it’s not because of a single cause, but because multiple things go wrong at once. A programmer makes an off-by-one error, and their pairing partner was up late with a newborn and didn’t notice, and the team is experimenting with less frequent pair swaps, and the canary server alerts were accidentally disabled. Failure happens, not because of problems, but because the development system—people, processes, and business environment—allows problems to combine.

Furthermore, systems exhibit a drift toward failure. Ironically, for teams with a track record of containing failures, the threat isn’t mistakes, but success. Over time, as no failures occur, the team’s norms change. For example, they might make pairing optional so people have more choice in their work styles. Their safe boundaries shrink. Eventually, the failure conditions—which existed all along!—combine in just the right way to exceed these smaller boundaries, and a failure occurs.

It’s hard to see the drift toward failure. Each change is small, and is an improvement in some other dimension, such as speed, cost, convenience, or customer satisfaction. To prevent drift, you have to stay vigilant. Past success doesn’t guarantee future success.

Small failures are a “dress rehearsal” for large failures.

You might expect large failures to be the result of large mistakes, but that isn’t how failure works. There’s no single cause, and no proportionality. Large failures are the result of the same systemic issues as small failures. That’s good news, because it means small failures are a “dress rehearsal” for large failures. You can learn just as much from them as you do from big ones.

Therefore, treat every failure as an opportunity to learn and improve. A typo is still a failure. A problem detected before release is still a failure. No matter how big or small, if your team thinks something is “done,” and it later needs correction, it’s worthy of analysis.

But it goes even deeper. Failures are a consequence of your development system, as I said, but so are successes. You can analyze them, too.

Conducting the Analysis

Ally Retrospectives

Incident analysis is a type of retrospective. It’s a joint look back at your development system for the purpose of learning and improving. Textbook. As such, an effective analysis will involve the five stages of a retrospective: [Derby and Larsen 2006]

Set the stage;
Gather data;
Generate insights;
Decide what to do; and
Close the retrospective.

Include your whole team in the analysis, along with anyone else involved in the incident response. Avoid including managers and other observers; you want participants to be able to speak up and admit mistakes openly, and that requires limiting attendance to just the people who need to be there. When there’s a lot of interest in the analysis, you can produce a report, as I’ll describe later.

The time needed for the analysis session depends on the number of events leading up to the incident. It’s independent of the impact. In the beginning, expect it to take several hours. You’ll get faster with experience.

A neutral facilitator should lead the session. The more sensitive the incident, the more experienced your facilitator needs to be.

This practice, as with all practices in this book, is focused on team-level incidents—incidents which your team can analyze mainly on their own. You can apply this format to larger, multi-team incidents, but you’ll need to customize it first. It’s best to wait until you’ve had experience using this approach with smaller, team-level analyses first.

If a multi-team incident occurs and the official incident analysis isn’t useful to your team, your team can conduct their own, unofficial analysis, using this format, by focusing on your team’s contributions to the incident.

Set the stage

Ally Safety

Because incident analysis involves a critical look at successes and failures, it’s vital for every participant to feel safe to contribute, including having frank discussions about the choices they made. For that reason, start the session by reminding everyone that the goal is to use the incident to better understand the way you create software—the development system of people, processes, expectations, environment, and tools. You’re not here to focus on the failure itself or to place blame, but instead to learn how to make your development system more resilient.

Ask everyone to confirm that they can abide by that goal and assume good faith on the part of everyone involved in the incident. Norm Kerth’s prime directive is a good choice:

Regardless of what we discover, we must understand and truly believe that everyone did the best job he or she could, given what was known at the time, his or her skills and abilities, the resources available, and the situation at hand. [Kerth 2001] (ch. 1)

In addition, consider establishing the Vegas Rule: What’s said in the analysis session, stays in the analysis session. Don’t record the session, and ask participants to agree not to repeat any personal details shared in the session.

If the session includes people outside the team, or if your team is new to working together, you might also want to establish working agreements for the session. (See “Create Working Agreements” on p.XX.)

Gather data

Once the stage has been set, your first step is to understand what happened. You’ll do so by creating an annotated, visual timeline of events.

Stay focused on facts, not interpretations.

People will be tempted to interpret the data at this stage, but it’s important to keep everyone focused on “just the facts.” With the benefit of hindsight, it’s easy to fall into the trap of critiquing people’s actions, but that won’t help. A successful analysis focuses on understanding what people actually did, and how your development system contributed to them doing those things, not what they could have done differently.

As this stage progresses, participants will probably need multiple reminders to stay focused on facts, not interpretations. Be persistent.

To create the timeline, start by creating a long horizontal space on your virtual whiteboard. If you’re conducting the session in person, use blue tape on a large wall. Divide the timeline into columns representing different periods in time. The columns don’t need to be uniform; weeks or months are often best for the earlier part of the timeline, while hours or days might be more appropriate for the moments leading up to the incident.

Have participants use simultaneous brainstorming to think of events relevant to the incident. (See “Work Simultaneously” on p.XX.) Events are factual, non-judgemental statements about something that happened, such as “Deploy script stops all ServiceGamma instances,” “ServiceBeta returns 419 response code,” “ServiceAlpha doesn’t recognize 419 response code and crashes,” “On-call engineer is paged about system downtime,” and “On-call engineer manually restarts ServiceGamma instances.” (You can use people’s names, but only if they’re present and agree.) Be sure to capture events that went well, not just ones that went poorly.

Software logs, incident response records, and version control history are all likely to be helpful. Write each event on a separate sticky note and add it to the board. Use the same color sticky for each event.

Afterwards, invite everyone to step back and look at the big picture. Which events are missing? Working simultaneously, look at each event and ask, “What came before this? What came after?” Add each additional event as another sticky note. You might find it helpful to show before/after relationships with arrows.

How was the automation used? Configured? Programmed?

Be sure to include events about people, not just software. People’s decisions are an enormous factor in your development system. Find each event which involves automation your team controls or uses, then add preceding events about how people contributed to that event. How was the automation used? Configured? Programmed? Be sure to keep these events neutral in tone and blame-free. Don’t second-guess what people should have done; only write what they actually did.

For example, the event “Deploy script stops all ServiceGamma instances” might be preceded by “Op misspells --target command-line parameter as --tagret” and “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” which in turn is preceded by “Team decides to clean up deploy script’s command-line processing.”

Events can have multiple predecessors feeding into the same event. Each predecessor can occur at different points in the timeline. For example, the event “ServiceAlpha doesn’t recognize 419 response code and crashes” could have three predecessors: “ServiceBeta returns 419 response code“ (immediately before); “Engineer inadvertently disables ServiceAlpha top-level exception handler” (several months earlier); and “Engineer programs ServiceAlpha to throw exception when unexpected response code received” (a year earlier).

As events are added, encourage participants to share recollections of their opinions and emotions at the time. Don’t ask people to excuse their actions; you’re not here to assign blame. Ask them to explain it was like to be there, in the moment, when the event occurred. This will help your team understand the social and organizational aspects of your development system—not just what choices were made, but why.

Ask participants to add additional stickies, in another color, for those thoughts. For example, if Jarrett says, “I had concerns about code quality, but I felt like I had to rush to meet our deadline,” he could write two sticky notes: “Jarrett has concerns about code quality” and “Jarrett feels he has to rush to meet deadline.” Don’t speculate about the thoughts of people who aren’t present, but you can record things they said at the time, such as “Layla says she has trouble remembering deploy script options.”

Keep these notes focused on what people felt and thought at the time. Your goal is to understand the system as it really was, not to second-guess people.

Finally, ask participants to highlight important events in the timeline—the ones that seem most relevant to the incident. Doublecheck whether people have captured all their recollections about those events.

Generate insights

Now it’s time to turn facts into insights. In this stage, you’ll mine your timeline for clues about your overall development system. Before you begin, give people some time to study the board. This can be a good point to call for a break.

The events aren’t the cause of failure; they’re a symptom of your system.

Begin by reminding attendees about the nature of failure. Problems are always occurring, but they don’t usually combine in a way that leads to failure. The events in your timeline aren’t the cause of the failure; they’re a symptom of how your development system functions. It’s that deeper system that you want to analyze.

Look at the events you identified as important during the “gather data” activity. Which of them involved people? To continue the example, you would choose the “Op misspells --target” and “Engineer inadvertently changes deploy script” events, but not “Deploy script stops all ServiceGamma instances,” because that event happened automatically.

Working simultaneously, assign one or more of the following categories to each people-involved event.2 Write each category on a third color of sticky note and add it to the timeline.

2This list was inspired by [Woods et al. 2010] and [Dekker 2014].

Knowledge and mental models: Involves information and decisions within the team involved in the event. For example, believing a service maintained by the team will never return a 419 response.
Communication and feedback: Involves information and decisions from outside the team involved in the event. For example, believing a third-party service will never return a 419 response.
Attention: Involves the ability to focus on relevant information. For example, ignoring an alert because several other alerts are happening at the same time, or misunderstanding the importance of an alert due to fatigue.
Fixation and plan continuation: Persisting with an assessment of the situation in the face of new information. For example, during an outage, continuing to troubleshoot a failing router after logs show that traffic successfully transitioned over to the backup router. Also involves continuing with an established plan; for example, releasing on the planned date despite beta testers saying the software isn’t ready.
Conflicting goals: Choosing between multiple goals, some of which may be unstated. For example, deciding to prioritize meeting a deadline over improving code quality.
Procedural adaptation: Involves situations in which established procedure doesn’t fit the situation. For example, abandoning a checklist after one of the steps reports an error. A special case is the responsibility-authority double bind, which requires people to make a choice between being punished for violating procedure or following a procedure that doesn’t fit the situation.
User experience: Involves interactions with computer interfaces. For example, providing the wrong command-line argument to a program.
Write-in: You can create your own category if the event doesn’t fit into the ones I’ve provided.

The categories apply to positive events, too. For example, “engineer programs back-end to provide safe default when ServiceOmega times out” is an example of a “knowledge and mental models” event.

After you’ve categorized the events, take a moment to consider the whole picture again, then break into small groups to discuss each event. What does each one say about your development system? Focus on the system, not the people.

For example, the event, “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” sounds like it’s a mistake on the part of the engineer. But the timeline reveals that Jarrett, the engineer in question, felt he had to rush to meet a deadline, even though it reduced code quality. That means it was a “conflicting goals” event, and it’s really about how priorities are made and communicated. As the team discusses the event, they realize they all feel pressure from sales and marketing to prioritize deadlines over code quality.

Incident analysis looks at the system, not individuals.

On the other hand, let’s say the timeline analysis revealed that Jarrett also misunderstood the behavior of the team’s command-line processing library. That would make it a “knowledge and mental models” event, too, but you still wouldn’t put the blame on Jarrett. Incident analysis always looks at the system, not individuals. Individuals are expected to make mistakes. In this case, a closer look at the event reveals that, although the team used test-driven development and pairing for production code, they didn’t apply that standard to their scripts. They didn’t have any way to prevent mistakes in their scripts, and it was just a matter of time before one slipped through.

After the breakout groups have had a chance to discuss the events—for speed, you might want to divide the events among the groups, rather than having each group discuss every event—come back together to discuss what you’ve learned about the system. Write each conclusion on a fourth color of sticky note and put it on the timeline next to the corresponding event. Don’t make suggestions, yet; just focus on what you’ve learned. For example, “No systematic way to prevent programming mistakes in scripts,” “Engineers feel pressured to sacrifice code quality,” and “Deploy script requires long and error-prone command-line.”

Decide what to do

You’re ready to decide how to improve your development system. You’ll do so by brainstorming ideas, then choosing a few of your best options.

Start by reviewing the overall timeline again. How could you change your system to be more resilient? Consider all possibilities, without worrying about feasibility. Brainstorm simultaneously onto a table or a new area of your virtual whiteboard. You don’t need to match your ideas to specific events in the timeline. Some ideas will address multiple things at once.

To continue the example, your team might brainstorm ideas such as, “stop committing to deadlines,” “update forecast weekly and remove stories that don’t fit deadline,” “apply production coding standards to scripts,” “perform review of existing scripts for additional coding errors,” “simplify deploy script’s command line,” and “perform user experience review of command-line options across all the team’s scripts.” Some of these ideas are better than others, but at this stage, you’re generating ideas, not filtering them.

Once you have a set of options, group them into “control,” “influence,” and “soup” circles, depending on your team’s ability to make them happen, as described in “Circles and Soup” on p.XX. Have a brief discussion about the options’ pros and cons. Then use dot voting, followed by a consent vote (see “Work Simultaneously” on p.XX and “Seek Consent” on p.XX), to decide which options your team will pursue. You can choose more than one.

As you think about what to choose, remember that you shouldn’t fix everything. Sometimes, introducing a change adds more risk or cost than the thing it solves. In addition, although every event is a clue about the resiliency of your underlying system, not every event is bad. For example, one of the example events was, “Engineer programs ServiceAlpha to throw exception when unexpected response code received.” Even though that event directly led to the outage, it made diagnosing the failure faster and easier. Without it, something still would have gone wrong, and it would have taken longer to solve.

Make Errors Impossible

As you think of options for what your team could do, try to think of ways to make errors impossible. For example, imagine you’ve discovered data loss in a text field. People could enter 500 characters, but only the first 250 were stored in the database. An analysis reveals multiple errors: first, a programmer hard-coded the wrong maximum length in the front end; second, a programmer programmed the back end to accept front-end data without validating length; third, a programmer coded the database abstraction to silently truncate the data rather than failing fast.

In this example, you could make the front-end error impossible by changing it to automatically pull field lengths from back-end metadata.

When you can’t make errors impossible, try to catch them automatically, typically by improving your build or test suite. For example, you could write a test to look at all front-end templates and check their length against the database.

If automatically detecting errors isn’t an option, change your process to make the error less likely. For example, you might add “double-check lengths of modified UI and database fields” to your definition of “done.” This is the slowest and least reliable option, so balance it against keeping your process simple and lightweight. Sometimes, the likelihood or impact of an error is so small, it’s not worth shielding against it.

Don’t stop with the immediate error. Think about what it says about your team’s development system. For example, if the back-end didn’t validate field lengths, does that mean there’s a problem with validating data in general? If the database wrapper didn’t fail fast, does that mean the database wrapper lacks data integrity checks in general, or that your team needs to be more proactive about failing fast overall?

Some of your “decide what to do” options can include further investigation. For example, you could suggest performing a code review of your database abstractions, or using exploratory testing to investigate data validation.

If you find more problems, it’s worthy of a followup discussion. For example, if you discover that none of your database wrappers fail fast, you could create an Architecture Decision Record (see “Architecture Decision Records” on p.XX) to incrementally add them. Alternatively, depending on how much risk the issue presents, you could prioritize a story to fix all of them at once.

Close the retrospective

Incident analysis can be intense. Close the retrospective by giving people a chance to take a breath and gently shift back to their regular work. That breath can be metaphorical, or you can literally suggest that people stand up and take a deep breath.

Start by deciding what to keep. A screenshot or picture of the annotated timeline and other artifacts is likely to be useful for future reference. First, invite participants to review the timeline for anything they don’t want shared outside the room. Remove those stickies before taking the picture.

Next, decide who will follow through on your decisions, and how. If you will be producing a report, decide who will participate in writing it.

Finally, wrap up by expressing appreciations to each other for your hard work.3 Explain the exercise and provide an example: “(Name), I appreciate you for (reason).” Sit down and wait. Others will speak up as well. There’s no requirement to speak, but leave plenty of time at the end—a minute or so of silence—because people can take a little while to speak up.

3The “appreciations” activity is based on [Derby and Larsen 2006] (pp. 119-120).

Some people find the “appreciations” activity uncomfortable. An alternative activity is for each participant to take turns saying a few words about how they feel now the analysis is over. It’s okay to pass.

Afterwards, thank everybody for their participation. Remind them of the Vegas rule (don’t share personal details without permission), and end.

Organizational Learning

Organizations will often require a report about the incident analysis’s conclusions. It’s usually called a post-mortem, although I prefer the more neutral incident report.

In theory, part of the purpose of the incident report is to allow other teams to use what you’ve learned to improve their own development systems. Unfortunately, people tend to dismiss lessons learned by other teams. This is called distancing through differencing. “Those ideas don’t apply to us, because we’re an internally-facing team, not externally-facing.” Or, “we have microservices, not a monolith.” Or, “we work remotely, not in-person.” It’s easy to latch onto superficial differences as a reason to avoid change.

Preventing this distancing is a matter of organizational culture, which puts it out of the scope of this book. Briefly, though, people have the most appetite for learning and change after a major failure. Other than that, I’ve had the most success from making the lessons personal. Show how the lessons affect things your audience cares about.

This is easier in person than with a written document. In practice, I suspect—but don’t know for sure!—that the most effective way to get people to read and apply the lessons from an incident report is to tell a compelling, but concise story. Make the stakes clear from the outset. Describe what happened and allow the mystery to unfold. Describe what you learned about your system and explain how it affects other teams, too. Describe the potential stakes for other teams and summarize what they can do to protect themselves.

Accountability

Another reason organizations want incident reports is to “hold people accountable.” This tends to be misguided at best.

That’s not to say teams shouldn’t be accountable for their work. They should be! And by performing an incident analysis and working on improving their development system, including working with the broader organization to make changes, they are showing accountability.

Searching for someone to blame makes big incidents worse.

Searching for a “single, wringable neck,” in the misguided parlance of Scrum, just encourages deflection and finger-pointing. It may lower the number of reported incidents, but that’s just because people hide problems. The big ones get worse.

“As the incident rate decreases, the fatality rate increases,” reports The Field Guide to Understanding ‘Human Error’, speaking about construction and aviation. [Dekker 2014] (ch. 7) “[T]his supports the importance... of learning from near misses. Suppressing such learning opportunities, at whatever level, and by whatever means, is not just a bad idea. It is dangerous.”

If your organization understands this dynamic, and genuinely wants the team to show how it’s being accountable, you can share what the incident analysis revealed about your development system. (In other words, the final stickies from the “Generate Insights” activity.) You can also share what you decided to do to improve the reliability of your system.

Often, your organization will have an existing report template, and you’ll just have to conform to that expectation. Do your best to avoid presenting a simplistic cause-and-effect view of the situation, and be careful to show how the system, not individuals, allowed problems to turn into failures.

Questions

What if we don’t have time to do a full analysis of every bug and incident?

Incident analysis doesn’t have to be a formal retrospective. You can use the basic structure to explore possibilities informally, with just a few people, or even in the privacy of your own thoughts, in just a few minutes. The core point to remember is that events are symptoms of your underlying development system. They’re clues to teach you how your system works. Start with the facts, discuss how they change your understanding of your development system, and only then think of what to change.

Prerequisites

Ally Safety

Successful incident analysis depends on psychological safety. Unless participants feel safe to share their perspective on what happened, warts and all, you’ll have trouble achieving a deep understanding of your development system.

The broader organization’s approach to incidents has a large impact on participants’ safety. Even companies that pay lip-service to “blameless postmortems” have trouble moving from a simplistic cause-effect view of the world to a systemic view. They tend to think of “blameless” as “not saying who’s to blame,” but to be truly blameless, they need to understand that no one is to blame. Failures and successes are a consequence of a complex system, not specific individuals’ actions.

You can conduct a successful incident analysis in organizations that don’t understand this, but you’ll need to be extra careful to establish ground rules about psychological safety, and ensure people who have a blame-oriented worldview don’t attend. You’ll also need to exercise care to make sure the incident report, if there is one, is written with a systemic view, not a cause-effect view.

Indicators

When you conduct incident analyses well:

Incidents are acknowledged and even incidents with no visible impact are analyzed.
Team members see the analysis as an opportunity to learn and improve, and even look forward to it.
Your system’s resiliency improves over time, resulting in fewer escaped defects and production outages.

Alternatives and Experiments

Many organizations approach incident analysis through the lens of a standard report template. This tends to result in shallow “quick fixes” rather than a systemic view, because people focus on what they want to report rather than studying the whole incident. The format I’ve described will help people expand their perspective before coming to conclusions. Conducting it as a retrospective will also ensure everybody’s voices are heard, and the whole team buys in to the conclusions.

Many of the ideas in this practice are inspired by books from the field of Human Factors and Systems Safety. Those books are concerned with life-and-death decisions, often made under intense time pressure, in fields such as aviation. Software development has different constraints, and some of those transplanted ideas may not apply perfectly.

In particular, the event categories I’ve provided are likely to have room for improvement. I suspect there’s room to split the “knowledge” category into several categories. Don’t just add categories arbitrarily, though. Check out the further reading section and ground your ideas in the underlying theory, first.

The retrospective format I’ve provided has the most room for experimentation. It’s easy to fixate on solutions or simplistic cause-effect thinking during an incident analysis, and the format I’ve provided is designed to avoid this mistake. But it’s just a retrospective. It can be changed. After you’ve conducted several analyses using the format I’ve provided, see what you can improve by experimenting with new activities. For example, can you conduct parts of the “Gather Information” stage asynchronously? Are there better ways to analyze the timeline during the “Generate Insights” stage? Can you provide more structure to “Decide What to Do?”

Finally, incident analysis isn’t limited to analyzing incidents. You can also analyze successes. As long as you’re learning about your development system, you’ll achieve the same benefits. Try conducting an analysis of a time when the team succeeded under pressure. Find the events that could have led to failure, and the events that prevented failure from occurring. Discover what that teaches you about your system’s resiliency, and think about how you can amplify that sort of resiliency in the future.

AoAD2 Practice: Incident Analysis