A few thoughts on Incident Response

June 23rd, 2016

Every now and then, the internet is set on fire. All too often when this happens -- and this does happen all too often -- we scramble, wave our arms in the air and enter panic mode, trying to patch systems left and right until eventually the excitement wears off and we continue with our day to day routine.

Unfortunately, this does not lead to an elimination of the given attack vector, and the next time a major vulnerability in a widely used library drops, we rinse and repeat.

A formal Incident Response Process can help reduce stress, improve the efficiency of your limited resources, and help yield actual results in keeping your users and data secure. The following is an outline of an incident response process as I might design it; it is based on some of my experiences and observations at different companies but does not reflect any one employer's specific process.

Incident Types

Within this document, I will focus on "major" incidents. The nature of computer security requires our Incident Response Process to be adaptive, since minor incidents may evolve to become major incidents as our understanding of the impact changes. As a result, your Incident Response Process should be applicable (and be followed!) for "minor" incidents as well.

Typical examples of "major" incidents include:

discovery and/or disclosure of a vulnerability in a widely used piece of software or library; e.g. Heartbleed, Shellshock, ImageTragick
identification of a new attack vector or protocol vulnerability; e.g. POODLE, Logjam
identification of an internal compromise or attack in progress

Some of these events are known a priori, such as by way of responsible disclosure through your Bug Bounty program or within the community; some of these events hit us without advance notice, such as the sudden disclosure of a 0-day vulnerability or an immediate alert condition (most frequently: a human going "huh, that's weird").

Documentation

Your Incident Response Process needs to be documented and accessible for everybody involved. Ensure that all participants know where to find it, be that on your wiki, as a formal policy document, a shared Google doc, or whatever works best for your organization. This document is the place people will go to during an incident, in a time of high stress and tremendous pressure, so it needs to be clearly written, easy to find and read, and properly linked.

Your Incident Response Process should follow a runbook style, to allow incident responders to walk a simple decision tree and execute the required steps. You may be able to autoamte many of the tasks involved. This has the added benefit that they are executed reliably and no steps are missed.

Communication

Incident Response is divided into the following steps:

Incident Identification
Incident Analysis
Information Dispatch
Follow-up and Resolution

It's critical to understand (and, in execution, remember) that the primary task of the Incident Response team is not to complete these tasks, but to coordinate them. As such, efficient communications between the IR and other teams and individuals is crucial.

Communications amongst participants happen synchronously and in real-time (e.g. via IRC or some other online chat, face to face, or over the phone / video chat) as well as asynchronously (e.g. via email). Communications may be one-to-many (e.g. announcements), one-to-one (e.g. notification or dialog amongst individuals), or many-to-many (e.g. discussions); they may be confidential (initial disclosure or impact analysis), semi-confidential (internal discussions amongst different teams or organizations), internal-open (announcements to your company or organization at large), or public (on the internet or within public communities).

For each of these different types of communication, you will need a suitable channel. People will contact your IR team in a variety of ways. Not all alerts or disclosures necessarily trigger an incident and not all incidents are of equal importance or urgency. Your incident intake might be divided into:

an internal-open mailing list

Such a list should be open for discussions of any security-relevant matter and be separate from your overall information security team's list: information security cannot exist in a vacuum, and many subject matter experts (SMEs), although not formally declared members of your information security team, have interests and insights in security topics.

You should encourage an open forum for discussions to which non-experts and experts alike can contribute or ask questions. Somebody forwarding a security announcement from a software project they follow may trigger an incident.

This list is also useful for your team to disseminate information to those interested and who may be able to help without spamming the entire organization or company.

Your IR team should pay attention to this list, although no formal response is required, and no SLA for follow-up is needed.
an internal-open chat channel

Offering your company or organization a method to engage your team synchronously -- or asynchronously with a reasonable delay -- by online chat is a good idea. This allows internal reporters to verify an issue or determine where to best report it.

Your IR team should pay attention to this channel, although no formal response or SLA is necessary.
an internal-closed mailing list

This list is intended for people within your organization to let you know about an issue or ask for your input. It should be closed to people outside of your team, to allow for internal discussions.

As a method of contacting your team, this list should be monitored and responded to within a reasonable amount of time. Any incoming mail that is not marked by the sender as "FYI only" (or similar) should receive an explicit response.
an on-call rotation

For certain incidents, an immediate response is necessary. For this, you should have an on-call rotation with a well-defined and quick SLA as well as a suitable escalation process.

This list of contact methods is in ascending order of priority and your on-call staff should track and respond to incoming requests appropriately.

In addition to the above, you also need to provide a method for people from outside the company to engage your team. This, however, goes a bit beyond the scope of just Incident Response, and may take the form of a Bug Bounty program, participation in community discussion forums or disclosure lists (such as e.g. Operations Security Trust), a public contact address with a public PGP key tied to it, or a variety of other possibilities.

Within this document, let us assume that incident intake begins with somebody within the company, irrelevant of how they became aware of the issue or by which channel they were notified.

It is important that everybody within the company knows how and where to report any security issues they encounter. You need to make sure that your contact information and engagement process is clearly spelled out and easy to find.

Issue Severity

Upon notification or discovery of an issue, the first task of the IR staff is to identify and classify the incident. For this, your team may need to consult with SMEs on the affected piece of software and in how far this applies to your infrastructure and software stack, your larger infosec team on the perceived impact, your internal red team for exploitability, as well as your compliance team for input on any legal obligations resulting from the given issue.

Issue severity classification is a black art all by itself, requiring intimate understanding of all these factors. One factor is whether the vulnerability or issue is publicly known. Another is whether it is actively being exploited, assumed to be feasible to be exploited, or unlikely to be exploited. For example, a vulnerability found to have been present for years and disclosed without advance notification to the software vendors should be assumed to be actively exploited, while a responsibly disclosed vulnerability that requires extraordinary capabilities may well not be considered as such.

All this therefor requires understanding a reasonably accurate and realistic Threat Model. But you need more than just a designation based on vulnerability type. It is a common mistake to declare e.g. all Remote Code Execution (RCE) vulnerabilities as of highest priority, regardless of what might be exposed from the vulnerable system. Severity should include a combined scoring of at least:

vulnerability type
exploitability and attack vector
public disclosure of the vulnerability
exposure of the systems in question
whether or not the same vulnerability has repeatedly been exploited
publicity damage
data at risk

When identifying the severity of an issue, the team also needs to determine in how far they are able to further disclose the vulnerability. One useful method of identifying whom sensitive information may be shared with is the Traffic Light Protocol or TLP. Considering and abiding by TLP designations even within your organization or your team(s) is critical to avoid accidental exposure of confidential information.

Running an incident

When processing the intake, your team may decide to treat a given issue as an incident. This decision may at times be based on incomplete information, such as a security pre-announcement.

At this point, your team should start the formal Incident Response Process, following the process outline you have described in your policy document.

The first steps of tracking an incident should include the identification of an incident lead, creation of a master or parent ticket, an incident chat channel, a timeline document, as well as an incident information document or wiki page. The order in which these are created does not matter, but all should be part of the list of items to check off as incident tracking begins:

Incident Lead

Even though incident response is a team effort and requires the collaboration of many individuals across different organizations, it is useful to identify one primary person to coordinate the incident. We will refer to this person as the Incident Lead. This person has the responsibility to ensure that the incident is tracked and the Incident Response Process is followed.

The Incident Lead may be one of the first responders or a more senior analyst, but she should be identified and involved early on.

Note: the Incident Lead is not responsible to do all the work, but to make sure it gets done. That is, she needs to help coordinate the research, own the timeline document, review the documentation, and ensure updates and notifications are sent to the appropriate parties.

Note: first and primary responders may change through the course of an incident. Work may be coordinated and driven to completion by other individuals or teams, but in the end it is the responsibility of the Incident Lead to own the final resolution.

Master Ticket

This ticket should track all work relating to the incident. It is, by nature, a parent ticket, primarily used to provide a terse summary of the issue, include links to more information, and to link any and all other tickets or problem reports defining outstanding work within your company or organization.

Ideally, you will tag all tickets with a unique incident identifier (such as a CVE number, if available) and use an automated script to correlate and link tickets to the master ticket.

The Incident Lead should own the master ticket should and only close it after all action items have been resolved and verified.

Incident Chat Channel

To ease discussions around a given incident, I recommend creating a dedicated internal-open chat channel early on in the incident response process. This will be the easiest way for people within your company or organization to ask for feedback, or for your team to coordinate resolution with different teams.

Unless the incident is classified as TLP Amber, you should open this channel to all members of your organization. The channel should be logged, and the title of the channel should be set to include the vulnerability or incident identifier (e.g. CVE number) as well as a link to the incident information document.

Timeline Document

As soon as the Incident Pesponse Process starts, create a timeline document. This document will track important events and will be critical to help you analyze your response process in your post-mortem.

All too often, timelines are reconstructed after the incident. This necessarily leads to incongruities and misleading data, as either information is simply not available (any longer) or is (unintentionally and/or unconsciously) recorded incorrectly. This is why it's important to begin the timeline document early on and to continue to update it throughout the incident.

When creating a timeline document and recording events, you should:

use UTC for all dates and times

All your systems should report times in UTC already. In most cases you will track events across multiple time zones, different data centers, office locations. Announcements are made and information is released by different parties on the internet.
use RFC3339 / ISO8601 time stamps

Incidents span days, weeks, months, and may even cross year boundaries; different cultures may use different date conventions. To avoid any confusion, use a single, unambiguous date format, utilizing a 24 hour clock and easily sortable date stamps. The following illustrates the date(1) command correlating to the preferred format:
```
$ TZ=UTC date +%Y-%m-%dT%H:%M:%SZ
2016-06-21T20:39:33Z
$ 
```
use internal usernames to identify individuals unambiguously; full names, titles, team or function are only necessary to clarify engagement or role of the participants
record time and method of any advance notification or previously embargoed disclosure
record time and method of notification and/or escalation
record time and method of engagement with vendors or third parties
record time and method of major findings, vendor announcements (private or public), software release dates
record time of resolution for individual components, services etc.
record final time of overall resolution
record date and time of scheduled post mortem

This timeline document will be updated throughout the incident, and should be editable by all incident responders or participants. A shared Google doc or a plain text document under revision control is preferred, to ensure that changes can be tracked.

The basic structure of this document might be:

Incident Identifier and terse summary

Link to Incident Information Document

Link to Master Ticket

Link to Incident Chat Logs

YYYY-MM-DDTHH:MM:SSZ: first embargoed disclosure of vulnerability FOO via channel
YYYY-MM-DDTHH:MM:SSZ: incident identified; Incident Lead: alice@
YYYY-MM-DDTHH:MM:SSZ: incident master ticket created: link
YYYY-MM-DDTHH:MM:SSZ: first public disclosure via channel
YYYY-MM-DDTHH:MM:SSZ: service SMEs bob@, jdoe@, jane@ notified via channel
YYYY-MM-DDTHH:MM:SSZ: workaround identified
YYYY-MM-DDTHH:MM:SSZ: list of vulnerable systems identified
YYYY-MM-DDTHH:MM:SSZ: vendor X contacted via channel
YYYY-MM-DDTHH:MM:SSZ: service Y reported as patched by fritz@
YYYY-MM-DDTHH:MM:SSZ: vendor X publishes response via channel
YYYY-MM-DDTHH:MM:SSZ: all vulnerable systems patched; follow-up actions A, B, C identified: link
YYYY-MM-DDTHH:MM:SSZ: follow-up actions completed
YYYY-MM-DDTHH:MM:SSZ: post-mortem scheduled for YYYY-MM-DDTHH:MM:SSZ
YYYY-MM-DDTHH:MM:SSZ: incident closed by incident lead

Incident Information Document

You will need to collect a lot of information, answer many questions, and make sure your organization can read up the best methods to address a given vulnerability. To collect this information, you should create an Incident Information Document early on. Necessarily incomplete in the beginning, update it throughout the process.

This will be your go-to document. It should provide all the important information around the incident, the vulnerabilities in question, the work-arounds and solutions as well as answers to the most commonly asked questions.

The basic structure of this document might be:

One sentence summary, including the incident identifier. Name of short link to this page.

Summary

A high-level description of the issue, links to public advisories (if any), summarized analysis and additional information.

Vulnerabilities

Breakdown of vulnerabilities (if more than one) by CVE identifier and terse summary of impact on your infrastructure.

Attack Vectors

Description of how the vulnerabilities might be exploited and what capabilities attackers need. Identify specifically how your infrastructure is at risk here.

Vulnerable Systems

Description of which systems/libraries/frameworks are vulnerable specifically; link to vendor announcements for each if applicable.

PoC / verification steps

Steps to how to identify a system as being vulnerable / affected or how to determine that a system is safe.

Remediation and mitigation

A description of what measures owners should take to fully remediate the issue (e.g. "upgrade to libfoo-1234, then restart service foo").

After this, include a description of what measures owners can take to mitigate the vulnerability if a resolution is not possible or feasible, include required follow-up actions, if any.

FAQ

Specific questions will come up from within your organizations. If they cannot be answered by updating the above items, explicitly note them here.

Include pointers to your contact addresses and discussion list or channel.

Incident Notifications

As you classify the incident, determine who needs to be contacted to help identify impact and risk. Establish which teams are needed to help fix the problem. You should have at hand a list of SMEs for the most common issues (TLS and cryptographic protocols, your serving stack, your primary languages and frameworks, etc.); consult with your red team on the analysis of the attack vectors and realistic exploitability.

It's important to remember that in some cases and based on which data may be at risk notifications may be in order up the chain. Your Incident Response Process document should have clear guidelines when to contact your CISO, and whether she ought to further escalate or notify executive staff.

In addition, publicly disclosed issues with major industry wide impact (think DROWN, POODLE, Shellshock, Heartbleed, ...) may require you to give your PR team a heads-up, as your company or organization may come under external scrutiny and press inquiries may be expected. When something like this strikes, it's also useful to send out a quick note to your organization at large.

Incident Analysis

As the incident response process starts, a more detailed analysis takes place. During this time, your team will update the timeline as well as the information documents as needed. The analysis includes:

identification / development of a proof of concept (PoC) for the vulnerability

This may include external services or development in-house. Sometimes a PoC may not be available or too hard to develop. But do not shy away from spending resources on this step, as having a reliable PoC will help tremendously in assessing your attack surface as well as help others within your organization to resolve their tickets.
identification / development of a test method

Beyond the PoC, additional tools may be needed to determine the risk on a specific system or service, accounting for mitigating factors or testing for contributing elements only.
determination of a method to identify all affected systems

In the case of a software library update, this might include a check in your inventory data base or a scan of all systems. You may need to write a configuration management module to report identifying information, or you may use either of the tools mentioned above to establish the footprint of your vulnerable systems.
determination of methods to mitigate the issue

Even though a fix may be possible, it may not be feasible to roll out (quickly), or it may not yet be available. Identify any methods to reduce the severity and limit the impact of the vulnerability.
identification or development of actual fixes

This may include the identification of package versions to be upgraded or the development of patches in-house. In your Incident Information Document, link to where vendor supplied fixes will become available and create a clear, concise description of how to completely fix the issue.
identify SMEs and product leads in the affected products or organizations and provide them with a heads-up and required information

For each of these items, remember to note relevant details in the timeline or incident information documents.

Information Dispatch

While running the incident, the primary responsibility of the Incident Response team is to dispatch information about the incident. This includes notifying system or software owners of the vulnerability. For this, you need a comprehensive and accurate inventory data base mapping systems and software components to the teams responsible for their maintenance.

The notification of work items most commonly takes the form of tracking tickets. These should:

explicitly identify the vulnerable systems under purview of the given team
include the incident identifier, a terse summary or impact, a link to the incident information document, and expected action taken
be tagged and linked to the master ticket
have an expected SLA
have a human owner responsible for the resolution; this may be the team's lead or an SME, but should not be a mailing list, bot, or other, non-human owner

Ticket creation likely requires some automation. As the incident progresses, you may have to adjust priorities or SLAs, guide owners to work-arounds, and identify follow-up actions. This is the long tail of the Incident Response Process, and care must be taken that work-arounds, fixes, and updates are verified and follow-up actions correctly classified and tracked.

A dashboard that shows the number and status of relevant tickets by organizational leader (e.g. by VP) can be useful in pushing for traction as well as in illustrating and understanding your attack surface and vulnerability.

The other important means of communication around the incident include the notification of the different audiences we identified earlier. You might consider notifying:

your internal-open mailing list:
- upon initial discovery / disclosure of the issue (unless restricted by an embargo / TLP)
- when important new findings are discovered (new attack vectors, additional related vulnerabilities, the severity of the impact changes, important vendor announcements are made, ...)
- periodic updates as major components are fixed
- when the issue is resolved and a post-mortem is scheduled
- if a retrospective or findings after the post-mortem are published
your developer-, devops-, ops-, etc. communities:
- when the general scope is known (i.e. as a heads-up: "we'll need action from you")
- when specific action items have been identified
- when the issue is resolved and a post-mortem is scheduled
leaders or executives up the chain (sending these to your CISO and letting her make the decision as to when to further escalate may be an option):
- if major new developments become public
- if resources need to be mobilized company-wide to address the issue
- if e.g. end-user data is exposed and your PR and Legal departments might need to get involved
- when the issue is resolved and a post-mortem is scheduled
partners, vendors, mergers and acquisitions not covered in your general processes:
- establish a list of security contacts in the different groups / domain
- establish chat options with your partners
informal contacts in the PR and Legal departments:
- if the issue is public and press inquiries are likely
- if you expect a press release to be necessary; your team needs to help PR draft a correct message
company wide:
- if the issue is being discussed in the press
- if the issue affects people outside their professional affiliation with the company; provide some help on how they can protect themselves and what they should look for in the products they use privately. Examples: Heartbleed, POODLE

Follow-up and Resolution

Throughout the incident, individual tasks will be marked as completed (e.g. by closing a ticket), and it is the Incident Response team's responsibility to verify that they were completed correctly and do not require any follow-up actions or to schedule any additional work items that may be needed.

For example, if a code injection vulnerability is found in a given piece of software, then it may be immediately acceptable to e.g. remove the software from the exposed system, or to restrict access to the system. But this is not sufficient to fully resolve the issue: all too frequently, vulnerable packages are kept available and systems get resurrected or reimaged with the vulnerable software. A suitable follow-up item would then be to track the elimination of the vulnerable software version from your repository altogether.

Post Mortem

Major incidents require a post-mortem to allow your team(s) to review and learn how to improve the process. Having a complete timeline and a comprehensive incident information document is critical here.

Post-mortems should be scheduled for soon after the incident has been completed. Due to the long tail in resolving all follow-up work, it may frequently be the case that holding the post-mortem occurs even before the full incident is marked as resolved, i.e. while follow-up actions are still outstanding.

During the post-mortem, the Incident Response Lead presents the detailed timeline of the incident, and participants help fill in any gaps. The primary goal is to create a meaningful description of the event, to identify any missed remediation actions or follow-up tasks, and to help refine the process.

Ideally, the flow of the post-mortem would follow this outline:

brief description of the vulnerability
brief summary of the impact on your organization or company
presentation of the incident timeline
presentation of high-level dashboards of outstanding work
list of follow-up items identified to help improve the Incident Response Process

Follow-up items to help improve the Incident Response Process should also be tracked in your ticketing system, be assigned to specific humans, and have a due date.

Post-mortems should be open to anybody within your company to observe (so long as no restricted or highly confidential information is disclosed), although it's important to avoid getting distracted or derailed in lengthy discussions about future prevention of similar incidents: when these occur, take note of the suggestions, create a ticket for somebody to follow up on or review, and move on.

Lastly, the final post mortem document should be linked to from the Incident Information document.

Incident Response is a hard, laborsome, tedious, and frequently thankless job. All too often, we don't know whether we're making any progress, and tracking incidents across a diverse and large infrastructure may feel like an attempt to boil the proverbial ocean.

Having a formal Incident Response Process that responders can adhere to step-by-step may help in assuring that nothing is overlooked, but in the end it is the reflective step of analyzing our own responses that can help us make the biggest impact.

My sincere condolencesthanks go out to all Incident Responders trying to ensure that their systems are kept (or once again made) safe. It really is a dirty job, but somebody surely has got to do it.

June 23rd, 2016

A few thoughts on Incident Response