Scaling Incident Response At Affinity

Published in

Affinity

6 min readJan 29, 2020

It’s 3pm. You’re surveying all your snack options because it’s that time of day. You feel your phone vibrate, take a peek — oh shit, “@channel the site is down right now”.

You start sprinting back to your desk. This is the moment you’ve been training your whole life for. You put on your firefighter hat, check the logs, and start furiously searching stack overflow to find out what the hell an “nginx 457 your server has escaped our time space consortium” means.

You have an idea of what’s going on and start a thread with the sales rep on your team who reported the issue. Meanwhile you overhear the head of support talking to your manager about the issue. Your manager is giving the head of support info that is totally different from what you found. You ask her where she heard that and she points to another engineer who knows their shit when it comes to nginx and the 4th dimension. Oh no, you told the sales rep the wrong thing, now there is misinformation spreading internally and maybe externally.

It’s 3:25 pm and the issue is at least fixed. You talk to the other engineer about a long term fix, and it sounds like they are taking care of it, so that’s good.You both hop into an important 3:30 meeting regarding another important project with a potential customer. At 3:45 the issue pops up again, the customer support reps try to find other engineers with context on the issue but have no luck. Finally, after 15 minutes of other engineers with less context fumbling around, you both are out of the meeting and patch it up again. Ok this time it’s fixed for real — turns out we didn’t exit vim correctly.

One week later, 3pm: “@channel the site is down right now”. You find the other engineer who worked on this issue with you. They thought you were handling the long term, you thought they were. The rest of the company is second guessing whether the engineering team knows their head from their toes.

Communication breakdown!! Even though the root cause of the above issue is presumably from some technical bug, most of the disaster actually had to do with coordination and communication. At Affinity, as we scaled from a team that could all huddle around one engineer’s laptop to grill him or her about what had happened and why, to a team large enough that {insert food delivery service here} couldn’t possibly get all the lunch orders correct for, we quickly learned that the on the fly technical solutions are usually the easy part of incident response, and everything else is hard — and gets harder with scale.

Luckily over time at Affinity we’ve made sure to be just as retrospective about our fire fighting process as the actual technical issues causing the fires under the hood. As a result, we’ve come up with a playbook, or list of suggested actions, that has worked well for us in critical times. Most of the line items are extremely simple, like determining a response lead on engineering and support, or creating a custom slack channel for the issue, but little things like that can go a long way in preventing wires crossing, misinformation spreading, and a general confusion from those following the issue peripherally.

Note — we’re currently just under 100 people at Affinity. I’m sure some of our items here would have been overkill when we were 10 people, and will probably need to be refined once we are 200, but hopefully this simple list is helpful for other growing startups out there!

Section 1: Coordinate, Communicate, and Fix

Determine Engineering Response Lead & Support Lead

A clear point person on relevant teams avoids scenarios where internal stakeholders are reaching out to different people for information, which can then cause differing accounts of the incident spreading throughout the company. Setting a clear engineering response lead early on also gives the engineers doing the more technical fire fighting time to just focus on fixing the issue at hand. The person we determine to be the engineering response lead also tackles the rest of the items in this playbook.

Create a Slack channel to coordinate the response

Similar to above — a centralized place for discussion avoids creating more avenues for repeat explanations and different accounts of the issue at hand. We really like creating a custom slack channel for the incident at Affinity so:

Anyone can optionally follow along.
We don’t end up clogging up other channels that are dedicated to other topics.
It’s easy to look back and create a timeline of events when we retrospect the incident.

Post in the #psa-urgent channel in Slack

We have a dedicated urgent channel that the whole company is required to be in. This is a good way to notify everyone that an issue exists, we are aware of it, and working on a solution. We generally post something like “There is currently a partial site outage, follow along in #partial-outage-jan-22–2020 for more granular updates”.

Cancel ongoing & upcoming meetings for response team

This lifts a weight off the response teams back. No one likes putting out a fire while also thinking about the fact they have somewhere to be soon. It also can lead to situations where the response team changes members halfway through the incident, and the new members don’t have enough context to effectively take over.

Huddle with all stakeholders to share state & plan response

A quick sync with stakeholders is important so everyone is aligned on priorities. An engineer may assume they should patch the issue and redeploy because reverting the changes may cause some data loss, while someone else may bring up the fact that we can write a quick script to save relevant data and revert. There are often many different directions, and tradeoffs to weigh when trying to put out a fire. This sync can save people spending too much time on the wrong solution.

Create an incident report on our status page and keep status page up-to-date as we work on fixing the issue

Not only is a status page useful for customers to understand why your product may not be working as expected, but it’s great for an internal support team to understand the time at which issues were occurring so they know which bug tickets are likely related to the incident and which are not.

Communicate scope of impact and projected resolution timeline to Support Lead

There are usually two major things the customer facing teams care about:

When can we expect a resolution?
How many people are affected?

Communicating this as you understand more about the issue is extremely important in enabling them to do their jobs to the best of their ability.

Section 2: Regroup, Learn, and Make Future Improvements

Run a retrospective meeting

Retrospective meetings probably deserve a blog post on their own, but here is a quick rundown on how our retrospective meetings happen at Affinity:

The meeting should happen within a week or so of the incident.
The meeting should include major stakeholders.
The response lead should create a document with an overview of what happened, the scope of impact, and a timeline of all the events from when the incident was reported to when the issue was resolved before the meeting.
In the meeting itself, attendees should take a few minutes to read the timeline to refresh themselves on the incident, and then brainstorm: What went well? What didn’t go well? And what should we do next time?
It never hurts to emphasize blamelessness in these meetings, especially if your meeting will include people who are new to the process

Share notes from retro with the company

Transparency around major incidents goes a long way in building trust across departments. One department may start resenting another if they don’t have any insight into what happened and what will be done differently in the future. Sharing the retro notes with the whole company is a great way to address this.

We keep this playbook in the form of an Asana project template, so when an issue comes up we can easily copy it and complete/assign tasks as needed.

Anything you would add? Disagree with? Share your thoughts in the comments section!