1

AWS us-east-1 outage

 2 years ago
source link: https://news.ycombinator.com/item?id=29473630
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

AWS us-east-1 outage

Looks like they've acknowledged it on the status page now. https://status.aws.amazon.com/

> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.

> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/

s.gif
> This issue is affecting the global console landing page, which is also hosted in US-EAST-1

Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.

s.gif
Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.
s.gif
> Facepalm in the form of $100Ms in service credits.

It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.

s.gif
Oh no... I think you may be in for a rough time, because I purchased something this morning and it only popped up in my orders list a few minutes ago.
s.gif
They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".
s.gif
>Facepalm in the form of $100Ms in service credits.

Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.

s.gif
That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.

Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.

s.gif
Utter lies on that page. Multiple services listed as green aren't working for me or my team.
s.gif
This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.

Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.

s.gif
Suggesting that when the status page sends a status request and hears no response—it defaults to green—hear no evil and see no evil —> report no evil

Either way—overt lies or engineering incompetence—it’s disappointing!

s.gif
Pretty low chance that the status page is automated, especially via health checks. I imagine it's a static asset updated by hand.
s.gif
Or the service that updates the status page runs out of us-east-1.
s.gif
It has customer relationship implications. I guarantee you it is updated by a support agent.
s.gif
Don’t think there is an sla for the console , so you would not be claiming anything for the console at least
s.gif
I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/
s.gif
It's like three regions - when two of them explode.

Two is one & one is none.

s.gif
the obvious solution is to put all internet in one region so that when that one explodes nobody notices your little service
s.gif
> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI

s.gif
It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)

The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)

The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.

And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.

s.gif
Maybe has something to do with CloudFront mandating certs to be in us-east-1?
s.gif
YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.
s.gif
Agreed - in my line of work regulators want everything in the country we operate from but of course CloudFront has to be different.
s.gif
Wouldn't using a global CDN for everything be off the table to begin with, in that case?
s.gif
Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.
s.gif
Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...
s.gif
Exactly. And I’m surprised AWS doesn’t have failover. That’s basic SOP for an SRE team.
s.gif
> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.

s.gif
I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.
s.gif
I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.

I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)

After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.

s.gif
Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".
s.gif
MAANG*

How long before Meta takes over for Facebook?

s.gif
Well, alphabet needs to take over for Google first.
s.gif
I like MAGMA (Meta, Amazon, Google, Microsoft, Apple).

Especially when you are getting burned by an outage.

s.gif
Yeah, but I still have a different understanding what "Increased Error Rates" means.

IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.

s.gif
They say that to try and avoid SLA commitments.
s.gif
Some big customers should get together and make an independent org to monitor cloud providers and force them to meet their SLA guarantees without being able to weasel out of the terms like this…
s.gif
They are still lying about it, the issues are not only affecting the console but also AWS operations such as S3 puts. S3 still shows green.
s.gif
It's certainly affecting a wider range of stuff from what I've seen. I'm personally having issues with API Gateway, CloudFormation, S3, and SQS
s.gif
Our corporate ForgeRock 2FA service is apparently broken. My services are behind distributed x509 certs so no problems there.
s.gif
> We are experiencing API and console issues in the US-EAST-1 Region
s.gif
I read it as console APIs. Each service API has its own indicator, and they are all green.
s.gif
IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".

STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.

s.gif
Yep, I am seeing failures on IAM as well:
   aws iam list-policies
  
  An error occurred (503) when calling the ListPolicies operation (reached max retries: 2): Service Unavailable
s.gif
I'm seeing errors for things that worked fine, like policies that had no issue now are saying "access denied".

I'm wondering if the cause of the outage has to do with something changing in the way IAM is interpreted ?

s.gif
Same here. Kubernetes pods running in EKS are (intermittently) failing to get IAM credentials via the ServiceAccount integration.
s.gif
I still can't create/destroy/etc CloudFront distros. They are stuck in "pending" indefinitely.
s.gif
Ok, we've changed the URL to that from https://us-east-1.console.aws.amazon.com/console/home since the latter is still not responding.

There are also various media articles but I can't tell which ones have significant new information beyond "outage".

s.gif
When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.

I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.

If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.

s.gif
As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.

Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.

The length of this outage has blown my mind.

s.gif
You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.

Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.

As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.

s.gif
Looks like they removed some 9s from availability in one day. I wonder if more are considering moving away from cloud.
s.gif
Uh, four minutes to identify the root cause? Damn, those guys are on fire.
s.gif
Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.

I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.

s.gif
This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.
s.gif
:) I imagine it went like this theoretical Slack conversation:

> Dev1: Pushing code for branch "master" to "AWS API". > <slackbot> Your deploy finished in 4 minutes > Dev2: I can't react the API in east-1 > Dev1: Works from my computer

s.gif
Outage started at 731 PST from our monitoring. They are on fire, but not in a good way.
s.gif
It was down as of 7:45am (we posted in our engineering channel), so that's a good 40 minutes of public errors before the root cause was figured out.
s.gif
I'm trying to login in the AWS Console from other regions but I'm getting HTTP 500. Anyone managed to login in other regions? Which ones?

Our backend is failing, it's on us-east-1 using AWS Lambda, Api Gateway, S3

s.gif
I like how 6 hours in: "Many services have already recovered".
s.gif
It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.
s.gif
firefox has more aggressive caching than other browsers I think
Haha my developer called me in panic telling that he crashed Amazon - was doing some load tests with Lambda
s.gif
Postmortem: unbounded auto-scaling of lambda combined with oversight on internal rate limits caused unforseen internal ddos.
s.gif
Just wait for the medium article “How I ran up a $400 million AWS bill.”
s.gif
I asked my friend who's a senior dev if he ever uses recursion at work. He said whenever he sees recursion in a code review, he tells the junior dev to knock it off.
s.gif
He created a lambda function that spawned more lambda functions and the rest is history
s.gif
If he actually knows how to crash Amazon, you have a new business opportunity, albeit not a very nice one...
s.gif
It'd be hilarious if you kept that impression going for the duration of the outage.
s.gif
"my boss" "my QA team" "my peers" "my partner"
I worked at a company that hired an ex-Amazon engineer to work on some cloud projects.

Whenever his projects went down, he fought tooth and nail against any suggestion to update the status page. When forced to update the status page, he'd follow up with an extremely long "post-mortem" document that was really just a long winded explanation about why the outage was someone else's fault.

He later explained that in his department at Amazon, being at fault for an outage was one of the worst things that could happen to you. He wanted to avoid that mark any way possible.

YMMV, of course. Amazon is a big company and I've had other friends work there in different departments who said this wasn't common at all. I will always remember the look of sheer panic he had when we insisted that he update the status page to accurately reflect an outage, though.

s.gif
It's popular to upvote this during outages, because it fits a narrative.

The truth (as always) is more complex:

* No, this isn't the broad culture. It's not even a blip. These are EXCEPTIONAL circumstances by extremely bad teams that - if and when found out - would be intervened dramatically.

* The broad culture is blameless post-mortems. Not whose fault is it. But what was the problem and how to fix it. And one of the internal "Ten commandments of AWS availability" is you own your dependencies. You don't blame others.

* Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

* Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.

s.gif
Well, the narrative is sort of what Amazon is asking for, heh?

The whole us-east-1 management console is gone, what is Amazon posting for the management console on their website?

"Service degradation"

It's not a degradation if it's outright down. Use the red status a little bit more often, this is a "disruption", not a "degradation".

s.gif
Yeah no kidding. Is there a ratio of how many people it has to be working for to be in yellow rather than red? Some internal person going “it works on my machine” while 99% of customers are down.
s.gif
I've always wondered why services are not counted down more often. Is there some sliver of customers who have access to the management console for example?

An increase in error rates - no biggie, any large system is going to have errors. But when 80%+ of customers loads in the region are impacted (cross availability zones for whatever good those do) - that counts as down doesn't it? Error rates in one AZ - degraded. Multi-AZ failures - down?

s.gif
SLAs. Officially acknowledging an incident means that they now have to issue the SLA credits.
s.gif
The outage dashboard is normally only updated if a certain $X percent of hosts / service is down. If the EC2 section were updated every time a rack in a datacenter went down, it would be red 24x7.

It's only updated when a large percentage of customers are impacted, and most of the time this number is less than what the HN echo chamber makes it appear to be.

s.gif
I mean, sure, there are technical reasons why you would want to buffer issues so they're only visible if something big went down (although one would argue that's exactly what the "degraded" status means).

But if the official records say everything is green, a customer is going to have to push a lot harder to get the credits. There is a massive incentivization to “stay green”.

s.gif
yes there were. I'm from central europe and we were at least able to get some pages of the console in us-east-1 -but i assume this was more caching related. Even though the console loaded and worked for listing some entries - we weren't able to post a support case nor viewing SQS messages etc.

So i aggree that degraded is not the proper wording - but it's / was not completly vanished. so.... hard to tell what is an common acceptable wording here.

s.gif
From France, when I connect to "my personal health dashboard" in eu-west-3, it says several services are having "issues" in us-east-1.

To your point, for support center (which doesn't show a region) it says:

Description

Increased Error Rates

[09:01 AM PST] We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.

[09:26 AM PST] We can confirm increased error rates for the Support Center console and Support API in the US-EAST-1 Region. We have identified the root cause of the issue and are working towards resolution.

s.gif
I'm part of a large org with a large AWS footprint, and we've had a few hundred folks on a call nearly all day. We have only a few workloads that are completely down; most are only degraded. This isn't a total outage, we are still doing business in east-1. Is it "red"? Maybe! We're all scrambling to keep the services running well enough for our customers.
s.gif
Because the console works just fine in us-east-2 and that the console on the status page does not display regions.

If the console works 100% in us-east-2 and not in us-east-1 why would they put the console completely down in us-east?

s.gif
Well you know, like when a rocket explode, it's a sudden and "unexpected rapid disassembly" or something...

And a cleaner is called a "floor technician".

Nothing really out of the ordinary for a service to be called degraded while "hey, the cache might still be working right?" ... or "Well you know, it works every other day except today, so it's just degradation" :-)

s.gif
If your statement is true, then why is the AWS status page widely considered useless, and everyone congregates on HN and/or Twitter to actually know what's broken on AWS during an outage?
s.gif
> Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.

My experience generally aligns with amzn-throw, but this right here is why. There's a manual step here and there's always drama surrounding it. The process to update the status page is fully automated on both sides of this step, if you removed VP approval, the page would update immediately. So if the page doesn't update, it is always a VP dragging their feet. Even worse is that lags in this step were never discussed in the postmortem reviews that I was a part of.

s.gif
It's intentional plausible deniability. By creating the manual step you can shift blame away. It's just like the concept of personal health dashboards which are designed to keep an asymmetry in reliability information from a host and the client to their personal anecdata experiences. Ontop of all of this, the metrics are pretty arbitrary.

Let's not pretend businesses haven't been intentionally advertising in deceitful ways for decades if not hundreds of years. This just happens to be current strategy in tech of lying and deceiving customers to limit liability, responsibility, and recourse actions.

To be fair, it's not it's not just Amazon, they just happen to be the largest and targeted whipping boys on the block. Few businesses under any circumstances will admit to liability under any circumstances. Liability has to always be assessed externally.

s.gif
I have in the past directed users here on HN who were complaining about https://status.aws.amazon.com to the Personal Health Dashboard at https://phd.aws.amazon.com/ as well. Unfortunately even though the account I was logged into this time only has a single S3 bucket in the EU, billed through the EU and with zero direct dependancies on the US the personal health dashboard was ALSO throwing "The request processing has failed because of an unknown error" messages. Whatever the problem was this time it had global effects for the majority of users of the Console, the internet noticed for over 30 minutes before either the status page or the PHD were able to report it. There will be no explanation and the official status page logs will say there was "increased API failure rates" for an hour.

Now i guess its possible that the 1000s and 1000s of us who noticed and commented are some tiny fraction of the user base but if thats so you could at least publish a follow up like other vendors do that says something like 0.00001% of API requests failed effecting an estimated 0.001% of our users at the time.

s.gif
I haven't asked AWS employees specifically about blameless postmortems, but several of them have personally corroborated that the culture tends towards being adversarial and "performance focused." That's a tough environment for blameless debugging and postmoretems. Like if I heard that someone has a rain forest tree-frog living happily in their outdoor Arizona cactus garden, I have doubts.
s.gif
When I was at Google I didn't have a lot of exposure to the public infra side. However I do remember back in 2008 when a colleague was working on routing side of YouTube, he made a change that cost millions of dollars in mere hours before noticing and reverting it. He mentioned this to the larger team which gave applause during a tech talk. I cannot possibly generalize the culture differences between Amazon and Google, but at least in that one moment, the Google culture seemed to support that errors happen, they get noticed, and fixed without harming the perceived performance of those responsible.
s.gif
While I support that, how are the people involved evaluated?
s.gif
I was not informed of his performance reviews. However, given the reception, his work in general, and the attitudes of the team, I cannot imagine this even came up. More likely the ability to improve routing to actually make YouTube cheaper in the end was I'm sure the ultimate positive result.

This was also towards the end of the golden age of Google, when the percentage of top talent was a lot higher.

s.gif
So on what basis is someone's performance reviewed, if such performance is omitted?
s.gif
The entire point of blameless postmortems is acknowledging that the mere existence of an outage does not inherently reflect on the performance of the people involved. This allows you to instead focus on building resilient systems that avoid the possibility of accidental outages in the first place.
s.gif
I know. That's not what I'm asking about, if you might read my question.
s.gif
I'll play devil's advocate here and say that sometimes these incidents deserve praise because they uncovered an issue that was otherwise unknown previously. Also if the incident had a large negative impact then it shows to leadership how critical normal operation of that service is. Even if you were the cause of the issue, the fact that you fixed it and kept the critical service operating the rest of the time, is worth something good.
s.gif
I know; that's not what I'm asking about. I'm talking about a different issue.
s.gif
Mistakes happen, and a culture that insists too hard that "mistakes shouldn't happen, and so we can't be seen making mistakes" is harmful toward engineering.

How should their performance be evaluated, if not by the rote number of mistakes that can be pinned onto the person, and their combined impact? (Was that the question?)

s.gif
If an engineer causes an outage by mistake and then ensures that would never happen again, he has made a positive impact.
s.gif
I understand that, but eventually they need to evaluate performance, for promotions, demotions, raises, cuts, hiring, firing, etc. How is that done?
s.gif
It’s standard. Career ladder [1] sets expectation for each level. Performance is measured against those expectations. Outages don’t negatively impact a single engineer.

The key difference is the perspective. If reliability is bad that’s an organizational problem and blaming or punishing one engineer won’t fix that.

[1] An example ladder from Patreon: https://levels.patreon.com/

s.gif
> The key difference

The key difference between what and what?

s.gif
Your approach and their approach. It sounded like you have a different perspective about who is responsible for an outage.
s.gif
We knew us-east-1 was unuseable for our customers for 45 minutes before amazon acknowledged anything was wrong _at all_. We made decisions _in the dark_ to serve our customers, because amazon drug their feet communicating with us. Our customers were notified after 2 minutes.

It's not acceptable.

s.gif
Can’t comment on most of your post but I know a lot of Amazon engineers who think of the CoE process (Correction of Error, what other companies would call a postmortem) as punitive
s.gif
They aren't meant to be, but shitty teams are shitty. You can also create a COE and assign it to another team. When I was at AWS, I had a few COEs assigned to me by disgruntled teams just trying to make me suffer and I told them to pound sand. For my own team, I wrote COEs quite often and found it to be a really great process for surfacing systemic issues with our management chain and making real improvements, but it needs to be used correctly.
s.gif
At some point the number of people who were on shitty teams becomes an indictment on the wider culture at Amazon.
s.gif
Absolutely! Anecdotally, out of all the teams I interacted with in seven years at AWS across multiple arms of the company, I saw only a handful of bad teams. But like online reviews, the unhappy people are typically the loudest. I'm happy they are though, it's always important to push to be better, but I don't believe that AWS is the hellish place to work that HN discourse would lead many to believe.
s.gif
I don't know any, and I have written or reviewed about 20
s.gif
Even in a medium decent culture, with a sample of 20? You know at least one, you just don't know it.
s.gif
I'm an ex-Amazon employee and approve of this response.

It reflects exactly my experience there.

Blameless post-mortem, stick to the facts and how the situation could be avoided/reduced/shortened/handled better for next time.

In fact, one of the guidelines for writing COE (Correction Of Error, Amazon's jargon for Post Mortem) is that you never mention names but use functions and if necessary teams involved:

1. Personal names don't mean anything except to the people who were there on the incident at the time. Someone reading the CoE on the other side of the world or 6 months from now won't understand who did what and why. 2. It stands in the way of honest accountability.

s.gif
Because OTHERWISE people might think AMAZON is a DYSFUNCTIONAL company that is beginning to CRATER under its HORRIBLE work culture and constant H/FIRE cycle.

See, AWS is basically turning into a long standing utility that needs to be reliable.

Hey, do most institutions like that completely turn over their staff every three years? Yeah, no.

Great for building it out and grabbing market share.

Maybe not for being the basis of a reliable substrate of the modern internet.

If there are dozens of bespoke systems that keep AWS afloat (disclosure: I have friends who worked there, and there are, and also Conway's law), but if the people who wrote them are three generations of HIRE/FIRE ago....

Not good.

s.gif
> Maybe not for being the basis of a reliable substrate of the modern internet.

Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if it's THAT BAD. I wasn't sure what the pattern for all caps was, so just giving it a shot there. Apologies if it's incorrect.

s.gif
I was mocking the parent, who was doing that. Yes it's awful. Effective? Sigh, yes. But awful.
s.gif
>* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

You mean the one that is down right now?

s.gif
Seems like it's doing an exemplary job of indicating their experience, then.
s.gif
What?!

Everybody is very slow to update their outage pages because of SLAs. It's in a company's financial interest to deny outages and when they are undeniable to make them appear as short as possible. Status pages updating slowly is definitely by design.

There's no large dev platform I've used that this wasn't true of their status pages.

s.gif
> ...you own your dependencies. You don't blame others.

Agreed, teams should invest resources in architecting their systems in a way that can withstand broken dependencies. How does AWS teams account for "core" dependencies (e.g. auth) that may not have alternatives?

s.gif
This is the irony of building a "reliable" system across multiple AZ's.
s.gif
> * Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

https://rachelbythebay.com/w/2019/07/15/giant/

s.gif
> you own your dependencies. You don't blame others.

I love that. Build your service to be robust. Never assume that dependencies are 100% reliable. Gracefully handle failures. Don't just go hard down, or worse, sure horribly in a way that you can't recover from automatically when you're dependencies come back. I've seen a single database outage cause cascading failures across a whole site even though most services had no direct connection to the database. And recovery had to be done in order of dependency, or else you're playing whack-a-mole for an hour)

> VP approval is needed to make updates on the status board.

Isn't that normal? Updating the status has a cost (reparations to customers if you breach SLA). You don't want some on-call engineer stressing over the status page while trying to recover stuff.

s.gif
Oh, yes. Let me go look at the PERSONAL health dashboard and... oh, I need to sign into the console to view it... hmm
s.gif
Come on, we all know managers don’t want to claim an outage till the last minute.
s.gif
"Yes, VP approval is needed to make any updates on the status dashboard."

If services are clearly down, why is this needed ? I can understand the oversights required for a company like Amazon but this sounds strange to me. If services are clearly down, I want that damn status update right away as a customer.

s.gif
Because "services down" also means SLA credits.
s.gif
Hiding behind a throw away account does not help your point.
s.gif
The person is unlikely to have been authorized as a spokesman for AWS. In many workplaces, doing that is grounds for disciplinary action. Hence, throwaway.
s.gif
Well, when you talk about blameless post mortems and how they are valued at the company... A throw-away does make me doubt that the culture supports being blameless :)
s.gif
Well, I understand that, but if you look at his account history it is only pro-Amazon comments. It feels like propaganda more than information, and all I am saying is that the throwaway does not add credibility or a feeling that his opinion are genuine.
s.gif
That sounds like the exact opposite of human-factors engineering. No one likes taking blame. But when things go sideways, people are extra spicy and defensive, which makes them clam up and often withhold useful information, which can extend the outage.

No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.

s.gif
I worked at Walmart Technology. I bravely wrote post mortem documents owning the fault of my team (100+ people), owning both technically and also culturally as their leader. I put together a plan to fix it and executed it. Thought that was the right thing to do. This happend two times in my 10 year career there.

Both times I was called out as a failure in my performance eval. Second time, I resigned and told them to find a better leader.

Happy now I am out of such shitty place.

s.gif
That's shockingly stupid. I also worked for a major Walmart IT services vendor in another life, and we always had to be careful about how we handled them, because they didn't always show a lot of respect for vendors.

On another note, thanks for building some awesome stuff -- walmart.com is awesome. I have both Prime and whatever-they're-currently-calling Walmart's version and I love that Walmart doesn't appear to mix SKU's together in the same bin which seems to cause counterfeiting fraud at Amazon.

s.gif
walmart.com user design sucks. My particular grudge right now is - I'm shopping to go pickup some stuff (and indicate "in store pickup) and each time I search for the next item, it resets that filter making me click on that filter for each item on my list.
s.gif
Almost every physical-store-chain company's website makes it way too hard to do the thing I nearly always want out of their interface, which is to search the inventory of the X nearest locations. They all want to push online orders or 3rd-party-seller crap, it seems.
s.gif
Yes, I assume they intentionally make it difficult to push third party sellers’ where they get to earn bigger profit margins and/or hide their low inventory.

Although, Amazon is the worst, then Walmart (still much better than Amazon since you can at least filter). The others are not bad in my experience.

s.gif
Walmart.com, Am I the only one in the world who can't view their site on my phone? I tried it on a couple devices and couldn't get it to work. Scaling is fubar. I assumed this would be costing them millions/billions since it's impossible to buy something from my phone right now. S21+ in portrait on multiple browsers.
s.gif
I believe he means a literal bin. E.g. Amazon takes products from all their sellers and chucks them in the same physical space, so they have no idea who actually sold the product when it's picked. So you could have gotten something from a dodgy 3rd party seller that repackages broken returns, etc, and Amazon doesn't maintain oversight of this.
s.gif
Literally just a bin in a fulfillment warehouse.

An amazon listing doesn't guarantee a particular SKU.

s.gif
Ah, whew. That's what I thought. Thanks! I asked because we make warehouse and retail management systems and every vendor or customer seems to give every word their own meanings (e.g., we use "bin" in our discounts engine to be a collection of products eligible for discounts, and "barcode" has at least three meanings depending on to whom you're speaking).
s.gif
Props to you and Walmart will never realize their loss. Unfortunately. But one day there will be headline (or even a couple of them) and you will know that if you had been there it might not have happened and that in the end it is Walmarts' customers that will pay the price for that, not their shareholders.
s.gif
Stories like this are why I'm really glad I stopped talking to that Walmart Technology recruiter a few years ago. I love working for places where senior leadership constantly repeat war stories about "that time I broke the flagship product" to reinforce the importance of blameless postmortems. You can't fix the process if the people who report to you feel the need to lie about why things go wrong.
s.gif
that's awful. You should have been promoted for that.
s.gif
is it just 'ceremony' to be called out on those things? (even if it is actually a positive sum total)
s.gif
> Happy now I am out of such shitty place.

Doesn't sound like it.

s.gif
I firmly believe in the dictum "if you ship it you own it". That means you own all outages. It's not just an operator flubbing a command, or a bit of code that passed review when it shouldn't. It's all your dependencies that make your service work. You own ALL of them.

People spend all this time threat modelling their stuff against malefactors, and yet so often people don't spend any time thinking about the threat model of decay. They don't do it adding new dependencies (build- or runtime), and therefore are unprepared to handle an outage.

There's a good reason for this, of course: modern software "best practices" encourage moving fast and breaking things, which includes "add this dependency we know nothing about, and which gives an unknown entity the power to poison our code or take down our service, arbitrarily, at runtime, but hey its a cool thing with lots of github stars and it's only one 'npm install' away".

Just want to end with this PSA: Dependencies bad.

s.gif
Should I be penalized if an upstream dependency, owned by another team, fails? Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver? These are real problems in the micro-services world, especially since I own UI and there are dozens of teams pumping out services, and I'm at the mercy of all of them. The best I can do is gracefully fail when services don't function in a healthy state.
s.gif
You and many others here may be conflating two concepts which are actually quite separate.

Taking blame is a purely punitive action and solves nothing. Taking responsibility means it's your job to correct the problem.

I find that the more "political" the culture in the organization is, the more likely everyone is to search for a scapegoat to protect their own image when a mistake happens. The higher you go up in the management chain, the more important vanity becomes, and the more you see it happening.

I have made plenty of technical decisions that turned out to be the wrong call in retrospect. I took _responsibility_ for those by learning from the mistake and reversing or fixing whatever was implemented. However, I never willfully took _blame_ for those mistakes because I believed I was doing the best job I could at the time.

Likewise, the systems I manage sometimes fail because something that another team manages failed. Sometimes it's something dumb and could have easily been prevented. In these cases, it's easy point blame and say, "Not our fault! That team or that person is being a fuckup and causing our stuff to break!" It's harder but much more useful to reach out and say, "hey, I see x system isn't doing what we expect, can we work together to fix it?"

s.gif
Every argument I have on the internet is between prescriptive and descriptive language.

People tend to believe that if you can describe a problem that means you can prescribe a solution. Often times, the only way to survive is to make it clear that the first thing you are doing is describing the problem.

After you do that, and it's clear that's all you are doing, then you follow up with a prescriptive description where you place clearly what could be done to manage a future scenario.

If you don't create this bright line, you create a confused interpretation.

s.gif
My comment was made from the relatively simpler entrepreneurial perspective, not the corporate one. Corp ownership rests with people in the C-suite who are social/political lawyer types, not technical people. They delegate responsibility but not authority, because they can hire people, even smart people, to work under those conditions. This is an error mode where "blame" flows from those who control the money to those who control the technology. Luckily, not all money is stupid so some corps (and some parts of corps) manage to function even in the presence of risk and innovation failures. I mean the whole industry is effectively a distributed R&D budget that may or may not yield fruit. I suppose this is the market figuring out whether iterated R&D makes sense or not. (Based on history, I'd say it makes a lot of sense.)
s.gif
I wish you wouldn't talk about "penalization" as if it was something that comes from a source of authority. Your customers are depending on you, and you've let them down, and the reason that's bad has nothing to do with what your boss will do to you in a review.

The injustice that can and does happen is that you're explicitly given a narrow responsibility during development, and then a much broader responsibility during operation. This is patently unfair, and very common. For something like a failed uService you want to blame "the architect" that didn't anticipate these system level failures. What is the solution? Have plan b (and plan c) ready to go. If these services don't exist, then you must build them. It also implies a level of indirection that most systems aren't comfortable with, because we want to consume services directly (and for good reason) but reliability requires that you never, ever consume a service directly, but instead from an in-process location that is failure aware.

This is why reliable software is hard, and engineers are expensive.

Oh, and it's also why you generally do NOT want to defer the last build step to runtime in the browser. If you start combining services on both the client and server, you're in for a world of hurt.

s.gif
Not penalised no, but questioned as to how well your graceful failure worked in the end.

Remember: it may not be your fault, but it still is your problem.

s.gif
A analogy for illustrating this is:

You get hit by a car and injured. The accident is the other driver's fault, but getting to the ER is your problem. The other driver may help and call an ambulance, but they might not even be able to help you if they also got hurt in the car crash.

s.gif
> Should I be penalized if an upstream dependency, owned by another team, fails?

> Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver?

s.gif
Say during due diligence two options are uncovered: use an upstream dependency owned by another team, or use that plus a 3P vendor for redundancy. Implementing parallel systems costs 10x more than the former and takes 5x longer. You estimate a 0.01% chance of serious failure for the former, and 0.001% for the latter.

Now say you're a medium sized hyper-growth company in a competitive space. Does spending 10 times more and waiting 5 times longer for redundancy make business sense? You could argue that it'd be irresponsible to over-engineer the system in this case, since you delay getting your product out and potentially lose $ and ground to competitors.

I don't think a black and white "yes, you should be punished" view is productive here.

s.gif
Where does this mindset end? Do I lack due diligence by choosing to accept that the cpu microcode on the system I’m deploying to works correctly?
s.gif
If it's brand new RiscV CPU that was just relesed 5 min ago, and nobody really tested then yes.

If its standard CPU that everybody else uses, and its not known to be bad then no.

Same for software. Is it ok to have dependency on AWS services ? Their history shows yes. Dependency on brand new SaaS product ? Nothing mission critical.

Or npm/crates/pip packages. Packages that have been around and seedily maintained for few years, have active users, are worth checking out. Some random project from single developer ? Consider vendoring (and owning if necessary ) it.

s.gif
Why? Intel has Spectre/Meltdown which erased like half of everyone's capacity overnight.
s.gif
You choose the CPU and you choose what happens in a failure scenario. Part of engineering is making choices that meet the availability requirements of your service. And part of that is handling failures from dependencies.

That doesn't extend to ridiculous lengths but as a rule you should engineer around any single point of failure.

s.gif
I think this is why we pay for support, with the expectation that if their product inadvertently causes losses for you they will work fast to fix it or cover the losses.
s.gif
Yes? If you are worried about CPU microcode failing, then you do a NASA and have multiple CPU arch's doing calculations in a voting block. These are not unsolved problems.
s.gif
JPL goes further and buys multiple copies of all hardware and software media used for ground systems, and keeps them in storage "just in case". It's a relatively cheap insurance policy against the decay of progress.
s.gif
That's a great philosophy.

Ok, let's take an organization, let's call them, say Ammizzun. Totally not Amazon. Let's say you have a very aggressive hire/fire policy which worked really well in rapid scaling and growth of your company. Now you have a million odd customers highly dependent on systems that were built by people that are now one? two? three? four? hire/fire generations up-or-out or cashed-out cycles ago.

So.... who owns it if the people that wrote it are lllloooooonnnnggg gone? Like, not just long gone one or two cycles ago so some institutional memory exists. I mean, GONE.

s.gif
A lot can go wrong as an organization grows, including loss of knowledge. At amazon "Ownership" officially rests with the non-technical money that owns voting shares. They control the board who controls the CEO. "Ownership" can be perverted to mean that you, a wage slave, are responsible for the mess that previous ICs left behind. The obvious thing to do in such a circumstance is quit (or don't apply). It is unfair and unpleasant to be treated in a way that gives you responsibility but no authority, and to participant in maintaining (and extending) that moral hazard, and as long as there are better companies you're better off working for them.
s.gif
I worked on a project like this in government for my first job. I was the third butt in that seat in a year. Everyone associated with project that I knew there was gone by one year from my own departure date.

They are now on the 6th butt in that seat in 4 years. That poor fellow is entirely blameless for the mess that accumulated over time.

s.gif
Having individuals own systems seems like a terrible practice. You're essentially creating a single point of failure if only one person understands how the system works.
s.gif
if I were a black hat I would absolutely love GitHub and all the various language-specific package systems out there. giving me sooooo many ways to sneak arbitrary tailored malicious code into millions of installs around the world 24x7. sure, some of my attempts might get caught, or not but not lead to a valuable outcome for me. but that percentage that does? can make it worth it. its about scale and a massive parallelization of infiltration attempts. logic similar to the folks blasting out phishing emails or scam calls.

I love the ubiquity of thirdparty software from strangers, and the lack of bureaucratic gatekeepers. but I also hate it in ways. and not enough people know about the dangers of this second thing.

s.gif
Any yet oddly enough the Earth continues to spin and the internet continues to work. I think the system we have now is necessarily the system that must exist ( in this particular case, not in all cases ). Something more centralized is destined to fail. And, while the open source nature of software introduces vulnerabilities it also fixes them.
s.gif
> And, while the open source nature of software introduces vulnerabilities it also fixes them.

dat gap tho... which was my point. smart black hats will be exploiting this gap, at scale. and the strategy will work because the majority of folks seem to be either lazy, ignorant or simply hurried for time.

and btw your 1st sentence was rude. constructive feedback for the future

s.gif
For my vote, I don't think it was rude, I think it was making a point.
s.gif
when working on CloudFiles, we often had monitoring for our limited dependencies that were better than their monitoring. Don't just know what your stuff is doing, but what your whole dependency ecosystem is doing and know when it all goes south. also helps to learn where and how you can mitigate some of those dependencies.
s.gif
This. We found very big, serious issues with our anti-DDOS provider because their monitoring sucked compared to ours. It was a sobering reality check when we realized that.
s.gif
It's also a nightmare for software preservation. There's going to be a lot from this era that won't be usable 80 years from now because everything is so interdependent and impossible to archive. It's going to be as messy and irretrievable as the Web pre Internet Archive + Wayback are.
s.gif
I don't think engineers can believe in no-blame analysis if they know it'll harm career growth. I can't unilaterally promote John Doe, I have to convince other leaders that John would do well the next level up. And in those discussions, they could bring up "but John has caused 3 incidents this year", and honestly, maybe they'd be right.
s.gif
Would they? Having 3 outages in a year sounds like an organization problem. Not enough safeguards to prevent very routine human errors. But instead of worrying about that we just assign a guy to take the fall
s.gif
If you work in a technical role and you _don't_ have the ability to break something, you're unlikely to be contributing in a significant way. Likely that would make you a junior developer whose every line of code is heavily scrutinized.

Engineers should be experts and you should be able to trust them to make reasonable choices about the management of their projects.

That doesn't mean there can't be some checks in place, and it doesn't mean that all engineers should be perfect.

But you also have to acknowledge that adding all of those safeties has a cost. You can be a competent person who requires fewer safeties or less competent with more safeties.

Which one provides more value to an organization?

s.gif
The tactical point is to remove sharp edges, eg there's a tool that optionally take a region argument.
    network_cli remove_routes [--region us-east-1]
Blaming the operator that they should have known that running
    network_cli remove_routes
will take down all regions because the region wasn't specified is the kind of thing as to what's being called out here.

All of the tools need to not default to breaking the world. That is the first and foremost thing being pushed. If an engineer is remotely afraid to come forwards (beyond self-shame/judgement) after an incident, and say "hey, I accidentally did this thing", then the situation will never get any better.

That doesn't mean that engineers don't have the ability to break things, but it means it's harder (and very intentionally so) for a stressed out human operator to do the wrong thing by accident. Accidents happen. Do you just plan on never getting into a car accident, or do you wear a seat belt?

s.gif
> Which one provides more value to an organization?

Neither, they both provide the same value in the long term.

Senior engineers cannot execute on everything they commit to without having a team of engineers they work with. If nobody trains junior engineers, the discipline would go extinct.

Senior engineers provide value by building guardrails to enable junior engineers to provide value by delivering with more confidence.

s.gif
Well if John caused 3 outages and and his peers Sally and Mike each caused 0, it's worth taking a deeper look. There's a real possibility he's getting screwed by a messed up org, also he could be doing slapdash work or he seriously might not undertsand the seriousness of an outage.
s.gif
John’s team might also be taking more calculated risks and running circles around Sally and Mike’s teams with respect to innovation and execution. If your organization categorically punishes failures/outages, you end up with timid managers that are only playing defense, probably the opposite of what the leadership team wants.
s.gif
Worth a look, certainly. Also very possible that this John is upfront about honest postmortems and like a good leader takes the blame, whereas Sally and Mike are out all day playing politics looking for how to shift blame so nothing has their name attached. Most larger companies that's how it goes.
s.gif
Or John's work is in frontline production use and Sally's and Mike's is not, so there's different exposure.
s.gif
You're not wrong, but it's possible that the organization is small enough that it's just not feasible to have enough safeguards that would prevent the outages John caused. And in that case, it's probably best that John not be promoted if he can't avoid those errors.
s.gif
Current co is small. We are putting in the safeguards from Day 1. Well, okay technically like day 120, the first few months were a mad dash to MVP. But now that we have some breathing room, yeah, we put a lot of emphasis on preventing outages, detecting and diagnosing outages promptly, documenting them, doing the whole 5-why's thing, and preventing them in the future. We didn't have to, we could have kept mad dashing and growth hacking. But very fortunately, we have a great culture here (founders have lots of hindsight from past startups).

It's like a seed for crystal growth. Small company is exactly the best time to implement these things, because other employees will try to match the cultural norms and habits.

s.gif
Well, I started at the small company I'm currently at around day 7300, where "source control" consisted of asking the one person who was in charge of all source code for a copy of the files you needed to work on, and then giving the updated files back. He'd write down the "checked out" files on a whiteboard to ensure that two people couldn't work on the same file at the same time.

The fact that I've gotten it to the point of using git with automated build and deployment is a small miracle in itself. Not everybody gets to start from a clean slate.

s.gif
> I have to convince other leaders that John would do well the next level up.

"Yes, John has made mistakes and he's always copped to them immediately and worked to prevent them from happening again in the future. You know who doesn't make mistakes? People who don't do anything."

s.gif
You know why SO-teams, firefighters and military pilots are so successful?

-You don't hide anything

-Errors will be made

-After training/mission everyone talks about the errors (or potential ones) and how to prevent them

-You don't make the same error twice

Being afraid to make errors and learn from them creates a culture of hiding, a culture of denial and especially being afraid to take responsibility.

s.gif
You can even make the same error twice but you better have much better explanation the second time around than you had the first time around because you already knew that what you did was risky and or failure prone.

But usually it isn't the same person making the same mistake, usually it is someone else making the same mistake and nobody thought of updating processes/documentation to the point that the error would have been caught in time. Maybe they'll fix that after the second time ;)

s.gif
Yes. AAR process in the army was good at this up to the field grade level, but got hairy on G/J level staffs. I preferred being S-6 to G-6 for that reason.
s.gif
There is no such thing as "no-blame" analysis. Even in the best organizations with the best effort to avoid it, there is always a subconscious "this person did it". It doesn't help that these incidents serve as convenient places for others to leverage to climb their own career ladder at your expense.
s.gif
Or just take responsibility. People will respect you for doing that and you will demonstrate leadership.
s.gif
Cynical/realist take: Take responsibility and then hope your bosses already love you, you can immediately both come with a way to prevent it from happening again, and convince them to give you the resources to implement it. Otherwise your responsibility is, unfortunately, just blood in the water for someone else to do all of that to protect the company against you and springboard their reputation on the descent of yours. There were already senior people scheming to take over your department from your bosses, now they have an excuse.
s.gif
This seems like an absolutely horrid way of working or doing 'office politics'.
s.gif
Yes, and I personally have worked in environments that do just that. They said they didn't, but with management "personalities" plus stack ranking, you know damn well that they did.
s.gif
And the guy who doesn't take responsibility gets promoted. Employees are not responsible for failures of management to set a good culture.
s.gif
The Gervais/Peter Principle is alive and well in many orgs. That doesn't mean that when you have the prerogative to change the culture, you just give up.

I realize that isn't an easy thing to do. Often the best bet is to just jump around till you find a company that isn't a cultural superfund site.

s.gif
You can work an entire career and maybe enjoy life in one healthy organization in that entire time even if you work in a variety of companies. It just isn't that common, though of course voicing the _ideals_ is very, very common.
s.gif
Once you reach a certain size there are surprisingly few healthy organization, most of them turn into externalization engines with 4 beats per year.
s.gif
I love it when I share a mental model with someone in the wild.
s.gif
Way more fun argument: Outages just, uh… uh… find a way.
s.gif
> No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.

Yea, except it doesn't work in practice. I work with a lot of people who come from places with "blameless" post-mortem 'culture' and they've evangelized such a thing extensively.

You know what all those people have proven themselves to really excel at? Blaming people.

s.gif
Ok, and? I don't doubt it fails in places. That doesn't mean that it doesn't work in practice. Our company does it just fine. We have a high trust, high transparency system and it's wonderful.

It's like saying unit tests don't work in practice because bugs got through.

s.gif
Have you ever considered that the “no-blame” postmortems you are giving credit for everything are just a side effect of living in a high trust, high transparency system?

In other words, “no-blame” should be an emergent property of a culture of trust. It’s not something you can prescribe.

s.gif
Yes, exactly. Culture of trust is the root. Many beneficial patterns emerge when you can have that: more critical PRs, blameless post-mortems, etc.
s.gif
Sometimes, these large companies tack on too much "necessary" incident "remediation" actions with Arbitrary Due Date SLAs that completely wrench any ongoing work. And ongoing, strategically defined ""muh high impact"" projects are what get you promoted, not doing incident remediations.

When you get to the level you want, you get to not really give a shit and actually do The Right Thing. However, for all of the engineers clamoring to get out of the intermediate brick laying trenches, opening an incident can have pervasive incentives.

s.gif
In my experience this is the actual reason for fear of the formal error correction process.
s.gif
I've worked for Amazon for 4 years, including stints at AWS, and even in my current role my team is involved in LSE's. I've never seen this behavior, the general culture has been find the problem, fix it, and then do root cause analysis to avoid it again.

Jeff himself has said many times in All Hands and in public "Amazon is the best place to fail". Mainly because things will break, it's not that they break that's interesting, it's what you've learned and how you can avoid that problem in the future.

s.gif
I guess the question is why can't you (AWS) fix the problem of the status page not reflecting an outage? Maybe acceptable if the console has a hiccup, but when www.amazon.com isn't working right, there should be some yellow and red dots out there.

With the size of your customer base there were man years spent confirming the outage after checking the status.

s.gif
Because there's a VP approval step for updating the status page and no repercussions for VPs who don't approve updates in a timely manner. Updating the status page is fully automated on both sides of VP approval. If the status page doesn't update, it's because a VP wouldn't do it.
s.gif
Haha... This bring back memories. It really depends on the org.

I've had push backs on my postmortems before because of phrasing that could be constituted as laying some of the blame on some person/team when it's supposed to be blameless.

And for a long time, it was fairly blameless. You would still be punished with the extra work of writing high quality postmortems, but I have seen people accidentally bring down critical tier-1 services and not be adversely affected in terms of promotion, etc.

But somewhere along the way, it became politicized. Things like the wheel of death, public grilling of teams on why they didn't follow one of the thousands of best practices, etc, etc. Some orgs are still pretty good at keeping it blameless at the individual level, but... being a big company, your mileage may vary.

s.gif
We're in a situation where the balls of mud made people afraid to touch some things in the system. As experiences and processes have improved we've started to crack back into those things and guess what, when you are being groomed to own a process you're going to fuck it up from time to time. Objectively, we're still breaking production less often per year than other teams, but we are breaking it, and that's novel behavior, so we have to keep reminding people why.

The moment that affects promotions negatively, or your coworkers throw you under the bus, you should 1) be assertive and 2) proof-read your resume as a precursor to job hunting.

s.gif
Or problems just persisting, because the fix is easy, but explaining it to others who do not work on the system are hard. Esp. justifying why it won't cause an issue, and being told that the fixes need to be done via scripts that will only ever be used once, but nevertheless needs to be code reviewed and tested...

I wanted to be proactive and fix things before they became an issue, but such things just drained life out of me, to the point I just left.

s.gif
That’s idiotic, the service is down regardless. If you foster that kind of culture, why have a status page at all?

It make AWS engineers look stupid, because it looks like they are not monitoring their services.

s.gif
The status page is as much a political tool as a technical one. Giving your service a non-green state makes your entire management chain responsible. You don't want to be one that upsets some VPs advancement plans.
s.gif
> It make AWS engineers look stupid, because it looks like they are not monitoring their services.

Management.

s.gif
Former AWSser. I can totally believe that happened and continues to happen in some teams. Officially, it's not supposed to be done that way.

Some AWS managers and engineers bring their corporate cultural baggage with them when they join AWS and it takes a few years to unlearn it.

s.gif
Thanks for the perspective. I was beginning to regret posting this after so many people claiming this wouldn’t happen at AWS.

Amazon is a huge company so I have no doubt YMMV depending on your manager.

s.gif
When I worked for AMZN (2012-2015, Prime Video & Outbound Fulfillment), attempting to sweep issues under the rug was a clear path to termination. The Correction-Of-Error (COE) process can work wonders in a healthy, data-driven, growth-mindset culture. I wonder if the ex-Amazonian you're referring to did not leave AMZN by their own accord?

Blame deflection is a recipe for repeat outages and unhappy customers.

s.gif
> I wonder if the ex-Amazonian you're referring to did not leave AMZN by their own accord?

Entirely possible, and something I've always suspected.

s.gif
This is the exact opposite of my experience at AWS. Amazon is all about blameless fact finding when it comes to root cause analysis. Your company just hired a not so great engineer or misunderstood him.
s.gif
Adding my piece of anecdata to this.. the process is quite blameless. If a postmortem seems like it points blame, this is pointed out and removed.
s.gif
Blameless, maybe, but not repercussion-less. A bad CoE was liable to upend the team's entire roadmap and put their existing goals at risk. To be fair, management was fairly receptive to "we need to throw out the roadmap and push our launch out to the following reinvent", but it wasn't an easy position for teams to be in.
s.gif
Every incident review meeting I've ever been in starts out like, _"This meeting isn't to place blame..."_, then, 5 minutes later, it turns into the Blame Game.
s.gif
Manually updated status pages are an anti-pattern to begin with. At that point, why not just call it a blog?
s.gif
This gets posted every time there's an AWS outage. It mind as well be a copy pasta at this point.
s.gif
Sorry. I'm probably to blame because I've posted this a couple times on HN before.

It strikes a nerve with me because it caused so much trouble for everyone around him. He had other personal issues, though, so I should probably clarify that I'm not entirely blaming Amazon for his habits. Though his time at Amazon clearly did exacerbate his personal issues.

s.gif
well, this is the first time I've seen it, so I am glad it was posted this time.
s.gif
First time I've seen it too. Definitely not my first "AWS us-east-1 is down but the status board is green" thread, either.
s.gif
Ditto, it's always annoyed me that their status page is useless, but glad someone else mentioned it.
s.gif
I had that deja vu feeling reading PragmaticPulp's comment, too.

And sure enough, PragmaticPulp did post a similar comment on a thread about Amazon India's alleged hire-to-fire policy 6 months back: https://news.ycombinator.com/item?id=27570411

You and I, we aren't among the 10000, but there are potentially 10000 others who might be: https://xkcd.com/1053/

s.gif
I mean it's true at every company I've ever worked at too. If you can lawyer incidents into not being an outage you avoid like 15 meetings with the business stakeholders about all the things we "have to do" to prevent things like this in the future that get canceled the moment they realize that how much dev/infra time it will take to implement.
s.gif
It's the "grandma got run over by a reindeer" of AWS outages. Really no outage thread would be complete without this anecdote.
s.gif
Perhaps reward structure should be changed to incentivize the post-mortems. There could be several flaws that run underreported otherwise.

We may run into the problem of everything documented and possible deliberate acts but for a service that relies heavily on uptime, that’s a small price to pay for a bulletproof operation.

s.gif
Then we would drown in a sea of meetings and 'lessons learned' emails. There is a reason for post-mortems, but there has to be balance.
s.gif
I find post-mortems interesting to read through especially when it’s not my fault. Most of them would probably be routine to read through but there are occasional ones that make me cringe or laugh.

Post-mortems can sometime be thought of like safety training. There is a big imbalance of time dedicated to learning proper safety handling just for those small incidences.

s.gif
Does Disney still play the "Instructional Videos" series starring Goofy where he's supposed to be teaching you how to do something and instead we learn how NOT to do something? Or did I just date myself badly?
s.gif
On the retail/marketplace side this wasn't my experience, but we also didn't have any public dashboards. On Prime we occasionally had to refund in bulk, and when it was called for (internally or externally) we would right up a detailed post-mortem. This wasn't fun, but it was never about blaming a person and more about finding flaws in process or monitoring.
s.gif
I don't think anecdotes like this are even worth sharing, honestly. There's so much context lost here, so much that can be lost in translation. No one should be drawing any conclusions from this post.
s.gif
> explanation about why the outage was someone else's fault

In my experience, it's rarely clear who was at fault for any sort of non-trivial outage. The issue tends to be at interfaces and involve multiple owners.

s.gif
What if they just can't access the console to update the status page...
s.gif
They could still go into the data center, open up the status page servers' physical...ah wait, what if their keyfobs don't work?
s.gif
Yep I can confirm that. The process when the outage is caused by you is called COE (correction of errors). I was oncall once for two teams because I was switching teams and I got 11 escalations in 2 hours. 10 of these were caused by an overly sensitive monitoring setting. The 11th was a real one. Guess which one I ignored. :)
s.gif
This fits with everything I've heard about terrible code quality at Amazon and engineers working ridiculous hours to close tickets any way they can. Amazon as a corporate entity seems to be remarkably distrustful of and hostile to its labor force.
s.gif
> I will always remember the look of sheer panic

I don't know if you're exaggerating or not, but even if true why would anyone show that emotion about losing a job in the worst case?

You certainly had a lot of relevate-to-todays-top-hn-post stories throughout you career. And I'm less and less surprised to continuously find PragmaticPulp as one of the top commenters if not the top that resonates with a good chunk of HN.

s.gif
This is weird, on my team it’s taken as a learning opportunity. I caused a pretty big outage and we just did a COE.
s.gif
I am finding that I have a very bimodal response to "He did it". When I write an RCA or just talk about near misses, I may give you enough details to figure out that Tom was the one who broke it, but I'm not going to say Tom on the record anywhere, with one extremely obvious exception.

If I think Tom has a toxic combination of poor judgement, Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure but I may be repeating myself here), such that he won't listen to reason and he actively steers others into bad situations (and especially if he then disappears when shit hits the fan), then I will nail him to a fucking cross every chance I get. Public shaming is only a tool for getting people to discount advice from a bad actor. If it comes down to a vote between my idea and his, then I'm going to make sure everyone knows that his bets keep biting us in the ass. This guy kinda sounds like the Toxic Tom.

What is important when I turned out to be the cause of the issue is a bit like some court cases. Would a reasonable person in this situation have come to the same conclusion I did? If so, then I'm just the person who lost the lottery. Either way, fixing it for me might fix it for other people. Sometimes the answer is, "I was trying to juggle three things at once and a ball got dropped." If the process dictated those three things then the process is wrong, or the tooling is wrong. If someone was asking me questions we should think about being more pro-active about deflecting them to someone else or asking them to come back in a half hour. Or maybe I shouldn't be trying to watch training videos while babysitting a deployment to production.

If you never say "my bad" then your advice starts to sound like a lecture, and people avoid lectures so then you never get the whole story. Also as an engineer you should know that owning a mistake early on lets you get to what most of us consider the interesting bit of solving the problem instead of talking about feelings for an hour and then using whatever is left of your brain afterward to fix the problem. In fact in some cases you can shut down someone who is about to start a rant (which is funny as hell because they look like their head is about to pop like a balloon when you say, "yep, I broke it, let's move on to how do we fix it?")

s.gif
To me, the point of "blameless" PM is not to hide the identity of the person who was closest to the failure point. You can't understand what happened unless you know who did what, when.

"Blameless" to me means you acknowledge that the ultimate problem isn't that someone made a mistake that caused an outage. The problem is that you had a system in place where someone could make a single mistake and cause an outage.

If someone fat-fingers a SQL query and drops your database, the problem isn't that they need typing lessons! If you put a DBA in a position where they have to be typing SQL directly at a production DB to do their job, THAT is the cause of the outage, the actual DBA's error is almost irrelevant because it would have happened eventually to someone.

s.gif
That's true if the direct cause is an actual mistake, which often is the case but not always.

It may also be that the cause is willful negligence, intentionally circumventing barriers for some personal reason.

And, of course, it may be that the cause is explicitly malicious (e.g. internal fraud, or the intent to sabotage someone) and at least part of the blame directly lies on the culprit, and not only on those who failed to notice and stop them.

s.gif
Naming someone is how you discover that not everyone in the organization believes in Blamelessness. Once it's out it's out, you can't put it back in.

It's really easy for another developer to figure out who I'm talking about. Managers can't be arsed to figure it out, or at least pretend like they don't know.

s.gif
And this is exactly why you can expect these headlines to hit with great regularity. These things are never a problem at the individual level, they are always at the level of culture and organization.
s.gif
being at fault for an outage was one of the worst things that could happen to you

Imagine how stressful life would be thinking that you had to be perfect all the time.

s.gif
That's been most of my life. Welcome to perfectionism.
s.gif
Maybe it’s more telling that that engineer no longer works at Amazon.
s.gif
That's a real shame, one of the leadership principles used to be "be vocally self-critical" which I think was supposed to explicitly counteract this kind of behaviour.

I think they got rid of it at some point though.

s.gif
This may not actually be that bad of thing. If you think about it if they're fighting tooth and nail to keep the status page still green that tells you they were probably doing that at every step of the way before the failure became eminent. Gotta have respect for that.
s.gif677 more comments...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK