

Ask HN: Best “I brought down production” story?
source link: https://news.ycombinator.com/item?id=27644387
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

I was trying to learn MySQL and the CTO made the mistake of giving me access to the prod database. This huge network that served most of the ads in the world ran off of only two huge servers running in an office outside Los Angeles.
MyISAM uses a read lock on every SELECT query. I did not know this at the time. I was running a number of queries that were trying to pull historical performance data for all our ads across all time. They were taking a long time so I let them run in the background while working on a spreadsheet somewhere else.
A little while later I hear some murmuring. Apparently the whole network was down. The engineering team was frantically trying to find the cause of the problem. Eventually, the CTO approaches my desk. "Were you running some queries on the database?" "Yes." "The query you ran was trying to generate billions of rows of results and locked up the entire database. Roughly three quarters of the ads in the world have been gone for almost two hours."
After the second time I did this, he showed me the MySQL EXPLAIN command and I finally twigged that some kinds of JOINs can go exponential.
Kudos to him for never revoking my access and letting me learn things the hard way. Also, if he worked for me I would have fired him.

Sounds like you appreciated that your boss gave you space to learn, and understood that you made an honest mistake, but you’d fire someone who made this mistake if they were working for you?
How do you square those two things internally?

It is good to punish people who give access to production databases to people who shouldn't have it. And the guy learning MySQL should not be given that access.
Taking down prod is always a symptom of a systemic failure. The person responsible for the systemic failure should see the consequences, not the person responsible for the symptom.


Except that he typoed, and instead ran `sudo -c wsgi -s /bin/bash`. What that does is instead of launching the (-s)hell as the uwsgi (-u)ser, it interprets the rest as a (-c)ommand. Now, `wsgi` is also a binary, and unfortunately, it does support a `-s` switch. It tries to open a socket at that address - or a filesystem path, as the case may be. Meaning that the command (under root) overwrote /bin/bash with 0 bytes.
Within minutes, jobs started failing, the machine couldn't be SSH'd into, but funnily enough, as /bin/bash was the login shell for all users, not even logging in via a tty through KVM worked.
Perhaps not the best story, but certainly a fun way to blow your foot off on a Monday morning :)


But thanks, I've just added another technique to my toolbox.






In this case if you noticed and still had a shell, you could just copy another shell over ("cp /bin/sh /bin/bash"), to at least get back to probably able to login, until you could pull a copy from another machine or backups.
They did the usual mistake of wanting to jettison all the developer tooling and start from scratch. So there was a special request to just install a base O/S, put accounts on the box, and setup a plain old apache webserver with a simple /var/www/index.html (this was well outside of how Amazon normally deployed webservers which was all customized apache builds and software deployment pipelines and had a completely different file system layout).
They didn't specify what was to go into the index.html files on the servers.
So I just put "FOO" in the index.html and validated that hitting them on port 80 produced "FOO".
Then I handed off the allocated IPs to the networking team to setup a single VIP in a loadbalancer that had these two behind it.
The network engineer brought up the VIP on a free public IP address, as asked.
What nobody know was that the IP had been a decomissioned IP for www.amazon.com from a year or two earlier when there was some great network renumbering project and it had pointing at a cluster of webservers on the old internal fabric.
The DNS loadbalancers were still configured for that IP address and they were still in the rotation for www.amazon.com. And all they did as a health check was pull GET / and look for 200s and then based on the speed of the returns they'd adjust their weighting.
They found that this VIP was incredibly well optimized for traffic and threw most all the new incoming requests over to these two webservers.
I learned of this when my officemate said "what is this sev1... users reporting 'foo' on the website..."
This is why you always keep it professional, kids...
I didn't play muds and my experience was mostly limited to helping him fix C programming bugs from time to time and fielding an occasional irate phone call from users who got my number off the whois data. But because of the programming help I had some kind of god-access to the mud.
One afternoon I had a ladyfriend over that I was presumably trying to impress and she'd asked about the mud. We hopped on and I summoned into existence a harmless raggedy-ann doll. That was kind of boring so I thought it would be fun to attach an NPC script to it, -- I went through the list and saw something called a "Zombie Lord" which sounded promising. I applied it, and suddenly the doll started nattering on about the apocalypse and ran off. Turned out that it killed everyone it encountered, turned them all into zombie lords, creating an exponential wave of destruction that rapidly took over the over the whole game.
I found the mental image of some little doll running around bringing on the apocalypse to be just too funny-- until my phone started ringing. Ultimately the game state had to be reported to a backup from a day or two prior.
[I've posted a couple examples, -- I dunno which one is best, but people can vote. :)]


I used to work for a major university as a student systems admin. The only thing that was "student" about it was the pay-- I had a whole lab of Sun and SGI servers/desktops, including an INCREIDBLE 1TB of storage-- we had 7xSun A1000's (an array of arrays) if memory serves.
Our user directories were about 100GB at the time. I had sourced this special tape drive that could do that, but it was fidgety (which is not something you want in a backup drive admittedly). The backups worked, I'd say, 3/4ths of the time. I think the hardware was buggy, but the vendor could never figure it out. Also, before you lecture me, we were very constrained with finances, I couldn't just order something else.
So I graduated, and as such had to find a new admin. We interviewed two people, one was very sharp and wore black jeans and a black shirt-- it was obvious he couldn't afford a suit which would have been the correct thing to wear. The other candidate had suit, and he was punching below his weight. Over my objections, suit guy gets hired.
Friday night, my last day of employment I throw tapes into the machine and start a full L0 backup which would take all weekend to complete.
Monday morning I get a panicked phone calls from my former colleagues. "The new guy deleted the home directories!"
The suit guy literally, had in his first few hours destroyed the entire labs research. All of it. Anyways, I said something to the effect of, "Is the light on the AIT array green or amber?"
"Green."
"You're some lucky sons of bitches. I'll be down in an hour and we'll straighten it out."

> "Is the light on the AIT array green or amber?"
Can you explain this? What is an AIT array?

Anyways, thank you for reading my silly story!



By all reports, he eventually became a well liked and good admin. I just don't think he knew that much Unix when he started.
This caused the mainframe to run out of memory, page out to disk, and thrash, bringing other users to a crawl. It took a while to figure out why relatively simple theorems were doing this.
Boyer and Moore explained to me that the internal representation of numbers was exactly that of their constructive mathematics theory. 2 was (ADD1 (ADD1 (ZERO))). 65536 was a very long string of CONS cells. I was told that most of their theorems involved numbers like 1.
They went on to improve the number representation in their prover, after which it could prove useful theorems about bounded arithmetic.
(I still keep a copy of their prover around. It's on Github, at [1]. It's several thousand times faster on current hardware than it was on the DECSYSTEM 2060.)
On Solaris killall kills all processes.
To make matters worse, I used the command on a sever with a hung console -- so it didn't apply immediately, but later in the middle of the day the console got unhung and the main database server went down.
Explaining that this was an earnest error and not something malicious to the PHBs was somewhat ... delicate. "So why did you kill all the processes?" "Because I didn't expect it to do that." "But the command name is kill all?" ...


I made the same mistake nullc made once, as I was more accustomed to linux than solaris. That was after hours and the effect was immediate, but it was still a pretty jarring and memorable moment.

Still, it's not something common enough to deserve it's own program.




Our startup was based in the garden office of a large house and the production server was situated in a cupboard in the same room.
The day I started was a cold January day and I’d had to cycle through flooded pathways to get to work that morning - so by the time I arrived my feet were soaked.
Once I’d settled down to a desk I asked if I could plug a heater in to dry my shoes. As we were in a garden office every socket was an extension cable so I plugged the heater in to the one under my desk.
A few minutes later I noticed that I couldn’t access the live site I’d been looking through - and others were noticing the same.
It turned out the heater I was using had popped the fuse on the socket. The extension I was using was plugged into the UPS used by the servers. So the battery had warmed my feet for a few minutes before shutting down and taking the servers down too.
And that’s how I brought production down within 3 hours of starting my first job in the web industry…
I'd written the code to reformat the mainframe database of menu items, prices, etc, to the format used by the store systems. I hadn't accounted for the notion that the mainframe would run out of disk space. When the communications jobs ran, a flock of 0-byte files were downloaded to the stores. When the POS systems booted with their 0-byte files, they were... confused. As were the restaurant managers. As were the level 1, level 2, vendor, and executive teams back at headquarters. Once we figured it out, we re-spun the files, sent them out, and the stores were back in business. I added a disk space check, and have done much better with checking my return codes ever since.
I was so excited to meet a legit/professional dev team the first day of my career.
I was paired with a Sr dev and sat in his cubicle so he could show me some backend update to a prod app with 20K internal users... "normally I'd run this on the dev server, but, its quick & easy so Ill just do it on prod"
...watched him crash the whole thing & struggle the rest of the day to try and bring it up. I just sat there in awe, especially as everyone came running over and the emails poured in, while the Sr Dev casually brushed it all aside. He was more interested in explaining how the mobile game Ingress worked.
I couldn’t get the tape drive door open so I looked around and saw a key next to the door.
That didn’t open the door either. I was stood there scratching my head when the double doors burst open and half a dozen sysadmins came running in like a SWAT team.
I was a bit surprised until I glanced down and notice all the lights were off.
Yes the key that turned the power off had been left in the machine.
Good news is I was already planning on restoring the test database from the production backup, so i had the database up in under 45 minutes (slower than it should have been because Oracle's docs were flat out wrong).
A more senior engineer told me he was impressed by how quickly I got things running again; apparently in the Bad Old Days (i.e. a year before I started) the database went down and, while everybody was pretty sure there were backups somewhere, nobody was sure where; customer interactions were tracked by pen and paper for almost 3 business days while this was figured out.
Recovery took 30 minutes (Through Heroku support as Heroku did not allow backups-via-replication outside their own backup system), but that was a very long 30 minutes.
Second worst was a cleanup system that removed old CloudFormation stacks automatically by only retaining the latest version of a specific stack. Deployed a canary version of an edge (Nginx+Varnish) stack for internal testing. Cleanup script helpfully removed the production stack entirely.
Fortunately, given the nature of media buys in that time, all placements were printed and faxed. My team sent me to my wedding rehearsal dinner and spent the next two days collecting printed orders and re-keying them into the system.
I am forever grateful to that team.

One night in bed I realized that if someone hit submit on the delete screen without filling in any criteria it would just delete the whole database.
Not a fun drive in.
Yes, we drove in in those days.

I have learned this from a very similar experience.




I.e if no criteria, it could be sending a DELETE message with no where clause in SQL land.


we got things restored and back online in a couple of hours. I let her go home afterwards, heh she had suffered enough.




aaand it’s gone
I hit Shutdown, instead of restart. The server was in a colo a thousand miles away. Unfortunately at that time, a colo that didn't over overnight staff onsight (oncall yes, but not on site). Also we had no IPMI.
I had to page some poor dude at 1am to drive 30mn (each way) into the colo to push the "on" button. I felt terrible. (Small company, it was our one critical production server)
10 seconds later, my program finished... and everything snapped back to life.
Another time, I walked into a different, bigger lab, with 100 terminals... snooped around the system, saw that the compiler queue had about 40 minutes of entries... bumped the first one up a bit (the queue was set to lower priority than any of the users, which was a mistake)... it finished in 2 seconds, instead of 2 minutes...
15 minutes later, the queue was empty, 30 minutes after that the room was empty, because everyone had gotten their work done.

Sounds more like a "I brought up production" story...

sudo chown -R www-data:www-data [folder]
I’d made some changes and was ready to update the owner only I was inside the folder that needed updating. In the moment I decided the correct way to refer to that folder was /
I noticed the command was taking far longer than usual to execute. I realised the mistake but by then the server was down with no way to bring it back up.
TBH, my team was very gracious about it and the RCA focused purely on the events that occurred and how to never let if happen it again. No blame game at all.

Which is how a PIR, PER or PCR should be. If you don't understand why someone makes a mistake, you can't avoid future mistakes.
Back in 2003 or so, I was in tech support for a company that used desktop computers running java applets to connect to a mainframe via Telnet (IBM Host-on-Demand IIRC). Most of the core business processes were handled by mainframe apps, which the company largely developed. I used to hang out in the data center with the mainframe guys who coded in COBOL all day.
On a Friday afternoon, I was working on testing deployment of an update to the java terminal client applet. Everything seemed to work fine in testing, and it was a minor update, so (idiot me) I went ahead and pushed it to the server.
Shortly after I pushed it out, the mainframe guys' phones started ringing with complaints that the mainframe was down. Then my phone started ringing. Then all of the phones started ringing.
Turns out, something I did in the update (I honestly can't remember the specifics now) reset every local users' mainframe connection information for the applet. Across the whole company. So as soon as they exited the applet, they couldn't get back in.
That was a fun weekend.
On another occasion I had a division operation happen on integers instead of floats, and the code was running on some hardware that steered antennas for radios on airplanes. Much time was spent by pilots flying in circles over LA while I gathered data and found the "oops". It was fixed by adding a period to an int literal.
On another occasion my machine learning demo API failed due to heavy load, but only when India's prime minister was looking at it.
Anyway, I hid that binary. In /etc, where they'd never think to look.
Gosh we do some dumb things, eh? LOL. That took a while to find a solution for, and no small amount of luck. The owner of the ISP walked back in the office a couple hours later and said "I heard you had some excitement?" I said, "Oh yes, it was pretty ugly for a bit." "Is it fixed now?" "Yup." "Carry on."
For sure thought my ass was fired and I'd only been on the job a month or so.



Ouch, indeed. We ended up getting lucky and found a workstation where someone had left themselves at a root prompt on another machine that had a shared NFS mount. This was before protection from this kind of attack, so we were able to create a setuid root script and run it on the main server to get root access to fix the broken passwd file.
Our next step was going to be rebooting the server. We were pretty sure that faced with a corrupt passwd file, SunOS would drop to single user mode. Never tested that theory. Glad we didn't have to, the server in question was a hack job as it was. Copied over (literally, as files) from a previous server, it wasn't even 100% in agreement with itself on its own hostname, so I always kinda wondered how it would react to any big changes.

I've seen it written *nix to grab Linux and Unix.

https://unix.stackexchange.com/questions/2342/why-is-there-a...
Doesn't explain why exactly the asterisk was put in that particular position. Maybe someone felt like it was odd to lead the word with an asterisk. :shrug:
My first job out of school, working away from home and learning the ropes of embedded software.
The office was using on promise databases, email servers and the like, as was somewhat common at the time, but nothing much more than a a few robustified PCs and some networking infra. We were having internet problems being too far away from the exchange and so the telephone company was coming into replace the exchange over the weekend so everything was shutdown on the Friday night.
Monday morning comes by and we boot things up again, but no connectivity… Office is dissolving into chaos as phones were also down. British Telecom is demanded to return this very minute and figure it out!
An hour later a very flustered gentleman turns up and begins to debug a few sockets but finds them all dead. 1 minute later he is at the new exchange (that was inside our office), only to emerge from the room after 30 seconds looking extremely confused.
It turns out Dave, an extremely helpful chap who was in charge of some product final assembly had turned up at the office as normal at 7am and thought he would helpfully uninstall the old exchange and throw it in the skip we had rented for just that purpose. A quick wonder around to said skip found the exchange in there with a bunch of wiring - the helpful chap had really gone to town on this. Sadly, I was quick to identify that this was the new exchange, not the old by observing simply how fresh it looked and the BT chap came over to confirm. Because of the damage that had done to the wiring, it was not trivial to simply wire back the old exchange and so that was the end of office operations for a week.
A small company meeting was held where it was announced that “an error of judgement” had occurred and that we were to have some vacation - much of the in office equipment was taken offsite to get temporary connectivity so that sales could continue whilst we vacationed. Internet remained terrible until I left that gig, now blamed on all the wire patches needed to get the office back on line.
https://monzo.com/blog/2019/09/08/why-monzo-wasnt-working-on...
In summary, we were scaling up our production Cassandra data store and we didn't migrate/backfill the data properly which led to data being 'missing' for an hour.
In a typical Cassandra cluster when scaled up, data moves around the ring a single node at a time. When you want to add multiple nodes, this can be an extremely time and bandwidth consuming process. There's a flag called auto_bootstrap which controls this behaviour. Our understood behaviour was that it will not join the cluster until operators explicitly signal for it to so (and this is a valid scenario because as an operator, you can potentially backfill data from backups for example). Unfortunately it was completely misunderstood when we originally changed the defaults many months prior to the scale up.
Fortunately, we were able to detect data inconsistency within minutes of the original scale up and we were able to fully revert the status of the ring to it's original state within 2 hours (it took that long because we did not want to lose any new writes so we had to carefully remove nodes in the reverse order that they came in and joined the ring).
Through a mammoth effort across the engineering team across two days, we were able to reconcile the vast majority of inconsistent data through the use of audit events.
This was a mega stressful day for everyone involved. On the plus side though, I've had a few emails telling me that the blog post has saved others from making a similar mistake.

I was always worried about something like this happening so only ever provisioned (via ansible) one server at a time. When the logs showed it was fully synced, we provisioned the next node. It could take two days to add 10 nodes but I always felt much safer
Narrator: Bind was not running.
Down goes a media organizations web site.
I was talking about what garbage our ethernet switches were (this was one of the earliest L3 switches-- full of blue wire bodges and buggy firmware), and how I'd already encountered a dozen different ways of crashing them.
While typing I started saying "and I bet if I send it an ICMP redirect from itself (typing) to itself (typing) it won't like that at all! (enter)" --- and over the cube partitions I hear the support desk phones start ringing. Fortunately it was a bit after the end of the day and it didn't take me long to reboot the switch.
I didn't actually expect it to kill it. I probably should have.


I was working in a different company back then and I was contacted by this consultancy that specialized in Google Cloud (which was always my personal favorite anyway). I was offered a very handsome pay for just what seemed to be just 4 days worth of work. To me, it sounded really simple, like get in, migrate and get out.
After I signed the contract and everything I got to know the client was promised a $200/mo budget and a very wrong technical solution proposal by an Engineer in the consultancy from what they were paying then which is definitely in multiples of what was quoted. And to make matters more interesting, this guy just quit after realizing his mistake. And that's how I even got this project.
So, I went in, tried many cost effective combinations including various levels of caching and BAM!, the server kept going down. They had too much traffic for even something like Google's Paas to hold (it has autoscaling and all the good stuff, but it would die even before it could autoscale!). Their WP theme wasn't even the best and made tons of queries for a single page load. Their MySQL alone costed them in the 1000s. So, I put them on a custom regular compute box, slapped some partial caching on bits they didn't need and managed to bring the cost to slightly higher than what they were paying with their previous cheap hosting company. All this lead to a 4 hour downtime.
I apologized to them profusely and built them a CMS from scratch that held their traffic and more and dropped their cost to 1/4th of what their competitors are paying. Today, this client is one of my best friends. They went from "Fuck this guy" to "Can we offer you a CTO role?" :)
I make it sound like it's so easy, but it was a year long fight almost bundled with lots of humiliation for something I didn't do just to earn their trust and respect. Till date, they don't know the ex-consultant's screw up.
In retrospect, this downtime is the best thing that happened to me and helped me to understand how you handle such scenarios and what you should do and not to do. In such situations it is tempting to blame other people around you, but in the long term, it pays off if you don't and solve it yourself.
I wasn't oncall, and the oncall didnt have access to the query runner script -- it was on my laptop. So, oncall was desperately trying to fight a fire they couldn't put out while i slept like a baby .. that was a fun monday morning meeting.
$ rm /etc/ld.so*


Took a long vacation weekend as my error proceeded to shut down all production due to network issues causing the AS/400 to freak out.
Cant run a conveyor belt, or robot, or sensor, production line, or or or if your mainframe isn't working.

Fortunately, it was on a testing system, SNA continued to work, and the system was due to be rebooted on the weekend anyway, so it was not that bad.
Not myself, but a few years ago, a new coworker managed to accidentally delete all user accounts from our Windows domain while trying to "clean up" the Group Policies. Our backup solution, while working, was rather crappy, so we had to restore the entire domain controller (there only was the one), which took all day, even though it was not that big. Fortunately, most users took it rather well and decided to either take the day off (it was a Friday) or tidy up their desks and sort through the papers they had lying around. A few actually thanked us for giving them the opportunity to "actually get some work done".
Of course I blamed NetApp and called their tech support screaming for help with their OS "bug". After hours of troubleshooting we finally figured out that the NetApp OS upgrade had included a network performance optimization and it was now sending out packets fast enough to overflow the buffer on our gigabit Ethernet switch. The packet loss rate was huge. Fortunately we had a newer switch back in the office so after swapping that out and repairing some corrupt databases I was able to get production back on line. Didn't get any sleep that night though.
Well, in dev, the database was refreshed with prod data every night at midnight, so we never saw the bug in my code. I had a sign error in updating the customer's balance so instead of lowering their balance by the payment amount, my code increased their balance. Geometric growth is an amazing thing. A few days later we had calls from angry customers because we had maxed out their credit cards. Miraculously, I was not fired. In retrospect, I think that it might have been because the manager would have then had to explain why he had not made sure there was not adequate testing on something so central to the business.
So when it ran it looked in blank directory (default c:\windows\system32), for blank conditions. Zip filetype matching blank. Delete files over blank age.
These servers were rebuilt over a weekend, and the script was scheduled again, and broke these servers again, requiring another rebuild.
When I came in on Monday, I was told that the script had caused this carnage. I didn't believe them until I read the debug logs in horror.
Luckily the config was specific yo a subset of servers, but they happened to be servers that were responsible for police GPS radios to function.
Suffice to say it now has a lot of defensive programming in it to test the config file and resulting config object before doing anything.
One ISP router in production with 20k active connections... one "backup" router fresh from the box.
My job was to backup the production firmware and flash the config to the spare box.
The opposite happened and the customer support telephones lit up like a Christmas tree.
We didn't realize the issue until he admitted that he had run an update on the full user table (forgot a where clause) and every single email was now being funneled into his email account.

This wasn't a huge problem, but the configuration on Action Cable (Rails wrapper around websockets), logged the entire contents of the message to STDOUT. At a moderate scale, this combined with a memory leak bug in Docker that crashed our application every time one of our staff members tried to perform a routine action on our web app. This action resulted in a single log line of > 64kb, which Docker was unable to handle.
All of this would have been more manageable if it hadn't first surfaced while I was taxiing on a flight from Detroit to San Francisco (I was the only full time engineer). I managed to restart the application via our hosting providers mobile web interface, and frantically instructed everyone to NOT TOUCH ANYTHING until I landed.
Best part is that I did the wiring in that building when it was built 5 years before that; I really should have realized it was there.
In eterm on my gentoo linux laptop with enlightenment desktop I typed: su - shutdown -h now
Because I was tired and I wanted to go to bed. Came back after brushing my teeth. F### laptops and linux! Screen still on. The thing didn’t shutdown!
Strange thing was: in the terminal something said it got a shutdown signal.
Then I realized I shutdown a remote server for a forum with 200k members.
It was on the server of an isp employee, which happened to be member of that site. All for free, so no remote support and no kvm switches. Went to bed and took a train next day early morning to fix it.

I use K8s and docker to run software on my server, but initiate these via SSH. I suppose CI is perhaps modern approach or what else is everyone using?

Does self DDOS count?
We worked for Flanders radio and television (site was one of Flanders biggest radio stations). The site was a angularjs Frontend with a CMS backend.
The 40x and 80x pages fetched content from the backend to show the relevant message (so editors can tweak it). The morning they started selling tickets for Tomorrowland I deploy the frontend breaking the js fetching a non existing 5x page, looping to doing this constantly. in a matter of seconds the servers were on fire and I was sitting sweating next to the operations people. Luckily they were very capable and were able to restore the peace quite quickly.
And also (other radio station) deleting the DB in production. And also (on a Bank DB2) my coworker changing the AMOUNT in all rows of cash plans in stead of in 1 row (and OR and braquets and trust you know).
My code reviewers didn't notice and we didn't have linting or warnings on that file, so I brought down production :)
At the time I was mortified, but in hindsight the fact that I was able to do that in the first place was really the issue, not my script.
So I crafted an .asp to do my maintenance.
Only I was calling CreateObject() in the for loop to get a new AdoDb.Connection for each of the array entries of the data.
That creaky IIS server crashed like the economy.
https://www.google.com/amp/s/techcrunch.com/2013/02/27/bug-i...
Customer fears were far worse than reality, but my management team (up to and including JeffB) we're not amused.
I still think it was over-provisioned, but they told Ops to stop listening to me unless someone else agreed. Probably ran on the 50 machines till it was discontinued 10 or 15 years later, but I left so who knows.

However I’ve only ever heard stories where management lays the blame.

Setup was, mobile app -> detect beacon & ping web endpoint with customer-id+beacon-uuid -> WAF -> Web application -> Internal Firewall -> Kafka Cluster -> downstream applications/use cases
It was an experiment — I didn't have high expectations for the number of customers who'd opt in to sharing their location. The 3 node Kafka cluster was running in a non-production environment. Location feed was primarily used for determining flow rates through the airport which could then predict TSA wait times, provide turn by turn indoor navigation and provide walk times to gates and other POIs.
About a week in, the number of customers who enabled their location sharing ballooned and pretty soon we were getting very high chatty traffic. This was not an issue as the resource utilization on the application servers and especially the Kafka cluster was very low. As we learned more about the behavior of the users, movements and the application, mobile team worked on a patch to reduce the number of location pings and only transmit deltas.
One afternoon, I upgraded one of the Kafka nodes and before I could complete the process, had to run to a meeting. When I came back about an hour later and started checking email, there were Sev-2/P-2 notifications being sent out due to a global slowdown of communications to airports and flight operations. For context, on a typical day the airline scheduled 5,000 flights. As time went on it became apparent that it was a Sev-1/P-1 that had caused a near ground stop of the airline, but the operations teams were unable to communicate or correctly classify the extent of the outage due to their internal communications also having slowed down to a crawl. I don't usually look into Network issues, but logged into the incident call to see what was happening. From the call I gathered that a critical firewall was failing due to connections being maxed out and restarting the firewall didn't seem to help. I had a weird feeling — so, I logged into the Kafka node that I was working on and started the services on it. Not even 10 seconds in, someone on the call announced that the connections on the firewall was coming down and another 60 seconds later firewall went back to humming as if nothing had happened.
I couldn't fathom what had happened. It was still too early to determine if there was a relationship between the downed Kafka node and the firewall failure. The incident call ended without identifying a root cause, but teams were going to start on that soon. I spent the next 2 hours investigating and following is what I discovered. ES/Kibana dashboard showed that there were no location events in the preceding hour prior to me starting the node. Then I checked the other 2 nodes that are part of the Kafka cluster and discovered that being a non-prod env they were patched during the previous couple of days by the IT-infra team and the Zookeeper and Kafka services didn't start correctly. Which meant the cluster was running on a single node. When I took it offline, the entire cluster was offline. I talked to the web application team who owned the location service endpoint and learned that their server was communicating with the Kafka cluster via the firewall that experienced the issue. Furthermore, we discovered that the Kafka producer library was setup to retry 3x in the event of a connection issue to Kafka. It became evident to us that the Kafka cluster being offline caused the web application cluster to generate exponential amount of traffic and DDoS'd the firewall.
Looking back, there were many lesson learned from this incident beyond the obvious things like better isolation of non-prod to and production envs. The affected firewall was replaced immediately and some of the connections were re-routed. Infra teams started doing better risk/dependency modeling of the critical infrastructure. On a side note, I was quite impressed by how well a single Kafka node performed and the amount of traffic it was able to handle. I owned up to my error and promptly moved the IOT infrastructure to cloud. In many projects that followed, these lessons were invaluable. Traffic modeling, dependency analysis, failure scenario simulation and blast radius isolation are etched into my DNA as a result of this incident.
Right?
The sense of dread the dawned on me as the former Navy Seal turned Network Engineer (and later doctor) started sniffing around the switch I had just touched was palpable. Luckily for me, He kept my mistake quiet and fixed it quickly.

Search:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK