Ask HN: Do you find working on large distributed systems exhausting?

Ask HN: Do you find working on large distributed systems exhausting? 181 points by wreath 5 hours ago | hide | past | favorite | 148 comments Ive been working on large distributed system for the last 4-5 years with teams owning few services or have different responsibilities to keep the system up and running. We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

I think it has progressed my career and expanded my skills but I feel it's pretty damn exhausting to manage all this even when following a lot of the best-practices and working with other highly skilled engineers.

I've been wondering recently if others feel this kind of burnout (for lack of better word). Is the expectation is that your average engineer should now be able to handle all this?

Yes, I used to,

but No, I fixed it :)

Among other things, I am team lead for a private search engine whose partner-accessible API handles roughly 500 mio requests per month.

I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

But over time, management's "stand on the shoulders of giants" brainwashing wore off so that they actually started to read all the "AWS outage XY" information that we forwarded to them. They started to actually believe us when we said "Nothing we can do, call Amazon!". And then, I found a struggling hosting company with almost compatible tooling and we purchased them. And I moved all of our systems off the public cloud and onto our private cloud hosting service.

Nowadays, people still hold me (at least emotionally) accountable for any issue or downtime, but I feel much better about it :) Because now it actually is withing my circle of power. I have root on all relevant servers, so if shit hits the fan, I can fix things or delegate to my team.

Your situation sounds like you will constantly take the blame for other people's fault. I would imagine that to be disheartening and extremely exhausting.

I feel that your problems aren't even remotely related to my problems with large distributed systems.

My problems are all about convincing the company that I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user, it is usually some internal system related to data storage or processing which can't cope anymore.

Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog. Immediately you have a million problems like data migration, reporting, data ingestion, making it work with all the related systems like search, recommendations, reviews and so on.

And even if you get the ball rolling you have to work across dozens of different teams which can be hard because naturally people resist change.

Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

> Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.

I used to work on a Unicorn a few years ago, and this hits close to home. From 2016 to 2020, the pages didn't change one single pixel, however there we had 400 more engineers working on the code and three stack iterations: full-stack PHP, PHP backend + React SSR frontend, Java backend + [redacted] SSR frontend (redacted because only two popular companies use this framework). All were rewrites, and those rewrites were justified because none of them was ever stable, the site was constantly going offline. However each rewrite just added more bloat and failure points. At some point the three of them were running in tandem: PHP for legacy customers, another as main and another on an A/B test. (Yeah, it was a dysfunctional environment and I obviously quit).

> Yeah, it was a dysfunctional environment and I obviously quit

What do you think could management have done better to make it not dysfunctional and have people quitting?

"That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user"

It seems to be the same story in fiels of Infrastructure maintenance, Aircraft design (boeing Max), and mortgage CDOs (2008). Was it always like this or the new management doesn not care untill something explodes?

a manufacturing company is designed the ground up to works whit machine but isn't the same whit software, is hard to understand that triple data isn't only triple server but a totaly different software stack, and exponentially more complex is not only put more factories like textile.

Heh, I wish they still looked the same. They added an order of magnitude of HTML and JS bloat while removing functionality.

Had that issue in my previous job.

Higher management decided to migrate our properitary vendor locked platform from one cloud provider to the other one. Majority of migration fell on a single platform team that was constantly struggling with attricion.

Unfortunately I was not able (neither our architects) to explain the higherups that we need bigger team and overall way more resources to pull that off.

Hope that someone that comes after me will be able to make the miracle happen.

My impression has always been that FAANG need lots of engineers because the 10xers refuse to work there. I've seen plenty of really scalable systems being built by a small core team of people who know what they are doing. FAANG instead seem to be more into chasing trends, inventing new frameworks, rewriting to another more hip language, etc.

I would have no idea how to coordinate 200 engineers. But then again, I have never worked on a project that truly needed 50+ engineers.

"Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog." Probably that's 4 friends in a basement, similar to the core Android team ;)

Your impression comes from the fact that you have not worked at larger teams, as you said so yourself. It's relatively easy to build something scalable from the beginning if you know what you need to build and if you are not already handling large amounts of traffic and data.

It's a whole different ballgame to build on top of an existing complex system already in production that was made to satisfy the needs at the time it was built but it now needs to support other features, bug fixes and supporting existing features but at scale while having 50+ engineers not step on each other and not break each others code in the process. 4 friends in the basement will not achieve more than 50+ engineers in this scenario, even when considering the inefficiencies of the difficulty in communication that come along with so many minds working on the same thing.

I usually move on to a different project/team/company when it gets to this. E.g. my new team builds a new product that grows like crazy and has its own set of challenges. I prefer to be deliver immediate customer value vs. long term hard to sell and hard to project the value work.

You had problems with management of a cloud based api and executive visibility… so you bought a set of data centers to handle 500mio req per month?

The visibility you will get after the capex when there’s a truly disastrous outage will be interesting.

Hmm that’s only 190Hz on average, but we don’t know what kind of search engine it is. For example if he’s doing ML inference for every query, it would make perfect sense to get a few cabinets at a data center. I’ve done so for a much smaller project that only needs 4 GPUs and saved a ton of money.

Nah, it's text-only requests returning JSON arrays of which newspaper article URLs mention which influencer or brand name keyword.

The biggest hardware price point is that you need insane amounts of RAM so that you can mmap the bloom hash for the mapping from word_id to document_ids.

You might be surprised. The performance equivalent of $100k monthly in EC2 spend fits into a 16m2 cage with 52HU racks.

Which costs you more than $100k monthly to operate with the same level of manageability and reliability.

We don't use AWS, because our use cases don't require that level of reliability and we simply cannot afford it, but if I needed a company to depend on IT that generates enough revenue... I probably wouldn't argue about the AWS bill. So long, prepaid at hetzner + in-house works good enough, but I know what I cannot offer with the click of a button to my user!

This is a religious debate among many. The IT/engineering nerd stuff doesn’t matter at all. Cloud migration decisions are always made by accounting and tax factors.

I run two critical apps, one on-prem and one cloud. There is no difference in people cost, and the cloud service costs about 20% more on the infrastructure side. We went cloud because customer uptake was unknown and making capital investments didn’t make sense.

I’ve had a few scenarios where we’ve moved workloads from cloud to on-prem and reverse. These things are tools and it doesn’t pay to be dogmatic.

> These things are tools and it doesn’t pay to be dogmatic.

I wish I would hear this line more often.

So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.

Especially bad imho when somebody tries to tell you how you could do better with 'not x' instead of x you are currently using without even trying to understand the context this decision resides in.

[Edit] typo

that cage is a liability, not an asset. How is the networking in that rack? What's its connection to large-scale storage (IE, petabytes, since that's what I work with). What happens if a meteor hits the cage? Etc.

That depends on what contracts you have. You could have multiple of these cages in different locations. Also, 1 PB is only 56 large enterprise HDDs. So you just put storage into the cage, too.

But my point wasn't about how precisely the hardware is managed. My point was that with a large cloud, a mid-sized company has effectively NO SUPPORT. So anything that gives you more control is an improvement.

"1 pb is only 56 large enterprise hdds".

umm, what happens when one fails?

With large cloud my startup had excellent support. We negotiated a contract. That's how it works.

As a security guy I HATE the loss of visibility in going to the cloud. Can you duplicate it? Sure. Still not as easily as spanning a trunk and you still have to trust what you’re seeing to an extent.

The visibility I was mentioning in the parent comment was visibility from executives in your business, but I can see how it would be confusing.

There are tradeoffs — cloud removes much of the physical security risks and gives you tools to help automated incident detection. Things like serverless functions let you build out security scaffolding pretty easily.

But in exchange you do have to give some trust. And I totally understand resistance there.

> cloud removes much of the physical security risks

Doesn't cloud increase the physical security risks, rather than decrease/remove?

@fxtentacle, I was curious which private search engine this is for. Is the system you are describing ImageRights.com?

No, ImageRights is much more requests and mostly images. Also, at ImageRights I don't have management above me that I would need to convince :)

This one is text-only and used by influencers and brands to check which newspapers report about their events. As I said, it's internally used by a few partner companies who buy the API from my client and sell news alerts to their clients.

BTW, I'm hoping to one day build something similar as an open source search engine where people pay for the data generation and then effectively run their own ad-free Google clone, but so far interest has been _very_ low:

https://news.ycombinator.com/item?id=30374611 (1 upvote)

https://news.ycombinator.com/item?id=30361385 (5 upvotes)

EDIT: Out of curiosity I just checked and found my intuition wrong. The ImageRights API averages 316rps = 819mio requests per month. So it's not that much bigger.

>I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.

If the business can't afford to have downtime then they should be paying for enterprise support. You'll be able to connect to someone in < 10 mins and have dedicated individuals you can reach out to.

In the two years I worked on serverless AWS I filed four support tickets. Three out of those four I came up with the solution or fix on my own before support could find a solution. The other ticket was still open when I left the company. But the best part is when support wanted to know how I resolved the issues. I always asked how much they were going to pay me for that information.

>You never hosted on AWS, did you?

Previously 2k employee company, with the entire advertising back office on AWS.

Currently >$1M YR at AWS, you can get the idea of scale & what is running, here: https://www.youtube.com/playlist?list=PLf-67McbxkT6iduMWoUsh...

Enterprise Support never disappointed me so far. Maybe not <10 minute response time, but we never felt left alone during an outage. But I guess this is also highly region/geo dependent.

>"they should be paying for enterprise support"

This sounds a bit arrogant. I think they found better and overall cheaper solution.

>This sounds a bit arrogant.

The parent thread talks about how the business could not go down even with a triple AZ outage for S3, and I don't think it is arrogant to state they should be paying for enterprise support if that level of expectation is set.

>I think they found better and overall cheaper solution.

Cheaper solution does not just include the cost but also the time. For the time we need to look at the time they spent regardless of department to acquire, migrate off of AWS, modifying the code to work for their multi-private cloud, etc. I'd believe it if they're willing to say they did this, have been running for three years, and compiled the numbers in excel. It is common if you ask internally was it worth it to get a yes because people put their careers on it and want to have a "successful" project.

The math doesn't work out in my experiences with clients in the past. The scenarios that work out are, top 30 in the enitre tech industry, significant GPU training, egress bandwidth (CDN, video, assets), or business that are selling basically the infrastructure (think Dropbox, Backblaze, etc.).

I'm sure someone will throw down some post where their cost, $x is less than $y at AWS, but that is such a tiny portion that if the cost is not >50% it isn't even worth looking at the rest of the math. The absolute total cost of ownership is much harder than most clickbait articles are willing to go into. I have not seen any developers talk about how it changes the income statement & balance sheet which can affect total net income and how much the company will lose just to taxes. One argument assumes that it evens out after the full amortization period in the end.

Here are just a handful of factors that get overlooked, supply chain delays, migration time, access to expertise, retaining staff, churn increase due to pager/call rotation, opportunity cost of to capital being in idle/spare inventory and plenty more.

Back then, it was enough to saturate the S3 metadata node for your bucket and then all AZs would be unable to service GET requests.

And yes, this won't be financially useful in every situation. But if the goal is to gain operational control, it's worthwhile nonetheless. That said, for a high-traffic API, you're paying through the nose for AWS egress bandwidth, so it is one of those cases where it also very much makes financial sense.

I don't read that as arrogant. The full statement is:

> If the business can't afford to have downtime then they should be paying for enterprise support.

It's simply stating that it's either cheaper for business to have downtime, or it's cheaper to pay for premium support. Each business owner evaluates which is it for them.

If you absolutely can't afford downtime, chances are premium support will be cheaper.

If you rely on public cloud infrastructure, you should understand both the advantages and disadvantages. Seems like your company forgot about the disadvantages.

What i read here was "Cloud is hard, so I took on even more responsibility"

What you should read is: At the monthly spend of a mid-sized company, it is impossible to get phone support from any public cloud provider.

Care to share uptime metrics on AWS vs your own servers?

That wouldn't be much help because the AWS and Heroku metrics are always green, no matter what. If you can't push updates to production, they count that as a developer-only outage and do not deduct it from their reported uptime.

For me, the most important metric would be time that me and my team spent fixing issues. And that went down significantly. After a year of everyone feeling burned out, now people can take extended vacations again.

One big issue for example was the connectivity between EC2 servers degrading, so that instead of the usual 1gbit/s they would only get 10mbit/s. It's not quite an outage, but it makes things painfully slow and that sluggishness is visible for end users. Getting reliable network speeds is much easier if all the servers are in the same physical room.

What are you using for aws alternatives? Example for S3?

>What are you using for aws alternatives? Example for S3?

Not OP but they're probably using Rook/Minio

docker + self-developed image management + CEPH

What do you find exhausting?

One anti-pattern I've found is that most orgs ask a single team to handle on-call around the clock for their service. This rarely scales well, from a human standpoint. If you're getting paged at 2:00 in the morning on a regular basis you will start to resent it. There's not much you can do about that so long as only one team is responsible for uptime 24/7.

The solution is to hire operations teams globally, and then setup follow-the-sun operations whereby the people being paged are always naturally awake at that hour, and allows them to work normal eight hour shifts. But this requires companies to, gasp, have specialized developers and specialized operators collaborate before allowing new feature work into production, to ensure that the operations teams understand what the services are supposed to do and keep it all online. It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.

So naturally, many orgs would prefer to burn out their engineers instead.

I don’t think this is a stable long term solution. The “on call” teams end up frustrated with the engineers who ship bugs and this results in added process that delays deploys, arbitrary demands for test coverage, capricious error budgets, etc. It’s much better to have the engineers who wrote the code be responsible for running it, and if their operational burden becomes too high, to staff up the dev team to empower them to go after root causes. Plus the engineers who wrote the code always have better context than reliability people who tend to be systems experts but lack the business logic intuition to spot errors at a glance.

The solution to get paged at off hours a lot is rarely to hire additional teams to cover those times for you, at least not long term. For things you can control, you should fix the root causes of those issues. For things you can't control you should spend effort on making them within your control (eg architecture improvement). This takes time, so follow-the-sun rotation might be a stop gap solution, but you need to make sure it doesn't cover over the real problems without them getting any better.

From experience, it's really hard to fix the root causes of issues when you were woken up three times the night before and had two more of the same incident occur during the workday. In my case I struggled along for a couple years but the best thing to do was just leave and let it be someone else's problem.

Best thing for what? Surely not software quality and customer satisfaction.

If they cared about that they would either pay me so much money I'd be insane to walk away or they would hire people in other time zones to cover the load. Instead they chose to pay for their customer satisfaction with my burnout. The thing about that strategy is... eventually the thing holding their customer satisfaction together gets burnt out. So I leave. And even then they're still getting the better half of the bargain.

At the end of the day, there's a human cost to responding to pages, and there's a human cost to collaboration.

Both of those can drive burn out. Personally, I find all that collaboration work very hard and stressful, so I work better in a situation where I get pages for the services I control; but that would change if pages were frequent and mostly related to dependencies outside of my control. It also helps to have been working in organizations that prioritize a working service over features. Getting frequent overnight issues that can't be resolved without third party effort that's not going to happen anytime soon is a major problem that I see reports of in threads like this.

I can also get behind a team that can manage the base operations issues like ram/storage/cpu faults on nodes and networking. The runbooks for handling those issues are usually pretty short and don't need much collaboration.

I'd argue that timezone is just part of the problem. If you're responsible for a high oncall load, you are subjected to a steady, unpredictable stream of interrupts requiring you to act to minimize downtime or degradation. Obviously it's worse if you get these at night, but it's still bad during the day.

I think the anti-pattern is having one team responsible for another's burden. You want teams to both be responsible for fixing their own systems when they break, AND be empowered to build/fix their broken systems to minimize oncall incidents.

This. Absolutely this. Working on large distributed system can be both exhilarating and exhausting. The two often go hand in hand. However, working on such systems without diligence tips the scales toward exhausting. If your testing and your documentation and your communication (both internal and with consumers) suck, you're in for a world of pain.

"But writing documentation is a waste of time because the code evolves so fast."

Yeah, I hear that, but there's also a lot of time lost to people harried during their on-call and still exhausted for a week afterward, to training new people because the old ones burned out or just left for greener pastures, to maintaining old failed experiments because customers (perhaps at your insistence) still rely on them and backing them out would be almost as much work than adding them was, and so on.

That's not really moving fast. That's just flailing. You can actually go further faster if you maintain a bit of discipline. Yes, there will still be some "wasted" time, but it'll be a bounded, controlled waste like the ablative tiles on a re-entry vehicle - not the uncontrolled explosion of complexity and effort that seems common in many of the younger orgs building/maintaining such systems nowadays.

> That's not really moving fast. That's just flailing.

Yes, a million times yes. This is moving me. Where do I find a team that understands this wisdom?

In these large scale systems the boundaries are usually not well defined (there are APIs but data flowing through the APIs is another matter as are operational and non functional requirements).

Stress is often caused by a mismatch of what you feel responsible and accountable for and what you really control. The more you know the more you feel responsible for but you are rarely able to expand control as much or as fast as your knowledge. It helps to be very clear about where you have ultimate say (accountability) or control within some framework (responsibility) or simply know and contribute. Clear in your mind, others and your boss. Look at areas outside your responsibility with curiosity and willingness to offer support but know that you are not responsible and others need to worry.

This is spot on. Feeling frustrated working on large distributed systems could be generalized as “feeling frustrated working in a large organization” because the same limitations apply. You learn about things you cannot control, and it is important to see the difference between what you can control and contribute and what you can’t.

My experience is that the expectations on what your average engineer should be able to handle has grown enormously during the last 10 years or so. Working both with large distributed systems and medium size monolithic systems I have seen the expectations become a lot higher in both.

When I started my career the engineers at our company were assigned a very specific part of the product that they were experts on. Usually there were 1 or 2 engineers assigned to a specific area and they knew it really well. Then we went Agile(tm) and the engineers were grouped into 6 to 9 person teams that were assigned features that spanned several areas of the product. The teams also got involved in customer interaction, planning, testing and documentation. The days when you could focus on a single part of the system and become really good at it were gone.

Next big change came when the teams moved from being feature teams to devops teams. None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software.

In some ways I agree that these changes have empowered us. But it is also, as you say, exhausting. Once I was simply a programmer; now I'm a domain expert, project manager, programmer, tester, technical writer, database admin, operations engineer, and so on.

It sounds like whomever shaped your teams & responsibilities didn’t take into account the team’s cognitive load. I find it’s often overlooked, especially by those who think agile means “everyone does everything”. The trick is to become agile whilst maintaining a division of responsibilities between teams.

If you look up articles about Team Topologies by Matthew Skelton and Manuel Pais, they outline a team structure that works for large, distributed systems.

> None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software

On the flipside, in the olden days when one set of people were churning features and another set of people were given a black box to run and be responsible for keep it running, it was very hard to get the damn thing to work reliably and the only recourse you often had was to "just be more careful", which often meant release aversion and multi-year release cycles.

Hence, some companies explored alternatives, found ways to make them work, wrote about their success but a lot of people copied only half of the picture and then complained that it didn't work.

> only half of the picture

Can you please share some details about what you think is missing from most "agile"/devops teams?

Ah excellent. Yes. In my experience there's this idea of "scale at all costs"--a better way would probably be to limit scaling until the headcount is scaled. Although then you probably need more VC money.

I've found that external tech requirements are horrible to work with, especially when the underlying stack simply doesn't support it. Normally these are pushed by certified cloud consultants or by an intrepid architect who found another "best practice blog."

It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.

Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.

... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).

If only the requirement was for a DR run-book + regular fire drills.

...how is the JIRA server mission critical? is it tied to CI/CD somehow?

In the enterprise you'll find that Jira is used for general workflow management not just CICD. I've encountered teams of analysts spend their working day moving and editing work items. It's the Quicken of workflow management solutions.

Jira Server is deliberately cobbled by the sequence table + no Aurora support and now EOL (no security updates 1 year after purchase!). DC edition scales horizontally if you have 100k.

Jira in general is a poorly thought out product (looking at you customfield_3726!) but it's held in such a high regard by users it's impossible to avoid.

Pre covid I would have laughed at this. But now, no one knows what a user story should be unless you can reas it off jira and there are no backups of course.

Gives me a fun idea: a program that randomly deletes items out of your backlog.

The first ten years of my career, I worked with distributed systems built on this stack: C++, Oracle, Unix (and to some extent, MFC and Qt). There were hundreds of instances of dozens of different type of processes (we would now call these microservices) connected via TCP links, running on hundreds of servers. I seldom found this exhausting.

The second ten years of my career, I worked with (and continue to work on) much more simpler systems, but the stack looks like this: React/Angular/Vue.js, Node.js/SpringBoot, MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a dozen services right here), Docker, Kubenetes. _This_ is exhausting.

When you spend so much time wrangling a zoo of commercial products, each with its own API and often own vocabulary for what should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and being constantly obsoleted by competing "latest" products, that you don't have enough time to focus on code, then yes, it can be exhausting.

You know what? This is a really great point. When I reflect back on my career experience (at companies like Expedia, eBay, Zillow, etc.) the best distributed systems experience I had was at companies that standardized on languages and frameworks and drew a pretty strong boundary around those choices.

It wasn't that you technically couldn't choose another stack for a project, but to do so you had to justify the cost/benefit with hard data, and the data almost never bore out more benefit than cost.

I used to lead teams that owned message bus, a stream processing framework and a distributed scheduler (like k8s) at Facebook.

The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.

Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.

Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.

I like that the mantra went from "move fast and break things" to (paraphrased) "move fast and don't break things".

It's been a pretty poor mantra from the beginning anyway. How about we move at a realistic pace and deliver good features, without burning out, and without leaving a trail of half-baked code behind us?

Without more info it’s hard to say. When I felt like this, a manager recommended I start journaling my energy. I kept a Google doc with sections for each week. In each section, there’s a bulleted list of things I did that gave me energy and a list of things I did that took energy.

Once you have a few lists some trends become clear and you can work with your manager to shift where you spend time.

Your story is close to home. I was part of a team that integrated our newly-acquired startup with a massive, complex and needlessly distributed enterprise system that burned me out.

Being forced to do things that absolutely did not make sense(CS wise) was what I found to be most exhausting. Having no other way than writing shitty code or copying functionality into our app led me to an eventual burnout. My whole career felt pointless as I was unable to apply any of my skills and expertise that I learned over all these years, because everything was designed in a complex way. Getting a single property into an internal API is not a trivial task and requires coordination from different teams as there are a plethora of processes in place. However I helped to build a monstrous integration layer and everything wrong with it is partly my doing. Hindsight is 20/20 and I now see there really was no other, better way to do it, which feels nice in a schadenfreude kind of way.

I sympathise with your point about not understanding what is expected of an average engineer nowadays. Should you take initiative and help manage things, are you allowed to simply write code and what should you expect from others were amongst my pain points. I certainly did not feel rewarded for going the extra mile, but somehow felt obliged because of my "senior" title.

I took therapy, worked on side projects and I'm now trying out a manager role. My responsibilities are pretty much the same, but I don't have to write code anymore. It feels empowering to close my laptop after my last Zoom meeting and not think about bugs, code, CI or merging tomorrow morning because it's release day tomorrow.

But hey, grass is always greener on the other side! I think taking therapy was one of my life's best decisions after being put through the ringer. Perhaps it will help you as well!

I don't find it exhausting, I find it *exhilarating*.

After years of proving myself, earning trust and strategical positioning I am finally leading a system that will support millions of requests per second. I love my job and this is the most intellectually stimulating activity I have done in a long while.

I think this is far from the expectation of the average engineer. You can find many random companies with very menial and low stake work. However if you work at certain companies you sign up for this.

BTW I don't think this is unreasonable. This is precisely why programmers get paid big bucks, definitely in the US. We have have a set of skills that require a lot of talent and effort, and we are rewarded for it.

Bottom line this isn't for everyone, so if you feel you are done with it that's fair. Shop around for jobs and be deliberate about where you choose to work, and you will be fine.

> I am finally leading a system that will support millions of requests per second.

This is the difference. Millions of things per second is a super hard problem to get right in any reality. Pulling this off with any technology at all is rewarding.

Most distributed systems are not facing this degree of realistic challenge. In most shops, the challenge is synthetic and self-inflicted. For whatever reason, people seem to think saying things like "we do billions of x per month" somehow justifies their perverse architectures.

I love building and developing software, and despite the fun and interesting challenges presented at my last job I quit because of the operations component. We adopted DevOps and it felt like "building" got replaced with "configuring" and managing complex configurations does not tickle my brain at all. Week-long on-call shifts are like being under house arrest 24/7.

I understand the value that developers bring to operational roles, and to some extent making developers feel the pain of their screwups is appropriate. But when DevOps is 80% Ops, you need a fundamentally different kind of developer.

After-hours on-call is a thing that needs to be destroyed. A company that is sufficiently large that the CEO doesn't get woken up for emergencies needs to have shifts in other timezones to handle them. I don't know why people put up with it.

Part of it is a culture that discourages complaining about after hours work.

There's an expectation that everyone is a night owl and that night time emergency work is fun, and that these fires are to be expected.

Finally, engineers seem to get this feeling of being important because they wake up and work at night. It's really a form of insanity.

Mental / emotional burnout is certainly not uncommon in tech (probably in most other careers, I'd bet). Most people in Silicon Valley are changing jobs more often than 4-5 years. I don't like to constantly be the new guy, but there is a refreshing feeling to starting on something new and not carrying years of technical debt on your emotions. Maybe it's time to try something new, take a bigger vacation than usual, or talk to someone about new approaches you can try in your professional or personal life. But certainly don't let the fact that you feel like this add to the load - you're not alone, and it's not permanent.

Jobs aren’t exhausting. Teams are. If you find yourself feeling this way, consider that the higher ups may be mismanaging.

There’s often not a lot of organizational pressure to change anything. So the status quo stays static. But the services change over time, so the status quo needs to change with them.

Agree with this. Conway's Law will always hold. If a company does not organize it's teams into units that actually hold full responsibility and full control/agency over that responsibility, those teams will burn out.

When getting anything done requires constant meetings, placing tickets, playing politics, and doing anything and everything to get other teams to accept that they need to work with you and prioritize your tasks so that you can get them done, you will burn out.

I find it "exhilirating," not "exhausting." But I also don't think that "...your average engineer should now be able to handle all this." That is where we went completely wrong as an industry. It used to be said that what we work on is complex, and you can either improve your tools or you can improve your people. I've always held that you will have to improve your people. But clever marketing of "the cloud" has held out the false promise that anyone can do it.

Lies, lies, and damn lies, I say!

Unless you have bright and experienced people at the top of a large distributed systems company, who have actually studied and built distributed systems at scale, your experience of working in such a company is going to suck, plain and simple. The only cure is a strong continuous learning culture, with experienced people around to guide and improve the others.

I'm the sole developer of a small research application. It's nothing compared to a big commercial outfit but it is distributed (task orchestration, basically glue between storage providers and heterogeneous HPC systems) and it's just me.

I feel very much like you do. I am nominally in grad school but really a full time engineer moonlighting as a student. We have no funding for a real development team. The app is coupled to things that constantly shift underfoot. I kind of feel like a single parent trying to keep a precocious toddler from accidentally dying.

It's hard to answer this because you don't specify what exactly you find exhausting. Is it oncall? Deployment? Performance issues? Dealing with different teams? Failures and recovery? The right hand not knowing what the left hand is doing? Too many services? Something else?

It's not even clear how big your service is. You mention billions of requests per month. Every 1B requests/month translates to ~400 QPS, which isn't even that large. Like, that's single server territory. Obviously spikiness matters. I'd also be curious what you mean by "large amount of data".

> Every 1B requests/month translates to ~400 QPS, which isn't even that large

I said billions not one billion.

I guess what I find exhausting is the long feedback cycle. For example, Writing a simple script that makes two calls to different APIs requires tons of wiring for telemetry, monitoring, logging, error handling, integrating w/ two APIs, setting up the proper kubernetes manifests, setting up the required permissions to run this thing and have them available to k8s. I find all this to be exhausting. We're not even talking about operating this thing yet (on call, running in issues with the APIs owned by other teams etc)

It definitely can be. I'm constantly trying to push our stack away from anti-patterns and towards patterns that work well, are robust, and reduce cognitive load.

It starts by watching Simple Made Easy by Rich Hickey. And then making every member of your team watch it. Seriously, it is the most important talk in software engineering.

https://www.infoq.com/presentations/Simple-Made-Easy/

Exhausting patterns:

- Mutable shared state

- distributed state

- distributed, mutable, shared state ;)

- opaque state

- nebulosity, soft boundaries

- dynamicism

- deep inheritance, big objects, wide interfaces

- objects/functions which mix IO/state with complex logic

- code than needs creds/secrets/config/state/AWS just to run tests

- CI/CD deploy systems that don't actually tell you if they successfully deployed or not. I've had AWS task deploys that time out but actually worked, and ones that seemingly take, but destabilize the system.

Things that help me stay sane(r):

- pure functions

- declarative APIs/datatypes

- "hexagonal architecture" - stateful shell, functional core

- type systems, linting, autoformatting, autocomplete, a good IDE

- code does primarily either IO, state management, or logic, but minimal of the other ops

- push for unit tests over integration/system tests wherever possible

- dependency injection

- ability to run as much of the stack locally (in docker-compose) as possible

- infrastructure-as-code (terraform as much as possible)

- observability, telemetry, tracing, metrics, structured logs

- immutable event streams and reducers (vs mutable tables)

- make sure your team takes time periodically to refactor, design deliberately, and pay down tech debt.

Most of that I agree with, I'm curious why you'd recommend unit tests over integration tests? It seems at odds with the direction of overall software engineering best practices.

I agree with most of you points, but the one that stands out is "push for unit tests over integration/system tests wherever possible".

By integration/system tests, do you mean tests that you cannot run locally?

Handling scale is a technically challenging problem, if you enjoy it - then take advantage! however sometimes taking a break to work on something else can be more satisfying.

Typically on a "High scale" service spanning hundreds or thousands of servers you'll have to deal with problems like. "How much memory does this object consume?", "how many ms will adding this regex/class into the critical path use?", "We need to add new integ/load/unit tests for X to prevent outage Y from recurring", and "I wish I could try new technique Y, but I have 90% of my time occupied on upkeep".

It can be immensely satisfying to flip to a low/scale, low/ops problem space and find that you can actually bang out 10x the features/impact when you're not held back by scale.

Source: Worked on stateful services handling 10 Million TPS, took a break to work on internal analytics tools and production ML modeling, transitioning back to high scale services shortly.

Yes the complexity and scale of these systems is far beyond what companies understand. The salaries of engineers on these systems need to double asap or they risk collapse.

That question probably needs more information.

But your 'average engineer' is probably better served by asking themselves the question whether the system really needed to be that large and distributed rather than if working on them is exhausting. The vast bulk of the websites out there doesn't need that kind of overkill architecture, typically the non-scalable parts of the business preclude needing such a thing to begin with. If the work is exhausting that sounds like a mismatch between architecture choice and size of the workforce responsible for it.

If you're an average (or even sub average) engineer in a mid sized company stick to what you know best and how to make that work to your advantage, KISS. A well tuned non-distributed system with sane platform choices will outperform a distributed system put together by average engineers any day of the week, and will be easier to maintain and operate.

Yeah, large-scale systems are often boring in my experience, because the scale limits what features you can add to make things better. Each and every decision has to take scale into account, and it's tricky to try experimenting.

I think it has to do with the kind of engineer you are. Some engineers love iterating and improving such systems to be more efficient, more scalable, etc. But it can be limiting due to the slower release cycles, hyper focus on availability, and other necessary constraints.

I don't think they are boring, but very important on the kind of engineer you are. At AWS I try to encourage people who like the problem space and at the very least appreciate it, but can totally understand that you don't want to do your entire career on it. Many of our younger folks have never felt the speed and joy you can get with hammering out a simple app (web, python, ML) that doesn't have to work at scale.

Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code. Unfortunately, it moves the complexity to operations. In an organization with a DevOps culture, the software engineers still share responsibility for resolving issues that occur between their micro-service and others.

In other organizations, individual teams have ICDs and SLAs for one or more micro-services and can therefore state they're meeting their interface requirements as well as capacity/uptime requirements. In these organizations, when a system problem occurs, someone who's less familiar with the internals of these services will have to debug complex interactions. In my experience, once the root-cause is identified, there will be one or more teams who get updated requirements - why not make them stakeholders at the system-level and expedite the process?

> Using micro-services instead of monoliths is a great way for software engineers to reduce the complexities of their code

Could you share why you think that's true?

IMO that it's exactly the opposite - microservices have potential to simplify operations and processes (smaller artifacts, independent development/deployments, isolation, architectural boundaries easier to enforce) but when it comes to code and their internal architecture - they are always more complex.

If you take microservices and merge them into a monolith - it will still work, you don't need to add code or increase complexity. You actually can remove code - anything related to network calls, data replication between components if they share a DB, etc.

In all the situations i have had to work on microservices it generally means the team just works on all the different services, now spread out over more applications. Doing more integration work vs actual business logic. Because the fancy microservices the architect wanted doesn't mean there's actually money to do it properly or even have an ops team.

Also for junior team members a lot of this stuff works via magic because they can't yet oversee where the boundaries are or do not understand all the automagically configuration stuff.

Also the amount of works on my machine with docker is staggering even if the developers laptop's are the same batch / imaged machine.

> We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.

So, if you are handling 10 billion requests per month, that would average out to about 4k per second.

Are these API calls data/compute intensive, or is this more pedestrian data like logging or telemetry?

Any time I see someone having a rough time with a distributed system, I ask myself if that system had to be distributed in the first place. There is usually a valuable lesson to be learned by probing this question.

Yes! A single machine can handle tons of traffic in many cases.

I'm trying to relate this to my experiences. The best I can make of it is that burnout comes from dealing with either the same types of problems, or new problems at a rate that's higher than old problems get resolved.

I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.

I am curious what the split is of problems being related to:

1. error rates, how many 9s per end-user-action, and per service endpoint

2. performance, request (and per-user-action) latency

3. incorrect responses, bugs/bad-data

4. incorrect responses, stale-data

5. any other categories

Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.

A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.

Quite the opposite, interestingly, I’m usually in “Platform”-ish roles which touch or influence all aspects of the business, inc. building and operating services which do a couple orders of magnitude more than OP’s referenced scale (in the $job[current] case, O(100B - 1T) requests per day) and while I agree with the “Upside” (career progression, intellectual interest, caliber of people you work with), I haven’t experienced the burnout and in 2022 am actually the most energized I’ve been in a few years.

I expect you can hit burnout building services and systems for any scale and that’s more reflective on the local environment — the job and the day to day, people you work with, formalized progression and career development conversations, the attitude to taking time off and decompressing, attitudes to oncall, compensation, other facets.

That said, mental health and well-being is real and IMO needs to be taken very seriously, if you’re feeling burnout, figuring out why and fixing that is critical. There have been too many tragedies both during COVID and before :-(

The most undervalued thing that forgot even highly skilled engineers - KISS principle. That’s why you are burning out supporting such systems.

Yes, it's amazing how much one modern high spec system running good code can do. Turn off all the distributed crap and just use a pair in leader/follower config with short ttl DNS to choose the leader and manual failover scripts. If your app/company/industry cannot accept the compromises from such a simple config, quit and work in one which can.

Yes, but in a different way. I work in Quality Engineering, and the scope of maturity in testing distributed systems has been exhausting.

Reading other comments from the thread, I see similar frustrations from teams I partner with. How to employ patterns like contact, hypothesis, doubles, or shape/data systems (etc.) typically gets conflated with System testing. Teams often disagree on the boundaries of the system start leaning towards System Testing, and end up adding additional complexity in tests that could be avoided.

My thought is that I see the desire to control more scope presenting itself in test. I typically find myself doing some bounded context exercises to try to hone in on scope early.

My take is that it's exhausting because everything is so damn SLOW.

"Back to the 70's with Serverless" is a good read:

https://news.ycombinator.com/item?id=25482410

The cloud basically has the productivity of a mainframe, not a workstation or PC. It's big and clunky.

I quote it in my own blog post on distributed systems

http://www.oilshell.org/blog/2021/07/blog-backlog-2.html

https://news.ycombinator.com/item?id=27903720 - Kubernetes is Our Generation's Multics

Basically I want basic shell-like productivity -- not even an IDE, just reasonable iteration times.

At Google I saw the issue where teams would build more and more abstraction and concepts without GUARANTEES. So basically you still have to debug the system with shell. It's a big tower of leaky abstractions. (One example is that I had to turn up a service in every data center at Google, and I did it with shell invoking low level tools, not the abstractions provided)

Compare that with the abstraction of a C compiler or Python, where you rarely have to dip under the hood.

IMO Borg is not a great abstraction, and Kubernetes is even worse. And that doesn't mean I think something better exists right now! We don't have many design data points, and we're still learning from our mistakes.

Maybe a bigger issue is incoherent software architectures. In particular, disagreements on where authoritative state is, and a lot of incorrect caches that paper over issues. If everything works 99.9% of the time, well multiple those probabilities together, and you end up with a system that requires A LOT of manual work to keep running.

So I think the cloud has to be more principled about state and correctness in order not to be so exhausting.

If you ask engineers working on a big distributed system where the authoritative state in their system is stored, then I think you will get a lot of different answers...

I find it actually the other way around.

As you said, a benefit of large distributed systems is that usually its a shared responsibility, with different teams owning different services.

The exhaustion comes into place when those services are not really independent, or when the responsibility is not really shared, which in turn is just a worse version of a typical system maintained by sysadmins.

One thing that helps is bring the DevOps culture into the company, but the right way. It's not just about "oh cool we are now agile and deploy a few times a day", it's all down to shared responsibility.

It's only exhausting when you know deep in your heart that this could run on one t2 large box.

Worked on a team at BofA, our application would handle 800 million events per day. The logic we had for retry and failure was solid. We also had redundancy across multiple DCs. I think we processed like 99.9999999% of all events successfully. (Basically all of them, last year we lost about 2,000 events total) I didn’t find it very stressful at all. We build in JMX Utica for our production support teams be able to handle practically anything they would need to.

It's okay to prefer working on small single server systems with small teams for example. I do this while contracting quite often and enjoy how much control you get to make big changes with minimal bureaucracy.

Sometimes it feels like everyone is focused on eventually working with Google scale systems and following best practices that are more relevant towards that scale but you can pick your own path.

My number one requirement for a distributed system is that the code all be one place.

There are good reasons for wanting multiple services talking through APIs. Perhaps you have a Linux scheduler that is marshalling test suites running on Android, Windows, macOS and iOS?

If all these systems originate from a single repository, preferably with the top level written in a dynamic language that runs from its own source code, then life can be much easier. Being able to change multiple parts of the infrastructure in a single commit is a powerful proposition.

You also stand a chance of being able to model your distributed system locally, maybe even in a single Python process, which can help when you want to test new infrastructure ideas without needing the whole distributed environment.

Your development velocity will be faster and less painless. Changes being slow and painful are what burn people out and grind progress to a halt.

> My number one requirement for a distributed system is that the code all be one place.

This is a major source of frustration. Having to touch multiple repositories and syncing and waiting for their deployment/release (if it's a library) just to add a small feature easily wastes a few hours of the day and most importantly drains cognitive ability by context switching.

I find it very draining and vexing to work on systems that have all of its components distributed left and right without clear boundaries, instead of being more coalesced. Distribution in the typical sense - identical spares working in parallel for the sake of redundancy - doesn't faze me very much.

I find working on single services / components more exhausting.

Yup. Spent more than a decade doing it. Got so frustrated that I started a company to try abstract it all away for everyone else. It's called M3O https://m3o.com. Everyone ends up building the same thing over and over. A platform with APIs either built in house or an integration to external public APIs. If we reuse code, why not APIs.

I should say, I've been a sysadmin, SRE, software engineer, open source creator, maintainer, founder and CEO. Worked at Google, bootstrapped startups, VC funded companies, etc. My general feeling, the cloud is too complex and I'm tired on waiting for others to fix it.

>Consume public APIs as simpler programmable building blocks

Is the 'r' in simpler there intentionally? In which way are the building blocks more simple than simple blocks?

It’d be interesting to know - what are the expectations made of you? In this environment, I’d expect there to be dedicated support for teams operating their services - i.e. SRE/DevOps/Platform teams who should be looking to abstract away some of the raw edges of operating at scale.

That said, I do think there’s a psychological overhead when working on something that serves high levels of production traffic. The stakes are higher (or at least, they feel that way), which can affect different people in different ways. I definitely recognise your feeling of exhaustion, but I wonder if it maybe comes from a lack of feeling “safe” when you deploy - either from insufficient automated testing or something else.

(For context - I’m an SRE who has worked in quite a few places exactly like this)

Depends. Not the systems themselves but more the scope of the work and how it is being done. If the field is boring or the design itself is bad(with no ability to make it better, whether it's simply by design, code quality or whatever), my motivation, will and desire to work teleports to a different dimension-it's a fine line between exhaustion and frustration I guess. If it is something interesting, I can work on it for days straight without sleeping. Lately I've been working on a personal project and every time I have to do anything else I feel depressed for having to set it aside.

Yes. That's why you avoid building them unless you absolutely need to, and instead build libraries instead.

TLDR; Yes, it is exhausting, but I have found ways to mitigate it.

I don't develop stuff that runs billions of queries. More like thousands.

It is, however, important infrastructure, on which thousands of people around the world, rely, and, in some cases, it's not hyperbole to say that lives depend on its integrity and uptime.

One fairly unique feature of my work, is that it's almost all "hand-crafted." I generally avoid relying on dependencies out of my direct control. I tend to be the dependency, on which other people rely. This has earned me quite a few sneers.

I have issues...

These days, I like to confine myself to frontend work, and avoid working on my server code, as monkeying with it is always stressful.

My general posture is to do the highest Quality work possible; way beyond "good enough," so that I don't have to go back and clean up my mess. That seems to have worked fairly well for me, in at least the last fifteen years, or so. Also, I document the living bejeezus[0] out of my work, so, when I inevitably have to go back and tweak or fix, in six months, I can find my way around.

[0] https://littlegreenviper.com/miscellany/leaving-a-legacy/

Feel free to see for yourself. I have quite a few OS projects out there. My GH ID is the same as my HN one.

My frontend work is native Swift work, using the built-in Apple frameworks (I ship classic AppKit/UIKit/WatchKit, using storyboards and MVC, but I will be moving onto the newer stuff, as it matures).

My backend work has chiefly been PHP. It works quite well, but is not where I like to spend most of my time.

I did at first, but then learning config management and taking smaller bites helped.

I started out as a systems administrator and it's evolved into doing that more and faster. The tooling helps me get there, but I did have to learn how to give better estimates.

I have 15 years xp in dev but all of that was in smaller projects and a small team. I recently took a gig in bigger org with a distributed system and on call etc. It's exhausting and information overload. I'll give myself more time to acclimate but if I feel like this still after a year I'm out.

Yes, it's part of why I'm a dad at home who works on a little bash scripting sysadmin work as a side job.

Everything has gotten too complicated and slow.

Can you say more? What specifically is exhausting?

Exhaustion/burnout isn't uncommon but without more context it's hard to say if it's a product of the type of work or your specific work environment.

This is on point... You also give no actual numerical context. Are you saying you are working 40 hours a week and leave work exhausted? Are you saying you work 40 at work, and are on call/email/remote terminals for 40 more hours coordinating teams, putting out fires, designing architecture?

Even then, I would ask you to be more specific. I have a normal 40 hour a week uni job as a sysadmin, but it typically takes somewhat more or less (hey, sometimes I can get it done in 35, sometimes its 50 hours). However, for the last several years we have been so shorthanded, faculty wise, that I teach (at a minimum) two senior level computer science classes every semester (I was a professor at another uni). About mid semester, things will break, professors will make unreasonable demands of building out new systems/software/architecture, and I find myself doing (again at a minimum) 80 hours a week. On the other hand, I am not exhausted, as I enjoy teaching quite a bit, and I have been a sysadmin for many years and also enjoy that work.

As you imply towards the end, I think things like numbers of hours worked are generally not relevant for stuff like this. I've been incredibly engaged working 12+ hour days and I've been burnt out barely getting 2-3 hours of real work in a day. It has more to do with the nature of the work.

Even though you only did 2-3 hours of "real work", how much actual time investment was in your job? I don't see how somebody can burn out working just 2-3 hours in a day. Maybe emotionally burnt out if you're a therapist or something, but not as a software engineer.

Yes, a bit. But it's fun. And the motivation of fun is hardly to find in a big monothlic system.

I think our field is so broad that it is somewhat nebolous to talk about the average engineer. But from my experience taking car of such a large system with a large amount of requests and complexity is outside of what is expected of an average engineer. I think that there is an eventual limit for how much complexity a single engineer can handle for several years.

I think I understand what you mean, but it’s hard for me to contextualize, because I’m still working through some of my own past to identify where some of my burn out began.

For my part, I love working at global scale on highly distributed systems, and find deep enjoyment in diving into the complexity that brings with it. What I didn’t enjoy was dealing with unrealistic expectations from management, mostly management outside my chain, for what the operations team I led should be responsible for. This culminated in an incident I won’t detail, but suffice to say I hadn’t left the office in more than 72 hours continuous, and the aftermath was I stopped giving a shit about what anyone other than my direct supervisor and my team thought about my work.

It’s not limited to operations or large systems, but every /job/ dissatisfaction I’ve had has been in retrospect caused by a disconnect between what I’m being held accountable for vs what I have control over. As long as I have control over what I’m responsible for, the complexity of the technology is a cakewalk in comparison to dealing with the people in the organization.

Now I’ve since switched careers to PM and I’ve literally taken on the role of doing things and being held responsible for things I have no control over and getting them done through influencing people rather than via direct effort. Pretty much the exact thing that made my life hell as an engineer is now my primary job.

Making that change made me realize a few things that helped actually ease my burn out and excite me again. Firstly, the system mostly reflects the organization rather than the organization reflecting the system. Secondly, the entire cultural balance in an organization is different for engineers vs managers, which has far-reaching consequences for WLB, QoL, and generally the quality of work. Finally, I realized that if you express yourself well you can set boundaries in any healthy organization which allows you to exert a sliding scale of control vs responsibility which is reasonable.

My #1 recommendation for you OP is to take all of your PTO yearly, and if you find work intruding into your time off realize you’re not part of a healthy organization and leave for greener pastures. Along the way, start taking therapy because it’s important to talk through this stuff and it’s really hard to find people who can understand your emotional context who aren’t mired in the same situation. Most engineers working on large scale systems I know are borderline alcoholics (myself too back then), and that’s not a healthy or sustainable coping strategy. Therapy can be massively helpful, including in empowering you to quit your job and go elsewhere.

Yes it’s horrible. I actually miss the early 00’s when I did infra and code for small web design agencies. I actually could complete work back then.

it's exhausting but can be fun if you have a competent team to support you. I like nothing more than being told "one TPU chip in this data center is bad. Find it efficiently at priority 0."

Yes. But remember, with tools and automation getting better, this is a major source of value add that you bring as a software engineer which is likely to have long term career viability.

If you're burnt out, you're most likely being suckered.

Often when I hear stories of billions of requests per second it's self inflicted because of over complicated architecture where all those requests are generated only by a few thousand customers... So it's usually a question of how the company operate, do you constantly fight fires ? or do you spend your time implementing stuff that have high value for the company and it's customers ? Fighting fires can get your burned out (no pun intended) while feeling that you deliver a lot of value will make you feel great.

> billions of requests per second

Op said "billions of requests per month".

That's ~thousands of qps.

I can see how it'd be exhausting to have to deal with the responsibility for the entirety of a few services.

A key part of scaling at an org-level is continuously simplifying systems.

At a certain level of maturity, it's common for companies to introduce a horizontal infra team (that may or may not be embedded in each vertical team).

It's exhausting when the business does not give you the support you need and leans on you to do too much work. Find another place to work where they do things without stress (ask them in the interview about their stress levels and workload). Make sure leadership are actively prioritizing work that shores up fundamental reliability and continuously improves response to failure.

When things aren't a tire fire, people will still ask you to do too much work. The only way to deal with it without stress is to create a funnel.

Require all new requests come as a ticket. Keep a meticulously refined backlog of requests, weighted by priorities, deadlines and blockers. Plan out work to remove tech debt and reduce toil. Dedicate time every quarter to automation that reduces toil and enables development teams to do their own operations. Get used to saying "no" intelligently; your backlog is explanation enough for anyone who gets huffy that you won't do something out of the blue immediately.

Not so all? Stuff is usually fixable.

Org and people are not.

Recently I was asked to work on a older project for enterprise customers. And we are always weary of working on old unmaintained code

But it just felt like a breath of fresh air

All code in same repository, UI, back-end, SQL, MVC style Fast from feature request to deliver in production. Changes, test, fix bugs, deploy. We were happy and the customers too

No cloud apps, buckets, secrets, no oauth, little configuration, no docker, no micro services, no proxies, no CICD. It does look somewhere along the way we overcomplicated things

lets say for the argument sake it's 50 billion thats 20k/sec there is zero need to for a fancy setup at this scale

I am not sure you are aware that server load is never linearly distributed. And that's the exact problem OP is talking about.

If everybody would get a ticket number and do requests when they're supposed to do them, we wouldn't need load balancers.

This is orthogonal to what causes the pain point. All the pain comes from distributed state and these load levels even if you peak at 800K requests per second you don't need distributed state. So most of this pain is self inflicted.

Let's set aside the "distributed" aspect. To effectively scale a team and a code base you need some concept of "modularization" and "ownership". It is unrealistic to expect engineers to know everything about the entire system.

The problem is that this division of the code base is really hard. It is really hard to find the time and the energy to properly section your code base in proper domains and APIs. Especially with the constantly moving target of what needs to be delivered next. Even in a monorepo it is exhausting.

Now, put on top of that the added burden brought by a distributed system (deployment, protocol, network issues, etc) and you have something that becomes even more taxing on your energy.

Recommend

Clone culture and its continuous impact on indie developers

“东南亚小腾讯”跌入谷底：受阻的业务飞轮撑不起千亿市值

GitHub - larrykollar/Unix-Text-Processing: Recreated sources for the book "...

Kendo Grid: A Primer For First-Time Users

How to write better scientific code in Python?

Tell HN: I got 10x Hetzner storage at the same price

Why we shouldn’t push a positive mindset on those in poverty

How to set up a blog with Hugo and Cloudflare (and why you should)

Ask HN: How to stop thinking about work and software engineering on the weekend?

Android chat app in six steps using Azure Communication

About Joyk