Ask HN: Do you find working on large distributed systems exhausting?
source link: https://news.ycombinator.com/item?id=30396454
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Ask HN: Do you find working on large distributed systems exhausting?
Ask HN: Do you find working on large distributed systems exhausting? 181 points by wreath 5 hours ago | hide | past | favorite | 148 comments Ive been working on large distributed system for the last 4-5 years with teams owning few services or have different responsibilities to keep the system up and running. We run into very interesting problems due to scale (billions of requests per month for our main public apis) and the large amount of data we deal with.
I think it has progressed my career and expanded my skills but I feel it's pretty damn exhausting to manage all this even when following a lot of the best-practices and working with other highly skilled engineers.
I've been wondering recently if others feel this kind of burnout (for lack of better word). Is the expectation is that your average engineer should now be able to handle all this?
but No, I fixed it :)
Among other things, I am team lead for a private search engine whose partner-accessible API handles roughly 500 mio requests per month.
I used to feel powerless and stressed out by the complexity and the scale, because whenever stuff broke (and it always does at this scale), I had to start playing politics, asking for favors, or threatening people on the phone to get it fixed. Higher management would hold me accountable for the downtime even when the whole S3 AZ was offline and there was clearly nothing I could do except for hoping that we'll somehow reach one of their support engineers.
But over time, management's "stand on the shoulders of giants" brainwashing wore off so that they actually started to read all the "AWS outage XY" information that we forwarded to them. They started to actually believe us when we said "Nothing we can do, call Amazon!". And then, I found a struggling hosting company with almost compatible tooling and we purchased them. And I moved all of our systems off the public cloud and onto our private cloud hosting service.
Nowadays, people still hold me (at least emotionally) accountable for any issue or downtime, but I feel much better about it :) Because now it actually is withing my circle of power. I have root on all relevant servers, so if shit hits the fan, I can fix things or delegate to my team.
Your situation sounds like you will constantly take the blame for other people's fault. I would imagine that to be disheartening and extremely exhausting.
My problems are all about convincing the company that I need 200 engineers to work on extremely large software projects before we hit a scalability wall. That wall might be 2 years in the future so usually it is next to impossible to convince anyone to take engineers out of product development. Even more so because working on this changes absolutely nothing for the end user, it is usually some internal system related to data storage or processing which can't cope anymore.
Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog. Immediately you have a million problems like data migration, reporting, data ingestion, making it work with all the related systems like search, recommendations, reviews and so on.
And even if you get the ball rolling you have to work across dozens of different teams which can be hard because naturally people resist change.
Why do large sites like Facebook, Amazon, Twitter and Instagram all essentially look the same after 10 years but some of them now have 10x the amount of engineers? I think they have so much data and so many dependencies between parts of the system that any fundamental change is extremely hard to pull off. They even cut back on features like API access. But I am pretty sure that most of them have rewritten the whole thing at least 3 times.
I used to work on a Unicorn a few years ago, and this hits close to home. From 2016 to 2020, the pages didn't change one single pixel, however there we had 400 more engineers working on the code and three stack iterations: full-stack PHP, PHP backend + React SSR frontend, Java backend + [redacted] SSR frontend (redacted because only two popular companies use this framework). All were rewrites, and those rewrites were justified because none of them was ever stable, the site was constantly going offline. However each rewrite just added more bloat and failure points. At some point the three of them were running in tandem: PHP for legacy customers, another as main and another on an A/B test. (Yeah, it was a dysfunctional environment and I obviously quit).
What do you think could management have done better to make it not dysfunctional and have people quitting?
It seems to be the same story in fiels of Infrastructure maintenance, Aircraft design (boeing Max), and mortgage CDOs (2008). Was it always like this or the new management doesn not care untill something explodes?
Higher management decided to migrate our properitary vendor locked platform from one cloud provider to the other one. Majority of migration fell on a single platform team that was constantly struggling with attricion.
Unfortunately I was not able (neither our architects) to explain the higherups that we need bigger team and overall way more resources to pull that off.
Hope that someone that comes after me will be able to make the miracle happen.
I would have no idea how to coordinate 200 engineers. But then again, I have never worked on a project that truly needed 50+ engineers.
"Imagine that you are Amazon and for some scalability reason you have to rewrite the storage layer of your product catalog." Probably that's 4 friends in a basement, similar to the core Android team ;)
It's a whole different ballgame to build on top of an existing complex system already in production that was made to satisfy the needs at the time it was built but it now needs to support other features, bug fixes and supporting existing features but at scale while having 50+ engineers not step on each other and not break each others code in the process. 4 friends in the basement will not achieve more than 50+ engineers in this scenario, even when considering the inefficiencies of the difficulty in communication that come along with so many minds working on the same thing.
The visibility you will get after the capex when there’s a truly disastrous outage will be interesting.
The biggest hardware price point is that you need insane amounts of RAM so that you can mmap the bloom hash for the mapping from word_id to document_ids.
We don't use AWS, because our use cases don't require that level of reliability and we simply cannot afford it, but if I needed a company to depend on IT that generates enough revenue... I probably wouldn't argue about the AWS bill. So long, prepaid at hetzner + in-house works good enough, but I know what I cannot offer with the click of a button to my user!
I run two critical apps, one on-prem and one cloud. There is no difference in people cost, and the cloud service costs about 20% more on the infrastructure side. We went cloud because customer uptake was unknown and making capital investments didn’t make sense.
I’ve had a few scenarios where we’ve moved workloads from cloud to on-prem and reverse. These things are tools and it doesn’t pay to be dogmatic.
I wish I would hear this line more often.
So many things today are (pseudo-) religious now. The right frsmework/language, cloud or on prem, x vs not x.
Especially bad imho when somebody tries to tell you how you could do better with 'not x' instead of x you are currently using without even trying to understand the context this decision resides in.
[Edit] typo
But my point wasn't about how precisely the hardware is managed. My point was that with a large cloud, a mid-sized company has effectively NO SUPPORT. So anything that gives you more control is an improvement.
umm, what happens when one fails?
With large cloud my startup had excellent support. We negotiated a contract. That's how it works.
There are tradeoffs — cloud removes much of the physical security risks and gives you tools to help automated incident detection. Things like serverless functions let you build out security scaffolding pretty easily.
But in exchange you do have to give some trust. And I totally understand resistance there.
Doesn't cloud increase the physical security risks, rather than decrease/remove?
This one is text-only and used by influencers and brands to check which newspapers report about their events. As I said, it's internally used by a few partner companies who buy the API from my client and sell news alerts to their clients.
BTW, I'm hoping to one day build something similar as an open source search engine where people pay for the data generation and then effectively run their own ad-free Google clone, but so far interest has been _very_ low:
https://news.ycombinator.com/item?id=30374611 (1 upvote)
https://news.ycombinator.com/item?id=30361385 (5 upvotes)
EDIT: Out of curiosity I just checked and found my intuition wrong. The ImageRights API averages 316rps = 819mio requests per month. So it's not that much bigger.
If the business can't afford to have downtime then they should be paying for enterprise support. You'll be able to connect to someone in < 10 mins and have dedicated individuals you can reach out to.
Previously 2k employee company, with the entire advertising back office on AWS.
Currently >$1M YR at AWS, you can get the idea of scale & what is running, here: https://www.youtube.com/playlist?list=PLf-67McbxkT6iduMWoUsh...
This sounds a bit arrogant. I think they found better and overall cheaper solution.
The parent thread talks about how the business could not go down even with a triple AZ outage for S3, and I don't think it is arrogant to state they should be paying for enterprise support if that level of expectation is set.
>I think they found better and overall cheaper solution.
Cheaper solution does not just include the cost but also the time. For the time we need to look at the time they spent regardless of department to acquire, migrate off of AWS, modifying the code to work for their multi-private cloud, etc. I'd believe it if they're willing to say they did this, have been running for three years, and compiled the numbers in excel. It is common if you ask internally was it worth it to get a yes because people put their careers on it and want to have a "successful" project.
The math doesn't work out in my experiences with clients in the past. The scenarios that work out are, top 30 in the enitre tech industry, significant GPU training, egress bandwidth (CDN, video, assets), or business that are selling basically the infrastructure (think Dropbox, Backblaze, etc.).
I'm sure someone will throw down some post where their cost, $x is less than $y at AWS, but that is such a tiny portion that if the cost is not >50% it isn't even worth looking at the rest of the math. The absolute total cost of ownership is much harder than most clickbait articles are willing to go into. I have not seen any developers talk about how it changes the income statement & balance sheet which can affect total net income and how much the company will lose just to taxes. One argument assumes that it evens out after the full amortization period in the end.
Here are just a handful of factors that get overlooked, supply chain delays, migration time, access to expertise, retaining staff, churn increase due to pager/call rotation, opportunity cost of to capital being in idle/spare inventory and plenty more.
And yes, this won't be financially useful in every situation. But if the goal is to gain operational control, it's worthwhile nonetheless. That said, for a high-traffic API, you're paying through the nose for AWS egress bandwidth, so it is one of those cases where it also very much makes financial sense.
> If the business can't afford to have downtime then they should be paying for enterprise support.
It's simply stating that it's either cheaper for business to have downtime, or it's cheaper to pay for premium support. Each business owner evaluates which is it for them.
If you absolutely can't afford downtime, chances are premium support will be cheaper.
For me, the most important metric would be time that me and my team spent fixing issues. And that went down significantly. After a year of everyone feeling burned out, now people can take extended vacations again.
One big issue for example was the connectivity between EC2 servers degrading, so that instead of the usual 1gbit/s they would only get 10mbit/s. It's not quite an outage, but it makes things painfully slow and that sluggishness is visible for end users. Getting reliable network speeds is much easier if all the servers are in the same physical room.
Not OP but they're probably using Rook/Minio
One anti-pattern I've found is that most orgs ask a single team to handle on-call around the clock for their service. This rarely scales well, from a human standpoint. If you're getting paged at 2:00 in the morning on a regular basis you will start to resent it. There's not much you can do about that so long as only one team is responsible for uptime 24/7.
The solution is to hire operations teams globally, and then setup follow-the-sun operations whereby the people being paged are always naturally awake at that hour, and allows them to work normal eight hour shifts. But this requires companies to, gasp, have specialized developers and specialized operators collaborate before allowing new feature work into production, to ensure that the operations teams understand what the services are supposed to do and keep it all online. It requires (oh, the horror!) actually maintaining production standards, runbooks, and other documentation.
So naturally, many orgs would prefer to burn out their engineers instead.
Both of those can drive burn out. Personally, I find all that collaboration work very hard and stressful, so I work better in a situation where I get pages for the services I control; but that would change if pages were frequent and mostly related to dependencies outside of my control. It also helps to have been working in organizations that prioritize a working service over features. Getting frequent overnight issues that can't be resolved without third party effort that's not going to happen anytime soon is a major problem that I see reports of in threads like this.
I can also get behind a team that can manage the base operations issues like ram/storage/cpu faults on nodes and networking. The runbooks for handling those issues are usually pretty short and don't need much collaboration.
I think the anti-pattern is having one team responsible for another's burden. You want teams to both be responsible for fixing their own systems when they break, AND be empowered to build/fix their broken systems to minimize oncall incidents.
"But writing documentation is a waste of time because the code evolves so fast."
Yeah, I hear that, but there's also a lot of time lost to people harried during their on-call and still exhausted for a week afterward, to training new people because the old ones burned out or just left for greener pastures, to maintaining old failed experiments because customers (perhaps at your insistence) still rely on them and backing them out would be almost as much work than adding them was, and so on.
That's not really moving fast. That's just flailing. You can actually go further faster if you maintain a bit of discipline. Yes, there will still be some "wasted" time, but it'll be a bounded, controlled waste like the ablative tiles on a re-entry vehicle - not the uncontrolled explosion of complexity and effort that seems common in many of the younger orgs building/maintaining such systems nowadays.
Yes, a million times yes. This is moving me. Where do I find a team that understands this wisdom?
Stress is often caused by a mismatch of what you feel responsible and accountable for and what you really control. The more you know the more you feel responsible for but you are rarely able to expand control as much or as fast as your knowledge. It helps to be very clear about where you have ultimate say (accountability) or control within some framework (responsibility) or simply know and contribute. Clear in your mind, others and your boss. Look at areas outside your responsibility with curiosity and willingness to offer support but know that you are not responsible and others need to worry.
When I started my career the engineers at our company were assigned a very specific part of the product that they were experts on. Usually there were 1 or 2 engineers assigned to a specific area and they knew it really well. Then we went Agile(tm) and the engineers were grouped into 6 to 9 person teams that were assigned features that spanned several areas of the product. The teams also got involved in customer interaction, planning, testing and documentation. The days when you could focus on a single part of the system and become really good at it were gone.
Next big change came when the teams moved from being feature teams to devops teams. None of the previous responsibilities were removed but we now became responsible also for setting up and running the (cloud) infrastructure and deploying our own software.
In some ways I agree that these changes have empowered us. But it is also, as you say, exhausting. Once I was simply a programmer; now I'm a domain expert, project manager, programmer, tester, technical writer, database admin, operations engineer, and so on.
If you look up articles about Team Topologies by Matthew Skelton and Manuel Pais, they outline a team structure that works for large, distributed systems.
On the flipside, in the olden days when one set of people were churning features and another set of people were given a black box to run and be responsible for keep it running, it was very hard to get the damn thing to work reliably and the only recourse you often had was to "just be more careful", which often meant release aversion and multi-year release cycles.
Hence, some companies explored alternatives, found ways to make them work, wrote about their success but a lot of people copied only half of the picture and then complained that it didn't work.
Can you please share some details about what you think is missing from most "agile"/devops teams?
It's begins with small requirements such as coming up with a disaster recovery plan only for it to be rejected because your stack must "automatically heal" and devs can't be trusted to restore a backup during an emergency.
Blink and you're implementing redundant networking (cross AZ route tables, DNS failover, SDN via gateways/load balancers), a ZooKeeper ensemble with >= 3 nodes in 3 AZs, per service health checks, EFS/FSX network mounts for persistent data that expensive enterprise app insists storing on-disk and some kind of HA database/multi-master SQL cluster.
... months and months of work because a 2 hour manual restore window is unacceptable. And when the dev work is finally complete after 20 zero-downtime releases over 6 months (bye weekend!) how does it perform? Abysmally - DNS caching left half the stack unreachable (partial data loss) and the mission critical Jira Server fail-over node has the wrong next-sequence id because Jira uses an actual fucking sequence table (fuck you Atlassian - fuck you!).
If only the requirement was for a DR run-book + regular fire drills.
Jira Server is deliberately cobbled by the sequence table + no Aurora support and now EOL (no security updates 1 year after purchase!). DC edition scales horizontally if you have 100k.
Jira in general is a poorly thought out product (looking at you customfield_3726!) but it's held in such a high regard by users it's impossible to avoid.
The second ten years of my career, I worked with (and continue to work on) much more simpler systems, but the stack looks like this: React/Angular/Vue.js, Node.js/SpringBoot, MongoDB/MySQL/PostGreSQL, ElasticSearch, Redis, AWS (about a dozen services right here), Docker, Kubenetes. _This_ is exhausting.
When you spend so much time wrangling a zoo of commercial products, each with its own API and often own vocabulary for what should be industry standards (think TCP/IP, ANSI, ECMA, SQL), and being constantly obsoleted by competing "latest" products, that you don't have enough time to focus on code, then yes, it can be exhausting.
It wasn't that you technically couldn't choose another stack for a project, but to do so you had to justify the cost/benefit with hard data, and the data almost never bore out more benefit than cost.
The oncall was brutal. At some point I thought I should work on something else, perhaps even switch careers entirely. However this also forced us to separate user issues and system issues accurately. That’s only possible because we are a platform team. Since then I regained my love for distributed systems.
Another thing is, we had to cut down on the complexity - reduce number of services that talked to each other to a bare minimum. Weigh features for their impact vs. their complexity. And regularly rewrite stuff to reduce complexity.
Now Facebook being Facebook, valued speed and complexity over stability and simplicity. Specially when it comes to career growth discussions. So it’s hard to build good infra in the company.
Once you have a few lists some trends become clear and you can work with your manager to shift where you spend time.
Being forced to do things that absolutely did not make sense(CS wise) was what I found to be most exhausting. Having no other way than writing shitty code or copying functionality into our app led me to an eventual burnout. My whole career felt pointless as I was unable to apply any of my skills and expertise that I learned over all these years, because everything was designed in a complex way. Getting a single property into an internal API is not a trivial task and requires coordination from different teams as there are a plethora of processes in place. However I helped to build a monstrous integration layer and everything wrong with it is partly my doing. Hindsight is 20/20 and I now see there really was no other, better way to do it, which feels nice in a schadenfreude kind of way.
I sympathise with your point about not understanding what is expected of an average engineer nowadays. Should you take initiative and help manage things, are you allowed to simply write code and what should you expect from others were amongst my pain points. I certainly did not feel rewarded for going the extra mile, but somehow felt obliged because of my "senior" title.
I took therapy, worked on side projects and I'm now trying out a manager role. My responsibilities are pretty much the same, but I don't have to write code anymore. It feels empowering to close my laptop after my last Zoom meeting and not think about bugs, code, CI or merging tomorrow morning because it's release day tomorrow.
But hey, grass is always greener on the other side! I think taking therapy was one of my life's best decisions after being put through the ringer. Perhaps it will help you as well!
After years of proving myself, earning trust and strategical positioning I am finally leading a system that will support millions of requests per second. I love my job and this is the most intellectually stimulating activity I have done in a long while.
I think this is far from the expectation of the average engineer. You can find many random companies with very menial and low stake work. However if you work at certain companies you sign up for this.
BTW I don't think this is unreasonable. This is precisely why programmers get paid big bucks, definitely in the US. We have have a set of skills that require a lot of talent and effort, and we are rewarded for it.
Bottom line this isn't for everyone, so if you feel you are done with it that's fair. Shop around for jobs and be deliberate about where you choose to work, and you will be fine.
This is the difference. Millions of things per second is a super hard problem to get right in any reality. Pulling this off with any technology at all is rewarding.
Most distributed systems are not facing this degree of realistic challenge. In most shops, the challenge is synthetic and self-inflicted. For whatever reason, people seem to think saying things like "we do billions of x per month" somehow justifies their perverse architectures.
I understand the value that developers bring to operational roles, and to some extent making developers feel the pain of their screwups is appropriate. But when DevOps is 80% Ops, you need a fundamentally different kind of developer.
There's an expectation that everyone is a night owl and that night time emergency work is fun, and that these fires are to be expected.
Finally, engineers seem to get this feeling of being important because they wake up and work at night. It's really a form of insanity.
There’s often not a lot of organizational pressure to change anything. So the status quo stays static. But the services change over time, so the status quo needs to change with them.
When getting anything done requires constant meetings, placing tickets, playing politics, and doing anything and everything to get other teams to accept that they need to work with you and prioritize your tasks so that you can get them done, you will burn out.
Lies, lies, and damn lies, I say!
Unless you have bright and experienced people at the top of a large distributed systems company, who have actually studied and built distributed systems at scale, your experience of working in such a company is going to suck, plain and simple. The only cure is a strong continuous learning culture, with experienced people around to guide and improve the others.
I feel very much like you do. I am nominally in grad school but really a full time engineer moonlighting as a student. We have no funding for a real development team. The app is coupled to things that constantly shift underfoot. I kind of feel like a single parent trying to keep a precocious toddler from accidentally dying.
It's not even clear how big your service is. You mention billions of requests per month. Every 1B requests/month translates to ~400 QPS, which isn't even that large. Like, that's single server territory. Obviously spikiness matters. I'd also be curious what you mean by "large amount of data".
I said billions not one billion.
I guess what I find exhausting is the long feedback cycle. For example, Writing a simple script that makes two calls to different APIs requires tons of wiring for telemetry, monitoring, logging, error handling, integrating w/ two APIs, setting up the proper kubernetes manifests, setting up the required permissions to run this thing and have them available to k8s. I find all this to be exhausting. We're not even talking about operating this thing yet (on call, running in issues with the APIs owned by other teams etc)
It starts by watching Simple Made Easy by Rich Hickey. And then making every member of your team watch it. Seriously, it is the most important talk in software engineering.
https://www.infoq.com/presentations/Simple-Made-Easy/
Exhausting patterns:
- Mutable shared state
- distributed state
- distributed, mutable, shared state ;)
- opaque state
- nebulosity, soft boundaries
- dynamicism
- deep inheritance, big objects, wide interfaces
- objects/functions which mix IO/state with complex logic
- code than needs creds/secrets/config/state/AWS just to run tests
- CI/CD deploy systems that don't actually tell you if they successfully deployed or not. I've had AWS task deploys that time out but actually worked, and ones that seemingly take, but destabilize the system.
Things that help me stay sane(r):
- pure functions
- declarative APIs/datatypes
- "hexagonal architecture" - stateful shell, functional core
- type systems, linting, autoformatting, autocomplete, a good IDE
- code does primarily either IO, state management, or logic, but minimal of the other ops
- push for unit tests over integration/system tests wherever possible
- dependency injection
- ability to run as much of the stack locally (in docker-compose) as possible
- infrastructure-as-code (terraform as much as possible)
- observability, telemetry, tracing, metrics, structured logs
- immutable event streams and reducers (vs mutable tables)
- make sure your team takes time periodically to refactor, design deliberately, and pay down tech debt.
By integration/system tests, do you mean tests that you cannot run locally?
Typically on a "High scale" service spanning hundreds or thousands of servers you'll have to deal with problems like. "How much memory does this object consume?", "how many ms will adding this regex/class into the critical path use?", "We need to add new integ/load/unit tests for X to prevent outage Y from recurring", and "I wish I could try new technique Y, but I have 90% of my time occupied on upkeep".
It can be immensely satisfying to flip to a low/scale, low/ops problem space and find that you can actually bang out 10x the features/impact when you're not held back by scale.
Source: Worked on stateful services handling 10 Million TPS, took a break to work on internal analytics tools and production ML modeling, transitioning back to high scale services shortly.
But your 'average engineer' is probably better served by asking themselves the question whether the system really needed to be that large and distributed rather than if working on them is exhausting. The vast bulk of the websites out there doesn't need that kind of overkill architecture, typically the non-scalable parts of the business preclude needing such a thing to begin with. If the work is exhausting that sounds like a mismatch between architecture choice and size of the workforce responsible for it.
If you're an average (or even sub average) engineer in a mid sized company stick to what you know best and how to make that work to your advantage, KISS. A well tuned non-distributed system with sane platform choices will outperform a distributed system put together by average engineers any day of the week, and will be easier to maintain and operate.
I think it has to do with the kind of engineer you are. Some engineers love iterating and improving such systems to be more efficient, more scalable, etc. But it can be limiting due to the slower release cycles, hyper focus on availability, and other necessary constraints.
In other organizations, individual teams have ICDs and SLAs for one or more micro-services and can therefore state they're meeting their interface requirements as well as capacity/uptime requirements. In these organizations, when a system problem occurs, someone who's less familiar with the internals of these services will have to debug complex interactions. In my experience, once the root-cause is identified, there will be one or more teams who get updated requirements - why not make them stakeholders at the system-level and expedite the process?
Could you share why you think that's true?
IMO that it's exactly the opposite - microservices have potential to simplify operations and processes (smaller artifacts, independent development/deployments, isolation, architectural boundaries easier to enforce) but when it comes to code and their internal architecture - they are always more complex.
If you take microservices and merge them into a monolith - it will still work, you don't need to add code or increase complexity. You actually can remove code - anything related to network calls, data replication between components if they share a DB, etc.
Also for junior team members a lot of this stuff works via magic because they can't yet oversee where the boundaries are or do not understand all the automagically configuration stuff.
Also the amount of works on my machine with docker is staggering even if the developers laptop's are the same batch / imaged machine.
So, if you are handling 10 billion requests per month, that would average out to about 4k per second.
Are these API calls data/compute intensive, or is this more pedestrian data like logging or telemetry?
Any time I see someone having a rough time with a distributed system, I ask myself if that system had to be distributed in the first place. There is usually a valuable lesson to be learned by probing this question.
I've been in those situations. My solution was to ensure that there was enough effort into systematically resolving long-known issues in a way that not only solves them but also reduces the number of new similar issues. If the strategy is instead to perform predominantly firefighting with 'no capacity' available for working on longer term solutions there is no end in sight unless/until you lose users or requests.
I am curious what the split is of problems being related to:
1. error rates, how many 9s per end-user-action, and per service endpoint
2. performance, request (and per-user-action) latency
3. incorrect responses, bugs/bad-data
4. incorrect responses, stale-data
5. any other categories
Another strategy that worked well was not to fix the problems reported but instead fix the problems known. This is like the physicist looking for keys under the streetlamp instead of where they were dropped. Tracing a bug report to a root cause and then fixing it is very time consuming. This of course needs to continue, but if sufficient effort it put to resolving known issues, such as latency or error rates of key endpoints, it can have an overall lifting effect reducing problems in general.
A specific example was how effort into performance was toward average latency for the most frequently used endpoints. I changed the effort instead to reduce the p99 latency of the worst offenders. This made the system more reliable in general and paid off in a trend to fewer problem reports, though it's not easy/possible to directly relate one to the other.
I expect you can hit burnout building services and systems for any scale and that’s more reflective on the local environment — the job and the day to day, people you work with, formalized progression and career development conversations, the attitude to taking time off and decompressing, attitudes to oncall, compensation, other facets.
That said, mental health and well-being is real and IMO needs to be taken very seriously, if you’re feeling burnout, figuring out why and fixing that is critical. There have been too many tragedies both during COVID and before :-(
Reading other comments from the thread, I see similar frustrations from teams I partner with. How to employ patterns like contact, hypothesis, doubles, or shape/data systems (etc.) typically gets conflated with System testing. Teams often disagree on the boundaries of the system start leaning towards System Testing, and end up adding additional complexity in tests that could be avoided.
My thought is that I see the desire to control more scope presenting itself in test. I typically find myself doing some bounded context exercises to try to hone in on scope early.
"Back to the 70's with Serverless" is a good read:
https://news.ycombinator.com/item?id=25482410
The cloud basically has the productivity of a mainframe, not a workstation or PC. It's big and clunky.
I quote it in my own blog post on distributed systems
http://www.oilshell.org/blog/2021/07/blog-backlog-2.html
https://news.ycombinator.com/item?id=27903720 - Kubernetes is Our Generation's Multics
Basically I want basic shell-like productivity -- not even an IDE, just reasonable iteration times.
At Google I saw the issue where teams would build more and more abstraction and concepts without GUARANTEES. So basically you still have to debug the system with shell. It's a big tower of leaky abstractions. (One example is that I had to turn up a service in every data center at Google, and I did it with shell invoking low level tools, not the abstractions provided)
Compare that with the abstraction of a C compiler or Python, where you rarely have to dip under the hood.
IMO Borg is not a great abstraction, and Kubernetes is even worse. And that doesn't mean I think something better exists right now! We don't have many design data points, and we're still learning from our mistakes.
Maybe a bigger issue is incoherent software architectures. In particular, disagreements on where authoritative state is, and a lot of incorrect caches that paper over issues. If everything works 99.9% of the time, well multiple those probabilities together, and you end up with a system that requires A LOT of manual work to keep running.
So I think the cloud has to be more principled about state and correctness in order not to be so exhausting.
If you ask engineers working on a big distributed system where the authoritative state in their system is stored, then I think you will get a lot of different answers...
As you said, a benefit of large distributed systems is that usually its a shared responsibility, with different teams owning different services.
The exhaustion comes into place when those services are not really independent, or when the responsibility is not really shared, which in turn is just a worse version of a typical system maintained by sysadmins.
One thing that helps is bring the DevOps culture into the company, but the right way. It's not just about "oh cool we are now agile and deploy a few times a day", it's all down to shared responsibility.
Sometimes it feels like everyone is focused on eventually working with Google scale systems and following best practices that are more relevant towards that scale but you can pick your own path.
There are good reasons for wanting multiple services talking through APIs. Perhaps you have a Linux scheduler that is marshalling test suites running on Android, Windows, macOS and iOS?
If all these systems originate from a single repository, preferably with the top level written in a dynamic language that runs from its own source code, then life can be much easier. Being able to change multiple parts of the infrastructure in a single commit is a powerful proposition.
You also stand a chance of being able to model your distributed system locally, maybe even in a single Python process, which can help when you want to test new infrastructure ideas without needing the whole distributed environment.
Your development velocity will be faster and less painless. Changes being slow and painful are what burn people out and grind progress to a halt.
This is a major source of frustration. Having to touch multiple repositories and syncing and waiting for their deployment/release (if it's a library) just to add a small feature easily wastes a few hours of the day and most importantly drains cognitive ability by context switching.
I should say, I've been a sysadmin, SRE, software engineer, open source creator, maintainer, founder and CEO. Worked at Google, bootstrapped startups, VC funded companies, etc. My general feeling, the cloud is too complex and I'm tired on waiting for others to fix it.
Is the 'r' in simpler there intentionally? In which way are the building blocks more simple than simple blocks?
That said, I do think there’s a psychological overhead when working on something that serves high levels of production traffic. The stakes are higher (or at least, they feel that way), which can affect different people in different ways. I definitely recognise your feeling of exhaustion, but I wonder if it maybe comes from a lack of feeling “safe” when you deploy - either from insufficient automated testing or something else.
(For context - I’m an SRE who has worked in quite a few places exactly like this)
I don't develop stuff that runs billions of queries. More like thousands.
It is, however, important infrastructure, on which thousands of people around the world, rely, and, in some cases, it's not hyperbole to say that lives depend on its integrity and uptime.
One fairly unique feature of my work, is that it's almost all "hand-crafted." I generally avoid relying on dependencies out of my direct control. I tend to be the dependency, on which other people rely. This has earned me quite a few sneers.
I have issues...
These days, I like to confine myself to frontend work, and avoid working on my server code, as monkeying with it is always stressful.
My general posture is to do the highest Quality work possible; way beyond "good enough," so that I don't have to go back and clean up my mess. That seems to have worked fairly well for me, in at least the last fifteen years, or so. Also, I document the living bejeezus[0] out of my work, so, when I inevitably have to go back and tweak or fix, in six months, I can find my way around.
[0] https://littlegreenviper.com/miscellany/leaving-a-legacy/
My frontend work is native Swift work, using the built-in Apple frameworks (I ship classic AppKit/UIKit/WatchKit, using storyboards and MVC, but I will be moving onto the newer stuff, as it matures).
My backend work has chiefly been PHP. It works quite well, but is not where I like to spend most of my time.
I started out as a systems administrator and it's evolved into doing that more and faster. The tooling helps me get there, but I did have to learn how to give better estimates.
Everything has gotten too complicated and slow.
Exhaustion/burnout isn't uncommon but without more context it's hard to say if it's a product of the type of work or your specific work environment.
Even then, I would ask you to be more specific. I have a normal 40 hour a week uni job as a sysadmin, but it typically takes somewhat more or less (hey, sometimes I can get it done in 35, sometimes its 50 hours). However, for the last several years we have been so shorthanded, faculty wise, that I teach (at a minimum) two senior level computer science classes every semester (I was a professor at another uni). About mid semester, things will break, professors will make unreasonable demands of building out new systems/software/architecture, and I find myself doing (again at a minimum) 80 hours a week. On the other hand, I am not exhausted, as I enjoy teaching quite a bit, and I have been a sysadmin for many years and also enjoy that work.
For my part, I love working at global scale on highly distributed systems, and find deep enjoyment in diving into the complexity that brings with it. What I didn’t enjoy was dealing with unrealistic expectations from management, mostly management outside my chain, for what the operations team I led should be responsible for. This culminated in an incident I won’t detail, but suffice to say I hadn’t left the office in more than 72 hours continuous, and the aftermath was I stopped giving a shit about what anyone other than my direct supervisor and my team thought about my work.
It’s not limited to operations or large systems, but every /job/ dissatisfaction I’ve had has been in retrospect caused by a disconnect between what I’m being held accountable for vs what I have control over. As long as I have control over what I’m responsible for, the complexity of the technology is a cakewalk in comparison to dealing with the people in the organization.
Now I’ve since switched careers to PM and I’ve literally taken on the role of doing things and being held responsible for things I have no control over and getting them done through influencing people rather than via direct effort. Pretty much the exact thing that made my life hell as an engineer is now my primary job.
Making that change made me realize a few things that helped actually ease my burn out and excite me again. Firstly, the system mostly reflects the organization rather than the organization reflecting the system. Secondly, the entire cultural balance in an organization is different for engineers vs managers, which has far-reaching consequences for WLB, QoL, and generally the quality of work. Finally, I realized that if you express yourself well you can set boundaries in any healthy organization which allows you to exert a sliding scale of control vs responsibility which is reasonable.
My #1 recommendation for you OP is to take all of your PTO yearly, and if you find work intruding into your time off realize you’re not part of a healthy organization and leave for greener pastures. Along the way, start taking therapy because it’s important to talk through this stuff and it’s really hard to find people who can understand your emotional context who aren’t mired in the same situation. Most engineers working on large scale systems I know are borderline alcoholics (myself too back then), and that’s not a healthy or sustainable coping strategy. Therapy can be massively helpful, including in empowering you to quit your job and go elsewhere.
Op said "billions of requests per month".
That's ~thousands of qps.
A key part of scaling at an org-level is continuously simplifying systems.
At a certain level of maturity, it's common for companies to introduce a horizontal infra team (that may or may not be embedded in each vertical team).
When things aren't a tire fire, people will still ask you to do too much work. The only way to deal with it without stress is to create a funnel.
Require all new requests come as a ticket. Keep a meticulously refined backlog of requests, weighted by priorities, deadlines and blockers. Plan out work to remove tech debt and reduce toil. Dedicate time every quarter to automation that reduces toil and enables development teams to do their own operations. Get used to saying "no" intelligently; your backlog is explanation enough for anyone who gets huffy that you won't do something out of the blue immediately.
Org and people are not.
But it just felt like a breath of fresh air
All code in same repository, UI, back-end, SQL, MVC style Fast from feature request to deliver in production. Changes, test, fix bugs, deploy. We were happy and the customers too
No cloud apps, buckets, secrets, no oauth, little configuration, no docker, no micro services, no proxies, no CICD. It does look somewhere along the way we overcomplicated things
If everybody would get a ticket number and do requests when they're supposed to do them, we wouldn't need load balancers.
The problem is that this division of the code base is really hard. It is really hard to find the time and the energy to properly section your code base in proper domains and APIs. Especially with the constantly moving target of what needs to be delivered next. Even in a monorepo it is exhausting.
Now, put on top of that the added burden brought by a distributed system (deployment, protocol, network issues, etc) and you have something that becomes even more taxing on your energy.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Search:
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK