4

The funniest performance regression you've seen

 1 year ago
source link: https://lobste.rs/s/p7fo6i/funniest_performance_regression_you_ve
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The funniest performance regression you've seen

I want to hear the dumbest, silliest, most “ahahaha what was I/they thinking” slowdowns you’ve had to fix.

Here’s mine: we had a test suite that took 30 minutes to finish on every build. Okay, split it across multiple machines, right? With two machines it dropped to 20 minutes each, then with four machines it dropped to… 19 each. Eight machines, 18. So something weird was going on.

After some looking at logs and reading the setup, I found the problem. The test runner was using Ruby to seed with two million sample organizations. Then it generated ten departments for for each organization, and ten employees for each department, before finally dropping the whole database and starting the unit tests. The whole thing wasn’t even used!

Removing that line brought our test runs to a more manageable level.

  1. One time I was making sure our testing framework (custom written) was up to date with with all dependencies and it ended up being slower. It took a week to realize there was a breaking change in one API we used. Even worse—I made the breaking change in the API.

    1. Me: Ugh, what idiot broke this code.

      $ git blame

      Me: Oh. Oh no.

      1. Postmortems are blameless, but git is not 😒

  2. We’d written our own code to read from a socket in Java. Someone who was most definitely probably maybe not me had hardcoded for no good reason 128-byte reads from a socket regularly going to have several megabytes read. We didn’t catch it for several releases until someone else who was definitely not me introduced a bug that cause that socket to be read once instead of until it closed or we hit the number of bytes expected. We celebrated a significant performance increase in the next release, having increased to something like 128-kilobyte reads if not megabyte reads.

  3. lorddimwit

    edited 6 hours ago

    | link

    Hm, I can think of a couple.

    One was a test of a regex engine. I wanted to test it against “big” inputs but I messed up and the automatically generated test inputs ended up being gigabytes in size instead of a couple of hundred KB.

    There was a database issue where one of the columns had been set to Latin-1 encoding (or something, I don’t remember exactly) whereas everything else was UTF-8. The DB engine was falling into a state where it did an entire table scan, transcoding each row to compare to the key.

    I also worked on a product where one developer insisted on a certain feature: logging events for replay later to rebuild the database if something happened. This was not a good fit for the product (a high speed network monitor) but they kept insisting and eventually threatened to quit if we didn’t give this idea a try and some office politics came into play so…every new TCP connection of interest, every alert detected by our system, every whatever, did a transactional insert of a JSON blob to a database table. The (large) JSON blob stored all the information about the event.

    Nothing ever read from that table. Ever. And the table rapidly became gigs and gigs in size, causing our automated DB cleanup to constantly run and this stupid table had been prioritized over customer-visible tables, so customers’ old data was being deleted much sooner than it otherwise would be to make room for this damn table. Performance tanked, etc. We had to add a flag to disable that subsystem that we turned on by default, but politics prevented us from deleting the subsystem entirely.

    The developer decided that our complaining was because the rest of the system couldn’t keep up with their awesome idea and finally ended up quitting in a huff a few months later.

    Nuking all that code the day after they left was cathartic.

    1. imode

      5 hours ago

      | link

      Sounds like that developer heard “event sourcing” and didn’t really think about the size of the events they were storing.

      1. lorddimwit

        edited 5 hours ago

        | link

        It was event sourcing and they also were not concerned with the frequency of these events, the latency introduced by logging them, the size of the events, or the extra database load.

  4. hank

    5 hours ago

    | link

    I worked at a backup provider in a previous job, and the dev team pushed a beta build to the beta hardware platform that hosted the company’s backups. They were trying a new on-disk metadata format based on a few RocksDB databases instead of a single h2 instance. We (the ops team) were testing a hardware configuration that did softraid instead of using the hated, finicky hardware RAID. The platform change meant that the drives no longer lied about fsync. The real fsyncs happening over a few thousand instances (a few hundred users * a few DBs) on a single filesystem made the system load shoot up over 10k, which was a fun afternoon of disbelief and debugging.

  5. Years ago, there was a script set up as a cron job that needed to run semi-frequently and did a kind of expensive task.

    And someone forgot to write the script to create/check for a lockfile to prevent a new instance from starting up while an old instance was still running (which was a possibility).

    Have you ever seen a machine report a load average of 700? I did, that day.

  6. Once upon a 2015 I was working on making Terraform’s graph code more debuggable, and I thought it would be nice for the vertices to always be visited in the same order. It didn’t functionally matter for any graph operations, but it made examining debug logs a bit easier.

    So I dropped what I thought was an innocent sort operation into the graph walk.

    That code sat there for over two years (v0.7.7–v0.10.8) before one of my colleagues decided to audit performance on nontrivial Terraform configs and discovered a ridiculous amount of time spent diligently sorting vertices for an operation that had zero user benefit.

    I shudder to think about the sum total of time and energy cumulatively wasted by that single line of code. It’s burned into my memory as a reminder to always be cognizant of hot paths!

  7. Fresh from the press: opening emoji picker on nixos takes 500ms for me. It seems that that’s due to accidentally quadratic PATH manipulation in a shell script?

    Details: https://discourse.nixos.org/t/plasma-emojier-is-very-slow/27160

  8. I was building a web app and eventually tinkered with the “loading” state of the frontend (a spinner was displayed on an overlay). But my local machine was pretty responsive, the spinner just flashed for a fraction of a second. When I decided to add sleep(3) to the backend code to actually test a meaningful loading state.

    Yeah…

    … and I left that sleep in the prod release :D So it was a 1000x speed-up when I removed it and the calls finished in 3ms :D

  9. badtuple

    18 minutes ago

    | link

    We were doing a massive overhaul of a service that was prototyped and launched on Postgres to use DynamoDB. We had hit certain scaling issues that we couldn’t get around in the relational model due to some unusual domain constraints, but with some cleverness we could deconstruct it to make it work over a KV setup w/ some indexes. DynamoDB itself wasn’t necessarily ideal due to it’s batch update and put size limits, but overall it was a good choice because the company could easily afford buying Bezos a new yacht in exchange for the reliability.

    Because of the sensitivity of the service (the company bled money if it was down), we tested the update like crazy. We had internal staging environments, we had an external integration environment so our customers could test it out individually, we had testing environments and things dedicated to benchmarking…

    So of course we hit production deploy and ~80% of everything timed out. Background jobs were stuck, autoscalers were going crazy, everything ground to a halt. The next 3 hours we looked for locking bugs, looked at logs, we looked at the pegged metrics. Everything seemed to be operating like it should just…so much worse.

    Finally someone noticed a weird conditional tucked away in terraform that relied on a weird shell script. Turns out every environment other than production had a stupidly massive machine allocated. Like dumb big. Someone added it months back to get around an issue temporarily and it was never reverted. Due to the batch limits on Dynamo we were constantly spinning up so many goroutines that the Go runtime/scheduler just pegged the CPU on the smaller (but reasonable!) machine. We were young and naive and incorrectly assumed goroutines were as free as the hype said.

    We cranked up the specs in production and spent the next week basically not sleeping and optimizing. Goroutine pools, semaphores, adding extra backpressure, reusing resources we didn’t think were a big deal, etc. Really nothing crazy, just nothing we thought we had to worry about because you don’t prematurely optimize.

  10. In one case, tests for a small Python backend took way too long to run for what it did. Some debugging led me to tests using Minio for testing interactions using S3 as the main hog, where those tests took just over a minute each. It turned out to be due to an incompatibility between botocore and Minio, which meant Minio didn’t handle the Expects request header right, making it wait for a while for the rest of the request, which never was received (details). The fix was effectively a one-liner and sped up the tests from taking 10+ minutes to instead only taking 10 seconds.

    Another interesting case is also Pylint gets much slower if you enable concurrency: https://github.com/pylint-dev/pylint/issues/2525.

  11. Two common classes of database query you might do in Python are “pull one row into a tuple” and “pull a column of IDs or such as a list”.

    Our DB utility library handled these two situations with 1) a function that would accumulate all the values from all the rows in a query into one big tuple (so that one function could handle a single-row or single-column query), and 2) a wrapper to call the tuple-returning function and convert its result to a list.

    In retrospect it’d’ve made more sense for those two use cases to be handled with totally independent functions, and the row one enforces that the query returns exactly one row and the column one enforces that it returns one column. But ten years ago I was–uh, we were capricious and foolish.

    Unfortunately, adding lots of values to a tuple one-by-one is O(n^2). Retrieving lots of values through this code still completed quickly enough that it took surprisingly long to notice the problem–it might add a few seconds pulling a million IDs, and often those retrievals were the type of situation where a legitimate few-second runtime was plausible.

    When we did fix it, it was a very small diff.

  12. Not sure I can think of any, tbh. Most of my noteworthy bugs involve things breaking horrifically, not performance regressions.

  13. I recently swapped DBs in a project from SQLite to PostgreSQL and all of my N+1 queries suddenly became relevant.

    It was fun though to see the number of queries in one of the pages drop from 4,000 to 8 or so.

  14. bryce

    56 minutes ago

    | link

    Not a regression since it was basically greenfield code, but the first protobuffs pass on Riak Time Series didn’t have a float64 field, just an ascii-encoded numeric field. I begged and begged for feedback from existing protobuffs experts since it was my first work on that, but didn’t hear anything until after it was merged in to develop with field numbers assigned. We did fix it before it made it to customer-world, but still.

  15. apg

    32 minutes ago

    | link

    I once had a job where^[1] for reasons of politics, and not wanting to do anything they put “speed-up loops” in the code. That way, when a manager came around and said, “customers are complaining our app is slow! Speed it up!”, our genius engineers could remove a million iterations from the speed-up loop, and we’d have a huge factor performance speed up!

    ^[1]: I didn’t, but I like this story.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK