4

Feedback: TCP lag, Mac crashes, site code, and octal dates

 3 years ago
source link: http://rachelbythebay.com/w/2020/10/17/feedback/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Feedback: TCP lag, Mac crashes, site code, and octal dates

It's another round of reader feedback.

Regarding the TCP_NODELAY / Nagle post from Wednesday, one reader says:

There is another setting which should always be set on Linux & FreeBSD: TCP_QUICKACK. This stops problems with delayed acks.

Yep, this is one of a bunch of knobs that people should check out. I threw a mention of the man page in the bottom of the post since it would be really hard to do all of them justice in a relatively short single story. Also, the story in question happened several years ago and I wasn’t involved, so all we have to go on is what they told me. It seems that TCP_NODELAY was the key for them, but other settings may have further reduced the latency in certain situations.

Also on Nagle, another reports a fun-looking (to me, at least) trouble report for uWSGI which amounts to "it sometimes takes ~43 msec longer to respond" and wonders if it's related.

People need not wonder. This is why we have things like strace. Crank that sucker up in there and really look at those timestamps. See where it's actually spending that time when it's handling your request. You can compare that with the timestamps from sniffing the network to see what kind of latency there is between a write and it actually hitting the NIC.

If you're calling the syscall to write to a network file descriptor at 19:55:07.000000 and you only see the frame with the [P] flag set show up in tcpdump at 19:55:07.040000, well, that's your 40 milliseconds right there. You get the idea.

I've gotten a bunch of people commenting about their Apple woes after my most recent post about Apple completely failing to do Thunderbolt properly even when it's all Apple-branded hardware involved.

Here's one of them:

Funny, I've noticed this with Apple in general just by using the sleep every night - on a macbook pro 15", 2016 and since 10.13 or 10.12. Just repeated sleeps, no peripheral devices. I notice it will go into a panic and not wake up from sleep after the TouchBar stops working, and the CapsLock stops working - from that point forward, the next time i sleep, it will crash on wakeup.

I don't know what it is about sleep mode, but it seems like everyone gets it wrong. Maybe they're used to "crash-only software", in which you only ever start it up from zero and never ever "go backwards" in terms of operating modes?

All of those cut corners start to add up when you start using features like suspend-to-RAM or suspend-to-disk. I can only wonder what's really going on here.

I haven't even mentioned the insanity I managed to create on my OLD (2015 model) MBP while trying some of the same sleep-wake tests over there just to see what would happen. Imagine a wifi connection that only lets existing TCP connections work, and nothing else even reaches the level that things like BPF (so libpcap/tcpdump) can see. That's highly screwed up, and I managed to trigger it twice on Sunday night within about 20 minutes. What the hell?

I'm beginning to wonder if I could make some money off this terrible luck I have with tech stuff. Plenty of people get early release stuff shipped to them so they can rave about it on YouTube. How about if someone shipped me an early release of something and included a consulting fee? Then I'd do my usual thing on it, and when it broke, they'd get the report first.

This would continue as long as I felt like I was being heard and my reports were accomplishing something useful. Once that dried up, that's it, and I'm back to doing my own thing again. (Sound familiar?)

Here's another report on the whole Apple sleep insanity thing:

There's a similar bug where the touchbar and screen flicker (go into sleep?) every second or two, also related to thunderbolt 3. The machine doesn't crash, but it's unusable, because the flickering resets the password dialog. Sometimes the fingerprint sensor or modifier keys can break the loop - but it's hard to reliably escape. (It's also hard to reliably reproduce the bug itself.)

I can also confirm triggering this on occasion. Certain combinations of plug/unplug/sleep/wake sequences along with certain timing elements seem to lead to slightly different results. One of them is where the external monitor will be awakened and will have stuff displayed, and then will be snuffed out less than a second later. This will happen maybe five or ten times, then it will give up.

While this is going on, the "log stream" output on the machine is just chock full of crap. I mean, I think it is. It's hard to tell when the damn OS ships from Apple with all kinds of terrible log diarrhea that shows that nobody truly cares about getting things right in that domain.

Honestly, if you look at the logs on a typical Mac these days, you'd think there was a civil war worthy of its own cinematic franchise going on inside Apple on the Mac OS side of the world. You can see one program doing something stupid, and another program logs something about it, like "such and such is deprecated, don't do that". Another program says "object X exists in both place A and place B, and which one you get is undefined, so you should fix this". A third thing reports that something is respawning too quickly and so it's being throttled.

All of this is the kind of stuff that's useful for third-party developers to know... but the things which are triggering it are all stock parts of the operating system. Holy Conway's Law, Batman!

An anonymous reader asks about what makes this site run:

What software do you use to write and publish your posts (and the feed)? What software runs behind this website (including this form)? I see a very old post about writing a tool for your blog as opposed to writing the HTML by hand on older posts, but couldn't find any other details. Whenever you're able to, would you mind writing a post about the tools and also about what JavaScript code is used and for what purposes? No hurry

That very old post, incidentally, is about five paragraphs and is from the first month or so of me doing this. It's light on details, so it's not surprising that people would have more questions about this.

Also, there are almost 1300 posts here now, so finding the one from eight years ago that happens to describe how things work is a real chore. It took *me* a good couple of minutes to find it, and I wasn't even fully sure it existed at first, put it that way.

But yeah, here it is, my post about "publog" from 2012.

A few things have changed since the days of that post. For one thing, the same code is also used to generate the books, and neither of them existed when I wrote that description. The book generation amounts to a big (ASCII protobuf!) config file that has the title, path info for certain texts (intro, about) and a bunch of sections. Each section has a title, content of its own (an introduction or one of the "book exclusive" stories), and then a list of posts.

The very same post files that turn into the site here wind up driving the book generation. I had to add a couple of meta commands which let me make the book renderer skip certain chunks of the text which won't work in a book, even an electronic one - HTML audio being one of them, and the "this post has an update" is another, since in the book, I just put that item next in the section.

Incidentally, the books are where you will find the "promised land" of posts grouped by topic and arranged in a meaningful order. Out here on the web, it's strictly chronological.

Besides the Atom feed for the site, there's one other fun thing I added back in 2013 that sees very little use: the "protofeed" output. Are you tired of parsing XML? Wouldn't you rather have a sensible way to get your posts? Well, for the past seven years, it's been right here, waiting for you to notice.

Finally, there's their question about the feedback stuff. That one was covered in a 2018 post and it's largely unchanged from then. The only difference is that people were surprised that it was so fast and so I added something that says "saved in XXX milliseconds (yes really)". This last part about people expecting web stuff to be slow and stupid generated a fair bit of thoughtful grumbling from other folks who also are also annoyed by what's considered acceptable at the moment. We have the fastest computers and fattest pipes to our homes and mobile devices ever, and yet you still get jank when doing trivial tasks.

To that, I can only say: you have no idea how low the bar really is in terms of what people will use and actually pay for. This bubbles up through the stack into some so-called "tech companies", and, well, giant wobbly piles of garbage are the result. Then you need more garbage to manage that garbage, since those piles aren't going to organize themselves!

What's the Javascript about? It puts up a "submitting feedback" message that most people never see because it's so quick. It POSTs the form back at the server here, and then it puts up the "feedback saved" with the timing info mentioned previously. There's also a little nuance where it unhides the textarea only after it's put in the handlers for submitting the form or clicking the button. This way, you won't accidentally race with the page loading. It's probably another thing nobody sees because it happens so quickly, but it's there! Check out the document.ready function in the bottom of the index.js if you're curious.

Why does it do that? I don't know, this is just how I build stuff. When I wrote this ages ago, I must have thought through the timeline and thought "hmm, the page loads and it has a bare form on it without my click handlers, and so until I get those installed, it won't go through the JS code and it'll do a default submit and that's not what I want". Then I just cranked things around so a visitor to the page *can't* get into that position without doing something deliberately hacky, at which point any breakage is on them.

You have to put on the "what could happen" hat and think your way through all kinds of stuff that allows user input, or your users will eventually end up in bad places. I personally think that most of my "bad luck" with tech stuff crashing is from me doing stuff at an interval or time when it's not otherwise "ready" for me to do it. There is definitely a timing element to a lot of this.

Regarding date and time anomalies, Jonathan writes:

Tangentially to your mention of the issues *someone* surely saw when we flipped from month 09 to month 10, I have this fun gem. It was years ago, so I've forgotten where I saw it, what language it was, and all of the other less important details. But, I hit a bug when we flipped from July to August. It took me a while, but I finally got it. We'd flipped from "07" to "08" on a system that was apparently understanding "07" as octal notation. And, well, "08" is obviously not a valid octal number. So, let's just go ahead and crash, m'kay? :facepalm:

This is great, since I also got another bit of feedback on this same topic from an anonymous reader:

That reminds me of a fun time in Bash. Step one: split a date string by hyphens into $year, $month and $day. Step two: use $month in an arithmetic context. I don't remember the exact code, bu t it was probably something trivial like '[[ "$month" -eq 1 ]]'. Step three: wait until August, when the script blows up with '[[: 08: value too great for base (error token is "08")'. Step four: learn about octal in Bash, smack forehead, and curse language designers who thought it was a good idea to make numbers ambiguous in order to save one character, instead of using something reasonable like 0o7.

And then there's also this:

Anything written in a language like sh/bash that uses textual interpolation and leading zero octal notation might break in August and September because 8 and 9 are not valid octal digits. All the other months work fine because 1 to 7 are identical and 10 to 12 have no leading zero.

I had never hit this one personally, so file this under "TIL"! That's one more thing for my time-related reliability list: some programs written in January through July will break the first day of August because "01" through "07" are valid as octal but "08" is not. It will continue to break in September ("09" is also invalid), but then will mysteriously stop crashing in October and will be fine for close to a year.

See, some problems DO fix themselves! (No, not really.)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK