3

Gabriele Svelto: "On Monday morning we (Mozilla)…" - Fosstodon

 1 year ago
source link: https://fosstodon.org/@gabrielesvelto/110592904713090347
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Gabriele Svelto: "On Monday morning we (Mozilla)…"

The crash started apparently out-of-the-blue, hitting thousands of Argentinian users on a Debian-based distro called Huayra, and specifically on version 5 which was based on Debian 10.

https://bugzilla.mozilla.org/show_bug.cgi?id=1839139

Everybody seemed to crash while searching for images on Google. All versions of Firefox - even very old ones - were affected suggesting that the change didn't happen on our side, but on Google's. 2/6

A colleague analyzed Firefox' behavior at the point of crash and realized that it happened during stack probing. The JIT touched the area that would hold the variables for the next JavaScript call and somehow hit an overflow.

https://bugzilla.mozilla.org/show_bug.cgi?id=1839139#c8

This is where things got weird, Google's code was allocating 20000 variables in a single frame. Ouch, that's probably some machine-generated code which went out of hand. Think twice before using ChatGPT to write code. 3/6

But why was it crashing? Linux automatically extends the stack and we had reserved more than enough space, something that we confirmed by looking at the memory map of the affected processes.

Well it turns out that the Linux kernel used to have a check that prevented stack accesses that were too far from the stack pointer. Specifically accesses 64KiB + 256 bytes away would crash instead of extending the stack.

https://github.com/torvalds/linux/blob/84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d/arch/x86/mm/fault.c#L1359 4/6

713433008d6dae56.png

This was fixed in kernel 4.20 so users of more recent distros are unaffected, and we'll see if we can deploy a workaround to help users of older systems.

https://github.com/torvalds/linux/commit/1d8ca3be86ebc6a38dad8236f45c7a9c61681e78

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control. 5/6

d38cd81e738d19d3.png

@gabrielesvelto How did I miss hearing about TCP? I've been using 1p-isolate for years so it's not really relevant to me but I'd love to know the details on how they built a weaker 1p-isolate and whether it fully does the job or has exploitable weaknesses.

@somecanuckchick yes, we had TCP for a while, I think more improvements were made over time but that's the link I had on hand

@gabrielesvelto then it is not Google doing bad things, which has become crap.

@gabrielesvelto

And since we’re at it let’s shame Google for putting 20 thousand variables in a single function. Bad Google, no cookie.

I once worked on a game engine that used ODE as its physics layer. At the core of ODE collision detection and handling was a function that built a Jacobian matrix on the stack (using alloca) to compute the forces to apply to objects colliding to separate them. We crashed on touching the stack redzone in Windows when our engine ran as a plugin in Internet Explorer—not something we could fix easily on our end, since the size of a thread redzone is decided at compile time by the application configs (which, again, application is Internet Explorer).

Filed a ticket against ODE maintainers and their response was basically “We don’t consider that application domain to be a meaningful one to fix bugs in.” So we fixed it on our end by #define-ing alloca away to a heap allocation in a tiny buffer.

Point of this story is: no shame on Google. Google doesn’t consider the Firefox browser on old Linux configurations a meaningful application domain to fix bugs in. And if you can’t point to where in the JavaScript language spec it says 20,000 variables is disallowed… Shame on Mozilla for having a noncompliant JS implementation. ;)

At least it was easy to fix.

It is interesting though that we find ourselves working around a bug we did not introduce triggered by code we do not control.

Oh yeah… That’s the nature of Internet software. It is interesting every time. :) I’ve had to get up from the keyboard and take a walk twice in my career, and the first time was when I realized if I’m going to be writing web software, that’s going to be, like, my whole career: stuff breaking because someone changed something somewhere that I was relying on for their own reasons. Internet software is like 1/3 technology and 2/3 social network effects.

@mtomczak this is sadly a very common occurrence for us. Just in the past two months we dealt with a couple of CPU bugs and an issue in a Rust crate that would only occur to people running Windows 7 installations w/o the SP1 installed on AVX-ready CPUs (yes, in 2023).

As for Google they reverted the change before we contacted them, so chances are that it was either wildly inefficient or it also messed Chrome up.

@gabrielesvelto Compiling with stack clash protection will fix this 100% unless the extension is happening via JIT code rather than static native code.

@dalias we're building with -fstack-protector-strong and -fstack-clash-protection but this is happening in the JIT IIUC

@gabrielesvelto my fear as a developer is that this is basically my life. With more and more inter-dependent code being used everywhere at some point you're going to end up working around the platform you're tied into itself.

Good on Mozilla for having the ability to inspect that this was happening and provide a backported fix.

@gabrielesvelto I haven't tried to look at the code, but my guess is that that was the result of relatively old functional language compilation techniques (lambda lifting comes to mind, but this isn't something I've thought much about since grad school), or maybe old-fashioned code generation like what Firefox uses for XPCOM and IPC, rather than any kind of “AI”.

@xlerb yes, that was meant to be a bit of a joke... But also a warning, if a compiler phase can go *this* bad think how much crap "AI" could produce

@gabrielesvelto Google's JavaScript is produced by an optimizing compiler. It's machine-generated only in the sense that a compiler rearranged it in attempt to shrink code size or runtime.

@evmar yes I'm sure this is some scalarization pass that accidentally turned a fixed-size array into a bunch of variables or something along the lines. The mention of ChatGPT was meant as a joke.

@gabrielesvelto I remember hearing from someone who worked on gmail that Firefox used to limit your DOM to 255 levels of nesting, after which point new children in the HTML would appear in the DOM as siblings.

They apparently had discovered this because they had generated such awful HTML that they hit that limit. 😬

@gabrielesvelto
20K variables? So much for not being evil. Not that they've held themselves to that standard for years now. Sheesh.

@gabrielesvelto Wait a sec, in 2023, there are active users, on a desktop PC, connected to the internet and running a Linux version older than 4.20? That's 5 years old at this point!!!

It's bullshit like this that makes me just want to stop Debian! How much engineering time is wasted because users or package maintainers refuse to update? It's not like they would lose features and it's free! But no, "stability" by dust gathering is still a thing...

Well, enjoy your crashes I guess?

@mupuf @gabrielesvelto
1. Debian 10 is old stable but some people might not have upgraded to 11 yet.
2. 4.19 is still supported by upstream since it's an lts version and Debian follow the lts released at the moment a new stable is release, meaning it was this kernel when 10 got released.
3. You can install a newer kernel from the backport repository if needed.

I don't use debian on desktop (only on servers) but I informe myself before throwing rocks at a project ;)

@mupuf @gabrielesvelto I mean, first, Debian 10 is oldoldstable (as in, you're supposed to upgrade ASAP). Second, and more importantly, Debian doesn't have control over what a derivative distribution (like Huayra) does and when they update. Finally, buster-backports (https://packages.debian.org/buster-backports/linux-image-amd64) contains a much newer kernel.

Debian itself is not at fault here.

@gabrielesvelto That was interesting, thanks for detailing it. How many “nobody would ever do this” bugs do developers have to fix, eh?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK