8

Administrivia: HTML generation and my general clowniness

 1 year ago
source link: http://rachelbythebay.com/w/2023/03/31/html/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Administrivia: HTML generation and my general clowniness

I've been kind of quiet these past few weeks. Part of that has been from plowing a bunch of work into getting serious about how all of the /w/ posts get generated. I figure if I'm going to start leaning on people to not do goofy things with their feed readers, the least I can do is make sure I'm not sending them broken garbage.

To really explain this, I need to back up to 2011 when this whole thing was just getting off the ground. I started writing here in order to keep the momentum from the writing I had been doing inside the company I was about to leave. I figured that anything was better than nothing, and so those posts were all done by hand. The posts themselves were hand-formatted: I'd type it up, slap on the header and footer, and then it'd get a link from the top-level index page.

Then people asked for an Atom feed, and I delivered on that too... ALSO doing it by hand at first. Yeah, that was about as awful as you can possibly imagine. Obviously that could not stand, but it did get me through the first couple of days and posts, and then my little generator came together and it picked up most of the load for me.

But there's a dirty little secret here: this generator has been little more than a loop that slaps HTML paragraph ("p") tags around everything. It doesn't really understand what's going on, and any time it sees a blank line, it assumes one ended and another one just began.

If you've ever looked at the source of some of the more complicated posts with embedded tables, audio players, PRE blocks or anything else of the sort, you've probably wondered what kind of crazy I was smoking. Now you know why. The only reason it works at all is because the web as a whole is terrible and browsers have had to adapt to our collective human clownery. HTML parsers tend to ignore the botched tags, and it generally looks right anyway.

I still find myself doing stupid things to work around the nuances of the ridiculous state machine that I created. If you've seen PRE blocks where for some reason there are lines with a single space in them, this is why! A blank line would trip the "stick on a /p and then a p" thing, but a line with a single space would not. So, I've been doing that.

Worse still, see how I'm calling it /p and p? I'm not using the actual angle brackets? Yeah, that's because there's no entity encoding in this thing at the moment. I'd have to manually do the whole "ampersand l t semicolon" thing... and HAVE been doing this all this time. I don't feel like doing that at the moment. (Because I'd have to fix it when it's time to convert this very post, but I'm getting ahead of myself.)

Both publog (the thing that is responsible for what you're seeing now) and my own diary software share a similar heritage, and I've been bitten by the lack of proper handling of this stuff over the years. For whatever reason, I decided it was time to do something about it, and finally got traction with an approach around the time of the new year.

Here's what's coming: every single post will be run through a generator that actually functions like a "real" parser - tokens and rules and put_backs and all of this! It's not just a "am I in a paragraph right now" state machine. It'll accumulate text, and when it's ready to emit a paragraph, it will do that with all of the rules it's been told about, like how to handle attributes, their values, AND when (and what) to escape/encode in the actual body of the tag/container.

This also goes for some of the "commands" that have been part of the input files all this time. When I include an image, I've been doing a special little thing that says "generate the IMG SRC gunk with the right path for this file with this height and width". This lets me ensure that the http and https feeds don't get cross-protocol URLs, among other things. The "this post has an update" lines and the backwards links to older posts also work this way.

This HAD been working with a bunch of nasty stuff that was basically building HTML from strings. You know the type, right? You print the left bracket, IMG SRC=, then you have to do a \" to get a literal " in there without ending the string... and then you end the string. Then you add the filename, and start another string and put a \" in it to cap off the SRC attribute of the IMG tag, and so on and so forth...

I'm kind of wondering who's reading this and thinks I'm a clown vs. how many people are reading this and are just nodding their heads like "yeah, totally, that's how we do HTML all over the place". But I digress.

Now, actually doing this has meant coding it up, but it's also meant going back and converting all of the damn posts, too. Any place where I had raw HTML shenanigans going on (like doing my own "ampersand + l + t + semicolon" stuff) had to be found and changed back to the actual character I want there. The program itself will do that encoding for me now. It's nice to have it, but it's a chore to go and do it without breaking anything, like a place where I WANT the literal gunk there.

With almost 5.5 MB of input text across 1400 posts, that was a non-trivial amount of work. I would not be surprised if I missed things that will pop up down the road and which will need to be hammered back down.

So yes, for a while, it will be "same clown, different circus". But, at least this time, I'll be trying to emit the right stuff.

I haven't set a date or anything for this. There's this possibility of also trying to solve some other dumb problems that also vex certain (broken) feed readers at the same time, and I haven't decided whether to block the rollout of the one thing on the rollout of the other one. This matters because I'd rather not rewrite every single /w/YYYY/MM/DD/whatever/index.html page multiple times. Ideally, they'll only change the one time. (What can I say, I care about these things.)

While waiting on that, if you're a feed reader author, you can at least check on a few things. You aren't honestly taking the "updated" time from inside the feed and using that in the HTTP transaction (If-Modified-Since), right? Right?? You know those are two totally different things from different layers of the stack, and aren't interchangeable, right? The IMS value should come from the "Last-Modified" header I sent you in the first place.

Right, Akregator? Right, NextCloud-News?

It's crazy how long it took me to figure out why they were sending me reasonable-looking "IMS" values that I had never handed out. It wasn't until I looked inside the actual feed that the penny dropped.

Want to know how the sausage is made and why this happens? Okay, settle in.

The web pages and the feed files (yep, plural: http and https) are made by running the generator on my laptop. The wall time on that system winds up being used in the "updated" fields in the XML gunk that is the Atom feed. The files also get a mtime that's about the same... on the laptop. More on that in a bit.

This writes to a directory tree that's a git repo, and a few moments later there's a git add + git commit + git push that captures the changes and schleps it off to my usual git storage space.

Later on, I jump on snowgoose (that's my current web server machine) and have it pull from that same git storage space into a local directory and then rsync the new stuff out of that tree into the various document roots - there are multiple web sites on this box.

If you didn't know this already, git does not preserve mtimes. The mtimes on files it writes out are just "now", whatever that may be. It's usually a minute or two later than when I did the generation on my laptop, just because I don't usually push to "production" right away. I usually eyeball things on an internal machine first.

Now, rsync DOES preserve mtimes, but it's preserving values that aren't particularly interesting. They are just the time when "git pull" ran on the web server and brought in the new/updated versions of the files. It's not the same time that the actual feed was updated on my laptop.

Apache uses the mtime on the files, so it's handing out "Last-Modified: (whatever)" based on that "git pull". This is not going to match the "updated" XML blob in the feed itself.

So, what I get to consider is whether I want to go nuclear on this and come up with something that will actually *SET* the mtimes explicitly and make sure they stay set all the way to the document root, no matter where it is.

Besides the broken feed fetchers, there's another reason to care about this sort of thing. What if I get a second web server, and put it behind a load balancer? Requests could be served by one or the other. Imagine if the two web heads did their "git pull" at two different times. Clients would get one Last-Modified value from server #1 and another value from server #2. Chaos! Madness! Insanity!

Now, I don't have a second web server, and in fact have no plans to do that unless people want to start throwing a LOT of money at me to run one in a colocation rack somewhere. But, it's the principle of the thing: controlling important values explicitly instead of leaving them to chance, *especially* since I'm expecting other people to do their part with those same values.

It's funny, right. I never thought I'd miss XHP until I started doing this project, and I didn't even do that many (internal) web pages at FB - just the ones I absolutely needed because nothing else would do.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK