9

Boiling the machines when they really needed to just chill

 3 years ago
source link: http://rachelbythebay.com/w/2020/04/28/boil/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Boiling the machines when they really needed to just chill

Back in my school district sysadmin days, I used to get things done with whatever equipment was available. Most of my Linux boxes were old recycled classroom boxes that used to run Reader Rabbit and Kid Pix. Kindergarteners had possibly tried to feed them lollypops through their floppy drives. That kind of thing.

Most of the money went into the Windows side of the house, and the network equipment itself: routers, switches, and a bunch of telephone stuff. There was a "data room" of sorts which held most of that stuff, but not mine. It had been a copier repair room once upon a time, with a vent hood to suck out nasty air... and a nice sink in case you needed to wash something out.

This sink was eventually removed, but the fact remains that for a while, there was a large utility sink that probably could have been turned back on from the cut-off valves underneath in that room. It was maybe three feet from a multi-million dollar investment in telco and network stuff.

This room also had its own air conditioner, although apparently that came much later in the game. I can remember that room being very hot, and equipment constantly failing. Then someone decided to give it a dedicated AC and things got better.

Yeah, this is where your tax dollars were going in the '90s.

I tended to keep an eye on things remotely. I was far away, doing a contract gig. I also didn't know the meaning of "work/life balance" back then, and used to keep "tail -f" windows running 24/7 on my home workstation boxes. If you started up a spam run, started port scanning my networks, or really did anything else, I'd probably notice because I was probably there, doing my own thing and watching it out of the corner of my eye.

So one night a few years into this, I came back to my desk and noticed that the "rest of the district" had seemingly disappeared. The network topology back then was an interesting Conway's Law situation, where my stuff ruled the links to the outside world, and then connected inward to a big "core" router that connected further inward to all of the schools. Further reinforcing this situation was the fact that nearly all of my systems sat on what had been my desk in the office, nowhere near the aforementioned "data room".

When that thing dropped out, I found a way into another machine in that room and started poking around. Eventually it dawned on me: that room was HOT. Very very hot. That one big core router had started yelling, and sure enough, it must have given up and shut itself down.

I called the boss directly. He lived close by. He hopped in the car and drove over, and I heard the news a few minutes later: when he opened the door to that "data room", a wall of heat hit him in the face. He described it like a "blast furnace" and probably well over 130 degrees F.

Clearly, the AC in that room was no longer functional. He propped a door open and grabbed a fan somehow and proceeded to start airing it out. I could do nothing from my distant location and just waited to hear the story the next day.

It turned out that some kids had apparently gotten on to the roof of the (attached) elementary school, found a breaker box with a nice big 0/1 switch without a padlock keeping it in place, and threw it. That obviously fed the AC, and without power, hey, no air conditioning.

The rest basically writes itself. I'm kind of surprised nothing died on the spot. We probably shortened the life of a bunch of equipment in there, though.

So then there was another night about a year later. It was around 1 in the morning and one of my machines died. It was the firewall box for everything leaving the district, so when it went down, a bunch of my systems buried on the inside couldn't hit the outside world to Do Stuff any more. Their rsync mirror scripts of various distant repos would fail, for instance.

Like good little boxes, they mailed me when it happened, and biff/comsat did the right thing and beeped my console when I got the mails. Beep! I can't get to rsync.foo.bar, said one. Beep! I can't get to some other thing, said another.

I hopped on the box and tried a traceroute, and sure enough, it croaked at the hop where the firewall box should be. I tried to get on the console using this odd little device called a PC Weasel which emulated a MDA adapter (!) to give you a serial console like a Real Computer would have already had built-in. It didn't work. Okay, now that was bad news, since for that to be toast, the whole machine would have to be powered off, on fire, underwater, or in some other similarly bad state.

Before raising an alarm I tried to figure out the nature of the problem using what info I could find from far away. My switch had logged the firewall's port shutting down a few minutes before the rsyncs gave up and mailed me. An OS crash isn't going to bring the link down on a switch port. Something bigger obviously happened.

It was around this time I got the idea to look at the temperatures on adjacent devices. This server, like most of my other boxes, was still at my old desk location. Some of them had actual temperature readings from various spots inside the case. One usually ran in the mid-30s (C, so about 95F) and this morning it was up around 50C (123F) ... bad news!

Other systems were also giving me numbers that seemed high, but I didn't know what "normal" was for them. The only one I could compare was that first one, and it had an actual hourly log. Normally, in that log, it would top out at 40C (104F). That night, this is what it recorded:

  • 16:00:00 40.0C / 104.0F
  • 17:00:02 42.3C / 108.1F
  • 18:00:05 44.1C / 111.4F
  • 19:00:07 45.4C / 113.7F
  • 20:00:10 46.8C / 116.2F
  • 21:00:12 47.7C / 117.9F
  • 22:00:00 48.6C / 119.5F
  • 23:00:02 49.0C / 120.2F
  • 00:00:05 49.9C / 121.8F
  • 01:00:07 50.8C / 123.4F
  • 02:00:10 51.3C / 124.3F

Now, at the time, it was the middle of winter, and the outside air temperatures were freezing. This made no sense. I assumed the inside AC for the office had gotten whacked this time, and the inside was more or less boiling.

Once again, I woke up the boss, and he headed over. His initial report was the office space was "well over 95 degrees" (F), or quite a problem indeed! I assume he propped the doors open and got some fans going, since the temperature started coming down. Later, I got him to look at my machine, and sure enough, it was powered off. I guess it hit a thermal trip point and shut itself down rather than baking itself to death. Meanwhile, the other systems were still just clipping along, ready to turn into piles of slag at some point, I guess?

About an hour later, he had managed to raise the facilities people, and they figured it out: someone had been working on the "air handler" for the office space that afternoon. They had flipped it to some "manual" setting where it ignored the thermostat's request to stop sending heat. Since it was the middle of January, it was in "heat" mode.

With that combination of settings, it proceeded to pump out hot air from the furnace into the office space all night long. That's why the log shows the temperature roughly normal around quitting time, and then it just kept going up, and up, and up.

The immediate fix was to throw it over to "disabled" and just let the joint cool back down, and that's what they did. My machine was willing to come back up once it had cooled off, and it kept on working, never showing any signs of trouble. I guess the thermal safety did its job.

Oddly enough, when this happened, the data room was just fine, since it had that separate AC/thermostat setup... and no furnace.

Two different times, my tendencies to stay up late and keep an eye on things caught situations that could have killed a bunch of equipment. How incredibly random!

A padlock would fix the first one, but the second? How do you deal with a tech who leaves a system in a state where it will do nothing but dump hot air at full speed all night long?

Trust but verify, I guess. Set up monitoring and hope it never fires.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK