We might want to regularly keep track of how important each server is

February 5, 2024

Today we had a significant machine room air conditioning failure in our main machine room, one that certainly couldn't be fixed on the spot ('glycol all over the roof' is not a phrase you really want to hear about your AC's chiller). To keep the machine room's temperature down, we had to power off as many machines as possible without too badly affecting the services we offer to people here, which are rather varied. Some choices were obvious; all of our SLURM nodes that were in the main machine room got turned off right away. But others weren't things we necessarily remembered right away or we weren't clear if they were safe to turn off and what effects it would have. In the end we took several rounds of turning servers off, looking at what was left, spotting remaining machines, and turning more things off, and we're probably not done yet.

(We have secondary machine room space and we're probably going to have to evacuate servers into it, too.)

One thing we could do to avoid this flailing in the future is to explicitly (try to) keep track of which machines are important and which ones aren't, to pre-plan which machines we could shut down if we had a limited amount of cooling or power. If we documented this, we could avoid having to wrack our brains at the last minute and worry about dependencies or uses that we'd forgotten. Of course documentation isn't free; there's an ongoing amount of work to write it and keep it up to date. But possibly we could do this work as part of deploying machines or changing their configurations.

(This would also help identify machines that we didn't need any more but hadn't gotten around to taking out of service, which we found a couple of in this iteration.)

Writing all of this just in case of further AC failures is probably not all that great a choice of where to spend our time. But writing down this sort of thing can often help to clarify how your environment is connected together in general, including things like what will probably break or have problems if a specific machine (or service) is out, and perhaps which people depend on what service. This can be valuable information in general. The machine room archaeology of 'what is this machine, why is it on, and who is using it' can be fun occasionally, but you probably don't want to do it regularly.

(Will we actually do this? I suspect not. When we deploy and start using a machine its purpose and so on feel obvious, because we have all of the context.)

We might want to regularly keep track of how important each server is

We might want to regularly keep track of how important each server is

Recommend

YouTube reportedly has an Apple Vision Pro app on its roadmap

Lines of code – how to not measure code quality and developer efficiency

How to set up Apple Pay on your iPhone (plus all other Apple devices)

Sunsetting success: How to strategically phase out products in the digital age

华硕推出 ROG Strix XG259QNS 24英寸显示器，1080P 380Hz

首播十分钟观众破百万！《英雄联盟》抖音直播今起全面开放

【OpenVINO™】在 MacOS 上使用 OpenVINO™ C# API 部署 Yolov5 - 椒颜皮皮虾

多用户二级分销商城系统

Apple preps new iPads and Macs likely to launch in late March

从“铁娘子”到“话题女王”，董明珠能否带领格力重回巅峰？

About Joyk