home blogsleep through the night with self healing infrastructure

Sleep through the night with self-healing infrastructure

What is self-healing infrastructure and why do you need it? The first part is easy; it’s exactly what the name implies. It’s a methodology for creating automation that allows systems to identify and repair errors and misconfigurations without any human action. The “why” is a little more complex, but, like self-healing infrastructure, is well worth the effort.

Self-healing infrastructure is the next natural progression in automation. Automation has become a necessity in today’s environment where expectations on IT are high, demand for talent far outpaces the supply, and threat remediation requires immediate action. Self-healing infrastructure is also the basis of AIOps. Without the ability to automatically remediate errors and misconfigurations, an IT group won’t be able to scale.

The reason for implementing self-healing infrastructure should be (in my humble opinion anyway) to remove interruptions for your work day (and your nights). The goal should be to adopt a well-scheduled work cycle where interruptions are the exception, not the norm. When I was a DevOps engineer, I had a list of projects I was going to complete “when things slowed down.” That list served me well for many years, and I passed the list on, sepia-toned and faded, to my replacement when I moved on. Imagine a world where those projects get worked on, implemented, checked off… and where the phone isn’t ringing at 3 a.m. because a disk drive is full.

How to get started

Start small: Identify common, repetitive support issues and compliance-breaking configuration changes. Those five-minute fixes that require very little engagement for each occurrence but that add up over time are a great way to cut down on the tickets!
Keep it simple: A complex, multi-step remediation process isn't the place to start.
Set guidelines and be transparent: Pick your tools, enforce their use, and have the automation stored in a code repository to make the solutions readily available. As your self-healing library expands, there will be more and more opportunity to reuse the code, keeping things simple and standardized.

How NOT to get started

Don’t automate tasks that change frequently. Don’t exchange being interrupted by errors and outages for being interrupted by debugging automation.
Avoid complex, multi-step solutions. Monolithic self-healing routines are brittle and high maintenance. It’s better to make the self-healing responses modular and, if appropriate, put them in a toolchain.
Mind the ROI. Spending days scripting and testing automation for a task that takes you five minutes once a month isn’t freeing up your time. Target quick hits and repeat offenders.

A concrete example

So what would be a practical, actionable example of a task or scenario that makes sense for creating self-healing infrastructure? Think of ways you can complete this sentence in your environment:

Every time X happens, I have to Y.

It looks like there are two variables to consider here, but there are actually three parts to this self-healing equation. You need to identify the incident that arises (X) and the action taken every time to remediate the incident (Y), but also the conditions that arise that make you aware of the incident. More often than not, this comes in the form of an alert from your monitoring system. So the manual, human-based workflow today is 1) I get an alert, 2) the alert tells me -- or I divine from the message and the symptoms -- what is going on, and 3) I take action to resolve it.

When broken down like this, it becomes a bit easier to conceptualize how to automate these three actions into a self-healing situation. Identify the message or alert that indicates an incident has occurred, then tie an automated action to the incident. Optionally, add confirmation that the incident has, in fact, been resolved, and notify your monitoring platform to acknowledge and resolve the alert.

A very common example is disk space utilization. Frequently, a human will be woken in the middle of the night by an alert that a disk is 90% full. Then the human will log into a system to delete old files, clean up large binaries left over from other users, compress old logs, or take some other space-saving measures to resolve the alert. A more permanent solution would be to also extend the size of a volume and filesystem to meet increasing demand. An important follow-up action for the next day is to contact the owner of the system and remind them (for the fourth time this week) that their system is filling up filesystems and can they please do something about it.

Tying together monitoring, ChatOps, and configuration management into a self-healing system can seem daunting, but if you start small, create recomposable pieces, and gradually move toward a solution, you can start to move the humans from performing the healing to managing the self-healing process itself.

Let your systems work for you

Hopefully we’ve been able to pique your interest in self-healing automation for your infrastructure. This is a very basic outline of the theory; the practice itself can vary widely depending on your environment and your needs. A good plan and a goal can help guide you in designing your self-healing infrastructure and keep it where it should be: a tool that improves your life and enhances the reliability of your IT estate. If you want to learn more, register for the webinar on self-healing infrastructure, or give us a shout! We’d love to hear from you.

Puppet: Sleep through the night with self-healing infrastructure

Sleep through the night with self-healing infrastructure

How to get started

How NOT to get started

A concrete example

Let your systems work for you

Learn more

Recommend

Windows 11 Gets an Official Game Pass Widget

File Explorer Gets Lots of Improvements in the Latest Windows 11 Build

The Sandbox：官方 Instagram 账户或遭入侵，请注意防范风险

Aldi Will Pay for a Couple's Wedding If They Marry in a Store

蔚来手机公司正式成立！一年一旗舰学习苹果

Plug and run: Putting software solutions back in company hands

RTC 场景下的屏幕共享优化实践-51CTO.COM

Github突遭大规模恶意攻击，大量加密密钥可能泄露！

网红吃大白鲨背后，是治不好的流量饥渴症

“疯狂”的锂矿，暴利的盛新

About Joyk