Programming to save time and to prove it could be done

Sometimes I write programs to solve problems I've been having. Once in a while I write programs to demonstrate to someone else that a given situation just does not require much drama. This usually happens after witnessing someone making a huge deal out of something when I am convinced it could be far simpler.

Version management was a case where both really applied. I was running a few dozen Linux boxes on my own, doing everything by hand. Meanwhile, just a few feet away, the "network engineers" were trying to make Microsoft's SMS (server management system, or some such) work on their fleet of NT machines. They were just trying to get a handle on what they had, where it was, and what was running.

They didn't even have so much as a list of machines, what they had running, and which versions of things were installed. Granted, I only had a list because I manually maintained my notebook, but I knew I could do more.

When my CD Tower project started up, it suddenly added 20 more machines to my fleet, and manual maintenance went from a minor annoyance to a full-on pain in the rear. I decided to do something about it. If it also meant that I could come up with a solution which would let me say "nyah nyah" to the NT guys, that would be great.

I called my solution "vmanage" in my long tradition of simple project names. It had two parts. One ran on each of my Linux boxes and had a config file which told it what to do. When it started, it would read that file and discover a series of labels and directives, and it looked a bit like this:

apache /www/httpd -v | cut -f ... | cut ... | ...
ssh /usr/bin/ssh -v | head ... | cut ... | ...
sendmail /usr/sbin/sendmail ... | grep ...

Each line was a name and then a shell pipeline which would cause that program to emit its version number. My program would just popen() each command in sequence and would then hang onto the result in local storage. If it hit an error, then it considered it not installed and stored a placeholder instead.

It was ugly and evil, but it did work. This was actually about the only way to find out exactly what versions of key services were running on my systems back then. In those days, Slackware packages had names like "sendmail.tgz" or "ssh.tgz", and they were frequently squashed down to fit into a DOS-style 8.3 namespace. They didn't have any versioning whatsoever, so you had to figure it out yourself.

Also, at the time, a bunch of key services didn't run from official upstream Slackware packages. They either didn't exist as a package, or a proper patched version didn't exist, or whatever. In the span of about a year, both OpenSSH and then sendmail had a bunch of issues, and it became very important to keep tabs on those things.

The other side of this was a client which ran as a CGI program. By default, it would reach out to all of the servers (using a list in a local config file) to grab all of their version data. Then it would render it as a simple HTML table where hostnames were one axis, package names were the other, and each cell had the version data.

With that done, I could watch my entire fleet. This gave me a glorified "to-do" list. As the system grew, it learned about different system types, and different acceptable values for certain packages. The CD towers, for instance, were allowed to run kernel version X, but the other machines had to run at least kernel version Y. Anything which was out of spec for a given "tree" or "flavor" would show up as an anomaly. I think I rendered it as a red cell, or something like that.

But still, this was just read-only. It would tell me what was going on, but it wouldn't let me do anything about it. Even this was far more than the NT guys had managed to do with their fish-out-of-water thrashing about on their systems, but I wanted more. I was getting tired of logging into dozens of machines to do stuff.

I finally wound up adding some more code to both the client (CGI) and the server programs. When the client found a host which had a package which was out of spec, it would turn that table cell into a clickable link. That link would just call back to the same CGI program, but with new arguments which meant "make host X upgrade package Y". It knew that when you loaded it with those arguments, it was supposed to emit a command to that system instead of doing the usual status page.

The target machine would receive this request and would run a little package fetcher/installer script I had added as part of this scheme. That script would contact my "package server" and say "hey, I'm a machine of flavor X, and I need package Y". That "package server", which really was just another small CGI program, would then kick back the appropriate .tgz file. Assuming the transfer completed properly, the target machine would then install/upgrade that package using that freshly-downloaded file.

Finally, the target machine would refresh its local version cache by rerunning the full set of commands from the config file. This would make it pick up the newest version strings, including anything which may have changed as a result of the upgrade or installation it had just finished.

Back in my web browser, I could just wait a bit and then refresh and it would usually show up as the right version. One click to kick it off and another to click back to my overview list was all it took. I made it a point to show this to the folks around me as a demonstration of "simple things should be simple".

One time, I even opened it up to a friend and got him to click on a few things in his browser. Since he could only kick off upgrades to the latest "blessed" versions of packages, he couldn't hurt anything. I showed him what happened as he clicked around.

Before:

cdtower:/# ls -lsa /vmlinuz
572 -rw-r--r-- 1 root root 581625 Sep 15 13:56 /vmlinuz

What he saw:

Requesting server update of kernel on (IP)...

What happened on the machine:

Jan 27 03:03:15 cdtower vmanaged[175]: Remote command: updating package kernel
cdtower:/# ls -lsa /vmlinuz
563 -rw-r--r-- 1 root root 572185 Jan 26 11:17 /vmlinuz

As I told him then, "congratulations, you just loaded Linux version (bar) on a server in Colorado". It was a big deal because he did it with just one click. Today, you'd go "so what?", naturally, but at the time, this was still considered special.

Yes, I still had to reboot the machine at some point to make the new kernel start. I didn't put that in as part of my installer. That was a little more than I wanted to automate at that point.

Other packages had their own "post-install" stuff which took care of things. Once I had proven it to be safe and reliable, then it would restart whatever daemons might be involved.

Feb 25 23:14:06 cdtower vmanaged[185]: Remote command: updating package openssh
Feb 25 23:14:08 cdtower sshd[23995]: Received SIGHUP; restarting.
Feb 25 23:14:08 cdtower sshd[24036]: Server listening on 0.0.0.0 port 22.

My snarky line for all of this at the time was "just point and click!", and it was mostly intended to poke fun at the NT people.

Over the years, I gradually expanded and revised things. At one point, it had a "fix everything" button which would launch upgrade requests for anything which was out of spec. Then, I could just click that, watch all of those hosts pile on my package server CGI script, wait a few seconds, then refresh and see the results. Life was good, even if it did generate a local "Slashdot effect" for my little webserver box as all of my machines piled on at once to ask for updates!

At some point, I decided to abandon the "magic commands to get version data" schtick. There were some programs which would not yield a version number in any easy way. Also, just because something is version "1.23" doesn't mean it's necessarily the one I care about. Sometimes, a patch would be applied which did not change the version number. I really needed that patch, but how could I tell if it was running on some arbitrary machine? Both of them claimed to be "version 1.23".

My solution was to start tracking the versions of actual installed files by using MD5 hashes. It was a bit of a hack, but it did work.

[openssh]
  file /usr/local/bin/ssh
  sum 5a631821844d9c56392b793b9a44aa0f 
  info "[20020730] Custom OpenSSH 3.4p1 (OpenSSL 0.9.6e)"

The actual implementation of this system was pretty evil and messy, but the things it did for me were lovely and helpful. I probably wouldn't do it the same way today, but at the same time, I wouldn't try to build "the mother of all version control systems", either.

As for SMS on those NT machines, that turned into a disaster, but that story will have to wait for another post.

Programming to save time and to prove it could be done

Programming to save time and to prove it could be done

Recommend

Corporate "logic"

Are you a SWE or not? You can't have it both ways.

What happens when you lose the main downtown telco switch?

Slipping evil in through the back door

Using protocol buffers with no drama

A bad customer pattern and why mediocrity reigns supreme

What it means when engineers say "stop"

I reject your unnecessary naming conventions

UI disasters in the real world

Premature software release celebrations

About Joyk