Chris's Wiki :: blog/sysadmin/LogMonitoringTarpit

Monitoring your logs is mostly a tarpit

August 6, 2023

One of the reactions to my entry on how systemd's auto-restarting of units can hide problems was a suggestion that we should monitor our logs to detect things like this. As it happens, one of my potentially unpopular views is that monitoring your logs is generally a tarpit that isn't worth it. Much of the time you'll spend a great deal of effort to get very little of worth.

The fundamental problem with general log monitoring is that logs are functionally unstructured. Anything can appear in them and in a sufficiently complex environment, anything eventually will. This unstructured randomness means that sorting general signal from general noise is a large and often never-ending job, and if you don't do it well, you wind up with a lot of noise (which makes it almost impossible in practice to spot the signal).

One thing you can theoretically sensibly monitor your logs for is specific, narrow signals of things of interest; for example, you might monitor Linux kernel logs for machine check error messages. The first problem with monitoring this way is that there's no guarantee that the message you're monitoring for won't change. Maybe someday the Linux kernel developers will decide to put more information in their MCE messages and change the format. One reason this happens is that almost no one considers log messages to be an API, and so they feel free to change log messages at whim.

(But in the mean time, maybe you'll derive enough value or reassurance from looking for the current MCE messages in your kernel logs. It's a tradeoff, which I'll get to.)

The second problem with monitoring for specific narrow signals of interest in your logs is that you have to know what they look like. It's easy to say that we'll monitor for the Prometheus host agent crashing and systemd restarting it, but it's much harder to be sure that we have properly identified the complete collection of log messages that signal this happening. Remember, log messages are unstructured, which means it's hard to get a complete inventory of what to look for short of things like reading the (current) program source code to see what it logs.

Finally, all of this potential effort only matters if identifiable problems appear in your logs on a sufficiently regular basis and it's useful to know about them. In other words, problems that happen, that you care about, and probably that you can do something about. If a problem was probably a one time occurrence or occurs infrequently, the payoff from automated log monitoring for it can be potentially quite low (you can see this as an aspect of how alerts and monitoring can never be comprehensive).

But monitoring your logs looks productive and certainly sounds good and proper. You can write some positive matches to find known problems, you can write some negative matches to discard noise, you can 'look for anomalies' and then refine your filtering, and so on. That's what makes it a tarpit; it's quite easy to thoroughly mire yourself in progressively more complex log monitoring.

Monitoring your logs is mostly a tarpit

Recommend

DLP Exact Data Match beta now available

高通仍未敲定第4代骁龙8cx发布时间，或在苹果M3之后|显卡|内存|台积电|苹果m3|财务会...

Binance Crosses 150 Million Users

备忘：OpenWrt在旁路由下Ping通但TCP不通的解决办法

Delivering on sustainable promises: Making sustainability a tangible company KPI

The Creative Symphony of Generative AI in Music Composition

中消协点名共享充电宝好借难还：企业应提高消费纠纷解决效率

打开设计思维才能突破瓶颈-经验/观点-UICN用户体验设计平台

马斯克拿下极品域名 AI.com：曾被 OpenAI 斥资千万美元买下，不到一年就易主了？

8月起这些新规将影响你我生活：使用AI生成内容不得侵害他人肖像权

About Joyk