How good monitoring saved our ass ... again

November 1, 2018

You know how it goes - suddenly people complain your app does not work, your are getting plenty of timeouts or other errors in your error tracking tool, you find the backend app that is misbehaving and finally "fix" the problem by restarting it. Phew!

But why? What caused the downtime? A glitch an an upstream system? Sudden overload due to a spike in concurrent users? Trolls?

You know that it helps sometimes to zoom out, to get the right perspective. Here the perspective was 7 days:

iD8MlIDGGPM26RoMopxITnfTCG_OcF3-5tyD9tyImQnihoYe6e83BUMTX76lL2HD3BmJDYx-3rETnrAva0bAHh48efpoaDd3VOOOWiKGzqqvkQQFGSz5o9xnT843sqXjkXsJHxgEP1udYeQ14NgFsABJaC_3p7SvWT33qvSzmqP0TMo_XXrush-FtjrGNsqQ0CbC107ja0NpCCHpdHcNuYVl6VYrIjc6e8Ib1LVrzI_MJ9fIuZ0boJ0-IkRsyxazA_5g4LwhGfi4ulVbyaFK-GP61mrzUtuBGhbs9f0Wmr-ACKyS6Ue0xNnE55yujlF1oPEhDoCm76Oi2qskpwMrREEnURBAmTq1dy8yfn4MdjTUrbUMqVUZkoL78o7qhADK2_UOAFBwUHAbS_2zQuqEtlSBwFA_57xc0X6sDLcm4SHoK7OZYoPix0UecMC1dEKwelsKGK5IEzmmpcRe5grPBYms0QOaH7rTlpKqLC0L549HGik8jXJti6l1J8VHuUVz_oMQwIdsz8mQBMNmAXK7G6vmLZV3yTiIbYTU-w0MAQlKPsD-BKQ-MX_JYsx7t3lO5BH5uav2fPF_biyzveZcjkilkMnsIc8JyeO6X6r-jEzlC6e8vW1mxozj2CVwzN5jnn0VLC2nZgcZRoBbk8GeGnZx=w524-h221-no

It was enough to look at this chart with the right zoom to see at once that something happened on October 23rd that caused a significant change in the behavior of the application. Quick search and indeed, the change in CPU usage corresponds with a deployment. A quick revert to the previous version shortly confirmed the culprit. (It would have been even easier if we showed deployments on these charts.)

This is not the first time good monitoring saved us. A while ago we struggled regularly with the application becoming sluggish and had to restart it regularly. A graph of the Node.js even loop lag showed it increasing over time. Once it was on the same dashboard as Node's heap usage, we could at once see that it correlated with increasing memory usage - indicating a memory leak. Few hours of experimenting and heap dump analysis later the problem was fixed.

So good monitoring is paramount.

Of course the trick is to know what to monitor and to display all relevant metrics in such a way that you can spot important relations. I am still working on improving that...

Tags: monitoring

Are you benefitting from my writing? Consider buying me a coffee or supporting my work via GitHub Sponsors. Thank you! You can also book me for a mentoring / pair-programming session via Codementor or (cheaper) email.

Allow me to write to you!

Let's get in touch! I will occasionally send you a short email with a few links to interesting stuff I found and with summaries of my new blog posts. Max 1-2 emails per month. I read and answer to all replies.

How good monitoring saved our ass ... again

How good monitoring saved our ass ... again

Allow me to write to you!

Recommend

阿里国际站：延长新贸节履约时效以及无忧退优惠时效

吐槽淘宝，感觉失败的拌面

Clojure/Java: Prevent Exceptions With "trace missing"

调查表明实施正式勒索软件计划的企业并不多

AWK: Extract Logs for the Given Date(s) from a Log File

Fixing JSON out-of-memory error with streaming and MapDB

如何拯救推广过程遭遇差评，销量断崖式下滑的链接？

Show HN: Redo – Command line utility for quickly creating shell functions

Moving Too Fast For UX? Genuine Needs, Wrong Solutions

使命召唤游戏文件越来越大玩家不断流失

About Joyk