10

Check Your SQL Server Backup Performance & Safety with sp_BlitzBackups

 2 years ago
source link: https://www.brentozar.com/archive/2021/12/check-your-sql-server-backup-performance-safety-with-sp_blitzbackups/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Check Your SQL Server Backup Performance & Safety with sp_BlitzBackups

Out of all of the scripts in our free First Responder Kit, sp_BlitzBackups is probably the one you’ve used the least. Let’s talk about what comes out of it, and why you should use it more often.

First, let’s define two terms:

  • Recovery Point Objective (RPO) – measured in time, it’s how much data you would lose if you restored a backup. For example, if your last backup finished at 1AM this morning, and it’s currently 10AM, you’d lose 9 hours of data if the server went down right now. That’s a 9 hour RPO.
  • Recovery Time Objective (RTO) – also measured in time, it’s how much time it would take you to perform a recovery. For example, if the server went down right now, and it took you a total of 1 hour to decide to do a restore, start the restore process, finish the restore, and then let people back into the app again, that’s a 1 hour RTO.

Businesses have a really hard time agreeing on what their desired RPO and RTO are because they wanna lose as little data as possible, and they wanna be down for as short a time as possible. That’s where sp_BlitzBackups comes in. It can’t tell you what your goals are, obviously. Instead, it estimates what you’re actually delivering today.

When you run sp_BlitzBackups, it analyzes the last 7 days of backups. Here’s what it returns:

I know, tiny screenshot – you can click on it if you wanna see the results, but I’ll describe the non-obvious columns for you:

  • RPOWorstCaseMinutes – the longest length of time between two successful backups. Say you regularly do log backups every 15 minutes, but between 9:10AM and 9:40AM, the backup target was offline, and no backups could be done. You had successful log backups at 9:00AM and 9:45AM. Your worst case for RPO would be if the server went down at 9:44AM, just before the 9:45AM backup ran, so you’d have a 44-minute worst case RPO.
  • RTOWorstCaseMinutes – the longest length of time it would take you to do a restore and get back online. Say you do full backups every night, and log backups every 15 minutes. Your worst case RTO scenario would be if the server went down right before the nightly full backup because you’d have to restore the prior day’s full backup, plus all day long’s transaction log backups. In order to calculate this number, we take the total backup time spent during all of those backups. Technically, this isn’t accurate because your restores could take longer than the backups, especially due to the lack of instant file initialization on the transaction log. This just gives you a rough starting point idea, though.
  • Supporting info – when people see these “worst case” numbers, their next question is, “Holy smokes, when was the time where we might have lost 44 minutes worth of data?” The following columns give you information about what the dates/times were and the backups involved, plus more-info queries so you can examine the history stored in MSDB, like this:

Wanna performance tune your backups too?

Continue scrolling to the right in the result set, and you’ll find:

  • FullMBpsAvg, Min, Max – the throughput you’re getting in megabytes per second.
  • FullSizeMBAvg, Min, Max – how large your backups are, before compression.
  • FullCompressedSizeMBAvg, Min, Max – how large your backups are, after compression.
  • Similar columns for Diffs, Logs – these are useful if you need to estimate change rates.

I love using backup throughput as an early warning system, like a canary in the coal mine. If backup throughput suddenly drops, it’s a sign that something went wrong with the storage or the networking. We can’t tell you whether it’s a problem with slower read speeds, slower write speeds, or more traffic on the storage network, but it’s just your clue that it’s time to start troubleshooting – because user queries are likely running more slowly too. And when backup throughput drops simultaneously across multiple SQL Servers, that’s an even bigger clue that something went seriously awry with the company’s overall storage.

When you sit down at your desk each morning, backups aren’t usually the first thing on your mind. Just take a few minutes today to double-check that you’re actually backing this stuff up as frequently as you think, and that downtime will be as short as you think. I wish I had a dollar for every time a client’s DBA got caught with their pants down because backups weren’t scheduled correctly or were regularly failing at specific days/times.

Oh wait – I actually do have a dollar for each of those times. Several dollars, in fact. Look, let’s not get sidetracked.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK