5

SQL Server CPU 利用率毛刺的分析定位与解决

 3 years ago
source link: https://www.cnblogs.com/xuliuzai/p/15019204.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

1.1 问题描述

近期发现一台SQL Server的CPU利用率很不稳定,发现不定时的飙升到100%,更可怕的是在业务繁忙时,影响了业务调用,失败率明显增加,所以,减低CPU的利用率,是迫切需要解决的问题。 

CPU升高的原因直观上来说,就是CPU(中央处理器)的负载过高, 中央处理器忙不过来。进一步分析的话,可以从两个角度优化,1.减少单个CPU 的处理时间;2.减少单个任务占有的CPU核数,即一个任务不要分配太多的CPU核数。

1.2 优化的方法

1.表结构的优化,例如索引是否合理、关联表字段的定义是否一致等;

2.SQL 语句的优化;

3.表数据量是否归档、缩减;

4.将数据缓存到缓存层(如,Reids),减少对DB的访问;

5.DB实例配置是否需要优化;

6.升级硬件。 

二. 问题处理过程

 2.1 优化前

从这个监控图可以看出,CPU最大值为100%,平均值为19%,毛刺比较明显。

2.2 定位SQL语句

通过常用的SP,很快定位到了SQL语句,是关于一张表的查询,语句简单,但是表的数据量比较大(7600W),查询出的数据有(4000W)。这张表每天的写入量<100W。

并且和研发确认,此SQL的调用也是周期性的,比如5分钟查询一次,基本符合Zabbix周期性CPU毛刺突起。

2.3 处理步骤

Step 1 .考虑到,CPU突然飙升,毛刺陡然加剧,冲刺到100%,并且问题SQL 不是最近新上线,所以,我们的第一反映是 索引走偏,统计信息失效了。但是 刷新了 表统计信息 ,情况没有好转。

Step 2. 考虑到是不是索引失效了,我们坚持到业务低峰期,重建了表的索引,情况 依然没有好转。

Step 3.考虑是不是表的数据量到了一定规模,才导致的此问题,和业务研发确认后,将历史数据归档,归档了2800W,数据量由7600W减少到4800W。数据量减少后,情况有所好转,SQL事务的排队和阻塞 明显减少。但是毛刺突起依然明显,CPU 100% 依然存在。

..........

.........

Step 4 这时想到,最大并行度 。当SQL Server 发现一条指令比较复杂(不仅仅是SQL语句复杂,查询的数据量比较大也是复杂),会决定用多少个线程并行执行,从而提高整体相应时间。如果指令复杂,甚至需要所有CPU来运行这些线程,别的用户发过来的指令会受到影响,甚至可能拿不到CPU执行。即需要调整max degree of parallelism的值

查看问题实例 发现没有设置,即可以使用所有的CPU。修改参数,将最大并行度将至4.执行以下命令:

exec sp_configure 'max degree of parallelism',4
go
RECONFIGURE
GO

此时 毛刺消失了,问题解决了。

2.4 优化后

 优化后,从监控图中可以看出,CPU的最大值降至了25%,平局值为7%。

 三.定位问题SQL常用的SP

当我们遇到DB性能问题或DBServer监控指标异常时,以下四个SP,可以帮忙我们快速定位SQL语句。

 3.1 查看当前阻塞排队的情况

/*  
---------------------------------------------------------------------------------  
uspm_Block  
功能:查看阻塞和锁,阻塞源头 
参数:无 
---------------------------------------------------------------------------------  
*/  
create PROCEDURE  [dbo].[uspm_Block]  
as  

--查找有关被阻塞的请求的信息(含用户)
SELECT  s.loginame
       ,[Individual Query] = SUBSTRING (qr.text,qs.statement_start_offset/2, 
         (CASE WHEN qs.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), qr.text)) * 2 
               ELSE qs.statement_end_offset 
          END - qs.statement_start_offset)/2) 
       ,qs.session_id ,s.counts AS [进程个数],qs.status ,qs.blocking_session_id
       ,qs.wait_type ,qs.wait_time ,qs.wait_resource 
       ,qs.transaction_id 
FROM SYS.DM_EXEC_REQUESTS qs (nolock)
  LEFT JOIN (
             SELECT spid,MAX(loginame)AS loginame,COUNT(0)AS counts FROM SYS.SYSPROCESSES (nolock) GROUP BY spid
            ) s ON qs.session_id=s.spid
  OUTER APPLY SYS.DM_EXEC_SQL_TEXT(qs.sql_handle) AS qr
WHERE qs.status = N'suspended'
--and s.loginame<>''
ORDER BY qs.wait_time DESC

--查找阻塞源头v3.0
SELECT SP.spid
      ,CASE WHEN ST1.text IS NULL THEN ST2.text
            ELSE SUBSTRING (ST1.text,SR.statement_start_offset/2, 
                 (
                  CASE WHEN SR.statement_end_offset = -1 THEN LEN(CONVERT(NVARCHAR(MAX), ST1.text)) * 2 
                       ELSE SR.statement_end_offset 
                  END - SR.statement_start_offset)/2
                  )
        END AS [T-sql]
      ,SP.loginame
      ,DB_NAME(SP.dbid) AS [db_name]
      ,SP.open_tran,SP.hostname,SP.program_name,SP.waitresource,SP.*
FROM SYS.SYSPROCESSES SP (nolock) 
  LEFT JOIN SYS.DM_EXEC_REQUESTS SR (nolock) ON SP.spid=SR.session_id
  LEFT JOIN SYS.DM_EXEC_CONNECTIONS SC (nolock) ON SP.spid=SC.session_id
  OUTER APPLY SYS.DM_EXEC_SQL_TEXT(SC.most_recent_sql_handle) AS ST2
  OUTER APPLY SYS.DM_EXEC_SQL_TEXT(SR.sql_handle) AS ST1
WHERE SP.spid IN
      (
         SELECT BLOCKED FROM SYS.SYSPROCESSES (nolock) WHERE BLOCKED<>0
      )
  AND SP.BLOCKED=0
GO

 3.2 查看 当前最消耗CPU的SQL

/*
---------------------------------------------------------------------------------
uspm_perf_topcpu
功能:取当前N个最耗CPU的SQL
参数:@topCount   --N的具体数量,默认取前20条
示例:
     1.取前10条
      exec uspm_perf_topcpu 

     2.取前20条最耗CPU的SQL
      exec uspm_perf_topcpu @topCount=20 
---------------------------------------------------------------------------------
*/
CREATE PROCEDURE  [dbo].[uspm_perf_topcpu]
(@topCount int=10)
as 
set nocount on
declare @cmd varchar(1000)
select @cmd='
        SELECT TOP '+ CAST(@topCount AS VARCHAR)+'  SUBSTRING(text, (statement_start_offset/2) + 1,
            ((CASE statement_end_offset
                                WHEN -1 THEN DATALENGTH(text)
                                ELSE statement_end_offset
                                END - statement_start_offset)/2) + 1
                            ) AS query_text
                            ,b.hostname
                            ,b.loginame
                            ,a.*
                            ,qr.text
                            ,qt.query_plan
            FROM sys.dm_exec_requests a (nolock)
               INNER JOIN sys.sysprocesses b (nolock) on a.session_id=b.spid
               CROSS APPLY sys.dm_exec_sql_text(a.sql_handle) as qr
               CROSS APPLY sys.dm_exec_query_plan(a.plan_handle)as qt
            ORDER BY a.total_elapsed_time DESC'
        
exec(@cmd)

GO

 3.3 查看执行时间最长的SQL

/*
---------------------------------------------------------------------------------
uspm_perf_topduration 
功能:取N个执行时间最长的SQL
参数:@topCount   --N的具体数量,默认取前50条
示例:
      --1.取前50条
      exec uspm_perf_topduration 

      --2.取前10条执行时间最长的SQL
      exec uspm_perf_topduration @topCount=10
---------------------------------------------------------------------------------
*/
CREATE PROCEDURE  [dbo].[uspm_perf_topduration] 
(@topCount int=50)
as 
set nocount on
declare @cmd varchar(600)
select @cmd='
            select  
                highest_cpu_queries.plan_handle,  
                highest_cpu_queries.total_worker_time, 
                q.dbid, 
                q.objectid, 
                q.number, 
                q.encrypted, 
                q.[text] 
            from  
                (select top '+ cast(@topCount as varchar)+'  
                    qs.plan_handle,  
                    qs.total_worker_time 
                from  
                    sys.dm_exec_query_stats qs  (nolock)
                order by qs.total_worker_time desc) as highest_cpu_queries 
                cross apply sys.dm_exec_sql_text(plan_handle) as q 
            order by highest_cpu_queries.total_worker_time desc'
exec(@cmd)
GO

 3.4 当前SQL执行概览

/*  
---------------------------------------------------------------------------------  
uspm_perf_cpudetail  
功能:查看CPU的任务数量,使用率,CPU瓶颈  
参数:无参  
---------------------------------------------------------------------------------  
*/  
CREATE PROCEDURE  [dbo].[uspm_perf_cpudetail]  
as   
set nocount on  
--1.Cpu_Task  
SELECT '查看cpu任务'  
SELECT scheduler_id, current_tasks_count, runnable_tasks_count  
FROM sys.dm_os_schedulers (nolock)  
WHERE scheduler_id < 255  
  
---2.CUP_USING  
SELECT '查看cpu使用情况'  
declare @ts_now bigint  
 --select @ts_now = cpu_ticks / convert(float, cpu_ticks_in_ms) from sys.dm_os_sys_info (nolock)  
 select @ts_now = cpu_ticks/(cpu_ticks/ms_ticks) from sys.dm_os_sys_info (nolock)  
 
 select top 50 record_id,  
  dateadd(ms, -1 * (@ts_now - [timestamp]), GetDate()) as EventTime,   
  SQLProcessUtilization as [CPU使用率,不能始终处于高位],  
  SystemIdle,  
  100 - SystemIdle - SQLProcessUtilization as OtherProcessUtilization  
 from (  
  select   
   record.value('(./Record/@id)[1]', 'int') as record_id,  
   record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') as SystemIdle,  
   record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') as SQLProcessUtilization,  
   timestamp  
  from (  
   select timestamp, convert(xml, record) as record   
   from sys.dm_os_ring_buffers  (nolock)  
   where ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'  
   and record like '%<SystemHealth>%') as x  
  ) as y   
 order by record_id desc  
  
  
--3.CPU_NECK  
SELECT '查看CPU瓶颈'  
select cast([signal_wait_time_ms] as decimal(30,2))/[wait_time_ms] as [百分比] ,*   
from sys.dm_os_wait_stats (nolock)   
where [wait_time_ms]<>0    
and cast([signal_wait_time_ms] as decimal(30,2))>([wait_time_ms]*0.25 )  
order by 1 desc  
  
SELECT '查看百分比是否>10%,如果大于10%,考虑降低并行度'  
select cast([signal_wait_time_ms] as decimal(30,2))/[wait_time_ms] as [百分比],*   
from sys.dm_os_wait_stats (nolock)  
where [wait_time_ms]<>0 AND  wait_type='CXPACKET'
GO

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK