Log File Sync等待超高案例浅析
source link: http://server.51cto.com/sOS-645008.htm
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
本文转载自微信公众号「DBA闲思杂想录」,作者潇湘隐者 。转载本文请联系DBA闲思杂想录公众号。
监控工具DPA(SolarWinds的Database Performance Analyzer)发现海外一台Oracle数据库服务器DB Commit Time指标告警,超过红色告警线(40毫秒左右,黄色告警线是10毫秒,红色告警线是20毫秒),如下截图所示,生成了对应时段的AWR报告,分析时发现Top 5 Timed Events里面,log file sync等待事件的平均等待时间为37毫秒,log file parallel write等待事件的平均等待时间为40毫秒
如果对'Tanel Poder::Understanding LGWR, Log File Sync Waits and Commit Performance'这篇文章所讲述的内容很熟悉的话(经典图如下),那么通过等待事件log file sync与log file parallel write的Avg Wait(ms)指标:37ms & 40ms,基本上就可以判断redo log所在的磁盘I/O出现了性能问题
然后在用lfsdiag.sql脚本分析一下详细的统计数据,如下所示:
HISTOGRAM DATA FOR LFS AND OTHER RELATED WAITS: APPROACH: Look at the wait distribution for log file sync waits by looking at "wait_time_milli". Look at the high wait times then see if you can correlate those with other related wait events. INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT ---------- ---------------------------------------- --------------- ---------- 1 log file sync 1 4363 1 log file sync 2 835 1 log file sync 4 1650 1 log file sync 8 4937 1 log file sync 16 146252 1 log file sync 32 606674 1 log file sync 64 263377 1 log file sync 128 253254 1 log file sync 256 2 1 log file switch completion 1 124 1 log file switch completion 2 9 1 log file switch completion 4 19 1 log file switch completion 8 21 1 log file switch completion 16 35 1 log file switch completion 32 97 1 log file switch completion 64 133 1 log file switch completion 128 326 1 log file switch completion 256 1736 1 log file switch completion 512 3042 1 log file switch completion 1024 2020 1 log file parallel write 1 0 1 log file parallel write 2 0 1 log file parallel write 4 80 1 log file parallel write 8 2142 1 log file parallel write 16 170987 1 log file parallel write 32 779205 1 log file parallel write 64 311463 1 log file parallel write 128 79688 1 log file parallel write 256 42763 1 log file parallel write 512 13052 1 log file parallel write 1024 20468 1 log file parallel write 2048 14020 1 log file parallel write 4096 921 1 log file parallel write 8192 96 1 log file parallel write 16384 18 1 log file parallel write 32768 18 1 log file parallel write 65536 8 1 log file parallel write 131072 2 1 LGWR wait for redo copy 1 8516 1 LGWR wait for redo copy 2 20 1 LGWR wait for redo copy 4 19 1 LGWR wait for redo copy 8 20 1 LGWR wait for redo copy 16 11 ORDERED BY WAIT_TIME_MILLI INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT ---------- ---------------------------------------- --------------- ---------- 1 log file sync 1 4363 1 log file switch completion 1 124 1 log file parallel write 1 0 1 LGWR wait for redo copy 1 8516 1 log file sync 2 835 1 log file switch completion 2 9 1 log file parallel write 2 0 1 LGWR wait for redo copy 2 20 1 log file sync 4 1650 1 log file switch completion 4 19 1 log file parallel write 4 80 1 LGWR wait for redo copy 4 19 1 log file sync 8 4937 1 log file switch completion 8 21 1 log file parallel write 8 2142 1 LGWR wait for redo copy 8 20 1 log file sync 16 146252 1 log file switch completion 16 35 1 log file parallel write 16 170987 1 LGWR wait for redo copy 16 11 1 log file sync 32 606674 1 log file switch completion 32 97 1 log file parallel write 32 779205 1 log file sync 64 263377 1 log file switch completion 64 133 1 log file parallel write 64 311463 1 log file sync 128 253254 1 log file switch completion 128 326 1 log file parallel write 128 79688 1 log file sync 256 2 1 log file switch completion 256 1736 1 log file parallel write 256 42763 1 log file switch completion 512 3042 1 log file parallel write 512 13052 1 log file switch completion 1024 2020 1 log file parallel write 1024 20468 1 log file parallel write 2048 14020 1 log file parallel write 4096 921 1 log file parallel write 8192 96 1 log file parallel write 16384 18 1 log file parallel write 32768 18 1 log file parallel write 65536 8 1 log file parallel write 131072 2 REDO WRITE STATS "redo write time" in centiseconds (100 per second) 11.1: "redo write broadcast ack time" in centiseconds (100 per second) 11.2: "redo write broadcast ack time" in microseconds (1000 per millisecond) VERSION INST_ID NAME VALUE MILLISECONDS ----------------- ---------- ---------------------------------------- --------------------- ------------------- 10.2.0.5.0 1 redo write time 9551524 95515240.000 10.2.0.5.0 1 redo writer latching time 51 10.2.0.5.0 1 redo writes 1434931 AWR WORST AVG LOG FILE SYNC SNAPS:
从上面数据可以看到,log file sync等待事件数量最多的是32ms这个区间的,log file parallel write等待事件发生最多的也是32ms这个区间的,其实这个值已经远远超过了正常值7ms,极其不正常。log file parallel write事件是LGWR进程专属的等待事件,发生在LGWR将log_buffer中的重做日志信息写入联机重做日志文件组的成员文件,LGWR在该事件上等待该写入过程的完成。该事件等待时间过长,说明日志文件所在磁盘缓慢或存在争用。log file sync和log file parallel write是相互关联的。换句话讲,假设log file parallel write的时间非常长,那么必定导致log file sync等待时间拉长。如果log file parallel write 等待非常高,那么可能一般是物理磁盘I/O的问题
另外,我们也检查了一下redo log的切换频率,如下所示,redolog_sitch_time.sql查看发现redo log切换的次数并不频繁,生成的归档日志的量也并不大。大部分时候一小时切换零次或一次。
然后我们找了一台服务器(上述指标正常的服务器)简单对测试了一下IO的速度,这个方法极其简单,就是看看生成一个大文件需要多长时间,简单测试一下I/O性能(没有考虑cache等,测试采样也不详尽),但是对比数据也基本能验证、反馈磁盘IO存在问题)。
问题服务器:
# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct 2048+0 records in 2048+0 records out 1073741824 bytes (1.1 GB) copied, 88.271 seconds, 12.2 MB/s real 1m28.273s user 0m0.010s sys 0m0.655s
对比服务器(正常的服务器):
# time dd if=/dev/zero of=./test bs=512k count=2048 oflag=direct 2048+0 records in 2048+0 records out 1073741824 bytes (1.1 GB) copied, 2.48344 seconds, 432 MB/s real 0m2.485s user 0m0.004s sys 0m0.386s
如上对比所示,两台服务器生成同样一个大小文件,耗费的时间,I/O性能差别非常大,完全验证了告警的服务器所在的存储I/O存在性能问题,但是公司分工非常明确,DBA也不清楚底层存储出了什么问题,只能将这个问题反馈出来,等待海外负责维护系统和存储的同事的回复。
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK