数据库异常宕机IPC Send timeout问题分析

版权声明：本文为Buddy Yuan原创文章，转载请注明出处。原文地址：数据库异常宕机IPC SEND TIMEOUT问题分析
一、故障现象

11月9日早上3点28分，Ticket 1节点和Ticker2节点先后出现异常，导致2个节点分别宕机。大概在4点10几分左右重启1节点CRS集群才逐步恢复。

二、问题原因分析

通过节点1的数据库日志，可以发现

2020-11-09T03:18:41.948752+08:00
Thread 1 advanced to log sequence 91179 (LGWR switch)
  Current log# 8 seq# 91179 mem# 0: +REDODG/INST/ONLINELOG/group_8.259.1018275603
2020-11-09T03:18:42.453573+08:00
TT02: Standby redo logfile selected for thread 1 sequence 91179 for destination LOG_ARCHIVE_DEST_2
2020-11-09T03:18:47.708120+08:00
Archived Log entry 184766 added for T-1.S-91178 ID 0x9615ba1a LAD:1
2020-11-09T03:29:27.714630+08:00
IPC Send timeout detected. Sender: ospid 103568 [oracle@db-billing-inst01]
Receiver: inst 2 binc 64238 ospid 51127
2020-11-09T03:29:27.715828+08:00
IPC Send timeout to 2.2 inc 4 for msg type 65521 from opid 4524
2020-11-09T03:29:27.717655+08:00
IPC Send timeout: Terminating pid 4524 osid 103568
2020-11-09T03:30:00.506572+08:00
IPC Send timeout detected. Sender: ospid 424001 [oracle@db-billing-inst01]
Receiver: inst 2 binc 64244 ospid 51131
2020-11-09T03:30:00.507758+08:00
IPC Send timeout to 2.4 inc 4 for msg type 65521 from opid 1486
2020-11-09T03:30:00.509723+08:00
IPC Send timeout: Terminating pid 1486 osid 424001
2020-11-09T03:30:14.572035+08:00
IPC Send timeout detected. Sender: ospid 39594 [oracle@db-billing-inst01 (LMS4)]
Receiver: inst 2 binc 64243 ospid 51134
2020-11-09T03:30:14.588642+08:00
IPC Send timeout to 2.5 inc 4 for msg type 65521 from opid 27
2020-11-09T03:30:14.603966+08:00
Communications reconfiguration: instance_number 2 by ospid 39594
2020-11-09T03:30:17.624091+08:00
IPC Send timeout detected. Sender: ospid 39587 [oracle@db-billing-inst01 (LMS1)]

在3点29分27秒，发生了IPC Send timeout。

节点2的alert日志

2020-11-09T02:52:13.746641+08:00
Thread 2 advanced to log sequence 33765 (LGWR switch)
  Current log# 17 seq# 33765 mem# 0: +REDODG/INST/ONLINELOG/group_17.268.1018275633
2020-11-09T02:52:13.816393+08:00
Archived Log entry 184752 added for T-2.S-33764 ID 0x9615ba1a LAD:1
2020-11-09T02:52:15.443649+08:00
TT02: Standby redo logfile selected for thread 2 sequence 33765 for destination LOG_ARCHIVE_DEST_2
2020-11-09T03:08:16.894010+08:00
Thread 2 advanced to log sequence 33766 (LGWR switch)
  Current log# 18 seq# 33766 mem# 0: +REDODG/INST/ONLINELOG/group_18.269.1018275637
2020-11-09T03:08:17.203016+08:00
Archived Log entry 184760 added for T-2.S-33765 ID 0x9615ba1a LAD:1
2020-11-09T03:08:18.444763+08:00
TT02: Standby redo logfile selected for thread 2 sequence 33766 for destination LOG_ARCHIVE_DEST_2
2020-11-09T03:31:03.210073+08:00
Detected an inconsistent instance membership by instance 1
2020-11-09T03:31:03.368467+08:00
Received an instance abort message from instance 1
2020-11-09T03:31:03.368566+08:00
Received an instance abort message from instance 1
2020-11-09T03:31:03.368580+08:00
Please check instance 1 alert and LMON trace files for detail.
2020-11-09T03:31:03.368605+08:00
Received an instance abort message from instance 1
2020-11-09T03:31:03.368621+08:00
Received an instance abort message from instance 1
2020-11-09T03:31:03.368655+08:00
Received an instance abort message from instance 1
2020-11-09T03:31:03.368669+08:00
Please check instance 1 alert and LMON trace files for detail.
2020-11-09T03:31:03.368695+08:00
Please check instance 1 alert and LMON trace files for detail.
2020-11-09T03:31:03.368711+08:00
Please check instance 1 alert and LMON trace files for detail.
2020-11-09T03:31:03.368738+08:00
Please check instance 1 alert and LMON trace files for detail.
2020-11-09T03:31:03.369146+08:00
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
Errors in file /u01/app/oracle/diag/rdbms/inst/inst2/trace/inst2_lmon_51121.trc  (incident=1151529):
ORA-29740: evicted by instance number 1, group incarnation 6
Incident details in: /u01/app/oracle/diag/rdbms/inst/inst2/incident/incdir_1151529/inst2_lmon_51121_i1151529.trc
2020-11-09T03:31:03.403869+08:00
USER (ospid: 51129): terminating the instance due to error 481
2020-11-09T03:31:03.462705+08:00
System state dump requested by (instance=2, osid=51129 (LMS2)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/inst/inst2/trace/inst2_diag_51101_20201109033103.trc
2020-11-09T03:31:04.202068+08:00
License high water mark = 57
2020-11-09T03:31:04.242953+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/inst/inst2/trace/inst2_lmon_51121.trc  (incident=1151530):
ORA-29740 [] [] [] [] [] [] [] [] [] [] [] []
Incident details in: /u01/app/oracle/diag/rdbms/inst/inst2/incident/incdir_1151530/inst2_lmon_51121_i1151530.trc
2020-11-09T03:31:12.313098+08:00
Dumping diagnostic data in directory=[cdmp_20201109033103], requested by (instance=2, osid=51129 (LMS2)), summary=[abnormal instance termination].
2020-11-09T03:31:12.843558+08:00
Instance terminated by USER, pid = 51129
2020-11-09T03:31:12.850723+08:00
Warning: 2 processes are still attach to shmid 491535:
 (size: 348160 bytes, creator pid: 50663, last attach/detach pid: 51103)
2020-11-09T03:31:13.203683+08:00
USER (ospid: 440980): terminating the instance
2020-11-09T03:31:13.205435+08:00
Instance terminated by USER, pid = 440980
2020-11-09T03:31:16.256927+08:00
ksxp_exafusion_enabled_dcf: ipclw_enabled=0 
Starting ORACLE instance (normal) (OS id: 448151)
2020-11-09T03:31:16.265983+08:00
CLI notifier numLatches:131 maxDescs:19835

查看节点2的alert日志，发现出问题的时间是3点31分。
继续看集群的日志。
INST01

2020-09-19 00:05:26.729 [ORAAGENT(39731)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 39731
2020-11-09 03:28:36.757 [ORAROOTAGENT(19160)]CRS-5818: Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/grid_base/diag/crs/db-billing-inst01/crs/trace/ohasd_orarootagent_root.trc.
2020-11-09 03:32:42.928 [ORAAGENT(39731)]CRS-5011: Check of resource "inst" failed: details at "(:CLSN00007:)" in "/u01/app/grid_base/diag/crs/db-billing-inst01/crs/trace/crsd_oraagent_oracle.trc"
2020-11-09 03:35:06.251 [CRSD(24131)]CRS-1008: Node db-billing-inst02 is not responding to OCR requests. Details in /u01/app/grid_base/diag/crs/db-billing-inst01/crs/trace/crsd.trc.
2020-11-09 03:35:52.612 [ORAROOTAGENT(25198)]CRS-5822: Agent '/u01/app/grid_home/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:37199} in /u01/app/grid_base/diag/crs/db-billing-inst01/crs/trace/crsd_orarootagent_root.trc.
2020-11-09 03:35:52.612 [ORAAGENT(25318)]CRS-5822: Agent '/u01/app/grid_home/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:5:4} in /u01/app/grid_base/diag/crs/db-billing-inst01/crs/trace/crsd_oraagent_grid.trc.
2020-11-09 03:36:07.821 [CRSD(246667)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 246667
2020-11-09 03:58:10.375 [OHASD(18260)]CRS-2878: Failed to restart resource 'ora.crsd'
2020-11-09 03:58:10.387 [GPNPD(19562)]CRS-2329: GPNPD on node db-billing-inst01 shut down. 
2020-11-09 03:58:12.087 [MDNSD(19306)]CRS-5602: mDNS service stopping by request.

INST02

2020-11-09 03:31:04.022 [ORAAGENT(50626)]CRS-5011: Check of resource "inst" failed: details at "(:CLSN00007:)" in "/u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/crsd_oraagent_oracle.trc"
2020-11-09 03:31:04.133 [ORAAGENT(440875)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 440875
2020-11-09 03:31:13.347 [ORAAGENT(448125)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 448125
2020-11-09 03:32:42.620 [ORAROOTAGENT(18386)]CRS-5818: Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/ohasd_orarootagent_root.trc.
2020-11-09 03:35:52.597 [OCSSD(18821)]CRS-1663: Member kill issued by PID 22564 for 1 members, group ocr_db-bill-cluster. Details at (:CSSGM00044:) in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/ocssd.trc.
2020-11-09 03:35:52.598 [CRSD(22564)]CRS-1015: A request to terminate the Cluster Ready Service on node db-billing-inst01 completed successfully. Details at (:PROCR00000:) in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/crsd.trc.
2020-11-09 03:59:50.173 [OLOGGERD(93436)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 93436
2020-11-09 04:05:13.766 [CRSD(22564)]CRS-1024: The Cluster Ready Service on this node terminated because the ASM instance was not active on this node. Details at (:PROCR00009:) in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/crsd.trc.
2020-11-09 04:05:13.785 [ORAROOTAGENT(23629)]CRS-5822: Agent '/u01/app/grid_home/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:8550} in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/crsd_orarootagent_root.trc.
2020-11-09 04:05:13.785 [ORAAGENT(23492)]CRS-5822: Agent '/u01/app/grid_home/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:9} in /u01/app/grid_base/diag/crs/db-billing-inst02/crs/trace/crsd_oraagent_grid.trc.

从集群的日志看，在3点28分36秒节点1的集群就出现了异常。

2020-11-09 03:28:37.758 :    AGFW:2642040576: {0:5:3} Command: check for resource: ora.crsd 1 1 has been aborted, updating resource state/label
2020-11-09 03:28:37.758 :    AGFW:2642040576: {0:5:3} Cleared pPEMsg in 0x7f5d78096ae0
2020-11-09 03:28:37.758 :    AGFW:2642040576: {0:5:3} Unlocking cmdQ after check abort
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} Processing aborted reply : check for resource: ora.crsd 1 1
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} ora.crsd 1 1 state changed from: ONLINE to: UNKNOWN
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} RECYCLE_AGENT attribute not found
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} ora.crsd 1 1 would be continued to monitored!
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} Resource ora.crsd 1 1 has transitioned to INTERMEDIATE
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} Started timer for [ora.crsd 1 1]  delay=720000
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:5:3} ora.crsd 1 1 state details has changed from:  to: CHECK TIMED OUT
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:2:9} Generating new Tint for unplanned state change. Original Tint: {0:5:3}
2020-11-09 03:28:37.758 :    AGFW:2633635584: {0:2:9} Resid = ora.crsd 1 1

通过指向的ohasd_orarootagent_root.trc日志，可以发现在3点28分37秒左右，ora.crsd进程已经从online变成了unknown状态。Crsd进程是集群非常重要的进程。那么为什么会突然出现状态异常？

通过crsd进程日志可以发现，一直都在做gipchaLowerInternalSend: send msg，而且消息一直在传输send msg 0x7f2d4c039708和send msg 0x7f2da802a008。
通过搜索MOS，发现和这个关键函数相关的内容非常少。发现文档Node Evicted and Not Re-Joining Although OS Ping/Traceroute Works (Doc ID 2029731.1)与之稍微有点契合。

从上述函数调用可以看出需要做的事情有很多。而我们可以看到没出问题的时候一直显示就是上述函数的最后的一个步骤gipchaWorkerWork 。

这说明消息没有成功发送，可能原因有以下几种：

1.OS层面的send API问题。

通过当时的OSWATCH，可以发现系统资源是充足的。可以排除资源不足导致的send的问题。具体send情况还需要os工程师进行排查。

2.网络问题。根据观察。发现网络上的IP Fragmentation比较严重。

产生了很多碎片和重组。需要我们调整参数。

3.Oracle软件未知Bug。

三、后续建议

根据上述的分析，建议实施以下步骤。
1.OS Send API问题，根据MOS文档，需要主机人员进行排查。

2.优化网络参数，优化IP碎片，建议操作系统设置以下参数。

/proc/sys/net/ipv4/ipfrag_low_thresh (default = 196608)
/proc/sys/net/ipv4/ipfrag_high_thresh (default = 262144)
/proc/sys/net/ipv4/ipfrag_time (default = 30)

3．安装最新的Oracle 12C GI的PSU。

数据库异常宕机IPC Send timeout问题分析

数据库异常宕机IPC Send timeout问题分析

Recommend

[译] 深层系统的弹性

Envoy 是什么？

CPU虚拟化：虚拟机切入和退出

Diving Into Istio 1.6 Certificate Rotation

ninja: a simple way to do builds

Antimagic and Force Cubes

Installing Oracle BIP 12.2.1.4

Stopping Docker Containers Gracefully

Firefox usage is down 85% despite Mozilla's top exec pay going up 40...

MySQL Performance : Understanding InnoDB IO Internals & "Checkpointing&...

About Joyk