1

记一次 .NET某施工建模软件 卡死分析

 5 months ago
source link: https://www.cnblogs.com/huangxincheng/p/18082543
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

1. 讲故事

前几天有位朋友在微信上找到我,说他的软件卡死了,分析了下也不知道是咋回事,让我帮忙看一下,很多朋友都知道,我分析dump是免费的,当然也不是所有的dump我都能搞定,也只能尽自己最大能力帮助别人缩小问题范围吧,既然dump有了,接下来就开启分析之路。

二:WinDbg分析

1. 为什么会卡死

不同类型的程序卡死的解决思路是不一样的,朋友也说了是窗体程序,那就重点观察下主线程吧,使用 k 命令即可。


0:000> k 25
 # Child-SP          RetAddr               Call Site
00 00000000`007fc8d8 00007ffd`87439b13     ntdll!NtWaitForAlertByThreadId+0x14
01 00000000`007fc8e0 00007ffd`87439a06     ntdll!RtlpWaitOnAddressWithTimeout+0x43
02 00000000`007fc910 00007ffd`8743987d     ntdll!RtlpWaitOnAddress+0xae
03 00000000`007fc980 00007ffd`87435fdc     ntdll!RtlpWaitOnCriticalSection+0xd9
04 00000000`007fc9f0 00007ffd`87435ef0     ntdll!RtlpEnterCriticalSectionContended+0xdc
05 00000000`007fca20 00007ffd`536839ea     ntdll!RtlEnterCriticalSection+0x40
06 00000000`007fca50 00007ffd`5368470a     AcLayers!NS_VirtualRegistry::CRegLock::CRegLock+0x1a
07 00000000`007fca90 00007ffd`536726d2     AcLayers!NS_VirtualRegistry::APIHook_RegOpenKeyExW+0x2a
08 00000000`007fcb10 00007ffd`778e550b     AcLayers!NS_WRPMitigation::APIHook_RegOpenKeyExW+0x42
09 00000000`007fcb60 00007ffd`778e5437     xxx!GetCodePageForFont+0xa7
0a 00000000`007fcc90 00007ffd`778e5296     xxx!CToolTipsMgr::NewFont+0x113
0b 00000000`007fcda0 00007ffd`778e18f9     xxx!CToolTipsMgr::LoadTheme+0xb2
0c 00000000`007fcdd0 00007ffd`84b9ca66     xxx!CToolTipsMgr::s_ToolTipsWndProc+0x1b9
0d 00000000`007fce10 00007ffd`84b9c34b     user32!UserCallWinProcCheckWow+0x266
0e 00000000`007fcf90 00007ffd`4f36b1cc     user32!CallWindowProcW+0x8b
0f 00000000`007fcfe0 00007ffd`4f39ccac     System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.DefWndProc+0x9c
10 00000000`007fd090 00007ffd`4f39cc05     System_Windows_Forms_ni!System.Windows.Forms.ToolTip.WndProc+0x9c
11 00000000`007fd260 00007ffd`4f36a3a3     System_Windows_Forms_ni!System.Windows.Forms.ToolTip.ToolTipNativeWindow.WndProc+0x15
12 00000000`007fd290 00007ffd`4f9e1161     System_Windows_Forms_ni!System.Windows.Forms.NativeWindow.Callback+0xc3
13 00000000`007fd330 00007ffd`52c8222e     System_Windows_Forms_ni+0x8d1161
14 00000000`007fd3a0 00007ffd`84b9ca66     clr!UMThunkStub+0x6e
15 00000000`007fd430 00007ffd`84b9c78c     user32!UserCallWinProcCheckWow+0x266
16 00000000`007fd5b0 00007ffd`84bb3b32     user32!DispatchClientMessage+0x9c
17 00000000`007fd610 00007ffd`874c22c4     user32!__fnINLPCREATESTRUCT+0xa2
18 00000000`007fd670 00007ffd`836a1f24     ntdll!KiUserCallbackDispatcherContinue
19 00000000`007fd7e8 00007ffd`84ba15df     win32u!NtUserCreateWindowEx+0x14
1a 00000000`007fd7f0 00007ffd`84ba11d4     user32!VerNtUserCreateWindowEx+0x20f
1b 00000000`007fdb80 00007ffd`84ba1012     user32!CreateWindowInternal+0x1b4
1c 00000000`007fdce0 00007ffd`4f3e8098     user32!CreateWindowExW+0x82
1d 00000000`007fdd70 00007ffd`4f3696f0     System_Windows_Forms_ni+0x2d8098
...

从卦象看,很明显主线程卡在 NtWaitForAlertByThreadId 上,这是有问题的,接下来我们仔细解读下线程栈。

  • DispatchClientMessage

这个方法表示当前从 queue 中拿到了别的线程通过 Invoke 送过来的信息,正在处理中。

  • LoadTheme

这个方法表示正在用主线程更新窗体样式

  • APIHook_RegOpenKeyExW

首先说一下 AcLayers.dll,专业名词叫 垫片,详情可以看一下《软件调试》,它主要用来处理一些系统级兼容性的问题,然后可以看到它在查询注册表时有一个lock操作。

在非托管代码中,lock 一般都用 临界区(CriticalSection) 实现,那到底它等待的临界区被谁持有着呢?

2. 谁持有着临界区锁

要想获取锁的持有信息,可以使用 !cs -l 或者 !locks,但这里要提醒一下,在真实的dump分析过程中,有时候不准,所以更好的办法就是从线程栈上提取,那怎么提取呢? 其实就是寻找 ntdll!RtlEnterCriticalSection 方法的第一个参数即可,方法签名如下:


VOID RtlEnterCriticalSection(
  PRTL_CRITICAL_SECTION CriticalSection
);

接下来反汇编下 00007ffd536839ea 处的代码,看看 rcx 寄存器是怎么传下来的。


0:000> ub 00007ffd`536839ea
AcLayers!NS_VirtualRegistry::OPENKEY::AddEnumEntries<NS_VirtualRegistry::VIRTUALVAL>+0x11a:
00007ffd`536839ce cc              int     3
00007ffd`536839cf cc              int     3
AcLayers!NS_VirtualRegistry::CRegLock::CRegLock:
00007ffd`536839d0 48895c2408      mov     qword ptr [rsp+8],rbx
00007ffd`536839d5 57              push    rdi
00007ffd`536839d6 4883ec30        sub     rsp,30h
00007ffd`536839da 488bf9          mov     rdi,rcx
00007ffd`536839dd 488d0d4c7f0300  lea     rcx,[AcLayers!NS_VirtualRegistry::csRegCriticalSection (00007ffd`536bb930)]
00007ffd`536839e4 ff15ae660100    call    qword ptr [AcLayers!_imp_EnterCriticalSection (00007ffd`5369a098)]

从卦象上看,很吉利,这个 rcx 原来是一个全局变量 AcLayers!NS_VirtualRegistry::csRegCriticalSection, 接下来用 !cs 观察下到底被谁持有着。


0:000> !cs AcLayers!NS_VirtualRegistry::csRegCriticalSection
-----------------------------------------
Critical section   = 0x00007ffd536bb930 (AcLayers!NS_VirtualRegistry::csRegCriticalSection+0x0)
DebugInfo          = 0x000000001c4e58e0
LOCKED
LockCount          = 0x2
WaiterWoken        = No
OwningThread       = 0x0000000000001d20
RecursionCount     = 0x1
LockSemaphore      = 0xFFFFFFFF
SpinCount          = 0x00000000020007ce

这又是一副吉卦,可以看到当前持有线程是 1d20,那这个线程正在做什么呢?

3. 1d20 线程为什么持锁不释放

案情往前推进了一步,我们切过去观察下这个线程栈。


0:000> ~~[1d20]s
ntdll!NtDelayExecution+0x14:
00007ffd`874bec14 c3              ret
0:028> kL
 # Child-SP          RetAddr               Call Site
00 00000000`33ccd948 00007ffd`83955381     ntdll!NtDelayExecution+0x14
01 00000000`33ccd950 00007ffd`6d4a2361     KERNELBASE!SleepEx+0xa1
02 00000000`33ccd9f0 00007ffd`8520a75c     perfts!CloseLagPerfData+0x21
03 00000000`33ccda30 00007ffd`85209ccd     advapi32!CloseExtObjectLibrary+0xec
04 00000000`33ccda90 00007ffd`8396dc6a     advapi32!PerfRegCloseKey+0x15d
05 00000000`33ccdae0 00007ffd`839715e6     KERNELBASE!BaseRegCloseKeyInternal+0x72
06 00000000`33ccdb10 00007ffd`83935209     KERNELBASE!ClosePredefinedHandle+0x96
07 00000000`33ccdb40 00007ffd`53685d71     KERNELBASE!RegCloseKey+0x149
08 00000000`33ccdba0 00007ffd`53683ae5     AcLayers!NS_VirtualRegistry::CVirtualRegistry::CloseKey+0xbd
09 00000000`33ccdbf0 00007ffd`51c7737e     AcLayers!NS_VirtualRegistry::APIHook_RegCloseKey+0x25
0a 00000000`33ccdc30 00007ffd`51bf4be2     mscorlib_ni+0x58737e
0b 00000000`33ccdce0 00007ffd`513c356a     mscorlib_ni!Microsoft.Win32.RegistryKey.Dispose+0x72
0c 00000000`33ccdd20 00007ffd`513c34b9     System_ni!System.Diagnostics.PerformanceCounterLib.GetStringTable+0x41a
...
13 00000000`33cce050 00007ffd`513bfe3c     System_ni!System.Diagnostics.PerformanceCounter..ctor+0xd7
14 00000000`33cce0a0 00007ffc`f45cb2ce     System_ni!System.Diagnostics.PerformanceCounter..ctor+0x1c
15 00000000`33cce0d0 00007ffc`f45cb14c     0x00007ffc`f45cb2ce
16 00000000`33cce120 00007ffc`f45cb023     0x00007ffc`f45cb14c
...

从卦中看,这个线程貌似在用 CloseLagPerfData 方法关闭一些东西时一直在Sleep等待,可以反汇编 00007ffd6d4a2361 处代码看看等待多久。


0:028> ub 00007ffd`6d4a2361
perfts!CloseLagPerfData+0x5:
00007ffd`6d4a2345 55              push    rbp
00007ffd`6d4a2346 488bec          mov     rbp,rsp
00007ffd`6d4a2349 4883ec30        sub     rsp,30h
00007ffd`6d4a234d e8720e0000      call    perfts!LagCounterManager::Cleanup (00007ffd`6d4a31c4)
00007ffd`6d4a2352 33db            xor     ebx,ebx
00007ffd`6d4a2354 eb0b            jmp     perfts!CloseLagPerfData+0x21 (00007ffd`6d4a2361)
00007ffd`6d4a2356 b964000000      mov     ecx,64h
00007ffd`6d4a235b ff15c74e0000    call    qword ptr [perfts!_imp_Sleep (00007ffd`6d4a7228)]
...

从卦中的 mov ecx,64h 可以看到是 Sleep(100) 毫秒,更多细节也没空继续追究了,但不管怎么样,它是由上层的计数器类 PerformanceCounter引发的,这里学一下 4S 店的做法,让朋友能不能不要调用 PerformanceCounter 这个类,咱躲开他就可以了,截图如下:

214741-20240319130730088-1906278703.png

去掉之后,朋友反馈问题消失。

说来也奇怪,最近发现了二起由 PerformanceCounter 引发的程序卡死,把经验留在这里,希望后来人少踩坑吧!

图片名称

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK