6

记一次 .NET 某企业OA后端服务 卡死分析

 1 year ago
source link: https://www.51cto.com/article/720592.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

记一次 .NET 某企业OA后端服务 卡死分析

作者:一线码农聊技术 2022-10-13 18:40:05
在某个时刻输出端窗口因为各种原因被他人独占,导致程序侧无法输出内容到窗口而一直被迫等待,之后朋友将日志输出切换到文件模式,问题得以解决。

​一、背景

1.讲故事

前段时间有位朋友微信找到我,说他生产机器上的 Console 服务看起来像是卡死了,也不生成日志,对方也收不到我的httpclient请求,不知道程序出现什么情况了,特来寻求帮助。

哈哈,一般来说卡死的情况在窗体程序(WinForm,WPF) 上特别多,在 Console,WebApi 中相对较少,既然找到我,那就上 WinDbg 分析吧。

二、WinDbg 分析

1. 程序真的卡死了吗

程序之所以能跑的梭梭响,全靠线程帮忙,如果玩不转可能就是线程上出了点什么问题,接下来使用 !t 展示下线程列表。


0:000> !t
ThreadCount:      124
UnstartedThread:  0
BackgroundThread: 105
PendingThread:    0
DeadThread:       18
Hosted Runtime:   no
                                                                                                            Lock  
 DBG   ID     OSID ThreadOBJ           State GC Mode     GC Alloc Context                  Domain           Count Apt Exception
   0    1     65ec 000002097A4BF390    2a020 Preemptive  0000000000000000:0000000000000000 000002097a4bea20 -00001 MTA 
   3    2    15afc 000002097A50BA10    2b220 Preemptive  00000209060BE8F8:00000209060BEBE8 000002097a4bea20 -00001 MTA (Finalizer) 
   4    3    11d20 000002097A524EC0  102a220 Preemptive  0000000000000000:0000000000000000 000002097a4bea20 -00001 MTA (Threadpool Worker) 
   5    9     b8a4 000002097DE10290  3029220 Preemptive  0000000000000000:0000000000000000 000002097a4bea20 -00001 MTA (Threadpool Worker) 
   6   13    1b22c 000002097DE0ADB0  1029220 Preemptive  0000000000000000:0000000000000000 000002097a4bea20 -00001 MTA (Threadpool Worker) 
   7   16    1b0e8 000002097DE1A030  202b220 Preemptive  00000209063CB630:00000209063CC1F8 000002097a4bea20 -00001 MTA 
   8   17    138c8 000002097DE175C0  202b220 Preemptive  00000209063F71C0:00000209063F7BC8 000002097a4bea20 -00001 MTA 
   9   18    1afd0 000002097DE181E0  202b220 Preemptive  00000209064627E0:0000020906463598 000002097a4bea20 -00001 MTA 
  10   19    1ac48 000002097DE13310  202b220 Preemptive  000002090632D6C8:000002090632E0D8 000002097a4bea20 -00001 MTA 
  11   20    18704 000002097DE16390  202b220 Preemptive  00000209063FB5A8:00000209063FBBC8 000002097a4bea20 -00001 MTA 
  12   21    1ade4 000002097DE187F0  202b220 Preemptive  00000209062EA138:00000209062EA708 000002097a4bea20 -00001 MTA 
  13   22    13164 000002097DE13920  202b220 Preemptive  0000020906392108:0000020906392A38 000002097a4bea20 -00001 MTA 
  14   23    1b334 000002097DE169A0  202b220 Preemptive  00000209063CD9E8:00000209063CE1F8 000002097a4bea20 -00001 MTA 
  ...
 106  168    19e18 0000020927066770  a029220 Preemptive  0000000000000000:0000000000000000 000002097a4bea20 -00001 MTA (Threadpool Completion Port) 
 108  136     af74 0000020928590290  8029220 Preemptive  00000209063B60E0:00000209063B6408 000002097a4bea20 -00001 MTA (Threadpool Completion Port) 
 107   84     e7f0 000002097AD6B600  8029220 Preemptive  00000209063E3300:00000209063E3DD8 000002097a4bea20 -00001 MTA (Threadpool Completion Port) 

一般来说卦中的 Lock Count​ 列表示当前线程所持有的托管锁个数,现在显示的 -00001​ 应该是命令不起效果了。。。不过没关系,我们还可以通过 !syncblk​ 来看下 lock 的情况,毕竟 95% 的锁场景都会用到它。


0:000> !syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
  176 000002097E0DA1D8           17         1 00000209216A3500 132c8  71   00000209002c6a68 System.Object
  191 000002097E0DAC88           27         1 000002097ADC54F0 1ae90  53   00000209000185e8 System.IO.TextWriter+SyncTextWriter
-----------------------------
Total           270
CCW             0
RCW             0
ComClassFactory 0
Free            32

从卦中的 MonitorHeld=27​ 来看,表示这个 SyncTextWriter 对象当前有 13 个线程在等待,有 1 个线程在持有,那这个线程为什么没有退出呢?接下来可以切到 53 号线程上,查看下它的线程栈。


0:053> ~~[1ae90]s
ntdll!NtWriteFile+0x14:
00007ffd`a70df774 c3              ret
0:053> !clrstack 
OS Thread Id: 0x1ae90 (53)
        Child SP               IP Call Site
00000070505BDE88 00007ffda70df774 [InlinedCallFrame: 00000070505bde88] Interop+Kernel32.WriteFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
00000070505BDE88 00007ffd32a7cbe2 [InlinedCallFrame: 00000070505bde88] Interop+Kernel32.WriteFile(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
00000070505BDE50 00007ffd32a7cbe2 ILStubClass.IL_STUB_PInvoke(IntPtr, Byte*, Int32, Int32 ByRef, IntPtr)
00000070505BDF20 00007ffd3a2d6971 System.ConsolePal+WindowsConsoleStream.WriteFileNative(IntPtr, Byte[], Int32, Int32, Boolean)
00000070505BDF80 00007ffd3a2d672f System.ConsolePal+WindowsConsoleStream.Write(Byte[], Int32, Int32) [/_/src/libraries/System.Console/src/System/ConsolePal.Windows.cs @ 1131]
00000070505BDFD0 00007ffd377c05d9 System.IO.StreamWriter.Flush(Boolean, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs @ 260]
00000070505BE050 00007ffd3a2d6687 System.IO.StreamWriter.WriteLine(System.String)
00000070505BE0D0 00007ffd3a2d6472 System.IO.TextWriter+SyncTextWriter.WriteLine(System.String) [/_/src/libraries/System.Private.CoreLib/src/System/IO/TextWriter.cs @ 880]
00000070505BE130 00007ffd3a2d640d System.Console.WriteLine(System.String) [/_/src/libraries/System.Console/src/System/Console.cs @ 716]
00000070505BE160 00007ffd3a2375ab HCloud.xxx+d__2.MoveNext()
00000070505BEAE0 00007ffd37e19365 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 63]
00000070505BEB40 00007ffd3a235cc3 HCloud.xxxx.Execute(Quartz.IJobExecutionContext)
00000070505BEBB0 00007ffd3a231264 Quartz.xxx+d__9.MoveNext()
00000070505BEE90 00007ffd3a230183 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Quartz.Core.JobRunShell+d__9, Quartz]](d__9 ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 63]
00000070505BEF10 00007ffd3a2300db System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Quartz.Core.JobRunShell+d__9, Quartz]](d__9 ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilder.cs @ 33]
00000070505BEF40 00007ffd3a230082 Quartz.xxxxl.Run(System.Threading.CancellationToken)
00000070505BF020 00007ffd3a22ffc3 Quartz.Core.QuartzSchedulerThread+c__DisplayClass28_0.b__0()
00000070505BF060 00007ffd3939b71a System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 507]
00000070505BF0C0 00007ffd37d54431 System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 300]
00000070505BF110 00007ffd37d53657 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2352]
00000070505BF1C0 00007ffd37d50e04 System.Threading.ThreadPoolWorkQueue.Dispatch() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.cs @ 677]
00000070505BF690 00007ffd925ea7a3 [DebuggerU2MCatchHandlerFrame: 00000070505bf690] 

仔细观察线程栈会很惊讶的发现,居然还能卡在 System.Console.WriteLine​ 方法上,挺奇怪的,为了探究原因,我们使用 k 命令看下非托管栈。


0:053> k 5
 # Child-SP          RetAddr               Call Site
00 00000070`505bddd8 00007ffd`a32febda     ntdll!NtWriteFile+0x14
01 00000070`505bdde0 00007ffd`32a7cbe2     KERNELBASE!WriteFile+0x7a
02 00000070`505bde50 00007ffd`3a2d6971     0x00007ffd`32a7cbe2
03 00000070`505bdf20 00007ffd`3a2d672f     System_Console!System.ConsolePal.WindowsConsoleStream.WriteFileNative+0x61
04 00000070`505bdf80 00007ffd`377c05d9     System_Console!System.ConsolePal.WindowsConsoleStream.Write+0x3f

0:053> ub ntdll!NtWriteFile+0x14
ntdll!NtDeviceIoControlFile+0x15:
00007ffd`a70df755 cd2e            int     2Eh
00007ffd`a70df757 c3              ret
00007ffd`a70df758 0f1f840000000000 nop     dword ptr [rax+rax]
ntdll!NtWriteFile:
00007ffd`a70df760 4c8bd1          mov     r10,rcx
00007ffd`a70df763 b808000000      mov     eax,8
00007ffd`a70df768 f604250803fe7f01 test    byte ptr [SharedUserData+0x308 (00000000`7ffe0308)],1
00007ffd`a70df770 7503            jne     ntdll!NtWriteFile+0x15 (00007ffd`a70df775)
00007ffd`a70df772 0f05            syscall

从上面的 syscall​ 系统调用关键词看,代码是将用户态的 ntdll!NtWriteFile​ 切到入了内核态的 nt!NtWriteFile 方法,那进入了内核态为什么没有返回呢?这又是一个值得思索的问题。

2. 为什么进入了内核态无返回

其实 ntdll!NtWriteFile 这个 win32 api 方法的第一个参数是一个 handle 的文件句柄,签名如下。


__kernel_entry NTSYSCALLAPI NTSTATUS NtWriteFile(
  [in]           HANDLE           FileHandle,
  [in, optional] HANDLE           Event,
  [in, optional] PIO_APC_ROUTINE  ApcRoutine,
  [in, optional] PVOID            ApcContext,
  [out]          PIO_STATUS_BLOCK IoStatusBlock,
  [in]           PVOID            Buffer,
  [in]           ULONG            Length,
  [in, optional] PLARGE_INTEGER   ByteOffset,
  [in, optional] PULONG           Key
);

可能 handle 在内核中被别人占用了,可以用 !handle​ 查看下 rcx 寄存器。


0:053> r
rax=0000000000000008 rbx=00000070505bdf50 rcx=0000000000000418
rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000418
rip=00007ffda70df774 rsp=00000070505bddd8 rbp=00000070505bdf10
 r8=0000000002000805  r9=0000000000000000 r10=0000020906191620
r11=00000070505bc8f8 r12=0000000000000100 r13=0000000000000053
r14=0000000000000077 r15=000002097adc54f0
iopl=0         nv up ei pl zr na po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
ntdll!NtWriteFile+0x14:
00007ffd`a70df774 c3              ret
0:053> !handle 0000000000000418 f
Handle 0000000000000418
  Type          File
  Attributes    0
  GrantedAccess 0x120196:
         ReadControl,Synch
         Write/Add,Append/SubDir/CreatePipe,WriteEA,ReadAttr,WriteAttr
  HandleCount   3
  PointerCount  65483
  No object specific information available

哈哈,其实也看不出什么,也没法进入内核态,所以下一步只能到网上搜搜看,其实有经验的朋友肯定猜出来了,应该是控制台启用了 快捷编辑窗口 ,截图如下:

图片

快捷编辑窗口​ 简而言之就是用户可以在控制台上独占这个窗口,编辑一些内容, 可一旦被用户独占,那程序侧就没法输出内容到 控制台窗口​ 上了,只能在 内核态 傻傻等等,这应该就是形成原因,画个图大概就像下面这样。

图片

将信息告诉朋友后,朋友说他用的是 Windows 服务部署,但不管是什么模式部署,注释掉 Console.WriteLine 肯定没错。

这次卡死的事故,主要还是开发人员大量的使用 Console.WriteLine 来输出日志,在某个时刻输出端窗口因为各种原因被他人独占,导致程序侧无法输出内容到窗口而一直被迫等待,之后朋友将日志输出切换到文件模式,问题得以解决。

其实这个问题很多新手朋友都会犯,特此记录下来。

责任编辑:武晓燕 来源: 一线码农聊技术

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK