6

Diagnosing an ASP.NET Core hard crash

 2 years ago
source link: https://codeblog.jonskeet.uk/2022/02/20/diagnosing-an-asp-net-core-hard-crash/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Diagnosing an ASP.NET Core hard crash

As part of my church A/V system (At Your Service), I run a separate local web server to interact with the Zoom SDK. Initially this was because the Zoom SDK would only run in 32-bit processes and I needed a 64-bit process to handle the memory requirements for the rest of the app. However, it’s also proven useful in terms of keeping the Zoom meeting for a church service alive if At Your Service crashes. Obviously I try hard to avoid that happening, but when interoperating with a lot of native code (LibVLC, NDI, the Stream Deck, PowerPoint via COM) there are quite a few avenues for crashes. The web server runs ASP.NET Core within a WPF application to make it easy to interact with logs while it’s running, and to give the Zoom SDK a normal event dispatcher.

Yesterday when trying to change my error handling code significantly, I found that the web server was crashing hard, with no obvious trace of what’s going on. I’ve already spent a little time trying to figure out what’s going on, but I couldn’t get to the bottom of it. I know the immediate cause of the crash, and I’ve fixed that fairly easily – but I want to harden the web server against any further bugs I might introduce. I figured it would be useful to blog about that process as I went along.

What I know so far

The immediate crash was due to an exception being thrown in an async void method.

Relevant threading aspects:

  • I start the ASP.NET Core app in a separate thread (although that’s probably unnecessary anyway, now that I think about it) calling IHost.Start
  • I have handlers for Dispatcher.UnhandledException and TaskScheduler.UnobservedTaskException
  • I execute all Zoom-specific code on the WPF Dispatcher thread

The immediate error came from code like the following. You can ignore the way this effectively reproduces Web API to some extent… it’s the method body that’s important.

public abstract class CommandBase<TRequest, TResponse> : CommandBase
{
public override async Task HandleRequest(HttpContext context)
{
var reader = new StreamReader(context.Request.Body);
var text = await reader.ReadToEndAsync();
var request = JsonUtilities.Parse<TRequest>(text);
var dispatcher = Application.Current.Dispatcher;
try
{
var response = await dispatcher.Invoke(() => ExecuteAsync(request));
var responseJson = JsonUtilities.ToJson(response);
await context.Response.WriteAsync(responseJson);
}
catch (ZoomSdkException ex)
{
SetExceptionResponse(new ZoomExceptionResponse { /* More info here */ });
}
async void SetExceptionResponse(ZoomExceptionResponse response)
{
var responseJson = JsonUtilities.ToJson(response);
await context.Response.WriteAsync(responseJson);
context.Response.StatusCode = 500;
}
}
public abstract Task<TResponse> ExecuteAsync(TRequest request);
}

There are at least three problems here:

  • I’m trying to set HttpResponse.StatusCode after writing the body
  • The SetExceptionResponse method is async void (generally a bad idea)
  • I’m not awaiting the call to SetExceptionResponse (which I can’t, due to it returning void)

(It’s also a bit pointless having a local method there. This code could do with being rewritten when I don’t have Covid brain fog, but hey…)

The first of these causes an InvalidOperationException to be thrown. The second and third, between them, cause the app to crash. The debugger has been no help here in working out what’s going on.

Create with a console app to start ASP.NET Core

It feels like this should be really easy to demonstrate in a simple console app that does nothing but start a web server which fails in this particular way.

At this stage I should say how much I love the new top-level statements in C# 10. They make simple complete examples an absolute joy. So let’s create a console app, change the Sdk attribute in the project file to Microsoft.NET.Sdk.Web, and see what we can do. I’m aware that with ASP.NET Core 6 there are probably even simpler ways of starting the server, but this will do for now:

using System.Net;
var host = Host.CreateDefaultBuilder()
.ConfigureWebHostDefaults(builder => builder
.ConfigureKestrel((context, options) => options.Listen(IPAddress.Loopback, 8080))
.Configure(application => application.Run(HandleRequest)))
.Build();
host.Start();
host.WaitForShutdown();
async Task HandleRequest(HttpContext context)
{
await context.Response.WriteAsync("Testing");
}

Trying to run that initially brought up prompts about IIS Express and trusting SSL certificates – all very normal for a regular web app, but not what I want here. After editing launchSettings.json to a simpler set of settings:

{
"profiles": {
"AspNetCoreCrash": {
"commandName": "Project"
}
}
}

… I can now start the debugger, then open up localhost:8080 and get the testing page. Great.

Reproduce the exception

Next step: make sure I can throw the InvalidOperationException in the same way as the original code. This is easy, just replacing the body of the HandlRequest method:

async Task HandleRequest(HttpContext context)
{
await context.Response.WriteAsync("Testing");
context.Response.StatusCode = 500;
}

Sure enough the console logs show that it’s failed as expected:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
at Microsoft.WebTools.BrowserLink.Net.BrowserLinkMiddleware.ExecuteWithFilterAsync(IHttpSocketAdapter injectScriptSocket, String requestId, HttpContext httpContext)
at Microsoft.AspNetCore.Watch.BrowserRefresh.BrowserRefreshMiddleware.InvokeAsync(HttpContext context)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ProcessRequests[TContext](IHttpApplication`1 application)

… but (again, as expected) the server is still running. It’s interesting that BrowserLink occurs in the stack trace – I suspect that wouldn’t be the case in my

Let’s try making the failure occur in the same way as in At Your Service:

async Task HandleRequest(HttpContext context)
{
// In AYS we await executing code in the dispatcher;
// Task.Yield should take us off the synchronous path.
await Task.Yield();
WriteError();
async void WriteError()
{
await context.Response.WriteAsync("Testing");
context.Response.StatusCode = 500;
}
}

This time we get a longer stack trace, and the process quits, just like in AYS:

System.InvalidOperationException: StatusCode cannot be set because the response has already started.
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.ThrowResponseAlreadyStartedException(String value)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.set_StatusCode(Int32 value)
at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.Microsoft.AspNetCore.Http.Features.IHttpResponseFeature.set_StatusCode(Int32 value)
at Microsoft.AspNetCore.Http.DefaultHttpResponse.set_StatusCode(Int32 value)
at Program.c__DisplayClass0_0.<<<Main>$>g__WriteError|4>d.MoveNext() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 19
--- End of stack trace from previous location ---
at System.Threading.Tasks.Task.c.b__128_1(Object state)
at System.Threading.QueueUserWorkItemCallback.c.b__6_0(QueueUserWorkItemCallback quwi)
at System.Threading.ExecutionContext.RunForThreadPoolUnsafe[TState](ExecutionContext executionContext, Action`1 callback, TState& state)
at System.Threading.QueueUserWorkItemCallback.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()

This happens in both the debugger and when running from the command line.

Setting a break point in the WriteError method shows a stack trace like this:

at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
at Program.c__DisplayClass0_0.<<Main>$>g__WriteError|4()
at Program.<<Main>$>g__HandleRequest|0_1(HttpContext context) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\Program.cs:line 14
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext()
at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.c.b__6_0(Action innerContinuation, Task continuationIdTask)
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.ContinuationWrapper.Invoke()
at System.Runtime.CompilerServices.YieldAwaitable.YieldAwaiter.RunAction(Object state)
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()

There’s nothing about ASP.NET Core in there at all… so I wonder if we can take that out of the equation too?

Reproducing the crash in a pure console app

To recap, I’m expecting at this stage that to reproduce the crash I should:

  • Write an async void method that throws an exception
  • Call that method from a regular async method

Let’s try:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Console.WriteLine("Done");
async Task NormalAsyncMethod()
{
await Task.Yield();
Console.WriteLine("Start ofNormalAsyncMethod");
BrokenAsyncMethod();
Console.WriteLine("End of NormalAsyncMethod");
}
async void BrokenAsyncMethod()
{
await Task.Yield();
throw new Exception("Bang");
}

Hmm. That exits normally:

$ dotnet run
Start ofNormalAsyncMethod
End of NormalAsyncMethod
Done

But maybe there’s a race condition between the main thread finishing and the problem crashing the process? Let’s add a simple sleep:

#pragma warning disable CS1998 // Async method lacks 'await' operators and will run synchronously
await NormalAsyncMethod();
Thread.Sleep(1000);
Console.WriteLine("Done");
// Remainder of code as before

Yup, this time it crashes hard:

Start ofNormalAsyncMethod
End of NormalAsyncMethod
Unhandled exception. System.Exception: Bang
at Program.<<Main>$>g__BrokenAsyncMethod|0_1() in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 15
at System.Threading.Tasks.Task.<>c.<ThrowAsync>b__128_1(Object state)
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()

Interlude: what about async Task?

At this point I’m remembering some of what I’ve learned about how async void methods handle exceptions. What happens if we turn it into an async Task method instead? At that point, the Task returned by the method (which we ignore) will have the exception, and as by default unobserved task exceptions no longer crash the process, maybe we’ll be okay. So just changing BrokenAsyncMethod to:

async Task BrokenAsyncMethod()
{
throw new Exception("Bang");
}

(and ignoring the warning at the call site)… the program no longer crashes. (I could subscribe to TaskScheduler.UnobservedTaskException but I’m not that bothered… I’m pretty convinced it would fire, at least eventually.)

Do all ThreadPool exceptions crash the app?

We don’t need to use async methods to execute code on the thread pool. What happens if we just write a method which throws an exception, and call that from the thread pool?

ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");
void ThrowException(object? state)
{
throw new Exception("Bang!");
}

Yup, that crashes:

Unhandled exception. System.Exception: Bang!
at Program.<<Main>$>g__ThrowException|0_0(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 7
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()

At this point some readers (if there still are any…) may be surprised that this is a surprise to me. It’s been a long time since I’ve interacted with the thread pool directly, and taking down the process like this feels a little harsh to me. (There are pros and cons, certainly. I’m not trying to argue that Microsoft made the wrong decision here.)

Can we change the ThreadPool behaviour?

Given that we have things like TaskScheduler.UnobservedTaskException, I’d expect there to be something similar for the thread pool… but I can’t see anything. It looks like this is behaviour that changed with .NET 2.0 – back in 1.x, thread pool exceptions didn’t tear down the application.

After a bit more research, I found AppDomain.UnhandledException. This allows us to react to an exception that’s about to take down the application, but it doesn’t let us mark it as “handled”.

Here’s an example:

AppDomain.CurrentDomain.UnhandledException += (sender, args) =>
Console.WriteLine($"Unhandled exception: {((Exception)args.ExceptionObject).Message}");
ThreadPool.QueueUserWorkItem(ThrowException);
Thread.Sleep(1000);
Console.WriteLine("Done");
void ThrowException(object? state) =>
throw new Exception("Bang!");

Running this code a few times, I always get output like this:

Unhandled exception: Bang!
Unhandled exception. System.Exception: Bang!
at Program.<<Main>$>g__ThrowException|0_1(Object state) in C:\users\skeet\GitHub\jskeet\DemoCode\AspNetCoreCrash\ConsoleCrash\Program.cs:line 8
at System.Threading.QueueUserWorkItemCallbackDefaultContext.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()

… but sometimes “Done” is printed too. So I guess there’s some uncertainty about how quickly the AppDomain is torn down.

Hardening At Your Service

Given what I know now, I don’t think I can easily stop the web server for Zoom interactions from terminating if I have a bug – but I can make it easier to find that bug afterwards. I already normally write the log out to a text file when the app exits, but that only happens for an orderly shutdown.

Fortunately, it looks like the AppDomain.UnhandledException is given enough time to write the log out before the process terminates. Temporarily reverting to the broken code allows me to test that – and yes, I get a log with an appropriate critical error.

Conclusion

I knew that async void methods were generally bad, but I hadn’t quite appreciated how dangerous they are, particularly when executed from a thread pool thread.

While I’m not thrilled with the final result, I at least understand it now, and can find similar errors more easily in the future. The “not understanding” part was the main motivation for this blog post – given that I’d already found the immediate bug, I could have just fixed it and ignored the worrying lack of diagnostic information… but I always find it tremendously unsettling when I can’t explain significant behaviour. It’s not always worth investigating immediately but it’s generally useful to come back to it later on and keep diving deeper until you’ve got to the bottom of it.

I haven’t put the source code for this blog post in GitHub as there are so many iterations – and because it’s all in the post itself. Shout if you would find it valuable, and I’m happy to add it after all.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK