How We Accidentally DoS-ed Ourselves with Kotlin Coroutines

This story illustrates how things can go awry even when you follow best practices.

In GoodData, we have been using Kotlin to create our microservices for almost 4 years now. For most of our developers, learning Kotlin was quite straightforward as they were using Java before and conceptually, the languages are very similar. Also, some frameworks and libraries (most importantly Spring) are used both in our Java and Kotlin worlds so at least in simple cases, it was really just the syntax that changed.

Naturally however, we wanted to take advantage of the advanced features that Kotlin introduces — which brings me to the topic of coroutines and reactive REST controllers. In the plain old Java world, we got used to the comforting fact that any HTTP request is handled by a single thread. And if you wanted to change that, you needed to do quite a lot of work yourself — like use CompletableFuture to explicitly delegate work to other threads.

In Kotlin though, making things run in parallel is painfully simple. So simple in fact that sometimes, one might even forget that it is happening.

The Exporter Example

Let the code do the talking:

@RestController
class ExportController(...) {
   @GetMapping("/dashboards/{dashboardId}/export")
   suspend fun export(
       @PathVariable dashboardId: String,
       @RequestHeader(
           name = HttpHeaders.ACCEPT, 
           defaultValue = "text/csv"
       ) acceptHeader: String,
   ): ResponseEntity<ExportResponse> {
       val result = when (acceptHeader.lowercase()) {
           "application/pdf" -> pdfExportService.doExport(dashboardId)
           "text/csv" -> csvExportService.doExport(dashboardId)
           else -> // throw 406
       }
       return ResponseEntity(result, HttpStatus.OK)
   }
}

What we have here is an extremely simplified version of GoodData’s export controller. As you can see, the “export” HTTP endpoint simply checks the Accept header and based on its value, it either calls services responsible for PDF/CSV exports or throws an error. No tricky stuff so far.

Let’s check the implementation of the PDF export service:

class PdfExportService {
   suspend fun doExport(dashboardId: String): ExportResponse {
       // call PdfServer via HTTP
       val client = HttpClient.newBuilder().build()
       val request = HttpRequest.newBuilder()
           .uri(buildUri(dashboardId))
           …
           .build()

return withContext(Dispatchers.IO) {
           val responseStream = client.send(
               request,
               HttpResponse.BodyHandlers.ofInputStream()
           ).body()
           jacksonObjectMapper()
               .readValue(responseStream, ExportResponse::class.java)
       }
   }

private fun buildUri(dashboardId: String): URI = URI.create(dashboardId)
}

Apparently PDF exports are somehow complicated, therefore the class just delegates the business logic to a dedicated service called PdfServer by issuing another HTTP call to it. So what the controller does is basically just forwarding the request to another server.

Good Intention Gone Bad

Things get interesting when we focus on the switch to Dispatchers.IO. It seems well justified, as per its documentation: “Dispatchers.IO is designed for offloading blocking IO tasks to a shared pool of threads”. Clearly we are doing an IO task (making an HTTP request), so we should do that in Dispatchers.IO, right?

Well, maintainers of the PdfServer microservice would disagree. Let’s see what happens when a lot of PDF export requests arrive at the controller:

The request is first handled by the “export” method. Since this is a suspending method, Dispatchers.Default is used to execute the controller’s method call. As per documentation, the thread pool backing that dispatcher has the same number of threads as there are CPU cores (at least 2). Let’s for now assume that there are really just 2 CPU cores, so at most 2 export requests can be handled by the controller at the same time.

But then, handling of the export request is switched to Dispatchers.IO, which as per documentation: “… defaults to the limit of 64 threads or the number of cores (whichever is larger)”.

Only two requests being processed in parallel by Dispatchers.Default, but up to 64 of them in Dispatchers.IO

So what is the problem? Well, Java developers, used to the fact that there is a single thread pool for all the HTTP request processing, would scale the PdfServer so that it can handle 2 PDF exports at the time — similarly to the original controller. Alas, it is perfectly possible that there will be up to 64 requests instead, as all the threads from Dispatchers.Default simply delegate their work to Dispatchers.IO and start handling new requests before the actual exporting via PdfServer is done.

Because of this, our PdfServer doing the actual heavy work of creating PDF exports was very easily DoS-ed, ran out of memory, and crashed. It was scaled with the original thread-per-request model in mind, not expecting that the actual load on the service might be much higher than just 2 requests at a time.

Thankfully in our case, our performance tests detected the issue very early, so production was not affected and we had plenty of time to discuss solutions.

The Ways Out

Once we identified the problem, there were multiple solutions at discussion:

Don’t switch to Dispatchers.IO at all.
Use the “kotlinx.coroutines.io.parallelism” system property to actually restrict the number of threads in Dispatchers.IO.
Use a custom dispatcher backed by a custom thread pool instead of Dispatchers.IO. This way, other usages of Dispatchers.IO (like interaction with a local filesystem) would remain unaffected.
Scale PdfServer up (a lot) so that it can actually handle 64 requests at the same time.

Expected production load at that point did not justify massive (and therefore expensive) scaling up of the PdfServer. At the same time, we wanted to keep the good practice of doing IO tasks in Dispatchers.IO, thus keeping the coroutine dispatcher switch in place. Interfering with the global property “kotlinx.coroutines.io.parallelism” seemed wrong as we might do more harm than good.

This left us with just one option to try — create a custom coroutine dispatcher backed by a reasonably-sized thread pool. The actual solution that you would find on our production is actually different and much simpler, but before describing that, let me spend a chapter here talking about the custom dispatcher.

The Custom Dispatcher

We discarded the solution described in this section as a wrong approach anyway, but let me use the opportunity to point out an error that many people could make. In a different scenario, using a custom dispatcher would be perfectly reasonable and in that case, it is good to know what to watch out for.

When the first merge request arrived for a code review, it looked like this:

@RestController
class ExportController(...) {
   private val dispatcher = Executors.newFixedThreadPool(
       5,
       BasicThreadFactory.Builder().daemon(true).build()
   ).asCoroutineDispatcher()

@GetMapping("/dashboards/{dashboardId}/export")
   suspend fun export(
       …
   ): ResponseEntity<ExportResponse> {
       val result = when (acceptHeader.lowercase()) {
           "application/pdf" -> pdfExportService.doExport(dispatcher, dashboardId)
           "text/csv" -> csvExportService.doExport(dispatcher, dashboardId)
           else -> // throw 406
       }
       return ResponseEntity(result, HttpStatus.OK)
   }

class PdfExportService {
   suspend fun doExport(dispatcher: CoroutineDispatcher, dashboardId: String): ExportResponse {
       // call PdfServer via HTTP
       …
       return withContext(dispatcher) {
           …
       }
   }
}

A new dispatcher was added based on a fixed thread pool of 5 threads (the hardcoded magic number just here for brevity; not the right approach of course). In the “doExport” method, this dispatcher is then used instead of Dispatchers.IO to execute the REST call.

You might notice that the threads from the custom thread pool have the daemon flag set to true. Why is that? Well apparently, someone noted that without it, the Spring Boot application containing the controller is unstoppable. Better said, it does not respond to the usual REST call to the “/actuator/shutdown” endpoint. That is because the application will run as long as any of its threads run — we created a new thread pool that without further changes will keep the application alive forever (or at least until the whole JVM process is killed).

The incorrect approach to solve it is what you see in the snippet above. If a thread has the daemon flag, it does not prevent application from stopping. Problem solved, one might say. But when the application exits, it stops its daemon threads immediately — without even executing finally blocks. This is why daemon threads are not recommended for IO tasks, which is exactly what we are doing here.

So what is the correct way to shut down a custom thread pool? Well, we just need to close it:

@PreDestroy
fun destroy() {
   runCatching { dispatcher.close() }
}

Spring ensures that the method annotated with PreDestroy gets executed on application shutdown. And we just need to make sure that all custom resources that the Spring bean (the controller in our case) might have created are properly closed. With this fix, the application correctly responds to calls to “/actuator/shutdown/ even with a non-daemon thread pool.

The Right Way Out

The solution we eventually went for in our case was very different, and pretty straightforward after all: we decided to restrict the parallelism of PdfServer (the target service that does the heavy job of PDF exports) so it can protect itself from being overloaded. After all, any service should do that. A service should only take up as many requests as it can handle without crashing; the other requests should be queued or, in the worst case, rejected.

This approach was made even simpler thanks to the fact that the PdfServer is a Spring Web microservice with just one REST endpoint. We therefore set up “server.tomcat.threads.max” to limit the amount of threads that the embedded Tomcat was using for handling the requests. Unlike Dispatchers.Default, the default value of the property does not depend on the number of CPU cores, but is set up to be 200 (that is why originally, the PdfServer was seemingly able to handle all the 64 parallel requests from the controller). We also configured “server.tomcat.accept-count” (the number of requests that are allowed to wait in a queue for a handling thread to be available).

Summing Up

Lessons learned? Well, following the best practices (like switching to Dispatchers.IO for IO tasks) is no good on its own without fully understanding the consequences and actual reasons why it is done. Creating performance tests for a new set of services is a must. And finally, Kotlin is not just a shiny new syntax for JVM, but actually has some strong concepts on its own which deserve to be studied properly.

How We Accidentally DoS-ed Ourselves with Kotlin Coroutines

How We Accidentally DoS-ed Ourselves with Kotlin Coroutines

The Exporter Example

Good Intention Gone Bad

The Ways Out

The Custom Dispatcher

The Right Way Out

Summing Up

Recommend

华信永道获北交所IPO批文，拟募资约1.29亿元

China plans to begin building lunar base as early as this decade

SAP PaPM Cloud: Introduction of Query based Master Data and Hierarchy

Is infinity an odd or even number?

Apple to Launch AI-Powered Health App With Emotion Tracking

Effects of the conflict in Sudan on Internet patterns

WebUI Documentation

马斯克鼓励苹果多发推文：学学特斯拉_原创_科技频道首页_财经网 - CAIJING.COM.CN

ChatGPT: We're only in 'AOL' phase of AI, CEO says

Snyk Partner Speak Video with ServiceNow: A complete view of application securit...

About Joyk