Network performance regressions from TCP SACK vulnerability fixes

On June 17, three vulnerabilities in Linux’s networking stack were published. The most severe one could allow remote attackers to impact the system’s availability. We believe in offering the most secure image available to our customers, so we quickly applied a kernel patch to address the issues.

Since the kernel patch was applied, we have observed certain workloads experiencing unexpected, nondeterministic performance regressions on the Amazon Web Services (AWS) platform, manifesting in the form of lengthy or hung writes to S3. Even though these regressions can be observed in less than 0.2% of cases, we wanted to share with you what we have found so far and a mitigation strategy.

Symptom

The behavior can manifest in anyDatabricks Runtime / Apache Spark version. Affected customers will see Spark jobs running on Databricks clusters slow down and potentially “hang” for 15 minutes, or fail entirely due to timeout, when writing to Amazon S3. Customers may see a stack trace similar to this in Databrickscluster logs:

org.apache.spark.SparkException:
... truncated ...
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception:
Your socket connection to the server was not read from or written
to within the timeout period. Idle connections will be closed.

Customers can also look at the Spark web UI and see one or more tasks taking an abnormally long time compared with most other tasks of the same stage.

uERVnmJ.png!web

Since the security patch was applied to the Databricks platform on June 24th, any change in performance would coincide precisely with that date. Jobs that performed normally on (or after) June 24th and later experienced a performance regression would be unrelated to this issue.

Root Cause

The TCP SACK DoS vulnerability was disclosed on June 17, 2019. It enables a remote attacker to trigger a kernel panic on a server that is accepting traffic on a port.

Our Infrastructure Security Team immediately triaged the issue and decided to ship the patch to this CVE in our regular security release train. We shipped this update in the form of a new Amazon Machine Image (AMI), which forms the base image for the operating system on which we run LXC containers containing the Databricks Runtime.

Shortly after rolling out the patch, we identified nondeterministic abnormalities in a very small subset of our internal benchmarks and in some customer jobs. For example, short 5-minute data processing jobs that write to Amazon S3 were taking up to an hour to finish.

Using the reproduction from our internal benchmarks, we analyzed network traffic between Databricks clusters and Amazon S3. Clusters without the patch exhibited the expected behavior when writing to Amazon S3, consistently completing writes in less than 90 seconds:

iQremaj.png!web

However, the same code running on clusters with the security patch experienced short periods of data transfer to Amazon S3 followed by long periods of inactivity. Here is an example job that took over 15 minutes to complete:

Mitigation Strategy

We are still actively investigating the issue in order to determine the root cause. Fixing this non-deterministic performance regression might require another OS-level patch. In the meantime, security is our default and we will continue to ship the most secure kernel we can offer. We will share with you updates as soon as we have them.

Fortunately, Spark and Databricks’ platform have been designed from the beginning to mitigate these types of long-tail distributed system problems. Customers can turn on task speculation in Apache Spark by setting “spark.speculation” to “true” in their cluster configuration to mitigate this issue. This capability was designed initially to mitigate stragglers, in the case of machine slowdowns. When speculation is turned on, Spark will launch a replica of the long-running slow task and retry it, with a high likelihood that the replica task will finish quickly without hitting the performance regression.

For customers that do not want to leverage task speculation and can accept a different security threat model, our support team can work with you to provide alternate mitigation strategies. Please contact your Account Manager or [email protected] if you are affected and require assistance identifying a workaround.

Symptom

Root Cause

Mitigation Strategy

Recommend

老物重识-Quartz

思科上海全员被裁，赔偿N+7人均百万？官方回应消息不实

阿里专家：如何成为一名“值得跟”的Leader？

6个提供免费证书的地方，免费给你的网站升级 HTTPS

GitHub - nlopes/slack: Slack API in Go

GitHub - concourse/concourse: Concourse is a container-based continuous thing-do...

GitHub - baidu/bfe

GitHub - FairwindsOps/polaris: Validation of best practices in your Kubernetes c...

Ibrar Ahmed: Out-Of-Memory Killer… or Savior

Crytic: Continuous Assurance for Smart Contracts

About Joyk