39

Apache Spark-3.0 Sneek peak

 4 years ago
source link: https://towardsdatascience.com/apache-spark-3-0-sneek-peak-284da5ad4166
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing goal of Unified Analytics to blend both Batch and Streaming world into one. Let’s see some of the features of it.

  1. Improved Optimizer and Catalog
  2. Delta Lake (Acid Transactions) + Linux Foundation
  3. Koalas: Bringing spark scale to Pandas
  4. Python Upgrade
  5. Deep Learning
  6. Kubernetes
  7. Scala Version upgrade
  8. Graph API — Graph and Cypher Script.
  9. GPU Support and along with Project Hydrogen
  10. Java Upgrade
  11. Yarn Upgrade
  12. Binary Files

Improved Optimizer and Catalog:

i) Pluggable Data Catalog: (DataSourceV2)

  • Pluggable catalog integration
  • Improved pushdown
  • Unified APIs for streaming and batch
eg: df.writeTo(“catalog.db.table”).overwrite($”year” === “2019”)

ii) Adaptive Query Execution

  • Make better optimization decisions during query execution

eg: It interprets the size of the table and automatically changes from Sort Merge Join into a Broadcast join and so on.. if one of the tables is small

  • Dynamic Partition Pruning speeds up expensive joins

Based on the dimension table(Small table) filter query fact table(Large table) will also be filtered making the joins easier and optimal

Delta-Lake:

Delta Lake has been open-sourced for quite some time and has gained its popularity, given its ease of implementation and up-gradation with any existing Spark Applications. I believe, this is a next-generation of Data Lake, which helps overcome Data Swamp as well as the limitations of Lambda and Kappa Architecture. Now with Linux foundation backing up this program will step-up a notch.

Here are some of the features which help us move one step closer towards Unified Analytics.

  • ACID transactions
  • Schema enforcement
  • Scalable metadata handling
  • Time Travel

Note:More details related to DeltaLake will be updated once I resume on my upcoming daily posts soon — ( Follow me or the hashtag #jayReddy meanwhile)

Koalas: Bringing spark scale to Pandas:

Koalas have been released recently and it is a big add-on for Python developers both for Data Engineers as well as Data Scientists for its similarities between DataFrames and Pandas. Now they can scale up from a single node environment to the distributed environment without having to learn Spark Dataframes separately.

  • Integrated into Python data science ecosystem. e,g: numpy, matpotlib
  • 60% of the DataFrame / Series API
  • 60% of the DataFrameGroupBy
  • 15% of the Index / MultiIndex API
  • 80% if the plot functions
  • 90% of Multi-Index Columns

Jfi6Njb.jpg!web

Python Upgrade:

Python is expected to completely move out from Version 2 to Version 3.

Deep Learning:

  • Request GPUs in RDD operations. i.e, you can specify how many GPUs to use per task in an RDD operation, e.g., for DL training and inference.
  • YARN+Docker support to launch my Spark application with GPU resources. So you can easily define the DL environment in your Dockerfile.

Kubernetes:

Host clusters via Kubernetes are the next big thing it could be on-premise or cloud. The ease of deployment, management and the spin-up time is going to be far exceeding compared to the time taken by other orchestrating containers such as Mesos and Docker Swarm.

  • Spark-submit with mutating webhook confs to modify pods at runtime
  • Auto-discovery of GPU resources
  • GPU isolation at the executor pod level
  • spark-submit with pod template
  • Specify the number of GPUs to use for a task (RDD stage, Pandas UDF)

Kubernetes orchestrates containers and supports some container runtimes including Docker. Spark (version 2.3+) ships with a docker file that can be used for this purpose and customized to specific application needs.

Scala Version upgrade:

  • Scala 2.12

Graph API — Graph and Cypher Script:

Spark Graph Api has a new add-on.

  • A graph along with Property Graph and Cypher Script.

Cypher query execution, query result handling, and Property Graph storing / loading. The idea behind having a separate module for the API is to allow multiple implementations of a Cypher query engine.

  • Graph query will have its own Catalysts & it will follow a similar principle as SparkSQL.

GPU Support along with Project Hydrogen:

NVIDIA has the best GPU and it has by far surpassed any other vendors. Spark 3.0 best works with this. (NVIDIA RTX — 2080) is something to watch out for.

  • Listing GPU Resources
  • Auto discover GPU
  • GPU Allocation to a Job and Fall-back
  • GPU For Pandas UDF
  • GPU Utilisation and Monitoring
  • supporting heterogeneous GPU (AMD, Intel, Nvidia)

Java Upgrade:

With every new JDK version release from the Java community, we can see it moving one-step closer towards functional programming.

The release of the Java-8 version was the beginning of it, starting from Lambda Expressions.

Here’s an example of a variable declaration:

Prior to Java 10 version:

String text = "HelloO Java 9";

From Java 10 and higher versions:

var text = "HelloO Java 10 or Java 11";

Yarn Upgrade:

  • GPU scheduling support
  • Auto-discovery of GPU
  • GPU isolation at a process level

Here’s the configuration setup to support GPU’s from Spark or Yarn 3 version onwards

GPU scheduling

In resource-types.xml

configuration>
  <property>
     <name>yarn.resource-types</name>
     <value>yarn.io/gpu</value>
  </property>
</configuration>

In yarn-site.xml

<property>
    <name>yarn.nodemanager.resource-plugins</name>
    <value>yarn.io/gpu</value>
  </property>

Binary Files:

Another file format added to support unstructured data. you can use it load images, videos and so on…. The limitation is that it cannot perform a write operation.

val df = spark.read.format(BINARY_FILE).load("Path")

Now that you know a glimpse of the next major Spark release, you can check out the Spark 3.0 preview version.

If you liked this article, then you can check out my article on

Note:Delta-Lake and Koalas can either be part of Spark-3.0 or remain as a separate entity as part of Databricks.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK