Apache Spark-3.0 Sneek peak

Apache Spark has remained strong over the years and now is coming back with one of its major releases with its ongoing goal of Unified Analytics to blend both Batch and Streaming world into one. Let’s see some of the features of it.

Improved Optimizer and Catalog
Delta Lake (Acid Transactions) + Linux Foundation
Koalas: Bringing spark scale to Pandas
Python Upgrade
Deep Learning
Kubernetes
Scala Version upgrade
Graph API — Graph and Cypher Script.
GPU Support and along with Project Hydrogen
Java Upgrade
Yarn Upgrade
Binary Files

Improved Optimizer and Catalog:

i) Pluggable Data Catalog: (DataSourceV2)

Pluggable catalog integration
Improved pushdown
Unified APIs for streaming and batch

eg: df.writeTo(“catalog.db.table”).overwrite($”year” === “2019”)

ii) Adaptive Query Execution

Make better optimization decisions during query execution

eg: It interprets the size of the table and automatically changes from Sort Merge Join into a Broadcast join and so on.. if one of the tables is small

Dynamic Partition Pruning speeds up expensive joins

Based on the dimension table(Small table) filter query fact table(Large table) will also be filtered making the joins easier and optimal

Delta-Lake:

Delta Lake has been open-sourced for quite some time and has gained its popularity, given its ease of implementation and up-gradation with any existing Spark Applications. I believe, this is a next-generation of Data Lake, which helps overcome Data Swamp as well as the limitations of Lambda and Kappa Architecture. Now with Linux foundation backing up this program will step-up a notch.

Here are some of the features which help us move one step closer towards Unified Analytics.

ACID transactions
Schema enforcement
Scalable metadata handling
Time Travel

Note:More details related to DeltaLake will be updated once I resume on my upcoming daily posts soon — ( Follow me or the hashtag #jayReddy meanwhile)

Koalas: Bringing spark scale to Pandas:

Koalas have been released recently and it is a big add-on for Python developers both for Data Engineers as well as Data Scientists for its similarities between DataFrames and Pandas. Now they can scale up from a single node environment to the distributed environment without having to learn Spark Dataframes separately.

Integrated into Python data science ecosystem. e,g: numpy, matpotlib
60% of the DataFrame / Series API
60% of the DataFrameGroupBy
15% of the Index / MultiIndex API
80% if the plot functions
90% of Multi-Index Columns

Jfi6Njb.jpg!web

Python Upgrade:

Python is expected to completely move out from Version 2 to Version 3.

Deep Learning:

Request GPUs in RDD operations. i.e, you can specify how many GPUs to use per task in an RDD operation, e.g., for DL training and inference.
YARN+Docker support to launch my Spark application with GPU resources. So you can easily define the DL environment in your Dockerfile.

Kubernetes:

Host clusters via Kubernetes are the next big thing it could be on-premise or cloud. The ease of deployment, management and the spin-up time is going to be far exceeding compared to the time taken by other orchestrating containers such as Mesos and Docker Swarm.

Spark-submit with mutating webhook confs to modify pods at runtime
Auto-discovery of GPU resources
GPU isolation at the executor pod level
spark-submit with pod template
Specify the number of GPUs to use for a task (RDD stage, Pandas UDF)

Kubernetes orchestrates containers and supports some container runtimes including Docker. Spark (version 2.3+) ships with a docker file that can be used for this purpose and customized to specific application needs.

Scala Version upgrade:

Scala 2.12

Graph API — Graph and Cypher Script:

Spark Graph Api has a new add-on.

A graph along with Property Graph and Cypher Script.

Cypher query execution, query result handling, and Property Graph storing / loading. The idea behind having a separate module for the API is to allow multiple implementations of a Cypher query engine.

Graph query will have its own Catalysts & it will follow a similar principle as SparkSQL.

GPU Support along with Project Hydrogen:

NVIDIA has the best GPU and it has by far surpassed any other vendors. Spark 3.0 best works with this. (NVIDIA RTX — 2080) is something to watch out for.

Listing GPU Resources
Auto discover GPU
GPU Allocation to a Job and Fall-back
GPU For Pandas UDF
GPU Utilisation and Monitoring
supporting heterogeneous GPU (AMD, Intel, Nvidia)

Java Upgrade:

With every new JDK version release from the Java community, we can see it moving one-step closer towards functional programming.

The release of the Java-8 version was the beginning of it, starting from Lambda Expressions.

Here’s an example of a variable declaration:

Prior to Java 10 version:

String text = "HelloO Java 9";

From Java 10 and higher versions:

var text = "HelloO Java 10 or Java 11";

Yarn Upgrade:

GPU scheduling support
Auto-discovery of GPU
GPU isolation at a process level

Here’s the configuration setup to support GPU’s from Spark or Yarn 3 version onwards

GPU scheduling

In resource-types.xml

configuration>
  <property>
     <name>yarn.resource-types</name>
     <value>yarn.io/gpu</value>
  </property>
</configuration>

In yarn-site.xml

<property>
    <name>yarn.nodemanager.resource-plugins</name>
    <value>yarn.io/gpu</value>
  </property>

Binary Files:

Another file format added to support unstructured data. you can use it load images, videos and so on…. The limitation is that it cannot perform a write operation.

val df = spark.read.format(BINARY_FILE).load("Path")

Now that you know a glimpse of the next major Spark release, you can check out the Spark 3.0 preview version.

If you liked this article, then you can check out my article on

Note:Delta-Lake and Koalas can either be part of Spark-3.0 or remain as a separate entity as part of Databricks.

Improved Optimizer and Catalog:

i) Pluggable Data Catalog: (DataSourceV2)

ii) Adaptive Query Execution

Delta-Lake:

Koalas: Bringing spark scale to Pandas:

Python Upgrade:

Deep Learning:

Kubernetes:

Scala Version upgrade:

Graph API — Graph and Cypher Script:

GPU Support along with Project Hydrogen:

Java Upgrade:

Prior to Java 10 version:

From Java 10 and higher versions:

Yarn Upgrade:

GPU scheduling

In resource-types.xml

In yarn-site.xml

Binary Files:

Recommend

特斯拉电动皮卡预订量增至18.7万辆预订表现不及3年前的Model 3

高鹄观点：产业互联网会是下一波IPO浪潮的主角吗？

惠普再次拒绝施乐335亿美元收购要约：低估了公司价值

人民法院报：打击“区块链”传销法律和技术缺一不可

🔥《吊打面试官》系列-分布式事务、重复消费、顺序消费

干货满满!如何优雅简洁地实现时钟翻牌器(支持JS/Vue/React)

你能在大冬天吃火锅，离不开啤酒厂的努力

华为 MatePad 及全场景新品发布会

实战思考：产品经理要懂技术吗？

欧盟国家支持对 5G 供应商采取强硬立场

About Joyk