7

Spark: RDD vs DataFrames

 3 years ago
source link: https://blog.knoldus.com/spark-rdd-vs-dataframes/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Spark: RDD vs DataFrames

Reading Time: 3 minutes

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations.
One use of Spark SQL is to execute SQL queries. When running SQL from within another programming language the results will be returned as a Dataset/DataFrame.
Before exploring these APIs, let’s understand the need for these APIs.

RDDs:

An RDD is a distributed collection of elements. All work in Spark is expressed as either creating new RDDs, transforming existing RDDs, or calling actions on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Problems with RDDs:

  • They express how of a solution better than what i.e., RDD library is bit opaque.

rddopaque

We can see in the above example that by looking at the solution we think that how this reduceByKey transformation is being performed.

  • They cannot be optimized by Spark.
  • It’s too easy to build an inefficient RDD transformation chain.

rddinefficient

Here, we can see that these two filter operations could have been applied in one transformation itself by using AND operator. Spark doesn’t take care of the optimization.

sparkui

We could see spark didn’t optimize the transformation chain. So, we conclude that RDD API doesn’t take care of the query optimization. This is being handled through DataFrame APIs.

DataFrames:

A Spark DataFrame is a distributed collection of data organized into named columns that provide operations to filter, group, or compute aggregates, and can be used with Spark SQL. DataFrames can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

Characteristics of DataFrames:

  • DataFrame API provides a higher-level abstraction, allowing you to use a query language to manipulate data.
  • Avail SQL functionalities.
  • Focus on What rather than How.

dff

Here, query optimizations are being handled by the spark.

explain

As we can see, there are three types of logical plan and one physical plan

Analyzed logical plans go through a series of rules to resolve. Then, the optimized logical plan is produced. The optimized logical plan normally allows Spark to plug in a set of optimization rules. You can plug in your own rules for the optimized logical plan.

This optimized logical plan is converted to a physical plan for further execution. These plans lie inside the DataFrame API.

In the optimized logical plan, Spark does optimization itself. It sees that there is no need for two filters. Instead, the same task can be done with only one filter using the AND operator, so it does execution in one filter.

Physical plan is actual RDD chain which will be executed by the spark.

Conclusion:

RDDs were good with characteristics like

  • Immutability
  • Lazy evaluation, etc

But they lacked query optimization, focusses more on what rather than how of a solution. We have seen how DataFrame overcomes these shortcomings of RDDs.

References:


Recommend

  • 42
    • www.itweet.cn 6 years ago
    • Cache

    Why Spark RDD

    我提出的论文计划,一再被打乱,我也在找机会慢慢调整过来。 今天,我们聊一聊Spark,我第一次在工作中使使用spark是0.9版本,当时是试用Spark来做OLAP Cube模型,那个时候的SparkSQL称为 Shark ,历史原因,spark...

  • 24
    • 微信 mp.weixin.qq.com 5 years ago
    • Cache

    spark rdd的另类解读

  • 29
    • www.tuicool.com 5 years ago
    • Cache

    Spark -- RDD

    传统的MapReduce框架运行缓慢 有向无环图的 中间计算结果 需要写入 硬盘 来防止运行结果丢失 每次调用中间计算结果都需要重新进行一次硬盘的读取 ...

  • 10
    • www.itrensheng.com 3 years ago
    • Cache

    Spark RDD的弹性到底指什么 | IT人生

    RDD(Resiliennt Distributed Datasets)抽象弹性分布式数据集对于Spark来说的弹性计算到底提现在什么地方? 自动进行内存和磁盘数据这两种存储方式的切换 Spark 可以使用 persist 和 cache 方法将任意 RDD 缓存到内存或者磁盘文件系统中。数据...

  • 6
    • blog.knoldus.com 3 years ago
    • Cache

    Difference between RDD , DF and DS in Spark

    Difference between RDD , DF and DS in Spark Reading Time: 3 minutesIn this blog I try to cover the difference between RDD, DF and DS. much of you have a little bit confused about RDD, DF and DS. so don’t worry a...

  • 5
    • www.guofei.site 3 years ago
    • Cache

    【丢弃】【spark】rdd.

    【丢弃】【spark】rdd. 2018年03月29日 Author: Guofei 文章归类: ,文章编号: 151 版权声明:本文作者是郭飞。转载随意,但需要标明原文链接,...

  • 9
    • lorenzo-dee.blogspot.com 2 years ago
    • Cache

    Apache Spark RDD and Java Streams

    Apache Spark RDD and Java Streams A few months ago, I was fortunate enough to participate in a few PoCs (proof-of-concepts) that used Apache Spark. There, I got the chance to use resilient distributed datasets (RDDs for short), t...

  • 6

    WordPress Hosting...

  • 2
    • yoursite.com 2 years ago
    • Cache

    Spark笔记(1):RDD编程

    WordPress Hosting...

  • 5
    • www.analyticsvidhya.com 2 years ago
    • Cache

    An End-to-End Starter Guide on Apache Spark and RDD

    This article was published as a part of the Data Science Blogathon.

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK