Getting Started with Apache Spark Basic

Reading Time: 4 minutes

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLib for machine learning, Graphx for graph processing, and Spark Streaming. Here, are the Spark core components

All the functionalities being provide by Apache Spark are built on the top of Spark Core the most import feature that brings into it is It overcomes the snag of MapReduce by using in-memory computation.

RDD in Apache Spark

The main abstraction Spark provides is a resilient distribute dataset (RDD), which is a collection of elements partition across the nodes of the cluster that can be operate on in parallel. It is the fundamental data structure of Apache Spark. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster

The second abstraction in Spark is share variables that can be use in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task variable support are broadcast variables and accumulators

There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Parallelized Collections in Apache Spark

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program where it can operate in parallel.

External Datasets

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase.

Two major operation support by RDD:

1) Transformations, which create a new dataset from an existing one

For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results.All transformations in Spark are lazy, in that they do not compute their results right away. Few transformation are filter, flatMap, distinct, union, Intersection, groupByKey, reduceByKey, join.

2) Action, which return a value to the driver program after running a computation on the dataset.reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.Few other actions collect, count, first, take, saveAsTextFile, foreach.

Apache Spark SQL, DataFrames and Datasets

Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provide by Spark SQL provide Spark with more information about the structure of both the data and the computation being performe. Spark SQL uses this extra information to perform extra optimizations

A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database but with richer optimizations under the hood

DataSet

Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use a specialized Encoder to serialize the objects for processing or transmitting over the network.

While both encoders and standard serialization are responsible for turning an object into bytes, encoders are code generate dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object.

Spark SQL supports two different methods for converting existing RDDs into Datasets.

Inferring the Schema Using Reflection
Programmatically Specifying the Schema

How to Submit a job

1) Create an sbt project with the following dependency:

2) Create file SimpleApp.scala

3) create a package with which will create jar in target/ folder

sbt package

4) Submit

YOUR_SPARK_HOME/bin/spark-submit \

–class “SimpleApp” \

–master local[4] \

target/scala-2.12/simple-project_2.12-1.0.jar

Getting Started with Apache Spark Basic

Getting Started with Apache Spark Basic

RDD in Apache Spark

Parallelized Collections in Apache Spark

External Datasets

Two major operation support by RDD:

Apache Spark SQL, DataFrames and Datasets

DataSet

How to Submit a job

References

Recommend

【无为原创】全网最全JVM虚拟机栈讲解，一文弄懂先进后出的原理

苹果iPhone SE 2022即将到来，供应链已做好准备

大众纷纷跳入Web3“兔子洞”，诱因在哪？

字典树之旅02.Trie 的标准实现

websocket 多实例问题

函数计算 GB 镜像秒级启动：下一代软硬件架构协同优化揭秘

Committer 郭吉伟专访：做开源不是搞慈善，用开源也不是薅羊毛

Python 一个整型居然最少 24 字节

小白求助大数据大佬 spark 问题

计算机史最疯狂一幕：豪赌50亿美元，“蓝色巨人”奋身一跃

About Joyk