27

My 10 recommendations after getting the Databricks Certification for Apache Spar...

 5 years ago
source link: https://www.tuicool.com/articles/amM3mmE
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
Y3yEfeJ.png!web Databricks Certified Developer Badget

Since some months ago I started to prepare myself to achieve the Databricks Certifications for Apache Spark . It was not easy because there is no much information about it so to promote self-preparation I’m going to share ten useful recommendations.

Recommendation 1: Schedule your exam before start preparing yourself

This is the only non-technical recommendation but is also useful of all 9 remainings. When you have a deadline for taking an exam, you have more reasons and pressure to study. In this case for the exam, a 5–7 weeks preparation would make you ready for a successful result especially if you have work experience with Apache Spark . Register on the web page It will cost you $300 and you get 1 additional chance if you fail the first attempt (my advice is rescheduling the second attempt maximum between the next 2 weeks). You need a score minimum of 65% answering 40 questions of multiple alternatives in 3 hours and you can take the exam in Python or Scala.

Basic exam information

V7b2meN.jpg!web

Scheduling the exam makes you focus on practicing

Recommendation 2: Either PySpark o Spark Scala API are almost the same for the Exam

If you find your self in a disjunctive about wich Spark language API use Python or Scala my advice is that not worry so much because the question doesn’t need a deep knowledge of those programming languages. For example, you can find this type of questions where you are provided by a snippet of code (Python or Scala) and you need to identify which of then is incorrect.

//Scala
val df = spark.read.format("parquet").load("/data/sales/june")
df.createOrReplaceTempView("table")
#Python
df = spark.read.orc().load("/data/bikes/june")
df. createGlobalTempView ("table")
#Python
from
pyspark.sql import Row
myRow = Row(3.14156, "Chicago ", 7 )
import org.apache.spark.sql.functions.lit
df.select(lit("7.5"), lit("11.47")).show(2)
Could you find the incorrect code?

So to face this kind of question remember the structures and the main options in Spark Dataframe ( 20%-25% of the questions ), RDDs, SQL, Streaming and Graphframes. For example, these are the Write and Read core structures in Spark Dataframe.

#Read
DataFrameReader.format(...).option("key","value").schema(...).load()
#Read modes: permissive (default), dropMalformed and failFast.#Write
DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(...).save()
#Save modes: append, overwrite, errorIfExists (default) and ignore.

Recommendation 3: Practice ‘executing’ code with Spark Dataframes and SQL on your head

If you used your mind to get the output of the code above well you are doing fine because during the test you are not allowed to check any documentation or even have a paper to take notes so you will find another kind of question where you need to identify the correct alternative (could be more than one) that produces the output showed based in one o more tables.

#Table
+---------+---------+
| Name| Age|
+---------+---------+
| David| 71|
| Angelica| 22|
| Martin| 7|
| Sol| 12|
+---------+---------+
#Output needed
# Quantity of people greater than 21
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/names/*.csv")
df.where(col("Age")>21).count()
df = spark.read.parquet("/names/*.parquet") \
.option("inferSchema", "true") \
df.where("Age > 21").count()
logic = "Age > 21"
df = spark.read.("/names/*.parquet") \
.option("inferSchema", "true") \
df.where( logic ).count()
df =spark.read.format("json").option("mode", "FAILFAST") \
.option("inferSchema", "true") \
.load("names.json")
df.where("Age > 21").count()
Could you find the correct code?  Hint: exists more than one.

Not only this kind of question is about Dataframes also is used in RDD question so study carefully some functions like map, reduce, flatmap, groupby, etc. My recommendation is to check the book Learning Spark especially chapters 3 and 4 .

Recommendation 4: Understand the basic Spark Architecture

v2uyMfF.png!web

Do you feel familiar with these components? [ Spark documentation ]

Understand the Spark Architecture (15% of the exam) means have to read not only the official documentation and know the modules but also discover:

  • How a simple or a complex query in Spark is executed?
  • Spark’s different cluster managers
  • What means ‘Lazy evaluation’, ‘Actions’, ‘Transformations’?
  • Hierarchy of a Spark application
  • Cluster deployment choices
  • Basic knowledge of the Spark’s toolset

fYzU3me.png!web

Spark’s toolset [ Spark: The Definitive Guide ]

The kind of question for Spark Architecture trying that you check if a concept or definition is correct or not.

What means RDDs? What part of Spark are?Resilent Distributed Dataframes. Streaming API
Resilent Distributed Datasets. Structured APIs
Resilent Distributed Datasets. Low lever APIs

To go further in Architecture I recommend checking chapters 2, 3, 15 and 16 of the book Spark: The Definitive Guide .

Recommendation 5: Identify the input, sinks, and output in Spark Structured Streaming

Around 10% of the questions are about Spark Structured Streaming* mainly trying that you recognize the correct code that will not produce errors to achieve that you need to have clear the basic components and definitions in this module.

In this case, this code was obtained from the official Spark Documentation Repo on Github and shows a basic word count that get the data from a Socket, apply some basic logic and write the result in console with the outputMode complete.

# Start running the query that prints the running counts to the console
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()

# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()

query.awaitTermination()

yi6zuiF.png!web

[ Spark documentation ]

The questions for this module will require that you identify the correct or incorrect code. It will combine the different input sources (Apache Kafka, files, sockets, etc) and/or sinks (output) e.g. Apache Kafka, any file format, console, memory, etc. and also output modes: append, update and complete. To practice for this question read chapter 21 of the book Spark: The Definitive Guide.

*I know that exists DStreams but it is low-level APIs and is unlikely to come in the exam.

Recommendation 6: Practice Spark Graph Algorithms

Like recommendation 5 exists few Graph Algorithms that you need to identify so my advice is to check first the concept of Graphs and Graphframes * ( 5%–10% of the questions ) and then practice these algorithms:

  • PageRank
  • In-Degree and Out-Degree Metrics
  • Breadth-First Search
  • Connected Components

Here, for example, we are creating a GraphFrame based on two Dataframes, if you want to practice more, you can find this code and a complete notebook in the GraphFrame user guide on Databricks

from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *
vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])
edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])
g = GraphFrame(vertices, edges)
print(g)

*Also before Graphframes existed GraphX (still exists now) but for the first one is more utilized nowadays.

Recommendation 7: Understand the steps of building a Spark ML Pipeline

Spark ML* ( 10% of the exam ) is the module that is bundled with many machine learning algorithms for Classification, Regression, Clustering or for basic statistics, tunning, model selection and pipelines.

Here you need to focus on understanding some must-know concepts like steps to build, train and apply a trained model. For example, is mandatory to have only number variables for all the algorithms so if you have a String column you need to use a StringIndexer method a OneHotEncoder an encoder and all the variables are needed to be in one vector so we need to use the class VectorAssembler to finally group all the transformation in a Pipeline

from pyspark.ml.feature import 
 
   StringIndexer
indexer = StringIndexer() \
.setInputCol("month") \
.setOutputCol("month_index")
from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder() \
.setInputCol("month_index") \
.setOutputCol("month_encoded")
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler() \
.setInputCols(["Sales", "month_encoded"]) \
.setOutputCol("features")
from pyspark.ml import Pipeline
transfPipeline = Pipeline() \
.setStages([indexer, encoder, vectorAssembler])
fitPipeline = transfPipeline.fit(trainDataFrame)

*Spark MLlib is an RDD-based API that since Spark 2.0 have entered in maintenance mode so Spark ML is the primary ML API and is DataFrame-based.

Recommendation 8: Recognize Spark Transformation, Actions, and more

Going back to Spark RDDs ( 15% of questions ) a question could be like ‘Select the alternative with all the Transformation (wide/narrow) or Actions’ so you need to make sure you recognize the majority of them. You have a good explanation in the Spark Documentation .

Another important topic is to understand well these topics:

  • Broadcast variables
broadcastVar = sc.broadcast([1, 2, 3])
broadcastVar.value
  • Accumulators
  • RDD Persistence
  • Passing functions to Spark
  • Coalesce, repartition
What is the method of persistence cache()?MEMORY_ONLY
MEMORY_AND_DISK
DISK_ONLY
OFF_HEAP

Recommendation 9: Don’t waste time building your Spark environment or getting training data

Yes! I know you want to follow many fantastic tutorials that exist on Medium but to prepare for this exam I strongly recommend to choose one of these options that will let you focus in the content and not in configurations. I prefer Databricks because you get a small Spark cluster configured ready to start practicing for free.

If you need some data to practice I recommend this Github repository where you can have CSVs, JSONs, Parquet, ORC files.

Recommendation 10: Consider reading those books if you are totally new in Spark

Your friends on this road to learn more about Apache Spark are:

Now you’re ready to become a certified Apache Spark Developer :)

QjyERnq.png!web

Always keep learning

PS if you have any questions, or would like something clarified, you can find me on Twitter and LinkedIn.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK