My 10 recommendations after getting the Databricks Certification for Apache Spar...
source link: https://www.tuicool.com/articles/amM3mmE
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Since some months ago I started to prepare myself to achieve the Databricks Certifications for Apache Spark . It was not easy because there is no much information about it so to promote self-preparation I’m going to share ten useful recommendations.
Recommendation 1: Schedule your exam before start preparing yourself
This is the only non-technical recommendation but is also useful of all 9 remainings. When you have a deadline for taking an exam, you have more reasons and pressure to study. In this case for the exam, a 5–7 weeks preparation would make you ready for a successful result especially if you have work experience with Apache Spark . Register on the web page It will cost you $300 and you get 1 additional chance if you fail the first attempt (my advice is rescheduling the second attempt maximum between the next 2 weeks). You need a score minimum of 65% answering 40 questions of multiple alternatives in 3 hours and you can take the exam in Python or Scala.
Scheduling the exam makes you focus on practicing
Recommendation 2: Either PySpark o Spark Scala API are almost the same for the Exam
If you find your self in a disjunctive about wich Spark language API use Python or Scala my advice is that not worry so much because the question doesn’t need a deep knowledge of those programming languages. For example, you can find this type of questions where you are provided by a snippet of code (Python or Scala) and you need to identify which of then is incorrect.
//Scala
val df = spark.read.format("parquet").load("/data/sales/june")
df.createOrReplaceTempView("table")#Python
df = spark.read.orc().load("/data/bikes/june")
df. createGlobalTempView ("table")#Python
from pyspark.sql import Row
myRow = Row(3.14156, "Chicago ", 7 )import org.apache.spark.sql.functions.lit
df.select(lit("7.5"), lit("11.47")).show(2)
Could you find the incorrect code?
So to face this kind of question remember the structures and the main options in Spark Dataframe ( 20%-25% of the questions ), RDDs, SQL, Streaming and Graphframes. For example, these are the Write and Read core structures in Spark Dataframe.
#Read DataFrameReader.format(...).option("key","value").schema(...).load() #Read modes: permissive (default), dropMalformed and failFast.#Write DataFrameWriter.format(...).option(...).partitionBy(...).bucketBy(...).sortBy(...).save() #Save modes: append, overwrite, errorIfExists (default) and ignore.
Recommendation 3: Practice ‘executing’ code with Spark Dataframes and SQL on your head
If you used your mind to get the output of the code above well you are doing fine because during the test you are not allowed to check any documentation or even have a paper to take notes so you will find another kind of question where you need to identify the correct alternative (could be more than one) that produces the output showed based in one o more tables.
#Table
+---------+---------+
| Name| Age|
+---------+---------+
| David| 71|
| Angelica| 22|
| Martin| 7|
| Sol| 12|
+---------+---------+#Output needed
# Quantity of people greater than 21df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/names/*.csv")
df.where(col("Age")>21).count()df = spark.read.parquet("/names/*.parquet") \
.option("inferSchema", "true") \
df.where("Age > 21").count()logic = "Age > 21"
df = spark.read.("/names/*.parquet") \
.option("inferSchema", "true") \
df.where( logic ).count()df =spark.read.format("json").option("mode", "FAILFAST") \
.option("inferSchema", "true") \
.load("names.json")
df.where("Age > 21").count()
Could you find the correct code? Hint: exists more than one.
Not only this kind of question is about Dataframes also is used in RDD question so study carefully some functions like map, reduce, flatmap, groupby, etc. My recommendation is to check the book Learning Spark especially chapters 3 and 4 .
Recommendation 4: Understand the basic Spark Architecture
Do you feel familiar with these components? [ Spark documentation ]
Understand the Spark Architecture (15% of the exam) means have to read not only the official documentation and know the modules but also discover:
- How a simple or a complex query in Spark is executed?
- Spark’s different cluster managers
- What means ‘Lazy evaluation’, ‘Actions’, ‘Transformations’?
- Hierarchy of a Spark application
- Cluster deployment choices
- Basic knowledge of the Spark’s toolset
Spark’s toolset [ Spark: The Definitive Guide ]
The kind of question for Spark Architecture trying that you check if a concept or definition is correct or not.
What means RDDs? What part of Spark are?Resilent Distributed Dataframes. Streaming API Resilent Distributed Datasets. Structured APIs Resilent Distributed Datasets. Low lever APIs
To go further in Architecture I recommend checking chapters 2, 3, 15 and 16 of the book Spark: The Definitive Guide .
Recommendation 5: Identify the input, sinks, and output in Spark Structured Streaming
Around 10% of the questions are about Spark Structured Streaming* mainly trying that you recognize the correct code that will not produce errors to achieve that you need to have clear the basic components and definitions in this module.
In this case, this code was obtained from the official Spark Documentation Repo on Github and shows a basic word count that get the data from a Socket, apply some basic logic and write the result in console with the outputMode complete.
# Start running the query that prints the running counts to the console
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate running word count
wordCounts = words.groupBy("word").count()query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
[ Spark documentation ]
The questions for this module will require that you identify the correct or incorrect code. It will combine the different input sources (Apache Kafka, files, sockets, etc) and/or sinks (output) e.g. Apache Kafka, any file format, console, memory, etc. and also output modes: append, update and complete. To practice for this question read chapter 21 of the book Spark: The Definitive Guide.
*I know that exists DStreams but it is low-level APIs and is unlikely to come in the exam.
Recommendation 6: Practice Spark Graph Algorithms
Like recommendation 5 exists few Graph Algorithms that you need to identify so my advice is to check first the concept of Graphs and Graphframes * ( 5%–10% of the questions ) and then practice these algorithms:
- PageRank
- In-Degree and Out-Degree Metrics
- Breadth-First Search
- Connected Components
Here, for example, we are creating a GraphFrame based on two Dataframes, if you want to practice more, you can find this code and a complete notebook in the GraphFrame user guide on Databricks
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *vertices = sqlContext.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
("d", "David", 29),
("e", "Esther", 32),
("f", "Fanny", 36),
("g", "Gabby", 60)], ["id", "name", "age"])edges = sqlContext.createDataFrame([
("a", "b", "friend"),
("b", "c", "follow"),
("c", "b", "follow"),
("f", "c", "follow"),
("e", "f", "follow"),
("e", "d", "friend"),
("d", "a", "friend"),
("a", "e", "friend")
], ["src", "dst", "relationship"])g = GraphFrame(vertices, edges)
print(g)
*Also before Graphframes existed GraphX (still exists now) but for the first one is more utilized nowadays.
Recommendation 7: Understand the steps of building a Spark ML Pipeline
Spark ML* ( 10% of the exam ) is the module that is bundled with many machine learning algorithms for Classification, Regression, Clustering or for basic statistics, tunning, model selection and pipelines.
Here you need to focus on understanding some must-know concepts like steps to build, train and apply a trained model. For example, is mandatory to have only number variables for all the algorithms so if you have a String column you need to use a StringIndexer
method a OneHotEncoder
an encoder and all the variables are needed to be in one vector so we need to use the class VectorAssembler
to finally group all the transformation in a Pipeline
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer() \
.setInputCol("month") \
.setOutputCol("month_index")from pyspark.ml.feature import OneHotEncoder
encoder = OneHotEncoder() \
.setInputCol("month_index") \
.setOutputCol("month_encoded")from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler() \
.setInputCols(["Sales", "month_encoded"]) \
.setOutputCol("features")from pyspark.ml import Pipeline
transfPipeline = Pipeline() \
.setStages([indexer, encoder, vectorAssembler])fitPipeline = transfPipeline.fit(trainDataFrame)
*Spark MLlib is an RDD-based API that since Spark 2.0 have entered in maintenance mode so Spark ML is the primary ML API and is DataFrame-based.
Recommendation 8: Recognize Spark Transformation, Actions, and more
Going back to Spark RDDs ( 15% of questions ) a question could be like ‘Select the alternative with all the Transformation (wide/narrow) or Actions’ so you need to make sure you recognize the majority of them. You have a good explanation in the Spark Documentation .
Another important topic is to understand well these topics:
- Broadcast variables
broadcastVar = sc.broadcast([1, 2, 3]) broadcastVar.value
- Accumulators
- RDD Persistence
- Passing functions to Spark
- Coalesce, repartition
What is the method of persistence cache()?MEMORY_ONLY MEMORY_AND_DISK DISK_ONLY OFF_HEAP
Recommendation 9: Don’t waste time building your Spark environment or getting training data
Yes! I know you want to follow many fantastic tutorials that exist on Medium but to prepare for this exam I strongly recommend to choose one of these options that will let you focus in the content and not in configurations. I prefer Databricks because you get a small Spark cluster configured ready to start practicing for free.
If you need some data to practice I recommend this Github repository where you can have CSVs, JSONs, Parquet, ORC files.
Recommendation 10: Consider reading those books if you are totally new in Spark
Your friends on this road to learn more about Apache Spark are:
- Spark: The Definitive Guide
- Learning Spark
- Spark Documentation
- 5 Tips for Cracking Databricks Apache Spark Certification .
Now you’re ready to become a certified Apache Spark Developer :)
Always keep learning
PS if you have any questions, or would like something clarified, you can find me on Twitter and LinkedIn.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK