Finding Burgers, Bars and the Best Yelpers in Town

A Digestible PySpark Tutorial for Avid Python Users — Part 1

Photo by Eaters Collective on Unsplash

Some time back, Yelp made the move to share its repository to the public. As a Yelp user, when I learnt that it included a wealth of information on businesses, user reviews and user characteristics, I was eager to work on it. However, I quickly hit a roadblock when I realized that some of the files were to big too be uploaded onto my Jupyter Notebook. For someone with no prior coding experience, and who was just beginning to learn the basics of data analysis and machine learning, I decided to put that on hold.

Fast forward to yesterday. After spending the weekend trying to set up PySpark in Jupyter, I found out that Google Colab provides a much simpler solution to working with big data via PySpark, without all the frustrations of downloading and working with it locally. Naturally, I decided to work with the Yelp users dataset.

So again, I’ve decided to put the puzzle pieces together to make the journey as painless for someone else as I can make it. Whether you’ve stumbled across this article, or are facing similar frustrations, I’ve put together a menu for you below, so you can decide for yourself if this is worth your stay.

How PySpark Works: A Brief Recap

In summary, PySpark is a distributed computing framework: it allows data to be processed in parallel by distributing it across several nodes.

Distributed Data Parallelism

In general, operations on the memory are computationally cheaper than those on networks and disks. Spark gives users the option for significantly faster computing than systems like Hadoop by shifting as many operations on memory and minimize the amount on network, thereby reducing network traffic.

A Smooth and Easy PySpark Setup on Colab

Photo by Fahrul Azmi on Unsplash

Here is a list of libraries you’ll need to get started:

OS — to set your environment variables
findspark — makes PySpark importable
pyspark.sql.SparkSession — to access Spark functionality and work with Spark DataFrames
google.colab.driv e — to access the data file on my Drive
the file type you’re working with (in my case, it’s json)

Asif Ahmed wrote a greatarticle which I referenced to aid me in the installation of PySpark. This process is universal, so anyone can use the same block of code to download PySpark onto Colab. The only thing you should take note off is the version of Spark you’re downloading. It should be the latest one available. I’ve made a comment where this is relevant, and documented some of what is going on in the code below.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz #based on latest version
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark

#setting the environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7" #based on latest version

#finds PySpark to make it importable
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()

To avoid the long wait time required to upload a huge file onto colab, I uploaded my files onto Google Drive and gave my notebook access to it.

drive.mount('/content/drive') #produces a link with instructions to enter an authentication code

Once you’ve entered the authentication code, click on the arrow button on the left of your Colab Notebook, and search for your file.

Accessing Your Data File Via The Arrow Button

Once you’ve found your file, right click on it and copy its path. Pass its string through spark.read.filetype() . Disclaimer: this will not work if you don’t add a “/” in the beginning of the path you copied!

To my pleasant surprise, spark.read.json() automatically infers the schema of my nested JSON file, and converts it into a PySpark DataFrame.

Exploration

Photo by Andrew Neel on Unsplash

Below is a list of methods I’ve applied to my PySpark DataFrame to explore my dataset.

df.printSchema() returns a neat tree format of information about the dataframe.

Example of a dataframe schema in tree format

df.select() is a method for querying. Features you want to display select and SQL-like operations are passed through this method. Mentioning features is not necessary, however, without them the query returns boolean values. Either way, the resulting output is always a list of rows, not a dataframe.

.collect() assembles the fragmented dataset that was distributed earlier. Avoid calling it until you need collated results.

pyspark.sql.functions.udf or a user-defined function is a function that can be applied to every to dataframe columns. Two arguments are needed: the transformation and the expected data type of your transformed variable. They are used in conjunction with df.select() . In addition, beyond employing the functions found in pyspark.sql.functions , it is possible to define your own functions and pass them through the udf method. UDFs may be slower than in-built functions, though, so becoming well versed with PySpark functions can save you a ton of time.

spark.createDataFrame() transforms output of query (a list of rows) into a dataframe.

df.show() displays a PySpark dataframe object with 20 rows. df.show(n) displays only the first n rows of the dataframe.

Output of df.show() on Colab

df.withColumn() returns a new dataframe consisting of the original dataframe and a new column with a specified operation. The two arguments required for this method are the new column name and the operation.

Example Use Case

The dataset gave me a list of friends per user, and from it I created a column with the number of friends per user.

#create a user-defined function
split_count = udf(lambda z: len(z.split(“, “)), IntegerType())

#transform and attach a new column to the original dataframe
df_user1 = df_user.withColumn(“no_friends”,split_count(df_user.friends))

Visualization

Calling the display function easily allows you to display visualizations in PySpark dataframe format. Alternatively, by calling .toPandas() , you can easily employ seaborn, matplotlib and other visualization libraries you desire to use.

Example Use Case

Histogram of the Average Rating By Each User

avg_stars_query = df_user.select("average_stars").collect()
avg_stars_df = spark.createDataFrame(avg_stars_query)

sns.distplot(avg_stars_df.toPandas(), bins = 4, kde =  False)

Building a model with MLlib

PySpark’s ML library is very similar to that of sklearn’s, with some minor differences. Instead of train_test_split , for example, randomSplit is used. Here is a general breakdown of model building with MLlib:

merge features into one column with pyspark.ml.feature.VectorAssembler or mllib.linalg.DenseVector
Scale features, e.g. with pyspark.ml.feature.StandardScaler
perform train test split using randomSplit
train model with 1 “features” column and 1 “label” column
evaluate with metrics, e.g. mllib.evaluation.RegressionMetrics

Use Case

Photo by Marvin Meyer on Unsplash

I was curious to know the different profiles of Yelpers, so I performed K-Means clustering and accessed their WSSE at varying levels of k. The optimal k is found at the “elbow” where the indent of the graph is significantly greater, and in a way that imitates the bend of an arm. This is known as the elbow method.

Despite finding an optimal level of k, analyzing the clusters can still be tough with so many features. Dimensionality reduction would be useful in simplifying the model.

Dimensionality Reduction

Correlation Heatmap of Variables Based on Sample (n = 500,000)

It seems like there isn’t a clear-cut way to select features based on this heatmap, since majority had a correlation coefficient of .30.

In addition, I tinkered with dimensionality reduction in PySpark, but found no direct solution to retrieve back feature names or rank features by importance for the purposes of interpretation.

Yet another solution would be to transform the variables. By merging the compliments the user gave into one column and the compliments the user received into another, I was able to reduce the number of features to 6.

Elbow Method: Finding the Optimal Number of Clusters

In the graph above, the number of clusters range from 2 to 24. The elbow occurs at k = 6, i.e. 6 distinguished user profiles can be found.

Alternative Solutions

Alternative Clustering Algorithms

The other algorithms offered by PySpark’s MLlib package include gaussian mixture models, LDA (often used in text mining) and bisecting k-means. Bisecting k-means is a hierarchical clustering algorithm that employs a top-down approach at splitting the data, and is preferable for big datasets.

Subsampling

Subsampling, say, about a third of the data (about 500,000 cases) and calling .toPandas() will allow you to perform feature selection and conduct training in scikit-learn on the newly converted Pandas dataframe.

Bonus: Hyperparameter Tuning

Photo by Rodion Kutsaev on Unsplash

PySpark also enables users to select best models by providing pipelining and hyperparameter tuning functionalities.

Here are some terms you are likely to come across while selecting your model:

paramMap
paramGridBuilder

Note: there is also a PySpark-sklearn library for grid searching.

My Review

Photo by Alex Ware on Unsplash

Training the model on my laptop was very time consuming. Several attempts at other clustering algorithms and grid searching were not feasible simply because of memory space issues.

Nevertheless, PySpark is a great tool if you want to be able to handle large data for free, and many companies use it today. I’ve also recently learnt of the existence of PySparkling, which is a combination of H2O, an automated, easy-to-use machine learning platform, and PySpark.

What You Can Be Looking Forward To

In part 2, we will walk through cluster analysis and present to you the findings from each cluster.

In summary, we recapped how PySpark handles big data, how to set the system up on Colab, some of the common methods used when working with dataframes (as opposed to RDDs), and how to train a model in PySpark.

Hope you enjoyed reading this article! Feel free to leave your thoughts (or tips) below.

References

[1] Asif Ahmed, PySpark in Google Colab , Towards Data Science

[2] Machine Learning Library (MLlib) Guide , ApacheSpark 2.4.3

Finding Burgers, Bars and the Best Yelpers in Town