19

Finding Burgers, Bars and the Best Yelpers in Town

 5 years ago
source link: https://www.tuicool.com/articles/nqaaiaV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Finding Burgers, Bars and The Best Yelpers in Town

A Digestible PySpark Tutorial for Avid Python Users — Part 1

R7vAruy.jpg!web22E773r.jpg!web
Photo by Eaters Collective on  Unsplash

Some time back, Yelp made the move to share its repository to the public. As a Yelp user, when I learnt that it included a wealth of information on businesses, user reviews and user characteristics, I was eager to work on it. However, I quickly hit a roadblock when I realized that some of the files were to big too be uploaded onto my Jupyter Notebook. For someone with no prior coding experience, and who was just beginning to learn the basics of data analysis and machine learning, I decided to put that on hold.

Fast forward to yesterday. After spending the weekend trying to set up PySpark in Jupyter, I found out that Google Colab provides a much simpler solution to working with big data via PySpark, without all the frustrations of downloading and working with it locally. Naturally, I decided to work with the Yelp users dataset.

So again, I’ve decided to put the puzzle pieces together to make the journey as painless for someone else as I can make it. Whether you’ve stumbled across this article, or are facing similar frustrations, I’ve put together a menu for you below, so you can decide for yourself if this is worth your stay.

Menu

nAreYrj.jpg!webBrMnuiQ.jpg!web
Photo by Louis Hansel on  Unsplash

Appetizer

  • An understanding of how PySpark handles big data (if not, this video on latency should suffice)

Main Course

  • Importing PySpark onto Colab
  • Common operations on PySpark DataFrame’s object
  • Data Visualization
  • Modeling with MLlib

Dessert

  • Model Comparison / Model Selection

How PySpark Works: A Brief Recap

In summary, PySpark is a distributed computing framework: it allows data to be processed in parallel by distributing it across several nodes.

J73iIvZ.png!web26jUZbe.png!web
Distributed Data Parallelism

In general, operations on the memory are computationally cheaper than those on networks and disks. Spark gives users the option for significantly faster computing than systems like Hadoop by shifting as many operations on memory and minimize the amount on network, thereby reducing network traffic.

A Smooth and Easy PySpark Setup on Colab

INVvAnE.jpg!webY7rAZ3R.jpg!web
Photo by Fahrul Azmi on  Unsplash

Here is a list of libraries you’ll need to get started:

  • OS — to set your environment variables
  • findspark — makes PySpark importable
  • pyspark.sql.SparkSession — to access Spark functionality and work with Spark DataFrames
  • google.colab.driv e — to access the data file on my Drive
  • the file type you’re working with (in my case, it’s json)

Asif Ahmed wrote a greatarticle which I referenced to aid me in the installation of PySpark. This process is universal, so anyone can use the same block of code to download PySpark onto Colab. The only thing you should take note off is the version of Spark you’re downloading. It should be the latest one available. I’ve made a comment where this is relevant, and documented some of what is going on in the code below.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz #based on latest version
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark
#setting the environment variables
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7" #based on latest version
#finds PySpark to make it importable
findspark.init()
spark = SparkSession.builder.master("local[*]").getOrCreate()

To avoid the long wait time required to upload a huge file onto colab, I uploaded my files onto Google Drive and gave my notebook access to it.

drive.mount('/content/drive') #produces a link with instructions to enter an authentication code

Once you’ve entered the authentication code, click on the arrow button on the left of your Colab Notebook, and search for your file.

77J3ai2.png!webeURNjuE.png!web
Accessing Your Data File Via The Arrow Button

Once you’ve found your file, right click on it and copy its path. Pass its string through spark.read.filetype() . Disclaimer: this will not work if you don’t add a “/” in the beginning of the path you copied!

To my pleasant surprise, spark.read.json() automatically infers the schema of my nested JSON file, and converts it into a PySpark DataFrame.

Exploration

ZvaEJvF.jpg!webIRvYnuU.jpg!web
Photo by Andrew Neel on  Unsplash

Below is a list of methods I’ve applied to my PySpark DataFrame to explore my dataset.

df.printSchema() returns a neat tree format of information about the dataframe.

vMVbu23.png!web
Example of a dataframe schema in tree format

df.select() is a method for querying. Features you want to display select and SQL-like operations are passed through this method. Mentioning features is not necessary, however, without them the query returns boolean values. Either way, the resulting output is always a list of rows, not a dataframe.

.collect() assembles the fragmented dataset that was distributed earlier. Avoid calling it until you need collated results.

pyspark.sql.functions.udf or a user-defined function is a function that can be applied to every to dataframe columns. Two arguments are needed: the transformation and the expected data type of your transformed variable. They are used in conjunction with df.select() . In addition, beyond employing the functions found in pyspark.sql.functions , it is possible to define your own functions and pass them through the udf method. UDFs may be slower than in-built functions, though, so becoming well versed with PySpark functions can save you a ton of time.

spark.createDataFrame() transforms output of query (a list of rows) into a dataframe.

df.show() displays a PySpark dataframe object with 20 rows. df.show(n) displays only the first n rows of the dataframe.

vErqy2q.png!webQZFrAvf.png!web
Output of df.show() on Colab

df.withColumn() returns a new dataframe consisting of the original dataframe and a new column with a specified operation. The two arguments required for this method are the new column name and the operation.

Example Use Case

The dataset gave me a list of friends per user, and from it I created a column with the number of friends per user.

#create a user-defined function
split_count = udf(lambda z: len(z.split(“, “)), IntegerType())
#transform and attach a new column to the original dataframe
df_user1 = df_user.withColumn(“no_friends”,split_count(df_user.friends))

Visualization

Calling the display function easily allows you to display visualizations in PySpark dataframe format. Alternatively, by calling .toPandas() , you can easily employ seaborn, matplotlib and other visualization libraries you desire to use.

Example Use Case

QNJjqiB.png!webayqemqe.png!web
Histogram of the Average Rating By Each User
avg_stars_query = df_user.select("average_stars").collect()
avg_stars_df = spark.createDataFrame(avg_stars_query)
sns.distplot(avg_stars_df.toPandas(), bins = 4, kde =  False)

Building a model with MLlib

PySpark’s ML library is very similar to that of sklearn’s, with some minor differences. Instead of train_test_split , for example, randomSplit is used. Here is a general breakdown of model building with MLlib:

  • merge features into one column with pyspark.ml.feature.VectorAssembler or mllib.linalg.DenseVector
  • Scale features, e.g. with pyspark.ml.feature.StandardScaler
  • perform train test split using randomSplit
  • train model with 1 “features” column and 1 “label” column
  • evaluate with metrics, e.g. mllib.evaluation.RegressionMetrics

Use Case

YNNzUbf.jpg!webnARbQfY.jpg!web
Photo by Marvin Meyer on  Unsplash

I was curious to know the different profiles of Yelpers, so I performed K-Means clustering and accessed their WSSE at varying levels of k. The optimal k is found at the “elbow” where the indent of the graph is significantly greater, and in a way that imitates the bend of an arm. This is known as the elbow method.

Despite finding an optimal level of k, analyzing the clusters can still be tough with so many features. Dimensionality reduction would be useful in simplifying the model.

Dimensionality Reduction

BVRvauB.png!webjuuiyeM.png!web
Correlation Heatmap of Variables Based on Sample (n = 500,000)

It seems like there isn’t a clear-cut way to select features based on this heatmap, since majority had a correlation coefficient of .30.

In addition, I tinkered with dimensionality reduction in PySpark, but found no direct solution to retrieve back feature names or rank features by importance for the purposes of interpretation.

Yet another solution would be to transform the variables. By merging the compliments the user gave into one column and the compliments the user received into another, I was able to reduce the number of features to 6.

RV7ruei.png!webUBBJFzB.png!web
Elbow Method: Finding the Optimal Number of Clusters

In the graph above, the number of clusters range from 2 to 24. The elbow occurs at k = 6, i.e. 6 distinguished user profiles can be found.

Alternative Solutions

Alternative Clustering Algorithms

The other algorithms offered by PySpark’s MLlib package include gaussian mixture models, LDA (often used in text mining) and bisecting k-means. Bisecting k-means is a hierarchical clustering algorithm that employs a top-down approach at splitting the data, and is preferable for big datasets.

Subsampling

Subsampling, say, about a third of the data (about 500,000 cases) and calling .toPandas() will allow you to perform feature selection and conduct training in scikit-learn on the newly converted Pandas dataframe.

Bonus: Hyperparameter Tuning

uABNZnN.jpg!webZVNVze7.jpg!web
Photo by Rodion Kutsaev on  Unsplash

PySpark also enables users to select best models by providing pipelining and hyperparameter tuning functionalities.

Here are some terms you are likely to come across while selecting your model:

paramMap
paramGridBuilder

Note: there is also a PySpark-sklearn library for grid searching.

My Review

ArueueA.jpg!webreYB7j6.jpg!web
Photo by Alex Ware on  Unsplash

Training the model on my laptop was very time consuming. Several attempts at other clustering algorithms and grid searching were not feasible simply because of memory space issues.

Nevertheless, PySpark is a great tool if you want to be able to handle large data for free, and many companies use it today. I’ve also recently learnt of the existence of PySparkling, which is a combination of H2O, an automated, easy-to-use machine learning platform, and PySpark.

What You Can Be Looking Forward To

In part 2, we will walk through cluster analysis and present to you the findings from each cluster.

In summary, we recapped how PySpark handles big data, how to set the system up on Colab, some of the common methods used when working with dataframes (as opposed to RDDs), and how to train a model in PySpark.

Hope you enjoyed reading this article! Feel free to leave your thoughts (or tips) below.

References

[1] Asif Ahmed, PySpark in Google Colab , Towards Data Science

[2] Machine Learning Library (MLlib) Guide , ApacheSpark 2.4.3


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK