Finding Burgers, Bars and the Best Yelpers in Town
source link: https://www.tuicool.com/articles/nqaaiaV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Finding Burgers, Bars and The Best Yelpers in Town
A Digestible PySpark Tutorial for Avid Python Users — Part 1
Some time back, Yelp made the move to share its repository to the public. As a Yelp user, when I learnt that it included a wealth of information on businesses, user reviews and user characteristics, I was eager to work on it. However, I quickly hit a roadblock when I realized that some of the files were to big too be uploaded onto my Jupyter Notebook. For someone with no prior coding experience, and who was just beginning to learn the basics of data analysis and machine learning, I decided to put that on hold.
Fast forward to yesterday. After spending the weekend trying to set up PySpark in Jupyter, I found out that Google Colab provides a much simpler solution to working with big data via PySpark, without all the frustrations of downloading and working with it locally. Naturally, I decided to work with the Yelp users dataset.
So again, I’ve decided to put the puzzle pieces together to make the journey as painless for someone else as I can make it. Whether you’ve stumbled across this article, or are facing similar frustrations, I’ve put together a menu for you below, so you can decide for yourself if this is worth your stay.
Menu
Appetizer
- An understanding of how PySpark handles big data (if not, this video on latency should suffice)
Main Course
- Importing PySpark onto Colab
- Common operations on PySpark DataFrame’s object
- Data Visualization
- Modeling with MLlib
Dessert
- Model Comparison / Model Selection
How PySpark Works: A Brief Recap
In summary, PySpark is a distributed computing framework: it allows data to be processed in parallel by distributing it across several nodes.
In general, operations on the memory are computationally cheaper than those on networks and disks. Spark gives users the option for significantly faster computing than systems like Hadoop by shifting as many operations on memory and minimize the amount on network, thereby reducing network traffic.
A Smooth and Easy PySpark Setup on Colab
Here is a list of libraries you’ll need to get started:
- OS — to set your environment variables
- findspark — makes PySpark importable
- pyspark.sql.SparkSession — to access Spark functionality and work with Spark DataFrames
- google.colab.driv e — to access the data file on my Drive
- the file type you’re working with (in my case, it’s json)
Asif Ahmed wrote a greatarticle which I referenced to aid me in the installation of PySpark. This process is universal, so anyone can use the same block of code to download PySpark onto Colab. The only thing you should take note off is the version of Spark you’re downloading. It should be the latest one available. I’ve made a comment where this is relevant, and documented some of what is going on in the code below.
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz #based on latest version
!tar xf spark-2.4.3-bin-hadoop2.7.tgz
!pip install -q findspark
#setting the environment variables os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" os.environ["SPARK_HOME"] = "/content/spark-2.4.3-bin-hadoop2.7" #based on latest version
#finds PySpark to make it importable findspark.init() spark = SparkSession.builder.master("local[*]").getOrCreate()
To avoid the long wait time required to upload a huge file onto colab, I uploaded my files onto Google Drive and gave my notebook access to it.
drive.mount('/content/drive') #produces a link with instructions to enter an authentication code
Once you’ve entered the authentication code, click on the arrow button on the left of your Colab Notebook, and search for your file.
Once you’ve found your file, right click on it and copy its path. Pass its string through spark.read.filetype()
. Disclaimer: this will not work if you don’t add a “/” in the beginning of the path you copied!
To my pleasant surprise, spark.read.json()
automatically infers the schema of my nested JSON file, and converts it into a PySpark DataFrame.
Exploration
Below is a list of methods I’ve applied to my PySpark DataFrame to explore my dataset.
df.printSchema()
returns a neat tree format of information about the dataframe.
df.select()
is a method for querying. Features you want to display select and SQL-like operations are passed through this method. Mentioning features is not necessary, however, without them the query returns boolean values. Either way, the resulting output is always a list of rows, not a dataframe.
.collect()
assembles the fragmented dataset that was distributed earlier. Avoid calling it until you need collated results.
pyspark.sql.functions.udf
or a user-defined function is a function that can be applied to every to dataframe columns. Two arguments are needed: the transformation and the expected data type of your transformed variable. They are used in conjunction with df.select()
. In addition, beyond employing the functions found in pyspark.sql.functions
, it is possible to define your own functions and pass them through the udf method. UDFs may be slower than in-built functions, though, so becoming well versed with PySpark functions can save you a ton of time.
spark.createDataFrame()
transforms output of query (a list of rows) into a dataframe.
df.show()
displays a PySpark dataframe object with 20 rows. df.show(n)
displays only the first n rows of the dataframe.
df.withColumn()
returns a new dataframe consisting of the original dataframe and a new column with a specified operation. The two arguments required for this method are the new column name and the operation.
Example Use Case
The dataset gave me a list of friends per user, and from it I created a column with the number of friends per user.
#create a user-defined function split_count = udf(lambda z: len(z.split(“, “)), IntegerType())
#transform and attach a new column to the original dataframe df_user1 = df_user.withColumn(“no_friends”,split_count(df_user.friends))
Visualization
Calling the display function easily allows you to display visualizations in PySpark dataframe format. Alternatively, by calling .toPandas()
, you can easily employ seaborn, matplotlib and other visualization libraries you desire to use.
Example Use Case
avg_stars_query = df_user.select("average_stars").collect() avg_stars_df = spark.createDataFrame(avg_stars_query)
sns.distplot(avg_stars_df.toPandas(), bins = 4, kde = False)
Building a model with MLlib
PySpark’s ML library is very similar to that of sklearn’s, with some minor differences. Instead of train_test_split
, for example, randomSplit
is used. Here is a general breakdown of model building with MLlib:
- merge features into one column with
pyspark.ml.feature.VectorAssembler
ormllib.linalg.DenseVector
- Scale features, e.g. with
pyspark.ml.feature.StandardScaler
- perform train test split using
randomSplit
- train model with 1 “features” column and 1 “label” column
- evaluate with metrics, e.g.
mllib.evaluation.RegressionMetrics
Use Case
I was curious to know the different profiles of Yelpers, so I performed K-Means clustering and accessed their WSSE at varying levels of k. The optimal k is found at the “elbow” where the indent of the graph is significantly greater, and in a way that imitates the bend of an arm. This is known as the elbow method.
Despite finding an optimal level of k, analyzing the clusters can still be tough with so many features. Dimensionality reduction would be useful in simplifying the model.
Dimensionality Reduction
It seems like there isn’t a clear-cut way to select features based on this heatmap, since majority had a correlation coefficient of .30.
In addition, I tinkered with dimensionality reduction in PySpark, but found no direct solution to retrieve back feature names or rank features by importance for the purposes of interpretation.
Yet another solution would be to transform the variables. By merging the compliments the user gave into one column and the compliments the user received into another, I was able to reduce the number of features to 6.
In the graph above, the number of clusters range from 2 to 24. The elbow occurs at k = 6, i.e. 6 distinguished user profiles can be found.
Alternative Solutions
Alternative Clustering Algorithms
The other algorithms offered by PySpark’s MLlib package include gaussian mixture models, LDA (often used in text mining) and bisecting k-means. Bisecting k-means is a hierarchical clustering algorithm that employs a top-down approach at splitting the data, and is preferable for big datasets.
Subsampling
Subsampling, say, about a third of the data (about 500,000 cases) and calling .toPandas()
will allow you to perform feature selection and conduct training in scikit-learn on the newly converted Pandas dataframe.
Bonus: Hyperparameter Tuning
PySpark also enables users to select best models by providing pipelining and hyperparameter tuning functionalities.
Here are some terms you are likely to come across while selecting your model:
paramMap paramGridBuilder
Note: there is also a PySpark-sklearn library for grid searching.
My Review
Training the model on my laptop was very time consuming. Several attempts at other clustering algorithms and grid searching were not feasible simply because of memory space issues.
Nevertheless, PySpark is a great tool if you want to be able to handle large data for free, and many companies use it today. I’ve also recently learnt of the existence of PySparkling, which is a combination of H2O, an automated, easy-to-use machine learning platform, and PySpark.
What You Can Be Looking Forward To
In part 2, we will walk through cluster analysis and present to you the findings from each cluster.
In summary, we recapped how PySpark handles big data, how to set the system up on Colab, some of the common methods used when working with dataframes (as opposed to RDDs), and how to train a model in PySpark.
Hope you enjoyed reading this article! Feel free to leave your thoughts (or tips) below.
References
[1] Asif Ahmed, PySpark in Google Colab , Towards Data Science
[2] Machine Learning Library (MLlib) Guide , ApacheSpark 2.4.3
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK