6

Data Wrangling Solutions— Splitting Column with Each Cell Containing List of Val...

 3 years ago
source link: https://medium.com/analytics-vidhya/data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Data Wrangling Solutions— Splitting Column with Each Cell Containing List of Values

A common data wrangling challenge and a one-step solution.

0*cv_yvgTzHVcg6t4j?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Photo by Joshua Sukoff on Unsplash

During the data preparation stage of an analytics project, a common challenge is to have a list of values in a table’s column. Typically, in a scenario like this, an analyst would like to split it into multiple records. A visual representation of this case is as below:

1*4iCwLdP9ZqBqRYsQsCe8KQ.png?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Sample Scenario (Image by Author)

In this tutorial, we will demonstrate a step-by-step approach to handle this case. Depending upon the version of the Pandas library, one can follow different solutions detailed below.

Assumption and Recommendation

Being hands-on is the key to master programming. We recommend that you continue to implement the codes as you follow through with the tutorial. To enable this, you can access the sample data used in this tutorial and the associated notebook from the Scenario_4 folder of this GitHub link.

If you are new to GitHub and want to learn it, please go through this tutorial. To set up a new Python environment on your system, please go through this tutorial.

Following is the list of pandas functions/ methods used in this tutorial:

  1. read_csv
  2. String Accessors
  3. apply
  4. merge
  5. dropna
  6. explode

Let’s get started

Step 1 — Keeping the data ready

For this tutorial, we have scraped news articles that talk about companies going bankrupt in Canada. We also used a few NLP techniques to extract keywords from each sentence of the scraped article. The data dictionary for this data set and the sample data snapshot is as follows:

  • Article_Number — Unique number to identify an article
  • Sentence_Number — Unique number to identify the sentence of the article
  • Article_Publish_Date — Publish date of the article
  • Keyword — Keyword used for web searching the article
  • Article_Body — Actual text of the article
  • Article Keywords — Important keywords identified from the article sentence
1*krBWH7_kIa6-WZqATKHVrA.png?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Sample Data Snapshot (Image by Author)

Notice, the article keywords are present in the form of a list of comma-separated text values. Our objective is to split it into multiple records.

Step 2 — Importing packages and data files.

Once you have the data available, the next step is to import it into your Python environment.

#### Sample Code
#### Importing Pandas
import pandas as pd#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_4\\Sample_Data.csv")

We have used the Pandas read_csv function to read the data in Python.

Step 3 — Converting the keywords text into a list of values.

One of the critical steps of this task is to convert the comma-separated text into a Python sequence object (list in this case). Following is the sample code:

#### Converting the keywords to a Python list object
imp_data["Article_Keywords_Converted"] = imp_data["Article_Keywords"].str.split(",")

Notice the use of string accessors and the associated split method to achieve the desired result. The list of text values is now available as a list object, where each keyword is a list component.

Step 4a — Users using Pandas version less than 0.25.0

Before version 0.25.0, Pandas didn’t have any ready-made function/method to solve a scenario like the one presented above. If you are on the old Pandas version and can’t move to the latest version, follow these simple steps:

  • Convert the column with the list object to a Pandas series and store it in a temporary data frame. By converting the list object column to the Pandas series, each list component will separate into multiple columns of a new data frame. The sample code and the structure of the new data frame are as follows:
#### Sample Code
temp_df = imp_data["Article_Keywords_Converted"].apply(pd.Series)
1*y7urBURM-7MOfh0EYr6wzQ.png?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Sample values after conversion (Image by Author)
  • Merge the temporary data frame with the original data frame. Since the newly created data frame will have the same index as the original data frame, we can merge the two using the left and right index.
#### Sample Code
imp_data = pd.merge(imp_data, temp_df, right_index=True, left_index=True, how = "left")
  • Melt the newly added columns. This step will bring all the individual keyword values into separate rows. To learn about melting in pandas, read through this tutorial.
#### Sample Code
imp_data = pd.melt(imp_data,
id_vars=['Article_Number','Sentence_Number',
'Article_Publish_Date','Keyword',
'Article_Body','Article_Keywords',
'Article_Keywords_Converted'],
value_name="Keywords")
1*9uAykMVLWxoJwOYtdTBf-w.png?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Output after Melting dataframe (Image by Author)
  • Finally, drop all the records which have null values in the newly created Keywords field. Also, drop all the temporary or unwanted columns. The final data frame will look something like the snapshot shared below:
#### Sample Code
imp_data.dropna(inplace = True)
imp_data.drop(["Article_Keywords", "Article_Keywords_Converted", "variable"], axis = 1, inplace=True)
1*ILxGOk174ZG8E5ZysGAQpw.png?q=20
data-wrangling-solutions-splitting-column-with-each-cell-containing-list-of-values-fc95d14fa539
Final Output (Image by Author)

Step 4b— Users on Pandas version 0.25.0 and above

In version 0.25.0, Pandas introduced a new data frame method called explode. This function does the job for us in a single step. Use this method on the data frame and pass the column name as the argument. It will automatically split the sequence of values into multiple records. The sample code is as follows:

#### Converting the list of values into multiple records
imp_data = imp_data.explode("Article_Keywords_Converted")
imp_data.head()

Closing note

I am sure that the last section of the tutorial must have exploded your mind. With such ready-made solutions, our data wrangling tasks become easy to implement. I hope that the solutions presented above were helpful.

HAPPY LEARNING ! ! ! !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK