4

Data Wrangling Solutions — Working With Dates — Part 2

 3 years ago
source link: https://towardsdatascience.com/data-wrangling-solutions-working-with-dates-part-2-58a4393c8218
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Data Wrangling Solutions — Working With Dates — Part 2

Convert columns from non-date to date datatypes

0*npZ8kJ9rAN1Y150f?q=20
data-wrangling-solutions-working-with-dates-part-2-58a4393c8218
Photo by Steinar Engeland on Unsplash

In the last tutorial, we looked at the various approaches to read the data files containing the date-time variables. In this tutorial, we will discuss how to handle columns of an existing dataframe that contains date-time data but are still read by Python as an object datatype.

Assumption and Recommendation

Being hands-on is the key to master programming. We recommend that you continue to implement the codes as you follow through with the tutorial. The sample data and the associated Jupiter notebook are available in the Scenario_6 folder of this GitHub link.

If you are new to GitHub and want to learn it, please go through this tutorial. To set up a new Python environment on your system, please go through this tutorial.

Following is the list of Python concepts and pandas functions/ methods used in the tutorial:

Pandas functions

  • read_csv
  • to_datetime

Challenges & Solutions

Knowing the dataset

As explained in the last tutorial, we have created a dummy dataset that contains multiple versions of a data field, release_date. The data dictionary of this dummy dataset is as follows:

  • release_date — Actual date column with first date value deleted.
  • release_date_int — Another column containing date information but in an integer format, for example, date 2020–02–12 is present as 20200212 in YYYYMMDD format.
  • release_date_text — Column containing dates in text format, and # as the separator. The format code is YYYY#DD#MM.
  • Year — Column containing only year values of the date data.
  • Month — Column containing only month values of the date data.
  • Day — Column containing information on days

A snapshot of the dataset is shared below:

1*KiUXQuKxjUfqRvp3v5sAwA.png?q=20
data-wrangling-solutions-working-with-dates-part-2-58a4393c8218
Sample Data Snapshot (Image by Author)

In all our scenarios explained below, we have assumed that this dummy data file is read in Python using the vanilla read_csv implementation. The code to read the dummy data and information on each column after reading it is as below:

#### Sample Code
#### Importing Pandas
import pandas as pd#### Importing Data File - Change the Windows Folder Location
imp_data = pd.read_csv("C:\\Ujjwal\\Analytics\\Git\\Scenario_6\\Date_Dataset.csv")#### Checking the dataset information
imp_data.info()#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null object
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
dtypes: int64(4), object(2)
memory usage: 140.8+ KB

Challenge 1 — Converting the column containing the date-time data in text format into a date-time object

In this scenario, we are trying to convert the release_date_text column from and object datatype into the date-time datatype. The sample code to demonstrate this is as below:

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["release_date_text_converted"] = pd.to_datetime(imp_data["release_date_text"], format="%Y#%d#%m")#### Checking the dataset information
imp_data.info()#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null object
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
6 release_date_text_converted 3000 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 164.2+ KB

Explanation

  • The new column release_date_text_converted is now available as a date-time object.
  • to_datetime — The function expects the first argument to be the column (series in Pandas) that we want to convert into a date-time object. As the second parameter, we have to provide the format in which the data currently exists.

In our case, the data is in YYYY#DD#MM format. In this format, we have represented the year component as %Y, the month component as %m, and the date component as %d.

Challenge 2 — Converting the column containing the date in integral form into a date-time object

This scenario is a minor diversion from the previous one. Here, rather than having the source column in text format, we have it in integral format. The code to implement this solution is as follows:

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["release_date_int_converted"] = pd.to_datetime(imp_data["release_date_int"], format="%Y%m%d")#### Checking the dataset information
imp_data.info()#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null object
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
6 release_date_text_converted 3000 non-null datetime64[ns]
7 release_date_int_converted 3000 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(4), object(2)
memory usage: 187.6+ KB

Explanation

  • The new column release_date_int_convertedis now available as a date-time object.
  • to_datetime— The code in this solution remains the same except the input to the format parameter. Since there were no separators between the date components, the revised format we have used is %Y%m%d.

Challenge 3 — Manually adding a new column with a constant date value to the data frame

Sometimes the situation warrants us to create a new date column with some fixed values. A minor modification to the above code can help us achieve this.

#### Sample Code
#### Datatype Conversion (Assuming the data is already read into a dataframe)
imp_data["new_constant_date_column"] = pd.to_datetime("1985-17-03", format="%Y-%d-%m")#### Checking the dataset information
imp_data.info()#### Sample output<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 release_date 2999 non-null object
1 release_date_int 3000 non-null int64
2 release_date_text 3000 non-null object
3 year 3000 non-null int64
4 month 3000 non-null int64
5 day 3000 non-null int64
6 release_date_text_converted 3000 non-null datetime64[ns]
7 release_date_int_converted 3000 non-null datetime64[ns]
8 new_constant_date_column 3000 non-null datetime64[ns]
dtypes: datetime64[ns](3), int64(4), object(2)
memory usage: 211.1+ KB
1*8_6wqkeEFa8Upml1VYRmiw.png?q=20
data-wrangling-solutions-working-with-dates-part-2-58a4393c8218
Sample Output (Image by Author)

Explanation

  • Observe the new column new_constant_date_column containing the constant date values.
  • to_datetime — This time, rather than passing the column containing the date-time objects, we have passed a constant date value to the function. The format parameter is also adjusted accordingly.

Closing note

Given the approach we have taken in this tutorial, can you think of a solution to convert the three date columns: year, month, and day into a single date column? Refer to the Jupyter notebook shared through GitHub Repository for the solution.

In this tutorial, we learned about converting dataframe columns from any data type into a date-time object. In the next tutorial, we will go through the various data-wrangling activities associated with the date-time columns.

HAPPY LEARNING ! ! ! !


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK