9

SAP Tech Bytes: Feature Engineering using Data Wrangling

 3 years ago
source link: https://blogs.sap.com/2021/08/17/sac-feature-engineering-data-wrangling-kaggle-titanic/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client
August 17, 2021 6 minute read

SAP Tech Bytes: Feature Engineering using Data Wrangling

In the previous posts of the series, we have built our predictive model using Smart Predict in SAP Analytics Cloud with just a few clicks. And then we looked at some key Performance Indicators of the model built for us by Smart Predict.

Improving Model Performance

Smart Predict greatly simplifies the process of training a Machine Learning model thanks to the automation of many steps and decisions. But we still have some technics available to us — such as feature engineering or imputation of missing values — that we can use to improve model performance.

2021-08-16_20-30-52-1.png

These and other techniques were presented by Stuart Clarke in unit 3 “Improving Model Performance” of week 5 of the openSAP course Getting Started with Data Science.

What is Feature Engineering

The post Feature Engineering Tips for Data Scientists gives a simple and intuitive explanation:

Feature engineering is simply the thoughtful creation of new input fields from existing input data. Thoughtful is the key word here. The newly created inputs must have some relevance to the model output and generally come from knowledge of the domain.

There are some variables in the Kaggle’s Titanic datasets, like Name or Cabin number, which seemed irrelevant at first. But can we get something more meaningful variables from them?

Data wrangling

Let’s try to apply our (intuitive and acquired) domain knowledge to improve model performance. We can create new variables in the training and test datasets using the data-wrangling capabilities of SAP Analytics Cloud.

Check Introduction to Smart Data Wrangling by Josef Hampp to learn more.

Titles

Honorifics, incl. common titles were common at that time, and we have them as a part of the Name column. Let’s extract them as a new category.

We used the Custom Expression Editor and the Wrangling Expression Language (WEL) to define our own transformations in the previous post already.

While in the train dataset open the Custom Expression Editor and define a new column Title based on the following expression.

[Title] = element(split(element(split([Name],'.',1),1),",",2),2)

It splits parts of each name based on the observation that it follows the pattern: the last name at the beginning followed by a comma followed by a title (or honorific) and then followed by a dot and the rest of the name.

2021-08-16_21-29-22.png

Frankly, I have never heard about Johnkheer before! We live and learn.

2021-08-16_21-32-33.png

Save this modified dataset as train_engineered in Titanic folder.

Age groups

An Age variable was defined as having a continuous statistical data type when we trained a previous model. Can we apply our human knowledge to split passengers into groups by their known age?

There are different approaches possible, like five-year age groups, or a life-cycle grouping. Based on a bit of research I came with the below grouping. But this is one of these features, where several approaches can be modeled and tested to get the best model.

[AgeCategory] = if(
	[Age] >= 65, 
	'Elderly', 
	if(
		[Age] >= 45, 
		'Senior', 
		if(
			[Age] >= 18, 
			'Adult', 
			if(
				[Age] >= 2, 
				'Child', 
				if(
					[Age] > 0, 
					'Baby', 
					null
				)
			)
		)
	)
)

2021-08-16_22-03-27-1.png

Save the dataset.

Traveling alone

While variables Parch (a number of parents or children traveling with a passenger) and Sibsp (a number of siblings or a spouse traveling with a passenger) seem not to provide too much influence on a target we can use them to create a new feature Alone. It is a boolean variable, where True represents a passenger traveling alone.

[Alone]=in([Parch]+[SibSp],0)

A function in( <search_expr> , <expr_1> , ... , <expr_20> ) indicates whether any of the specified <expr_N> values (here: 0) is equal to the <search_expr> value (here: [Parch]+[SibSp], or a sum of values of variables Parch and SibSp). You can use the in-place help of the Custom Expression Editor to find or check available functions.

2021-08-16_22-40-17.png

Save the dataset.

Side and Deck

Even seemingly unuseful Cabin variable can be used to get some new features if we know (after a little research and finding appropriate references) that its first letter represents the deck, and a parity of the last digit represents a side of the ship.

[Side] = in(toInteger(substr([Cabin], length([Cabin]), 1)), 1, 3, 5, 7, 9)

The value 1 represents a starboard (or the “right”) side of a vessel, and the value 0 represents a port (or the “left”) side.

[Deck] = if([Cabin]=='', 'Unknown', substr([Cabin], 1, 1))

Unfortunately with almost 77% of missing values, we should not expect these variables to be influential, but it was a nice try.

2021-08-16_23-04-42.png

I would not expect a missing variable substitution by the mean value or by the most common category would make sense here, so I leave these values as calculated.

Save the dataset.

So, we created several new columns in the dataset while practicing writing expressions using formulas provided by SAP Analytics Cloud’s data wrangling capabilities.

Train a new model(s)

We added five new variables, or features, to the original dataset. It is time to train a new predictive model and to see what/if improvements we get!

2021-08-17_11-27-46.png

  1. Go to the Predictive Scenarios
  2. Open Titanic, which we created in the first post
  3. Create a new model. I added a description “A train dataset with new engineered features”
  4. Select train_engineered dataset as a training data source
  5. Click on Edit Column Details and check PassendgerId as a key column
  6. Select Survived as a Target
  7. Exclude following variables from Influencers: Age, SibSp, Parch, Name, Ticket, Cabin
  8. Click Train

A new Model 2 should be created, trained, and automatically deployed.

Debriefing

Once the Model 2 is trained we can compare its performance indicators with the previous model. Indeed we achieved about 2 percentage points improvement in the Predictive Power and about 3 percentage points improvement in the Prediction Confidence.

2021-08-17_11-47-46.png

Let’s check Influencer Contributions.

2021-08-17_12-25-25.png

The first thing I noticed is that while the original Age variable was #3 contributor in the Model 1, in this new Model 2 the feature AgeCategory is contributing the least to explain the target.

Another iteration

The sequence of the activities in the Machine Learning process is not strict and moving back and forth between different phases is normal.

2021-07-06_12-56-45.png

In real life, we might go through a few or sometimes even dozens of iterations of training models to find the one with the best performance. That’s the reality of Data Science.

So, now based on that observation that Age influenced more than our AgeCategory variable, we might want to either review and modify how we group age values or we might test the hypothesis that plain Age variable is better categorized by automated algorithms of Smart Predict.

Once again let’s create a new model. It will be similar to the previous Model 2, with the only difference that now excludes the AgeCategory and not Age from influencers. Train the model.

2021-08-17_12-57-08.png

This time the third model’s prediction confidence slightly improved over the Model 2, but the predictive power dropped to the level of the Model 1. In simple words, based on the validation subset of the training dataset it makes slightly more incorrect predictions than Model 2, but does make them more consistently than Model 1 🤓

Now, it is your turn!

If you followed this series of creating these models based on the Titanic dataset, then you should practice now, if you can improve the model any further, by maybe coming with some more engineered features, removing or imputing missing values, or practicing features selection.


Have fun data-sciencing, and please share your results in the comments below!
-Vitaliy, aka @Sygyzmundovych


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK