7

Automatic Feature Selection in Python: An Essential Guide

 3 years ago
source link: https://hackernoon.com/automatic-feature-selection-in-python-an-essential-guide-uv3e37mk
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Automatic Feature Selection in Python: An Essential Guide

5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png

@davisdavidDavis David

Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.

Feature Selection in python is the process where you automatically or manually select the features in the dataset that contribute most to your prediction variable or output in which you are interested.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

One major reason is that machine learning follows the rule of “garbage in garbage out” and that is why you need to be very concerned about the features that are being fed to the model. Keep in mind that not all features presented in your dataset are important to give you the best model performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Top 4 Reasons to Apply Feature Selection in Python:

  1. It improves the accuracy of a model if the right subset is chosen.
  2. It reduces overfitting.
  3. It enables the machine learning algorithm to train faster.
  4. It reduces the complexity of a model and makes it easier to interpret.
0 reactions
heart.png
light.png
money.png
thumbs-down.png

“I prepared a model by selecting all the features and I got an accuracy of around 65% which is not good for a predictive model and after doing some feature selection and feature engineering without doing any logical changes in my model code my accuracy jumped to 81% which is quite impressive”- By Raheel Shaikh.

In this article, you will learn how to automatically select important features by using an open-source python package called featurewiz.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

What is Featurewiz?

Featurewiz is a new open-source python package for automatically creating and selecting important features in your dataset that will create the best model with higher performance. It also uses advanced feature engineering strategies to create new features before selecting the best set of features with a single line of code.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Note: Featurewiz can automatically detect if the problem is regression or classification.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

How does it work?

Featurewiz uses the SULOV algorithm and Recursive XGBoost to reduce features to select the best features for the model. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png

(a) SULOV 
SULOV means Searching for Uncorrelated List of Variables. The algorithm works in the following steps. 

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  1. First step: find all the pairs of highly correlated variables exceeding a correlation threshold (say absolute(0.8)).
  2. Second step: find their Mutual Information Score to the target variable. Mutual Information Score is a non-parametric scoring method. So it's suitable for all kinds of variables and target.
  3. Third step: take each pair of correlated variables, then knock off the one with the lower Mutual Information Score.
  4. Final step: Collect the ones with the highest Information scores and least correlation with each other.

(b) Recursive XGBoost
After selecting the features with less correlation and high mutual information score, the Recursive XGBoost is used to find the best features among the remaining features. Here is how it works.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  1. First step: Select all features in the dataset and split the dataset into train and valid sets.
  2. Second step: Find top X features on train using valid for early stopping (to prevent overfitting).
  3. Third step: Take the next set of features and find top X.
  4. Final step: Repeat this 5 times and finally combine all selected features and de-duplicate them.

Installation

The package requires xgboost, NumPy, pandas and matplotlib. In addition, tIt should run on most Python 3.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can install Featurewiz using PyPI.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
pip install featurewiz

How to Use Featurewiz for Feature Selection in Python

We will use the Mobile Price dataset to find the best features that can help to get good accuracy when predicting the price range.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • 0 (low cost)
  • 1 (medium cost)
  • 2 (high cost)
  • 4 (very high cost)

You can download the dataset here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Import python packages.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# import packages 
import pandas as pd 
import numpy as np 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score 
from featurewiz import featurewiz
np.random.seed(1234)

Load the Mobile Price dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
data = pd.read_csv('../data/train.csv')
 
data.shape

(2000, 21)

0 reactions
heart.png
light.png
money.png
thumbs-down.png

The dataset contains 21 columns(20 features and 1 target) and luckily this dataset has no missing values.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Split the data into independent features and target.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
X = data.drop(['price_range'],axis=1)
 
y = data.price_range.values 

Then Standardize the features by using StandardScaler from scikit-learn.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
X_scaled =  StandardScaler().fit_transform(X) 

Split the data into train and validate sets. 20% of the data will be used for validation.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
X_train, X_valid, y_train, y_valid = train_test_split(X_scaled,y,test_size = 0.2,stratify=y, random_state=1)

Create and train the RandoForestclassifier on the train set.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
classifier = RandomForestClassifier()
 
classifier.fit(X_train,y_train)

Make a prediction on the validation set and then check model performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# make prediction 
preds = classifier.predict(X_valid) 
# check performance
accuracy_score(preds,y_valid) 
0 reactions
heart.png
light.png
money.png
thumbs-down.png

The model accuracy is 88%  when we use all 20 features available in the dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Now you can use Featurewiz to automatically select the best set of features that will give the best model performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# automatic feature selection by using featurewiz package
target = 'price_range'
 
features, train = featurewiz(data, target, corr_limit=0.7, verbose=2, sep=",",
header=0,test_data="", feature_engg="", category_encoders="")

On the Featurewiz instance, we have added the dataset and the name of the target variable. You can also change the correlation limit by using corr_limit (by default is 7).

0 reactions
heart.png
light.png
money.png
thumbs-down.png

During the selection process, it will show the following series of output.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
Skipping feature engineering since no feature_engg input...
Skipping category encoding since no category encoders specified in input...
Loading train data...
Shape of your Data Set loaded: (2000, 21)
Loading test data...
    Filename is an empty string or file not able to be loaded
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    20 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
No GPU active on this device
    Running XGBoost using CPU parameters
Removing 0 columns from further processing since ID or low information variables
    columns removed: []
    After removing redundant variables from further processing, features left = 20
#### Single_Label Multi_Classification Feature Selection Started ####
Searching for highly correlated variables from 20 variables using SULOV method
#####  SULOV : Searching for Uncorrelated List Of Variables (takes time...) ############
    No highly correlated variables in data set to remove. All selected...
    Adding 0 categorical variables to reduced numeric variables  of 20
############## F E A T U R E   S E L E C T I O N  ####################
Current number of predictors = 20 
    Finding Important Features using Boosted Trees algorithm...
        using 20 variables...
        using 16 variables...
        using 12 variables...
        using 8 variables...
        using 4 variables...
Selected 16 important features from your dataset
    Time taken (in seconds) = 19
Returning list of 16 important features and dataframe.

As you can see, Featurewiz selects 16 important features from the dataset. The Featurewiz instance returns two objects 

0 reactions
heart.png
light.png
money.png
thumbs-down.png
  • Features - a list of selected features
  • One dataframe - This dataframe contains only selected features and the target variable.

Now you can train the RandomForestClassifier again with selected features and see if the model performance will improve.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Let see the list of selected features.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
print(features)

['ram',
 'battery_power',
 'px_height',
 'px_width',
 'touch_screen',
 'mobile_wt',
 'int_memory',
 'three_g',
 'sc_h',
 'four_g',
 'sc_w',
 'n_cores',
 'fc',
 'pc',
 'talk_time',
 'wifi']

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Split the one dataframe into selected features and the target.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#split data into feature and target
X_new = train.drop(['price_range'],axis=1)
 
y = train.price_range.values 

Then, Standardize the selected features by using StandardScaler from scikit-learn.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# preprocessing the features 
X_scaled =  StandardScaler().fit_transform(X_new)

Split the data into train and validate sets. 20% of the data will be used for validation.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
#split data into train and validate 
X_train, X_valid, y_train, y_valid = train_test_split(X_scaled,y,test_size = 0.2,stratify=y, random_state=1)

Create and train the RandoForestclassifier on the train set again.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# create and train classifier 
classifier = RandomForestClassifier()
 
classifier.fit(X_train,y_train)

Make a prediction on the validation set and then check model performance.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
# make prediction 
preds = classifier.predict(X_valid) 
# check performance
accuracy_score(preds,y_valid) 

0.905

The model accuracy has increased from  88%  to 90.5%  when we use the best-selected features (16 out of 20  features) from the dataset.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Final Thoughts on Feature Selection in Python

In this article, you have learned how you can automatically select important features by using the Featurewiz package. You can also use Featurewiz on any Multi-Class or Multi-Label Dataset. So you can have as many target labels as you want.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

There are more options available on the Featurewiz package. I recommend you read them here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

0 reactions
heart.png
light.png
money.png
thumbs-down.png

You can also find me on Twitter @Davis_McDavid.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

And you can read more articles like this here.

0 reactions
heart.png
light.png
money.png
thumbs-down.png

Want to keep up to date with all the latest in python? Subscribe to our newsletter in the footer below.

0 reactions
heart.png
light.png
money.png
thumbs-down.png
5
heart.pngheart.pngheart.pngheart.png
light.pnglight.pnglight.pnglight.png
boat.pngboat.pngboat.pngboat.png
money.pngmoney.pngmoney.pngmoney.png
by Davis David @davisdavid. Data Scientist | AI Practitioner | Software Developer. Giving talks, teaching, writing.Contact me to collaborate
Join Hacker Noon

Create your free account to unlock your custom reading experience.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK