1

Normalize before Training

 2 years ago
source link: https://medium.com/@sunnerli/normalize-before-training-872858e332a1
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Normalize before Training

In the previous article, I normalize the training data manually. However, after I check some kernels on Kaggle, some interesting method can be learned. This article records the methods.

Now we want to predict if the country has nuclear weapon. The above table is the data CSV file whose name is nuclear.csv. Each row represents one country. The first attribute is the number of soldiers in the particular country, and the second one is the total number of population. The last attribute represent if the country has nuclear weapon. 1 represents it has.

The first function that I think is very useful is train_test_split. This function can split the whole x and y into training data and testing data. The shape of x is [row_num, feature_num] while the shape of y is [row_num].

The second idea that is worth to learn is MinMaxScaler. This class can scale the column into the range between 0~1. The advantage of MinMaxScaler is that it can make the variance be more obvious.

The left side is a simple example. The different of 0.5 is relatively smaller than the mean (seven hundred thousand). However, the different will be obvious after MinMaxScaler works.

The MinMaxScaler is a common class that I can see on the Kaggle kernel. On the other hand, StandardScaler is another class that can help us arrange the data. it can makes the mean of each column become 0, and the variance of each column become 1.

However, both scalier have their own disadvantage. MinMaxScaler will control the value to be in the range of 0~1. If the testing data is out of range, it might get the incorrect predicted result. StandardScaler just normalize the column value. On the contrary, some result will be negative after the work. Some architecture (ex. ReLU) might be sensitive toward this property.

The last part is train and test the model! I use random forest to make the prediction. Even though the result is not good enough.

For the conclusion, we can use train_test_split to get the corresponding sets of data. MinMaxScaler and StandardScaler can help us normalize the feature, and use these data to do the further work!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK