Avoid These Deadly Modeling Mistakes that May Cost You a Career

zmaqMvz.png!web

I love seeing data scientists using advanced packages, creating dazzling exhibits and experimenting with different algorithms. Data scientists can keep the computer burning for a whole day. A cool T-shirt, a cup of coffee and a laptop — that’s all he or she needs. While as crazy as their titles sound, some novice data scientists keep committing common mistakes — I call them deadly mistakes. These rudimentary mistakes definitely will hurt the creditability of a data scientist and may cost a promising data science career. So my goal in this article is simple: I hope you will never commit these types of mistakes — after reading this article.

(1) Why Is the “Datetime” Variable the Most Significant Variable?

Be careful about any datetime field of the yymmdd:hhmmss format. You cannot employ this variable blindly in any tree-based methods. As shown in the exhibit, this variable appears at the top of the variable importance chart. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees. I am almost certain you need to derive year, month, day, weekday, etc. from this field. Remember, feature engineering is to capture repeatable patterns, such as month or week of the year, in order to predict the future. In some models, you may use ‘year’ as a variable just to explain any special volatility in the past. But you will never use the raw datetime field as a predictor.

myMR32n.png!web

(2) Be careful about the ‘0’, ‘-99’ or ‘-999’ in a variable

They are usually missing values and set by the system as the extreme values. In a parametric regression, do not use them blindly as numeric values. This prevailing issue cannot be blindly handled by software such as library(mice).

(3) What to Do if a Continuous Variable Has ‘NA’, ‘0’, ‘-99’ or ‘-999’?

As the first step I will advise you bin the continuous variable, leaving the special values ‘0’, ‘-99’, ’NA’ as their own categories (there may be other methods available). First, you can get the cut points of the variable:

The quantile looks like this:

Then use the above cut points to bin the variable to create a new variable. The code below also keeps the special values. I use the function cut() to cut the continuous variable into a categorical variable. I use case_when() to assign ‘-999’, ‘-99’ ,’0', and ‘NoData’ (Notice the language is in R, but the concept applies to other languages as well.)

(4) You force a categorical variable to be a cardinal variable

You want to convert a categorical variable to a numeric variable in order to run the regression. But you mistakenly force the categorical variable to a numeric variable. Below is a categorical “AP006”. A data scientist mistakenly converts to a numeric variable. If this new variable is employed in a regression, the brand ‘android’ will be two times higher than ‘h5’.

aaieq2V.png!web

(5) Forget to Deal with the Outliers in a Regression

The outlier in the exhibit causes your regression tilts towards that point. Your prediction will be biased.

(6) Require the normal distribution in a regression

The dependent variable Y doesn’t need to follow a normal distribution because it is not an assumption of multiple regression. However, the errors around a predicted Y should follow a normal distribution with a constant variance

How about the predictors Xs? Regression does not assume that the predictors have any distribution. The only requirement is to check if outliers exist (you can use box plots to check for outliers). If yes, capping and flooring techniques should be applied to the predictors.

(7) Do you need to make the distribution assumptions in a decision tree?

MnE3yyj.png!web

In parametric form (such as linear regressions), you MUST examine the distribution of the target variable to choose the right distribution. For example, if the target variable shows a gamma distribution, you need to choose the gamma distribution in your generalized linear model (GLM). However, decision trees do not make assumptions for the target variable. The basic principle for working of decision trees is to split each parent node in as distinct nodes as possible. It does not make any assumption about the distribution of the original or the resultant population. Hence, the nature of the distribution would not matter in implementing decision trees.

(8) Do you do capping and flooring for the predictors in a decision tree?

qQzyqqf.png!web

In parametric form (such as linear regression), you MUST take care of the outliers by capping the outliers at 99% (or 95%) and flooring at 1% or (5%). In tree-based algorithms, basically you do not need to do capping and flooring in decision trees. Or in other words, decision tree algorithms are robust to outliers. Tree algorithms split the data points on the basis of the same value and so the value of the outlier won’t affect that much to the split. It actually depends on your setting for the hyper-parameters.

(9) I got no significant variables or very few variables

You may have set the complexity parameter (cp) too high. The complexity parameter (cp) in part is the minimum improvement in the model needed at each node. It is the amount by which splitting that node improved the relative error. if splitting the original root node dropped the relative error from 1.0 to 0.5, the CP of the root node is 0.5.

(1) Why Is the “Datetime” Variable the Most Significant Variable?

(2) Be careful about the ‘0’, ‘-99’ or ‘-999’ in a variable

(3) What to Do if a Continuous Variable Has ‘NA’, ‘0’, ‘-99’ or ‘-999’?

(4) You force a categorical variable to be a cardinal variable

(5) Forget to Deal with the Outliers in a Regression

(6) Require the normal distribution in a regression

(7) Do you need to make the distribution assumptions in a decision tree?

(8) Do you do capping and flooring for the predictors in a decision tree?

(9) I got no significant variables or very few variables

Recommend

Swift 5 字符串插值-AttributedStrings

使用分析师报告中含有的情感信息预测上市公司股价变动

Hands-on! 如何给 TiDB 添加新系统表

Extending CI/CD: Kubernetes Continuous Deployment for Microservices

MIFieldValidator

SVM---这可能是最直白的推导了 - 知乎

roughViz.js – Reusable JavaScript library for creating sketchy/hand-drawn styled...

如果Solana成功是否意味着所有分片项目都是骗局

从一个PhD视角聊聊《禅与摩托车维修的艺术》 - 知乎

利用 AZTEC 协议进行匿名隐私转账

About Joyk