Data Scientist’s Tryst with Bell Curve

And why data scientists cannot escape it

For most of us, the most dreaded part of Data Science and Machine learning is the math and statistics involved in it.

If you’re a scientist, and you have to have an answer, even in the absence of data, you’re not going to be a good scientist.
- Neil deGrasse Tyson

Everyone has there own way of developing their love for data and data science. For me, understanding the basics worked like magic. Once, I mastered the basic concepts like types of data, distribution, and shape of distributions, etc., it was reasonably easy to take a deeper dive into advanced concepts.

Let’s break it down.

The main input in a data science project is observations: in other words “Feature Values”. These feature values (also called variables) can be Quantitative or Qualitative.

In case your anxiety level increased just by reading these two terms and you won’t move forward until you have a look at all the tentacles of Quantitative and Qualitative data, look at the below figure.

welcome-to-the-world-of-data-416d03175df0

Image by Author

Don’t be too hard on yourself. Let’s understand these two types.

1. Quantitative/Numerical Data

Photo by Alexander Mils on Unsplash

If you can add, subtract, multiply, and divide the data, it is quantitative. Numerical data is further detailed into

Continuous Data: Measurable data. Can take any value. Ex: Time in a race, Income of a person, Age of a person, etc. Time in a race can be any value, it can be hours, minutes, days etc.. There is no constraint on the value.

Photo by Jonathan Chng on Unsplash

Discrete Data: Finite and countable data. Can take only certain integer values. Ex. result of rolling a dice, number of students in a class, petals of a flower. If you roll a dice you can either get 1, 2, 3 .. maximum 6. There are finite possibilities.

Photo by Guillermo Velarde on Unsplash

1.1 Continuous Data

If you are going to work for enterprises like Financial Institutions, Retail industries, chances are that you will spend most of your data science life with continuous data. As the name suggests it is like water. As water can flow anywhere, continuous data can take any value.

To understand continuous data, you will have to find answers to the below questions.

What is the mean of data?
How scattered the data values are? i.e. Variance.
What is the overall data distributions with respect to mean value?
Are there any outliers? i.e. Standard Deviation.

Although I don’t want to scare you with formulas, it doesn’t harm just to scratch the surface.

How is the Mean of Continuous Data Distribution Calculated?

How is the Variance of Continuous Data Distribution Calculated?

Variance is calculated as a total of the square of the difference between mean and individual values.

How is the Standard Deviation of Continuous Data Distribution calculated?

The standard deviation is the square root of variance.

Continuous Data Distribution:

Photo by Isaac Smith on Unsplash

Now that you understand how to measure specific details like Mean, Variance, and Standard Deviation of continuous data, let’s understand the nature of its distribution.

Continuous data follow one of the below distributions.

Normal Distribution
t-Distribution

1.1.1 Normal Distribution

Most of the things around us follow Normal Distribution.

Strange!!

How about this, if you take heights of people in your country, create a table of range of heights and count of persons of that height and plot, it will be normal distribution and plot will look similar to the below figure.

Normal Distribution: Image by Author

You might be thinking, this is not possible.

It looks strange but true. A lot of other things in nature ex. Blood Pressure, IQ, Shoe Size, Birth weight, and to an extent Technical Stock market, follow this bell curve shape where data centers around the mean and show kind of symmetric spread on either side of the mean.

While we are talking about symmetric spread you should also remember the below formula to calculate the Skewness of data distribution.

Normally distributed data will have 0 skewness.

You will probably never need it, but in case you do, below is the equation for plotting this graph

The following are key characteristics of normal distribution.

Data population mean mode and median values are the same.
Most of the data points are centered around the mean.
Data points are scattered around the mean in a symmetrical manner.

Photo by Evan Dennis on Unsplash

If you are still reading this article (I hope you do!!), by now you must be thinking but why do you need to understand the Data Distribution?

The answer is one-word Generalization.

As data scientists, you can expect a lot of junk data, outliers, etc. coming to you and you will be pressed hard to make meaning of this data and predict the next course of action based on this data.

If you understand the overall nature of data distribution you could get rid of outliers and unwanted data and make sense of information.

Remember this “There is no chaos in Universe!”.

Data distribution follows a pattern. Barring Decision Tree, most of the machine learning models expect features with continuous data follow a Normal Distribution. You might come across situations, where feature values, by itself, do not follow a Normal Distribution, but if you apply a function like log to the values, it will follow a Normal Distribution.

Statisticians are fond of normal distribution. Some statisticians will try to fit every observation values with continuous numbers in a normal distribution. Some believe if a data population doesn’t follow normal distribution it means we don’t have enough observations.

Any discussion on normal distribution is not complete without mention of z score. z score indicates how far, from the mean value of data population, a specific data value is. Below is the formula for the z score.

If you calculate z-score of each data point in data population and plot them against standard deviation it will look like below

https://www.intmath.com/counting-probability/14-normal-probability-distribution.php

This is called Standard Normal Distribution. Key characteristics of Standard Normal Distributions are

It follows a Normal distribution.
Mean, median, and mode values are 0.
68.27% of data resides within 1 standard deviation. 95.45% data resides within 2 Standard Deviations and 99.73% data resided in 3 Standard Deviations.

z score will help you finding Outliers and verifying the null hypothesis (p value) and backward elimination during feature engineering.

Example: If z score of a feature value is less than 1.96 and greater than 1.96 then reject the null hypothesis.

Before I conclude my favorite topic, Normal Distribution, let me tell you about Central Limit Theorem (CLT).

As per the central limit theorem, if you take several samples of a data population, calculate the mean and plot the frequency of the mean it will look like a normal distribution. The more the number of samples, the better it will align with a normal distribution. This holds true even if the overall data population from which the samples are drawn does not follow a normal distribution.

Isn’t this Strange!!!

This article is becoming too big. Let’s conclude Normal Distribution and move on to t-distribution.

1.1.2 t-distribution

Now that you understand Normal Distribution and CLT, it’s time to go over t-distribution.

As per CLT, the mean of the sample follows a normal distribution as long as the sample size is sufficiently large (at least 30 observations). So, if you know the standard deviation of the data population, you can compute a z score, and using normal distribution you can evaluate probabilities with the sample mean.

What if sample sizes are small and you do not know the Standard Deviation of the population? When data scientists encounter such constraints, they rely on the t-distribution. It’s calculated as below.

Data scientists use t-distribution to analyze data sets where they cannot use the normal distribution. The data population should be approximately normal.

As a data scientist, you will use t-distribution in one of the following situations.

If you have a data size of more than 10 but less than 30. If data population size if less than 30, it is too less to show normal distribution.
Quite often you will come across situations where you have multi-millions of data to work on and you do not know the spread (standard deviation) of data. In such a case you will have to first get few samples of the data (with the same sample size) and then calculate it’s mean, median, mode, variance, standard deviation. Based on these values on sample size you will have to derive these values for the complete population.

By the way, t-distribution is also called Student distribution. However, it has nothing to do with the use of these statistics by students. Read the history behind this at the below link.

Student's T Distribution

William Gosset was an English statistician who worked for the brewery of Guinness. He developed different methods for…

365datascience.com

If you want to play around with some of these distributions in Excel, the following link contains interactive excel templates you can use.

Introductory Business Statistics with Interactive Spreadsheets - 1st Canadian Edition

The interactive spreadsheets used throughout this book have been locked except for select cells that allow the student…

opentextbc.ca

2. Qualitative/Categorical

Photo by Andrew Stutesman on Unsplash

Categorical data doesn’t hold mathematical significance as mathematical operations like addition, subtraction, multiplication, the division cannot be performed on such data. Example, the provinces of Canada is a categorical variable. You cannot compare these provinces like mathematical numbers. Categorical data can be further segregated into.

Binomial Data
Nominal Data
Ordinal Data

Unfortunately, I need to conclude this article now. I am in love with understanding data and, I can go on and on with it. But too big an article means rejection by publishers :(

If you are an aspiring data scientist, make sure you develop your love for data. And, love blooms by understanding, so spend the required time to understand the data and it’s nature.

Reference:

Data Scientist's Tryst with Bell Curve! | Towards Data Science

Data Scientist’s Tryst with Bell Curve

And why data scientists cannot escape it

1. Quantitative/Numerical Data

1.1 Continuous Data

1.1.1 Normal Distribution

Student's T Distribution

William Gosset was an English statistician who worked for the brewery of Guinness. He developed different methods for…

Introductory Business Statistics with Interactive Spreadsheets - 1st Canadian Edition

The interactive spreadsheets used throughout this book have been locked except for select cells that allow the student…

2. Qualitative/Categorical

Machine Learning Hands-on Course

Join the most comprehensive Machine Learning Hands-on Course, because now is the time to get started! From basic…

End to End Machine Learning

Sanrusha is a leading provider of Machine Learning and AI based solutions. We strive to make life better by using AI.

Recommend

Machine Learning Cross (K-fold) Validation Introduction | by Sanjay Singh | Sanr...

EVM中循环的成本是多少

Machine Learning Model Deployment as REST API in Four Easy Steps

How I improved the performance of my ML model from 70 to 95% | Analytics Vidhya

Logistic Regression Explained

在没有abi文件的情况下调用智能合约方法，web3py实现

Numpy: Heart of scientific computing in Python | Sanrusha

Predict success rate of your marketing campaign using Logistics Regression

No-code user-generated FAQ, for any webpage

It’s easy to surpass a predecessor

About Joyk