Heart Disease Prediction Using Python

Hey fellow coder! Today in this tutorial, we will try to predict the presence of a very common illness in people, heart disease.

Heart disease is one of the biggest causes of morbidity and mortality among the population of the world. Heart disease refers to a group of disorders that affect the heart. According to WHO, cardiovascular illnesses are now the leading cause of mortality globally, accounting for 17.9 million deaths per year.

Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare.

Understanding the Heart Disease Dataset

The dataset chosen for this tutorial is the 2020 annual CDC survey data. You can download the dataset here.

It consists of 401,958 rows and 279 columns. But the original dataset of nearly 300 variables was reduced to just about 20 variables. We should treat the variable “heart disease” as a binary (“Yes” – respondent had the disease; “No” – respondent had no disease).

Code for predicting Heart Disease

Our aim is to predict the disease present in a person using the dataset. This code implementation includes the following steps;

Importing necessary libraries/modules
Data loading and pre-processing
Generation of the train set and test set of data
Defining functions for the training on the train dataset
Performing training by using defined methods
Performing the training by using Sklearn Libraries

Importing Dependencies

The very first thing is importing the required libraries such as pandas, NumPy, and pyplot into our program.

import pandas as pd

import numpy as np

from matplotlib import pyplot as plt

Loading and Pre-processing Dataset

data = pd.read_csv('heart_2020_cleaned.csv')

print("Number of Datapoints : ",data.shape[0])

data.head()

heart disease Dataset

A lot of features in the dataset contain string data that are useless for the training of the logistic regression in the later section for prediction. Therefore we convert these objects into integer values. For this purpose, the unique values at those columns are given integer values starting from either 0 or 1. For example values with only “yes” or “no” options 1 is assigned for “yes” and 0 is assigned for “no”.

# Converting Gender type to Integers

data.iloc[:,8].replace("Female",1,inplace=True)

data.iloc[:,8].replace("Male",0,inplace=True)

# Categorizing Age values

data.iloc[:,9].replace("18-24",1,inplace=True)

data.iloc[:,9].replace("25-29",2,inplace=True)

data.iloc[:,9].replace("30-34",3,inplace=True)

data.iloc[:,9].replace("35-39",4,inplace=True)

data.iloc[:,9].replace("40-44",5,inplace=True)

data.iloc[:,9].replace("45-49",6,inplace=True)

data.iloc[:,9].replace("50-54",7,inplace=True)

data.iloc[:,9].replace("55-59",8,inplace=True)

data.iloc[:,9].replace("60-64",9,inplace=True)

data.iloc[:,9].replace("65-69",10,inplace=True)

data.iloc[:,9].replace("70-74",11,inplace=True)

data.iloc[:,9].replace("75-79",12,inplace=True)

data.iloc[:,9].replace("80 or older",13,inplace=True)

# Categorize Race of the person

data.iloc[:,10].replace("White",1,inplace=True)

data.iloc[:,10].replace("Black",2,inplace=True)

data.iloc[:,10].replace("Asian",3,inplace=True)

data.iloc[:,10].replace("American Indian/Alaskan Native",4,inplace=True)

data.iloc[:,10].replace("Other",5,inplace=True)

data.iloc[:,10].replace("Hispanic",6,inplace=True)

# Catgorize if the person is diabetic or not

data.iloc[:,11].replace("Yes",4,inplace=True)

data.iloc[:,11].replace("Yes (during pregnancy)",3,inplace=True)

data.iloc[:,11].replace("No, borderline diabetes",2,inplace=True)

data.iloc[:,11].replace("No",1,inplace=True)

# Categorize the Health of the person into integers values

data.iloc[:,13].replace("Excellent",4,inplace=True)

data.iloc[:,13].replace("Very good",3,inplace=True)

data.iloc[:,13].replace("Good",2,inplace=True)

data.iloc[:,13].replace("Fair",1,inplace=True)

data.iloc[:,13].replace("Poor",0,inplace=True)

# Set final label of having heart disease or not into integers

data.replace("Yes",1,inplace=True)

data.replace("No",0,inplace=True)

heart disease Dataset Cleaner

As the final step, we normalize the dataset values to overcome the problem of encountering large values and making everything more complex.

y = data.HeartDisease.values

x_d = data.drop(["HeartDisease"], axis=1)

x = (x_d - np.min(x_d))/(np.max(x_d)-np.min(x_d)).values

Training and Testing Split of Dataset

We will be making use of the 80-20 rule where the 80% data is the training dataset and the rest 20% is put to the testing dataset. To get the split we make use of the train_test_split function.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)

x_train = x_train.T

x_test = x_test.T

Applying Logistic Regression to predict Heart Disease

In this final section, the same training and test data help to train NN by using the Sklearn Logistic Regression function as follows.

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

lr.fit(x_train.T,y_train.T)

For the testing dataset, we make predictions whether the person has the disease or not. Along with this, we will also compute the accuracy score of the predictions made using the code below.

all_pred = list(lr.predict(x_test.T))

score = lr.score(x_test.T,y_test.T)

print("Score of Logistic Regression : ",score)

Score of Logistic Regression : 0.9137572507387545

You can see the score is pretty decent and the predictions are more than 90% accurate.

Heart disease is one of society’s key worries nowadays. Manually calculating the chances of developing heart disease based on risk factors is tough. Machine learning techniques, on the other hand, can help to anticipate the outcome of existing data.

Thank you for reading!

I hope you liked the tutorial!

I would recommend you to read the following tutorials and learn a lot more:

Heart Disease Prediction Using Python

Heart Disease Prediction Using Python

Understanding the Heart Disease Dataset

Code for predicting Heart Disease

Importing Dependencies

Loading and Pre-processing Dataset

Training and Testing Split of Dataset

Applying Logistic Regression to predict Heart Disease

Recommend

How To Keep Your Company Competitive in the Current Environment

灵犀微光获亿元级B轮融资，加速AR光波导全面量产

BCM控制方式在STM32中的实现使用高级定时器

Playing By Color

财报发布、股价上涨，教育机构转型曙光初现？

2017 年末总结

28岁员工猝死：字节跳动正在失去90后

8个大师级灯光技巧说全了：电影感画面的秘诀都在这个视频

Women in cybersecurity need more than inspiration

回首2012

About Joyk