Cardiovascular disease comprises of three components, mainly coronary, vascular, and Cardiomyopathy disease. Among the top five deaths that occur worldwide include CVD and respiratory diseases. The main aim of this article is a prediction of heart disease by the given data set. Naive Bayes, Random Forest, XGBoost, and Logistic Regression will be used to find the accuracy.

Ischemic coronary corridor sickness and stroke are the essential drivers of about 70% of CVD passing's. The information on CVD and its associated factors is significantly less in metropolitan and rural zones alongside the school children's. The family predecessors furthermore, identity are extra factors in CVD.

The use of alcohol, smoking, and irregular habit of eating fatty or junk foods are commonly seen in young people and is at risk of having heart disease. The prevalence rate of heart disease in India in 2030 will be double that in 2018. Air pollution also has some effects on the lungs and heart patients. India is the seventh most contaminated country concerning air contamination. The harmful gases generally come from vehicles. Air pollution contains a chemical substance that is dangerous. Tainted air impacts different organs. It goes from minor upper respiratory, coronary sickness, lung tumor, and serious respiratory pollution's in young people and steady bronchitis in grown-ups, bothering past heart, and lung disease, or asthmatic attacks.

The correlation coefficient for PM2.5 NO2 is 0.468688 and for that of PM2.5 and PM10 is 0.8 (View source- Springer)

Machine learning is used to predict the heart disease from the given data set to help cardiologist.

Information contains;

age - age in years

sex - (1-male; 0 - female)

cp - chest torment type

trestbps - resting circulatory strain

chol - serum cholesterol in mg/dl

fbs - (fasting glucose )

restecg - resting electrocardiograph results

thalach - greatest pulse accomplished

exang - practice instigated angina (1 = yes; 0 = no)

oldpeak - ST wretchedness instigated by practice comparative with rest

incline - the slant of the pinnacle practice ST section

ca - number of significant vessels (0-3) hued by fluoroscope

thal - 3 = typical; 6 = fixed imperfection; 7 = reversible deformity

target - have illness or not (1=yes, 0=no)


import pandas as pd

df =pd.read_csv("heart.csv")

df.corr()

Chest Pain and Target seems to be somewhat correlated.

From the graph above we can see that chest pain with value 1,2,3 are more likely to have disease as compared to zero value means the people with chest pain are most likely to have heart disease and no heart disease in person having no chest pain.

restecg- This graph shows the person having abnormal rhythm is most likely to have heart disease as compared to normal rhythm heart. 

plt.figure(figsize=(15, 15))

s=[]

for column in df1.columns:

  print(f"{column} : {df1[column].unique()}")

  if len(df1[column].unique()) <= 10:

    s.append(column)

  else:

    print('==============================') 

for i, column in enumerate(s, 1):

  plt.subplot(3, 3, i)

  df[df["target"] == 0][column].hist(bins=35, color='blue', label=' Heart Disease - NO', alpha=0.6)

  df[df["target"] == 1][column].hist(bins=35, color='red', label='Heart Disease - YES', alpha=0.6)

  plt.legend()

  plt.xlabel(column)

Health Patients are represented as zero values means have no disease whereas Disease Patients are represented as one. About 526 Total no. of patients are having heart disease and 499 are healthy patients.

Clustered mapping of few columns to check the correlation between them. Clustered map is created by using Clustergrammer2 which can be installed on python by using pip. Age and Chol. shows correlation very less.

pip install Clustergrammer2. 

from clustergrammer2 import *

net.load_df(df2)

net.cluster(enrichrgram=True)

net.widget()

Logistics Regression

Logistic regression is a measurable model that in its fundamental structure utilizes a calculated capacity to display a double reliant variable.

#Importing Libraries

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report

from sklearn.linear_model import LogisticRegression

from sklearn.naive_bayes import GaussianNB

from xgboost import XGBClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn.svm import SVC

x = df1.drop('target', axis=1)

y = df1.target 

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)


scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)


X_test = scaler.transform(X_test)

r= LogisticRegression()

model= r.fit(X_train, y_train)

m= r.predict(X_test)

accuracy = accuracy_score(y_test,m)

accuracy- 80.97560975609757

Naive Bayes

For solving a classification task, the Naive Bayes algorithm comes in handy. It is one of the simple and strong machine learning algorithms based on the Bayes Theorem.

Naive Bayes utilizes the Bayes' Theorem and accepts that all indicators are free. This classifier predicts that the presence of one specific element in a class doesn't influence the appearance of others. 

Here's a model: you'd believe an organic product to be orange if it is round, orange, and is of around 3.5 crawls in breadth. Presently, regardless of whether these highlights require each other to exist, they all contribute autonomously to your supposition that this specific natural product is orange.

n=GaussianNB()

nb=n.fit(X_train,y_train)

pred= nb.predict(X_test)

accuracy1= accuracy_score(y_test,pred)

accuracy1*100 - 78.04878048780488

Random Forest

It is part of a decision tree which is a supervised learning algorithm used for classification and regression.

Each tree within the random wooded area will do its own random train/check break up of the information, referred to as bootstrap aggregation and the samples no longer covered are called the ‘out-of-bag samples. Moreover, every tree will do characteristic bagging at every node-branch split to lessen the results of a characteristic mostly correlated with the response.

While an individual tree is probably touchy to outliers, the ensemble version will no longer be the same.

f = RandomForestClassifier()

f.fit(X_train,y_train)

rf= f.predict(X_test)

accuracy2= accuracy_score(y_test, rf)

accuracy2*100 - 100.0

XGBoost

XGBoost is an algorithm. That has recently been dominating applied gadget learning.

xgb = XGBClassifier( eval_metric='mlogloss',use_label_encoder =False)

xgb.fit(X_train, y_train)

xgb_predicted = xgb.predict(X_test)

accuracy3= accuracy_score(y_test, xgb_predicted)

accuracy3*100 - 100.0

Algorithms accuracy of Naive Bayes, Random Forest, Logistic Regression, and XGBoost were plotted in a bar chart to see the accuracy rate.

Bar chart shows Random Forest and XGBoost have same accuracy of 100% to predict the heart disease.

Conclusion

CVD is discovered to be the fundamental purpose for more passing in India and all throughout the planet. Random Forest and XGBoost shows 100% accuracy as compared to other algorithms.

Sources of Article

https://link.springer.com/chapter/10.1007/978-3-030-35252-3_11

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE