Results for ""
Cardiovascular disease comprises of three components, mainly coronary, vascular, and Cardiomyopathy disease. Among the top five deaths that occur worldwide include CVD and respiratory diseases. The main aim of this article is a prediction of heart disease by the given data set. Naive Bayes, Random Forest, XGBoost, and Logistic Regression will be used to find the accuracy.
Ischemic coronary corridor sickness and stroke are the essential drivers of about 70% of CVD passing's. The information on CVD and its associated factors is significantly less in metropolitan and rural zones alongside the school children's. The family predecessors furthermore, identity are extra factors in CVD.
The use of alcohol, smoking, and irregular habit of eating fatty or junk foods are commonly seen in young people and is at risk of having heart disease. The prevalence rate of heart disease in India in 2030 will be double that in 2018. Air pollution also has some effects on the lungs and heart patients. India is the seventh most contaminated country concerning air contamination. The harmful gases generally come from vehicles. Air pollution contains a chemical substance that is dangerous. Tainted air impacts different organs. It goes from minor upper respiratory, coronary sickness, lung tumor, and serious respiratory pollution's in young people and steady bronchitis in grown-ups, bothering past heart, and lung disease, or asthmatic attacks.
The correlation coefficient for PM2.5 NO2 is 0.468688 and for that of PM2.5 and PM10 is 0.8 (View source- Springer)
Machine learning is used to predict the heart disease from the given data set to help cardiologist.
Information contains;
age - age in years
sex - (1-male; 0 - female)
cp - chest torment type
trestbps - resting circulatory strain
chol - serum cholesterol in mg/dl
fbs - (fasting glucose )
restecg - resting electrocardiograph results
thalach - greatest pulse accomplished
exang - practice instigated angina (1 = yes; 0 = no)
oldpeak - ST wretchedness instigated by practice comparative with rest
incline - the slant of the pinnacle practice ST section
ca - number of significant vessels (0-3) hued by fluoroscope
thal - 3 = typical; 6 = fixed imperfection; 7 = reversible deformity
target - have illness or not (1=yes, 0=no)
import pandas as pd
df =pd.read_csv("heart.csv")
df.corr()
Chest Pain and Target seems to be somewhat correlated.
From the graph above we can see that chest pain with value 1,2,3 are more likely to have disease as compared to zero value means the people with chest pain are most likely to have heart disease and no heart disease in person having no chest pain.
restecg- This graph shows the person having abnormal rhythm is most likely to have heart disease as compared to normal rhythm heart.
plt.figure(figsize=(15, 15))
s=[]
for column in df1.columns:
print(f"{column} : {df1[column].unique()}")
if len(df1[column].unique()) <= 10:
s.append(column)
else:
print('==============================')
for i, column in enumerate(s, 1):
plt.subplot(3, 3, i)
df[df["target"] == 0][column].hist(bins=35, color='blue', label=' Heart Disease - NO', alpha=0.6)
df[df["target"] == 1][column].hist(bins=35, color='red', label='Heart Disease - YES', alpha=0.6)
plt.legend()
plt.xlabel(column)
Health Patients are represented as zero values means have no disease whereas Disease Patients are represented as one. About 526 Total no. of patients are having heart disease and 499 are healthy patients.
Clustered mapping of few columns to check the correlation between them. Clustered map is created by using Clustergrammer2 which can be installed on python by using pip. Age and Chol. shows correlation very less.
pip install Clustergrammer2.
from clustergrammer2 import *
net.load_df(df2)
net.cluster(enrichrgram=True)
net.widget()
Logistics Regression
Logistic regression is a measurable model that in its fundamental structure utilizes a calculated capacity to display a double reliant variable.
#Importing Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,accuracy_score,roc_curve,classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
x = df1.drop('target', axis=1)
y = df1.target
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
r= LogisticRegression()
model= r.fit(X_train, y_train)
m= r.predict(X_test)
accuracy = accuracy_score(y_test,m)
accuracy- 80.97560975609757
Naive Bayes
For solving a classification task, the Naive Bayes algorithm comes in handy. It is one of the simple and strong machine learning algorithms based on the Bayes Theorem.
Naive Bayes utilizes the Bayes' Theorem and accepts that all indicators are free. This classifier predicts that the presence of one specific element in a class doesn't influence the appearance of others.
Here's a model: you'd believe an organic product to be orange if it is round, orange, and is of around 3.5 crawls in breadth. Presently, regardless of whether these highlights require each other to exist, they all contribute autonomously to your supposition that this specific natural product is orange.
n=GaussianNB()
nb=n.fit(X_train,y_train)
pred= nb.predict(X_test)
accuracy1= accuracy_score(y_test,pred)
accuracy1*100 - 78.04878048780488
Random Forest
It is part of a decision tree which is a supervised learning algorithm used for classification and regression.
Each tree within the random wooded area will do its own random train/check break up of the information, referred to as bootstrap aggregation and the samples no longer covered are called the ‘out-of-bag samples. Moreover, every tree will do characteristic bagging at every node-branch split to lessen the results of a characteristic mostly correlated with the response.
While an individual tree is probably touchy to outliers, the ensemble version will no longer be the same.
f = RandomForestClassifier()
f.fit(X_train,y_train)
rf= f.predict(X_test)
accuracy2= accuracy_score(y_test, rf)
accuracy2*100 - 100.0
XGBoost
XGBoost is an algorithm. That has recently been dominating applied gadget learning.
xgb = XGBClassifier( eval_metric='mlogloss',use_label_encoder =False)
xgb.fit(X_train, y_train)
xgb_predicted = xgb.predict(X_test)
accuracy3= accuracy_score(y_test, xgb_predicted)
accuracy3*100 - 100.0
Algorithms accuracy of Naive Bayes, Random Forest, Logistic Regression, and XGBoost were plotted in a bar chart to see the accuracy rate.
Bar chart shows Random Forest and XGBoost have same accuracy of 100% to predict the heart disease.
Conclusion
CVD is discovered to be the fundamental purpose for more passing in India and all throughout the planet. Random Forest and XGBoost shows 100% accuracy as compared to other algorithms.
https://link.springer.com/chapter/10.1007/978-3-030-35252-3_11