본문 바로가기

ML with SckitLearn

심장질환 발병 예측하기 - SVM, K Nearset Neighbour, ANN Multilayer Perceptron

반응형

세 가지 다른 ML 알고리즘을 사용하여 관상 동맥 심장 질환을 예측하는 예시를 보겠습니다. 

 

SVM(서포트 벡터 머신), KNN(최근접-K), ANN(ANN Multilayer Perceptron)를 통해 어떤 모델이 최선의 접근인지에 대해 알아보겠습니다.

 

 ● Data Description

데이터 출처는 남아프리카 심장병 데이터 Set을 사용합니다. : https://www.openml.org/d/1498

 

OpenML

OpenML: exploring machine learning better, together. An open science platform for machine learning.

www.openml.org

데이타 속성은 다음의 10가지를 사용합니다. Systolic blood pressure (Sbp), Cumulative tobacco consumption (kg), Low density lipoprotein (LDL-cholesterol), Adiposity, Family history of heart disease (Present/Absent), Type-A behavior, Obesity, Current alcohol consumption, Age during onset of condition, CHD response

 

 ● Data Processing

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


# Reading the data
data = pd.read_csv('https://www.openml.org/data/get_csv/1592290/phpgNaXZe')

# Setting up the column
column = ['sbp','tobacco','ldl','adiposity','famhist','type','obesity','alcohol','age','chd']
data.describe()
data.columns=column
print(data.head())
print(data.describe())

# Checking for any missing values 
data.isnull().sum()

 

상기 내용을 프로그램 하면 누락된 값이 없음을 알 수 있습니다. 다음으로 다양한 속성의 값을 균등화하게 위해 Feature Scailing과 Min-Max Scaling을 적용합니다. 

 

from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler(feature_range =(0,100))

# setting scale of max min value for sbp in range of 0-100, normalise
data['sbp'] = scale.fit_transform(data['sbp'].values.reshape(-1,1))

print(data.head())

# Data after modification
print(data.describe())

 

 Data VISUALIZATION

data.head(50).plot(kind='area',figsize=(10,5))
data.plot(x='age',y='obesity',kind='scatter',figsize =(10,5))
data.plot(x='age',y='tobacco',kind='scatter',figsize =(10,5))
data.plot(x='age',y='alcohol',kind='scatter',figsize =(10,5))
data.plot(kind = 'hist',figsize =(10,5))
color = dict(boxes='DarkGreen', whiskers='DarkOrange',medians='DarkBlue', caps='Gray')
data.plot(kind='box',figsize=(10,6),color=color,ylim=[-10,90])
plt.show()

 

여러가지 그래프로 데이터 분포를 확인해봅니다.

 Prediction -Test vs Control

# splitting the data into test and train  having a test size of 20% and 80% train size
from sklearn.model_selection import train_test_split
col = ['sbp','tobacco','ldl','adiposity','famhist','type','obesity','alcohol','age']
X_train, X_test, y_train, y_test = train_test_split(data[col], data['chd'], test_size=0.2, random_state=1234)

 

Prediction -SVM

from sklearn import svm
svm_clf = svm.SVC(kernel ='linear')
svm_clf.fit(X_train,y_train)
y_pred_svm =svm_clf.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_svm = confusion_matrix(y_test, y_pred_svm)

from sklearn.metrics import accuracy_score
svm_result = accuracy_score(y_test,y_pred_svm)
recall_svm = cm_svm[0][0]/(cm_svm[0][0] + cm_svm[0][1])
precision_svm = cm_svm[0][0]/(cm_svm[0][0]+cm_svm[1][1])
print("Accuracy :",svm_result)
print("Recall :",recall_svm)
print("Precision :",precision_svm)

 

Accuracy : 0.7419354838709677
Recall : 0.85
Precision : 0.7391304347826086

 

 Prediction -SVM

from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors =5,n_jobs = -1,leaf_size = 60,algorithm='brute')
knn_clf.fit(X_train,y_train)

y_pred_knn = knn_clf.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_knn = confusion_matrix(y_test, y_pred_knn)
knn_result = accuracy_score(y_test,y_pred_knn)
recall_knn = cm_knn[0][0]/(cm_knn[0][0] + cm_knn[0][1])
precision_knn = cm_knn[0][0]/(cm_knn[0][0]+cm_knn[1][1])
print("Accuracy :",knn_result)
print("Recall :",recall_knn)
print("Precision :",precision_knn)

 

Accuracy : 0.6451612903225806
Recall : 0.8166666666666667
Precision : 0.8166666666666667

 

 Prediction -ANN

인공신경망 방법..

from sklearn.metrics import make_scorer, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
ann_clf = MLPClassifier()

#Parameters
parameters = {'solver': ['lbfgs'],
             'alpha':[1e-4],
             'hidden_layer_sizes':(9,14,14,2),   # 9 input, 14-14 neuron in 2 layers,1 output layer
             'random_state': [1]}
# Type of scoring to compare parameter combos
acc_scorer = make_scorer(accuracy_score)

# Run grid search
grid_obj = GridSearchCV(ann_clf, parameters, scoring=acc_scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Pick the best combination of parameters
ann_clf = grid_obj.best_estimator_
# Fit the best algorithm to the data
ann_clf.fit(X_train, y_train)
y_pred_ann = ann_clf.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm_ann = confusion_matrix(y_test, y_pred_ann)
ann_result = accuracy_score(y_test,y_pred_ann)
recall_ann = cm_ann[0][0]/(cm_ann[0][0] + cm_ann[0][1])
precision_ann = cm_ann[0][0]/(cm_ann[0][0]+cm_ann[1][1])
print("Accuracy :",ann_result)
print("Recall :",recall_ann)
print("Precision :",precision_ann)

 

Accuracy : 0.7096774193548387
Recall : 0.8166666666666667
Precision : 0.7424242424242424

 

● Result

results ={'Accuracy': [svm_result*100,knn_result*100,ann_result*100],
          'Recall': [recall_svm*100,recall_knn*100,recall_ann*100],
          'Precision': [precision_svm*100,precision_knn*100,precision_ann*100]}
index = ['SVM','KNN','ANN']
results =pd.DataFrame(results,index=index)
fig =results.plot(kind='bar',title='Comaprison of models',figsize =(9,9)).get_figure()
fig.savefig('Final Result.png')
plt.show()

아래 결과를 보면 KNN이 정확도가 가장 좋기 때문에 KNN모델이 적합하다.

 

반응형