Mastering machine learning algorithms isn’t a myth at all. Most beginners start by learning regression. It is simple to learn and use, but does that solve our purpose? Of course not! Because there is a lot more in ML beyond logistic regression and regression problems! For instance, have you heard of support vector regression and support vector machines or SVM?
SVM stands for Support Vector Machine, which is a type of supervised machine learning algorithm used for classification and regression analysis. SVM works by finding the optimal hyperplane that separates the data points into different classes.
In this article we’ll see the Practical Implementation of The SVM using heart disease prediction dataset from Kaggle.
1. What is Support Vector Machines(SVM)?
Support Vector Machines (SVM) is a popular and powerful machine learning algorithm used for classification and regression analysis. SVM works by finding the optimal hyperplane that separates the data points into different classes. The hyperplane is chosen in such a way that it maximizes the margin, which is the distance between the hyperplane and the closest data points of the different classes. The data points that are closest to the hyperplane are known as support vectors.
SVM is a type of binary classifier, which means it classifies data into two groups or classes. However, it can also be extended to multi-class classification problems by using various techniques such as one-vs-one or one-vs-all classification.
SVM is effective in handling high-dimensional data and can provide good accuracy even with relatively small training datasets. Additionally, SVM can handle non-linearly separable data by using a kernel function that maps the input data to a higher-dimensional feature space where it is more likely to be linearly separable.
2 Equation of Support Vector Machine
The equation of the Support Vector Machine (SVM) classifier depends on the type of SVM being used. However, in general, the decision boundary of the SVM can be represented by the following equation:
# Equation of Support Vector Machine
f(x) = sign(w^T x + b)
where:
x represents the input vector
w is the weight vector
b is the bias term
sign is the sign function that returns either +1 or -1 depending on the sign of its argument.
The weight vector w and bias term b are learned during the training phase of the SVM algorithm, and they determine the position and orientation of the decision boundary. The goal of SVM is to find the values of w and b that maximize the margin between the two classes of data points.
3. Support Vector Machine Practical Implementation for Heart Disease
Data Set Link: https://github.com/Narenderbeniwal/Spark-By-Example
#Import all necessary Libs
import numpy as np
import pandas as pd
#For Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
import seaborn as sns
%matplotlib inline
#For Counter lib — For checking the counts of classification cols
from collections import Counter
#For Train Test Split
from sklearn.model_selection import train_test_split
#For Lableling the dataset cols
from sklearn.preprocessing import LabelEncoder
#For Feature Scaling
from sklearn.preprocessing import StandardScaler
#For Handling imbalanced dataset
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import SMOTE
#For importing Support Vector Classifier
from sklearn.svm import SVC
#For Error Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
#For Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV
#For Evaluating the final Model itself
from eli5 import show_weights
from eli5.sklearn import PermutationImportance
#Import Dataset from the corresponding path
df = pd.read_csv(‘/Users/narenderbeniwal/Downloads/Machine_Learning-main/Classification/SVM/Heart_Disease_Data/Data/heart.csv’)
df.head()
3.1 Let’s have an overview on the data set:
It’s a clean, easy-to-understand set of data. However, the meaning of some of the column headers is not obvious. Here’s what they mean,
age: The person’s age in years
sex: The person’s sex (1 = male, 0 = female)
cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)
trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
chol: The person’s cholesterol measurement in mg/dl
fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes’ criteria)
thalach: The person’s maximum heart rate achieved
exang: Exercise induced angina (1 = yes; 0 = no)
oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)
ca: The number of major vessels (0-3)
thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
target: Heart disease (0 = no, 1 = yes)
#Lets Rename the Cols
df.columns = [‘age’, ‘sex’, ‘chest_pain_type’, ‘resting_blood_pressure’, ‘cholesterol’, ‘fasting_blood_sugar’, ‘rest_ecg’, ‘max_heart_rate_achieved’,
‘exercise_induced_angina’, ‘st_depression’, ‘st_slope’, ‘num_major_vessels’, ‘thalassemia’, ‘target’]
df.head()
Yields below output.
#checking the info of the dataset –> It has 303 Rows and 14 Cols
df.info()
Yields below output.
3.2 EDA (Exploratory Data Analysis) & Feature Engineering
NULL Check
#Checking the Nulls in Dataset
df.isnull().sum()
So, we don’t have any NULL in the dataset.
If we find any NULL’s we need to either remove them or update them with their mean, median, or mode values. Removing records will give you data loss. So, it’s good to fill NULL values with either mean, median, or mode.
Pandas Methods : fillna, dropna
3.3 Checking for Duplicates
#Identifying Duplicates
df.duplicated().sum()
#Removing Duplicate Data
df.drop_duplicates(inplace=True)
#checking for duplicates again
df.duplicated().sum()
We have identified duplicate data and removed it.
3.4 Outlier detection and Analysis
# Univariate analysis age.
f = plt.figure(figsize=(20,4))
f.add_subplot(1,2,1)
sns.distplot(df[‘age’])
f.add_subplot(1,2,2)
sns.boxplot(df[‘age’])
As above we can check the outliers for the Blood Pressure and max_heart_rate_achieved
3.5 Handle outliers with SVMs
There are 2 variants of SVMs. They are hard-margin variant of SVM and soft-margin variant of SVM.
The hard-margin variant of SVM does not deal with outliers. In this case, we want to find the hyperplane with maximum margin such that every training point is correctly classified with margin at least 1. This technique does not handle outliers well.
Another version of SVM is called soft-margin variant of SVM. In this case, we can have a few points incorrectly classified or classified with a margin less than 1. But for every such point, we have to pay a penalty in the form of C parameter, which controls the outliers. Low C implies we are allowing more outliers and high C implies less outliers.
The message is that since the dataset contains outliers, so the value of C should be high while training the model
3.6 Convert Categorical features to numerical values
#Let’s have a look on the data
df.head()
Let’s use pairplot to visualize the relation between features
sns.pairplot(df,hue=’target’, diag_kws={‘bw’: 0.2})
3.7 Feature Selection
As we can see from the above pair plot, all the features are necessary to drive in a conclusion.
# Defining X and yform the Feature Dataset
X = df.drop([‘target’], axis=1)
y = df[‘target’]
3.8 Train Test Split
For Train & Test split, we are going to use ‘train_test_split’ from ‘sklearn.modelselection’
#Spliting Data for Training and testing
#please import ‘from sklearn.model_selection import train_test_split’ if not
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
3.9 Handling Imbalanced Dataset
#Target Classification
sns.countplot(df[‘target’])
#For Counter lib — Import it, if haven’t done
from collections import Counter
#Lets check the Train Data for Sampling
print(“‘y’ Samples : ” , Counter(y_train))
print(“X_train Shape: ” , X_train.shape)
print(“y_train Shape: ” , y_train.shape)
## Output
‘y’ Samples : Counter({1: 123, 0: 103})
X_train Shape: (226, 13)
y_train Shape: (226,)
We have around 140 negative samples and 160 positive samples. It’s almost a balanced data. Below are the things need to do, incase of Imbalanced data.
There are 3 major techniques for balancing the imbalanced Dataset
Undersampling – It will reduce the number of majority data to match the minority data. For eg: In our dependent feature, if we have ‘1000’ Positive values and ‘100’ Negative values, It will downsample the ‘Postive’ values to ‘100’ to match the number of ‘Negative’ values. If we use this technique, there is a chance of missing the large part of actual data.
Oversampling – It will increase the number of minority data to match the majority data by duplicating.For eg: In our dependent feature, if we have ‘1000’ Positive values and ‘100’ Negative values, It will create duplicated or oversample the ‘Negative’ values from 100 to 1000 to match the ‘Positive’ values. If we use this technique, there is a chance of getting overfited model.
SMOTE (Synthetic Minority Oversampling Technique) – It will increase the number of minority data to match the majority data by Synthetically taking the average of Minority data.For eg: In our dependent feature, if we have ‘1000’ Positive values and ‘100’ Negative values, It will create Synthetic average for the ‘Negative’ values from 100 to 1000 to match the ‘Positive’ values. It will not duplicate the values.
3.10 Feature Scaling
#Spliting Data for Training and testing
#please import ‘from sklearn.model_selection import train_test_split’ if not
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Fitting the array into dataframe
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
3.11 Model Creation with Linear Kernel Hyperparams
# import SVC classifier
from sklearn.svm import SVC
# instantiate classifier with default hyperparameters
svc=SVC(kernel = ‘linear’, random_state=42)
#Training the model.
svc.fit(X_train, y_train)
#Predict test data set.
y_pred = svc.predict(X_test)
3.12 Error Metrics for model with Linear kernel Hyperparams
#Checking performance our model with Confusion Metrics
print(confusion_matrix(y_test,y_pred))
#Checking performance our model with classification report.
print(classification_report(y_test, y_pred))
#Checking performance our model with Recall
recall_score(y_test, y_pred)
#Checking performance our model with accuracy score
accuracy_score(y_test, y_pred)
#Checking performance our model with F1 score
f1_score(y_test, y_pred)
##Output
[[29 6]
[ 7 34]]
precision recall f1-score support
0 0.81 0.83 0.82 35
1 0.85 0.83 0.84 41
accuracy 0.83 76
macro avg 0.83 0.83 0.83 76
weighted avg 0.83 0.83 0.83 76
0.8292682926829268
0.8289473684210527
0.8395061728395061
As we can see our model is performing good.
4. Conclusion
The model we have used to explain Support Vector Machine is generalized and works well with a high score. But, I am not sure about the model performance and impact on the large dataset. However, it allowed us to create a simple model. At the start, I thought cholesterol and blood_pressure will impact the model but, the dataset didn’t show that. Instead ‘num_major_vessels‘, ‘thalassemia‘ and ‘chest_pain_type‘ are given more importance.
Mastering machine learning algorithms isn’t a myth at all. Most beginners start by learning regression. It is simple to learn and use, but does that solve our purpose? Of course not! Because there is a lot more in ML beyond logistic regression and regression problems! For instance, have you heard of support vector regression and support vector machines or Read More Machine Learning