Open In App

Classifying data using Support Vector Machines(SVMs) in R

Last Updated : 03 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Support Vector Machines (SVM) are supervised learning models mainly used for classification and but can also be used for regression tasks. In this approach, each data point is represented as a point in an n-dimensional space, where n is the number of features. The goal is to find a hyperplane that best separates the two classes.

Working of SVM Algorithm

A Support Vector Machine (SVM) is a classifier that finds a separating hyperplane to differentiate between classes in the data. A hyperplane is a flat subspace that divides the feature space into two parts for classification tasks. In a two-dimensional space this is simply a line while in higher dimensions it becomes a plane or a hyperplane that separates the data into different categories.

Mathematically, the hyperplane can be represented as :

[Tex]w \cdot x + b = 0[/Tex]

Where:

  • [Tex]w[/Tex] is the weight vector (normal to the hyperplane).
  • [Tex]x[/Tex] is a point on the feature space.
  • [Tex]b[/Tex] is the bias term that shifts the hyperplane.

For classification SVM aims to maximize the margin between the classes. The margin is the distance between the hyperplane and the closest data points from each class known as support vectors. SVM chooses the hyperplane that maximizes this margin which is given by:

[Tex]\text{Margin} = \frac{2}{\|w\|}[/Tex]

This ensures the largest possible separation between the classes while minimizing classification errors.

Selecting the Best Hyperlane

To determine the optimal hyperplane, algorithm analyzes labeled training data and evaluates different hyperplanes based on how well they separate the classes. Consider the following scenarios for selecting the best hyperplane:

Scenario 1:

In this case, we have three hyperplanes: A, B and C. The goal is to find the hyperplane that best separates the two classes i.e stars and circles. The rule here is to choose the hyperplane that best divides the classes. In this scenario hyperplane B does the best job at separating the two classes making it the optimal choice. 

Scenario 2:

In this situation all three hyperplanes A, B and C do a good job at separating the classes. To identify the best hyperplane we calculate the margin which is the distance between the nearest data points and the hyperplane. The hyperplane with the largest margin is considered the best as it provides better separation. Here hyperplane C has the largest margin making it the optimal choice. 

Implementation of SVM in R

We are going to implement the SVM algorithm in R using following steps:

1. Install and Load Required Packages

We need to install and load the e1071 package which contains the svm() function for training the model.

R
install.packages('e1071') 
install.packages('caTools')
install.packages("ggplot2")
install.packages(caret)

library(caret)
library(e1071) 
library(caTools)
library(ggplot2)

2. Loading the dataset 

We will use this dataset of Social network ads from file Social.csv. We will read the dataset using read.csv() function and display the first 6 rows using the head() function.

R
data = read.csv('/content/social.csv')
head(data)

Output:

head

sample data

3. Exploring the Data

We will explore our dataset by using the summary() function which provides a statistical summary of the dataset including measures like minimum, maximum, mean and quartiles.

R
summary(data)

Output:

summary_data

summary

 4. Data Preprocessing

We need to prepare the data by encoding the categorical variable Gender and scaling the continuous features Age and EstimatedSalary.

R
set.seed(123)

data$Gender <- as.numeric(factor(data$Gender, levels = c("Male", "Female"), labels = c(0, 1)))

data[, c("Age", "EstimatedSalary")] <- scale(data[, c("Age", "EstimatedSalary")])

split <- sample.split(data$Purchased, SplitRatio = 0.75)
training_set <- subset(data, split == TRUE)
test_set <- subset(data, split == FALSE)

5. Training the SVM Model

Now, we will train the SVM model using the svm() function. The model will predict whether a user purchased the product (Purchased) based on the features Age, EstimatedSalary and Gender.

R
classifier <- svm(Purchased ~ Age + EstimatedSalary + Gender, 
                  data = training_set, 
                  type = 'C-classification', 
                  kernel = 'radial', 
                  gamma = 0.1)

6. Making Predictions

Once the model is trained we can use it to predict on the test set.

R
y_pred <- predict(classifier, newdata = test_set)

table(test_set$Purchased, y_pred)

Output:

cm

Confusion matrix

7. Evaluating the Model

We evaluate the model’s performance using a confusion matrix, accuracy and other metrics like precision, recall, F1-score.

R
accuracy <- sum(diag(table(test_set$Purchased, y_pred))) / sum(table(test_set$Purchased, y_pred))
cat("Accuracy: ", accuracy)

confusionMatrix(table(test_set$Purchased, y_pred))

Output:

eval

Evaluation

8. Visualizing the Decision Boundary

We can also visualize the decision boundary using ggplot2. Here ,

  • X1 = seq(min(training_set$Age) – 1, max(training_set$Age) + 1, by = 0.01): Creates a sequence for Age with small steps to cover the feature range.
  • grid_set = expand.grid(X1, X2): Generates a grid of combinations of Age and EstimatedSalary.
  • grid_set$Gender = median(training_set$Gender): Sets a default Gender value for all grid points (using median here).
  • y_grid = predict(classifier, newdata = grid_set): Predicts the class for each grid point using the trained classifier.
  • geom_tile(…, alpha = 0.3): Fills grid cells with predicted class colors for the decision boundary.
  • geom_point(…, size = 3, shape = 21): Plots training points with colors representing actual class.
  • scale_fill_manual(values = c(‘coral1’, ‘aquamarine’)): Manually sets colors for predicted classes in the grid.
  • scale_color_manual(values = c(‘green4’, ‘red3’)): Manually sets colors for actual training points.
R
X1 = seq(min(training_set$Age) - 1, max(training_set$Age) + 1, by = 0.01)
X2 = seq(min(training_set$EstimatedSalary) - 1, max(training_set$EstimatedSalary) + 1, by = 0.01)

grid_set = expand.grid(X1, X2)
grid_set$Gender = median(training_set$Gender)  # Default Gender value for grid

y_grid = predict(classifier, newdata = grid_set)

ggplot() +
  geom_tile(data = grid_set, aes(x = Age, y = EstimatedSalary, fill = as.factor(y_grid)), alpha = 0.3) +
  geom_point(data = training_set, aes(x = Age, y = EstimatedSalary, color = as.factor(Purchased)), size = 3, shape = 21) +
  scale_fill_manual(values = c('coral1', 'aquamarine')) +
  scale_color_manual(values = c('green4', 'red3')) +
  labs(title = 'SVM Decision Boundary (Training set)', x = 'Age', y = 'Estimated Salary') +
  theme_minimal() +
  theme(legend.position = "none")

Output:

des_bound

Descision Boundary

In this article we implemented SVM algorithm in R from data preparation and training the model to evaluating its performance using accuracy, precision, recall and F1-score metrics.



Next Article
Practice Tags :

Similar Reads