Linear Discriminant Analysis was originally developed by R.A. Fisher to classify subjects into one of the two clearly defined groups. It was later expanded to classify subjects inoto more than two groups. It helps to find linear combination of original variables that provide the best possible separation between the groups. Linear Discriminant Analysis is focused on maximizing the separability among known categories. The problem is when 2 features are not sufficient to capture the most of variation. In PCA, we solve this problem reducing the dimensionality by focusing on the feature with the most variation. LDA is like PCA, but is focused to maximize the separability between the two groups. PCA is unsupervised, but LDA is supervised.
data(iris)
library(psych)
pairs.panels(iris[1:4],
gap = 0,
bg = c("red", "green", "blue")[iris$Species],
pch = 21)
We are using the iris dataset with 4 numerical variables and 1 factor which has 3 levels as described above. We can also see that the numerical variables have different ranges, it is a good pratice to normalize the data. From the graph above we have scatterplots of each combination of variabels. In the upper triangle we have correlation coefficients. We can see that Sepal.Length and Petal.Length are good to separate between thr three Species. In other cases, there is a overlapping and not a clear separation between the three Species.
# Data Partitioning
set.seed(123)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]
# Linear Discriminant Analysis
library(MASS)
linear <- lda(Species~., data=training)
linear
Call:
lda(Species ~ ., data = training)
Prior probabilities of groups:
setosa versicolor virginica
0.3370787 0.3370787 0.3258427
Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa 4.946667 3.380000 1.443333 0.250000
versicolor 5.943333 2.803333 4.240000 1.316667
virginica 6.527586 2.920690 5.489655 2.048276
Coefficients of linear discriminants:
LD1 LD2
Sepal.Length 0.3629008 0.05215114
Sepal.Width 2.2276982 1.47580354
Petal.Length -1.7854533 -1.60918547
Petal.Width -3.9745504 4.10534268
Proportion of trace:
LD1 LD2
0.9932 0.0068
From the resul above we have the Coefficients of linear discriminants for each of the four variables. The first discriminant function LD1 is a linear combination of the four variables: (0.3629008 x Sepal.Length) + (2.2276982 x Sepal.Width) + (-1.7854533 x Petal.Length) + (-3.9745504 x Petal.Width). Note that Discriminant functions are scaled. We have aslo the Proportion of trace, the percentage separations archived by the first discriminant function LD1 is 99.32%.
Now we can create a Stacked Histogram of Discriminant Function values.
# Histogram
p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species) # p$x[,1] give data from LD1
From the graph above, we have histogram from LD1, and w can see that the separatin between setosa and the oder two Species is quite large with no overlap. On the contrary, there is a certan amont of overlapping between versicolor and virginica. We already said that the percentage of separation archived by LD1 is 99.32%, that is we he can see a very clear separation from the histogram above. Now, we can try to do the same for LD2.
# Histogram
p <- predict(linear, training)
ldahist(data = p$x[,2], g = training$Species) # p$x[,1] give data from LD1
As we can see from the histogram here above LD2 we have a lot of overlap, which is not great. Now we can try to create the Bi-Plot.
# # Bi-Plot
library(ggord)
ggord(linear, training$Species, ylim = c(-10, 10))
From the Bi-Plot above, we have in the x-axis the LD1 and is able to separate the three Species quite well. There is some amount of overlap between versicolor in green and virginica in blue. We can also see that Sepal.Width and Sepal.Length are both in a positive direction. The contrary is for Petal.Width and Petal.Length.
Now we can build the Partition Plot.
# # Bi-Plot with Linear Discriminant Analysis Model
library(klaR)
partimat(Species~., data=training, method="lda")
From the Partition Plot above, we can see classification for eachof observation in the training dataset based on the Linear Discriminant Analysis Model, and for every combination of two variables. From the right bottom graph, we can see that setosa s is quite far away from the other two Species, and bewtween versicolor and virginica there is some amount of overlap. The graph above is for a Linear Discriminant, we can also use a Quadratic Discriminant Analysis Model.
# Bi-Plot with Quadratic Discriminant Analysis Model
partimat(Species~., data=training, method="qda")
# Confusion Matrix and Accuracy
p1 <- predict(linear, testing)$class
tab1 <- table(Predicted = p1, Actual = testing$Species)
tab1
Actual
Predicted setosa versicolor virginica
setosa 20 0 0
versicolor 0 19 1
virginica 0 1 20
accuracy1 <- sum(diag(tab1))/sum(tab1)
accuracy1
[1] 0.9672131
# Quadratic Discriminant Analysis
quadratic <- qda(Species~., data=training)
p2 <- predict(quadratic, testing)$class
tab2 <- table(Predicted = p2, Actual = testing$Species)
tab2
Actual
Predicted setosa versicolor virginica
setosa 20 0 0
versicolor 0 16 2
virginica 0 4 19
accuracy2 <- sum(diag(tab2))/sum(tab2)
accuracy2
[1] 0.9016393
From the Partition Plot above, now we have a curve to discriminate between Species. From the Accuracy estimation of the testing data, we can see that is higher with Linear Discriminant Analysis Model (96.72% vs. 90.16%), which is also confirmed comparing the confusion matrix for the linear discriminat (tab1) vs. the confusion matrix of the quadratic discriminant (tab2).