Classification — Statistical Learning
This is the 3rd post of blog post series ‘Statistical Learning Notes’, this post is my notes on ‘Chapter 4 — Classification’ of ‘Introduction to Statistical Learning (ISLR)’. Here I have tried to give an intuitive understanding of key concepts and how these concepts are related to Classification.
Note: I suggest the reader to refer ISLR book in case he/she wants to dig further or wants to look for examples.
Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category or class.
Given a feature vector X and a qualitative response Y taking values in the set C, the classification task is to build a function C(X) that takes as input the feature vector X and predicts its value for Y. Often we are more interested in estimating the probabilities that X belongs to each category in C.
We might think that regression is perfect for classification task as well. However, linear regression might produce probabilities less than zero or bigger than one. Logistic regression is more appropriate.
Now suppose we have a response variable with three possible values. If we code the categories, it will suggest an ordering, and in fact implies that the difference between categories is the same. So, we can see that Linear Regression is not appropriate here. Multi-class Logistic Regression or Discriminant Analysis are more appropriate.
Logistic regression models the probability that y belongs to a particular category rather than modeling the response itself. It uses the logistic function to ensure a prediction between 0 and 1. The logistic function takes the form:
Estimating Regression Coefficients
Logistic regression uses a strategy called maximum likelihood to estimate regression coefficients.
We use maximum likelihood to estimate the parameters.
This likelihood gives the probability of the observed 0s and 1s in the data. We pick 0 and 1 to maximize the likelihood of the observed data.
Logistic regression measures the accuracy of coefficient estimates using a quantity called the z-statistic. The z-statistic is similar to the t-statistic. The z-statistic for β1 is represented by:
A large z-statistic offers evidence against the null hypothesis.
In logistic regression, the null hypothesis:
Null hypothesis (H0): β1=0 implies that
which means p(X) does not depend on X.
Once coefficients have been estimated, predictions can be made by plugging the coefficients into the model equation.
In general, the estimated intercept (β0) is of limited interest since it mainly captures the ratio of positive and negative classifications in the given data set.
Similar to linear regression, dummy variables can be used to accommodate qualitative predictors.
Multiple Logistic Regression
Using a strategy similar to that employed for linear regression, multiple logistic regression can be generalized as:
Maximum likelihood is also used to estimate β0,β1,…,βp in the case of multiple logistic regression.
In general, the scenario in which the result obtained with a single predictor does not match the result with multiple predictors, especially when there is correlation among the predictors, is referred to as confounding. More specifically, confounding describes situations in which the experimental controls do not adequately allow for ruling out alternative explanations for the observed relationship between the predictors and the response.
Logistic Regression For More Than Two Classes
Though multi-class logistic regression is possible, discriminant analysis tends to be the preferred means of handling multi-class classification.
Linear Discriminant Analysis (LDA)
While logistic regression models the conditional distribution of the response Y given the predictor(s) X, linear discriminant analysis (LDA) is an approach to model the distribution of X in each of the classes separately, and then use Bayes’ theorem to flip things around and obtain Pr(Y/X).
Linear discriminant analysis is popular when there are more than two response classes, because it also provides low-dimensional views of the data. Beyond its popularity, linear discriminant analysis also benefits from not being susceptible to some of the problems that logistic regression suffers from:
The parameter estimates for logistic regression can be surprisingly unstable when the response classes are well separated. Linear discriminant analysis does not suffer from this problem.
Logistic regression is more unstable than linear discriminant analysis when n is small and the distribution of the predictors X is approximately normal in each of the response classes.
When we use normal (Gaussian) distributions for each class, this leads to linear or quadratic discriminant analysis. However, this approach is quite general, and other distributions can be used as well.
Classification With Bayes’ Theorem
Assuming a qualitative variable YY that can take on K≥2 distinct, unordered values, the prior probability describes the probability that a given observation is associated with the kth class of the response variable Y.
The density function of X for an observation that comes from the kth class is defined as:
This means that fk(X) should be relatively large if there’s a high probability that an observation from the kth class features X=x. Conversely, fk(X) will be relatively small if it is unlikely that an observation in class k would feature X=x.
Following this intuition, Bayes’ theorem states:
where πk denotes the prior probability that the chosen observation comes from the kth class. This equation is sometimes abbreviated as pk(x).
pk(x)=Pr(Y=k|X) is also known as the posterior probability, or the probability that an observation belongs to the kth class, given the predictor value for that observation.
Estimating the prior probability (πk) is easy given a random sample of responses from the population.
Estimating the density function fk(X) tends to be harder, but making some assumptions about the form of the densities can simplify things. A good estimate for fk(X) allows for developing a classifier that approximates the Bayes’ classifier which has the lowest possible error rate since it always selects the class for which pk(x) is largest.
Linear Discriminant Analysis For One Predictor
When only considering one predictor, if we assume that fk(X)fk(X) has a normal or Gaussian distribution, the normal density is described by:
where μk is the mean parameter for the kth class and σ²k is the variable parameter for the kth class.
The density function can be further simplified by assuming that the variance terms, σ²1,…,σ²k, are all equal in which case the variance is denoted by σ².
Plugging the simplified normal density function into Bayes’ theorem yields:
It can be shown that by taking a log of both sides and removing terms that are not class specific, a simpler equation can be extracted:
Using above equation, an observation can be classified by taking the class yields the largest value.
Linear discriminant analysis uses the following estimated values for μ^k and σ²:
where n is the total number of training observations and nk is the number of training observations in class k. The estimate of μ^k is the average value of x for all training observations in class k. The estimate of σ² can be seen as a weighted average of the sample variance for all k classes.
When the class prior probabilities, π1,…,πk,is not known, it can be estimated using the proportion of training observations that fall into the kth class:
Plugging the estimates for μ^k and σ²k into the modified Bayes’ theorem yields the linear discriminant analysis classifier:
which assigns an observation X=x to whichever class yields the largest value.
This classifier is described as linear because the discriminant function δ^k(x) is linear in terms of x and not a more complex function.
The linear discriminant analysis classifier assumes that the observations from each class follow a normal distribution with a class specific average vector and constant variance (σ²), and uses these simplifications to build a Bayes’ theorem based classifier.
Linear Discriminant Analysis with Multiple Predictors
Multivariate linear discriminant analysis assumes that X=(X1,X2,…,Xp) comes from a multivariate normal distribution with a class-specific mean vector and a common covariance matrix.
The multivariate Gaussian distribution used by linear discriminant analysis assumes that each predictor follows a one-dimensional normal distribution with some correlation between the predictors. The more correlation between predictors, the more the bell shape of the normal distribution will be distorted.
A p-dimensional variable X can be indicated to have a multivariate Gaussian distribution with the notation X∼N(μ,Σ) where E(x)=μ is the mean of XX (a vector with p components) and Cov(X)=Σ is the p x p covariance matrix of X.
As was the case for one-dimensional linear discriminant analysis, it is necessary to estimate the unknown parameters μ1,…,μk, π1,…,πk, and Σ. The formulas used in the multi-dimensional case are similar to those used with just a single dimension.
Since, even in the multivariate case, the linear discriminant analysis decision rule relates to X in a linear fashion, the name linear discriminant analysis holds. And as with other methods, the higher the ratio of parameters p, to number of samples n, the more likely overfitting will occur.
In general, binary classifiers are subject to two kinds of error: false positives and false negatives. A confusion matrix can be a useful way to display these error rates. Class-specific performance is also important to consider because in some cases a particular class will contain the bulk of the error.
The term sensitivity refers to the percentage of observations correctly positively classified (true positives) and specificity refers to the percentage of observations correctly negatively classified (true negatives).
A ROC curve is a useful graphic for displaying the two types of error rates for all possible thresholds. ROC is a historic acronym that comes from communications theory and stands for receiver operating characteristics.
The overall performance of a classifier summarized over all possible thresholds is quantified by the area under the ROC curve.
A more ideal ROC curve will hold more tightly to the top left corner which, in turn, will increase the area under the ROC curve. A classifier that performs no better than chance will have an area under the ROC curve less than or equal to 0.5 when evaluated against a test data set.
In summary, varying the classifier threshold changes its true positive and false positive rate, also called sensitivity and (1−specificity).
Quadratic Discriminant Analysis
Quadratic discriminant analysis offers an alternative approach to linear discriminant analysis that makes most of the same assumptions, except that quadratic discriminant analysis assumes that each class has its own covariance matrix.
The quadratic discriminant analysis Bayes classifier gets its name from the fact that it is a quadratic function in terms of x.
The choice between a shared covariance matrix (like that assumed in linear discriminant analysis) and a class-specific covariance matrix (like that assumed in quadratic discriminant analysis) amounts to a bias-variance trade-off.
Comparing Classification Methods
Since logistic regression and linear discriminant analysis are both linear in terms of x,the primary difference between the two methods is their fitting procedures. Linear discriminant analysis assumes that observations come from a Gaussian distribution with a common covariance matrix, and as such, out performs logistic regression in cases where these assumptions hold true.
K-nearest neighbors can outperform logistics regression and linear discriminant analysis when the decision boundary is highly non-linear, but at the cost of a less interpretable model.
Quadratic discriminant analysis falls somewhere between the linear approaches of linear discriminant analysis and logistic regression and the non-parametric approach of K-nearest neighbors. Since quadratic linear analysis models a quadratic decision boundary, it has more capacity for modeling a wider range of problems. Quadratic discriminant analysis is not as flexible as K-nearest neighbors, however it can perform better than K-nearest neighbors when there are fewer training observations due to its high bias.
Linear discriminant analysis and logistic regression will perform well when the true decision boundary is linear. Quadratic discriminant analysis may give better results when the decision boundary is moderately non-linear. Non-parametric approaches like K-nearest neighbors may give better results when the decision boundary is more complex and the right level of smoothing is employed.
As was the case in the regression setting, it is possible to apply non-linear transformations to the predictors to better accommodate non-linear relationships between the response and the predictors. The effectiveness of this approach will depend on whether or not the increase in variance introduced by the increase in flexibility is offset by the reduction in bias.
It is possible to add quadratic terms and cross products to the linear discriminant analysis model such that it has the same form as quadratic discriminant analysis, however the parameter estimates for each of the models would be different. In this fashion, it’s possible to build a model that falls somewhere between linear discriminant analysis and quadratic discriminant analysis.
Thank you for reading my post. I regularly write about Data & Technology on LinkedIn & Medium. If you would like to read my future posts then simply ‘Connect’ or ‘Follow’. Also feel free to listen to me on SoundCloud.