5 Reasons You Should Never Use PCA For Feature Selection

Principal Component Analysis, or PCA, is one of the most consequential dimensionality reduction algorithms ever invented.

Unfortunately, like all popular tools, PCA is often used for unintended purposes, sometimes abusively. One such purpose is feature selection.

In this article, we give you 5 key reasons never to use PCA for feature selection. But first, let's briefly review the inner-workings of PCA.

What Is PCA?

The Problem

Let us assume we have a vector of inputs x:=(x1,…,xd)∈Rd, which we assume has mean 0 to simplify the argument (i.e. E(x)=0).

We are interested in reducing the size d of our vector, without losing too much information. Here, a proxy for the information content of x is its energy defined as E(x):=E(||x||2).

The challenge is that the information content of x is usually unevenly spread out across its coordinates.

In particular, coordinates can be positively or negatively correlated, which makes it hard to gauge the effect of removing a coordinate on the overall energy.

Let's take a concrete example. In the simplest bivariate case (d=2),

E(x)=Var(x1)+Var(x2)+2ρ(x1,x2)√Var(x1)Var(x2),

where ρ(x1,x2) is the correlation between the two coordinates, and Var(xi) is the variance of xi.

Let's assume that x1 has a higher variance than x2. Clearly, the effect of removing x2 on the energy, namely E(x)−Var(x1)=Var(x2)+2ρ(x1,x2)√Var(x1)Var(x2), does not just depend on x2; it also depends on the correlation between x1 and x2, and on the variance/energy of x1! When d>2, things get even more complicated. The energy now readsE(x)=d∑i=1d∑j=1ρ(xi,xj)√Var(xi)Var(xj),and analyzing the effect on the energy of removing any coordinate becomes a lot more complicated.

The aim of PCA is to find a feature vector z:=(z1,…,zd)∈Rd obtained from x by a linear transformation, namely z=Wx, satisfying the following conditions:

z has the same energy as x: E(||x2||)=E(||z2||).
z has decorralated coordinates: ∀i≠j, ρ(xi,xj)=0.
Coordinates of z have decreasing variances: Var(x1)≥Var(x2)≥⋯≥Var(xd).

When the 3 conditions above are met, we have E(x)=E(z)=d∑i=1Var(zi).Thus, dimensionality reduction can be achieved by using features zp:=(z1,…,zp) instead of the original features x:=(x1,…,xd), where p<d is chosen so that the energy loss, namely E(z)−E(zp)=d∑i=p+1Var(zi),is only a small fraction of the total energy E(z):∑di=p+1Var(zi)∑di=1Var(zi)≪1.

The Solution

The three conditions above induce a unique solution.

The conservation of energy equation implies: E(||z2||)=E(xTWTWx)=E(xTx)=E(||x2||). A sufficient condition for this to hold is that W be an orthogonal matrix: WTW=WWT=I. In other words, columns (resp. rows) form an orthonormal basis of Rd.

As for the second condition, it implies that the autocovariance matrixCov(z)=WE(xxT)WT=WCov(x)WTshould be diagonal.

Let us write Cov(x)=UDUT the Singular Value Decomposition of Cov(x), where columns of the orthogonal matrix U are orthonormal eigenvectors of the (positive semidefinite) matrix Cov(x), sorted in decreasing order of eigenvalues.

Plugging Cov(x)=UDUT in the equation Cov(z)=WCov(x)WT, we see that, to satisfy the second condition, it is sufficient that WU=I=UTWT, which is equivalent to W=U−1=UT.

Note that, because U is orthogonal, the choice W=UT also satisfies the first condition.

Finally, given that columns of U are sorted in decreasing order of eigenvalues, their variances Var(zi)=Cov(z)[i,i]=D[i,i] also form a decreasing sequence, which satisfies the third condition.

Interestingly, it can be shown that W=UT is the only loading matrix of a linear transformation satisfying the three conditions above.

Coordinates of z are called principal components, and the transformation x→UTx is the Principal Component Analysis.

5 Reasons Not To Use PCA For Feature Selection

Now that we are on the same page about what PCA is, let me give you 5 reasons why it is not suitable for feature selection.

When used for feature selection, data scientists typically regard zp:=(z1,…,zp) as a feature vector than contains fewer and richer representations than the original input x for predicting a target y.

Reason 1: Conservation of energy does not guarantee conservation of signal

The essence of PCA is that the extent to which dimensionality reduction is lossy is driven by the information content (energy in this case) that is lost in the process. However, for feature selection, what we really want is to make sure that reducing dimensionality will not reduce performance!

Unfortunately, maximizing the information content or energy of features zp:=(z1,…,zp) does not necessarily maximize their predictive power!

Think of the predictive power of zp as the signal part of its overall energy or, equivalently, the fraction of its overall energy that is useful for predicting the target y.

We may decompose an energy into signal and noise as S(zp)+N(zp)=E(zp)≤E(x)=S(x)+N(x),where N(zp):=E(||zp||2|y) is the noise component, and S(zp):=E(||zp||2)−E(||zp||2|y) is the signal.

Clearly, while PCA ensures that E(x)≈E(zp), we may easily find ourself in a situation where PCA has wiped out all the signal that was originally in x (i.e. S(zp)≈0)! The lower the Signal-to-Noise Ratio (SNR) , the more likely this is to happen.

Fundamentally, for feature selection, what we want is conservation of signal S(x)≈S(zp) not conservation of energy.

Note that, if instead of using the energy as the measure of information content we used the entropy, the noise would have been the conditional entropy h(zp|y), and the signal would have been the mutual information I(y;zp).

Reason 2: Conservation of energy is antithetical to feature selection

Fundamentally, preserving the energy of the original feature vector conflicts with the objectives of feature selection.

Feature selection is most needed when the original feature vector x contains coordinates that are uninformative about the target y, whether they are used by themselves, or in conjunction with other coordinates.

In such a case, removing the useless feature(s) is bound to reduce the energy of the feature vector. The more useless features there are, the more energy we will lose, and that's OK!

Let's take a concrete example in the bivariate case x=(x1,x2) to illustrate this. Let's assume x2 is uninformative about y and x1 is almost perfectly correlated to y.

Saying that x2 is uninformative about the target y means that it ought to be independent from y both unconditionally (i.e. I(y;x2)=0) and conditionally on x1 (i.e. I(y;x2|x1)=0).

This can occur for instance when x2 is completely random (i.e. independent from both y and x1). In such a case, we absolutely need to remove x2, but doing so would inevitably reduce the energy by E(||x2||2).

Note that, when both x1 and x2 have been standardized, as is often the case before applying PCA, removing x2, which is the optimal thing to do from a feature selection standpoint, would result in 50% energy loss!

Even worse, in this example, x1 and x2 happen to be principal components (i.e. U=I) associated to the exact same eigenvalue. Thus, PCA is unable to decide which one to keep, even though x2 is clearly useless and x1 almost perfectly correlated to the target!

Reason 3: Decorrelation of features does not imply maximum complementary

It is easy to think that because two features are decorrelated each must bring something new to the table. That is certainly true, but that 'new thing' which decorrelated features bring is energy or information content, not necessarily signal!

Much of that new energy can be pure noise. In fact, features that are completely random are decorrelated with useful features, yet they cannot possibly complement them for predicting the target y; they are useless.

Reason 4: Learning patterns from principal components could be harder than from original features

When PCA is used for feature selection, new features are constructed.

In general, the primary goal of feature construction is to simplify the relationship between inputs and the target into one that models our toolbox can reliably learn.

By linearly combining previously constructed features, PCA creates new features that can be harder to interpret, and in a more complex relationship with the target.

The questions you should be asking yourself before applying PCA are:

Does linearly combining my features make any sense?
Can I think of an explanation for why the linearly combined features could have as simple a relationship to the target as the original features?

If the answer to either question is no, then PCA features would likely be less useful than original features.

As an illustration, imagine we want to predict a person's income using, among other features, GPS coordinates of her primary residence, age, number of children, number of hours worked per week.

While it is easy to see how a tree-based learner could exploit these features, linearly combining them would result in features that make little sense and are much harder to learn anything meaningful from using tree-based methods.

Reason 5: Feature selection ought to be model-specific

Feature selection serves one primary goal: removing useless features from a set of candidates. As explained in this article, feature usefulness is a model-specific notion. A feature can very well be useful for a model, but not so much for another.

PCA, however, is model-agnostic. In fact, it does not even utilize any information about the target.

What Is PCA?

The Problem

The Solution

5 Reasons Not To Use PCA For Feature Selection

Recommend

Frequentist vs Bayesian Inference

《科技时报》点赞卡奥斯：第四次工业革命的先行者和领导者

Firefox/CommandLineOptions

重磅！2021年中国肉牛养殖及牛肉深加工行业政策汇总及解读（全）政策完善引导行业规范...

Recycler displays both horizontal and vertical scrolling

Can anyone explain this CMD batch file?

HTTP1.1、HTTP2、HTTP3 演变 - ML李嘉图

中概股掀起回购潮？潘石屹为何吃15张罚单？| 听见艾问人物

The new standard for planning and analyzing A/B tests is here

推动数字化智能化转型中关村数智经济发展论坛成功举办

About Joyk