Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s not, but the typical use for PCA as a dimensionality reduction step prior to a downstream classifier is to introduce a bottleneck to prevent overfitting. In some sense, that can be thought of as a means of avoiding specific feature selection by using an unsupervised method to discover the population covariance structure


That's precisely what the blog post deals with.

In that PCA pre-processing step, nothing guarantees that principal components are better representations for your problem than original inputs; in fact PCA has nothing to do with your target, how could it guarantee principal components are better representation to predict it?

Similarly, understanding the covariance structure of your original inputs will not necessarily help you predict your target better.

Here's a simple example illustrating this. Take x a single feature highly informative about y, take z a (large) d-dimensional highly structured vector that is independent from both x and y. Now, consider using [x, z] to predict y.

In this case, x happens to be a principal component as x and z are independent; it is associated to eigenvector [1, 0,...., 0] and eigenvalue Var(x). All other eigenvectors are of the form [0, u_1, ..., u_d] where [u_1, ..., u_d] is an eigenvector of Cov(z).

All it would take for x to be the very last (i.e. 'least important') principal component is for Var(x) to be smaller than all eigenvalues of Cov(z), which is easily conceivable, irrespective of y! In your quest for a lower-dimensional 'bottleneck' using PCA you would end up removing x, the only useful feature for predicting y! This will certainly not reduce overfitting.

PCA and other autoencoders work well as pre-processing step when there are structural reasons to believe low energy loss (or low reconstruction error) coincides with low signal loss. In tabular data, this tends to be the exception, not the norm.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: