The article is silly because PCA cannot select features. It is all about dimensionality reduction. You should think of PCA as the equivalent to VAEs from the neural network world. The idea would be something like this: you have big images (let's say 4k) and this is too expensive to train with/store forever. So, you collect a training set, train a PCA on these images, and then you can convert your 4k images to 720p or even 10 numbers, which you then use to predict/train whatever you want. Of course, we have algorithms that scale images but maybe all your images are of cats and there is a specific linear transformation that contains more information from the 4k image than simply scaling. The implicit thing here is that you still are collecting 4k images but just immediately compress them down using your trained PCA transformation.
So, although you have less numbers than before, you still need to collect the original data. A real feature selection process would be able to do something like: "the proximity of the closest Applebees is not important to predict house prices, you should probably stop wasting your time calculating this number". As others have mentioned, L1 regression or some statistical procedure to identify useless features is typically how this is done. I would also add that domain knowledge is probably your #1 feature selection because we have to restrict the variables we input in the first place and which data we prioritize is inherently selecting the features.
Dimensionality reduction is a compressing data in a way that retains the most important information for the task
Feature selection is removing unimportant information (keeping/collecting, or selecting, only the important parts)
Both cut down on the amount of data you end up with, but one does it by finding a representation that is smaller, the other does it by discarding unnecessary data (or, rather, telling you which data is necessary, so you can stop collecting the unnecessary data).
But still, if some inputs are redundant shouldn't this be somehow apparent in the eigen-vectors/values of the covariance matrix (making PCA an indirect feature selection algorithm)?
indeed, predictable and disappointing how the discussion devolved into pedantry. should have been obvious what the author meant (plus it is clarified at the very beginning of the article). I'm not sure if this is a ML practitioner vs. statisticians thing or what
So, although you have less numbers than before, you still need to collect the original data. A real feature selection process would be able to do something like: "the proximity of the closest Applebees is not important to predict house prices, you should probably stop wasting your time calculating this number". As others have mentioned, L1 regression or some statistical procedure to identify useless features is typically how this is done. I would also add that domain knowledge is probably your #1 feature selection because we have to restrict the variables we input in the first place and which data we prioritize is inherently selecting the features.