Nice article and interesting comparison.
Yet, I have a minor issue with the title: Deep Learning are also statistical methods ... "univariate models vs. " would be a better title.
You could argue that deep learning is not a statistical method in the traditional sense, in that a typical neural network model is not a probability model, and some neural networks are well known to produce specifically bad probability models, requiring some amount of post processing in order to produce correctly "calibrated" probability predictions.
However I don't like that there is often a strict dichotomy presented between "deep learning" and "statistics". There is a whole world of gray areas and hybrid techniques, which tend to be both more accessible, easier to reason about, and more effective in practice, especially on smaller "tabular" datasets. What about generalized additive models, random forests, gradient boosted trees, etc.?
The author of the document I'm sure is aware of these techniques, and I assume they are left out because they didn't perform well enough to be considered here. But I don't think it does the discourse any favors to promulgate the false dichotomy.
Statistical models and probabilistic models are not synonymous.
Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction.
But to your point about the dichotomy between deep learning and more "traditional" statistical methods: this confusion in common parlance clearly has negative effects on model-building among engineers. You are right that when people think "deep learning" they think of very specific architectures with very specific features, and don't seem to conceive of the possibility that automatic differentiation techniques mean you can incorporate all sorts of new model components that blur the line between deep learning and older methods. For instance, you could feed the results of a kernel SVM to an ARIMA model in such a way that the whole thing is end-to-end differentiable. In fact, the great benefit of deep learning long-term is (in my opinion) that the ability to build these compositional models means you can bake in that much more inductive bias into the models you build, meaning they can be smaller and more stable in training.
"Vanilla deep learning models are statistical models (a la linear regression) and not probabilistic models (a la Gaussian mixture). It is important to maintain the distinction."
Isn't this just a matter of interpretation of the models? You can interpret linear regression in a Bayesian way and say that the prediction of the linear model is the MAP of the mean, you can also calculate the variance, the l2 norm objective is saying the distribution of errors is normally distributed, l2 regularisation is a normal prior on the coefficients, etc, etc? All the same stuff can be applied to deep learning models.
Maybe I don't understand your distinction between statistical and probabilistic though?
> Isn't this just a matter of interpretation of the models?
Not really. This is the classic frequentist vs Bayesian debate. In frequentist-land, you are computing point estimates of the model parameters. In Bayesian-land, you are computing distribution estimates of the model parameters. It is true that there is a difference in interpretation of the generative process but the two choices demand fundamentally different models because of the decision about which of the parameters or data are considered "real" and which are considered "generated".
I think a more abstract/general way to put it is: "statistics" is concerned with statistical summary values (i.e. mean-field estimates over measures) while "probability" is concerned more with distributions (i.e., topologies of measures). I'm not sure this is a rigorously correct way to characterize it, but it illustrates the intuition I'm trying to convey.
Statistics as practiced today (1930s until now?) consists almost entirely of making inferences about unobserved probability distributions. That includes nonparametric statistics, and frequentist versus Bayesian has nothing to do with it.
There are some probability models that are not really statistical models, but there are few or no statistical models that are not also probability models.
Least-squares regression is a probability model. Even if you don't particularly care about the error distribution, you are still estimating a conditional expectation and setting a conditional independence assumption on the residuals. If that's not a probability model, then I don't know what it is!
Statistics can be summarizes as one thing n -> N. Does ‘little n’ represent ‘big N’. In other words, does the sample generalize to the population. Statistics means something like “description of the state”. It was born out of census samples where larger population samples had to be estimated. “n” could be a handful of fish in a “N” lake. “n” could also be the parameter estimated in a linear regression with the sample of data collected while “N” is the true parameter of the relationship if we had all the data. Point estimation is about finding the needle in the haystack, but much more often statistics is about finding the haystack given the needle. One tool statistics uses to get to the haystack is probability.
A point estimate of distribution parameters describing a population is frequentist. A point estimate of distribution parameters describing another distribution's parameter is Bayesian.
How the parameters are estimated it not the message.
In statistics there are latin letters and greek letters. When you see a symbol denoted as a greek letter then that is a population parameter. When you see a latin letter that is a sample estimate. It could be Frequentist, Bayesian, Likelihoodist, Fiducial, Empirical Bayes, etc. Theoretical population greeks or sample calculated latins.
I have very limited statistical background but doesn't variational inference applied in the neural networks make them probabilistic models? The modelling definitely seems so because the math in those papers doesn't even specify whether it's a network (it implies that it can be any model).
Yes indeed. This synthesis of concepts is a great illustration of moving beyond hardened dichotomies in this research space and I believe similar approaches will be fruitful in the years to come.
They are all univariate models: some are trained offline on a bunch of different series before being applied (deep learning, “global” models), others are applied directly to each series to forecast (“statistical”, “local” models), but the task is the same univariate time series prediction for every model there.