Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions.

Why would you say that second-order information is "not generally workable in high dimensions"? We regularly run Newton's method on problems with tens of millions of variables today.

And Newton's method isn't the only way to use second-order information. It is easy to access, for example, Hessian-times-vector information using the same reverse-mode differentiation that's so popular today, using only a constant factor more time.

> You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.

What reason, exactly, is that?



I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.


> I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.

Yep, but often sparsity is present or the objective function is reasonably well-behaved. The former can save straight Newton; the latter will make nonlinear CG or quasi-Newton methods converge rapidly. (Quasi-Newton methods build a simple model of the Hessian from gradient and step information, usually using repeated low-rank modifications of a diagonal matrix. There are variants, like L-BFGS, that have an explicit, tunable limit on the number of additional vectors to be stored. This work really well for some reason---usually far better than gradient descent, and almost never more than a small constant factor slower)

> Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.

Interior-point methods for linear programming form the Hessian of a barrier function at each iteration, then (sparse) Cholesky factorise it, then do a few backsolves to find a search direction. (Special hacks generally go into these "few backsolves" to speed convergence.) This is a damped Newton method. Commercial implementations include CPLEX, Gurobi, and MOSEK; there are a great many less-commercial implementations as well.

Chih-Jen Lin's LIBLINEAR uses a truncated Newton method (solve Hessian*direction=gradient by linear conjugate gradient, stopping early when sufficient accuracy has been obtained to take a step) to create linear classifiers for data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: