I thought of a change to gradient accumulation, which I call Adam accumulation: ...

koningrobot · on Jan 30, 2021

We don't compute per-example gradients, so in your second code snippet there would not be a loop across examples. We compute the batch-averaged gradient in the same time it would take to compute a single example's gradient, so it's much more efficient than your proposal, which is equivalent to using a batch size of 1.

sillysaurusx · on Jan 31, 2021

Performance is rarely the issue, at least for us. The problem is, when the algorithms don’t work, then what?

A batch size of 1 != an average of larger batch sizes. It’s why the BigGAN paper reports “bigger batches = better FID.”

This proposal gives the advantage of a small batch size (and there are advantages) without sacrificing the option of large batches.

koningrobot · on Jan 31, 2021

You're right what you propose is not quite equivalent to batch size 1, as you don't update the parameters until processing the entire batch.

Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.

Eridrus · on Jan 31, 2021

This seems a bit weird to me. You're accumulating statistics in an order dependent way without updating the parameters. You're also doing k updates of the statistics with noisier estimates of the gradient. I'm not really a stats guy, but this doesn't seem like it would provide better estimates of the adam statistics like you suggest. I'm sure this would have some impact, but it doesn't seem like it would be better than tuning the beta hyperparameters to incorporate gradient changes more quickly. But you probably just want to try it if you believe in it, not try to get traction with it in thw HN comments.

mpfundstein · on Jan 30, 2021

maybe you should show with some well designed experiments that your idea improves upon the current practice?