We don't compute per-example gradients, so in your second code snippet there would not be a loop across examples. We compute the batch-averaged gradient in the same time it would take to compute a single example's gradient, so it's much more efficient than your proposal, which is equivalent to using a batch size of 1.
You're right what you propose is not quite equivalent to batch size 1, as you don't update the parameters until processing the entire batch.
Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.
This seems a bit weird to me. You're accumulating statistics in an order dependent way without updating the parameters. You're also doing k updates of the statistics with noisier estimates of the gradient. I'm not really a stats guy, but this doesn't seem like it would provide better estimates of the adam statistics like you suggest. I'm sure this would have some impact, but it doesn't seem like it would be better than tuning the beta hyperparameters to incorporate gradient changes more quickly. But you probably just want to try it if you believe in it, not try to get traction with it in thw HN comments.
https://twitter.com/theshawwn/status/1355343951033602057
https://news.ycombinator.com/item?id=25964420
Unfortunately, no one seems to understand it, which isn't a great sign. I'm either not explaining it very well, or the idea doesn't make sense.
In short:
That way, adam statistics are updated for every training example.Traditional gradient accumulation looks like this:
... which only updates Adam once.(It's equivalent to a bigger batch size.)
Probably best to just implement Adam accumulation and see if it works, I suppose.
(Sorry for rambling about this here. I was just hoping to find some prior work along these lines, if anyone knew of something.)