Their video training set is still small (10M + 10M). A lot of the interpolation artifacts seems come from the model haven't acquired enough real-world understanding of "natural" movements (looking at the horse running and the sail-boat examples). I suspect scaling this up to 10x would have much less artifacts.
Reading the paper, it seems to be the "right" approach (separating temporal / spatial for both convolution and attention). Thus, I am optimistic what remains is to scale it up.
The horse leg motion in these examples is really poor. But to be fair, horse legs are very difficult for human animators too, generally necessitating the use of reference photos to get it right. The first time somebody got a series of photographs showing a horse in motion caused quite a stir: https://en.wikipedia.org/wiki/The_Horse_in_Motion
Reading the paper, it seems to be the "right" approach (separating temporal / spatial for both convolution and attention). Thus, I am optimistic what remains is to scale it up.