True, but the attention layers still need to be able to look at all the shots - for example to make sure the background of a room shown at the start of the movie is the same as the background of the same room at the end.
Obviously you could do 'human assisted' movie making where humans decide the storyboard and make directions for each shot, and then that isn't necessary.
Yes, but consider that most films are made up of many different shots, each of which are often just seconds long.