Imagine you cut the image into 32x32 pixel blocks. And then for each block, you can chose 1 out of 128,000 variations. And then a post-processing step smoothes out the borders between blocks and adjusts small details. That's basically how a transformer image generation model works.
As such, the process is remarkably similar to old fixed-font ASCII art. It's just that modern AIs have a larger alphabet and, thus, more character shapes to choose from.
I don't get how this would produce consistent images. In the article, the text could be on a grid, but the window and doorway and sofa don't seem to be grid-aligned. (Or maybe the text is overlaid?)
The model looks ahead, just like LLMs look ahead. An LLM outputs token by token but can still output a fully coherent and consistent story for example. This new crop of auto-regressive image models does the same.
As such, the process is remarkably similar to old fixed-font ASCII art. It's just that modern AIs have a larger alphabet and, thus, more character shapes to choose from.