Imagine you cut the image into 32x32 pixel blocks. And then for each block, you ...

rwmj · 2025-04-08T10:50:02 1744109402

I don't get how this would produce consistent images. In the article, the text could be on a grid, but the window and doorway and sofa don't seem to be grid-aligned. (Or maybe the text is overlaid?)

danielbln · 2025-04-08T11:46:20 1744112780

The model looks ahead, just like LLMs look ahead. An LLM outputs token by token but can still output a fully coherent and consistent story for example. This new crop of auto-regressive image models does the same.