Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Imagine you cut the image into 32x32 pixel blocks. And then for each block, you can chose 1 out of 128,000 variations. And then a post-processing step smoothes out the borders between blocks and adjusts small details. That's basically how a transformer image generation model works.

As such, the process is remarkably similar to old fixed-font ASCII art. It's just that modern AIs have a larger alphabet and, thus, more character shapes to choose from.



I don't get how this would produce consistent images. In the article, the text could be on a grid, but the window and doorway and sofa don't seem to be grid-aligned. (Or maybe the text is overlaid?)


The model looks ahead, just like LLMs look ahead. An LLM outputs token by token but can still output a fully coherent and consistent story for example. This new crop of auto-regressive image models does the same.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: