> okay that’s just because you’re not good at using the models.
You literally cannot respond with "you are holding it wrong" specially when I'm claiming that even for the popular _example prompts_ SD authors used they had to hand-pick the best random result over a sea of extremely shitty images.
And even in the original paper they disclaim it by saying "oh, our model is just bad at limbs". No, it's not just bad at limbs. They just happened to try examples where it could particularly show how terrible it is at limbs (i.e. spider legged horses and the like). But in truth, it's just bad at everything.
"It's bad at everything" ... bad by what standards? Just a few years ago it would have been regarded as unbelievable science fiction that a model with such capabilities would soon be available. As soon as they are here, people stop being impressed. But the objective impressiveness of a technology is determined by how unlikely it was regarded in the past, not by how impressed people are now. People get used to things pretty quickly.
Besides, there are models that are much more capable than Stable Diffusion. The best one currently seems to be Midjourney V5.
> Just a few years ago it would have been regarded as unbelievable science fiction that a model with such capabilities would soon be available. As soon as they are here, people stop being impressed.
I don't know. I've had chatbots for decades before "a few years ago", so I have never been particularly impressed. I would say that for someone who was already impressed with that you could practically describe a landscape in plain old 2000s Google Images and get a result, SD feels like just an incremental improvement over it -- the ability to create very surreal-looking 'melanges', at the cost of it almost always generating non-sensical ones. And also add that Google Images is much easier to use than SD...
> Just a few years ago it would have been regarded as unbelievable science fiction that a model with such capabilities would soon be available.
No, it wouldn't have – not to people in the know. We just didn't have powerful enough computers back in the 90s. Sure, the techniques we've got now are better, but 90s algorithms (with modern supercomputers) can get you most of the way.
Transformers are awesome, but they're not that much of a stretch from 90s technology. GANs are… ridiculously obvious, in hindsight, and people have been doing similar things since the dawn of AI; I imagine the people who came up with the idea were pretty confident of its capabilities even before they tested them.
Both these kinds of system – and neural-net-based systems in general – are based around mimicry. Their inability to draw limbs, or to tell the truth, or count, are fundamental to how they function, and iterative improvement isn't going to fix them. Iterative improvement would be going faster, if researchers (outside of OpenAI and similar corporations) thought it was worthwhile to focus on improving these systems specifically.
ChatGPT is not where transformers shine. StyleGAN3 is not where GANs shine. Midjourney is not where diffusion models shine. They're really useful lenses for visualising the way the architectures work, so they are useful test-beds for iterative algorithmic improvements…¹ but they aren't all that they're made out to be.
¹: See the 3 in StyleGAN3. Unlike the 4 in GPT-4, it actually means something more than "we made it bigger and changed the training data a bit".
What's special about that day? That's after the algorithms were developed, models and drivers were built, and most of these behaviours were discovered. I've got fairly-photorealistic "AI-generated" photos on my laptop timestamped September 2019, and that was before I started learning how it all worked.
If you're talking about popular awareness of GPT-style autocomplete, then I agree. If you're talking about academic awareness of what these things can and can't do, we've had that for a while.
What photorealistic AI generated image? In September 2019 this must have been a GAN face. I admit those are impressive, but incredibly limited compared to todays text to image models. If you look at an iPhone from 2019, or a car, or a videogame ... they all still look about the same today.
Three years ago there was nothing remotely as impressive as modern GPT style or text to image models. Basically nobody predicted what was about to happen. The only exception I know is Scott Alexander [1]. I don't know about any similar predictions from the experts, but I'm happy to be proven wrong.
> In September 2019 this must have been a GAN face.
Well, yes,¹ but actually no. StyleGAN1's public release was February 2019, and it's capable of far more than just faces.
> Three years ago there was nothing remotely as impressive as modern GPT style
I predicted that! Albeit, not publicly, so I can't proveit.
My predictions claimed it would have certain limitations, which GPT-3 (and, later, GPT-4) exhibit. I can show that those still exist, but few people on Hacker News seem to understand when I try to communicate that.²
> or text to image models.
Artbreeder (then called Ganbreeder) existed in early 2020, and it didn't take me by surprise when it came out. It parameterises the output of the model by mapping sliders to regions of the latent space; quite an obvious thing to do, if you want to try getting fine-grained control over the output. (A 2015 paper built on this technique: https://arxiv.org/abs/1508.06576)
I was using spaCy back in 2017–2018-time. It represents sentences as vectors, that you can do stuff like cosine similarity on.
If I'd been more interested in the field back then, I could have put two and two together, and realised you could train a net on labelled images (with supervised learning) to map a spaCy model's space to StyleGAN's, which would be a text-to-image model. It was very much imaginable back before April of 2020; a wealthy non-researcher hobbyist could've made one, using off-the-shelf tools.
If I were better at literature searches, I could probably find you an example of someone who'd done that, or something like it!
---
¹: That file was, because that's what I was playing with in September. I do have some earlier ones, of landscapes; they're in a different folder.
²: See e.g. here: they tell me GPT-4 can translate more than just explicitly-specified meaning, and the "evidence" doesn't even manage that. https://news.ycombinator.com/item?id=35530316 (They also think translating the title of a game is the same as translating the game, for some reason; that confusion was probably my fault.)
Nope. You literally just tried out one prompt and saw one good image and several bad ones and just shook your fist at the computer and gave up.
I'll repeat myself. You have to play around with the models and learn how to use it (just like you have to do for everything) .
> But in truth, it's just bad at everything
Thousands of people (including myself) have had the complete opposite result and have gotten amazing pictures. You can play around with the finetuning with different models from civitai and get completely different art styles too.
Like, this is so dumb I don't even know how to respond lol.
You're like some guy who got a computer for the first time and couldn't figure out how to open the web browser, so he just dismissed it as useless.
I don't think you understand the point. Your claims that "all of this needs extensive tuning and hand-holding and picking results" do not help your argument, they help _mine_.
Most egregious if you are even doing more tuning and cherry picking that the authors of the models are doing, which you definitely are.
I could spend 20,000 hours trying to learn to draw and I would still be far worse than what I could generate with Stable Diffusion + Control Net + etc.
I doubt you would be better than someone who would use Stable Diffusion for 7 years. and I don’t even include the technological advancements in the next 7 years.
I don't think writing SD prompts compounds like actually make art from scratch does. It's kind of inherent right? Because one is derivative (with a bunch of computer science) and one is from people. You can be cynical and say all art is derivative, I guess.
Learning to draw at the level of what Stable Diffusion can generate would take thousands of hours of practice, and the individual drawings would take hours.
But if you do learn, you can then render photorealistic image with nothing but pencil and paper instead of being reliant on a beefy computer running a blackbox model trained at enormous cost :)
SD will never compare to the power of pencil and paper imo. Drawing is an essential skill for any visual artist not just for mechanics but for developing style, taste, and true understanding of the world around you visually.
I recommend Freehand Figure Drawing for Illustrators as a good starting point (along with some beginner art lessons). It won't take 1k hours before you see results. It's also fun!
> But if you do learn, you can then render photorealistic image with nothing but pencil and paper instead of being reliant on a beefy computer running a blackbox model trained at enormous cost :)
Why do I want to avoid that reliance, other than to be smug to the nerds? And as far as general self-satisfaction, let's assume I would rather master a different skill with the time it would take.
Especially because that training cost only has to be done once, so on a per-person basis it beats learning to draw by a lot.
If you said something about flexibility and specificity of what you can create I could get behind that, but I think the arguments you're making are very unconvincing.
Having spent several orders of magnitude more time working on drawing than with SD, I’ll say “Drawing isn’t hard for some people”.
If drawing was that easy, no one would worry about disruption from AI image generators, because everyone who wanted images would be knocking them out by hand, not paying people for them, so there’d be nothing to disrupt.
SD isn't creation imo. I have used it + followed stuff made with it, and I don't care how much people rebrand it as prompt engineering. Its consumption. Just because some imagination is involved in the query doesn't make it creation.
> SD isn't creation imo. I have used it + followed stuff made with it, and I don't care how much people rebrand it as prompt engineering. Its consumption. Just because some imagination is involved in the query doesn't make it creation.
No, the fact that something is made with it makes it creation. Imagination makes it creative, on top of being creation.
The problem I see here is you are trying to insert your moral/aesthetic judgement of the quality/value of the mechanism/process of creation into the objective description of whether creation is happening, probably because you have subscribed to a worldview which valorizes creation and denigrates consumption so accepting something you don’t want valorized as “creation” is incompatible with that worldview.
You can just say you don't like Generative AI and wish people wouldn’t use it. You don’t need to declare that creating things with it isn’t creation to try to mask your (valid as any other) aesthetic preference in a very silly circumlocution designed to resemble an objective description, albeit a patently self-contradictory one.
I mean I actually do think AI art isn't actually art - philosophically. It's not being snooty it's based on years of studying and thinking about art from a philosophical perspective.
The best artists are the ones that adapt to both, generating an initial image and using it as scaffolding to paint over. Drawing and using a diffusion model are not mutually exclusive concepts.
I guess the difference is hammers are logical, simple tools to use with a known use case. They’re fairly hard to use incorrectly, although it does take some practice to use one, I’ll admit.
> You literally cannot respond with "you are holding it wrong" specially when I'm claiming that even for the popular _example prompts_ SD authors used they had to hand-pick the best random result over a sea of extremely shitty images.
I do a lot of my SD with a fixed seed and 1-image batches, once you know the specific model you are using getting decent pictures isn't hard, and zeroing in on a specific vision is easier with a fixed seed. Once I am happy with it, I might do multiple images without a fixed seed using the final prompt to see if I get something better
If you are using a web interface that only use the base SD models and don’t allow negative prompts, yes, its harder (negative prompts and in particular good, model specific, negative embeddings are an SD superpower.)
You literally cannot respond with "you are holding it wrong" specially when I'm claiming that even for the popular _example prompts_ SD authors used they had to hand-pick the best random result over a sea of extremely shitty images.
And even in the original paper they disclaim it by saying "oh, our model is just bad at limbs". No, it's not just bad at limbs. They just happened to try examples where it could particularly show how terrible it is at limbs (i.e. spider legged horses and the like). But in truth, it's just bad at everything.