Most ML engineers know that many want more fine grained control. But the straight forward way to train such models is incredibly data demanding. The datasets used for whole image generation consist of several billion images. I do not think anyone has compiled any DAW project / stems projects that are anywhere close to this size. So that is a limiting factor right now. But we will find ways to get there, probably a lot of progress over the next 5 years. Maybe even the next 2.