Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> inferences at up to 15-95X (!) RT on my 4090

That's incredible!

Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.



It is theoretically possible to train a model that, given some speech, attempts to continue the speech, e.g. Spectron: https://michelleramanovich.github.io/spectron/spectron/. Similarly, it is possible to train a model to edit the content, a la Voicebox: https://voicebox.metademolab.com/edit.html.


Great. :P

Me: Won’t it be great when AI can-

Computer: Finish your sentences for you? OMG that’s exactly what I was thinking!


>Are infill and outpainting equivalents possible?

Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)


Not sure what you mean: If you mean could inpainting and out painting with image models be faster, its a "not even wrong" question, similar to asking if the United Airlines app could get faster because American Airlines did. (Yes, getting faster is an option available to ~all code)

If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.

If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time


I'm not sure I understand what you mean to say. To me it's a reasonable question asking whether text to speech models can complete a missing part of some existing speech audio, or make it go on for longer, rather than only generating speech from scratch. I don't see a connection to your faster apps analogy.

Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.

It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?

Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.


It imposes a cost on others when when you makes false claims like I said or felt the question was unreasonable.

I didn't and don't.

It is a hard question to understand and an interesting mind-bender to answer.

Less policing of the metacontext and more focusing on the discussion at hand will help ensure there's interlocutors around to, at the very least, continue policing.


Sorry but it was pretty obvious what he meant.


It's not, at all.

He could have meant speed, text, audio, words, or phonemes, with least probably images.

He probably didn't mean phonemes or he wouldn't be asking.

He probably didn't mean arbitrarily slicing 'real' audio and stitching on fake audio - he made repeated references to a video game.

He probably didn't mean inpainting and outpainting imagery, even though he made reference to a video game, because its an audio model.

Thank you for explaining I deserve to get downvoted through the floor multiple times for asking a question because it's "obvious". Maybe you can explain to the rest of the class what he meant then? If it was obviously phonemes, will you then advocate for them being downvoted through the floor since the answer was obvious? Or is it only people who assume good faith and ask what they meant who deserve downvotes?


Inpainting and outpainting of images is when the model generates bits inside or outside the image that don't exist. By analogy he was talking about generating sound inside (I.e. filling gaps) or outside (extrapolating beyond the end) the audio.

I don't know why you would think he was talking about inpainting images, words. This whole discussion is about speech synthesis.


Right, _until he brought up inpainting and outpainting_. And as I already laid out, the audio options made just about as much sense as the art.

I honestly can't believe how committed you are to explaining to me that as the only person who bothered answering, I'm the problem.

I've been in AI art when it was 10 people in an IRC room trying to figure out what to do with a bunch of GPUs an ex-hedge fund manager snapped up, and spent the last week working on porting eSpeak, the bedrock of ~all TTS models, from C++.

It wasn't "obvious" they didn't mean art, and it definitely was not obvious that they want to splice real voice clips at arbitrary points and insert new words without being a detectable fake for a video game. I needed more info to answer. I'm sorry.


I'll be the first to admit that it was an off the cuff, vague, and unclear question, and I'm lucky some people got it.

Wait 'till you learn I'm a woman though. :>


Ignore the speed comment; it is unrelated to my question.

What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?


Yes, the paper and Eleven Labs have a major feature of "given $AUDIO_SET, generate speech for $TEXT in the same style of $AUDIO_SET"

No, in that, you can't cut it at an arbitrary midword point, say at "what tim" in "what time is it bejing", and give it the string "what time is it in beijing", and have it recover seamlessly.

Yes, in that, you can cut it at an arbirtrary phoneme boundary, say 'this, I.S. a; good: test! ok?' in IPA is 'ðˈɪs, ˌaɪˌɛsˈeɪ; ɡˈʊd: tˈɛst! ˌoʊkˈeɪ?', and I can cut it 'between' a phoneme, give it the and have it complete.


Perfect! Thank you




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: