Voice2json: Offline speech and intent recognition on Linux

bmn__ · on May 21, 2021

Has anyone had any success getting the software to work?

It's entirely unpackaged: https://repology.org/projects/?search=voice2json https://pkgs.org/search/?q=voice2json

Docker image is broken, how'd that happen?

    $ voice2json --debug train-profile
    ImportError: numpy.core.multiarray failed to import
    Traceback (most recent call last):
      File "/usr/lib/voice2json/.venv/lib/python3.7/site-packages/deepspeech/impl.py", line 14, in swig_import_helper
        return importlib.import_module(mname)
      File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
      File "<frozen importlib._bootstrap>", line 983, in _find_and_load
      File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 670, in _load_unlocked
      File "<frozen importlib._bootstrap>", line 583, in module_from_spec
      File "<frozen importlib._bootstrap_external>", line 1043, in create_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
    ImportError: numpy.core.multiarray failed to import

xrd · on May 21, 2021

I tried docker (both debian version of Dockerfile), building from scratch, none of them work.

nerdponx · on May 21, 2021

The source package does have installation instructions and appears to use Autotools: https://voice2json.org/install.html#from-source. Hopefully at least building from source works.

mdaniel · on May 21, 2021

Building the v2.0 tag (or even master) using docker does not:

    E: The repository 'http://security.ubuntu.com/ubuntu eoan-security Release' does not have a Release file.

And just bumping the image tag to ":groovy" caused subsequent silliness, so this project is obviously only for folks who enjoy fighting with build systems (and that matches my experience of anything in the world that touches Numpy and friends)

disgruntledphd2 · on May 22, 2021

It's generally the fault of pip rather than numpy, but yeah, this is pretty common.

marcodiego · on May 21, 2021

Good FLOSS speech recognition and TTS is badly needed. Such interaction should not be left to an oligoply with bad history of not respecting users freedoms and privacy.

sodality2 · on May 21, 2021

Mozilla CommonVoice is definitely trying. I always do a few validations and a few clips if I have a few minutes to spare, and I recommend everyone does. They need volunteers to validate and upload speech clips to create a dataset.

https://commonvoice.mozilla.org/en

teraflop · on May 21, 2021

I like the idea, and decided to try doing some validation. The first thing I noticed is that it asks me to make a yes-or-no judgment of whether the sentence was spoken "accurately", but nowhere on the site is it explained what "accurate" means, or how strict I should be.

(The first clip I got was spoken more or less correctly, but a couple of words are slurred together and the prosody is awkward. Without having a good idea of the standards and goals of the project, I have no idea whether including this clip would make the overall dataset better or worse. My gut feeling is that it's good for training recognition, and bad for training synthesis.)

This seems to me like a major issue, since it should take a relatively small amount of effort to write up a list of guidelines, and it would be hugely beneficial to establish those guidelines before asking a lot of volunteers to donate their time. I don't find it encouraging that this has been an open issue for four years, with apparently no action except a bunch of bikeshedding: https://github.com/common-voice/common-voice/issues/273

cptskippy · on May 21, 2021

After listening to about 10 clips your point becomes abundantly clear.

One speaker, who sounded like they were from the mid-west United States, was dropping the S off words in a couple clips. I wasn't sure if it was misreads or some accent I'd never heard.

Another speaker, with a thick accent that sounded European, sounded out all the vowels in circuit. Had I not had the line being read, I don't think I'd have understood the word.

I heard a speaker with an Indian accent who added a preposition to the sentence that was inconsequential but incorrect none the less.

I hear these random prepositions added as flourishes frequently with some Indian coworkers, does anyone know the a reason? It's kind of like how American's interject "Umm..." or drop prepositions (e.g. "Are you done your meal?") and I almost didn't pick up on it. For that matter where did the American habit of dropping prepositions come from? It seems like it's people in the North East primarily.

OJFord · on May 21, 2021

I can't quite imagine superfluous prepositions (could you give an example?) but I have found it slightly amusing learning Hindi and coming across things where I think Oh! That's why you sometimes hear X from Indian English speakers, it's just a slightly 'too' literal¹ mapping from Hindi, or trying to use a grammatical construction that doesn't really exist in English, like 'topic marking'.

[¹] If that's even fair given it's a dialect in its own right - Americans also say things differently than I would as a 'Britisher'

batch12 · on May 21, 2021

Are you talking about phrases like "please do the needful"?

OJFord · on May 22, 2021

That's not one I've heard. Examples that come to mind are 'even I' (which seems closer to 'I too' than the 'you'd scarcely believe it but I' that it naturally sounds to me), 'he himself' (or similar subject emphasis), and adverb repetition.

I'd say it's mostly subtler (I suppose that should be the expected distribution!) things I've noticed though, they're just harder to recall as a result.

(Just want to emphasise I'm not making fun of anybody or saying anything's wrong, in case it's not clear in text. I'm just enjoying learning Hindi, fairly interested in language generally, and interested/amused to notice these things.)

OJFord · on May 22, 2021

Just thought of another - '[something is] very less' - which comes, presumably, from कम being used for both little/few and less than.

Hindi is much more economical, to put it literally, one says things like 'than/from/compared to orange, lemon is sour', and 'orange is little/less [without comparison] sour'.

Which, I believe, is what gives rise to InE sentences like 'the salt in this is very less' (it needs more salt, there's very little).

batch12 · on May 21, 2021

I have never heard that example. Maybe it is regional. I have heard, "you done?" as an example.

xwx · on May 21, 2021

I downloaded the (unofficial) Common Voice app [1] and it provides a link to some guidelines [2], which also aren't official but look sensible and seem like the best there is at the moment.

[1] https://f-droid.org/packages/org.commonvoice.saverio/

[2] https://discourse.mozilla.org/t/discussion-of-new-guidelines...

tootie · on May 21, 2021

If you read the doc, it says voice2json is layer on top of the actual voice recognition engine. And it supports mozilla deep speech, pocket sphinx and a few others as the underlying engine.

wcarss · on May 21, 2021

I've used the deepspeech project a fair amount and it is good. It's not perfect, certainly, and it honestly isn't good enough yet for an accurate transcription in my mind, but it's good. Easy to work with, pretty good results, and all the right kinds of free.

Thanks for taking time to contribute!

jfarina · on May 21, 2021

I wonder if they use movies and tv; recordings where the script is already available.

wongarsu · on May 21, 2021

That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.

Maybe you could convince a couple of indie creators or state-run programs to licence their audio? But I'm not sure if negotiating that is more efficient than just recording a bit more audio, or promoting the project to get more volunteers.

akiselev · on May 21, 2021

It would likely be a lot easier for someone from within the BBC, CBC, PBS, or another public broadcaster to convince their employer to contribute to the models. These organizations often have accessibility mandates with real teeth and real costs implementing that mandate. The work of closed captioning, for example, can realistically be improved by excellent open source speech recognition and TTS models without handing all of the power over to Youtube and the like.

It would still be an uphill battle to convince them to hand over the training set but the legal department can likely be convinced if the data set they contribute back is heavily chopped up audio of the original content, especially if they have the originals before mixing. I imagine short audio files without any of the music, sound effects, or visual content are pretty much worthless as far as IP goes.

dec0dedab0de · on May 21, 2021

That's fine for training your own model, but I don't think you could distribute the training set. That seems like a clear copyright violation, against one of the groups that cares most about copyright.

I'm not sure that is a clear copyright violation. Sure, at a glance it seems like a derivative work, but it may be altered enough that it is not. I believe that collages, and reference guides like cliff notes are both legal.

I think a bigger problem would be that the scripts, and even the closed captioning, rarely match the recorded audio 100%

Wowfunhappy · on May 21, 2021

And also... it's not like the program actually contains a copy of the training data, right? The training data is a tool which is used to build a model.

taneq · on May 21, 2021

How is it different from things like GPT3 which (unless I’m mistaken) is trained on a giant web scrape? I thought they didn’t release the model out of concerns for what people would do with a general prose generator rather than any copyright concerns?

sodality2 · on May 21, 2021

Does using copyrighted works to train a machine learning model make that model infringing?

wongarsu · on May 21, 2021

Generally a ML model transforms the copyrighted material to the point where it isn't recognizable, so it should be treated as its own unrelated work that isn't infringing or derivative. But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Also I don't think there have been any major court cases about this, so there's no clear precedent in either direction.

visarga · on May 21, 2021

> But then you have e.g. GPT that is reproducing some (largeish) parts of the training set word-for-word, which might be infringing.

Easy fix - keep a bloom filter of hashed ngrams ensuring you don't repeat more than N words from the training set.

pabs3 · on May 22, 2021

There are some that say that the Google Books court case is precedent for ML model stuff, if you search back through my comment history you will find links.

sodality2 · on May 21, 2021

Thanks!

marcodiego · on May 21, 2021

GP is not talking about the model but about the training data set.

sodality2 · on May 21, 2021

I am aware, I'm asking if the model, however, is infringing. Surely you can't distribute them in a dataset but is training on copyrighted data legal, and can you distribute that model?

_lqaf · on May 21, 2021

All text written by a human in the US is automatically copyright the author. So if an engine trained on works under copyright is a derivative work, GPT3 and friends have serious problems.

kelnos · on May 21, 2021

I expect that wouldn't be perfect, though. Sometimes the cut that makes it into the final product doesn't exactly match the script. Sometimes it's due to an edit, other times it's due to an actor saying something similar to but not exactly what the script says, but the director deciding to just go with it.

What might work better is using closed captions or subtitles, but I've also seen enough cases where those don't exactly match the actual speech either.

taneq · on May 21, 2021

They might work even better for interpreting the intent of spoken text. Not great for dictation though.

habibur · on May 21, 2021

He meant subtitle when he talked of script.

cerved · on May 21, 2021

Weird sentences

londons_explore · on May 21, 2021

Good speech recognition generally requites massive mountains of training data, both labelled and unlabelled.

Massive mountains of data tends to be incompatible with opensource projects. Even Mozilla collecting user statistics is pretty controversial. Imagine someone like Mozilla trying to collect hundreds of voice clips from each of tens of millions of users!!

marcodiego · on May 21, 2021

Really complicated question, but considering the free world got wikipedia and openstreetmaps, I'd bet we'll find a way.

JadeNB · on May 21, 2021

> Really complicated question, but considering the free world got wikipedia and openstreetmaps, I'd bet we'll find a way.

Both of those involve entering data about external things. Asking people to share their own data is another thing entirely—I suspect most people, me included, are much more suspicious about that.

sodality2 · on May 21, 2021

> Imagine someone like Mozilla trying to collect hundreds of voice clips from each of tens of millions of users!!

They do, and it's working! https://commonvoice.mozilla.org/en

londons_explore · on May 21, 2021

Except they have 12k hours of audio, when really they could do with 12B hours of audio...

woodson · on May 21, 2021

Then you need a lot of people that listen to those 12B hours of audio, and multiple listeners agree for each chunk of audio that what is spoken corresponds to the transcript.

londons_explore · on May 21, 2021

Lots of machine learning systems can use unsupervised and semi-supervised learning. Then nobody has to listen to and annotate all that audio.

woodson · on May 22, 2021

Yes, but then you don't need Mozilla collecting read speech samples. You can just scrape any audio out there, run speech activity detection, and there you go.

sodality2 · on May 21, 2021

Good point. I'm doing my part to contribute to it, though, not much else I can do!

totetsu · on May 22, 2021

Didn't Moz fire all the people who worked on this, and they have started their own project now?

_abox · on May 21, 2021

Well speech recognition for personal use doesn't have to recognise everyone. In fact it's a feature, not a bug if it recognises only me as the user.

posmonerd · on May 21, 2021

Not an expert on any of this, but wouldn’t already published content (public or proprietary) such as Youtube videos, audiobooks, tv interviews, movies, tv programs, radio programs, podcasts, etc. be useful and exempt from privacy concerns?

Do user collected clips have soemthing so special to the point that it’s critical to collect them?

eliaspro · on May 21, 2021

Movies etc would need to be transcribed accurately to be useful for training and even then just provide a single sample for the specific item.

timvisee · on May 21, 2021

Another problem is that the models tend to get very very large for what I've seen. A gigabyte to 10s of gigabyes is an undesirable requirement on your local machine.

kelnos · on May 21, 2021

Not sure about others, but DeepSpeech also distributes a "lite" model that's much smaller and suitable for mobile devices. Not sure how its accuracy compares to the full model though.

londons_explore · on May 21, 2021

With insane amounts of computation, making models much smaller while having minimal impacts on performance is possible.

cf · on May 22, 2021

I'd check out coqui https://coqui.ai/

It's well-documented and works basically out of box. I wish the STT models bundled were closer to the quality of Kaldi but the ease-of-use has no comparisons.

And maybe with time it will surpass Kaldi in quality too.

deknos · on May 22, 2021

is this free as in speech and beer and selfhostable? and not opencore?

cf · on May 22, 2021

Free as in speech and beer, self-hostable, and I don't know about opencore but everything seems freely available right now.

hardwaresofton · on May 22, 2021

There are a bunch of good libraries that work offline out there for speech recognition -- CMUSphinx[0] has been around a long time and work seems to have shifted a little bit to Kaldi[1] and Vosk[2] (?). Julius is still going strong as well[3].

CMUSphinx and Julius have been around for ~10+ years at this point.

[EDIT] - there's even a useful Quora post[4]

[0]: https://cmusphinx.github.io/

[1]: https://www.kaldi-asr.org/doc/about.html

[2]: https://github.com/alphacep/vosk-server

[3]: https://github.com/julius-speech/julius

[4]: https://www.quora.com/Are-there-any-open-source-APIs-for-spe...

londons_explore · on May 21, 2021

Speech recognition algorithms today require lots of data, lots of training computation, and a decent design.

Decent designs are in published papers all over the place, so thats a solved issue.

Lots of compute requires lots of $$$, which isn't opensource-friendly.

Lots of data also isn't really opensource friendly.

Sadly this is a niche that the opensource business model doesn't really fit.

marcodiego · on May 21, 2021

People would probably say the same about wikipedia 20 years ago. People said similar things about gnu, gcc and linux 30 years ago.

catblast01 · on May 22, 2021

GNU is heavily skewed to developer tools and infrastructure, and gcc is no counterexample. There are so many classes of software where this does not work. Pretty much anything for heavily regulated industries is not well served by FLOSS. There are few open source insurance software or medical records systems (the few that exist are highly niche and/or limited), EDA/CAD is not well served by FLOSS (I’ve toyed with FreeCAD, but even hobbyists gravitate to Fusion). Outside of developer tooling and infrastructure: commercial, generally closed source, closed development software is king.

* besides the hard part of standing up an EMR is not installing a prepackaged software.

sildur · on May 21, 2021

> Lots of compute requires lots of $$$, which isn't opensource-friendly.

Not really, look up BOINC.

jauer · on May 21, 2021

There's more involved in than just raw CPU cycles. It's not something that is easily adapted to BOINC, but trying to offload things to BOINC to free up clusters better suited to training models might make sense.

airstrike · on May 21, 2021

Sounds like a viable model for certain universities, though.

_abox · on May 21, 2021

Indeed, and it doesn't have to be as "machine learning" as the big ones.

A FLOSS system would only have my voice to recognise and I would be willing to spend some time training it. Very different usecase from a massive cloud that should recognise everyone's voice and accent.

smcameron · on May 22, 2021

pico2wav with the en-gb voice seems not too bad for TTS. I had reasonable luck in limited domain speech recognition with pocketsphinx, but it does need some custom vocabulary.

Example: https://www.youtube.com/watch?v=tfcme7maygw

Granted, maybe this is "not good enough", but I feel like I got pretty far with pico2wave, pocketsphinx plus 1980's Zork level "comprehension" technology.

And the open source status of pico2wave is a bit questionable, I'll grant you that.

A bit more detail about the implementation here: https://scaryreasoner.wordpress.com/2016/05/14/speech-recogn...

asdfman123 · on May 21, 2021

> should not be left to an oligoply with bad history of not respecting users freedoms and privacy

So companies with a lot of data, then.

Animats · on May 21, 2021

That's not what this is. This is more like the system you use for phone-answering systems ("Do you want help with a bill, payment, order, or refund?")

_abox · on May 21, 2021

Indeed, this is what I got from it too. It seems an alternative to VoiceXML used by companies like Nuance.

synesthesiam · on May 21, 2021

Author here. Thanks to everyone for checking out voice2json!

The TLDR of this project is: a unified command-line interface to different offline speech recognition projects, with the ability to train your own grammar/intent recognizer in one step.

My apologies for the broken packages; I'll get those fixed shortly. My focus lately has been on Rhasspy (https://github.com/rhasspy/rhasspy), which has a lot of the same ideas but a larger scope (full voice assistant).

Questions, comments, and suggestions are welcomed and appreciated!

tootie · on May 22, 2021

Is the primary use case for NLP interfaces? I'm looking for a good tool for automated transcriptions of long-form (10-60 minutes) of audio.

synesthesiam · on May 22, 2021

I'd recommend using Vosk directly for that: https://alphacephei.com/vosk/

voice2json is better suited for limited domain speech, where each sentence is a specific voice command (think home automation).

Jeaye · on May 22, 2021

Have you seen anyone using this for vim? Do you have any example of how that might look, or insight into whether it would work?

synesthesiam · on May 22, 2021

I haven't seen this yet, but I imagine it would involve running at least "voice2json record-command | voice2json transcribe-wav | jq .text". This will record a single command (until silence), and output the text transcription.

markjgx · on May 22, 2021

Hey there, this looks great. I was wondering why Deepspeech 0.6? Why not the latest version DeepSpeech 0.9?

synesthesiam · on May 22, 2021

I need to cycle back and update voice2json. Rhasspy (the full voice assistant) supports DeepSpeech 0.9.3.

markjgx · on May 22, 2021

Awesome, thanks.

hirundo · on May 21, 2021

I wonder if it would be possible to map vim keybindings to sounds and effectively drive the editor with the mouth when the hands are otherwise occupied. It might be possible to use sounds that compose into pronounceable words with minimal syllables for combinations. What would vim bindings look like as a concise command language suited to human vocalization?

E.g. maybe "dine" maps to d$ and "chine" to c$. So as in keyboard vim you can guess what "dend" and "chend" do.

krysp · on May 21, 2021

I do this successfully for work using https://talonvoice.com/ - initial learning curve is steep, but once you learn how to configure and hack on the commands, you can be very effective. I use it maybe half the day to combat lingering RSI symptoms, and with some work I could probably use it for 98% of input for the computer. Some people do use it for 100% afaik

Jeaye · on May 22, 2021

You do this on Linux?

krysp · on May 22, 2021

Yeah, although it's also available for Mac

twobitshifter · on May 21, 2021

https://youtu.be/8SkdfdXWYaI?t=600

this guy is already there: Slurp slap scratch buff yank

skratlo · on May 21, 2021

I now get the joke about Emacs and OS

pabs3 · on May 22, 2021

I wonder if this would pass the Debian Deep Learning Team's Machine Learning policy, which requires public data under a libre license that is retrainable using software under a libre license, without any proprietary drivers:

https://salsa.debian.org/deeplearning-team/ml-policy

synesthesiam · on May 22, 2021

I think many of the available Kaldi/DeepSpeech models would pass, at least with the "Type-F Reproducibility". The pocketsphinx models would not, however, since they were trained on private datasets.

My aim has been to train "good enough" models for any public/free data I can get my hands on.

nwalker85 · on May 21, 2021

Really interesting use of intents and entities. I feel like some of this is reinventing the wheel, since there is already a grammar specification, but novel use of intents/entities. https://www.w3.org/TR/speech-grammar/

Edman274 · on May 21, 2021

Yeah, in my experience no one uses or supports that specification, which is a shame because if you're using something like AWS Connect with AWS Lex for telephony IVR, you can't just create a grammar and then have AWS Lex figure out how to turn its recognized speech-to-text into something that matches a grammar rule. Thus, Lex will return speech-to-text results that are according to general English grammar rules, rather than what you might have prompted the user to reply with. You'll be unpleasantly surprised if you think that defining a custom entity as alphanumeric always prevents the utterance "[wʌn]" as sometimes matching "won" instead of "one" or "1".

Edit - Sorry, I realize that's a tangent. What I'm saying is that when I was evaluating speech to text engines for things like IVR systems using AWS and Google, neither of them supported SRGS. Microsoft does, I think, but they didn't have a telephony component, and IBM was ignored from the get go, so "no one" really means "two very large companies."

nwalker85 · on May 24, 2021

Some do, some don't, sure. Google STT for example supports class tokens natively. There are also services like uniMRCP that allow for certain SRGS grammar features to be used with Google STT, but they are limited in what constructs they support. I've worked pretty extensively with a platform called Verbio, and they fully support the SRGS grammar specification. I work in conversational AI, and when I do implementations, I have to evaluate the complexity of the use case and whether or not a full grammar will be needed and choose a STT provider based on that.

synesthesiam · on May 22, 2021

My templating language was inspired by JSGF, which seems to have informed the ABNF version of the W3C Speech Grammars. I don't support probabilities, though, since those are derived during the n-gram model generation.

I would have preferred to use a standard. Perhaps this is something for a future version.

offtop5 · on May 21, 2021

Fantastic.

Might use this with a Raspberry pi to set up some projects around the house. Is it possible to buy higher quality voice data ?

nmstoker · on May 21, 2021

If you're interested in projects on a Pi then you might just be interested in this: https://github.com/rhasspy/rhasspy

It's from the same author.

_abox · on May 21, 2021

I like rhasspy but the problem I have with it is that it's too much of a toolkit and less of an application. There's too many choices to pick for the different components.. I think they should pick one of each and really tune them so it works really well. This way they'd take a lot of complexity away from the user.

synesthesiam · on May 22, 2021

Agreed. I've at least added a "Recommended" option in the web UI that's language-specific.

Part of the problem is that language support varies dramatically between components. There's usually a pretty obvious "best" set for English, but it gets more difficult with other languages.

marcodiego · on May 21, 2021

For those who care: MIT license.

yewenjie · on May 21, 2021

How does it compare to Vosk and other open source models/APIs?

synesthesiam · on May 22, 2021

I plan to add Vosk support soon.

The goal of voice2json is to provide a common layer on top of existing open source engines. This common layer lets you train custom speech/intent models with having to know the details of each engine.

robmsmt · on May 21, 2021

I am working on something to compare at least 10 different ASRs both open source and production ones.

Mitch104 · on June 2, 2021

Just posted a medium article on using this tool to execute terminal commands with your voice https://mitchellharle.medium.com/how-to-execute-terminal-com... it was surprisingly straight forward!

jrm4 · on May 21, 2021

Excellent! I just installed MyCroft the other day to play around with it; while it looks like a great start, two odd things. The first is obvious, which is the online/offline thing.

The second was a little surprising (and maybe I missed it?) There was not much in the way of easily accessing transcribed output to and from shell scripts?

slmjkdbtl · on May 22, 2021

Not familiar with any of these tech, but would it be better to get the intent by

    voice2txt command.wav | txt2intent

? Or the intent analyzation actually requires the sound data (what are the cases of the same phrase expressing different intent, or how do we even define / categorize intent in this context)

bachmitre · on May 21, 2021

This should come with pre-trained templates to create new templates via voice commands ;)

intrepidhero · on May 21, 2021

Would love to see a demo integrating this with an IDE for either voice to code or for voice commands to navigate menus. I think the killer application would layer voice and traditional input rather than replace.

drno123 · on May 22, 2021

I would really like this technology to takeoff in mobile apps. When interacting with my mobile phone often it is more convenient to do navigation by voice than by using my finger.

a-dub · on May 21, 2021

neat. would be even neater if it used state to provide a prior on likely intents. (ie. in its most simple form, if you know the light is on, "turn on the light" has a prior of 0)

jedimastert · on May 21, 2021

Things like state would probably be under the scope of whatever you're feeding the intents into

a-dub · on May 21, 2021

yes, but by the time you've generated an intent it's too late to improve recognition accuracy using the prior.

dokem · on May 22, 2021

I can finally build the Jarvis home assistant I dreamt of when first learning coding in high school. To bad now I know voice assistant widgets generally are useless.

tootie · on May 22, 2021

Speech recognition is actually orthogonal to AI. In my day the AI prototypes (like ELIZA) were basically chat bots. Speech recognition is now very sophisticated and accurate. Determining meaning from human language (spoken or written) is far more advanced than it used to be but still kinda sucks.

dokem · on May 24, 2021

Yes, text to speech has become impressively good and can basically be relied upon. But siri/alexa still really disappoint me and dont seem to be much better than a list of basic rules one could program up.

varispeed · on May 21, 2021

It's not quite clear, but do you need to sacrifice your privacy in any way to use it? E.g. sending the data to some service in order to get trained model?

nmstoker · on May 21, 2021

The description clarifies the underlying systems:

>> Supported speech to text systems include:

>> CMU’s pocketsphinx

>> Dan Povey’s Kaldi

>> Mozilla’s DeepSpeech 0.6

>> Kyoto University’s Julius

In case you're not aware, those are all locally run (thus not sending data off, not sacrificing privacy as you mention)