Not familiar with any of these tech, but would it be better to get the intent by
voice2txt command.wav | txt2intent
? Or the intent analyzation actually requires the sound data (what are the cases of the same phrase expressing different intent, or how do we even define / categorize intent in this context)