Not familiar with any of these tech, but would it be better to get the intent by...

Not familiar with any of these tech, but would it be better to get the intent by

    voice2txt command.wav | txt2intent

? Or the intent analyzation actually requires the sound data (what are the cases of the same phrase expressing different intent, or how do we even define / categorize intent in this context)