Instead of a device being triggered by voice, you can trigger commands by spelling them out via ASL and hoping NL can predict wisely or give choices -> wait for users response signal.
https://motionsavvy.com/uni.html
The OP is more about 3-d recognition of mimed actions like turning dials and pushing buttons (Virtual Reality applications) than recognizing symbols.
Instead of a device being triggered by voice, you can trigger commands by spelling them out via ASL and hoping NL can predict wisely or give choices -> wait for users response signal.