Wiring DeepSpeech to Live Microphone Input

Sam Berg, Sat May 14 2022 • project research

I know I can get DeepSpeech to read a .wav file and output a JSON transcript. How can I get DeepSpeech to take in live audio, and how can I get the generated transcript to output to the command line?

I found out there's a class that takes in a NumPy array. Also, there's a library that can convert live audio into a NumPy ndarray: sounddevice. I had to install the PortAudio Library to get it working. That was a bit complicated. You have to compile it and configure it yourself:

# download and extract PortAudio tar; go to that directory
sudo apt-get install libasound-dev # I didn't have this dependency installed
./configure && make
ldconfig

On Second thought...

I found out through looking at an example on DeepSpeech usage that instead of using a raw audio library I can use a voice activity detection (VAD) library. This way, I can just send voice data to DeepSpeech, which should improve the results, or at least lessen the load on the system, since it won't have to waste effort trying to parse words when there are none.

The VAD library I found for Python is py-webrtcvad, which wraps a JavaScript API made by Google called WebRTC. The fact that WebRTC can be used on the front end excites me. Maybe I can create a little widget on my blog that can output speech from the user's mic...

The Design

With what I now know, this is my plan for implementing my STT app.

Use webrtc to detect speech
Route the speech data to DeepSpeech
Transcribe the speech to the console, live