Setting Up DeepSpeech

Sam Berg, Fri May 13 2022 • project research

I've identified the DeepSpeech engine as either the model for my STT application or a piece of it. I don't know much about the library implementation (FYI it's based on this paper), but before I look into that I want to figure out how I can use DeepSpeech. Maybe there's a way I can route live audio into it and have it spit out the text. That would meet my acceptance criteria for this project - although maybe I'll want to try doing more from scratch.

This is the most painful part of every project for me, so wish me luck. Before we can do anything fun with DeepSpeech, we have to get it up and running.

DeepSpeech has a section on installation on its documentation home page. Supposedly all I need to do is run these commands:

# Create and activate a virtualenv
virtualenv -p python3 $HOME/tmp/deepspeech-venv/
source $HOME/tmp/deepspeech-venv/bin/activate

# Install DeepSpeech
pip3 install deepspeech

After giving the ol' copy/paste a shot... I find out I need to install virtualenv.

Running DeepSpeech

Okay, now it should be set up, so I'm going to try to run the engine to transcribe an audio file.

# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

# Download example audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
tar xvf audio-0.9.3.tar.gz

# Transcribe an audio file
deepspeech \
  --model deepspeech-0.9.3-models.pbmm \
  --scorer deepspeech-0.9.3-models.scorer \
  --audio audio/2830-3980-0043.wav

It worked! I'll be honest though; I don't know what happened. Where is the transcription?

(deepspeech-venv) ~/Programming/deepspeech $ deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/2830-3980-0043.wav
Loading model from file deepspeech-0.9.3-models.pbmm
TensorFlow: v2.3.0-6-g23ad988
DeepSpeech: v0.9.3-0-gf2e9c85
2022-05-13 15:36:53.241530: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Loaded model in 0.0154s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.000225s.
Running inference.
experience proves this
Inference took 1.221s for 1.975s audio file.

I haven't found any documentation about what just happened. There's an example Python client referenced on this page. Unlike a command line client I can read the Python file.

Ah! I figured it out! I can add a --json flag to the command and voila:

(deepspeech-venv) ~/Programming/deepspeech $ deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio audio/4507-16021-0012.wav --json
Loading model from file deepspeech-0.9.3-models.pbmm
...
{
  "transcripts": [
    {
      "confidence": -26.321409225463867,
      "words": [
        {
          "word": "why",
          "start_time": 0.76,
          "duration": 0.12
        },
...