The Work Before the Work
Before I start coding, there are a couple things I need to iron out.
As a person who can get side-tracked, I need to define the scope of this project. What I intend to make is a program that takes in live audio and outputs text. This text can go to the console or to a text file, but I'm not interested in making a GUI, at least not at this point.
Additionally, it's worth noting that with machine learning applications, you define models and train them with data. Then you use that model to create content (eg. text to speech) or - in my case - interpret content. I am trying to get this app up and running ASAP, so I'll try to use pre-trained models if possible. Finding the big data sets required to train models is a pain in my experience, and it's an exercise I don't enjoy.
Next, I need to answer some of the high-level questions about making a speech-to-text, heretofore STT, application:
- What open source STT apps are out there that I can use as guide posts?
- What libraries can I use?
Open Source Speech to Text Applications and Libraries
- Mozilla DeepSpeech (Python)
- Athena (Python)
- Kaldi (C++)
Thoughts
I'm gonna start with using Mozilla DeepSpeech. DeepSpeech has great documentation and is written in Python, which I know a hell of a lot better than C++. There are plenty of other differences between the applications I listed, and plently of other applications that could be on that list. I just want something that's easy to use, and DeepSpeech seems be the easiest.
Post Research Discoveries
Finding datasets for speech recognition isn't as hard as I'd expected. The OS tools I found have links to plenty, including:
Terms
During research, I found some terms I didn't understand
- ASR - automatic speech recognition
- hot word - a special word (eg. "hey", "google", "alexa", "qt") you want a STT app to be sensitive to