How does speech recognition technology work on a basic level?
OK, so fundamentally, the way that the field has gone over the last couple of decades is more and more towards data-driven or statistical-modeling approaches. What I mean by that is rather than having people go in and try to program all these rules or all of these descriptions of how language works, we tried to build models where we could feed lots and lots of data to the models, and the models will learn about the structure of speech from the data. So data-driven approaches are approaches based on building large statistical models of the language by feeding it lots of data.
That's the first principle, and that movement towards machine learning, or data-driven or statistical approaches was actually one of the most important advances in the history of the speech-recognition field. And so the question becomes what kind of model should we start with that we can then feed this data to so we can get good performance out of a speech recognizer? What we do is we basically have a model that has three fundamental components to it that model different aspects of the speech signal. The first piece is called the acoustic model, and basically what that is, is a model of all of the basic sounds of language.