Expert Stuff: Google's Mike Cohen

What else do you need besides the acoustic model?

Longer strings of words and sentences present challenges to voice recognition software.
Longer strings of words and sentences present challenges to voice recognition software.
Hemera Technologies/

The next part of the model is called the lexicon, the dictionary. And what that is, is a definition for all of the words in the language of how they get pronounced. In other words, which fundamental sounds we string together, or even which of those acoustic models we string together to create the words. So for example, that lexicon would have information like, you know, you could say, "eh-conomics" or "ee-conomics" in English and they're both valid ways -- or typical ways -- of pronouncing the word "economics."

The third piece of the model is the model of how we put words together into phrases and sentences in the language. All of these are statistical models, and so for example, this model, although it's capturing, sort of, the grammatical constraints of the language, it's doing it in a statistical way based on feeding it lots of data. So for example, that model might learn that if the recognizer thinks it just recognized "the dog" and now it's trying to figure out what the next word is, it may know that "ran" is more likely than "pan" or "can" as the next word just because of what we know about the usage of language in English.

Dogs run more than they do things with pans, and so by feeding lots of data to this model -- we call it the language model. It's the statistical model of word sequences, how likely the different words are to occur given what the recent words have been. By feeding the model lots of data, it just computes all of those statistics about what's likely to occur next, and that's the language model. So now, these three models, the acoustic model, or the model with all those fundamental sounds, the lexicon, or the model of how all the words get pronounced, and finally the language model, or how all those words get strung together get compiled together.

So the lexical models are built by stringing together acoustic models, the language model is built by stringing together word models, and it all gets compiled into one enormous representation of spoken English, let's say, and that becomes the model that gets learned from data, and that recognizes or searches when some acoustics come in and it needs to find out what's my best guess at what just got said.