Google's Manager of Speech Technologies Mike Cohen understands speech on a level most of us don't think about. He understands it on a basic level of sound combinations and contextual clues. He has to -- he's in charge of a department at Google that works on speech-recognition technology.
Teaching a computer to recognize speech is tricky. To understand English, there are many hurdles one must overcome. The English language has a lot of homonyms -- words that phonetically sound the same but mean different things. Think of "to," "two" and "too." People speaking with an accent or in a regional dialect may pronounce words in a way that's dramatically different from the standard pronunciation. And then there are words like "route" that have alternate pronunciations -- you can say "root" or "rout" and both are correct.
How do you teach a computer to make these distinctions? How can a machine understand what we say and respond appropriately? These are the challenges Cohen and his team face at Google. We spoke with Cohen and asked him to give more detail about his work in speech-recognition research and applications.
On each page, you'll see our questions in the title and Cohen's responses in the body. We started with the basics of speech recognition technology, as you'll see on the next page.
How does speech recognition technology work on a basic level?
OK, so fundamentally, the way that the field has gone over the last couple of decades is more and more towards data-driven or statistical-modeling approaches. What I mean by that is rather than having people go in and try to program all these rules or all of these descriptions of how language works, we tried to build models where we could feed lots and lots of data to the models, and the models will learn about the structure of speech from the data. So data-driven approaches are approaches based on building large statistical models of the language by feeding it lots of data.
That's the first principle, and that movement towards machine learning, or data-driven or statistical approaches was actually one of the most important advances in the history of the speech-recognition field. And so the question becomes what kind of model should we start with that we can then feed this data to so we can get good performance out of a speech recognizer? What we do is we basically have a model that has three fundamental components to it that model different aspects of the speech signal. The first piece is called the acoustic model, and basically what that is, is a model of all of the basic sounds of language.
What exactly is an acoustic model?
So we're building an acoustic model for U.S. English, and we have model for "ah," and "uh," and "buh," and "tuh," and "mm," and "nn" and so on and so forth for all of the basic sounds of the language. Actually, it's a little bit more complicated than that because it turns out -- take the "aa" sound in English. The "aa" in the word "math," versus "aa" in the word "tap." They produce something differently, and they sound a bit differently, and so we actually need different models for the "aa" sound, whether it's following an M versus following a T. The production of those fundamental sounds or phonemes varies depending on their context.
We have many, many models for the "aa" sound, and it's a different model if the predecessor is "mm" versus "tuh," for example. So that's the first piece of the model, the acoustic model, the model that all of the fundamental sounds are given their context.
What else do you need besides the acoustic model?
The next part of the model is called the lexicon, the dictionary. And what that is, is a definition for all of the words in the language of how they get pronounced. In other words, which fundamental sounds we string together, or even which of those acoustic models we string together to create the words. So for example, that lexicon would have information like, you know, you could say, "eh-conomics" or "ee-conomics" in English and they're both valid ways -- or typical ways -- of pronouncing the word "economics."
The third piece of the model is the model of how we put words together into phrases and sentences in the language. All of these are statistical models, and so for example, this model, although it's capturing, sort of, the grammatical constraints of the language, it's doing it in a statistical way based on feeding it lots of data. So for example, that model might learn that if the recognizer thinks it just recognized "the dog" and now it's trying to figure out what the next word is, it may know that "ran" is more likely than "pan" or "can" as the next word just because of what we know about the usage of language in English.
Dogs run more than they do things with pans, and so by feeding lots of data to this model -- we call it the language model. It's the statistical model of word sequences, how likely the different words are to occur given what the recent words have been. By feeding the model lots of data, it just computes all of those statistics about what's likely to occur next, and that's the language model. So now, these three models, the acoustic model, or the model with all those fundamental sounds, the lexicon, or the model of how all the words get pronounced, and finally the language model, or how all those words get strung together get compiled together.
So the lexical models are built by stringing together acoustic models, the language model is built by stringing together word models, and it all gets compiled into one enormous representation of spoken English, let's say, and that becomes the model that gets learned from data, and that recognizes or searches when some acoustics come in and it needs to find out what's my best guess at what just got said.
How do you take into account accents and dialects when designing speech recognizers?
One of the fundamental things, given the kind of data-driven approach that we take, is we try to have very large, broad training sets. We have large amounts of data coming in from all kinds of people with all kinds of accents, saying all kinds of things, and so on and so forth, and the most important thing is to have good coverage in your training set of whatever is coming in. We have enough instances of Brooklyn accents -- and not just thanks to me -- but we have people from Brooklyn that have spoken to our systems such that we do a good job when people with Brooklyn accents talk to our system.
On the other hand, if somebody came along and had very peculiar and unusual ways of pronouncing things that was not well-covered in our data, we'd have more trouble recognizing them.
Sometimes pronunciations are radically different enough, let's say in U.K. English versus U.S. English, we may build a separate model, or a partially blended model, or whatever. That's sort of an area of research. When should we build separate models versus combine everything into one big model, or any compromise in between? That variation is one of the big challenges, one of a number of big challenges in the field that makes it more difficult. Having good training sets is one of the ways that we deal with that, when there's training sets that have broad coverage of all those things that happen.
What's the difference between a computational linguist and a speech technologist?
Wow. That's a good question, because the boundaries really have blurred. I mean, these days, we all work side by side and do similar things. Twenty or 30 years ago, there were sort of two camps. There were linguists that were trying to build speech recognizers by explicitly programming up knowledge about the structure of language, and then there were engineers who came along and said, "Language is so complex, nobody understands it well enough, and there's just too much there to ever be able to explicitly program it, so instead, we'll build these big statistical models, feed them data, and the let them learn." For a while, the engineers were winning, but nobody was doing a great job.
So more recently, like, in the last 25 years, those communities came together and we learned certain things from the linguists about the structure of speech, like the fact that I mentioned earlier, which is the production of any particular phoneme is very influenced by the phonemes that surround it. Linguists have been publishing on that, calling it co-articulation, for years. Finally, the statisticians or engineers took that to heart and built models that are context dependent so that they can learn and add a separate model for "ah" as it occurs following an "mm" versus a "duh," and so on and so forth.
Those communities really came together, and so -- maybe I've thrown these terms around too loosely referring to speech technologists versus computational linguists. We all work on the boundary of trying to understand language, the structure of language, trying to develop algorithms, machine learning style algorithms where we figure out how do we come up with a better model that can better capture the structure of speech, and then have an algorithm such that we feed that model lots and lots of data, and the model both changes its structure and alerts its internal parameters to become a better, richer model of language, given the data that's being fed to it.
What is a hidden Markov model and how does it play into speech recognition?
In a hidden Markov model there are certain assumptions about the data that comes in, some of which are not that accurate. So for example, there's a conditional -- this is going to get too technical -- but yes, there are some challenges in modeling longer-distance constraints. That's an active research area. How do we alter the model so that we can do a better job for those longer-distance restraints that matter, to capture them in a model? For example, we have something called delta feature, so we not only look at what's the acoustics at this moment, but what's the trajectory of those acoustics? Is this part of it rising, falling or whatever?
So that tells us something about what's happening at longer distance, even within these constraints of the assumptions about the statistics of what we're able to model with that kind of a model.
What are grammars?
Yeah, that word has been used loosely, and it has meant a couple different things over time. In the most general sense, you could think of it as a description of what we might expect in terms of what word strings can happen. In some systems, and this was very true for a lot of call-center systems, we would have a reasonably good idea of what people were pretty likely to say, right? You have a system that is a menu, do you want A, B, or C? You might expect most people will say either "A," "B," or "C," or they might say, "I want A" or "B please," or things like that, things that because of the application were fairly predictable.
But there were languages by which people could specify "here are the rules or the set of strings that people might say in this particular context." That would be a case where the recognizer was very limited. It would only recognize a certain number of variations in how you might say things. Let's say, "do you want your account balance or to make a transfer?" It's not like people will mimic exactly those words, but it's reasonably predictable, so somebody with experience, and after listening to some of the data, could have a reasonable chance of writing an explicit grammar that said, "Here are 50 variations in how people might make that two-way choice."
Whereas, as you get to more difficult applications like, for example, voice search, it's way more difficult to predict all of those different strings of words that people might utter. So instead, the grammar becomes what's called a statistical grammar, or what we often call a statistical language model. That would be something more in the form of, given the last two words were A, B, here are the probabilities across all of the words in my language of what might happen next.
How many words are in the Google voice search database?
So let me put it this way. For English, the vocabulary side, the number of different words in our vocabulary is roughly a million, and over time that evolves because, obviously, new words enter the language, new names come along, so on and so forth, so that gets rediscovered from time to time and it gets added, too. Then, those words can be put together in any imaginable order, and for any length word string. So you might come up with a 10-word query, picking randomly from those million words, so it turns out to be an astronomically large number. However, by using this kind of statistical language model I just mentioned, and training it on lots and lots of queries, hundreds of billions of queries, we end up with reasonable predictive power about what's likely.
How much computational power does a speech recognition system require?
It depends on when you mean. When we're actually doing recognition, at that moment the recognizer's running on a CPU, so on a CPU, we'll, in real time, do recognition. But in order to achieve the performance we get, in order to build these models, we may spend many, many decades of computer time to compute the language model for English as it works right now. So it evolves over time because we get more data, CPUs get faster and stuff like that, but just to train one language model for English we might use 230 billion, for example, words' worth of data, and that might take multiple decades of time if it was running on one CPU. But obviously, we'll apply thousands of CPUs to it so we can keep training these all the time.
Why is Google interested in speech recognition?
There are two fundamental reasons, and it goes right back to Google's mission. The first part of Google's mission is to organize the world's information. It turns out that a lot of the world's information is spoken, and we need to make that discoverable, searchable, organizable -- even if it's the audio track of a YouTube video, or a voicemail or whatever. The other part of Google's mission is to make all that information universally acceptable and useful, and so as an example, one really key part of that is how do you interact with the Internet when you're mobile? When you're mobile, you have small keypads, you may be walking down a street, riding your bike, driving a car, or whatever, and it's just often more convenient to talk than type.
So we want to make speech ubiquitously available input/output mode, so that whenever the end user feels like that's the mode by which I want to interact, we want it to be available, and available with such high performance that when they prefer speech, they just naturally use it.
What Google applications are currently using speech recognition?
A little over two years ago  we released voice search, and so that's basically all of Google searches on smartphones, you can do by speaking. That's widespread. It gets lots and lots of usage. A little over a year ago , we released something called Voice Input, and what that means is on Android, anytime the keypad pops up, there's also a little microphone button. So whether that keypad pops up in the middle of some application, or when you're surfing the web, filling out a form, if the keypad pops up you can also hit that microphone button and speak. To me, that was a very important step towards this vision of really ubiquitous speech access. That was released in January, a little over a year ago.
This past August , we released something called Voice Actions, and for eight or nine typical things that people do, like place calls, search for business, do navigation, go to Web pages, set their alarm, listen to music, send a SMS message, send an e-mail, things like that, they can now, from that search bar, make that request and have that action happen. So I can say, "Send text message to Steve Smith. Meet me at 7 p.m." and it will send him a text message, things like that. So those are the main things right now for mobile. More generally, what we're moving towards in the future is truly ubiquitous input. Any time you want to be able to speak, we want it to be available.
In the Android developer's kit, Google makes two models available: the Freeform model and the Web search model. What's the difference?
Right, so the Web search language model is specifically trained on lots of queries that have come into Google.com, all those word strings and stuff. So if it's a query-like application, be it voice search or things that are relatively short inputs where you're looking for something, even if it's more specialized -- I'm looking for this book, I'm looking for whatever -- then it may be more sensible to try the Web-search language model, if it's that style of application. If it's broader, people are dictating e-mails, letters and who knows what, then the dictation language model may be better, but people may very well want to experiment with it, try both, and see what works.
Could we see Google develop a universal translator akin to something you'd see on Star Trek?
In Google research, as well as the speech group, there's also a group working on machine translation, and we now collaborate, so you've probably heard a bit about speech-to-speech translation, where you can say something in English, and then get recognized in English, get translated to Spanish, and then synthesize in Spanish so you can carry on a conversation. That's an active research area with some initial deployments getting out there.
Learn more about speech recognition and Google by following the links on the next page.