One by one, the skills that separate us from machines are falling into the machines’ column. First there was chess, then Jeopardy!, then Go, then object recognition, face recognition, and video gaming in general. You could be forgiven for thinking that humans are becoming obsolete.
But try any voice recognition software and your faith in humanity will be quickly restored. Though good and getting better, these systems are by no means perfect. Are you ordering “ice cream” or saying “I scream”? Probably both, if it’s a machine you are talking to.
So it ought to be reassuring to know that ordinary conversational speech recognition is something machines still struggle at—that humans are still masters of their own language.
That view may have to change. Quickly. Today, Geoff Zweig and buddies at Microsoft Research in Redmond, Washington, say they’ve cracked this kind of speech recognition and that their machine-learning algorithms now outperform humans for the first time in recognizing ordinary conversational speech.
Speech recognition research has a long history. In the 1950s, early computers could recognize up to 10 words spoken clearly by a single speaker. In the 1980s, researchers built machines that could transcribe simple speech with a vocabulary of 1,000 words. In the 1990s they progressed to recordings of a person reading the Wall Street Journal, and then onto broadcast news speech.
These scenarios are all increasingly ambitious. But they are also simpler than ordinary speech because of various constraints. The vocabulary in the Wall Street Journal is limited to business and finance, and the sentences are well structured and grammatically correct, which is not necessarily true of ordinary speech. Broadcast news speech is less formal but still high structured and clearly pronounced. All of these examples have eventually been conquered by machines.
But the most difficult task—transcribing ordinary conversational speech—has steadfastly resisted the onslaught.
Ordinary speech is significantly more difficult because of the vocabulary size and also because of the noises other than words that people make when they speak. Humans use a range of noises to manage turn-taking in conversation, a type of communication that linguists call a backchannel.
For example, uh-huh is used to acknowledge the speaker and signal that he or she should keep talking. But uh is a hesitation indicating that the speaker has more to say, a warning that there is more to come. In turn management, uh plays the opposite role to uh-huh.
Humans have little difficulty parsing these sounds and understanding their role in a conversation. But machines have always struggled with them.
In 2000, the National Institute of Standards and Technology released a data set to help researchers tackle this problem. The data consisted of recordings of ordinary conversations on the telephone. Some of these were conversations between individuals on an assigned topic. The rest were conversations between friends and relatives on any topic.
Most of the data was to help train a machine-learning algorithm to recognize speech. The rest was a test that the machines had to transcribe.
The measure of performance was the number of words that the machine got wrong, and the ultimate goal was to do the task better than humans.
So how good are humans? The general consensus is that when it comes to transcription, humans have an error rate of about 4 percent. In other words, they incorrectly transcribe four words in every hundred. In the past, machines have got nowhere near this benchmark.
Now Microsoft says it has finally matched human performance, albeit with an important caveat. The Microsoft researchers began by reassessing human performance in transcription tasks. They did this by sending the telephone recordings in the NIST data set to a professional transcription service and measuring the error rate.
Read more here: www.technologyreview.com/s/602714/first-computer-to-match-humans-in-conversational-speech-recognition/