Now machines are going on internet-watching sprees too – but with something to show for it. After viewing a year’s worth of online videos, a computer model has learned to distinguish between sounds such as bird chirps, door knocks, snoring and fireworks.
Such technology could transform how we interact with machines and make it easier for our cellphones, smart homes and robot assistants to understand the world around them.
Computer vision has dramatically improved over the past few years thanks to the wealth of labelled data machines can tap into online. They can now recognise faces or cats as accurately as a human can.
But their listening abilities still lag behind because there is not nearly as much useful sound data available.
One group of computer scientists wondered if they could piggyback on the advances made in computer vision to improve machine listening.
Sound and vision“We thought: ‘We can actually transfer this visual knowledge that’s been learned by machines to another domain where we don’t have any data, but we do have this natural synchronisation between images and sounds,’” says Yusuf Aytar at the Massachusetts Institute of Technology.
Aytar and his colleagues Carl Vondrick and Antonio Torralba downloaded over two million videos from Flickr, representing a total running time of more than a year. The computer effectively marathoned through the videos, first picking out the objects in the shot, then comparing what it saw to the raw sound.
If it picked up on the visual features of babies in different videos, for example, and found they often appeared alongside babbling noises, it learned to identify that sound as a baby’s babble even without the visual clue.
“It’s learning from these videos without any human in the loop,” says Vondrick. “It’s learning in some sense on its own to recognise sound from just a year of video.”
The researchers tested several versions of their SoundNet model on three data sets, asking it to sort between sounds such as rain, sneezes, ticking clocks and roosters. At its best, the computer was 92.2 per cent accurate. Humans scored 95.7 per cent on the same challenge.
Laughing hens?A few sounds still give the SoundNet trouble, however. It might mistake footsteps for door knocks, for instance, or insects for washing machines. It sometimes also confuses laughter with the sound of hens. But more training could help it sort out those fine details.
The study will be presented next month at the Neural Information Processing Systems conference in Barcelona, Spain.
“This is like nothing we’ve seen before,” says Ian McLoughlin at the University of Kent in the UK.
Read more here: www.newscientist.com/article/2111363-binge-watching-videos-teaches-computers-to-recognise-sounds/