What’s the big deal with speech enhancement?

Writer and physician Oliver Wendell Holmes once said, “Speak clearly, if you speak at all; carve every word before you let it fall.” He, of course, was campaigning for thoughtfulness in speech, but comprehensibility of speech remains crucially important. We all want dearly to understand and be understood by other people — and increasingly by our devices that connect us to our world.

Speech has always been humans preferred interface method — in fact the emergence of hominid speech a couple of hundred thousand years ago coincided with the origins of the species homo sapiens. But, we live now in sonic chaos — a noisy world that is getting noisier every day.  We talk in cars, in restaurants, on the street, in crowds, in echo-y rooms and in busy kitchens — there are always kids yelling, horns honking or loud music playing. What can we do to combat the noise, and make speech and understanding easier?

Part of the emerging answer is speech enhancement, the electronic processing of natural human speech to pare back the noise that makes speech hard to comprehend and unpleasant to listen to. Electronic engineers have been concerned about the speech quality for almost a century, really since the emergence of radio broadcasts. Most of the speech improvement methods have evolved relatively slowly in the last couple of decades, representing gradual refinement of DSP algorithms looking to separate speech sounds from background noise sounds. The emergence of deep neural network ...

Continue Reading

Living in a speech-triggered world

Like many people in the developed world, I spend every day surrounded by technology — laptops, phones, car touch-screens, networked gadgets, and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well — I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe, and peck. The devices themselves have not learned much of anything!

A wave of change is sweeping towards us, with the potential to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions to bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming veritable substitutes for human vision in tasks like driving, surveillance, and inspection.

Deep learning advances in speech have a completely different character from vision — these advances are rarely about substituting for humans, but center instead on paying more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction. The ...

Continue Reading

How well does speech recognition cope with noise?

Speech is rapidly becoming a primary interface for the connected world. Apple’s Siri, Amazon’s Alexa services running on Echo devices, and a wealth of cloud-based speech recognition APIs are all bring speech-based user interfaces into the mainstream. Moreover, the spectrum of information services is likely to expand rapidly over the next couple of years, bringing speech up to the level of mobile touchscreen web browsing and phone apps.

The speech interface revolution is starting for good reason — typing on keyboards, clicking mice and touching glass are neither natural nor efficient for humans. Most importantly, they usually need our eyes engaged too, not just to get the responses, but also to make accurate inputs. Speaking is convenient and quite high bandwidth. Humans can routinely talk at 140-160 words per minute, but few of us can type that fast, especially while driving (though we seem to keep trying). With the advent of good automatic speech recognition (ASR) based on deep neural networks, we can start to exploit speech interfaces in all the situations where eyes and hands aren’t available, and where natural language can plausibly express the questions, commands, responses and discussion at the heart of key interactions. Many of the worst impediments to using speech widely come from noise problems. Noise impacts the usability of all interfaces and all likely speech application types. Siri and Alexa-based services work remarkably well in a quiet room, but performance degrades significantly as noise goes up.

My team at BabbleLabs has ...

Continue Reading