What does it sound like to you?

The Internet has been abuzz this past week with discussion of a single utterance.  People everywhere are talking about “Laurel versus Yanny”, an utterance that sounds more like “Laurel” to some, and more like “Yanny” to others. See the NY Times article. In fact, you can get from something very clearly “Laurel” to something clearly “Yanny” by changing the gain on different frequency bands.  “Laurel,” emphasized from the high frequencies are attenuated, and “Yanny” stands out when the high frequencies are boosted. Find a fun comparison tool here. Whether you hear “Laurel” or “Yanny” is partially determined by your auditory systems sensitivity to high frequencies.

This lively but inconsequential debate exposes the tip of an iceberg of more substance to speech experts. People want clearer speech in every situation where intelligibility or comfort are important. Nobody wants to listen to such overwhelming loud noise or such heavy reverberation that all intelligibility is lost. Alexander Graham Bell famously first transmitted voice on a wired system in 1875 and Reginald Aubrey Fessenden transmitted voice by wireless in 1900. Needless to say the voice quality was lousy. Since then, speech and communications engineers working on speech-based systems have developed a range of metrics to allow for better comparison of the level of noise and degree of intelligibility of electronic speech reproduction.

What is “good” sound in speech? Low noise level? High clarity? Good comprehensibility? Speech captured under ideal conditions — good microphones, anechoic recording environment, and zero additive noise — ...

Continue Reading

What’s the big deal with speech enhancement?

Writer and physician Oliver Wendell Holmes once said, “Speak clearly, if you speak at all; carve every word before you let it fall.” He, of course, was campaigning for thoughtfulness in speech, but comprehensibility of speech remains crucially important. We all want dearly to understand and be understood by other people — and increasingly by our devices that connect us to our world.

Speech has always been humans preferred interface method — in fact the emergence of hominid speech a couple of hundred thousand years ago coincided with the origins of the species homo sapiens. But, we live now in sonic chaos — a noisy world that is getting noisier every day.  We talk in cars, in restaurants, on the street, in crowds, in echo-y rooms and in busy kitchens — there are always kids yelling, horns honking or loud music playing. What can we do to combat the noise, and make speech and understanding easier?

Part of the emerging answer is speech enhancement, the electronic processing of natural human speech to pare back the noise that makes speech hard to comprehend and unpleasant to listen to. Electronic engineers have been concerned about the speech quality for almost a century, really since the emergence of radio broadcasts. Most of the speech improvement methods have evolved relatively slowly in the last couple of decades, representing gradual refinement of DSP algorithms looking to separate speech sounds from background noise sounds. The emergence of deep neural network ...

Continue Reading

Living in a speech-triggered world

Like many people in the developed world, I spend every day surrounded by technology — laptops, phones, car touch-screens, networked gadgets, and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well — I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe, and peck. The devices themselves have not learned much of anything!

A wave of change is sweeping towards us, with the potential to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions to bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming veritable substitutes for human vision in tasks like driving, surveillance, and inspection.

Deep learning advances in speech have a completely different character from vision — these advances are rarely about substituting for humans, but center instead on paying more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction. The ...

Continue Reading