Clear Cloud: Behind the Curtain

The conventional wisdom in deep learning is that GPUs are essential compute tools for advanced neural networks. I’m always skeptical of conventional wisdom, but for BabbleLabs, it seems to hold true. The sheer performance of GPUs, combined with their robust support in deep learning programming environments, allows us to train bigger, more complex networks with vastly more data and deploy them commercially at low cost. GPUs are a key element in BabbleLabs’ delivery of the world’s best speech enhancement technology.

The deep learning computing model gives us powerful new tools for extracting fresh insights from masses of complex data — data that has long defied good systematic analysis with explicit algorithms. The model has already transformed vision and speech processing, obsoleting most conventional classification and generation methods in just the last five years. Deep learning is now being applied — often with spectacular results — across transportation, public safety, medicine, finance, marketing, manufacturing, and social media. This already makes it one of the most significant developments in computing in the past two decades. In time, we may rate its impact in the same category with the “superstars” of tech transformation — the emergence of high-speed Internet and smart mobile devices.

The promise of deep learning is matched with a curse: it demands huge data sets and correspondingly huge computing resources for successful training and use. For example, a single full training of BabbleLabs’ most advanced speech enhancement network requires between 1019 and 1020 floating point operations, using ...

Continue Reading

What does it sound like to you?

The Internet has been abuzz this past week with discussion of a single utterance.  People everywhere are talking about “Laurel versus Yanny”, an utterance that sounds more like “Laurel” to some, and more like “Yanny” to others. See the NY Times article. In fact, you can get from something very clearly “Laurel” to something clearly “Yanny” by changing the gain on different frequency bands.  “Laurel,” emphasized from the high frequencies are attenuated, and “Yanny” stands out when the high frequencies are boosted. Find a fun comparison tool here. Whether you hear “Laurel” or “Yanny” is partially determined by your auditory systems sensitivity to high frequencies.

This lively but inconsequential debate exposes the tip of an iceberg of more substance to speech experts. People want clearer speech in every situation where intelligibility or comfort are important. Nobody wants to listen to such overwhelming loud noise or such heavy reverberation that all intelligibility is lost. Alexander Graham Bell famously first transmitted voice on a wired system in 1875 and Reginald Aubrey Fessenden transmitted voice by wireless in 1900. Needless to say the voice quality was lousy. Since then, speech and communications engineers working on speech-based systems have developed a range of metrics to allow for better comparison of the level of noise and degree of intelligibility of electronic speech reproduction.

What is “good” sound in speech? Low noise level? High clarity? Good comprehensibility? Speech captured under ideal conditions — good microphones, anechoic recording environment, and zero additive noise — ...

Continue Reading

What’s the big deal with speech enhancement?

Writer and physician Oliver Wendell Holmes once said, “Speak clearly, if you speak at all; carve every word before you let it fall.” He, of course, was campaigning for thoughtfulness in speech, but comprehensibility of speech remains crucially important. We all want dearly to understand and be understood by other people — and increasingly by our devices that connect us to our world.

Speech has always been humans preferred interface method — in fact the emergence of hominid speech a couple of hundred thousand years ago coincided with the origins of the species homo sapiens. But, we live now in sonic chaos — a noisy world that is getting noisier every day.  We talk in cars, in restaurants, on the street, in crowds, in echo-y rooms and in busy kitchens — there are always kids yelling, horns honking or loud music playing. What can we do to combat the noise, and make speech and understanding easier?

Part of the emerging answer is speech enhancement, the electronic processing of natural human speech to pare back the noise that makes speech hard to comprehend and unpleasant to listen to. Electronic engineers have been concerned about the speech quality for almost a century, really since the emergence of radio broadcasts. Most of the speech improvement methods have evolved relatively slowly in the last couple of decades, representing gradual refinement of DSP algorithms looking to separate speech sounds from background noise sounds. The emergence of deep neural network ...

Continue Reading