We Love Noise

During my training as an engineer, one of the most obvious concepts I had to learn was “always check your assumptions”. This maxim is applicable in many circumstances. All too often, you find a mismatch between how things should behave, and how they are actually functioning. And this leads us to our  main subject — why those of us building BabbleLabs, a speech company, care so deeply about noise in speech. We care so much, we derived our company's name from noise; we are BabbleLabs, not SpeechLabs, after all.

When I hear stories about how computers have achieved better-than-human speech recognition results, I wonder how that can be true, and yet I still cannot successfully dictate a number to my phone.  Even if short message transcription works in the comfort of your private office, it completely falls apart in your car.

These observations are screaming, Check Your Assumptions!! It turns out that most automatic speech recognition (ASR) work has historically been primarily focused on doing ASR in anechoic circumstances (i.e. without echoes or reverberation). Robustifying ASR for real-world environments has been a second step built on top of the work done for anechoic ASR. To be clear, that approach is not wrong — you need to crawl before you can walk. However, this approach has its inherent limitations. At BabbleLabs, we have started from the other end of the problem, by addressing the noise.

Noise, of course, means a lot of things; additive noise, modulated ...

Continue Reading

Clear Cloud: Behind the Curtain

The conventional wisdom in deep learning is that GPUs are essential compute tools for advanced neural networks. I’m always skeptical of conventional wisdom, but for BabbleLabs, it seems to hold true. The sheer performance of GPUs, combined with their robust support in deep learning programming environments, allows us to train bigger, more complex networks with vastly more data and deploy them commercially at low cost. GPUs are a key element in BabbleLabs’ delivery of the world’s best speech enhancement technology.

The deep learning computing model gives us powerful new tools for extracting fresh insights from masses of complex data — data that has long defied good systematic analysis with explicit algorithms. The model has already transformed vision and speech processing, obsoleting most conventional classification and generation methods in just the last five years. Deep learning is now being applied — often with spectacular results — across transportation, public safety, medicine, finance, marketing, manufacturing, and social media. This already makes it one of the most significant developments in computing in the past two decades. In time, we may rate its impact in the same category with the “superstars” of tech transformation — the emergence of high-speed Internet and smart mobile devices.

The promise of deep learning is matched with a curse: it demands huge data sets and correspondingly huge computing resources for successful training and use. For example, a single full training of BabbleLabs’ most advanced speech enhancement network requires between 1019 and 1020 floating point operations, using ...

Continue Reading

What does it sound like to you?

The Internet has been abuzz this past week with discussion of a single utterance.  People everywhere are talking about “Laurel versus Yanny”, an utterance that sounds more like “Laurel” to some, and more like “Yanny” to others. See the NY Times article. In fact, you can get from something very clearly “Laurel” to something clearly “Yanny” by changing the gain on different frequency bands.  “Laurel,” emphasized from the high frequencies are attenuated, and “Yanny” stands out when the high frequencies are boosted. Find a fun comparison tool here. Whether you hear “Laurel” or “Yanny” is partially determined by your auditory systems sensitivity to high frequencies.

This lively but inconsequential debate exposes the tip of an iceberg of more substance to speech experts. People want clearer speech in every situation where intelligibility or comfort are important. Nobody wants to listen to such overwhelming loud noise or such heavy reverberation that all intelligibility is lost. Alexander Graham Bell famously first transmitted voice on a wired system in 1875 and Reginald Aubrey Fessenden transmitted voice by wireless in 1900. Needless to say the voice quality was lousy. Since then, speech and communications engineers working on speech-based systems have developed a range of metrics to allow for better comparison of the level of noise and degree of intelligibility of electronic speech reproduction.

What is “good” sound in speech? Low noise level? High clarity? Good comprehensibility? Speech captured under ideal conditions — good microphones, anechoic recording environment, and zero additive noise — certainly ...

Continue Reading