Expected Surprises | Part IV: Better = Cheaper

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part IV: Better = Cheaper

Just in the last week, we realized another important way to leverage the remarkable speech enhancement progress of BabbleLabs. Normally, people think of better speech enhancement as delivering just that — an improved experience. But improvements in quality can sometimes be transmogrified into reductions in cost.  This turns out to be true for communications and storing speech. Communications systems have employed speech coding for decades, to deliver adequate speech quality over narrow bandwidth. However, as the aggressiveness of the encoding increases — squeezing speech into the fewest possible bits per second — the quality of the speech suffers. On top of that, the most ambitious speech coding methods attempt to model the characteristics of human speech production.

Modern speech codecs assume a "source-filter" model of speech production, and typically use two speech components: white Gaussian noise for unvoiced phonemes and periodic pulse train for voiced speech sounds. They use Linear Predictive Coders for the filter that represents resonances of the vocal tract.  These concise models work pretty well in the absence of noise, but non-speech noise doesn’t encode well in these models, so that noisy speech is often distorted by these speech coding methods, especially at lower bit-rates.

Combining state-of-the-art speech codecs with state-of-the-art speech enhancement addresses these limitations quite well. We can use BabbleLabs Clear to remove noise, so that speech codecs (e.g., ...

Continue Reading

Expected Surprises | Part III: What’s Missing?

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part III: What’s Missing?

I have long wondered why we see a proliferation of video cameras without sound capture. Sound is fundamentally cheaper to capture and there are many potential benefits of capturing the sounds of public spaces, with or without video:

Traffic monitoringAssessment of road conditions — the sound of tires on pavements directly depends on the amount of water or snow on the road surfaceEstimation of weather — wind, rain and lightning have distinctive soundsLocalization of loud vehicles, drones and aircraftDetection and localization of explosions, gunshots, and other anomalies.

In fact, with the wide adoption of deep learning methods, we have even more powerful tools for extracting this kind of information from the cacophony of sounds found in public. And since microphones are much less expensive and audio recordings are so much lower bandwidth, a mesh of microphones in public spaces could be a cost-effective source of priceless insights to make our environment safer, healthier, and more efficient. And the same potential exists for recording in the home, on the factory floor, in transportation systems, in office settings, hospitals and other complex environments.

So why don’t we do it? Principally because it is illegal!

In most jurisdictions around the world, it is forbidden to record a private conversation — even one in a public space — unless some or all participants in the conversation consent to recording. ...

Continue Reading

Expected Surprises | Part II: Rebuilding Speech

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part II: Rebuilding Speech

The BabbleLabs community has been delighted with how the availability of the product eases the task of explaining what BabbleLabs does. In just over a minute, someone can listen to the “before” and “after” clips like our “Raul-in-Montevideo” recording.  Please listen to this original, single microphone recording:

Then compare it to exactly the same single microphone track after it has passed through our real-time API:


Even the defects in noise suppression help us educate people about the subtle issues in advanced speech enhancement. In the loudest parts, a little bit of the traffic noise does leak into the final recording. In other parts, the careful listener can detect slight distortion of Raul’s voice, and minor variations in volume. These reflect both some conscious choices — allowing a small fraction of the original noise actually sounds more comfortable to most listeners — and some lingering technical challenges for the BabbleLabs speech science team.

The spectrogram — the plot of sound’s energy in each frequency over time — is an essential visual tool for understanding the impact of speech processing. Consider this comparison of the noisy audio track (above) and the enhanced audio track (below) over the 38 seconds of the track, for fine-grained frequency categories from 0 Hz to 8 kHz, where time-frequency samples with the highest energy are dark red, intermediate samples ...

Continue Reading