Expected Surprises | Part II: Rebuilding Speech

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part II: Rebuilding Speech

The BabbleLabs community has been delighted with how the availability of the product eases the task of explaining what BabbleLabs does. In just over a minute, someone can listen to the “before” and “after” clips like our “Raul-in-Montevideo” recording.  Please listen to this original, single microphone recording:

Then compare it to exactly the same single microphone track after it has passed through our real-time API:

Even the defects in noise suppression help us educate people about the subtle issues in advanced speech enhancement. In the loudest parts, a little bit of the traffic noise does leak into the final recording. In other parts, the careful listener can detect slight distortion of Raul’s voice, and minor variations in volume. These reflect both some conscious choices — allowing a small fraction of the original noise actually sounds more comfortable to most listeners — and some lingering technical challenges for the BabbleLabs speech science team.

The spectrogram — the plot of sound’s energy in each frequency over time — is an essential visual tool for understanding the impact of speech processing. Consider this comparison of the noisy audio track (above) and the enhanced audio track (below) over the 38 seconds of the track, for fine-grained frequency categories from 0 Hz to 8 kHz, where time-frequency samples with the highest energy are dark red, intermediate samples are yellow, and quiet time-frequency samples are blue:

You can easily see that the noise reduction in the increase in blue in the higher frequencies, and between the words – this is particularly striking in the last 10 seconds of the clip when the raw traffic noise increases dramatically.

A closer examination shows that the enhancer is not just removing noise, but also reconstructing missing speech. Consider an expanded look at just the period from 28-31 seconds, especially between the black arrows.

Normally, speech includes a fundamental frequency (a horizontal stripe of red at the bottom of the frequency spectrum at 200-500 Hz), plus more than a dozen layers of harmonic frequencies on top (extending up to 2000-3000 Hz). In the noisy speech above, the fundamental frequency is essentially lost in that noisy section. In the enhanced version, the fundamental appears to be not just unmasked, but reconstructed, making the speech richer and more natural.

BabbleLabs is deeply committed to continuously improving the algorithms and implementations, to keep raising the standard for speech clarity, intelligibility and accessibility. And we’re adding expanded coverage for new noise classes, languages and use-cases for compelling opportunities in business, consumer, industry, and communications. Keep listening.

Last Updated: September 7, 2018 8:06pm

Return to Blog