Expected Surprises | Part II: Rebuilding Speech

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part II: Rebuilding Speech

The BabbleLabs community has been delighted with how the availability of the product eases the task of explaining what BabbleLabs does. In just over a minute, someone can listen to the “before” and “after” clips like our “Raul-in-Montevideo” recording.  Please listen to this original, single microphone recording:

Then compare it to exactly the same single microphone track after it has passed through our real-time API:


Even the defects in noise suppression help us educate people about the subtle issues in advanced speech enhancement. In the loudest parts, a little bit of the traffic noise does leak into the final recording. In other parts, the careful listener can detect slight distortion of Raul’s voice, and minor variations in volume. These reflect both some conscious choices — allowing a small fraction of the original noise actually sounds more comfortable to most listeners — and some lingering technical challenges for the BabbleLabs speech science team.

The spectrogram — the plot of sound’s energy in each frequency over time — is an essential visual tool for understanding the impact of speech processing. Consider this comparison of the noisy audio track (above) and the enhanced audio track (below) over the 38 seconds of the track, for fine-grained frequency categories from 0 Hz to 8 kHz, where time-frequency samples with the highest energy are dark red, intermediate samples ...

Continue Reading

Expected Surprises | Part I: Restoring the Past

Just a few weeks ago, BabbleLabs launched its Clear Cloud speech enhancement product — available as both a cloud streaming API and via a web interface for handling individual files. The early results, as we expected, have been quite surprising ;-)

Find out what I mean in this four-part blog series. We’ll post a new part every few days for over the next two weeks:

Part I: Restoring the Past
Part II: Rebuilding Speech 
Part III: What’s Missing?
Part IV: Better = Cheaper

Part I: Restoring the Past

First, we’ve been struck by the uniform response from people who make video — ordinary consumers, video bloggers and professional video producers: “Wow! That’s pretty amazing.” For the amateurs, I think they are just shocked when you strip away the noise from a video — it feels as if speaker has been transported out of the noisy scene to some quiet place, yet there they are immersed in traffic, the windy South Atlantic, a club, an ice skating rink. For the professionals, they know that with great care in recording and post-production, distracting noises from field recording can be painstakingly eliminated, but it is expensive and uncertain. For them, the impact of Clear Cloud is how close it gets to professional, manual sound editing for no effort. It has become obvious to us that Clear Cloud has great potential in the video production, distribution, and sharing world.

Second, we’ve started to pick up anecdotes that suggest a whole ...

Continue Reading

We Love Noise

During my training as an engineer, one of the most obvious concepts I had to learn was “always check your assumptions”. This maxim is applicable in many circumstances. All too often, you find a mismatch between how things should behave, and how they are actually functioning. And this leads us to our  main subject — why those of us building BabbleLabs, a speech company, care so deeply about noise in speech. We care so much, we derived our company's name from noise; we are BabbleLabs, not SpeechLabs, after all.

When I hear stories about how computers have achieved better-than-human speech recognition results, I wonder how that can be true, and yet I still cannot successfully dictate a number to my phone.  Even if short message transcription works in the comfort of your private office, it completely falls apart in your car.

These observations are screaming, Check Your Assumptions!! It turns out that most automatic speech recognition (ASR) work has historically been primarily focused on doing ASR in anechoic circumstances (i.e. without echoes or reverberation). Robustifying ASR for real-world environments has been a second step built on top of the work done for anechoic ASR. To be clear, that approach is not wrong — you need to crawl before you can walk. However, this approach has its inherent limitations. At BabbleLabs, we have started from the other end of the problem, by addressing the noise.

Noise, of course, means a lot of things; additive noise, modulated ...

Continue Reading