Part IV: Better = Cheaper
Just in the last week, we realized another important way to leverage the remarkable speech enhancement progress of BabbleLabs. Normally, people think of better speech enhancement as delivering just that — an improved experience. But improvements in quality can sometimes be transmogrified into reductions in cost. This turns out to be true for communications and storing speech. Communications systems have employed speech coding for decades, to deliver adequate speech quality over narrow bandwidth. However, as the aggressiveness of the encoding increases — squeezing speech into the fewest possible bits per second — the quality of the speech suffers. On top of that, the most ambitious speech coding methods attempt to model the characteristics of human speech production.
Modern speech codecs assume a "source-filter" model of speech production, and typically use two speech components: white Gaussian noise for unvoiced phonemes and periodic pulse train for voiced speech sounds. They use Linear Predictive Coders for the filter that represents resonances of the vocal tract. These concise models work pretty well in the absence of noise, but non-speech noise doesn’t encode well in these models, so that noisy speech is often distorted by these speech coding methods, especially at lower bit-rates.
Combining state-of-the-art speech codecs with state-of-the-art speech enhancement addresses these limitations quite well. We can use BabbleLabs Clear to remove noise, so that speech codecs (e.g., the open-source Opus family of codecs) can perform on more optimal input. The result is an end-to-end system that improves, rather than degrades speech, while sharply reducing bandwidth requirements. And that saves money, both in communication and in storage of speech.
Let’s compare a set of 1000 noisy speech samples, without any compression, to the same noisy signals, with the combination of speech enhancement and Opus encoding and decoding at 6kbps, the most economical supported setting. We’ll use speech clarity, as measured by the PESQ metric, to compare the results. Across the whole range of noise levels (as shown in the figure below), from very noisy signal of -5 dB Signal-to-Noise Ration (SNR) to quite clean signals of 15 dB, the cleaned and compressed signals (red points and trendline) sound distinctly better than the original speech signals (blue points and trendline). (And without speech enhancement, compression makes the signal noticeably worse than the original.)
This approach emphasizes the fact that improved speech quality can be translated into many different benefits, by giving the system designer the quality margin to allow for cutting costs elsewhere. Those savings might be gained by using fewer or cheaper microphones, lower-sample rate or bit precision, more aggressive encoding, lower-quality communications channels, or less reliable storage media. So sometimes better is just better, but sometimes better is also faster, or more reliable, or lower power…or cheaper!