Writer and physician Oliver Wendell Holmes once said, “Speak clearly, if you speak at all; carve every word before you let it fall.” He, of course, was campaigning for thoughtfulness in speech, but comprehensibility of speech remains crucially important. We all want dearly to understand and be understood by other people — and increasingly by our devices that connect us to our world.
Speech has always been humans preferred interface method — in fact the emergence of hominid speech a couple of hundred thousand years ago coincided with the origins of the species homo sapiens. But, we live now in sonic chaos — a noisy world that is getting noisier every day. We talk in cars, in restaurants, on the street, in crowds, in echo-y rooms and in busy kitchens — there are always kids yelling, horns honking or loud music playing. What can we do to combat the noise, and make speech and understanding easier?
Part of the emerging answer is speech enhancement, the electronic processing of natural human speech to pare back the noise that makes speech hard to comprehend and unpleasant to listen to. Electronic engineers have been concerned about the speech quality for almost a century, really since the emergence of radio broadcasts. Most of the speech improvement methods have evolved relatively slowly in the last couple of decades, representing gradual refinement of DSP algorithms looking to separate speech sounds from background noise sounds. The emergence of deep neural network methods is now helping algorithm designers take a fresh look at the problem, and develop much more ambitious and effective speech enhancement methods to reduce noise, remove reverberation and improve intelligibility of speech.
Speech Enhancement ≠ Speech Recognition
Speech enhancement is distinct from speech recognition. Speech recognition is focused on the problem of rendering of human speech audio into equivalent text for purpose of capturing transcriptions and controlling devices and information systems. It is also being effectively extended into automated translation of languages. Speech enhancement concentrates instead on the human auditory experience. That experience suggests two goals for speech enhancement. First, the enhancement should make the speech easier to understand bringing out the distinguishing sounds from the background of noise and acoustic impairments. Second, the enhancement should make the speech more pleasant to listen to, cut back the distracting, annoying and even painful interference of noise. These two goals are not as inevitably linked as you might imagine. Intelligibility is actually quite challenging to improve, since the human auditory system and language centers are remarkably adept at piecing together the most likely meaning of even heavily impaired speech sounds. Intelligibility is best measured by human (aka subjective) testing, but it turns out to be well correlated to objective speech quality metrics like the Normalized Covariance Metric (NCM) and several flavors of the Short-Time Objective Intelligibility (STOI). Ironically many of the classical signal processing algorithms for speech enhancement do relatively little for intelligibility. Listening comfort is better modeled by metrics like Perceptual Evaluation of Speech Quality (PESQ), a family of methods captured in the ITU-T recommendation P.862. It correlated well to reduction of the perceived noise level, but noise reduction is not the same thing as intelligibility improvement. To illustrate this, the BabbleLabs team has implemented some 19 different speech enhancement algorithms from the literature, and looked at results on noisy speech. Here are two examples, compared the original noisy example, to the “enhanced” results on the algorithms on the PESQ and NCM.
(The benefits are so modest, especially on intelligibility, that I haven’t broken them out by name, but we summarize the 19 methods as Martin, MCRA, MCRA2, IMCRA, Doblinger,Hirsch, Conn_Freq, KLT, PKLT, Wiener AS,Audnoise, MMSE,LogMMSE,stsa_weuclid,stsa_wcosh, andstsa_mis. Let me know if you want to dig deeper on this.)
Noise plays a leading role as the main villain in speech understanding by both humans and machines. As I’ve written about before, today’s speech recognition system are often extremely sensitive to the presence of noise and some of the same neural network training methods for noise resilience in deep learning-based speech enhancement can be effectively applied to speech recognition systems too. Unless we make better speech systems — ones with higher speech quality, better intelligibility, more noise robustness and improved speaker identification — all the talk about a revolution in speech interfaces is just talk. A good example is the relatively low satisfaction of smartphone users with voice assistants. PwC reports that only 38% of users are “very satisfied” with mobile voice assistants. We must deploy significantly improved algorithms to address the serious limitations we see today!
Who Needs Speech Enhancement?
The applications of speech enhancement are endless. It can play a major role in better telephony systems, public address systems, and hearing aids. Even if we focus on just one small subset — speech enhancement in internet-based systems — the range is impressive.
- Call centers capture voice dialog for lots of reasons: for live interaction, for automated response using speech recognition, for logging and training, and for off-line assessment and response.
- Home and office voicemail used to be something hosted on a box owned by family or business, but that service has now moved almost entirely to the cloud. Voice calls are often captured with both nasty background noise and channel impairments that make voicemails notoriously difficult to understand.
- Video and audio production — amateur and professional — often confronts tough issues in noisy and reverberant speech material. When audio professionals have the time, they can manually filter the speech track second-by-second or even phoneme-by-phoneme, to improve the noise using tools like Adobe Audition or even re-record the audio under studio conditions. Few amateurs have the tools and skills to do sophisticated speech improvement, so automated speech enhancement makes sense for situations ranging from professional editing of field-recorded audio and video, to parents fixing up the cellphone video of their kids’ birthday party before posting it for relatives on YouTube. In fact, an increasing fraction of all social media revolves around video, usually produced by amateurs, on Snapchat, Instagram, Twitter, YouTube, and WeChat.
- Audio archiving is becoming a critical information access and compliance method in business. Customers, investors, and partners all value having ready access to the best possible recordings of conference calls and earnings calls. In Europe, new MiFID II/MiFIR finance rules require greater transparency and access by investors to key business information, and audio archiving is likely to be an element of this new regime.
- Games and entertainment have developed rich interactive speech elements. The Talking Tom franchise and apps like “My Pet Can Talk” turn voice recordings into animal animations. Massive multiplayer online games (MMOGs) are turning to enhanced speech to make game-play more fun, more comprehensible, and more immersive.
The list goes on and on, and the direction is clear. Enhanced speech is rapidly growing as a key ingredient in speech interaction systems of all kinds. Better speech systems will be used by billions of people, and make a critical business difference in any situation where better understanding is needed.
Humans are extremely demanding in regards to the quality of speech they will listen to — after all, we’ve been refining our tastes for hundreds of thousands of years. Better speech enhancement, especially methods that can improve intelligibility (not just turn down the noise), are critical to progress across these applications.
Let us "carve every word before we let it fall".