Is Speech Recognition Ready for a Noisy World?

Speech-centric user interfaces abound. Siri on the iPhone introduced us to the potential for using speech to control our phones.  Amazon’s Alexa service brought a wealth of new information services into our living rooms. Google’s high quality speech recognition now spans phones and smart speakers and is starting to bring on-the-fly speech translation within our reach. On the surface, it seems that speech interfaces are now ready for the real world.

The reality, alas, is not so rosy. The real world is a chaotic place with serious impairments to understanding because of loud background noises, acoustic reverberation and faulty communications channels. Both human-to-human and human-to-machine communications suffer with severe limitations in comprehensibility. It is worst, naturally, when the noise is loudest, for example outdoors in crowded cafes, in cars, in the kitchen and on the factory floor. Moreover, comprehension by people and by machines (automatic speech recognition or ASR) is worse when there is little context. We often have an easier time understanding a whole sentence than a short sequence of words because the longer utterance gives us more context to use in disambiguating the speech.

A simple experiment illustrates the problem. I recently captured a string of voice commands in a noisy and in a noise-free environment. The noisy environment was set up with playback of recorded crowd noise setup with the power level the same as my speech – 0dB signal-to-noise ratio. I fed the audio stream into the excellent IBM Watson speech recognition system. Then I spoke the same series of commands into a low-cost implementation of the BabbleLabs Clear Command application, built for small vocabulary recognition and trained for recognition of those target commands. The target script included a series of ten commands, each consisting of a trigger phrase. In this case, “Hey NXP” followed by a generic command:

Command one
Command two
Command three
Command four
Command five
Power on
Power off
Previous
Next
Go left

The IBM Watson speech recognition system (https://speech-to-text-demo.ng.bluemix.net) performs pretty well with low noise. In this case, I tested it at my desk at home. The ambient noise environment was about 55dB SPL. My voice signal was about 71dB SPL with my mouth approximately 40cm from the microphone (my laptop for the IBM Watson or from the prototype command recognizer running on an NXP i.MX RT 1060-based board).

Despite not having any foreknowledge of my likely vocabulary, it got most of the words right in this environment with 16dB signal-to-noise ratio (71 – 55 = 16dB). 

The results for the noisy case are less impressive. For this case I added some fairly difficult noise, a recording of a café environment with background speech and clanking dishes (https://youtu.be/BOdLmxy06H0). The playback was set to generate noise of about 77dB SPL for a -6dB signal-to-noise ratio.

IBM Watson – text output

BabbleLabs – command acknowledged

Noise-free: 16dB SNR

Hey NXP Command one
Hey NXP Man to
Hey NXP Command three
Hey NXP A man for
Hey NXP Command five
Hey NXP Power on
Hey NXP Power off
Hey NXP Yes
Hey NXP Next
Hey NXP Go left

Hey NXP Command one

Hey NXP Command two
Hey NXP Command three
Hey NXP Command four
Hey NXP Command five
Hey NXP Power on
Hey NXP Power off
Hey NXP Previous
Hey NXP Next
Hey NXP Go left

Noisy: -6dB SNR

Be a man run
Hey he command to.
Hey Three.
Hey XP command for
May be on the inside.
He won
---
Previous.
He knew.
Hey XP though that

Hey NXP Command one
Hey NXP Command two
Hey NXP Command three
Hey NXP Command four
Hey NXP Command five
Hey NXP Power on
Hey NXP Power off
Hey NXP Previous
Hey NXP Next
Hey NXP Go left

Why was the IBM system performance so different between quiet and noise environments? 

One possibility is that the IBM system isn’t adequately trained for noisy speech, such that the system is confused by unfamiliar audio distractions. It is possible that the essential structure of large vocabulary speech recognition systems has difficulty coping with the increased ambiguity of jumbled sounds and is forced to guess the most likely text sequence. As noise increases, there are more possible word sequences that could be a fit to the underlying speech, but the large vocabulary system has few clues to select which plausible sequence is best. It simply attempts to find a phrase that fits common word sequences in the target language. The large vocabulary system is merely constrained to find word sequences that COULD occur in the target language; it is not constrained to especially look for the target commands. Only the natural language processing stage, which looks only at the selected sequence of words, knows what set of phrases are of interest to the application. Unfortunately, there is rarely a connection from the natural language application back to the acoustic model, so the system can’t resolve acoustic ambiguities by preferring a disambiguation that results in target phrases more often. 

By contrast, the BabbleLabs command recognition model effectively combines the acoustic model, the language model and the natural language phrase-of-interest recognition application into a single neural-network recognizer.  The recognizer knows all about the target phrases and is typically trained with hundreds of thousands of example utterances, for both the target phrases to accept and for other phrases, including similar sounding phrases, to reject.  The disambiguation of the noisy speech is done with complete knowledge of the targets so it can find them more easily in the chaos.

This command recognition approach serves an important role that barely overlaps with the role of cloud-based continuous speech recognition systems. It thrives on finite vocabularies where all of the target utterances can be enumerated, and where noise-immunity is critical. The big cloud-based ASR systems do a remarkable job of handling the huge vocabulary of big languages like English. This is important for use-cases like web queries and shopping where user vocabulary and sentence structure can be wildly varied.

For some systems, the advantages of local command recognition go beyond noise-tolerance. The BabbleLabs unified neural network model consumes much less compute and memory than the three-phase speech model of ASR systems, often more than 100x less. This means it can be implemented in tiny microcontrollers and DSPs embedded in IoT products, wearable devices, vehicles and other personal gadgets. Local command recognizers don’t require continuous network availability, operate with lower response latency and inherently maintain privacy. They do not need to share anything with the cloud, so all the speech data remains local.

Noise-immune command recognition can resolve a thorny problem for system builders. It offers a way to combine the convenience and intuitiveness of speech control in difficult conditions, while reducing cost and power, and improving response time, uptime and data protection. I expect to see this class of interface proliferate rapidly.

Learn more about how BabbleLabs is helping to create the next generation of voice-driven products.

Last Updated: December 12, 2019 9:56pm

Return to Blog