The Internet has been abuzz this past week with discussion of a single utterance. People everywhere are talking about “Laurel versus Yanny”, an utterance that sounds more like “Laurel” to some, and more like “Yanny” to others. See the NY Times article. In fact, you can get from something very clearly “Laurel” to something clearly “Yanny” by changing the gain on different frequency bands. “Laurel,” emphasized from the high frequencies are attenuated, and “Yanny” stands out when the high frequencies are boosted. Find a fun comparison tool here. Whether you hear “Laurel” or “Yanny” is partially determined by your auditory systems sensitivity to high frequencies.
This lively but inconsequential debate exposes the tip of an iceberg of more substance to speech experts. People want clearer speech in every situation where intelligibility or comfort are important. Nobody wants to listen to such overwhelming loud noise or such heavy reverberation that all intelligibility is lost. Alexander Graham Bell famously first transmitted voice on a wired system in 1875 and Reginald Aubrey Fessenden transmitted voice by wireless in 1900. Needless to say the voice quality was lousy. Since then, speech and communications engineers working on speech-based systems have developed a range of metrics to allow for better comparison of the level of noise and degree of intelligibility of electronic speech reproduction.
What is “good” sound in speech? Low noise level? High clarity? Good comprehensibility? Speech captured under ideal conditions — good microphones, anechoic recording environment, and zero additive noise — certainly ranks well on these metrics. Many environments, however, make speech capture, transmission and reproduction very difficult. We need to be prepared for imperfect speech, and to analyze it in a way to gives insights on how to improve our speech systems. Good metrics are one key to good speech.
Broadly speaking, there are two sets of criteria we may want to apply to speech:
- Comfort/Quality: How does it feel to listen to the speech? Is it annoying or uncomfortable? Is the noise or reverberation distracting?
- Intelligibility: Regardless of how noisy or unpleasant the speech may be, how completely can you actually make out the words and the speaker’s intent? Can you understand?
As you’d expect, intelligible speech is often also comfortable to listen to, but the two metrics are not the same. High noise, especially outside of the typical speech frequency spectrum, can be quite uncomfortable and annoying, but may not interfere much with comprehension. But impairments don’t need to be annoying to interfere with comprehension. Moderate noise in the speech frequency bands, for example from other speakers in the background, can ruin intelligibility. Reverberation of rapid speech can also make comprehension difficult, even without disturbing levels of noise. Over time, many metrics have emerged. Let me briefly introduce and compare a few of the more relevant ones. The following are, in our opinion, the best metrics to use to measure comfort, quality, and intelligibility:
- MOS: The Mean Opinion Score is the gold standard for speech assessment, because, ironically, it is based on human judgment. It is a method for aggregating actual human listening experience to determine which speech is more satisfactory, The 1-5 scale is strikingly subjective and that goes to the real heart of human perception of speech – how annoying is it. Many other metrics are based on trying to predict MOS without the expense of collecting human data. There are a number of variants of MOS, including comparative scoring of two speech samples, continuous scoring, and scoring of specific impairments (for example, echo)
Score Quality Impairments 5 Excellent Imperceptible 4 Good Perceptible but not annoying 3 Fair Slightly annoying 2 Poor Annoying 1 Bad Very Annoying
- SNR: Everyone working on audio or speech knows with Signal to Noise Ratio. It measures the ratio of the power of the noise to the power of the intended signal. It may be used the average power over a time period and is measured on a log scale, so that a doubling of the ratio translates to a 3-decibel improvement in the SNR. This is really a measure of how badly the noise distorts the waveform relative to the clean signal. It is not well correlated with perception of signal quality or intelligibility. Usually SNR is directly measured by comparing the noisy speech with the same speech without noise, when that clean version is available. Sometimes SNR is estimated by trying to get an approximate magnitude of the noise, for example during speech gaps, to compare to the total audio magnitude.
- PESQ: Perceptual Evaluation of Speech Quality is a group of standards from the telecommunications industry, standardized as the ITU-T recommendations called P.862. It attempts to more directly model the assessment of voice quality by humans, so it tries to estimate results on roughly the same 0-5 scale as MOS. PESQ results are computed with a “full-reference” or “intrusive” algorithm. The full-reference version needs access to the original clean version of each speech sample as the ideal, and compares the noisy version to it. A new variation of PESQ-like metrics is captured in the new P.863 standard, Perceptual Objective Listening Quality Assessment (POLQA).There are analogous “no-reference” or non-intrusive metrics for some quality measures. These use no information about the clean version, but can only make estimates of the degradation relative to a hypothetical clean version.
- NCM and STOI: Normalized Covariance Metric (NCM) and Short-Time Objective Intelligibility (STOI) are metrics specifically designed to correlate well with human intelligibility. They are distant descendants of French and Steinberg’s Articulation Index, developed in the 1940s. NCM was developed in the 1990s and STOI a decade later, as different approximations to intelligibility on a 0-1 scale. Both are full-reference algorithms – they compare clean to dirty signals to estimate the degree of intelligibility. An improved version of STOI, Extended Short-Time Objective Intelligibility (ESTOI) has come into use in the past couple of years. ESTOI is design to handle highly modulated and variable noise source, so it may provide good insights for some of the most challenging noise types.
- WER: On the surface, the emergence of good speech recognition systems seems to offer a fine way to measure speech comprehensibility. Some researchers have tried to use Word Error Rate (WER), for example, as a stand-in for human comprehension. Unfortunately, automatic recognition systems are trained on finite distributions of speech and noise, and their performance depends heavily on whether a input speech has similar distributions. Speech processing of the voice signal may actually improve the comfort and the true intelligibility of the speech, but shift the signal pattern (often in inaudible ways) outside the scope of the recognition training. Thus, the WER metric degrades even while subjective and other objective metrics of intelligibility improve.
I must mention an important caveat about all these objective or algorithmic metrics. The correlation between the computed metric (say, PESQ) and the human perception tests (say, MOS) depends on the impairments in the speech under assessment. The correlation is excellent for the kinds of impairments originally used in characterizing and tuning the metric, but the correlation accuracy must degrade as new and different distortions are measured.
The proliferation of speech analysis metrics reflects an inherent conflict of goals for speech work. On one hand, we want metrics that are easy to compute, reproducible and objective. On the other hand, we want perfect prediction of human experience. No algorithmic metric captures all the complexity of human perception for all the possible forms of naturally occurring and artificial impairments. The only robust way to determine if humans can comprehend a speech sample is to get a statistically significant group of humans to judge it. The “Laurel” vs. “Yanny” debate further highlights the variability of human perception, so sample sizes must be significant to be truly accurate. The expense and time for subjective scoring force researchers to keep going back to come up with better methods — simple formulas that correlate well to real MOS. In the meantime, researchers and developers of real speech systems use an ensemble of metrics of comfort, quality and intelligibility, together with human listening, to make real progress on real speech experiences.
BabbleLabs is driving to create a speech-powered world, where humans are universally understood and in command. We use the full spectrum of metrics, subjective and objective, to measure and optimize speech systems. Using metrics for intelligibility, like NCM and STOI, is essential to improving real intelligibility. Optimizing only for SNR and PESQ may yield speech with a good sound, but it won’t do much for human intelligibility. The result of all the effort with metrics is worth it: clearer voice, greater comfort, higher comprehension – and ultimately, better understanding among people.