Human-in-the-loop approach for AI-based speech enhancement assessments

In the age of remote collaboration and videoconferencing, AI plays an important role in speech processing to remove background noise, enhance speech, and gain insights into speech and audio streams. But how do you assess the efficacy of such technology in the context of decades of legacy hardware and software solutions? At BabbleLabs, in addition to objective metrics, we use a listener perception-based framework to evaluate the efficacy of our AI-based speech enhancement models and products.

Subjective assessments with listener perceptions on sound quality

The gold standard for assessing speech and sound quality utilizes the subjective opinions of a large, diverse panel of human listeners. Traditionally, this process is laborious and expensive to conduct. There are established objective measures using predictive models. However, the predictions may be valid only for very specific types of distortions, and useful, at the earlier stages of audio algorithm development.

At BabbleLabs, we use objective measures (such as: PESQ, ESTOI, SNR) in the early stages of the algorithm development, since they are fast and inexpensive to apply. In the intermediate and, especially at the final stages, we place greater reliance on subjective, human listener opinions in order to:

  1. Leverage the gold standard in sound quality assessment
  2. Get feedback from real listeners in real-life listening environments
  3. Gather reliable opinions for many types of audio distortions

To this end, we have developed a simple, comparative, crowdsourced, subjective testing framework that is performed at scale with a large number of individuals in “our globally connected world.” The participants don’t need any prior experience in speech quality assessment. Any person from around the world can participate. Thanks to classical statistics and probability theory, along with careful study design, we can effectively characterize the distribution of listener opinions.

A survey framework and scoring metric for AI model assessments

To assess the listener’s responses objectively, we at BabbleLabs developed a scoring methodology to quantify human listener perceptions, which we refer to as “consensus response score.” The methodology is based on ITU-T P.808 recommendation for subjective evaluation of speech quality using human listeners. We use a large cohort of participants to compare many audio pairs – original noisy versus enhanced. We gather the responses and average them over particular types of treatment, noise types, or a chosen speech utterance. Based on these responses, we can also draw conclusions on model efficacy for the listeners.

We can roll these tests out within hours to a day, and a new speech enhancement model is assessed quickly by hundreds of listeners. By aggregating thousands of opinions, we expedite our research and development. We also adjust design decisions in model development, be it the choice of training data, model architecture, audio signal processing used, or a combination of the above.

In a typical survey workflow, we first train participants to understand the instructions before giving the actual test. In the rating phase, the system delivers randomized pairs of audio stimuli and a list of ratings for scoring. A typical survey consists of:

  • Short practice session, where listeners can familiarize themselves with the task.
  • The test itself usually includes pairs of stimuli with noisy and enhanced audio.

Figure 1. Survey flow for a participant in our crowdsourced testing.

The listener’s decision is highly subjective with factors such as the amount of noise reduction versus naturalness of enhanced sound influence their perception. Below is an example result of one such recent survey. From the listener ratings, we create a comparative five-point scale, from -2 to +2, with zero indicating no apparent difference.

Figure 2. Example survey result.

Benefits of scaling out subjective assessments with diverse listeners globally

Traditionally, the subjective listener tests are conducted by trained professionals (e.g., hearing or perception scientists, audiologists, and/or audio research engineers) in sound-treated quiet environments under a standardized research protocol and with controls to prevent introducing unwanted variability in the collected responses. However, such studies are laborious, time-consuming, and expensive. In the intermediate and final stages of model development, objective measures could fail or be misleading. They also don’t scale well for rapid innovation much needed in AI-based speech enhancement technology development.

In recent years, the scientific community embraced scaling out listening experiments to many diverse listeners around the world for cost savings and expedience. These tests can be conducted concurrently with ease at orders of magnitude lower costs than on-site testing. New studies show this approach can be very effective with the appropriate management of unwanted variability from differences in equipment, environment, and listener focus and by including a large number of responses and listeners. Also, this framework is inclusive of perceptions by the general public that is more appropriate in predicting actual user experience.

Key benefits:

  • Recruitment of many diverse listeners globally for broad coverage of accents and languages
  • Limited investment for resources – time, money – on the part of the data science and AI teams
  • Rapid deployment of the testing, including automation with many cloud-based crowdsourcing tools and platforms
  • Alignment on scientific methods and metrics to obtain relatable results
  • BabbleLabs advantage with subjective listening tests and metrics

    As a company leading the way in speech science and AI research, we spend a lot of our time and effort on developing better models and algorithms. We also look to improve the quality of assessment for our models. This includes listening in representative use case settings. BabbleLabs chose the crowdsourcing approach to expedite our speech AI technology development and improve the efficacy of our models. Our scalable, listener-based testing approach and the quantitative metric we developed enables us to assess algorithmic improvement very quickly and effectively with diverse listeners globally. Further, we can judiciously invest in value-added algorithm/model development steps to serve our customers and partners quickly.

    Want to hear how our models perform? Try our Clear Edge for Client speech enhancement product developed using this human-in-the-loop approach. Clear Edge for Client works with any communication or collaboration application, to enhance speech and removes background noise better than any other software available.

    Last Updated: June 22, 2020 12:00pm

    Return to Blog