Enhancing Speech to Solve the Pervasive Problem in Conferencing

Seventy years ago, the journalist William H. Whyte coined a popular adage, “The single biggest problem in communication is the illusion that it has taken place.“ Regardless of who the quote is ascribed to (sometimes even George Bernard Shaw is given credit), it gets at the perennial tension between the necessity of communication and the daunting difficulty in making it happen. This is especially true in large organizations with distributed teams. 

Large organizations emerge because they make humans more effective. Corporations, volunteer groups and the military all harness the coordinated energy and diverse talents of teams to create benefits unavailable to individuals. Everything that organizations need for success – shared vision, efficient allocation of resources, coordinated action, communal learning processes – is ultimately built on investment in good communication.

How does the modern organization communicate?
With a marvelous and complex diversity of methods – face-to-face meetings, mail, email, texts, live meetings, phone calls, video and audio conferences, video broadcasts and more. While many are asynchronous communications, live video, and especially live audio, are particularly pervasive, yet often problematic.

We can roughly break this list of communications methods down into two broad categories – non-real time content sharing methods and real-time audio-video methods. Within audio-video collaboration channels, it’s pretty clear that audio is central.  After all, you can have a productive audio conferencing experience without video, but video conferencing without good audio is sadly ineffective. All these tools play different roles in the overall team collaboration experience. Text ...

Continue Reading

Is Speech Recognition Ready for a Noisy World?

Speech-centric user interfaces abound. Siri on the iPhone introduced us to the potential for using speech to control our phones.  Amazon’s Alexa service brought a wealth of new information services into our living rooms. Google’s high quality speech recognition now spans phones and smart speakers and is starting to bring on-the-fly speech translation within our reach. On the surface, it seems that speech interfaces are now ready for the real world.

The reality, alas, is not so rosy. The real world is a chaotic place with serious impairments to understanding because of loud background noises, acoustic reverberation and faulty communications channels. Both human-to-human and human-to-machine communications suffer with severe limitations in comprehensibility. It is worst, naturally, when the noise is loudest, for example outdoors in crowded cafes, in cars, in the kitchen and on the factory floor. Moreover, comprehension by people and by machines (automatic speech recognition or ASR) is worse when there is little context. We often have an easier time understanding a whole sentence than a short sequence of words because the longer utterance gives us more context to use in disambiguating the speech.

A simple experiment illustrates the problem. I recently captured a string of voice commands in a noisy and in a noise-free environment. The noisy environment was set up with playback of recorded crowd noise setup with the power level the same as my speech – 0dB signal-to-noise ratio. I fed the audio stream into the excellent IBM Watson speech recognition system. ...

Continue Reading

Investing in Speech

Speech is essential human behavior. It ties our families and our societies together at the most fundamental level. Speech isn’t easy, however. It takes years to learn, and even modest impairments from unfamiliar accent, noise and reverberation can render it incomprehensible. But when it works, people can harvest an enormous amount of information from a single utterance. From one snippet of speech, we can identify the speaker or at least the speaker’s demographics (gender, age, national origin); we can pick up on emotions; and we can (usually) make out the words. We even pick out clues about the immediate environment of the speaker. It’s a treasure trove of information.

We’re right in the middle of a fundamental technology transformation.
BabbleLabs invents and invests in technologies fundamental to improving our understanding of speech. We enhance the clarity of recorded and transmitted speech. We recognize speech intent. We are extracting an increasing range of insights from speech streams. The volume of speech from billions of people multiplied by the value of speech — especially critical communications within organizations and with customers — creates a huge latent market. The voice and speech recognition market alone is expected to produce $30B in revenue over the next seven years (Tractica 2018). And that is just the tip of the iceberg when it comes to leveraging improved speech communications and analytics.

We’ve done a lot in our first two years.

We have laid a technical foundation of core algorithms, huge datasets ...

Continue Reading