Living in a speech-triggered world

Like many people in the developed world, I spend every day surrounded by technology — laptops, phones, car touch-screens, networked gadgets, and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well — I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe, and peck. The devices themselves have not learned much of anything!

A wave of change is sweeping towards us, with the potential to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions to bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming veritable substitutes for human vision in tasks like driving, surveillance, and inspection.

Deep learning advances in speech have a completely different character from vision — these advances are rarely about substituting for humans, but center instead on paying more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction. The ...

Continue Reading

How well does speech recognition cope with noise?

Speech is rapidly becoming a primary interface for the connected world. Apple’s Siri, Amazon’s Alexa services running on Echo devices, and a wealth of cloud-based speech recognition APIs are all bring speech-based user interfaces into the mainstream. Moreover, the spectrum of information services is likely to expand rapidly over the next couple of years, bringing speech up to the level of mobile touchscreen web browsing and phone apps.

The speech interface revolution is starting for good reason — typing on keyboards, clicking mice and touching glass are neither natural nor efficient for humans. Most importantly, they usually need our eyes engaged too, not just to get the responses, but also to make accurate inputs. Speaking is convenient and quite high bandwidth. Humans can routinely talk at 140-160 words per minute, but few of us can type that fast, especially while driving (though we seem to keep trying). With the advent of good automatic speech recognition (ASR) based on deep neural networks, we can start to exploit speech interfaces in all the situations where eyes and hands aren’t available, and where natural language can plausibly express the questions, commands, responses and discussion at the heart of key interactions. Many of the worst impediments to using speech widely come from noise problems. Noise impacts the usability of all interfaces and all likely speech application types. Siri and Alexa-based services work remarkably well in a quiet room, but performance degrades significantly as noise goes up.

My team at BabbleLabs has ...

Continue Reading

Building a speech start-up

A startup is a source of excitement at any time, but we don't live at just "any time".  Right now, we are experiencing an era of particularly rapid evolution of computing technology, and a period of dramatic evolution of new business models and real-world applications. The blanket term, "AI", has captured the world's imagination, and it is tempting to dismiss much of the breathless enthusiasm (and doomsaying) as just so much hype.  While there is a large dose of hype circulating, we must not overlook the very real and very potent emergence of deep learning or neural network methods as a substantially fresh approach to computing. The pace of improvement of algorithms, applications, and computing platforms over just the past five years shows that this approach — more statistical, more parallel, and more suitable for complex, essentially ambiguous problems — really is a big deal. 

Not surprisingly, a great deal of today's deep learning work is being directed at problems in computer vision: locating, classifying, segmenting, tagging, and captioning images and videos. Roughly half of all deep learning start-up companies are focused on one sort of vision problem or another. Deep learning is a great fit for these problems. Other developers and researchers have fanned out across a broad range of complex, data-intensive tasks in modeling financial markets, network security, recruiting, drug discovery, and transportation logistics. One domain is showing particular promise: speech processing. Speech has all the key characteristics of big data and deep complexity that ...

Continue Reading