Like many people in the developed world, I spend every day surrounded by technology — laptops, phones, car touch-screens, networked gadgets, and smart TV. Of course I want to interact with all these things to get my work done and to enjoy their services and entertainment. And I have learned to use them pretty well — I have learned to type on keyboards, move mice, finger touchpads, pinch, swipe, and peck. The devices themselves have not learned much of anything!
A wave of change is sweeping towards us, with the potential to dramatically shift the basic nature of interactions between people and our electronic environment. Deep learning-based algorithms are transforming many aspects of computation, by allowing good solutions to bigger, more complex, more ambiguous problems than before. The transformations are particularly striking across vision, language analysis, pattern finding in large data sets, and speech. I have written a lot about vision and the potential for neural network vision algorithms to meet or exceed human capabilities in recognizing and tracking objects, summarizing scenes and sequences, and finding safe navigational channels in two and three dimensions. These vision systems are becoming veritable substitutes for human vision in tasks like driving, surveillance, and inspection.
Deep learning advances in speech have a completely different character from vision — these advances are rarely about substituting for humans, but center instead on paying more attention to humans. Speech recognition, speaker identification, and speech enhancement are all functions to enhance the human-machine interaction. The ...