The conventional wisdom in deep learning is that GPUs are essential compute tools for advanced neural networks. I’m always skeptical of conventional wisdom, but for BabbleLabs, it seems to hold true. The sheer performance of GPUs, combined with their robust support in deep learning programming environments, allows us to train bigger, more complex networks with vastly more data and deploy them commercially at low cost. GPUs are a key element in BabbleLabs’ delivery of the world’s best speech enhancement technology.
The deep learning computing model gives us powerful new tools for extracting fresh insights from masses of complex data — data that has long defied good systematic analysis with explicit algorithms. The model has already transformed vision and speech processing, obsoleting most conventional classification and generation methods in just the last five years. Deep learning is now being applied — often with spectacular results — across transportation, public safety, medicine, finance, marketing, manufacturing, and social media. This already makes it one of the most significant developments in computing in the past two decades. In time, we may rate its impact in the same category with the “superstars” of tech transformation — the emergence of high-speed Internet and smart mobile devices.
The promise of deep learning is matched with a curse: it demands huge data sets and correspondingly huge computing resources for successful training and use. For example, a single full training of BabbleLabs’ most advanced speech enhancement network requires between 1019 and 1020 floating point operations, using hundreds of thousands of hours of labeled noisy speech. Conventional processors are not adequate to this task. A high-end, general purpose CPU, even one with high-end vector floating point, would still require more than five years to complete one training run. Without higher training performance, BabbleLabs couldn’t get its product off the ground.
BabbleLabs is a tech startup built entirely around the breakthrough potential of deep learning to solve one of the thorniest problems in everyday life: “How do we make ourselves understood?” Automatic speech recognition for web services is garnering lots of attention right now. Amazon Alexa, Google Voice and Apple Siri all address a useful task, yet those services represent a tiny fraction of the potential for speech processing to improve both human-to-human communication and human-machine interaction. We’re uncovering a world of opportunities in noise reduction, speech recognition, speaker identification and authentication, and real-time dialogue.
We see deep desire for speech systems that handle noise and reverberation better, not only by reducing the distracting interference, but also by improving actual intelligibility. We see deep desire for low latency systems that operate locally in phones, cars, and appliances for more natural control and more robustness. We see deep desire for more personalized speech systems that recognize and adapt to individual voices, vocabularies, and noise environments, while enhancing security and privacy.
Historically, speech developers have made algorithm simplifications in order to fit the necessary processing onto available digital signal processors. Moreover, the common algorithms use little temporal context for speech enhancement and often work with idealized models of acoustics. These cannot capture all the complexities and peculiarities of the ear, the brain, and real languages. Higher accuracies require a new class of sophisticated neural network-based speech algorithms. And the only way to do this quickly and cost-effectively is with enormous attention to computing efficiency at every level — the networks, the data, the training methods and the computing platforms.
On the first day of BabbleLabs’ existence, in October 2017, we installed desk side servers, each equipped with dual NVIDIA 1080ti GPUs, in the homes of each of the technical founders. This is how we started developing the core algorithms. When we took office space in San Jose, we moved most of those machines together and added more GPUs, so that we now have six personal compute servers with a total of 19 1080ti GPUs — up to four GPUs per server. These machines are typically used for initial model development and experimental training.
In order to grow our capacity beyond what we can handle in our offices, we have turned to a set of cloud compute service providers, including AWS and Google Cloud, typically using a dozen additional machines with up to eight GPUs each. We carefully benchmark each NVIDIA GPU to understand the particular strengths of each implementation and apply different GPUs to different tasks. For example, we found that the combination of the relatively larger memory per compute and affordable cost per operation of the NVIDIA K80 makes particular sense for inference tasks, so we leverage that for an important subset of our production use.
Wide support for optimized deep learning programming environments is also essential. We run several environments across our machines, in order to support development in both MatLab Deep Learning Toolbox and Google TensorFlow in Python. We also develop optimized implementations of the most mission-critical inference functions in both C with NVIDIA cuDNN libraries and native CUDA programming (for even better performance).
This rich computation environment has allowed us to push the envelope on model sophistication and training, and profoundly change what’s possible in speech processing. In recent weeks we have turned to the latest NVIDIA Volta V100 AWS P3.16xlarge instance to accelerate our training. By parallelizing training across 8 GPUs, we have reduced our end-to-end training time to less than two weeks, even as we have expanded our training set to hundreds of thousands of hours of unique noisy speech. V100 access allows us to both extend the window of innovation by our architects before we lock down the network for final production training, and to pull in our product launch date. The much faster turnaround time for V100 training and inference time for experiments allows us to run more tests, to explore more options for network structure, training and verification, and to improve our overall “time-to-insight” during development. Fully parallelized V100 use should allow us to turn around experiments up to 15x faster.
The GPUs also play a key role in deployment of these systems. Cost analysis of different cloud platforms, and the benchmarking of our optimized production inference implementation code shows that the best current cloud provider and GPU option is almost 8x cheaper than running the same inference application on a low-cost non-GPU configuration. This in turn means we can price our speech-as-a-service product more aggressively and enable advanced speech in a wider variety of high throughput cloud applications.
A close look at the daunting computational challenge of advanced speech highlights the need for massive computation combined with great convenience. Leveraging the full capabilities of deep learning compute platforms is absolutely key to BabbleLabs’ success. Having access to NVIDIA’s GPU solutions has been critical to our mission — developing world-class speech enhancement technology and innovative voice-centric applications.