Part III: What’s Missing?
I have long wondered why we see a proliferation of video cameras without sound capture. Sound is fundamentally cheaper to capture and there are many potential benefits of capturing the sounds of public spaces, with or without video:
- Traffic monitoring
- Assessment of road conditions — the sound of tires on pavements directly depends on the amount of water or snow on the road surface
- Estimation of weather — wind, rain and lightning have distinctive sounds
- Localization of loud vehicles, drones and aircraft
- Detection and localization of explosions, gunshots, and other anomalies.
In fact, with the wide adoption of deep learning methods, we have even more powerful tools for extracting this kind of information from the cacophony of sounds found in public. And since microphones are much less expensive and audio recordings are so much lower bandwidth, a mesh of microphones in public spaces could be a cost-effective source of priceless insights to make our environment safer, healthier, and more efficient. And the same potential exists for recording in the home, on the factory floor, in transportation systems, in office settings, hospitals and other complex environments.
So why don’t we do it? Principally because it is illegal!
In most jurisdictions around the world, it is forbidden to record a private conversation — even one in a public space — unless some or all participants in the conversation consent to recording. I agree strongly with this protection of privacy, but it does make recording of sound for other purposes quite problematic. Installing audio recording equipment in public places would inevitably pick up private conversations without permission.
Note: there are some experimental audio mesh systems for localizing events like gunshots, but these must be carefully engineered to capture only the gunshot information. That specialization inhibits innovative reuse of the data for other legitimate purposes.
However, with a technology like Clear Cloud that can fully separate the noise from the speech, the speech can be cut out before any sounds are recorded or heard by humans.
To appreciate the thoroughness of the noise reduction, listen to the video where the audio track contains just the removed noise. This is simply the original signal minus the enhanced speech signal. Take a look at the video without the speech:
The speech removal is so nearly perfect that the original speech is completely unintelligible, but every non-speech sound is perfectly retained. At BabbleLabs, we are coming to the realization that speechless noise has key applications, just like noiseless speech. In some usage, it may make sense to distribute audio with the speech entirely isolated in one audio channel or set of channels, and all the noise isolated in another channel or set of channels. Then the end user or end application can choose the appropriate mix of speech and noise.
The raw audio could either be streamed through cloud-based speech-removal or the audio could be cleansed in-place, right at the point of capture by the microphone, ensuring maximal privacy.
You can probably tell that I’m excited about all these interesting use cases we’re uncovering for BabbleLabs — audio restoration, near-professional audio noise reduction, even new “smart city” applications that protect privacy. I think we’re just at the start of an explosion of new smart audio applications that serve human needs for understanding, for health, and for productivity. Just by digging a little deeper into the underlying structure of sound and speech, we’re finding treasures.