During my training as an engineer, one of the most obvious concepts I had to learn was “always check your assumptions”. This maxim is applicable in many circumstances. All too often, you find a mismatch between how things should behave, and how they are actually functioning. And this leads us to our main subject — why those of us building BabbleLabs, a speech company, care so deeply about noise in speech. We care so much, we derived our company's name from noise; we are BabbleLabs, not SpeechLabs, after all.
When I hear stories about how computers have achieved better-than-human speech recognition results, I wonder how that can be true, and yet I still cannot successfully dictate a number to my phone. Even if short message transcription works in the comfort of your private office, it completely falls apart in your car.
These observations are screaming, Check Your Assumptions!! It turns out that most automatic speech recognition (ASR) work has historically been primarily focused on doing ASR in anechoic circumstances (i.e. without echoes or reverberation). Robustifying ASR for real-world environments has been a second step built on top of the work done for anechoic ASR. To be clear, that approach is not wrong — you need to crawl before you can walk. However, this approach has its inherent limitations. At BabbleLabs, we have started from the other end of the problem, by addressing the noise.
Noise, of course, means a lot of things; additive noise, modulated ...