Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is far away from the microphone. At Baidu we are working to enable truly ubiquitous, natural speech interfaces. In order to achieve this, we must improve the accuracy of speech recognition, especially in these challenging environments. We set out to make progress towards this goal by applying Deep Learning in a new way to speech recognition.
Deep Learning has transformed many important tasks; it has been successful because it scales well: it can absorb large amounts of data to create highly accurate models. Indeed, most industrial speech recognition systems rely on Deep Neural Networks as a component, usually combined with other algorithms. Many researchers have long believed that Deep Neural Networks (DNNs) could provide even better accuracy for speech recognition if they were used for the entire system, rather than just as the acoustic modeling component. However, it has proven difficult to find an end-to-end speech recognition system based on Deep Learning that improves on the state of the art.
Model and Data Co-design
One of the reasons this has been difficult is that training these networks on large datasets is computationally very intensive. The process of training DNNs is iterative: we instantiate ideas about models in computer code that trains a model, then we train the model on a training set and test it, which gives us new ideas about how to improve the model or training set. The latency of this loop is the rate limiting step that gates progress. Our models are relatively large, containing billions of connections, and we train them on thousands of hours of data, which means that training our models takes a lot of computation. Continue reading