deep_speech_thumb

Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning

Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is far away from the microphone. At Baidu we are working to enable truly ubiquitous, natural speech interfaces. In order to achieve this, we must improve the accuracy of speech recognition, especially in these challenging environments. We set out to make progress towards this goal by applying Deep Learning in a new way to speech recognition.

Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.
Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.

Deep Learning has transformed many important tasks; it has been successful because it scales well: it can absorb large amounts of data to create highly accurate models. Indeed, most industrial speech recognition systems rely on Deep Neural Networks as a component, usually combined with other algorithms. Many researchers have long believed that Deep Neural Networks (DNNs) could provide even better accuracy for speech recognition if they were used for the entire system, rather than just as the acoustic modeling component. However, it has proven difficult to find an end-to-end speech recognition system based on Deep Learning that improves on the state of the art.

Model and Data Co-design

One of the reasons this has been difficult is that training these networks on large datasets is computationally very intensive. The process of training DNNs is iterative: we instantiate ideas about models in computer code that trains a model, then we train the model on a training set and test it, which gives us new ideas about how to improve the model or training set. The latency of this loop is the rate limiting step that gates progress. Our models are relatively large, containing billions of connections, and we train them on thousands of hours of data, which means that training our models takes a lot of computation. Continue reading

cuda_spotlight

CUDA Spotlight: GPU-Accelerated Speech Recognition

Ian-Lane-CMUThis week’s Spotlight is on Dr. Ian Lane of Carnegie Mellon University. Ian is an Assistant Research Professor and leads a speech and language processing research group based in Silicon Valley. He co-directs the CUDA Center of Excellence at CMU with Dr. Kayvon Fatahalian.

The following is an excerpt from our interview (read the complete Spotlight here).

NVIDIA: Ian, what is Speech Recognition?
Ian: Speech Recognition refers to the technology that converts an audio signal into the sequence of words that the user spoke. By analyzing the frequencies within a snippet of audio, we can determine what sounds within spoken language a snippet most closely matches, and by observing sequences of these snippets we can determine what words or phrases the user most likely uttered.

Speech Recognition spans many research fields, including signal processing, computational linguistics, machine learning and core problems in computer science, such as efficient algorithms for large-scale graph traversal. Speech Recognition also is one of the core technologies required to realize natural Human Computer Interaction (HCI). It is becoming a prevalent technology in interactive systems being developed today.

NVIDIA: What are examples of real-world applications?
Ian: In recent years, speech-based interfaces have become much more prevalent, including applications such as virtual personal assistants, which include systems such as Siri from Apple or Google Voice Search, as well as speech interfaces for smart TVs and in-vehicle systems. Continue reading