Deep learning is becoming extremely popular due to several breakthroughs in various well-known tasks in artificial intelligence. For example, at the ImageNet Large Scale Visual Recognition Challenge, the introduction of deep learning algorithms into the challenge reduced the top-5 error by 10% in 2012. Every year since then, deep learning models have dominated the challenges, significantly reducing the top-5 error rate every year (see Figure 1). In 2015, researchers have trained very deep networks (for example, the Google “inception” model has 27 layers) that surpass human performance.
Moreover, at this year’s Computer Vision and Pattern Recognition (CVPR) conference, deep neural networks (DNNs) were being adapted to increasingly more complicated tasks. For example, in semantic segmentation, instead of predicting a single category for a whole image, a DNN is trained to classify each pixel in the image, essentially producing a semantic map indicating every object and its shape and location in the given image (see Figure 2).
The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.
“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”
Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.
[The following video shows off work from the Evolving AI Lab on visualizing deep neural networks. Keep reading to learn more about this work!]
For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.
Brad: How are you using deep neural networks (DNNs)?
We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading →
This year’s ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is about to begin. Every year, organizers from the University of North Carolina at Chapel Hill, Stanford University, and the University of Michigan host the ILSVRC, an object detection and image classification competition, to advance the fields of machine learning and pattern recognition. Competitors are given more than 1.2 million images from 1,000 different object categories, with the objective of developing the most accurate object detection and classification approach. After the data set is released, teams have roughly three months to create, refine and test their approaches. This year, the data will be released on August 14, and the final result submission deadline is November 13. Read on to learn how competing teams can get free access to the latest GPU-accelerated cloud servers from NVIDIA and IBM Cloud.
About the Challenge
Teams compete annually to develop the most accurate recognition systems, and each year the sub-tasks are more complex and challenging. Since none of the algorithms developed to date are able to classify all images correctly 100% of the time, classification accuracy is measured in terms of error rates. In this way, the winning submission will exhibit the lowest overall percentage of incorrectly classified images. When ILSVRC started in 2010, teams had a single assignment: classify the contents of an image (identify objects) and provide a list of the top five classifications with their respective probabilities. Since then, the complexity of assignments has grown as the organizers have added sub-tasks for classification with localization and with detection (identifying objects and specifying their locations with bounding boxes). Specifying bounding boxes is important because most pictures contain multiple objects. To illustrate the requirements of this type of task, Figure 1 shows two sample ImageNet training images annotated with their object categories and bounding boxes. Continue reading →
Labellio is the world’s easiest deep learning web service for computer vision. It aims to provide a deep learning environment for image data where non-experts in deep learning can experiment with their ideas for image classification applications. Watch our video embedded here to see how easy it is.
The challenges in deep learning today are not just in configuring hyperparameters or designing a suitable neural network structure; they are also in knowing how to prepare good training data to fine-tune a well-known working classification model, and in knowing how to set up a computing environment without hassle. Labellio aims to address these problems by supporting the deep learning workflow from beginning to end.
The Labellio training data collector lets users upload images, or downloads images from the internet from specified URL lists or keyword searches on external services like Flickr and Bing. While some users have their own domain-specific training data in hand already, others don’t. Labellio’s data ingestion capabilities help both types of users start building their own deep learning models immediately. It is extremely important to assign the correct class to each training image to build a more accurate classification model, so Labellio’s labelling capability helps users cleanse input training data.
Labellio helps users build and manage multiple classification models. Creating a classification model is typically not just a one-time operation; you need trial and error to develop the most accurate model by replacing some of the training data. Continue reading →
Note: This is the final part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 2.
In the previous post in this series, I introduced a simple encoder-decoder model for machine translation. This simple encoder-decoder model is excellent at English-French translation. However, in this post I will briefly discuss the weakness of this simple approach, and describe a recently proposed way of incorporating a soft attention mechanism to overcome the weakness and significantly improve the translation quality.
Furthermore, I will present some more recent works that utilize this neural machine translation approach to go beyond machine translation of text, such as image caption generation and video description generation. I’ll finish the blog series with a brief discussion of future research directions and a pointer to the open source code implementing these neural machine translation models.
The Trouble with Simple Encoder-Decoder Architectures
In the encoder-decoder architecture, the encoder compresses the input sequence as a fixed-size vector from which the decoder needs to generate a full translation. In other words, the fixed-size vector, which I’ll call a context vector, must contain every single detail of the source sentence. Intuitively, this means that the true function approximated by the encoder has to be extremely nonlinear and complicated. Furthermore, the dimensionality of the context vector must be large enough that a sentence of any length can be compressed.
In my paper “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches” presented at SSST-8, my coauthors and I empirically confirmed that translation quality dramatically degrades as the length of the source sentence increases when the encoder-decoder model size is small. Together with a much better result from Sutskever et al. (2014), using the same type of encoder-decoder architecture, this suggests that the representational power of the encoder needed to be large, which often means that the model must be large, in order to cope with long sentences (see Figure 1).
Of course, a larger model implies higher computation and memory requirements. The use of advanced GPUs, such as NVIDIA Titan X, indeed helps with computation, but not with memory (at least not yet). The size of onboard memory is often limited to several Gigabytes, and this imposes a serious limitation on the size of the model. (Note: it’s possible to overcome this issue by using multiple GPUs while distributing a single model across those GPUs, as shown by Sutskever et al. (2014). However, let’s assume for now that we have access to a single machine with a single GPU due to space, power and other physical constraints.)
Then, the question is “can we do better than the simple encoder-decoder based model?” Continue reading →
Originally trained as a veterinary surgeon, Chris Jewell, a Senior Lecturer in Epidemiology at Lancaster Medical School in the UK became interested in epidemics through his experience working on the foot and mouth disease outbreak in the UK in 2001. His work so far has been on livestock epidemics such as foot and mouth disease, theileriosis, and avian influenza with government organizations in the UK, New Zealand, Australia, and the US. Recently, he has refocused his efforts into the human field where populations and epidemics tend to be larger and therefore need more computing grunt.
Epidemic forecasting centers around Bayesian inference on dynamical models, using Markov Chain Monte Carlo (MCMC) as the model fitting algorithm. As part of this algorithm Chris has had to calculate a statistical likelihood function which itself involves a large sum over pairs of infected and susceptible individuals. He is currently using CUDA technology to accelerate this calculation and enable real-time inference, leading to timely forecasts for informing control decisions.
“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”
DIGITS is an interactive deep learning development tool for data scientists and researchers, designed for rapid development and deployment of an optimized deep neural network. NVIDIA introduced DIGITS in March 2015, and today we are excited to announce the release of DIGITS 2, which includes automatic multi-GPU scaling. Whether you are developing an optimized neural network for a single data set or training multiple networks on many data sets, DIGITS 2 makes it easier and faster to develop optimized networks in parallel with multiple GPUs.
Deep learning uses deep neural networks (DNNs) and large datasets to teach computers to detect recognizable concepts in data, to translate or understand natural languages, interpret information from input data, and more. Deep learning is being used in the research community and in industry to help solve many big data problems such as similarity searching, object detection, and localization. Practical examples include vehicle, pedestrian and landmark identification for driver assistance; image recognition; speech recognition; natural language processing; neural machine translation and mitosis detection.
DNN Development and Deployment with DIGITS
Developing an optimized DNN is an iterative process. A data scientist may start from a popular network configuration such as “AlexNet” or create a custom network, and then iteratively modify it into a network that is well-suited for the training data. Once they have developed an effective network, data scientists can deploy it and use it on a variety of platforms, including servers or desktop computers as well as mobile and embedded devices such as Jetson TK1 or Drive PX. Figure 1 shows the overall process, broken down into two main phases: development and deployment.
Note: This is part two of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 3.
In my previous post, I introduced statistical machine translation and showed how it can and should be viewed from the perspective of machine learning: as supervised learning where the input and output are both variable-length sequences. In order to introduce you to neural machine translation, I spent half of the previous post on recurrent neural networks, specifically about how they can (1) summarize a sequence and (2) probabilistically model a sequence. Based on these two properties of recurrent neural networks, in this post I will describe in detail an encoder-decoder model for statistical machine translation.
Encoder-Decoder Architecture for Machine Translation
I’m not a neuroscientist or a cognitive scientist, so I can’t speak authoritatively about how the brain works. However, if I were to guess what happens in my brain when I try to translate a short sentence in English to Korean, my brain encodes the English sentence into a set of neuronal activations as I hear them, and from those activations, I decode the corresponding Korean sentence. In other words, the process of (human) translation involves the encoder which turns a sequence of words into a set of neuronal activations (or spikes, or whatever’s going on inside a biological brain) and the decoder which generates a sequence of words in another language, from the set of activations (see Figure 1).
This idea of encoder-decoder architectures is the basic principle behind neural machine translation. In fact, this type of architecture is at the core of deep learning, where the biggest emphasis is on learning a good representation. In some sense, you can always cut any neural network in half, and call the first half an encoder and the other a decoder.
Starting with the work by Kalchbrenner and Blunsom at the University of Oxford in 2013, this encoder-decoder architecture has been proposed by a number of groups, including the Machine Learning Lab (now, MILA) at the University of Montreal (where I work) and Google, as a new way to approach statistical machine translation [Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015]. (There is also older, related work by Mikel Forcada at the University of Alcante from 1997! [Forcada and Neco, 1997].) Although there is no restriction on which particular type of neural network is used as either an encoder or a decoder, I’ll focus on using a recurrent neural network for both.
Let’s build our first neural machine translation system! But, before I go into details, let me first show you a big picture of the whole system in Figure 2. Doesn’t it look scarily complicated? Nothing to worry about, as I will walk you through this system one step at a time.
Note: This is the first part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 2 and part 3.
Neural machine translation is a recently proposed framework for machine translation based purely on neural networks. This post is the first of a series in which I will explain a simple encoder-decoder model for building a neural machine translation system [Cho et al., 2014; Sutskever et al., 2014; Kalchbrenner and Blunsom, 2013]. In a later post I will describe how an attention mechanism can be incorporated into the simple encoder-decoder model [Bahdanau et al., 2015], leading to the state-of-the-art machine translation model for a number of language pairs including En-Fr, En-De, En-Tr and En-Zh [Gulcehre et al., 2015; Jean et al., 2015]. Furthermore, I will introduce recent work which has applied this framework of neural machine translation to image and video description generation [Xu et al., 2015; Li et al., 2015].
Statistical Machine Translation
First, let’s start with a brief overview of machine translation. In fact, the name, machine translation, says everything. We want a machine to translate text in one language, which we will call the source sentence, to corresponding text in another language, which we call the target sentence. (Although ideally the machine should be able to translate a whole document from one language to another, let us concentrate in this blog post on sentence-level machine translation.)
There are multiple ways to build such a machine that can translate languages. For instance, we can ask a bilingual speaker to give us a set of rules transforming a source sentence into a correct translation. This is not a great solution, as you can imagine, because we don’t even know the set of rules underlying a single language, not to mention the rules underlying a pair of languages. It is simply hopeless to write an exhaustive set of rules for translating a source sentence into a correct translation. Hence, in this blog post, we focus on a statistical approach where those rules, either implicitly or explicitly, are automatically extracted from a large corpus of text.
This statistical approach to machine translation is called statistical machine translation. The goal is the same (build a machine that translates a sentence from one language to another), but we let the machine learn from data how to translate rather than design a set of rules for the machine (See Fig. 1 for a graphical illustration.) Learning is based on statistical methods, which should sound familiar to anyone who has taken a basic course on machine learning. In fact, statistical machine translation is nothing but a particular application of machine learning, where the task is to find a function that maps from a source sentence to a corresponding target.
Tell us about your research at The University of Arizona
We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.
Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.