gpu_computing_spotlight_358x230

Accelerate Recommender Systems with GPUs

Wei TanWei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on streaming media services, and recommending news items or posts on social media and networking services. Wei Tan’s team developed cuMF, a highly optimized matrix factorization system using CUDA to accelerate recommendations used in applications like these and more.

Brad: Can you talk a bit about your current research?

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix  (m-by-n, with  non-zero ratings) into a m-by-f matrix (X) and a f-by-n matrix (ΘT), as Figure 1 shows.

Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with N_z non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (Θ^T).
Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with Nz non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (ΘT ).

Suppose we obtained m users’ ratings on  items (say, movies). If user u rated item v, we use r_{uv} as the non-zero element of R at position (u, v). We want to minimize the following cost function JContinue reading

open-ai

Train Your Reinforcement Learning Agents at the OpenAI Gym

Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go.

John Schulman is a researcher at OpenAI.
John Schulman is a researcher at OpenAI.

OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design, iterate and improve their next generation applications. John studied physics at Caltech, and went to UC Berkeley for graduate school. There, after a brief stint in neuroscience, he studied machine learning and robotics under Pieter Abbeel, eventually honing in on reinforcement learning as his primary topic of interest. John lives in Berkeley, California, where he enjoys running in the hills and occasionally going to the gym.

What is OpenAI? What is your mission?

OpenAI is a non-profit artificial intelligence research company. Day to day, we are working on research projects in unsupervised learning and reinforcement learning. Our mission and long-term goal is to advance artificial intelligence in the ways that will maximally benefit humanity as a whole.

What is reinforcement learning? How is it different from supervised and unsupervised learning?

aeloop

Reinforcement learning (RL) is the branch of machine learning that is concerned with making sequences of decisions. It assumes that there is an agent that is situated in an environment. At each step, the agent takes an action, and it receives an observation and reward from the environment. An RL algorithm seeks to maximize the agent’s total reward, given a previously unknown environment, through a learning process that usually involves lots of trial and error.

The reinforcement learning problem sketched above, involving a reward-maximizing agent, is extremely general, and RL algorithms have been applied in a variety of different fields. They have been applied in business management problems such as deciding how much inventory a store should hold or how it should set prices. They have also been applied to robotic control problems, and rapid development is currently occurring in this area. The following video shows Hopper: a two-dimensional one-legged robot trained to hop forward as fast as possible with OpenAI Gym.

Continue reading

image07

Optimizing Recurrent Neural Networks in cuDNN 5

Figure 1: cuDNN 5 + Torch speedup vs. Torch GPU implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, Wordvec size 2560, num layers 1, Seq length 200, max epochs 1 Network B: RNN size 256, num layers 3, max epochs 50, batch size 64 Network C: RNN size 256, Wordvec size 256, num layers 1, Seq length 1000, max epochs 1
Figure 1: cuDNN 5 + Torch speedup vs. Torch-rnn implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, input size 2560, 1 layer, Seq length 200, batch size 64. Network B: RNN size 256, input size 64, 3 layers, batch size 64. Network C: RNN size 256, input size 256, 1 layer, batch size 32, Seq length 1000

This week at GTC 2016, we announced the latest update to NVIDIA Deep Learning SDK, which now includes cuDNN 5. Version 5 offers new features, improved performance and support for the latest generation NVIDIA Tesla P100 GPU. New features in cuDNN 5 include:

  • Faster forward and backward convolutions using the Winograd convolution algorithm;
  • 3D FFT Tiling;
  • Spatial Transformer Networks;
  • Improved performance and reduced memory usage with FP16 routines on Pascal GPUs;
  • Support for LSTM recurrent neural networks for sequence learning that deliver up to 6x speedup.

One of the new features we’ve added in cuDNN 5 is support for Recurrent Neural Networks (RNN). RNNs are a powerful tool used for sequence learning in a number of fields, from speech recognition to image captioning. For a brief high-level introduction to RNNs, LSTM and sequence learning, I recommend you check out Tim Dettmers recent post Deep Learning in a Nutshell: Sequence Learning, and for more depth, Soumith Chintala’s post Understanding Natural Language with Deep Neural Networks Using Torch.

I’m excited about the RNN capabilities in cuDNN 5; we’ve put a lot of effort into optimizing their performance on NVIDIA GPUs, and I’ll go into some of the details of these optimizations in this blog post.

cuDNN 5 supports four RNN modes:  ReLU activation function, tanh activation function, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). In this case study I’ll look at the performance of an LSTM network, but most of the optimizations can be applied to any RNN. Continue reading

dnn_green_on_black_thumb

Deep Learning in a Nutshell: Sequence Learning

This series of blog posts aims to provide an intuitive and gentle introduction to deep learning that does not rely heavily on math or theoretical constructs.

Be sure to check out the other Deep Learning in a Nutshell posts: Part 1, part 2.
The first part of this series provided an overview of the field of deep learning, covering fundamental and core concepts. The second part of the series provided an overview of training neural networks efficiently and gave a background on the history of the field. In this  post, we’ll look at sequence learning with a focus on natural language processing.

Figure 2: A Long Short-Term Memory (LSTM) unit. The LSTM unit has four input weights (from the data to the input and three gates) and four recurrent weights (from the output to the input and the three gates). Peepholes are extra connections between the memory cell and the gates, but they do not increase the performance by much and are often omitted for simplicity.
Figure 1: A Long Short-Term Memory (LSTM) unit. The LSTM unit has four input weights (from the data to the input and three gates) and four recurrent weights (from the output to the input and the three gates). Peepholes are extra connections between the memory cell and the gates, but they do not increase the performance by much and are often omitted for simplicity. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey.

Sequence Learning

Everything in life depends on time and therefore, represents a sequence. To perform machine learning with sequential data (text, speech, video, etc.) we could use a regular neural network and feed it the entire sequence, but the input size of our data would be fixed, which is quite limiting. Other problems with this approach occur if important events in a sequence lie just outside of the input window. What we need is (1) a network to which we can feed sequences of arbitrary length one element of the sequence per time step (for example a video is just a sequence of images; we feed the network one image at a time); and (2) a network which has some kind of memory to remember important events which happened many time steps in the past. These problems and requirements have led to a variety of different recurrent neural networks. Continue reading

thumbnail_eyeem

Understanding Aesthetics with Deep Learning

To me, photography is the simultaneous recognition, in a fraction of a second, of the significance of an event. — Henri Cartier Bresson

As a child I waited anxiously for the arrival of each new issue of National Geographic Magazine. The magazine had amazing stories from around the world, but the stunningly beautiful photographs were more important to me. The colors, shadows and composition intrigued and wowed me, and there was a cohesion of visual arrangement and storytelling.

Photograph by Porter Yates , EyeEm photographer of the year 2015.
Photograph by Porter Yates , EyeEm photographer of the year 2015.

This childhood fascination with photographs aroused in me a curiosity to understand the behavior, nuances and semantics embedded inside them. Ultimately, this curiosity drove me to study computer vision, which is empowering me to develop systems for understanding images from a computational and scientific perspective. Further, my job at EyeEm allows me to interact with technologists, designers, photo curators, photographers and product managers who are busy building the photography of the future!

EyeEm is a community and marketplace for passionate photographers. More than 15 million photographers use EyeEm to share their photos, connect with other photographers, improve their skills through masterclasses, get recognition through our photography missions and exhibitions, and earn money by licensing their photos. The following video shows the impact of our deep-learning-based automatic aesthetic curation on the EyeEm search experience—read on to learn more about how it is done.

Continue reading

dnn_green_on_black_thumb

Deep Learning in a Nutshell: History and Training

This series of blog posts aims to provide an intuitive and gentle introduction to deep learning that does not rely heavily on math or theoretical constructs. The first part in this series provided an overview over the field of deep learning, covering fundamental and core concepts. The third part of the series covers sequence learning topics such as recurrent neural networks and LSTM.

In this second part, we look briefly into the history of deep learning and then proceed to methods of training deep learning architectures quickly and efficiently. The third part focuses on learning algorithms, unsupervised and sequence learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

History

A Short History of Deep Learning

The earliest deep-learning-like algorithms that had multiple layers of non-linear features can be traced back to Ivakhnenko and Lapa in 1965 (Figure 1), who used thin but deep models with polynomial activation functions which they analyzed with statistical methods. In each layer, they selected the best features through statistical methods and forwarded them to the next layer. They did not use backpropagation to train their network end-to-end but used layer-by-layer least squares fitting where previous layers were independently fitted from later layers.

Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer.
Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer. Image of Prof. Alexey Ivakhnenko courtesy of Wikipedia.

The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.

Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form. The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982). Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations. At this time, this was an important result in cognitive psychology where the question was whether human cognition can be thought of as relying on distributed representations (connectionism) or symbolic logic (computationalism).

The first true, practical application of backpropagation came about through the work of LeCun in 1989 at Bell Labs. He used convolutional networks in combination with backpropagation to classify handwritten digits (MNIST) and this system was later used to read large numbers of handwritten checks in the United States. The video above shows Yann LeCun demonstrating digit classification using the “LeNet” network  in 1993.

Continue reading

thumbnail

Optimizing Warehouse Operations with Machine Learning on GPUs

Zalando_logoRecent advances in deep learning have enabled research and industry to master many challenges in computer vision and natural language processing that were out of  reach until just a few years ago. Yet computer vision and natural language processing represent only the tip of the iceberg of what is possible. In this article, I will demonstrate how Sebastian Heinz,
Roland Vollgraf and I (Calvin Seward) used deep neural networks in steering operations at Zalando’s fashion warehouses.

As Europe’s leading online fashion retailer, there are many exciting opportunities to apply the latest results from data science, statistics, and high-performance computing. Zalando’s vertically integrated business model means that I have dealt with projects as diverse as computer vision, fraud detection, recommender systems and, of course, warehouse management.

To solve the warehouse management problem that I’ll discuss in this post, we trained a neural network that very accurately estimates the length of the shortest possible route that visits a set of locations in the warehouse. I’ll demonstrate how we used this neural network to greatly accelerate a processing bottleneck, which in turn enabled us to more efficiently split work between workers.

The core idea is to use deep learning to create a fast, efficient estimator for a slow and complex algorithm. This is an idea that can (and will) be applied to problems in many areas of industry and research. Continue reading

cuDNN_logo_black_on_white_179x115

Inference: The Next Step in GPU-Accelerated Deep Learning

At 45 images/s/W, Jetson TX1 is super efficient at deep learning inference. Read the whitepaper.
Deep learning is revolutionizing many areas of machine perception, with the potential to impact the everyday experience of people everywhere. On a high level, working with deep neural networks is a two-stage process: First, a neural network is trained: its parameters are determined using labeled examples of inputs and desired output. Then, the network is deployed to run inference, using its previously trained parameters to classify, recognize and process unknown inputs.

Deep Neural Network Training vs. Inference
Figure 1: Deep learning training compared to inference. In training, many inputs, often in large batches, are used to train a deep neural network. In inference, the trained network is used to discover information within new inputs that are fed through the network in smaller batches.

It is widely recognized within academia and industry that GPUs are the state of the art in training deep neural networks, due to both speed and energy efficiency advantages compared to more traditional CPU-based platforms. A new whitepaper from NVIDIA takes the next step and investigates GPU performance and energy efficiency for deep learning inference.

The results show that GPUs provide state-of-the-art inference performance and energy efficiency, making them the platform of choice for anyone wanting to deploy a trained neural network in the field. In particular, the NVIDIA GeForce GTX Titan X delivers between 5.3 and 6.7 times higher performance than the 16-core Intel Xeon E5 CPU while achieving 3.6 to 4.4 times higher energy efficiency. The NVIDIA Tegra X1 SoC also achieves impressive results, achieving higher performance (258 vs. 242 images/second) and much higher energy efficiency (45 vs. 3.9 images/second/Watt) than the state-of-the-art Intel Core i7 6700K. Continue reading

dnn_green_on_black_thumb

Deep Learning in a Nutshell: Core Concepts

DL_dog_340x340This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction to deep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

Part 1 focuses on introducing the main concepts of deep learning. Part 2 provides historical background and delves into the training procedures, algorithms and practical tricks that are used in training for deep learning. Part 3 covers sequence learning, including recurrent neural networks, LSTMs, and encoder-decoder systems for neural machine translation.

Core Concepts

Machine Learning

In machine learning we (1) take some data, (2) train a model on that data, and (3) use the trained model to make predictions on new data. The process of training a model can be seen as a learning process where the model is exposed to new, unfamiliar data step by step. At each step, the model makes predictions and gets feedback about how accurate its generated predictions were. This feedback, which is provided in terms of an error according to some measure (for example distance from the correct solution), is used to correct the errors made in prediction.

The learning process is often a game of back-and-forth in the parameter space: If you tweak a parameter of the model to get a prediction right, the model may have in such that it gets a previously correct prediction wrong. It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.

Feature Engineering

Feature engineering is the art of extracting useful patterns from data that will make it easier for Machine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification. Continue reading

matlab_logo

Deep Learning for Computer Vision with MATLAB and cuDNN

Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology, we are able to solve problems once considered impossible in fields such as computer vision, natural language processing, and robotics.

Figure 1: Pet detection and recognition system.
Figure 1: Pet detection and recognition system.

Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks. In this post, I will discuss how you can use MATLAB to develop an object recognition system using deep convolutional neural networks and GPUs.

Why Deep Learning for Computer Vision?

Machine learning techniques use data (images, signals, text) to train a machine (or model) to perform a task such as image classification, object detection, or language translation. Classical machine learning techniques are still being used to solve challenging image classification problems. However, they don’t work well when applied directly to images, because they ignore the structure and compositional nature of images. Until recently, state-of-the-art techniques made use of feature extraction algorithms that extract interesting parts of an image as compact low-dimensional feature vectors. These were then used along with traditional machine learning algorithms.

Enter Deep learning. Deep convolutional neural networks (CNNs), a specific type of deep learning algorithm, address the gaps in traditional machine learning techniques, changing the way we solve these problems. CNNs not only perform classification, but they can also learn to extract features directly from raw images, eliminating the need for manual feature extraction. For computer vision applications you often need more than just image classification; you need state-of-the-art computer vision techniques for object detection, a bit of domain expertise, and the know-how to set up and use GPUs efficiently. Through the rest of this post, I will use an object recognition example to illustrate how easy it is to use MATLAB for deep learning, even if you don’t have extensive knowledge of computer vision or GPU programming. Continue reading