gpu_computing_spotlight_358x230

Using GPUs to Accelerate Digital Rock Physics

James McClure, a Computational Scientist with Advanced Research Computing at Virginia Tech shares how his group uses the NVIDIA Tesla GPU-accelerated Titan Supercomputer at Oak Ridge National Laboratory to combine mathematical models with 3D visualization to provide insight on how fluids move below the surface of the earth.

McClure spoke with us about his research at the 2015 Supercomputing Conference.

Brad Nemire: Can you talk a bit about your current research? 

Figure 1: Complex 3D microstructure showing fluid ganglia within a sandstone sample imaged using x-ray micro-tomography. The connected components of the oil phase are shown in various shades of blue.
Figure 1: Complex 3D microstructure showing fluid ganglia within a sandstone sample imaged using x-ray micro-tomography. The connected components of the oil phase are shown in various shades of blue.

James McClure: Digital Rock Physics is a relatively new computational discipline that relies on high-performance computing to study the behavior of fluids within rock and other geologic materials. Understanding how fluids move within rock is essential for applications like geologic carbon sequestration, oil and gas recovery, and environmental contaminant transport. New technologies such as synchrotron-based x-ray micro-computed tomography enable the collection of 3D images that reveal the structure of rocks at the micron scale. Using these images, we can make predictions about the complex rock-fluid interactions that take place within natural systems. Continue reading

EGL_logo

EGL Eye: OpenGL Visualization without an X Server

If you’re like me, you have a GPU-accelerated in-situ visualization toolkit that you need to run on the latest-generation supercomputer. Or maybe you have a fantastic OpenGL application that you want to deploy on a server farm for offline rendering. Even though you have access to all that amazing GPU power, you’re often out of luck when it comes to GPU-accelerated rendering. The reason is that it’s not sufficient to enable OpenGL rendering on the GPUs (See my previous blog post and this white paper for more details), but it also requires running an X server on each node. Especially in an HPC setting, system administrators are often reluctant to have X server processes running on the compute nodes. Until recently, this was the only way to manage an OpenGL context. That’s where EGL comes in.

Figure 1: The visualization toolkit VTK can use EGL for context management and off-screen rendering. Images courtesy of KitWare.
Figure 1: The visualization toolkit VTK can use EGL for context management and off-screen rendering. Images courtesy of KitWare.

Over the past few years, a new standard for managing OpenGL contexts has emerged: EGL. Initially driven by the requirements of the embedded space, the NVIDIA driver release 331 introduced EGL support, enabling context creation for OpenGL ES applications without the need for a running an X server. However, it was still not possible to run legacy OpenGL applications under such contexts.

With the release of NVIDIA Driver 355, full (desktop) OpenGL is now available on every GPU-enabled system, with or without a running X server. The latest driver (358) enables multi-GPU rendering support.

In this post, I will briefly describe the steps necessary to create an OpenGL context in order to enable OpenGL accelerated applications on systems without an X server.

Creating an OpenGL context

Electron flow in nanotransistor (Visualization Jean Favre, CSCS; Simulation: Mathieu Luisier, Mauro Caldelara, Joost VandeVondele , ETH Zurich.
Electron flow in nanotransistor (Visualization Jean Favre, CSCS; Simulation: Mathieu Luisier, Mauro Caldelara, Joost VandeVondele , ETH Zurich.

The most common use case for OpenGL on EGL is to create an OpenGL context and use it for off-screen rendering. Another use case is a CUDA or OpenACC application that performs an operation that can benefit from the graphics specific-functionality of the GPU as part of the computation. For example, solvers with domain boundaries expressed as triangulated surfaces; simulators that trace particle trajectories on a computational gird; or codes that need to perform visibility tests for geometrical structures.

The good news is that creating an OpenGL context with EGL is not rocket science! Just follow these five basic steps: Initialize EGL, select an appropriate screen, create a surface, bind the correct API and obtain a context from it. The following code outlines the steps. Continue reading

gpu_computing_spotlight_358x230

Developing New Materials with GPU-Accelerated Supercomputers

Josh_AndersonDr. Joshua A. Anderson is a Research Area Specialist at the University of Michigan who was an early user of GPU computing technology. He began his career developing software on the first CUDA capable GPU and now runs simulations on one of the world’s most powerful supercomputers.

Anderson’s “contributions to the development and dissemination of the open source, GPU-enabled molecular simulation software, HOOMD-blue, which enables scientific computations with unprecedented speed” earned him the 2015 CoMSEF Young Investigator Award for Modeling and Simulation.

Brad Nemire: Can you talk a bit about your current research?

Joshua Anderson: I work with the Glotzer Group at the University of Michigan. We use computer simulation to discover the fundamental principles of how nanoscale systems of building blocks self-assemble, and to discover how to control the assembly process to engineer new materials. Specifically, we focus on the role of particle shape and how changing the shape can result in different material properties.

Figure 1: an example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).
Figure 1: example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).

Over the past few years, I have been focusing on two-dimensional systems, using large scale simulations to study hexatic phase transitions for hard disks, and how patterning surfaces of polygons can create shape allophiles that improve self-assembly. The hexatic phase is an intermediate between the fluid and hexagonally ordered solid. In the hexatic phase, the orientation of bonds between particles has long range order, but translational order is short range and there is no crystal lattice. Shape allophiles are polygonal shapes cut so they fit together like puzzle pieces. These research projects are computationally demanding and could not have been run on any existing code. So before I could even begin the science research, I needed to develop, implement, and optimize the parallel algorithms necessary for these studies. Continue reading

VR image1_thumb

VR SLI: Accelerating OpenGL Virtual Reality with Multi-GPU Rendering

High-performance stereo head-mounted display (HMD) rendering is a fundamental component of the virtual reality ecosystem. HMD rendering requires substantial graphics horsepower to deliver high-quality, high-resolution stereo rendering with a high frame rate.

Today, NVIDIA is releasing VR SLI for OpenGL via a new OpenGL extension called “GL_NVX_linked_gpu_multicast” that can be used to greatly improve the speed of HMD rendering. With this extension, it is possible to control multiple GPUs that are in an NVIDIA SLI group with a single OpenGL context to reduce overhead and improve frame rates.

Autodesk VRED has successfully integrated NVIDIA’s new multicast extension into its stereo rendering code used for HMDs like the Oculus Rift and the HTC Vive, achieving a 1.7x speedup (see Figure 1). The multicast extension also helps Autodesk VRED with rendering for stereo displays or projectors, with nearly a 2x speedup.

Figure 1: Autodesk VRED Professional 2016 SR1-SP4 rendering a scene for Oculus Rift.
Figure 1: Autodesk VRED Professional 2016 SR1-SP4 rendering a scene for Oculus Rift.

Continue reading

dnn_green_on_black_thumb

Deep Learning in a Nutshell: History and Training

This series of blog posts aims to provide an intuitive and gentle introduction to deep learning that does not rely heavily on math or theoretical constructs. The first part in this series provided an overview over the field of deep learning, covering fundamental and core concepts.

In this second part, we look briefly into the history of deep learning and then proceed to methods of training deep learning architectures quickly and efficiently. The third part focuses on learning algorithms, unsupervised and sequence learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

History

A Short History of Deep Learning

The earliest deep-learning-like algorithms that had multiple layers of non-linear features can be traced back to Ivakhnenko and Lapa in 1965 (Figure 1), who used thin but deep models with polynomial activation functions which they analyzed with statistical methods. In each layer, they selected the best features through statistical methods and forwarded them to the next layer. They did not use backpropagation to train their network end-to-end but used layer-by-layer least squares fitting where previous layers were independently fitted from later layers.

Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer.
Figure 1: The achitecture of the first known deep network which was trained by Alexey Grigorevich Ivakhnenko in 1965. The feature selection steps after every layer lead to an ever-narrowing architecture which terminates when no further improvement can be achieved by the addition of another layer. Image of Prof. Alexey Ivakhnenko courtesy of Wikipedia.

The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.

Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form. The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982). Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations. At this time, this was an important result in cognitive psychology where the question was whether human cognition can be thought of as relying on distributed representations (connectionism) or symbolic logic (computationalism).

The first true, practical application of backpropagation came about through the work of LeCun in 1989 at Bell Labs. He used convolutional networks in combination with backpropagation to classify handwritten digits (MNIST) and this system was later used to read large numbers of handwritten checks in the United States. The video above shows Yann LeCun demonstrating digit classification using the “LeNet” network  in 1993.

Continue reading

thumbnail

Optimizing Warehouse Operations with Machine Learning on GPUs

Zalando_logoRecent advances in deep learning have enabled research and industry to master many challenges in computer vision and natural language processing that were out of  reach until just a few years ago. Yet computer vision and natural language processing represent only the tip of the iceberg of what is possible. In this article, I will demonstrate how Sebastian Heinz,
Roland Vollgraf and I (Calvin Seward) used deep neural networks in steering operations at Zalando’s fashion warehouses.

As Europe’s leading online fashion retailer, there are many exciting opportunities to apply the latest results from data science, statistics, and high-performance computing. Zalando’s vertically integrated business model means that I have dealt with projects as diverse as computer vision, fraud detection, recommender systems and, of course, warehouse management.

To solve the warehouse management problem that I’ll discuss in this post, we trained a neural network that very accurately estimates the length of the shortest possible route that visits a set of locations in the warehouse. I’ll demonstrate how we used this neural network to greatly accelerate a processing bottleneck, which in turn enabled us to more efficiently split work between workers.

The core idea is to use deep learning to create a fast, efficient estimator for a slow and complex algorithm. This is an idea that can (and will) be applied to problems in many areas of industry and research. Continue reading

cuDNN_logo_black_on_white_179x115

Inference: The Next Step in GPU-Accelerated Deep Learning

At 45 images/s/W, Jetson TX1 is super efficient at deep learning inference. Read the whitepaper.
Deep learning is revolutionizing many areas of machine perception, with the potential to impact the everyday experience of people everywhere. On a high level, working with deep neural networks is a two-stage process: First, a neural network is trained: its parameters are determined using labeled examples of inputs and desired output. Then, the network is deployed to run inference, using its previously trained parameters to classify, recognize and process unknown inputs.

Deep Neural Network Training vs. Inference
Figure 1: Deep learning training compared to inference. In training, many inputs, often in large batches, are used to train a deep neural network. In inference, the trained network is used to discover information within new inputs that are fed through the network in smaller batches.

It is widely recognized within academia and industry that GPUs are the state of the art in training deep neural networks, due to both speed and energy efficiency advantages compared to more traditional CPU-based platforms. A new whitepaper from NVIDIA takes the next step and investigates GPU performance and energy efficiency for deep learning inference.

The results show that GPUs provide state-of-the-art inference performance and energy efficiency, making them the platform of choice for anyone wanting to deploy a trained neural network in the field. In particular, the NVIDIA GeForce GTX Titan X delivers between 5.3 and 6.7 times higher performance than the 16-core Intel Xeon E5 CPU while achieving 3.6 to 4.4 times higher energy efficiency. The NVIDIA Tegra X1 SoC also achieves impressive results, achieving higher performance (258 vs. 242 images/second) and much higher energy efficiency (45 vs. 3.9 images/second/Watt) than the state-of-the-art Intel Core i7 6700K. Continue reading

Figure 4. Jetson TX1 Developer Kit, including module, reference carrier and camera board.

NVIDIA® Jetson™ TX1 Supercomputer-on-Module Drives Next Wave of Autonomous Machines

Figure 1. The 50x87mm embedded Jetson TX1 module and thermal plate, featuring integrated Maxwell GPU, ARMv8 CPU, and H.265 video processor.
Figure 1. The 50x87mm embedded Jetson TX1 module and thermal plate, featuring integrated Maxwell GPU, ARMv8 CPU, and H.265 video processor.

Today NVIDIA introduced Jetson TX1, a small form-factor Linux system-on-module, destined for demanding embedded applications in visual computing.  Designed for developers and makers everywhere, the miniature Jetson TX1 (figure 1) deploys teraflop-level supercomputing performance onboard platforms in the field.  Backed by the Jetson TX1 Developer Kit, a premier developer community, and a software ecosystem including Jetpack, Linux For Tegra R23.1, CUDA Toolkit 7, cuDNN, and VisionWorks, Jetson enables machines everywhere with the proverbial brains required to achieve advanced levels of autonomy in today’s world.

Aimed at developers interested in computer vision and on-the-fly sensing, Jetson TX1’s credit-card footprint and low power consumption mean that it’s geared for deployment onboard embedded systems with constrained size, weight, and power (SWaP).  Jetson TX1 exceeds the performance of Intel’s high-end Core i7-6700K Skylake in deep learning classification with Caffe, and while drawing only a fraction of the power, achieves more than ten times the perf-per-watt.

Jetson provides superior efficiency while maintaining a developer-friendly environment for agile prototyping and product development, removing extra legwork typically associated with deploying power-limited embedded systems. Jetson TX1’s small form-factor module enables developers everywhere to deploy Tegra into embedded applications ranging from autonomous navigation to deep learning-driven inference and analytics. Continue reading

cloud gears

Accelerating Hyperscale Data Center Applications with Tesla GPUs

The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive computation, and available internet bandwidth has led to an explosion in user-created content: users are recreating the internet, producing exabytes of content every day.

Exabytes of content produced daily

Periscope, a mobile application that lets users broadcast video to followers has 10 million users who broadcast over 40 years of video per day. Twitch, a popular game broadcasting service, revealed last month that 1.7 million users have live-streamed 7.5 billion minutes of content. China’s biggest search engine, Baidu, processes 6 billion queries per day, and 10% of those queries use speech. About 300 hours of video is uploaded to YouTube every minute. And just last week, Mark Zuckerberg shared that Facebook users view 8 billion videos every day—a number that has grown by a factor of 8 in about a year.

This massive scale of content requires massive amounts of processing, and due to the volume of media content involved, data center workloads are changing. Increasing resources are spent on video and image processing, resizing, transcoding, filtering and enhancement. Likewise, large-scale machine learning and deep learning techniques apply trained models to what’s known as “inference”, which applies trained models to tasks such as image classification, object detection, machine translation, and speech recognition.

new_platform_new_workloads Continue reading

dnn_green_on_black_thumb

Deep Learning in a Nutshell: Core Concepts

DL_dog_340x340This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction to deep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

Part 1 focuses on introducing the main concepts of deep learning. Future posts will provide historical background and delve into the training procedures, algorithms and practical tricks that are used in training for deep learning.

Core Concepts

Machine Learning

In machine learning we (1) take some data, (2) train a model on that data, and (3) use the trained model to make predictions on new data. The process of training a model can be seen as a learning process where the model is exposed to new, unfamiliar data step by step. At each step, the model makes predictions and gets feedback about how accurate its generated predictions were. This feedback, which is provided in terms of an error according to some measure (for example distance from the correct solution), is used to correct the errors made in prediction.

The learning process is often a game of back-and-forth in the parameter space: If you tweak a parameter of the model to get a prediction right, the model may have in such that it gets a previously correct prediction wrong. It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.

Feature Engineering

Feature engineering is the art of extracting useful patterns from data that will make it easier for Machine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification. Continue reading