Leveraging OpenACC to Compute High-Accuracy Molecular Energies

Janus Juul EriksenFor this interview, I reached out to Janus Juul Eriksen, a Ph.D. fellow at Aarhus University in Denmark. Janus is a chemist by trade without any formal education in computer science; but he is getting up to 12x speed-up compared to his CPU-only code after modifying less than 100 lines of code with one week of programming effort.

How did he do this? He used OpenACC.

OpenACC is a simple, powerful and portable approach for researchers and scientists who need to rapidly boost application performance for faster science while minimizing programming. With OpenACC, the original source code is kept intact, making the implementation intuitively transparent and leaving most of the hard work to the compiler.

NVIDIA recently announced the new OpenACC Toolkit, an all-in-one suite of parallel programming tools, that helps researchers and scientists quickly accelerate applications.

“OpenACC is much easier to learn than OpenMP or MPI. It makes GPU computing approachable for domain scientists,” says Janus. “Our initial OpenACC implementation required only minor efforts, and more importantly, no modifications of our existing CPU implementation.”

Figure 1: The active site of the Bacterial Leucine Transporter (LeuT) protein, on which the LSDalton team has been doing a number of high-accuracy calculations. The LeuT protein belongs to a family of transporters that are linked to neurological diseases.
Figure 1: The active site of the Bacterial Leucine Transporter (LeuT) protein, on which the LSDalton team has been doing a number of high-accuracy calculations. The LeuT protein belongs to a family of transporters that are linked to neurological diseases.

Janus is part of the research team developing the quantum chemistry code LSDalton, a massively parallel and linear-scaling program for the accurate determination of energies and other molecular properties for large molecular systems.

In need of speed, the LSDalton team was awarded an INCITE allocation which gave them access to Oak Ridge National Laboratory’s Titan supercomputer. With this, they needed to find a way to use the power of the supercomputer: enter OpenACC. Demonstrating success on Titan with their GPU-accelerated code, they were recently one of 13 application code projects selected to join the Center for Accelerated Application Readiness (CAAR) program. This means they will be among the first applications to run on Summit, the new supercomputer debuting in 2018 which will deliver more than five times the computational performance of Titan’s 18,688 nodes.

This access will enable the research team to simulate larger molecular structures at higher accuracy, ultimately accelerating discoveries in materials and quantum chemistry.

Continue reading

Theano Logo

Introduction to Neural Machine Translation with GPUs (part 3)

Note: This is the final part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 2.

In the previous post in this series, I introduced a simple encoder-decoder model for machine translation. This simple encoder-decoder model is excellent at English-French translation. However, in this post I will briefly discuss the weakness of this simple approach, and describe a recently proposed way of incorporating a soft attention mechanism to overcome the weakness and significantly improve the translation quality.

Furthermore, I will present some more recent works that utilize this neural machine translation approach to go beyond machine translation of text, such as image caption generation and video description generation. I’ll finish the blog series with a brief discussion of future research directions and a pointer to the open source code implementing these neural machine translation models.

The Trouble with Simple Encoder-Decoder Architectures

In the encoder-decoder architecture, the encoder compresses the input sequence as a fixed-size vector from which the decoder needs to generate a full translation. In other words, the fixed-size vector, which I’ll call a context vector, must contain every single detail of the source sentence. Intuitively, this means that the true function approximated by the encoder has to be extremely nonlinear and complicated. Furthermore, the dimensionality of the context vector must be large enough that a sentence of any length can be compressed.

In my paper “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches” presented at SSST-8, my coauthors and I empirically confirmed that translation quality dramatically degrades as the length of the source sentence increases when the encoder-decoder model size is small. Together with a much better result from Sutskever et al. (2014), using the same type of encoder-decoder architecture, this suggests that the representational power of the encoder needed to be large, which often means that the model must be large, in order to cope with long sentences (see Figure 1).

 Figure 1: Dramatic drop of performance w.r.t. the length of sentence with a small encoder-decoder model.
Figure 1: Dramatic drop of performance w.r.t. the length of sentence with a small encoder-decoder model.

Of course, a larger model implies higher computation and memory requirements. The use of advanced GPUs, such as NVIDIA Titan X, indeed helps with computation, but not with memory (at least not yet). The size of onboard memory is often limited to several Gigabytes, and this imposes a serious limitation on the size of the model. (Note: it’s possible to overcome this issue by using multiple GPUs while distributing a single model across those GPUs, as shown by Sutskever et al. (2014). However, let’s assume for now that we have access to a single machine with a single GPU due to space, power and other physical constraints.)

Then, the question is “can we do better than the simple encoder-decoder based model?” Continue reading


Using GPUs to Accelerate Epidemic Forecasting

Chris_JewellOriginally trained as a veterinary surgeon, Chris Jewell, a Senior Lecturer in Epidemiology at Lancaster Medical School in the UK became interested in epidemics through his experience working on the foot and mouth disease outbreak in the UK in 2001. His work so far has been on livestock epidemics such as foot and mouth disease, theileriosis, and avian influenza with government organizations in the UK, New Zealand, Australia, and the US. Recently, he has refocused his efforts into the human field where populations and epidemics tend to be larger and therefore need more computing grunt.

Epidemic forecasting centers around Bayesian inference on dynamical models, using Markov Chain Monte Carlo (MCMC) as the model fitting algorithm. As part of this algorithm Chris has had to calculate a statistical likelihood function which itself involves a large sum over pairs of infected and susceptible individuals. He is currently using CUDA technology to accelerate this calculation and enable real-time inference, leading to timely forecasts for informing control decisions.

“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”

Figure 1: Predicting the risk of infection by the tick-borne cattle parasite Theileria orientalis (Ikeda) for uninfected farms. Source: Jewell, CP and Brown RG (2015) Bayesian data assimilation provides rapid decision support for vector-borne diseases.  J. Roy. Soc. Interface 12:20150367.
Figure 1: Predicting the risk of infection by the tick-borne cattle parasite Theileria orientalis (Ikeda) for uninfected farms. Source: Jewell, CP and Brown RG (2015) Bayesian data assimilation provides rapid decision support for vector-borne diseases. J. Roy. Soc. Interface 12:20150367.

Continue reading


GPU-Accelerated Cosmological Analysis on the Titan Supercomputer

Ever looked up in the sky and wondered where it all came from? Cosmologists are in the same boat, trying to understand how the Universe arrived at the structure we observe today. They use supercomputers to follow the fate of very small initial fluctuations in an otherwise uniform density. As time passes, gravity causes the small fluctuations to grow, eventually forming the complex structures that characterize the current Universe. The numerical simulations use tracer particles representing lumps of matter to carry out the calculations. The distribution of matter at early times is known only in a statistical sense so we can’t predict exactly where galaxies will show up in the sky. But quite accurate predictions can be made for how the galaxies are distributed in space, even with relatively simplified simulations.

Figure 1: Visualization of the Q Continuum simulation generated with the vl3 parallel volume rendering system using a point sprite technique. Image courtesy of Silvio Rizzi and Joe Insley, Argonne National Laboratory.
Figure 1: Visualization of the Q Continuum simulation generated with the vl3 parallel volume rendering system using a point sprite technique. Image courtesy of Silvio Rizzi and Joe Insley, Argonne National Laboratory.

As observations become increasingly detailed, the simulations need to become more detailed as well, requiring huge amounts of simulation particles. Today, top-notch simulation codes like the Hardware/Hybrid Accelerated Cosmology Code (HACC ) [1] can follow more than half a trillion particles in a volume of more than 1 Gpc3 (1 cubic Gigaparsec. That’s a cube with sides 3.26 billion light years long) on the largest scale GPU-accelerated supercomputers like Titan at the Oak Ridge National Laboratory. The “Q Continuum” simulation [2] that Figure 1 shows is an example.

While simulating the Universe at this resolution is by itself a challenge, it is only half the job. The analysis of the simulation results is equally challenging. It turns out that GPUs can help with accelerating both the simulation and the analysis. In this post we’ll demonstrate how we use Thrust and CUDA to accelerate cosmological simulation and analysis and visualization tasks, and how we’re generalizing this work into libraries and toolkits for scientific visualization.

An object of great interest to cosmologists is the so-called “halo”. A halo is a high-density region hosting galaxies and clusters of galaxies, depending on the halo mass. The task of finding billions of halos and determining their centers is computationally demanding. Continue reading


Getting Started with OpenACC

This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO
Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.

When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.

  1. Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
  2. Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
  3. Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
  4. Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.

Continue reading


Introducing the NVIDIA OpenACC Toolkit

Programmability is crucial to accelerated computing, and NVIDIA’s CUDA Toolkit has been critical to the success of GPU computing. Over 3 million CUDA Toolkits have been downloaded since its first launch. However there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have huge existing code bases that must remain portable across platforms. Today NVIDIA is introducing the new OpenACC Toolkit to help these researchers and scientists achieve science and engineering goals faster.

Over the last few years OpenACC has established itself as a higher-level approach to GPU acceleration that is simple, powerful, and portable. The membership of the OpenACC organization has grown to include accelerator manufacturers, tools vendors, supercomputing centers and education institutions. The OpenACC 2.0 specification significantly expands the functionality and improves the portability of OpenACC and is now available in many commercial tools.

The NVIDIA OpenACC toolkit provides the tools and documentation that scientists and researchers need to be successful with OpenACC. The toolkit includes a free OpenACC compiler for university developers to remove any barriers for use by academics.

The new OpenACC Toolkit includes the following in a single package. Continue reading


New Features in CUDA 7.5

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details.

16-bit Floating Point (FP16) Data

CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic, adding new half and half2 datatypes and intrinsic functions for operating on them. 16-bit “half-precision” floating point types are useful in applications that can process larger datasets or gain performance by choosing to store and operate on lower-precision data. Some large neural network models, for example, may be constrained by available GPU memory; and some signal processing kernels (such as FFTs) are bound by memory bandwidth.

Many applications can benefit by storing data in half precision, and processing it in 32-bit (single) precision. At GTC 2015 in March, NVIDIA CEO Jen-Hsun Huang announced that future Pascal architecture GPUs will include full support for such “mixed precision” computation, with FP16 (half) computation at higher throughput than FP32 (single) or FP64 (double) .

With CUDA 7.5, applications can benefit by storing up to 2x larger models in GPU memory. Applications that are bottlenecked by memory bandwidth may get up to 2x speedup. And applications on Tegra X1 GPUs bottlenecked by FP32 computation may benefit from 2x faster computation on half2 data.

CUDA 7.5 provides 3 main FP16 features: Continue reading


Easy Multi-GPU Deep Learning with DIGITS 2

DIGITS is an interactive deep learning development tool for data scientists and researchers, designed for rapid development and deployment of an optimized deep neural network. NVIDIA introduced DIGITS in March 2015, and today we are excited to announce the release of DIGITS 2, which includes automatic multi-GPU scaling. Whether you are developing an optimized neural network for a single data set or training multiple networks on many data sets, DIGITS 2 makes it easier and faster to develop optimized networks in parallel with multiple GPUs.

Deep learning uses deep neural networks (DNNs) and large datasets to teach computers to detect recognizable concepts in data, to translate or understand natural languages, interpret information from input data, and more. Deep learning is being used in the research community and in industry to help solve many big data problems such as similarity searching, object detection, and localization. Practical examples include vehicle, pedestrian and landmark identification for driver assistance; image recognition; speech recognition; natural language processing; neural machine translation and mitosis detection.

This is a short sample clip promoting a 7 minute introduction to the DIGITS 2 deep learning training system. Watch the full-length video.

DNN Development and Deployment with DIGITS

Developing an optimized DNN is an iterative process. A data scientist may start from a popular network configuration such as “AlexNet” or create a custom network, and then iteratively modify it into a network that is well-suited for the training data. Once they have developed an effective network, data scientists can deploy it and use it on a variety of platforms, including servers or desktop computers as well as mobile and embedded devices such as Jetson TK1 or Drive PX. Figure 1 shows the overall process, broken down into two main phases: development and deployment.

Figure 1: Deep Learning Neural Network Development and Deployment Workflow Process
Figure 1: Deep Learning Neural Network Development and Deployment Workflow Process
Continue reading


GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++

This post demonstrates the practical utility of CUDA’s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace.

A great circle divides a sphere into two hemispheres.
A great circle divides a sphere into two hemispheres. Image: Jhbdel at en.wikipedia [CC BY-SA 3.0], via Wikimedia Commons
Wikipedia defines a great circle as

A great circle, also known as an orthodrome or Riemannian circle, of a sphere is the intersection of the sphere and a plane which passes through the center point of the sphere.

For almost any pair of points on the surface of a sphere, the shortest (surface) distance between these points is the path along the great circle between them. If you have ever flown from Europe to the west coast of North America and wondered why you passed over Greenland, your flight most likely followed a great circle path in order to conserve fuel. Continue reading


MapD: Massive Throughput Database Queries with LLVM on GPUs

Note: this post was co-written by Alex Şuhan and Todd Mostak of MapD.

At MapD our goal is to build the world’s fastest big data analytics and visualization platform that enables lag-free interactive exploration of multi-billion row datasets. MapD supports standard SQL queries as well as a visualization API that maps OpenGL primitives onto SQL result sets.

Although MapD is fast running on x86-64 CPUs, our real advantage stems from our ability to leverage the massive parallelism and memory bandwidth of GPUs. The most powerful GPU currently available is the NVIDIA Tesla K80 Accelerator, with up to 8.74 teraflops of compute performance and nearly 500 GB/sec of memory bandwidth. By supporting up to eight of these cards per server we see orders-of-magnitude better performance on standard data analytics tasks, enabling a user to visually filter and aggregate billions of rows in tens of milliseconds, all without indexing. The following Video shows the MapD dashboard, showing 750 million tweets animated in real time. Nothing in this demo is pre-computed or canned. Our big data visual analytics platform is running on 8 NVIDIA Tesla K40 GPUs on a single server to power the dashboard.

Fast hardware is only half of the story, so at MapD we have invested heavily into optimizing our code such that a wide range of analytic workloads run optimally on GPUs. In particular, we have worked hard so that common SQL analytic operations, such as filtering (WHERE) and GROUP BY, run as fast as possible. One of the biggest payoffs in this regard has been moving from the query interpreter that we used in our prototype to a JIT (Just-In-Time) compilation framework built on LLVM. LLVM allows us to transform query plans into architecture-independent intermediate code (LLVM IR) and then use any of the LLVM architecture-specific “backends” to compile that IR code for the needed target, such as NVIDIA GPUs, x64 CPUs, and ARM CPUs.

Query compilation has the following advantages over an interpreter:

  1. Since it is inefficient to evaluate a query plan for a single row at a time (in one “dispatch”), an interpreter requires the use of extra buffers to store the intermediate results of evaluating an expression. For example, to evaluate the expression x*2+3, an interpreter-based query engine would first evaluate x*2 for a number of rows, storing that to an intermediate buffer. The intermediate results stored in that buffer would then be read and summed with 3 to get the final result. Writing and reading these intermediate results to memory wastes memory bandwidth and/or valuable cache space. Compare this to a compiled query which can simply store the result of the first subexpression (x*2) into a register before computing the final result, allowing the cache to be used for other purposes, for example to create the hash table necessary for a query’s GROUP BY clause. This is related to loop fusion and kernel fusion compiler optimizations. Continue reading