Open, Reproducible Computational Chemistry with Python and CUDA

SONY DSCHigh-performance computing is relatively new to computational chemistry and researchers are now using GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Brad: Can you talk a bit about your current research?

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.
Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.

An important part of this effort has been related to open science and reproducibility. Not only is PyGBe open source, but the figures in our papers are available through Figshare under a Creative Commons CC-BY license. In the Figshare repository, we also have scripts that call PyGBe with the correct input parameters to reproduce the results of our paper with a single command. In fact, Professor Barba’s research group has made a consistent effort for reproducibility and open science for a few years now.

An interesting application we’ve been working on lately is label-free nanoscale biosensors. These are devices that detect antigens with very high sensitivity and selectivity. The antigen is sensed when it binds to a ligand that is adsorbed on the sensor, which needs to be properly oriented. We are using our model to find the best conditions to obtain favorable ligand orientations, and that way help the development of better sensors. This work has led to very interesting conversations with people outside our field, like experimentalists that fabricate these devices, which have been fun and challenging.

B: What tools do you use to develop Python code for GPUs?

We develop Python applications for GPUs using PyCUDA. This is a very nice combination because Python makes it very easy to move data around and sort it such that we get the most out of the GPU in the computationally intensive parts.

Figure 2: A close-up view of the mesh for a lysozyme molecule. The colors represent the electrostatic potential on the molecular surface.
Figure 2: A close-up view of the mesh for a lysozyme molecule. The colors represent the electrostatic potential on the molecular surface.

B: Why did you start looking at using NVIDIA GPUs?

C: I started working with NVIDIA GPUs over five years ago while a graduate student in Professor Lorena Barba’s group. We wanted to develop an efficient Poisson-Boltzmann solver that would be able to run on a local workstation using a boundary element method. It turns out that the most time consuming part of the algorithm can be formulated as an N-body type problem, which matches very well to the GPU architecture. Moreover, it was important for us to have a good double-precision performance which made NVIDIA GPUs the natural choice.

B: What is your GPU Computing experience and how have GPUs impacted your research?

C: I have been involved with CUDA in several ways. In 2011, I taught a month-long workshop on GPU computing with CUDA at Universidad Técnica Federico Santa María in Valparaíso, Chile (which was recently named a GPU Education Center). I also presented the early stage of my research in NVIDIA’s booth at the SC11 conference in Seattle, Washington.

GPUs and CUDA have had a huge impact on my research. For example, in the work on ligand orientation we sampled the free energy of antibodies near charged surfaces for nearly 1,000 different configurations – this would have taken over a week with the CPU version of the code, but we had the results in under a day!

B: Can you share some performance results?

C: The results in Figure 3 compare the time taken to calculate the solvation energy of a Lysozyme for a varying number of boundary elements, using one NVIDIA Tesla C2075 GPU and one Intel Xeon X5650 CPU.

Figure 3: Time taken to calculate the solvation energy of a lysozyme molecule for a varying number of boundary elements, using one GPU and one CPU.

The NlogN scaling is expected because of the treecode algorithm, and you can see a 10x speedup between GPU and CPU.

B: How are you using GPUs?

C: We boil down a matrix-vector multiplication into an N-body problem which we accelerate with a treecode algorithm. The treecode performs a direct N-body calculation for near-by interactions, and approximates the far-field. The way that each evaluation point interacts with the near and far field bodies is independent, and we do them in parallel on the GPU.

B: What NVIDIA technologies are you using?

C: Besides GPUs, I take advantage of CUDA. One of the goals of the project was to develop a desktop-level, straightforward tool that would be easy to use for someone without a computer science background. This is why we did most of the development in Python, and interface with CUDA using PyCUDA.

 B: What types of challenges have you encountered in your research?

C: The software development was a big challenge. Even though the treecode algorithm maps very well to the GPU architecture, the implementation was not straightforward and we had to be very careful in order to get the most from the GPU.

Figure 4: a close-up view of the mesh for a protein G B1 near a surface charged with 0.05 C/m^2. The colors represent the electrostatic potential on the molecular surface.
Figure 4: a close-up view of the mesh for a protein G B1 near a surface charged with 0.05 C/m^2. The colors represent the electrostatic potential on the molecular surface. The mesh on the bottom corresponds to the triangulation of the biosensor plate.

In our case, the key was to arrange the data in such a way that points of our N-body problem that were near in space, were also close by in memory. That way, we take advantage of the bandwidth, and make efficient use of the shared memory.

Arranging data gave us important speedups, but it was not the end of the story. We checked the CUDA code line by line to recognize bottlenecks and find better ways to perform each operation. The CUDA fast math instructions were very useful at this point! It was an exhausting process, but it definitely paid off.

Another challenge was linking the computational work that we were doing to a real application where it could have an impact. Once we saw an opportunity to study biosensors with PyGBe, we had to quickly educate ourselves in the topic, to understand specifically where our code would have the largest impact. In the end, some of the most interesting conversations I’ve had on this were with experimentalists that are not part of the computational world. 

B: Whats next?

C: There are many interesting lines of work that come out from here. On the biosensor application, we looked at the orientation of antibodies near the sensor surface. Even though antibodies are widely used as ligands, there is active research engineering new ligands to enhance the sensitivity even more.

We plan to study the orientation of those new ligands adsorbed on the sensor surface using PyGBe, which can be very useful to help find better ligands.

I think the future of numerical modeling in biophysics will be dictated by how realistic and fast our simulations can get. There are quite a few number of ways to do biomolecular modeling (Molecular Dynamics, Generalized Born, Poisson-Boltzmann, etc.), with various levels of detail. The questions are: how realistic do we need to be? and, how long are we willing to wait for an answer?

The model in PyGBe has limitations, and we are interested in developing our code further towards getting more realistic simulations. In that context, memory constraints allow us to model only one or two biomolecules at the same time (depending on the size); however, in real life, biomolecules are interacting with many other molecules that are immersed in the same solvent. How do they interact? What is the effect of this crowding? To answer these questions, we need to develop a large-scale distributed memory version of PyGBe. Perhaps it won’t be easy to use like the original version, but it will allow us to add realism to our simulations. 

B: What impact do you think your research will have on this scientific domain?

C: The biosensor application that we have been working on is a nice example of this. Most developments in this domain are guided by experiments and rely on trial-and-error. These biosensors are nanoscale devices, which are difficult and expensive to manipulate. An analytical tool like PyGBe can support those efforts and perhaps limit the number of trials to obtain a satisfying result.

B: Where do you obtain technical information related to NVIDIA GPUs and CUDA?

C: NVIDIA GPUs and CUDA have been around for a while now so information is readily available online. A good source to keep up with the latest is CUDA Zone in the NVIDIA Developer Zone. Also, when I started coding in CUDA I found the book “Programming Massively Parallel Processors,” by David Kirk and Wen-mei Hwu very useful.

About Christopher Cooper

Christopher is an Academic Instructor (entry-level faculty position) in the Mechanical Engineering Department at Universidad Técnica Federico Santa María, in Valparaiso, Chile. He earned a Ph.D. (2015) and M.S. (2012) in mechanical engineering from Boston University, and a B.S. (2007) and P.Eng. (2009) in mechanical engineering from Universidad Técnica Federico Santa María.  His research is focused in developing fast and accurate numerical algorithms for the solution of boundary integral equations, with applications in biophysics and fluid dynamics. More specifically, he studies the electrostatics behind biomolecule solvation by solving the Poisson-Boltzmann equation with accelerated boundary element methods on graphic processing units (GPUs).

Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors. Continue reading

CUDA 7.5

Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11, provide concise syntax for anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables from surrounding scopes. GPU Lambdas bring that power and convenience to writing GPU functions, letting you launch parallel work on the GPU almost as easily as writing a for loop.

In this post, I want to show you how modern C++ features combine to enable a higher-level, more portable approach to parallel programming for GPUs. To do so, I’ll show you Hemi 2, the second release of a simple open-source C++ library that I developed to explore approaches to portable parallel C++ programming. I have written before about Hemi on Parallel Forall, but Hemi 2 is easier to use, more portable, and more powerful.

hemi-logo-blogIntroducing Hemi 2

Hemi simplifies writing portable CUDA C/C++ code. With Hemi,

  • you can write parallel kernels like you write for loops—in line in your CPU code—and run them on your GPU;
  • you can launch C++ Lambda functions as GPU kernels;
  • you can easily write code that compiles and runs either on the CPU or GPU;
  • kernel launch configuration is automatic: details like thread block size and grid size are optimization details, rather than requirements.

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
  hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
    y[i] = a * x[i] + y[i];

Hemi is BSD-licensed, open-source software, available on Github. Continue reading


Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading


Increasing the Luminosity of Beam Dynamics with GPUs

Adrian_CERNWhat is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Brad: How is your research related to the upgrade of the LHC?

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

High Luminosity LHCThese days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading

CUDA 7.5

CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.]

Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs, helps pinpoint performance bottlenecks, letting you quickly identify the specific lines of source code (and assembly instructions) limiting the performance of GPU code, along with the underlying reason for execution stalls.

In this post, I demonstrate Instruction-Level Profiling by showing how it helped understand and improve the performance limitations of a CUDA kernel that implements the Iterative Closest Point algorithm (the original source code, by Thomas Whelan, is available on Github). I’ll show how instruction-level profiling makes it easier to apply advanced optimizations, helping speed up the example kernel by 2.7X on an NVIDIA Quadro M6000 GPU.

Profiling the kernel using the Guided Analysis feature of the Visual Profiler showed that the kernel performance was bound by instruction and memory latency. Latency issues indicate that the hardware resources are not used efficiently since most warps are stalled by a dependency on a data value from a previous math or memory instruction. Figure 1 shows that the compute units are only 40% utilized and memory units are around 25% utilized, so there is definitely room for improvement.

Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency) .
Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency).

Stall Analysis in Previous Profiler Versions

Before CUDA 7.5, the Visual Profiler was only capable of pointing out performance issues at the application or CUDA kernel level. For stall latency analysis, the CUDA 7.0 Visual Profiler produces the pie chart in Figure 2 by collecting various stall reason events for the entire kernel.

Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at kernel level).
Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at the kernel level).

This pie chart shows that the two primary stall reasons in this kernel are synchronization and memory dependencies. But if I look into the kernel code, there are lots of memory accesses and __syncthreads() calls, so this high-level analysis doesn’t provide any specific insight into which instructions are potential bottlenecks. In general it can be very difficult to find exact bottleneck causes in complex kernels using kernel-level profiling analysis. This is where CUDA 7.5 can help, as you’ll see. Continue reading


Mocha.jl: Deep Learning for Julia

Deep learning is becoming extremely popular due to several breakthroughs in various well-known tasks in artificial intelligence. For example, at the ImageNet Large Scale Visual Recognition Challenge, the introduction of deep learning algorithms into the challenge reduced the top-5 error by 10% in 2012. Every year since then, deep learning models have dominated the challenges, significantly reducing the top-5 error rate every year (see Figure 1). In 2015, researchers have trained very deep networks (for example, the Google “inception” model has 27 layers) that surpass human performance.

Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.
Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.

Moreover, at this year’s Computer Vision and Pattern Recognition (CVPR) conference, deep neural networks (DNNs) were being adapted to increasingly more complicated tasks. For example, in semantic segmentation, instead of predicting a single category for a whole image, a DNN is trained to classify each pixel in the image, essentially producing a semantic map indicating every object and its shape and location in the given image (see Figure 2).

Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.
Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.

Continue reading


ATCOM: Real-Time Enhancement of Long-Range Imagery

Imaging over long distances is important for many defense and commercial applications. High-end ground-to-ground, air-to-ground, and ground-to-air systems now routinely image objects several kilometers to several dozen kilometers away; however, this increased range comes at a price. In many scenarios, the limiting factor becomes not the quality of your camera but the atmosphere through which the light travels to the camera. Dynamic changes in the atmospheric density between the camera and object impart time-variant distortions resulting in loss of contrast and detail in the collected imagery (see Figure 1 and Figure 2).

Figure 1: Wavefront distortion by the atmosphere causes the image captured to be blurred. The effect becomes more pronounced when imaging over long distances or through more turbulent atmospheric conditions.
Figure 1: Wavefront distortion by the atmosphere causes the image captured to be blurred. The effect becomes more pronounced when imaging over long distances or through more turbulent atmospheric conditions.

Several approaches have been developed to combat this effect that can be roughly divided into two categories: hardware-based and signal processing approaches. The primary hardware technique is adaptive optics (AO), an approach favored by the astronomical community to observe stellar phenomena. AO techniques generally employ a deformable mirror to correct the incoming wavefront before it is captured on the sensor. While this provides the ability to improve imagery to account for distortions, the equipment required is fragile and expensive and is therefore not suitable for many applications. In contrast, signal processing techniques are limited only by the computational hardware they run on. In our case, we have leveraged the processing power of GPUs to achieve the performance necessary for real-time processing of high-definition video. Thanks to modern GPUs, we are now able to process live 720p video streams at over 30 fps, as the video below shows.

Continue reading


Harnessing the Caffe Framework for Deep Visualization

Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.
Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.

The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.

“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”

Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.

[The following video shows off work from the Evolving AI Lab on visualizing deep neural networks. Keep reading to learn more about this work!]

For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.

Brad: How are you using deep neural networks (DNNs)?

We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading


High-Performance MATLAB with GPU Acceleration

In this post, I will discuss techniques you can use to maximize the performance of your GPU-accelerated MATLAB® code. First I explain how to write MATLAB code which is inherently parallelizable. This technique, known as vectorization, benefits all your code whether or not it uses the GPU. Then I present a family of function wrappers—bsxfunpagefun, and arrayfun—that take advantage of GPU hardware, yet require no specialist parallel programming skills. The most advanced function, arrayfun, allows you to write your own custom kernels in the MATLAB language.

If these techniques do not provide the performance or flexibility you were after, you can still write custom CUDA code in C or C++ that you can run from MATLAB, as discussed in our earlier Parallel Forall posts on MATLAB CUDA Kernels and MEX functions.

All of the features described here are available out of the box with MATLAB and Parallel Computing Toolbox™.

Mobile phone signal strength example

Throughout this post, I will use an example to help illustrate the techniques. A cellular phone network wants to map its coverage to help plan for new antenna installations. We imagine an idealized scenario with M = 25 cellphone masts, each H = 20 meters in height, evenly spaced on an undulating 10km x 10km terrain. Figure 1 shows what the map looks like.

Figure 1: 3D plot of the map data and antenna positions
Figure 1: 3D plot of the map data and antenna positions

On the GPU, in the following listing we define a number of variables including:

  • map: An N x 3 height field in a 10km x 10km grid (N = 10,000);
  • masts: An M x 3 array of antenna positions, at height H;
  • AntennaDirection: A 3 x M array of vectors representing the orientation of each antenna.

Continue reading