Developing New Materials with GPU-Accelerated Supercomputers

Josh_AndersonDr. Joshua A. Anderson is a Research Area Specialist at the University of Michigan who was an early user of GPU computing technology. He began his career developing software on the first CUDA capable GPU and now runs simulations on one of the world’s most powerful supercomputers.

Anderson’s “contributions to the development and dissemination of the open source, GPU-enabled molecular simulation software, HOOMD-blue, which enables scientific computations with unprecedented speed” earned him the 2015 CoMSEF Young Investigator Award for Modeling and Simulation.

Brad Nemire: Can you talk a bit about your current research?

Joshua Anderson: I work with the Glotzer Group at the University of Michigan. We use computer simulation to discover the fundamental principles of how nanoscale systems of building blocks self-assemble, and to discover how to control the assembly process to engineer new materials. Specifically, we focus on the role of particle shape and how changing the shape can result in different material properties.

Figure 1: an example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).
Figure 1: example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).

Over the past few years, I have been focusing on two-dimensional systems, using large scale simulations to study hexatic phase transitions for hard disks, and how patterning surfaces of polygons can create shape allophiles that improve self-assembly. The hexatic phase is an intermediate between the fluid and hexagonally ordered solid. In the hexatic phase, the orientation of bonds between particles has long range order, but translational order is short range and there is no crystal lattice. Shape allophiles are polygonal shapes cut so they fit together like puzzle pieces. These research projects are computationally demanding and could not have been run on any existing code. So before I could even begin the science research, I needed to develop, implement, and optimize the parallel algorithms necessary for these studies. Continue reading


Open, Reproducible Computational Chemistry with Python and CUDA

SONY DSCIncreasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Brad: Can you talk a bit about your current research?

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.
Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.

Continue reading


Increasing the Luminosity of Beam Dynamics with GPUs

Adrian_CERNWhat is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Brad: How is your research related to the upgrade of the LHC?

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

High Luminosity LHCThese days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading

Python Logo

GPU-Accelerated Graph Analytics in Python with Numba

Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

From Prototype to Full Dataset with @cuda.jit

When doing exploratory programming, the interactivity of IPython Notebook and a comprehensive collection of scientific libraries (e.g. SciPy, Scikit-Learn, Theano, etc.) allow data scientists to process and visualize their data quickly. There are times when a fast implementation of what you need isn’t in a library, and you have to implement something new. Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

Figure 1: The DkS result of the 2012 Web Data Commons pay-level domain hyperlink graph.

Working with Dr. Alex Dimakis and his team at UT Austin, we implemented their densest-k-subgraph (DkS) algorithm [1]. Our goal was to extract the densest domain from the 2012 WebDataCommon pay-level-domain hyperlink graph using one NVIDIA Tesla K20 GPU accelerator. We developed the entire application using NumPy for array operations, Numba to JIT compile Python to CUDA, NumbaPro for GPU sorting and cuBLAS routines, and Bokeh for plotting the results. Continue reading

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

Learn GPU Computing with Hands-On Labs at GTC 2015

Every year NVIDIA’s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners opportunities to learn with hands-on labs how to use accelerated computing in their work. This year we are nearly doubling the amount of hands-on training provided from last year, with almost 2,400 lab hours available to GTC attendees!

We have two types of training this year at GTC: instructor-led labs and self-paced labs. And to help you keep up with one of the hottest trends in computing, this year we’re featuring a Deep Learning training track. Keep reading for details. If you haven’t registered for GTC yet this year, keep reading for a discount code.

Deep Learning Track

There is an explosion of Deep Learning topics at GTC, and it’s not limited to the keynotes, talks and tutorial sessions. We’ll feature at least six hands-on labs related to accelerating facets of Deep Learning on GPUs. From an introduction to Deep Learning on GPUs to cutting-edge techniques and tools, there will be something for everyone. Be sure to get to these labs early to get yourself a seat! Here are a few of the labs available in this track:

  • Introduction to Machine Learning with GPUs: Handwritten digit classification (S5674)
  • DIY Deep Learning for Vision with Caffe (S5647)
  • Applied Deep Learning for Vision, Natural Language and Audio with Torch7 (S5574)
  • Deep Learning with the Theano Python Library (S5732)
  • Deep Belief Networks Using ArrayFire (S5722)
  • Accelerate a Machine Learning C++ example with Thrust (S5822)

Instructor-led Labs

IMAG0568Just like GTC last year, there will be twenty hands-on instructor-led labs. These are 80-minute labs led by an expert on the topic. Continue reading


CUDACasts Episode #12: Programming GPUs using CUDA Python

So far in the CUDA Python mini-series on CUDACasts, I introduced you to using the @vectorize decorator and CUDA libraries, two different methods for accelerating code using NVIDIA GPUs.  In today’s CUDACast, I’ll be demonstrating how to use the NumbaPro compiler from Continuum Analytics to write CUDA Python code which runs on the GPU.

In CUDACast #12, we’ll continue using the Monte Carlo options pricing example, and I’ll show how to write the step function in CUDA Python rather than using the @vectorize decorator. In addition, by using the nvprof command-line profiler, we’ll be able to see the speed-up we’re able to achieve by writing the code explicitly in CUDA.

Continue reading


CUDACasts Episode #11: GPU Libraries for CUDA Python

In the previous episode of CUDACasts I introduced you to NumbaPro, the high-performance Python compiler from Continuum Analytics, and demonstrated how to accelerate simple Python functions on the GPU. Continuing the Python theme, today’s CUDACast demonstrates NumbaPro’s support for CUDA libraries.

The optimized algorithms in GPU-accelerated libraries often provide the easiest way to accelerate applications. NumbaPro includes a Python API interface to the cuBLAS, cuFFT, and cuRAND libraries. In CUDACasts episode #11 I show you how to use cuRAND to accelerate random-number generation for a Python Monte Carlo options pricing example, achieving a 17x overall speed-up.

Continue reading


CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than that; to me, nvprof is the light-weight profiler that reaches where other tools can’t.

Use nvprof for Quick Checks

I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.

    ==9261== Profiling application: ./tHogbomCleanHemi
    ==9261== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     58.73%  737.97ms      1000  737.97us  424.77us  1.1405ms  subtractPSFLoop_kernel(float const *, int, float*, int, int, int, int, int, int, int, float, float)
     38.39%  482.31ms      1001  481.83us  475.74us  492.16us  findPeakLoop_kernel(MaxCandidate*, float const *, int)
      1.87%  23.450ms         2  11.725ms  11.721ms  11.728ms  [CUDA memcpy HtoD]
      1.01%  12.715ms      1002  12.689us  2.1760us  10.502ms  [CUDA memcpy DtoH]

In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls. Continue reading


CUDACasts Episode #10: Accelerate Python on GPUs

This week’s CUDACast continues the Parallel Forall Python theme kicked off in last week’s post by Mark Harris, demonstrating exciting new support for CUDA acceleration in Python with NumbaPro. This video is the first in a 3-part series showing various ways to accelerate your Python code on NVIDIA GPUs.

Tomorrow you won’t want to miss the chance to learn about Python GPU acceleration with NumbaPro from its creators, in a GTC Express Webinar called “Pythonic Parallel Patterns for the GPU with NumbaPro” from Siu Kwan Lam, NumbaPro’s primary author at Continuum Analytics. Click the link to sign up now!

Continue reading

Python Logo

NumbaPro: High-Performance Python with CUDA Acceleration

Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors influencing the popularity of python, including its clean and expressive syntax and standard data structures, comprehensive “batteries included” standard library, excellent documentation, broad ecosystem of libraries and tools, availability of professional support, and large and open community. Perhaps most important, though, is the high productivity enabled by a dynamically typed, interpreted language like Python. Python is nimble and flexible, making it a great language for quick prototyping, but also for building complete systems.

But Python’s greatest strength can also be its greatest weakness: its flexibility and typeless, high-level syntax can result in poor performance for data- and computation-intensive programs. For this reason, Python programmers concerned about efficiency often rewrite their innermost loops in C and call the compiled C functions from Python. There are a number of projects aimed at making this optimization easier, such as Cython, but they often require learning a new syntax. Ideally, Python programmers would like to make their existing Python code faster without using another programming language, and, naturally, many would like to use accelerators to get even higher performance from their code.

NumbaPro: High Productivity for High-Performance Computing

In this post I’ll introduce you to NumbaPro, a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. Since Python is not normally a compiled language, you might wonder why you would want a Python compiler. The answer is of course that running native, compiled code is many times faster than running dynamic, interpreted code. NumbaPro works by allowing you to specify type signatures for Python functions, which enables compilation at run time (this is “Just-in-Time”, or JIT compilation). NumbaPro’s ability to dynamically compile code means that you don’t give up the flexibility of Python. This is a huge step toward providing the ideal combination of high productivity programming and high-performance computing. Continue reading