James McClure, a Computational Scientist with Advanced Research Computing at Virginia Tech shares how his group uses the NVIDIA Tesla GPU-accelerated Titan Supercomputer at Oak Ridge National Laboratory to combine mathematical models with 3D visualization to provide insight on how fluids move below the surface of the earth.
McClure spoke with us about his research at the 2015 Supercomputing Conference.
Brad Nemire: Can you talk a bit about your current research?
James McClure: Digital Rock Physics is a relatively new computational discipline that relies on high-performance computing to study the behavior of fluids within rock and other geologic materials. Understanding how fluids move within rock is essential for applications like geologic carbon sequestration, oil and gas recovery, and environmental contaminant transport. New technologies such as synchrotron-based x-ray micro-computed tomography enable the collection of 3D images that reveal the structure of rocks at the micron scale. Using these images, we can make predictions about the complex rock-fluid interactions that take place within natural systems. Continue reading →
Dr. Joshua A. Anderson is a Research Area Specialist at the University of Michigan who was an early user of GPU computing technology. He began his career developing software on the first CUDA capable GPU and now runs simulations on one of the world’s most powerful supercomputers.
Brad Nemire: Can you talk a bit about your current research?
Joshua Anderson: I work with the Glotzer Group at the University of Michigan. We use computer simulation to discover the fundamental principles of how nanoscale systems of building blocks self-assemble, and to discover how to control the assembly process to engineer new materials. Specifically, we focus on the role of particle shape and how changing the shape can result in different material properties.
Over the past few years, I have been focusing on two-dimensional systems, using large scale simulations to study hexatic phase transitions for hard disks, and how patterning surfaces of polygons can create shape allophiles that improve self-assembly. The hexatic phase is an intermediate between the fluid and hexagonally ordered solid. In the hexatic phase, the orientation of bonds between particles has long range order, but translational order is short range and there is no crystal lattice. Shape allophiles are polygonal shapes cut so they fit together like puzzle pieces. These research projects are computationally demanding and could not have been run on any existing code. So before I could even begin the science research, I needed to develop, implement, and optimize the parallel algorithms necessary for these studies. Continue reading →
Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par 2015. Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria’s Vienna University of Technology.
Leyuan’s paper Fast Parallel Suffix Array on the GPU, co-authored by her advisor John Owens and Sean Baxter, a research scientist at New York’s DE Shaw Research, details their efforts to implement a linear-time suffix array construction algorithm on NVIDIA GPUs, resulting in algorithmic improvements and significant speedups over the existing state of the art.
Wang completed her master’s degree in electrical and computer engineering at UC Davis in October 2014, after having earned her undergraduate degree in electronics science and technology at China’s Zhejiang University.
Brad: Can you talk a bit about your current research?
Leyuan Wang: I work on high-performance string processing and graph processing algorithms, mostly in string and graph queries. My current research focus is on GPGPU (general-purpose computing on graphics processing units) and the benchmark I care about most is speed. I’ve been working on designing and improving parallel suffix array construction algorithms (SACAs) and incorporating the implementations in a Burrows-Wheeler transform-based lossless data compression (bzip2) and a parallel FM index for pattern searching. The suffix array (SA) of a string is the sorted set of all suffixes of the string. The inverse suffix array (ISA) is also the lexicographic ranks of suffixes.
The Burrows-Wheeler transform (BWT) of a string is generated by lexicographically sorting the cyclic shift of the string to form a string matrix and taking the last column of the matrix. The BWT groups repeated characters together by permuting the string; it is also reversible, which means the original string can be recovered. These two characteristics make BWT a popular choice for a compression pipeline stage (for instance, bzip2). It is directly related to the suffix array: the sorted rows in the matrix are essentially the sorted suffixes of the string and the first column of the matrix reflects a suffix array. Table 1 shows an example of the SA, ISA and BWT of the input string “banana$”
The suffix array data structure is a building block in a spectrum of applications, including data compression, bioinformatics, text indexing, etc. I’ve studied the taxonomy of all classes of SACAs and compared them in order to find the best candidate for the GPU. I revisited the previous conclusion that skew SACAs are best suited on the GPU by demonstrating that prefix-doubling SACAs are actually better both in theoretical analysis and experimental benchmarks. Our hybrid skew/prefix-doubling suffix array implementation (with our amazing research collaborator Sean Baxter, formerly of NVIDIA Research) using a Tesla K20 achieves a 7.9x speedup against the previous state-of-the-art skew implementation. Our optimized skew SACA implementation has been added as a primitive to CUDPP 2.2 (CUDA Data Parallel Primitives Library) and incorporated into the BWT and bzip2 data compression application, resulting in great speedups compared with bzip2 in CUDPP 2.1. Figure 1 shows pseudocode for our two approaches.Continue reading →
Increasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.
Brad: Can you talk a bit about your current research?
Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.
What is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.
The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.
Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.
Brad: How is your research related to the upgrade of the LHC?
Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.
These days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?
CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading →
The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.
“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”
Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.
[The following video shows off work from the Evolving AI Lab on visualizing deep neural networks. Keep reading to learn more about this work!]
For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.
Brad: How are you using deep neural networks (DNNs)?
We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading →
For this interview, I reached out to Janus Juul Eriksen, a Ph.D. fellow at Aarhus University in Denmark. Janus is a chemist by trade without any formal education in computer science; but he is getting up to 12x speed-up compared to his CPU-only code after modifying less than 100 lines of code with one week of programming effort.
OpenACC is a simple, powerful and portable approach for researchers and scientists who need to rapidly boost application performance for faster science while minimizing programming. With OpenACC, the original source code is kept intact, making the implementation intuitively transparent and leaving most of the hard work to the compiler.
NVIDIA recently announced the new OpenACC Toolkit, an all-in-one suite of parallel programming tools, that helps researchers and scientists quickly accelerate applications.
“OpenACC is much easier to learn than OpenMP or MPI. It makes GPU computing approachable for domain scientists,” says Janus. “Our initial OpenACC implementation required only minor efforts, and more importantly, no modifications of our existing CPU implementation.”
Janus is part of the research team developing the quantum chemistry code LSDalton, a massively parallel and linear-scaling program for the accurate determination of energies and other molecular properties for large molecular systems.
In need of speed, the LSDalton team was awarded an INCITE allocation which gave them access to Oak Ridge National Laboratory’s Titan supercomputer. With this, they needed to find a way to use the power of the supercomputer: enter OpenACC. Demonstrating success on Titan with their GPU-accelerated code, they were recently one of 13 application code projects selected to join the Center for Accelerated Application Readiness (CAAR) program. This means they will be among the first applications to run on Summit, the new supercomputer debuting in 2018 which will deliver more than five times the computational performance of Titan’s 18,688 nodes.
This access will enable the research team to simulate larger molecular structures at higher accuracy, ultimately accelerating discoveries in materials and quantum chemistry.
Originally trained as a veterinary surgeon, Chris Jewell, a Senior Lecturer in Epidemiology at Lancaster Medical School in the UK became interested in epidemics through his experience working on the foot and mouth disease outbreak in the UK in 2001. His work so far has been on livestock epidemics such as foot and mouth disease, theileriosis, and avian influenza with government organizations in the UK, New Zealand, Australia, and the US. Recently, he has refocused his efforts into the human field where populations and epidemics tend to be larger and therefore need more computing grunt.
Epidemic forecasting centers around Bayesian inference on dynamical models, using Markov Chain Monte Carlo (MCMC) as the model fitting algorithm. As part of this algorithm Chris has had to calculate a statistical likelihood function which itself involves a large sum over pairs of infected and susceptible individuals. He is currently using CUDA technology to accelerate this calculation and enable real-time inference, leading to timely forecasts for informing control decisions.
“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”
Tell us about your research at The University of Arizona
We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.
Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.
As you are probably aware, CUDA 7 was officially released during the 2015 GPU Technology Conference. For this Spotlight I took a few minutes to pick the brain of an early adopter of CUDA 7 to see how his work benefits from the new C++11 support.
I interviewed Yu-Hang Tang, a Ph.D. candidate in the Division of Applied Mathematics at Brown University in Providence, Rhode Island.
What breakthrough project is currently taking up all of your brain’s time?
At this moment we are finalizing a particle-based simulator for the in silico investigation of microfluidic devices used in cancer diagnostic. The code enables us to predict the behavior of cancer cells as well as blood cells in various microfluidic channels. It could significantly speed up the process of microfluidic device design, which is usually time-consuming due to the large amount of trial-and-error experiments.
We will release the work by end of April and I will be happy to talk about more details by that time.
Tell me a bit about your GPU Computing background.
I started programming on the GeForce GTX 460 GPUs using OpenCL since 2010, and in 2012 I shifted entirely to CUDA C++.
Right now, I use mostly Kepler GPUs with high double-precision floating-point performance. I have been focused on accelerating particle-based simulations including All-Atom Molecular Dynamics (AAMD), Dissipative Particle Dynamics (DPD) and Smoothed Particle Hydrodynamics (SPH).
In fact, I have developed an entire GPU package (our USERMESO package), for the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) particle simulator for DPD and SPH simulations. The package achieves 20x to 30x speed up on a single K20 GPU over 16 AMD CPU cores on a Cray XK7 compute node.
How has GPU computing impacted your research?
Our USERMESO package allows us to simulate DPD systems containing several millions of particles for millions of time steps on a daily basis during the study of the self-assembly behavior of amphiphilic polymers. The multi-compartment multi-walled vesicle, or simply think of it as a miniature cell, as Figure 1 shows, is only observable at a spatial-temporal scale that is tens of times larger, and tens of times longer than that covered by typical contemporary DPD simulations. With the USERMESO code we can perform such simulations daily with just 16 GPUs!