# Accelerate Recommender Systems with GPUs

Wei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on streaming media services, and recommending news items or posts on social media and networking services. Wei Tan’s team developed cuMF, a highly optimized matrix factorization system using CUDA to accelerate recommendations used in applications like these and more.

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix  (m-by-n, with  non-zero ratings) into a m-by-f matrix (X) and a f-by-n matrix (ΘT), as Figure 1 shows.

Suppose we obtained m users’ ratings on  items (say, movies). If user u rated item v, we use $r_{uv}$ as the non-zero element of R at position (u, v). We want to minimize the following cost function JContinue reading

# Train Your Reinforcement Learning Agents at the OpenAI Gym

Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go.

OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design, iterate and improve their next generation applications. John studied physics at Caltech, and went to UC Berkeley for graduate school. There, after a brief stint in neuroscience, he studied machine learning and robotics under Pieter Abbeel, eventually honing in on reinforcement learning as his primary topic of interest. John lives in Berkeley, California, where he enjoys running in the hills and occasionally going to the gym.

What is OpenAI? What is your mission?

OpenAI is a non-profit artificial intelligence research company. Day to day, we are working on research projects in unsupervised learning and reinforcement learning. Our mission and long-term goal is to advance artificial intelligence in the ways that will maximally benefit humanity as a whole.

What is reinforcement learning? How is it different from supervised and unsupervised learning?

Reinforcement learning (RL) is the branch of machine learning that is concerned with making sequences of decisions. It assumes that there is an agent that is situated in an environment. At each step, the agent takes an action, and it receives an observation and reward from the environment. An RL algorithm seeks to maximize the agent’s total reward, given a previously unknown environment, through a learning process that usually involves lots of trial and error.

The reinforcement learning problem sketched above, involving a reward-maximizing agent, is extremely general, and RL algorithms have been applied in a variety of different fields. They have been applied in business management problems such as deciding how much inventory a store should hold or how it should set prices. They have also been applied to robotic control problems, and rapid development is currently occurring in this area. The following video shows Hopper: a two-dimensional one-legged robot trained to hop forward as fast as possible with OpenAI Gym.

# Using GPUs to Accelerate Digital Rock Physics

James McClure, a Computational Scientist with Advanced Research Computing at Virginia Tech shares how his group uses the NVIDIA Tesla GPU-accelerated Titan Supercomputer at Oak Ridge National Laboratory to combine mathematical models with 3D visualization to provide insight on how fluids move below the surface of the earth.

McClure spoke with us about his research at the 2015 Supercomputing Conference.

James McClure: Digital Rock Physics is a relatively new computational discipline that relies on high-performance computing to study the behavior of fluids within rock and other geologic materials. Understanding how fluids move within rock is essential for applications like geologic carbon sequestration, oil and gas recovery, and environmental contaminant transport. New technologies such as synchrotron-based x-ray micro-computed tomography enable the collection of 3D images that reveal the structure of rocks at the micron scale. Using these images, we can make predictions about the complex rock-fluid interactions that take place within natural systems. Continue reading

# Developing New Materials with GPU-Accelerated Supercomputers

Dr. Joshua A. Anderson is a Research Area Specialist at the University of Michigan who was an early user of GPU computing technology. He began his career developing software on the first CUDA capable GPU and now runs simulations on one of the world’s most powerful supercomputers.

Anderson’s “contributions to the development and dissemination of the open source, GPU-enabled molecular simulation software, HOOMD-blue, which enables scientific computations with unprecedented speed” earned him the 2015 CoMSEF Young Investigator Award for Modeling and Simulation.

Joshua Anderson: I work with the Glotzer Group at the University of Michigan. We use computer simulation to discover the fundamental principles of how nanoscale systems of building blocks self-assemble, and to discover how to control the assembly process to engineer new materials. Specifically, we focus on the role of particle shape and how changing the shape can result in different material properties.

Over the past few years, I have been focusing on two-dimensional systems, using large scale simulations to study hexatic phase transitions for hard disks, and how patterning surfaces of polygons can create shape allophiles that improve self-assembly. The hexatic phase is an intermediate between the fluid and hexagonally ordered solid. In the hexatic phase, the orientation of bonds between particles has long range order, but translational order is short range and there is no crystal lattice. Shape allophiles are polygonal shapes cut so they fit together like puzzle pieces. These research projects are computationally demanding and could not have been run on any existing code. So before I could even begin the science research, I needed to develop, implement, and optimize the parallel algorithms necessary for these studies. Continue reading

# Cutting Edge Parallel Algorithms Research with CUDA

Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par 2015.  Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria’s Vienna University of Technology.

Leyuan’s paper Fast Parallel Suffix Array on the GPU, co-authored by her advisor John Owens and Sean Baxter, a research scientist at New York’s DE Shaw Research, details their efforts to implement a linear-time suffix array construction algorithm on NVIDIA GPUs, resulting in algorithmic improvements and significant speedups over the existing state of the art.

Wang completed her master’s degree in electrical and computer engineering at UC Davis in October 2014, after having earned her undergraduate degree in electronics science and technology at China’s Zhejiang University.

Leyuan Wang: I work on high-performance string processing and graph processing algorithms, mostly in string and graph queries. My current research focus is on GPGPU (general-purpose computing on graphics processing units) and the benchmark I care about most is speed. I’ve been working on designing and improving parallel suffix array construction algorithms (SACAs) and incorporating the implementations in a Burrows-Wheeler transform-based lossless data compression (bzip2) and a parallel FM index for pattern searching. The suffix array (SA) of a string is the sorted set of all suffixes of the string. The inverse suffix array (ISA) is also the lexicographic ranks of suffixes.

The Burrows-Wheeler transform (BWT) of a string is generated by lexicographically sorting the cyclic shift of the string to form a string matrix and taking the last column of the matrix. The BWT groups repeated characters together by permuting the string; it is also reversible, which means the original string can be recovered. These two characteristics make BWT a popular choice for a compression pipeline stage (for instance, bzip2). It is directly related to the suffix array: the sorted rows in the matrix are essentially the sorted suffixes of the string and the first column of the matrix reflects a suffix array. Table 1 shows an example of the SA, ISA and BWT of the input string “banana\$”

The suffix array data structure is a building block in a spectrum of applications, including data compression, bioinformatics, text indexing, etc. I’ve studied the taxonomy of all classes of SACAs and compared them in order to find the best candidate for the GPU. I revisited the previous conclusion that skew SACAs are best suited on the GPU by demonstrating that prefix-doubling SACAs are actually better both in theoretical analysis and experimental benchmarks. Our hybrid skew/prefix-doubling suffix array implementation (with our amazing research collaborator Sean Baxter, formerly of NVIDIA Research) using a Tesla K20 achieves a 7.9x speedup against the previous state-of-the-art skew implementation. Our optimized skew SACA implementation has been added as a primitive to CUDPP 2.2 (CUDA Data Parallel Primitives Library) and incorporated into the BWT and bzip2 data compression application, resulting in great speedups compared with bzip2 in CUDPP 2.1. Figure 1 shows pseudocode for our two approaches. Continue reading

# Open, Reproducible Computational Chemistry with Python and CUDA

Increasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

# Increasing the Luminosity of Beam Dynamics with GPUs

What is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

These days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading

# Harnessing the Caffe Framework for Deep Visualization

The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.

“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”

Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.

For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.

Brad: How are you using deep neural networks (DNNs)?

We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading

# Leveraging OpenACC to Compute High-Accuracy Molecular Energies

For this interview, I reached out to Janus Juul Eriksen, a Ph.D. fellow at Aarhus University in Denmark. Janus is a chemist by trade without any formal education in computer science; but he is getting up to 12x speed-up compared to his CPU-only code after modifying less than 100 lines of code with one week of programming effort.

How did he do this? He used OpenACC.

OpenACC is a simple, powerful and portable approach for researchers and scientists who need to rapidly boost application performance for faster science while minimizing programming. With OpenACC, the original source code is kept intact, making the implementation intuitively transparent and leaving most of the hard work to the compiler.

NVIDIA recently announced the new OpenACC Toolkit, an all-in-one suite of parallel programming tools, that helps researchers and scientists quickly accelerate applications.

“OpenACC is much easier to learn than OpenMP or MPI. It makes GPU computing approachable for domain scientists,” says Janus. “Our initial OpenACC implementation required only minor efforts, and more importantly, no modifications of our existing CPU implementation.”

Janus is part of the research team developing the quantum chemistry code LSDalton, a massively parallel and linear-scaling program for the accurate determination of energies and other molecular properties for large molecular systems.

In need of speed, the LSDalton team was awarded an INCITE allocation which gave them access to Oak Ridge National Laboratory’s Titan supercomputer. With this, they needed to find a way to use the power of the supercomputer: enter OpenACC. Demonstrating success on Titan with their GPU-accelerated code, they were recently one of 13 application code projects selected to join the Center for Accelerated Application Readiness (CAAR) program. This means they will be among the first applications to run on Summit, the new supercomputer debuting in 2018 which will deliver more than five times the computational performance of Titan’s 18,688 nodes.

This access will enable the research team to simulate larger molecular structures at higher accuracy, ultimately accelerating discoveries in materials and quantum chemistry.

# Using GPUs to Accelerate Epidemic Forecasting

Originally trained as a veterinary surgeon, Chris Jewell, a Senior Lecturer in Epidemiology at Lancaster Medical School in the UK became interested in epidemics through his experience working on the foot and mouth disease outbreak in the UK in 2001. His work so far has been on livestock epidemics such as foot and mouth disease, theileriosis, and avian influenza with government organizations in the UK, New Zealand, Australia, and the US. Recently, he has refocused his efforts into the human field where populations and epidemics tend to be larger and therefore need more computing grunt.

Epidemic forecasting centers around Bayesian inference on dynamical models, using Markov Chain Monte Carlo (MCMC) as the model fitting algorithm. As part of this algorithm Chris has had to calculate a statistical likelihood function which itself involves a large sum over pairs of infected and susceptible individuals. He is currently using CUDA technology to accelerate this calculation and enable real-time inference, leading to timely forecasts for informing control decisions.

“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”