Fast Spectral Graph Partitioning on GPUs

Graphs are mathematical structures used to model many types of relationships and processes in physical, biological, social and information systems. They are also used in the solution of various high-performance computing and data analytics problems. The computational requirements of large-scale graph processing for cyberanalytics, genomics, social network analysis and other fields demand powerful and efficient computing performance that only accelerators can provide. With CUDA 8, NVIDIA is introducing nvGRAPH, a new library of GPU-accelerated graph algorithms. Its first release, nvGRAPH 1.0, supports 3 key graph algorithms (PageRank, Single-Source Shortest Path, and Single-Source Widest Path), and our engineering and research teams are already developing new parallel algorithms for future releases. I’ll discuss one of them in detail in this blog post.

Many applications need to partition graphs into subgraphs, or to find clusters within them. For example, graph partitioning can be used in the numerical solution of partial differential equations (PDEs) to perform more efficient sparse matrix-vector multiplications, and graph clustering can be used to identify communities in social networks and for cybersecurity (see Figure 1).

Figure 1: Applications of Graph Partitioning
Figure 1: Applications of Graph Partitioning

The quality of graph partitioning or clustering can have a significant impact on the overall performance of an application. Therefore it is very important not only to find the splitting into subgraphs fast by taking advantage of GPUs (our GPU spectral graph partitioning scheme performs up to 7x faster than a CPU implementation), but also to find the best possible splitting, which requires development of new algorithms.  Continue reading


Accelerate Recommender Systems with GPUs

Wei TanWei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on streaming media services, and recommending news items or posts on social media and networking services. Wei Tan’s team developed cuMF, a highly optimized matrix factorization system using CUDA to accelerate recommendations used in applications like these and more.

Brad: Can you talk a bit about your current research?

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix  (m-by-n, with  non-zero ratings) into a m-by-f matrix (X) and a f-by-n matrix (ΘT), as Figure 1 shows.

Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with N_z non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (Θ^T).
Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with Nz non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (ΘT ).

Suppose we obtained m users’ ratings on  items (say, movies). If user u rated item v, we use r_{uv} as the non-zero element of R at position (u, v). We want to minimize the following cost function JContinue reading

Figure 5: Ring order of GPUs in PCIe tree.

Fast Multi-GPU collectives with NCCL

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor scaling is that processors exchange too much data and spend more time communicating than computing. To avoid such communication bottlenecks, it is important to make the most of the available inter-GPU bandwidth, and this is what NCCL is all about.

NCCL (pronounced “Nickel”) is a library of multi-GPU collective communication primitives that are topology-aware and can be easily integrated into your application. Initially developed as an open-source research project, NCCL is designed to be light-weight, depending only on the usual C++ and CUDA libraries. NCCL can be deployed in single-process or multi-process applications, handling required inter-process communication transparently. Finally, the API will be very familiar to anyone with experience using MPI’s collectives.

Figure 1: Illustration of the All-Reduce collective.
Figure 1: Illustration of the All-Reduce collective.

Collective Communication

Collective communication routines are common patterns of data transfer among many processors. If you have experience with MPI, then you are probably already familiar with several collective operations. For example, all-reduce starts with independent arrays $latex V_k$ of N values on each of K processors and ends with identical arrays S of N values, where $S[k] = V_0[k]+V_1[k]+…+V_k[k]$, for each processor k. See Figure 1. Continue reading


Optimizing Recurrent Neural Networks in cuDNN 5

Figure 1: cuDNN 5 + Torch speedup vs. Torch GPU implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, Wordvec size 2560, num layers 1, Seq length 200, max epochs 1 Network B: RNN size 256, num layers 3, max epochs 50, batch size 64 Network C: RNN size 256, Wordvec size 256, num layers 1, Seq length 1000, max epochs 1
Figure 1: cuDNN 5 + Torch speedup vs. Torch-rnn implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, input size 2560, 1 layer, Seq length 200, batch size 64. Network B: RNN size 256, input size 64, 3 layers, batch size 64. Network C: RNN size 256, input size 256, 1 layer, batch size 32, Seq length 1000

This week at GTC 2016, we announced the latest update to NVIDIA Deep Learning SDK, which now includes cuDNN 5. Version 5 offers new features, improved performance and support for the latest generation NVIDIA Tesla P100 GPU. New features in cuDNN 5 include:

  • Faster forward and backward convolutions using the Winograd convolution algorithm;
  • 3D FFT Tiling;
  • Spatial Transformer Networks;
  • Improved performance and reduced memory usage with FP16 routines on Pascal GPUs;
  • Support for LSTM recurrent neural networks for sequence learning that deliver up to 6x speedup.

One of the new features we’ve added in cuDNN 5 is support for Recurrent Neural Networks (RNN). RNNs are a powerful tool used for sequence learning in a number of fields, from speech recognition to image captioning. For a brief high-level introduction to RNNs, LSTM and sequence learning, I recommend you check out Tim Dettmers recent post Deep Learning in a Nutshell: Sequence Learning, and for more depth, Soumith Chintala’s post Understanding Natural Language with Deep Neural Networks Using Torch.

I’m excited about the RNN capabilities in cuDNN 5; we’ve put a lot of effort into optimizing their performance on NVIDIA GPUs, and I’ll go into some of the details of these optimizations in this blog post.

cuDNN 5 supports four RNN modes:  ReLU activation function, tanh activation function, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). In this case study I’ll look at the performance of an LSTM network, but most of the optimizations can be applied to any RNN. Continue reading


CUDA 8 Features Revealed

This week at the GPU Technology Conference developers are getting a preview of some powerful new features coming in CUDA 8 later this year. In this post I wanted to give a quick overview of the major new features of CUDA 8:

  • Support for the Pascal GPU architecture, including the new Tesla P100 accelerator;
  • New Unified Memory capabilities;
  • The new nvGRAPH GPU-Accelerated Graph Analytics library;
  • Powerful new profiling capabilities; and
  • Improved compiler performance and heterogeneous lambda support.

To learn more at GTC 2016, you can check out my talk from GTC 2016, “CUDA 8 and Beyond” (GTC on-demand site registration required).

CUDA 8 Supports the new NVIDIA Pascal Architecture

A crucial goal for CUDA 8 is to provide support for the powerful new Pascal architecture, the first incarnation of which was launched at GTC today: Tesla P100. For full details on P100 and the Pascal GP100 GPU architecture, check out the blog post “Inside Pascal“.

pascal_key_imageIn a nutshell, Tesla P100 provides massive double-, single- and half-precision computational performance, 3x the memory bandwidth of Maxwell GPUs via HBM2 stacked memory, and with its support for NVLink, up to 5x the GPU-GPU communication performance of PCI Express. Pascal also improves support for Unified Memory thanks to a larger virtual address space and a new page migration engine.

CUDA 8 will enable CUDA applications to get high performance on Tesla P100 out of the box. Moreover, improvements in CUDA 8 enable developing efficient code for new Tesla P100 features such as NVLink and improved Unified Memory. Continue reading


Developing New Materials with GPU-Accelerated Supercomputers

Josh_AndersonDr. Joshua A. Anderson is a Research Area Specialist at the University of Michigan who was an early user of GPU computing technology. He began his career developing software on the first CUDA capable GPU and now runs simulations on one of the world’s most powerful supercomputers.

Anderson’s “contributions to the development and dissemination of the open source, GPU-enabled molecular simulation software, HOOMD-blue, which enables scientific computations with unprecedented speed” earned him the 2015 CoMSEF Young Investigator Award for Modeling and Simulation.

Brad Nemire: Can you talk a bit about your current research?

Joshua Anderson: I work with the Glotzer Group at the University of Michigan. We use computer simulation to discover the fundamental principles of how nanoscale systems of building blocks self-assemble, and to discover how to control the assembly process to engineer new materials. Specifically, we focus on the role of particle shape and how changing the shape can result in different material properties.

Figure 1: an example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).
Figure 1: example system configuration from the shape allophiles project: Eric S. Harper, Ryan Marson, Joshua A. Anderson, Greg van Anders, and Sharon C Glotzer. Shape Allophiles Improve Entropic Assembly. Soft Matter, 2015. (doi:10.1039/C5SM01351H).

Over the past few years, I have been focusing on two-dimensional systems, using large scale simulations to study hexatic phase transitions for hard disks, and how patterning surfaces of polygons can create shape allophiles that improve self-assembly. The hexatic phase is an intermediate between the fluid and hexagonally ordered solid. In the hexatic phase, the orientation of bonds between particles has long range order, but translational order is short range and there is no crystal lattice. Shape allophiles are polygonal shapes cut so they fit together like puzzle pieces. These research projects are computationally demanding and could not have been run on any existing code. So before I could even begin the science research, I needed to develop, implement, and optimize the parallel algorithms necessary for these studies. Continue reading


Open, Reproducible Computational Chemistry with Python and CUDA

SONY DSCIncreasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Brad: Can you talk a bit about your current research?

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.
Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.

Continue reading


Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading


Increasing the Luminosity of Beam Dynamics with GPUs

Adrian_CERNWhat is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Brad: How is your research related to the upgrade of the LHC?

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

High Luminosity LHCThese days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading


Mocha.jl: Deep Learning for Julia

Deep learning is becoming extremely popular due to several breakthroughs in various well-known tasks in artificial intelligence. For example, at the ImageNet Large Scale Visual Recognition Challenge, the introduction of deep learning algorithms into the challenge reduced the top-5 error by 10% in 2012. Every year since then, deep learning models have dominated the challenges, significantly reducing the top-5 error rate every year (see Figure 1). In 2015, researchers have trained very deep networks (for example, the Google “inception” model has 27 layers) that surpass human performance.

Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.
Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.

Moreover, at this year’s Computer Vision and Pattern Recognition (CVPR) conference, deep neural networks (DNNs) were being adapted to increasingly more complicated tasks. For example, in semantic segmentation, instead of predicting a single category for a whole image, a DNN is trained to classify each pixel in the image, essentially producing a semantic map indicating every object and its shape and location in the given image (see Figure 2).

Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.
Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.

Continue reading