cuda_pro_tip

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).  In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type.  Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface 
  integer(c_int) function &
      cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
      bind(c,name='cublasSgetrfBatched') 
    use iso_c_binding 
    use cublas 
    type(cublasHandle), value :: h 
    integer(c_int), value :: n 
    type(c_devptr), device :: Aarray(*) 
    integer(c_int), value :: lda
    integer(c_int), device :: ipvt(*) 
    integer(c_int), device :: info(*) 
    integer(c_int), value :: batchSize 
  end function cublasSgetrfBatched
end interface

The arguments of cublasSgetrfBatched() are: Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode 16: Thrust Algorithms and Custom Operators

Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.

In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.

Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode 15: Introduction to Thrust

Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing.  In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.

For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU.  In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.

Continue reading

amgx_prod_logo340

AmgX: Multi-Grid Accelerated Linear Solvers for Industrial Applications

Many industries use Computational Fluid Dynamics (CFD) to predict fluid flow forces on products during the design phase, using only numerical methods. A famous example is Boeing’s  777 airliner, which was designed and built without the construction (or destruction) of a single model in a wind tunnel, an industry first. This approach dramatically reduces the cost of designing new products for which aerodynamics is a large part of the value add. Another good example is Formula 1 racing, where a fraction of a percentage point reduction in drag forces on the car body can make the difference between a winning or a losing season.

FluentUsers of CFD models crave higher accuracy and faster run times. The key enabling algorithm for realistic models in CFD is Algebraic Multi-Grid (AMG). This algorithm allows solution times to scale linearly with the number of unknowns in the model; it can be applied to arbitrary geometries with highly refined and unstructured numerical meshes; and it can be run efficiently in parallel. Unfortunately, AMG is also very complex and requires specialty programming and mathematical skills, which are in short supply. Add in the need for GPU programming skills, and GPU-accelerated AMG seems a high mountain to climb. Existing GPU-accelerated AMG implementations (most notably the one in CUSP) are more proofs of concept than industrial strength solvers for real world CFD applications, and highly tuned multi-threaded and/or distributed CPU implementations can outperform them in many cases. Industrial CFD users had few options for GPU acceleration, so NVIDIA decided to do something about it.

NVIDIA partnered with ANSYS, provider of the leading CFD software Fluent to develop a high-performance, robust and scalable GPU-accelerated AMG library. We call the library AmgX (for AMG Accelerated). Fluent 15.0 uses AmgX as its default linear solver, and it takes advantage of a CUDA-enabled GPU when it detects one.  AmgX can even use MPI to connect clusters of servers to solve very large problems that require dozens of GPUs.  The aerodynamics problem in Figure 1 required 48 NVIDIA K40X GPUs, and involved 111million cells and over 440 million unknowns. Continue reading

cuda_spotlight

CUDA Spotlight: GPU-Accelerated Genomics

This week’s Spotlight is on Dr. Knut Reinert. Knut is a professor at Freie Universität in Berlin, Germany, and chair of the Algorithms in Bioinformatics group in the Institute of Computer Science. Knut and his team focus on the development of novel algorithms and data structures for problems in the analysis of biomedical mass data. In particular, the group develops mathematical models for analyzing large genomic sequences and data derived from mass spectrometry experiments (for example, for detecting differential expression of proteins between normal and diseased samples). Previously, Knut was at Celera Genomics, where he worked on bioinformatics algorithms and software for the Human Genome Project, which assembled the very first human genome.

On Oct. 22, 2013, Knut will deliver a GTC Express Webinar presentation titled: Intro to SeqAn, an Open-Source C++ Template Library.

Following is an excerpt from our interview (you can read the complete Spotlight here).

Knut Reinert, Freie Univ. Berlin
Knut Reinert, Freie Univ. Berlin

NVIDIA: Knut, tell us about the SeqAn library.
Knut: Before setting up the Algorithmic Bioinformatics group at Freie Universität, I had been working for years at a U.S. company – Celera Genomics in Maryland – where I worked on the assembly of both the Drosophila (fruit fly) and human genomes. A central part of these projects was the development of large software packages containing algorithms for assembly and genome analysis developed by the Informatics Research team at Celera.

Although successful, the endeavor clearly showed the lack of available implementations in sequence analysis, even for standard tasks. Implementations of much needed algorithmic components were either not available, or hard to access in third-party, monolithic software products.

With this experience in mind, and being educated at the Max-Planck Institute for Computer Science in Saarbrücken (the home of very successful software libraries like LEDA and CGAL) I put the development of such a software library high on my research agenda. Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode #8: Accelerate FFTW Apps with CUFFT 5.5

GPU libraries provide an easy way to accelerate applications without writing any GPU-specific code. With the new CUDA 5.5 version of the NVIDIA CUFFT Fast Fourier Transform library, FFT acceleration gets even easier, with new support for the popular FFTW API. It is now extremely simple for developers to accelerate existing FFTW library calls on the GPU, sometimes with no code changes! By simply changing the linker command line to link the CUFFT library instead of the FFTW library, you can take advantage of the GPU with only a re-link. In today’s CUDACast, we take a simple application that uses the standard FFTW library, and accelerate the function calls on the GPU by simply changing which library we link. In fact, the only code change we will make is to use the cufftw.h header file. This ensures that, at compile time, we are not calling any unsupported functions.

Continue reading

Do More, Code Less with ArrayFire GPU Matrix Library

arrayfire_logo2This is a guest post by Chris McClanahan from Accelereyes.

ArrayFire is a fast and easy-to-use GPU matrix library developed by Accelereyes. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Feature Highlights

  • ArrayFire provides a high-level array notation and an extensive set of functions for easily manipulating N-dimensional GPU data.
  • ArrayFire provides all basic arithmetic operations (element-wise arithmetic, trigonometric, logical operations, etc.), higher-level primitives (reductions, matrix multiply, set operations, sorting, etc.), and even domain-specific functions (image and signal processing, linear algebra, etc.).
  • ArrayFire can be used as a self-contained library, or integrated into and supplement existing CUDA code. The array object can wrap data from CUDA device pointers and existing CPU memory.
  • ArrayFire contains built-in graphics functions for data visualization. The graphics library in ArrayFire provides easy rendering of 2D and 3D data, and leverages CUDA OpenGL interoperation, so visualization is fast and efficient. Various visualization algorithms make easy to explore complex data.
  • ArrayFire offers a unique “gfor” construct that can drastically speed up conventional “for” loops over data. The gfor loop essentially auto-vectorizes the code inside, and executes all iterations of the loop simultaneously.
  • ArrayFire supports C, C++, and Fortran on top of the CUDA platform.
  • ArrayFire is built on top of a custom just-in-time (JIT) compiler for efficient GPU memory usage. The JIT back-end in ArrayFire automatically combines many operations behind the scenes, and executes them in batches to minimize GPU kernel launches.
  • Accelereyes strives to include only the best performing code in ArrayFire. This means that ArrayFire uses existing implementations of functions when they are faster—such as Thrust for sorting, CULA for linear algebra, and CUFFT for fft. Continue reading

Six Ways to SAXPY

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?

In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]

There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.

SAXPY stands for “Single-Precision A·X Plus Y”.  It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.

void saxpy(int n, float a, float *x, float *y)
{
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];
}

// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);

Continue reading

Expressive Algorithmic Programming with Thrust

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust’s High-Level interface greatly enhances programmer Productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software. Develop High-Performance applications rapidly with Thrust!

This excerpt from the Thrust home page perfectly summarizes the benefits of the Thrust library. Thrust enables expressive algorithmic programming via a vocabulary of parallel building blocks that let you rapidly develop fast, portable parallel algorithms. If you are a C++ programmer, and especially if you use template libraries like the STL and Boost C++ libraries, then you will find Thrust familiar. Like the STL, Thrust helps you focus on algorithms, rather than on platform-specific implementation details. At the same time, Thrust’s modular design allows low-level customization and interoperation with custom platform-specific code such as CUDA kernels and libraries.

Thrust is High-Level

As described in the article “Thrust, a Productivity-Oriented Library for CUDA”, Thrust aims to solve two types of problems: problems that can be “implemented efficiently without a detailed mapping to the target architecture”, and problems that don’t merit or won’t receive (for whatever reason) significant optimization attention from the programmer. High-level primitives make it easier to capture programmer intent; developers describe what to compute, without dictating how to compute it. This allows the library to make informed decisions about how to implement the intended computation.

Thrust provides an STL-style vector container (with host_vector and device_vector implementations), and a suite of high-level algorithms including searchingsortingcopyingmergingtransformingreordering,reducingprefix sums, and set operations. Here is an oft-repeated complete example program from the Thrust home page, which generates random numbers serially and then transfers them to the GPU where they are sorted.

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(void)
{
  // generate 32M random numbers serially
  thrust::host_vector h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer data to the device
  thrust::device_vector d_vec = h_vec;

  // sort data on the device
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

  return 0;
}

Continue reading

In the Trenches at GTC: Languages, APIs and Development Tools for GPU Computing

By Michael Wang, The University Of Melbourne, Australia (GTC ’12 Guest Blogger)

It’s 9 am, the first morning session of the pre-conference Tutorial Day. The atmosphere in the room is one of quiet anticipation. NVIDIA’s Will Ramey takes the stage and says: “this is going to be a great week.”

I couldn’t agree more. A quick show of hands reveals that more than 90% of the 200-strong audience had used CUDA in the past week. The prophetic words of Jack Dongarra aptly sum up why we are all here:

GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

And things couldn’t be easier if you consider the three broad categories of tools available to you today: Continue reading