Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.
CUDA on ARM64
Last year we introduced CUDA on ARM, and in March we released the Jetson TK1 developer board, which enables development of CUDA on the NVIDIA Tegra K1 system-on-a-chip which includes a quad-core 32-bit ARM CPU and an NVIDIA Kepler GPU. There is a lot of excitement about developing mobile and embedded parallel computing applications on Jetson TK1. And this week at the Hot Chips conference, we provided more details about our upcoming 64-bit Denver ARM CPU architecture. CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one. Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003. Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.
cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:
a CUDA context first needs to be created
a cuBLAS handle needs to be initialized
all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation
Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved. But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort). In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.
Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU. While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.
To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort. That is, CPU BLAS libraries can be directly replaced with NVBLAS. As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines. Continue reading →
Back in January I wrote a post about the public beta availability of AmgX, a linear solver library for large-scale industrial applications. Since then, AmgX has grown up! Now we can solve problems that were impossible for us before, due to the addition of “classical” Algebraic Multi-Grid (often called Ruge-Stueben AMG). V1.0 comes complete with classical AMG multi-GPU support, greatly improved scalability, and we have some nice performance numbers to back it up.
Models of Flow
One specific class of problem has eluded us, until now. In the oil and gas industry, reservoir simulation is used to predict the behavior of wells producing from large hydrocarbon deposits, and more recently from shale gas or shale oil fields. These problems are models of flow through porous media, coupled with flow through networks of fractures, piping and processing equipment, but it is the media that makes all the difference. Oil and gas deposits aren’t like big caves with lakes of oil, they are more like complex, many-layered sponges, each with different pore sizes, stiffness and hydrocarbon content.
When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.
The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.
Writing Interfaces to Batched cuBLAS Routines
At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:
integer(c_int) function &
type(cublasHandle), value :: h
integer(c_int), value :: n
type(c_devptr), device :: Aarray(*)
integer(c_int), value :: lda
integer(c_int), device :: ipvt(*)
integer(c_int), device :: info(*)
integer(c_int), value :: batchSize
end function cublasSgetrfBatched
Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.
In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.
Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing. In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.
For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU. In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.
Many industries use Computational Fluid Dynamics (CFD) to predict fluid flow forces on products during the design phase, using only numerical methods. A famous example is Boeing’s 777 airliner, which was designed and built without the construction (or destruction) of a single model in a wind tunnel, an industry first. This approach dramatically reduces the cost of designing new products for which aerodynamics is a large part of the value add. Another good example is Formula 1 racing, where a fraction of a percentage point reduction in drag forces on the car body can make the difference between a winning or a losing season.
Users of CFD models crave higher accuracy and faster run times. The key enabling algorithm for realistic models in CFD is Algebraic Multi-Grid (AMG). This algorithm allows solution times to scale linearly with the number of unknowns in the model; it can be applied to arbitrary geometries with highly refined and unstructured numerical meshes; and it can be run efficiently in parallel. Unfortunately, AMG is also very complex and requires specialty programming and mathematical skills, which are in short supply. Add in the need for GPU programming skills, and GPU-accelerated AMG seems a high mountain to climb. Existing GPU-accelerated AMG implementations (most notably the one in CUSP) are more proofs of concept than industrial strength solvers for real world CFD applications, and highly tuned multi-threaded and/or distributed CPU implementations can outperform them in many cases. Industrial CFD users had few options for GPU acceleration, so NVIDIA decided to do something about it.
NVIDIA partnered with ANSYS, provider of the leading CFD software Fluent to develop a high-performance, robust and scalable GPU-accelerated AMG library. We call the library AmgX (for AMG Accelerated). Fluent 15.0 uses AmgX as its default linear solver, and it takes advantage of a CUDA-enabled GPU when it detects one. AmgX can even use MPI to connect clusters of servers to solve very large problems that require dozens of GPUs. The aerodynamics problem in Figure 1 required 48 NVIDIA K40X GPUs, and involved 111million cells and over 440 million unknowns. Continue reading →
This week’s Spotlight is on Dr. Knut Reinert. Knut is a professor at Freie Universität in Berlin, Germany, and chair of the Algorithms in Bioinformatics group in the Institute of Computer Science. Knut and his team focus on the development of novel algorithms and data structures for problems in the analysis of biomedical mass data. In particular, the group develops mathematical models for analyzing large genomic sequences and data derived from mass spectrometry experiments (for example, for detecting differential expression of proteins between normal and diseased samples). Previously, Knut was at Celera Genomics, where he worked on bioinformatics algorithms and software for the Human Genome Project, which assembled the very first human genome.
Following is an excerpt from our interview (you can read the complete Spotlight here).
NVIDIA: Knut, tell us about the SeqAn library. Knut: Before setting up the Algorithmic Bioinformatics group at Freie Universität, I had been working for years at a U.S. company – Celera Genomics in Maryland – where I worked on the assembly of both the Drosophila (fruit fly) and human genomes. A central part of these projects was the development of large software packages containing algorithms for assembly and genome analysis developed by the Informatics Research team at Celera.
Although successful, the endeavor clearly showed the lack of available implementations in sequence analysis, even for standard tasks. Implementations of much needed algorithmic components were either not available, or hard to access in third-party, monolithic software products.
With this experience in mind, and being educated at the Max-Planck Institute for Computer Science in Saarbrücken (the home of very successful software libraries like LEDA and CGAL) I put the development of such a software library high on my research agenda. Continue reading →
GPU libraries provide an easy way to accelerate applications without writing any GPU-specific code. With the new CUDA 5.5 version of the NVIDIA CUFFT Fast Fourier Transform library, FFT acceleration gets even easier, with new support for the popular FFTW API. It is now extremely simple for developers to accelerate existing FFTW library calls on the GPU, sometimes with no code changes! By simply changing the linker command line to link the CUFFT library instead of the FFTW library, you can take advantage of the GPU with only a re-link. In today’s CUDACast, we take a simple application that uses the standard FFTW library, and accelerate the function calls on the GPU by simply changing which library we link. In fact, the only code change we will make is to use the cufftw.h header file. This ensures that, at compile time, we are not calling any unsupported functions.
This is a guest post by Chris McClanahan from Accelereyes.
ArrayFire is a fast and easy-to-use GPU matrix library developed by Accelereyes. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.
ArrayFire can be used as a self-contained library, or integrated into and supplement existing CUDA code. The array object can wrap data from CUDA device pointers and existing CPU memory.
ArrayFire contains built-in graphics functions for data visualization. The graphics library in ArrayFire provides easy rendering of 2D and 3D data, and leverages CUDA OpenGL interoperation, so visualization is fast and efficient. Various visualization algorithms make easy to explore complex data.
ArrayFire offers a unique “gfor” construct that can drastically speed up conventional “for” loops over data. The gfor loop essentially auto-vectorizes the code inside, and executes all iterations of the loop simultaneously.
ArrayFire supports C, C++, and Fortran on top of the CUDA platform.
ArrayFire is built on top of a custom just-in-time (JIT) compiler for efficient GPU memory usage. The JIT back-end in ArrayFire automatically combines many operations behind the scenes, and executes them in batches to minimize GPU kernel launches.
Accelereyes strives to include only the best performing code in ArrayFire. This means that ArrayFire uses existing implementations of functions when they are faster—such as Thrust for sorting, CULA for linear algebra, and CUFFT for fft. Continue reading →