[Note: Lung Sheng Chien from NVIDIA also contributed to this post.]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers. Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers. These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today. Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability. Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of factorization of a dense matrix. A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Capabilities

ArrayFire is an open sourceC/C++ library, with language bindings for R, Java and Fortran. ArrayFire has a range of functionality, including

ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.

In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs. Continue reading →

As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at nvidia.qwiklab.com.

Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud. Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.

You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.

The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.

This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading →

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.

The cuFFT library included with CUDA 6.5 introduces device callbacks to improve performance of this sort of transforms. Callback routines are user-supplied device functions that cuFFT calls when loading or storing data. You can use callbacks to implement many pre- or post-processing operations that required launching separate CUDA kernels before CUDA 6.5.

Example DSP Pipeline

In this blog post we will implement the first stages of a typical DSP pipeline as depicted in Figure 1. We will first discuss a solution without callbacks using multiple custom kernels which we then use as a stepping stone towards a solution based on cuFFT device callbacks. The source code for both versions is available on github.

Batches of 8-bit fixed-point samples are input to the DSP pipline from an A/D converter. Each sample consists of 1024 data points. For more efficient processing, we group samples into batches of 1000 samples each. Therefore, you can think of this input as a 1000×1024 matrix of 8-bit fixed-point values. Continue reading →

Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do. While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems. In fact, ML has now become a pervasive technology, underlying many modern applications. Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few. The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.

There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs). The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng. Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology. A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.

Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN. The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.

Machine Learning with DNNs

A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading →

Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.

CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.

Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003. Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:

a CUDA context first needs to be created

a cuBLAS handle needs to be initialized

all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation

Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved. But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort). In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.

Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU. While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.

To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort. That is, CPU BLAS libraries can be directly replaced with NVBLAS. As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines. Continue reading →

Back in January I wrote a post about the public beta availability of AmgX, a linear solver library for large-scale industrial applications. Since then, AmgX has grown up! Now we can solve problems that were impossible for us before, due to the addition of “classical” Algebraic Multi-Grid (often called Ruge-Stueben AMG). V1.0 comes complete with classical AMG multi-GPU support, greatly improved scalability, and we have some nice performance numbers to back it up.

Models of Flow

One specific class of problem has eluded us, until now. In the oil and gas industry, reservoir simulation is used to predict the behavior of wells producing from large hydrocarbon deposits, and more recently from shale gas or shale oil fields. These problems are models of flow through porous media, coupled with flow through networks of fractures, piping and processing equipment, but it is the media that makes all the difference. Oil and gas deposits aren’t like big caves with lakes of oil, they are more like complex, many-layered sponges, each with different pore sizes, stiffness and hydrocarbon content.

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface
integer(c_int) function &
cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
bind(c,name='cublasSgetrfBatched')
use iso_c_binding
use cublas
type(cublasHandle), value :: h
integer(c_int), value :: n
type(c_devptr), device :: Aarray(*)
integer(c_int), value :: lda
integer(c_int), device :: ipvt(*)
integer(c_int), device :: info(*)
integer(c_int), value :: batchSize
end function cublasSgetrfBatched
end interface