As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at nvidia.qwiklab.com.
Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud. Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.
You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.
The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.
This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading →
Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.
The cuFFT library included with CUDA 6.5 introduces device callbacks to improve performance of this sort of transforms. Callback routines are user-supplied device functions that cuFFT calls when loading or storing data. You can use callbacks to implement many pre- or post-processing operations that required launching separate CUDA kernels before CUDA 6.5.
Example DSP Pipeline
In this blog post we will implement the first stages of a typical DSP pipeline as depicted in Figure 1. We will first discuss a solution without callbacks using multiple custom kernels which we then use as a stepping stone towards a solution based on cuFFT device callbacks. The source code for both versions is available on github.
Batches of 8-bit fixed-point samples are input to the DSP pipline from an A/D converter. Each sample consists of 1024 data points. For more efficient processing, we group samples into batches of 1000 samples each. Therefore, you can think of this input as a 1000×1024 matrix of 8-bit fixed-point values. Continue reading →
Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do. While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems. In fact, ML has now become a pervasive technology, underlying many modern applications. Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few. The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.
There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs). The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng. Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology. A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.
Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN. The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.
Machine Learning with DNNs
A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading →
Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.
CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.
Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003. Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.
cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:
a CUDA context first needs to be created
a cuBLAS handle needs to be initialized
all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation
Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved. But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort). In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.
Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU. While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.
To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort. That is, CPU BLAS libraries can be directly replaced with NVBLAS. As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines. Continue reading →
Back in January I wrote a post about the public beta availability of AmgX, a linear solver library for large-scale industrial applications. Since then, AmgX has grown up! Now we can solve problems that were impossible for us before, due to the addition of “classical” Algebraic Multi-Grid (often called Ruge-Stueben AMG). V1.0 comes complete with classical AMG multi-GPU support, greatly improved scalability, and we have some nice performance numbers to back it up.
Models of Flow
One specific class of problem has eluded us, until now. In the oil and gas industry, reservoir simulation is used to predict the behavior of wells producing from large hydrocarbon deposits, and more recently from shale gas or shale oil fields. These problems are models of flow through porous media, coupled with flow through networks of fractures, piping and processing equipment, but it is the media that makes all the difference. Oil and gas deposits aren’t like big caves with lakes of oil, they are more like complex, many-layered sponges, each with different pore sizes, stiffness and hydrocarbon content.
When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.
The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.
Writing Interfaces to Batched cuBLAS Routines
At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:
integer(c_int) function &
type(cublasHandle), value :: h
integer(c_int), value :: n
type(c_devptr), device :: Aarray(*)
integer(c_int), value :: lda
integer(c_int), device :: ipvt(*)
integer(c_int), device :: info(*)
integer(c_int), value :: batchSize
end function cublasSgetrfBatched
Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.
In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.
Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing. In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.
For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU. In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.