Managing complexity in large programs requires breaking them down into components that are responsible for small, well-defined portions of the overall program. Separate compilation is an integral part of the C and C++ programming languages which allows portions of a program to be compiled into separate objects and then linked together to form an executable or library. Developing large and complex GPU programs is no different, and starting with CUDA 5.0, separate compilation and linking are now important tools in the repertoire of CUDA C/C++ programmers.
In this post, we explore separate compilation and linking of device code and highlight situations where it is helpful. In the process, we’ll walk through a simple example to show how device code linking can let you move existing code to the GPU with minimal changes to your class hierarchy and build infrastructure.
One of the key limitations that device code linking lifts is the need to have all the code for a GPU kernel present when compiling the kernel, including all the device functions that the kernel calls. As C++ programmers, we are used to calling externally defined functions simply by declaring the functions’ prototypes (or including a header that declares them).
Managing Complexity with Separate Compilation
The common approach to organizing and building C++ applications is to define all member functions of a single class in one or more .cpp source files, and compile each .cpp source file into a separate .o object file. Other classes and functions may call these member functions from anywhere in the program by including the class header file; the function implementation is not needed to compile another function that calls it. After compiling all code, the linker connects calls to functions implemented in other files as part of the process of generating the executable.
Let’s imagine a very simple example application which has two classes: a particle class and a three-dimensional vector class, v3, that it uses. Our main task is moving the particle objects along randomized trajectories. Particle filters and Monte Carlo simulations frequently involve operations of this sort. We’ll use a CUDA C++ kernel in which each thread calls particle::advance() on a particle. Continue reading →
Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.
Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are
support for Unified Memory;
CUDA on Tegra K1 mobile/embedded system-on-a-chip;
This week’s Spotlight is on Dr. Paul Richmond, a Vice Chancellor’s Research Fellow at the University of Sheffield (a CUDA Research Center). Paul’s research interests relate to the simulation of complex systems and to parallel computer hardware.
NVIDIA: Paul, tell us about FLAME GPU. Paul: Agent-Based Simulation is a powerful technique used to assess and predict group behavior from a number of simple interacting rules between communicating autonomous individuals (agents). Individuals typically represent some biological entity such as a molecule, cell or organism and can therefore be used to simulate systems at varying biological scales.
The Flexible Large-scale Agent Modelling Environment for the GPU (FLAME GPU) is a piece of software which enables high level descriptions communicating agents to be automatically translated to GPU hardware. With FLAME GPU, simulation performance is enormously increased over traditional agent-based modeling platforms and interactive visualization can easily be achieved. The GPU architecture and the underlying software algorithms are abstracted from users of the FLAME GPU software, ensuring accessibility to users in a wide range of domains and application areas.
NVIDIA: How does FLAME GPU leverage GPU computing? Paul: Unlike other agent-based simulation frameworks, FLAME GPU is designed from the ground up with parallelism in mind. As such it is possible to ensure that agents and behavior are mapped to the GPU efficiently in a way which minimizes data transfer during simulation. Continue reading →
Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.
Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.
Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!
Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, Continue reading →
NVIDIA GPU accelerators have emerged in High-Performance Computing as an energy-efficient way to provide significant compute capability. The Green500 supercomputer list makes this clear: the top 10 supercomputers on the list feature NVIDIA GPUs. Today at the 2014 GPU Technology Conference, NVIDIA announced a new interconnect called NVLink which enables the next step in harnessing the full potential of the accelerator, and the Pascal GPU architecture with stacked memory, slated for 2016.
Pascal will support stacked memory, a technology which enables multiple layers of DRAM components to be integrated vertically on the package along with the GPU. Stacked memory provides several times greater bandwidth, more than twice the capacity, and quadrupled energy efficiency, compared to current off-package GDDR5. Stacked memory lets us combine large, high-bandwidth memory in the same package with the GPU, allowing us to place the place the voltage regulators close to the chip for efficient power delivery. Stacked Memory, combined with a new Pascal module that is one-third the size of current PCIe boards, will enable us to build denser solutions than ever before.
Outpacing PCI Express
Today a typical system has one or more GPUs connected to a CPU using PCI Express. Continue reading →
NVIDIA GPU Boost™ is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a server or within a workstation. Note that GPU Boost is a system setting, which means that this Pro Tip applies to any user of a CUDA-accelerated application, not just developers.
The Tesla K40 board targets a specific power budget (235W) when running a highly optimized compute workload, but HPC workloads vary in power consumption and profile, as the graph in Figure 1 shows. This shows that for many applications there is power headroom. NVIDIA GPU Boost for Tesla allows customers to use available power headroom to select higher graphics clocks using NVML or nvidia-smi.
A great post by Saad Rahim on the Acceleware Blog covers everything you need to know to use GPU Boost. In the post, Saad benchmarks two applications with varying clocks on K40: Reverse Time Migration (RTM), a depth migration algorithm used to image complex geologies; and a Finite-difference time-domain (FDTD) electromagnetic solver. Continue reading →
CUDA 6 introduces Unified Memory, which dramatically simplifies memory management for GPU computing. Now you can focus on writing parallel kernels when porting code to the GPU, and memory management becomes an optimization.
The CUDA 6 Release Candidate is now publicly available. In today’s CUDACast, I will show you some simple examples showing how easy it is to accelerate code on the GPU using Unified Memory in CUDA 6, and how powerful Unified Memory is for sharing C++ data structures between host and device code. If you’re interested in looking at the code in detail, you can find it in the Parallel Forall repository on GitHub. You can also check out the great Unified Memory post by Mark Harris.
This week’s Spotlight is on Dr. Ian Lane of Carnegie Mellon University. Ian is an Assistant Research Professor and leads a speech and language processing research group based in Silicon Valley. He co-directs the CUDA Center of Excellence at CMU with Dr. Kayvon Fatahalian.
NVIDIA: Ian, what is Speech Recognition? Ian: Speech Recognition refers to the technology that converts an audio signal into the sequence of words that the user spoke. By analyzing the frequencies within a snippet of audio, we can determine what sounds within spoken language a snippet most closely matches, and by observing sequences of these snippets we can determine what words or phrases the user most likely uttered.
Speech Recognition spans many research fields, including signal processing, computational linguistics, machine learning and core problems in computer science, such as efficient algorithms for large-scale graph traversal. Speech Recognition also is one of the core technologies required to realize natural Human Computer Interaction (HCI). It is becoming a prevalent technology in interactive systems being developed today.
NVIDIA: What are examples of real-world applications? Ian: In recent years, speech-based interfaces have become much more prevalent, including applications such as virtual personal assistants, which include systems such as Siri from Apple or Google Voice Search, as well as speech interfaces for smart TVs and in-vehicle systems. Continue reading →
In finance, an option (or derivative) is the common name for a contract that, under certain conditions, gives a firm the right or obligation to receive or supply certain assets or cash flows. A financial firm uses options to hedge risks when it operates in the markets. It is critical for a firm to be able to accurately price those instruments and understand their dynamics to evaluate its positions, balance its portfolio and limit exposure to potential threats. The calculation of risk and prices for options is a computationally intensive task for which GPUs have a lot to offer. This post describes an efficient implementation of American Option Pricing using Monte Carlo Simulation with a GPU-optimized implementation of the Longstaff Schwarz algorithm.
NVIDIA recently partnered with IBM and STAC to implement the STAC-A2™ benchmark on two NVIDIA Tesla K20X GPUs. It is the first system that was able to calculate the risk and pricing of this particular complex option in less than a second. A system with two Tesla K20X GPUs is up to 6 times faster than a state-of-the-art configuration using only CPUs. Even more interestingly, adding one or two Tesla K20X GPUs to a system offers speedups of slightly more than 5x and 9x, respectively, compared to the same system without GPUs. Continue reading →
When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.
The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.
Writing Interfaces to Batched cuBLAS Routines
At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:
integer(c_int) function &
type(cublasHandle), value :: h
integer(c_int), value :: n
type(c_devptr), device :: Aarray(*)
integer(c_int), value :: lda
integer(c_int), device :: ipvt(*)
integer(c_int), device :: info(*)
integer(c_int), value :: batchSize
end function cublasSgetrfBatched