Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.
Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are
support for Unified Memory;
CUDA on Tegra K1 mobile/embedded system-on-a-chip;
This week’s Spotlight is on Dr. Paul Richmond, a Vice Chancellor’s Research Fellow at the University of Sheffield (a CUDA Research Center). Paul’s research interests relate to the simulation of complex systems and to parallel computer hardware.
NVIDIA: Paul, tell us about FLAME GPU. Paul: Agent-Based Simulation is a powerful technique used to assess and predict group behavior from a number of simple interacting rules between communicating autonomous individuals (agents). Individuals typically represent some biological entity such as a molecule, cell or organism and can therefore be used to simulate systems at varying biological scales.
The Flexible Large-scale Agent Modelling Environment for the GPU (FLAME GPU) is a piece of software which enables high level descriptions communicating agents to be automatically translated to GPU hardware. With FLAME GPU, simulation performance is enormously increased over traditional agent-based modeling platforms and interactive visualization can easily be achieved. The GPU architecture and the underlying software algorithms are abstracted from users of the FLAME GPU software, ensuring accessibility to users in a wide range of domains and application areas.
NVIDIA: How does FLAME GPU leverage GPU computing? Paul: Unlike other agent-based simulation frameworks, FLAME GPU is designed from the ground up with parallelism in mind. As such it is possible to ensure that agents and behavior are mapped to the GPU efficiently in a way which minimizes data transfer during simulation. Continue reading →
Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.
Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.
Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!
Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, Continue reading →
This week’s Spotlight is on Dr. Ian Lane of Carnegie Mellon University. Ian is an Assistant Research Professor and leads a speech and language processing research group based in Silicon Valley. He co-directs the CUDA Center of Excellence at CMU with Dr. Kayvon Fatahalian.
NVIDIA: Ian, what is Speech Recognition? Ian: Speech Recognition refers to the technology that converts an audio signal into the sequence of words that the user spoke. By analyzing the frequencies within a snippet of audio, we can determine what sounds within spoken language a snippet most closely matches, and by observing sequences of these snippets we can determine what words or phrases the user most likely uttered.
Speech Recognition spans many research fields, including signal processing, computational linguistics, machine learning and core problems in computer science, such as efficient algorithms for large-scale graph traversal. Speech Recognition also is one of the core technologies required to realize natural Human Computer Interaction (HCI). It is becoming a prevalent technology in interactive systems being developed today.
NVIDIA: What are examples of real-world applications? Ian: In recent years, speech-based interfaces have become much more prevalent, including applications such as virtual personal assistants, which include systems such as Siri from Apple or Google Voice Search, as well as speech interfaces for smart TVs and in-vehicle systems. Continue reading →
In finance, an option (or derivative) is the common name for a contract that, under certain conditions, gives a firm the right or obligation to receive or supply certain assets or cash flows. A financial firm uses options to hedge risks when it operates in the markets. It is critical for a firm to be able to accurately price those instruments and understand their dynamics to evaluate its positions, balance its portfolio and limit exposure to potential threats. The calculation of risk and prices for options is a computationally intensive task for which GPUs have a lot to offer. This post describes an efficient implementation of American Option Pricing using Monte Carlo Simulation with a GPU-optimized implementation of the Longstaff Schwarz algorithm.
NVIDIA recently partnered with IBM and STAC to implement the STAC-A2™ benchmark on two NVIDIA Tesla K20X GPUs. It is the first system that was able to calculate the risk and pricing of this particular complex option in less than a second. A system with two Tesla K20X GPUs is up to 6 times faster than a state-of-the-art configuration using only CPUs. Even more interestingly, adding one or two Tesla K20X GPUs to a system offers speedups of slightly more than 5x and 9x, respectively, compared to the same system without GPUs. Continue reading →
The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture. That makes Maxwell a great architecture for future NVIDIA Tesla products, so stay tuned.
This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.
1. The Heart of Maxwell: More Efficient Multiprocessors
Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.
Improved Instruction Scheduling
The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. Continue reading →
Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory atomic operations.
Efficient parallel reductions exchange data between threads within the same thread block. On earlier hardware this meant using shared memory, which involves writing data to shared memory, synchronizing, and then reading the data back from shared memory. Kepler’s shuffle instruction (SHFL) enables a thread to directly read a register from another thread in the same warp (32 threads). This allows threads in a warp to collectively exchange or broadcast data. As described in the post “Do the Kepler Shuffle”, there are four shuffle intrinsics: __shlf(), __shfl_down(), __shfl_up(), and __shfl_xor(), but in this post we only use __shfl_down(), defined as follows: (You can find a complete description of the other shuffle functions in the CUDA C Programming Guide.)
int __shfl_down(int var, unsigned int delta, int width=warpSize);
__shfl_down() calculates a source lane ID by adding delta to the caller’s lane ID (the lane ID is a thread’s index within its warp, from 0 to 31). The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If the source lane ID is out of range or the source thread has exited, the calling thread’s own var is returned. The ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged. Note that width must be one of (2, 4, 8, 16, 32). For brevity, the diagrams that follow show only 8 threads in a warp even though the warp size of all current CUDA GPUs is 32.
As an example, Figure 1 shows the effect of the following two lines of code, where we can see that values are shifted down by 2 threads.
int i = threadIdx.x % 32;
int j = __shfl_down(i, 2, 8);
There are three main advantages to using shuffle instead of shared memory: Continue reading →
Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.
In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.
When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL, or “shuffle”.
In upcoming posts here on Parallel Forall we will demonstrate uses of shuffle. To prepare, I highly recommend watching the following recording of a GTC 2013 talk by Julien Demouth entitled “Kepler’s SHUFFLE (SHFL): Tips and Tricks”. In the talk, Julien covers many uses for shuffle, including reductions, scans, transpose, and sorting, demonstrating that shuffle is always faster than safe uses of shared memory, and never slower than unsafe uses of shared memory.
This week’s Spotlight is on Professor Todd Martínez of Stanford.
Professor Martínez’ research lies in the area of theoretical chemistry, emphasizing the development and application of new methods which accurately and efficiently capture quantum mechanical effects.
Professor Martínez pioneered the use of GPU technology for computational chemistry, culminating in the TeraChem software package that uses GPUs for first principles molecular dynamics. He is a founder of PetaChem, the company that distributes this software.
The following is an excerpt from our interview (you can read the complete Spotlight here).
NVIDIA: Todd, tell us about TeraChem. Todd: TeraChem simulates the dynamics and motion of molecules, solving the electronic Schrodinger equation to determine the forces between atoms. This is often called first principles molecular dynamics or ab initio molecular dynamics.
The primary advantage over empirical force fields (for example, often used for protein structure) is that chemical bond rearrangements and electron transfer can be described seamlessly. Continue reading →