Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.
Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.
CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage. Continue reading →
A Givens rotation  represents a rotation in a plane represented by a matrix of the form
where the intersections of the th and th columns contain the values and . Multiplying a vector by a Givens rotation matrix represents a rotation of the vector in the plane by radians.
According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize. Continue reading →
Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.
Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are
support for Unified Memory;
CUDA on Tegra K1 mobile/embedded system-on-a-chip;
Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.
Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.
Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!
Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, Continue reading →
NVIDIA GPU Boost™ is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a server or within a workstation. Note that GPU Boost is a system setting, which means that this Pro Tip applies to any user of a CUDA-accelerated application, not just developers.
The Tesla K40 board targets a specific power budget (235W) when running a highly optimized compute workload, but HPC workloads vary in power consumption and profile, as the graph in Figure 1 shows. This shows that for many applications there is power headroom. NVIDIA GPU Boost for Tesla allows customers to use available power headroom to select higher graphics clocks using NVML or nvidia-smi.
A great post by Saad Rahim on the Acceleware Blog covers everything you need to know to use GPU Boost. In the post, Saad benchmarks two applications with varying clocks on K40: Reverse Time Migration (RTM), a depth migration algorithm used to image complex geologies; and a Finite-difference time-domain (FDTD) electromagnetic solver. Continue reading →
The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture.
This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.
1. The Heart of Maxwell: More Efficient Multiprocessors
Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.
Improved Instruction Scheduling
The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. Continue reading →
When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL, or “shuffle”.
In upcoming posts here on Parallel Forall we will demonstrate uses of shuffle. To prepare, I highly recommend watching the following recording of a GTC 2013 talk by Julien Demouth entitled “Kepler’s SHUFFLE (SHFL): Tips and Tricks”. In the talk, Julien covers many uses for shuffle, including reductions, scans, transpose, and sorting, demonstrating that shuffle is always faster than safe uses of shared memory, and never slower than unsafe uses of shared memory.
As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:
Does your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU
As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing. You can also use it to control execution of applications for which you don’t have source code, or to launch multiple instances of a program on a single machine, each with its own environment and set of visible devices. Continue reading →
It’s that time of year again! Here at NVIDIA we’re hard at work getting ready for the 2014 GPU Technology Conference, the world’s most important GPU developer conference. Taking place in the heart of Silicon Valley, GTC offers unmatched opportunities to learn how to harness the latest GPU technology including 500 sessions, hands-on labs and tutorials, technology demos, and face-to-face interaction with industry luminaries and NVIDIA technologists.
Come to the epicenter of computing technology March 24-27, and see how your peers are using GPUs to accelerate impactful results in various disciplines of scientific and computational research. Register for GTC now, because the Early Bird discount for GTC registration ends in one week on Wednesday, January 29th. The Early Bird discount is 25% on a full-conference registration, and to sweeten the deal I can offer Parallel Forall readers an extra 20% off using the code GM20PFB. That gets you four days of complete access to GTC for just $720, or $360 for academic and government employees. Don’t miss it, register now!
Here are a few talks to give you an idea of the breadth and quality of talks you will see at GTC: Continue reading →
Alex St. John has a new post on his blog “The Saint” about his first experience porting C++ classes to run on the GPU with CUDA 6 and Unified Memory.
The introduction of Unified Memory in CUDA, for the first time makes it practical to move huge bodies of general C++ code entirely up to the GPU and to write and run entire complex code systems entirely on the GPU with minimal CPU governance. In theory a big leap, but not without some new challenges.
Alex extends the example I provided in my post Unified Memory in CUDA 6 to make it portable between the CPU, with a switch to select managed memory or host memory allocation. He also touches on an approach to making the member functions of the class portable (using __host__ __device__—see my post about Hemi for more ideas on this topic).
Overall it looks like Alex had a very positive experience with Unified Memory: “Using this approach I ported several thousand lines of C++ code and half a dozen objects to CUDA 6.0 in a couple days.” I expect many programmers to have similar good experiences in the future.