Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
With the US Department of Energy’s announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect, and Mellanox high-speed networking, many developers are getting started building GPU-accelerated applications that run on IBM POWER processors. The good news is that porting existing applications to this platform is easy. In fact, smooth sailing is already being reported by software development leaders such as Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH, developer of the GROMACS molecular dynamics package:
The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.
The NVIDIA CUDA Toolkit version 5.5 is now available with POWER support, and all future CUDA Toolkits will support POWER, starting with CUDA 7 in 2015. The Tesla Accelerated Computing Platform enables multiple approaches to programming accelerated applications: libraries (cuBLAS, cuFFT, Thrust, AmgX, cuDNN and many more), and depending on platform, compiler directives (OpenACC), and programming languages (CUDA C++, CUDA Fortran, Python). Developers have a choice of approaches for programming GPU-accelerated systems, and system builders have a choice of technologies for deployment: Tesla GPUs can now be paired with POWER, x86, or ARM CPUs.
Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency. In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators. Titan established records not only in absolute system performance but also in energy efficiency, with 90% of its peak performance being delivered by the GPU accelerators. This week, the U.S. Department of Energy (DoE) announced the award to IBM and NVIDIA to build two new flagship supercomputers, the Summit system at Oak Ridge National Laboratory and the Sierra system at Lawrence Livermore National Laboratory.
You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.
The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.
This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading →
Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5 and all of the latest features of the CUDA platform, including Unified Memory and Dynamic Parallelism.
GM204′s 16 SMs make it over 3 times faster than the first-generation GM107 GPU that I introduced earlier this year on Parallel Forall, and additional architectural improvements help GM204 pack an even bigger punch.
SMM: The Maxwell Multiprocessor
As I discussed in my earlier Maxwell post, the heart of Maxwell’s power-efficient performance is it’s Streaming Multiprocessor, known as SMM. Maxwell’s new datapath organization and improved instruction scheduler provide more than 40% higher delivered performance per CUDA core, and overall twice the efficiency of Kepler GK104. The new SMM, shown in Figure 1, includes all of the architectural benefits of its first-generation Maxwell predecessor, including improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and more.
SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. Each SMM provides eight texture units, one polymorph engine (geometry processing for graphics), and dedicated register file and shared memory.
Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.
CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.
Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003. Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.
Unified Memory is a CUDA feature that we’ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7 enables Unified Memory in CUDA Fortran.
In a PGInsider article called CUDA Fortran Managed Memory, PGI Applications and Services Manager Brent Leback writes “using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.” PGI 14.7 adds the managed keyword to the language, which you can use in host code similarly to the device keyword. Here’s part of an example Brent included in his article, showing the allocation of managed arrays. Continue reading →
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.
Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.
CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.
A Givens rotation  represents a rotation in a plane represented by a matrix of the form
where the intersections of the th and th columns contain the values and . Multiplying a vector by a Givens rotation matrix represents a rotation of the vector in the plane by radians.
According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize. Continue reading →
Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.
Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are
support for Unified Memory;
CUDA on Tegra K1 mobile/embedded system-on-a-chip;
Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.
Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.
Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!
Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, Continue reading →