CUDA 7

GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with respect to each other. [See the post How to Overlap Data Transfers in CUDA C/C++ for an example]

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream called the NULL stream which implicitly synchronizes with all other streams on the device.

CUDA 7 introduces a ton of powerful new functionality, including a new option to use an independent default stream for every host thread, which avoids the serialization of the NULL stream. In this post I’ll show you how this can simplify achieving concurrency between kernels and data copies in CUDA programs.
Continue reading

CUDA 7

CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
Continue reading

NASDAQ_graph_thumb

How We Achieved Record Finance Benchmark Performance on Tesla K80

STAC Research develops financial benchmarks in partnership with leading banks and software or hardware vendors. The STAC-A2 suite of benchmarks aims to represent the standard risk analysis workload that banks and insurance companies use to measure exposure on the financial markets. Earlier this year we published a Parallel Forall post on Monte Carlo simulation for the pricing of American options in STAC-A2.

Record Performance with Tesla K80

Recently, STAC Research published astonishing performance results for the STAC-A2 benchmarks on an NVIDIA Tesla K80. In short, a single Tesla K80 driven by two CPU cores outperforms all previously audited systems in terms of pure performance and power efficiency.

For more on these results, read “Bank on It: Tesla Platform Shatters Record on Risk-Management Benchmark” on the NVIDIA Blog.

tesla-k80-3qtrWe obtained these new results after several optimizations of our previously audited code. First of all, a large fraction of the computations are now avoided due to a better factorization of the underlying mathematical process. Secondly, we tuned some of the kernel parameters to take advantage of the larger register file of the Tesla K80. Finally, we were able to significantly reduce the latency in one of the main loops of the benchmark. Let’s take a look at these optimizations. Continue reading

ArrayFire Logo

ArrayFire: A Portable Open-Source Accelerated Computing Library

The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Capabilities

ArrayFire is an open source C/C++ library, with language bindings for R, Java and Fortran. ArrayFire has a range of functionality, including

ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.

In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs. Continue reading

power_logo_thumb

Porting GPU-Accelerated Applications to POWER8 Systems

With the US Department of Energy’s announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect, and Mellanox high-speed networking, many developers are getting started building GPU-accelerated applications that run on IBM POWER processors. The good news is that porting existing applications to this platform is easy. In fact, smooth sailing is already being reported by software development leaders such as Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH, developer of the GROMACS molecular dynamics package:

The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.

The NVIDIA CUDA Toolkit version 5.5 is now available with POWER support, and all future CUDA Toolkits will support POWER, starting with CUDA 7 in 2015. The Tesla Accelerated Computing Platform enables multiple approaches to programming accelerated applications: libraries (cuBLAS, cuFFT, Thrust, AmgX, cuDNN and many more), and depending on platform, compiler directives (OpenACC), and programming languages (CUDA C++, CUDA Fortran, Python). Developers have a choice of approaches for programming GPU-accelerated systems, and system builders have a choice of technologies for deployment: Tesla GPUs can now be paired with POWER, x86, or ARM CPUs.

common_programming_approaches

Continue reading

chromatophore_thumb

Interactive Supercomputing with In-Situ Visualization on Tesla GPUs

So, you just got access to the latest supercomputer with thousands of GPUs. Obviously this is going to help you a lot with accelerating your scientific calculations, but how are you going to analyze, reduce and visualize this data?  Historically, you would be forced to write everything out to disk, just to later read it back into another data analysis cluster.

HIV Capsid
Figure 1: NVIDIA OptiX ray tracing helped scientists at the University of Illinois, Urbana-Champaign to visualize and analyze the world’s first complete atomic-level model of the chemical structure of the HIV capsid with 4.2 million atoms.

Wouldn’t it be nice if you could analyze and visualize your data as it is being generated, without having to go through a file system? And wouldn’t it be cool to interact with the simulation, maybe even modifying parameters while the simulation is running?

And wouldn’t it be nice to use your GPU for that as well? As it turns out, you can actually do this. It’s called in-situ visualization, meaning visualization of datasets in-place where they are computed. High-quality, high performance rendering and visualization is just one of the capabilities of the Tesla Accelerated Computing Platform. Depending on the site where you’re running, it just takes a couple of steps to get your system configured correctly, and in this post I’ll tell you how.

But before walking you through the steps necessary to get your system set up to enable remote, in-situ visualizations, I’ll describe a few use cases for in-situ visualization, and show you some of the tools that can help you to add visualization capability into your application. Continue reading

GPUBoost_thumb

Increase Performance with GPU Boost and K80 Autoboost

NVIDIA® GPU Boost™ is a feature available on NVIDIA® GeForce® and Tesla® GPUs that boosts application performance by increasing GPU core and memory clock rates when sufficient power and thermal headroom are available (See the earlier Parallel Forall post about GPU Boost by Mark Harris).  In the case of Tesla GPUs, GPU Boost is customized for compute-intensive workloads running on clusters. In this post I describe GPU Boost in more detail and show you how you can take advantage of it in your applications. I also introduce Tesla K80 autoboost and demonstrate that it can automatically match the performance of explicitly controlled application clocks.

Tesla GPUs target a specific power budget, for example Tesla K40 has a TDP (Thermal Design Power) of 235W and Tesla K80 has a TDP of 300W. These TDP ratings are upper limits, and the graph in Figure 1 shows that many HPC workloads do not come close to this power limit. NVIDIA GPU Boost for Tesla allows users to increase application performance by using available power headroom to select higher graphics clock rates.

Figure 1: Average GPU Power Consumption for Real Applications
Figure 1: Average GPU Power Consumption for Real Applications

NVIDIA GPU Boost is exposed for Tesla accelerators via application clock settings and on the new Tesla K80 accelerator it can also be enabled via the new autoboost feature, which is enabled by default. A user or system administrator can disable autoboost and manually set the right clocks for an application, by either:

  1. running the command line tool nvidia-smi locally on the node, or
  2. programmatically using the NVIDIA Management Library (NVML).

Continue reading

NVIDIA NVLink

How NVLink Will Enable Faster, Easier Multi-GPU Computing

Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency.  In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators.  Titan established records not only in absolute system performance but also in energy efficiency, with 90% of its peak performance being delivered by the GPU accelerators. This week, the U.S. Department of Energy (DoE) announced the award to IBM and NVIDIA to build two new flagship supercomputers, the Summit system at Oak Ridge National Laboratory and the Sierra system at Lawrence Livermore National Laboratory.

A new NVIDIA white paper explores key features of these new supercomputers and the technologies enabled by the Tesla® accelerated computing platform that will drive the U.S. DoE’s push toward exascale. Here’s a description of Summit and Sierra from the white paper. Continue reading

cuDNN_logo_black_on_white_179x115

Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1

Image RecognitionGPUs have quickly become the go-to platform for accelerating machine learning applications for training and classification. Deep Neural Networks (DNNs) have grown in importance for many applications, from image classification and natural language processing to robotics and UAVs. To help researchers focus on solving core problems, NVIDIA introduced a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, but only for workstations and server-based machine learning applications.

In the meantime, the Jetson TK1 development kit has become a must-have for mobile and embedded parallel computing due to the amazing level of performance packed into such a low-power board. Demand for embedded machine learning has been incredible, so to address this demand, we’ve released cuDNN for ARM (Linux for Tegra—L4T).

Jetson TK1 developer boardThe combination of these two powerful tools enables industry standard machine learning frameworks, such as Berkeley’s Caffe or NYU’s Torch7, to run on a mobile device with excellent performance. Numerous machine learning applications will benefit from this platform, enabling advances in robotics, autonomous vehicles and embedded computer vision. Continue reading

Qwiklabs Logo

Learn GPU Programming in Your Browser with NVIDIA Hands-On Labs

As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at nvidia.qwiklab.com.

Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud.  Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.

ipython_notebook

Continue reading