chromatophore_thumb

Interactive Supercomputing with In-Situ Visualization on Tesla GPUs

So, you just got access to the latest supercomputer with thousands of GPUs. Obviously this is going to help you a lot with accelerating your scientific calculations, but how are you going to analyze, reduce and visualize this data?  Historically, you would be forced to write everything out to disk, just to later read it back into another data analysis cluster.

HIV Capsid
Figure 1: NVIDIA OptiX ray tracing helped scientists at the University of Illinois, Urbana-Champaign to visualize and analyze the world’s first complete atomic-level model of the chemical structure of the HIV capsid with 4.2 million atoms.

Wouldn’t it be nice if you could analyze and visualize your data as it is being generated, without having to go through a file system? And wouldn’t it be cool to interact with the simulation, maybe even modifying parameters while the simulation is running?

And wouldn’t it be nice to use your GPU for that as well? As it turns out, you can actually do this. It’s called in-situ visualization, meaning visualization of datasets in-place where they are computed. High-quality, high performance rendering and visualization is just one of the capabilities of the Tesla Accelerated Computing Platform. Depending on the site where you’re running, it just takes a couple of steps to get your system configured correctly, and in this post I’ll tell you how.

But before walking you through the steps necessary to get your system set up to enable remote, in-situ visualizations, I’ll describe a few use cases for in-situ visualization, and show you some of the tools that can help you to add visualization capability into your application. Continue reading

GPUBoost_thumb

Increase Performance with GPU Boost and K80 Autoboost

NVIDIA® GPU Boost™ is a feature available on NVIDIA® GeForce® and Tesla® GPUs that boosts application performance by increasing GPU core and memory clock rates when sufficient power and thermal headroom are available (See the earlier Parallel Forall post about GPU Boost by Mark Harris).  In the case of Tesla GPUs, GPU Boost is customized for compute-intensive workloads running on clusters. In this post I describe GPU Boost in more detail and show you how you can take advantage of it in your applications. I also introduce Tesla K80 autoboost and demonstrate that it can automatically match the performance of explicitly controlled application clocks.

Tesla GPUs target a specific power budget, for example Tesla K40 has a TDP (Thermal Design Power) of 235W and Tesla K80 has a TDP of 300W. These TDP ratings are upper limits, and the graph in Figure 1 shows that many HPC workloads do not come close to this power limit. NVIDIA GPU Boost for Tesla allows users to increase application performance by using available power headroom to select higher graphics clock rates.

Figure 1: Average GPU Power Consumption for Real Applications
Figure 1: Average GPU Power Consumption for Real Applications

NVIDIA GPU Boost is exposed for Tesla accelerators via application clock settings and on the new Tesla K80 accelerator it can also be enabled via the new autoboost feature, which is enabled by default. A user or system administrator can disable autoboost and manually set the right clocks for an application, by either:

  1. running the command line tool nvidia-smi locally on the node, or
  2. programmatically using the NVIDIA Management Library (NVML).

Continue reading

NVIDIA NVLink

How NVLink Will Enable Faster, Easier Multi-GPU Computing

Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency.  In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators.  Titan established records not only in absolute system performance but also in energy efficiency, with 90% of its peak performance being delivered by the GPU accelerators. This week, the U.S. Department of Energy (DoE) announced the award to IBM and NVIDIA to build two new flagship supercomputers, the Summit system at Oak Ridge National Laboratory and the Sierra system at Lawrence Livermore National Laboratory.

A new NVIDIA white paper explores key features of these new supercomputers and the technologies enabled by the Tesla® accelerated computing platform that will drive the U.S. DoE’s push toward exascale. Here’s a description of Summit and Sierra from the white paper. Continue reading

cuDNN_logo_black_on_white_179x115

Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1

Image RecognitionGPUs have quickly become the go-to platform for accelerating machine learning applications for training and classification. Deep Neural Networks (DNNs) have grown in importance for many applications, from image classification and natural language processing to robotics and UAVs. To help researchers focus on solving core problems, NVIDIA introduced a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, but only for workstations and server-based machine learning applications.

In the meantime, the Jetson TK1 development kit has become a must-have for mobile and embedded parallel computing due to the amazing level of performance packed into such a low-power board. Demand for embedded machine learning has been incredible, so to address this demand, we’ve released cuDNN for ARM (Linux for Tegra—L4T).

Jetson TK1 developer boardThe combination of these two powerful tools enables industry standard machine learning frameworks, such as Berkeley’s Caffe or NYU’s Torch7, to run on a mobile device with excellent performance. Numerous machine learning applications will benefit from this platform, enabling advances in robotics, autonomous vehicles and embedded computer vision. Continue reading

Qwiklabs Logo

Learn GPU Programming in Your Browser with NVIDIA Hands-On Labs

As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at nvidia.qwiklab.com.

Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud.  Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.

ipython_notebook

Continue reading

tesla_platform_thumb

12 Things You Should Know about the Tesla Accelerated Computing Platform

You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.

The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.

This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading

Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

[This post was co-written by Everett Phillips and Massimiliano Fatica.]

The High Performance Conjugate Gradient Benchmark (HPCG) is a new benchmark intended to complement the High-Performance Linpack (HPL) benchmark currently used to rank supercomputers in the TOP500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm. The PCG algorithm better represents the computational and communication patterns prevalent in modern application workloads which rely more heavily on memory system and network performance than HPL.

GPU-accelerated supercomputers have proven to be very effective for accelerating compute-intensive applications like HPL, especially in terms of power efficiency. Obtaining good acceleration on the GPU for the HPCG benchmark is more challenging due to the limited parallelism and memory access patterns of the computational kernels involved. In this post we present the steps taken to obtain high performance of the HPCG benchmark on GPU-accelerated clusters, and demonstrate that our GPU-accelerated HPCG results are the fastest per-processor results reported to date.

The PCG Algorithm

The PCG algorithm solves a sparse linear system \mathbf{A}\mathbf{x} = \mathbf{b} given an initial guess \mathbf{x}_0. The particular sparse linear system used in HPCG is a simple elliptic partial differential equation discretized with a 27-point stencil on a regular 3D grid. Rows in the sparse matrix \mathbf{A} represent points in the grid. Each processor is responsible for a subset of rows corresponding to a local domain of size N_{x} \times N_{y} \times N_{z}, chosen by the user in the setup file. The number of processors is automatically detected at runtime, and decomposed into P_{x} \times P_{y} \times P_{z}, where P=P_{x}P_{y}P_{z} is the total number of processors. This creates a global domain G_{x} \times G_{y} \times G_{z}, where G_{x} = P_{x}N_{x}, G_{y} = P_{y}N_{y}, and G_{z} = P_{z}N_{z}.  Although the matrix has a simple structure, it is only intended to facilitate the problem setup and validation of the solution. Implementations may not use assumptions about the matrix structure to optimize the solver; they must treat the matrix as a general sparse matrix.

Following is pseudocode for the PCG algorithm. Continue reading

cuDNN_logo_black_on_white_179x115

Deep Learning for Computer Vision with Caffe and cuDNN

Deep learning models are making great strides in research papers and industrial deployments alike, but it’s helpful to have a guide and toolkit to join this frontier. This post serves to orient researchers, engineers, and machine learning practitioners on how to incorporate deep learning into their own work. This orientation pairs an introduction to model structure and learned features for general understanding with an overview of the Caffe deep learning framework for practical know-how. References highlight recent and historical research for perspective on current progress.The framework survey points out key elements of the Caffe architecture, reference models, and worked examples. Through collaboration with NVIDIA, drop-in integration of the cuDNN library accelerates Caffe models. Follow this post to join the active deep learning community around Caffe.

Automating Perception by Deep Learning

Deep learning is a branch of machine learning that is advancing the state of the art for perceptual problems like vision and speech recognition. We can pose these tasks as mapping concrete inputs such as image pixels or audio waveforms to abstract outputs like the identity of a face or a spoken word. The “depth” of deep learning models comes from composing functions into a series of transformations from input, through intermediate representations, and on to output. The overall composition gives a deep, layered model, in which each layer encodes progress from low-level details to high-level concepts. This yields a rich, hierarchical representation of the perceptual problem. Figure 1 shows the kinds of visual features captured in the intermediate layers of the model between the pixels and the output. A simple classifier can recognize a category from these learned features while a classifier on the raw pixels has a more complex decision to make.

Figure 1: Visualization of deep features by example. Each 3 x 3 array shows the nine image patches from a standard data set that maximize the response of a given feature from a low-level (left) and high-level (right) layer of the popular Zeiler-Fergus network [8]. Similarly rich features are found in concurrent work by Girshick et al. [4]. The low-level features capture color, simple shapes, and similar textures. The high-level features respond to parts like eyes and wheels, flowers in different colors, and text in various styles.
Figure 1: Visualization of deep features by example. Each 3 x 3 array shows the nine image patches from a standard data set that maximize the response of a given feature from a low-level (left) and high-level (right) layer of the popular Zeiler-Fergus network [8]. Similarly rich features are found in concurrent work by Girshick et al. [4]. The low-level features capture color, simple shapes, and similar textures. The high-level features respond to parts like eyes and wheels, flowers in different colors, and text in various styles.

Continue reading

java-logo

The Next Wave of Enterprise Performance with Java, POWER Systems and NVIDIA GPUs

The Java ecosystem is the leading enterprise software development platform, with widespread industry support and deployment on platforms like the IBM WebSphere Application Server product family. Java provides a powerful object-oriented programming language with a large developer ecosystem and developer-friendly features like automated memory management, program safety, security and runtime portability, and high performance features like just-in-time (JIT) compilation.

Java application developers face increasingly complex challenges, with big data and analytics workloads that require next generation performance. Big data pushes the scale of the problem to a new level with multiple hundreds of gigabytes of information common in these applications, while analytics drive the need for higher computation speeds. The Java platform has evolved by adding developer support for simpler parallel programming via the fork/join framework and concurrent collection APIs. Most recently, Java 8 adds support for lambda expressions, which can simplify the creation of highly parallel applications using Java.

IBM's new Power S824L is a data processing powerhouse that integrates the NVIDIA Tesla Accelerated Computing Platform (Tesla GPUs and enabling software) with IBM’s POWER8 processor.
IBM’s new Power S824L is a data processing powerhouse that integrates the NVIDIA Tesla Accelerated Computing Platform (Tesla GPUs and enabling software) with IBM’s POWER8 processor.

IBM’s POWER group has partnered with NVIDIA to make GPUs available on a high-performance server platform, promising the next generation of parallel performance for Java applications. We decided to bring GPU support to Java incrementally using three approaches.

Enabling CUDA for Java Developers

Our first step brings capabilities of the CUDA programming model into the Java programming environment. Java developers familiar with CUDA concepts can use the new IBM CUDA4J library, which provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management that Java developers expect. Continue reading

RDMA_thumb

Benchmarking GPUDirect RDMA on Modern Server Platforms

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express Base Address Register region (BAR. See this white paper for more technical details).

Both Open MPI and MVAPICH2 now support GPUDirect RDMA, exposed via CUDA-aware MPI. Since January 2014 the Mellanox Infiniband software stack has supported GPUDirect RDMA on Mellanox ConnectX-3 and Connect-IB devices. See this post on the Mellanox blog for a nice introduction to the topic.

This post is a detailed look at the performance obtainable with available hardware platforms. The main audience for this post is designers and users of GPU-accelerated clusters employing CUDA-aware MPI, and architects and designers of GPU-accelerated low-latency systems, such as in healthcare, aviation, and high-energy physics. It is also complementary to a recent post (Exploring the PCIe Bus Routes) by Cirrascale.

Though the details may change in future hardware, this post suggests expected levels of performance and gives useful hints for performance verification. Continue reading