Recently, STAC Research published astonishing performance results for the STAC-A2 benchmarks on an NVIDIA Tesla K80. In short, a single Tesla K80 driven by two CPU cores outperforms all previously audited systems in terms of pure performance and power efficiency.
We obtained these new results after several optimizations of our previously audited code. First of all, a large fraction of the computations are now avoided due to a better factorization of the underlying mathematical process. Secondly, we tuned some of the kernel parameters to take advantage of the larger register file of the Tesla K80. Finally, we were able to significantly reduce the latency in one of the main loops of the benchmark. Let’s take a look at these optimizations. Continue reading →
With the US Department of Energy’s announcement of plans to base two future flagship supercomputers on IBM POWER CPUs, NVIDIA GPUs, NVIDIA NVLink interconnect, and Mellanox high-speed networking, many developers are getting started building GPU-accelerated applications that run on IBM POWER processors. The good news is that porting existing applications to this platform is easy. In fact, smooth sailing is already being reported by software development leaders such as Erik Lindahl, Professor of Biophysics at the Science for Life Laboratory, Stockholm University & KTH, developer of the GROMACS molecular dynamics package:
The combination of POWER8 CPUs & NVIDIA Tesla accelerators is amazing. It is the highest performance we have ever seen in individual cores, and the close integration with accelerators is outstanding for heterogeneous parallelization. Thanks to the little endian chip and standard CUDA environment it took us less than 24 hours to port and accelerate GROMACS.
The NVIDIA CUDA Toolkit version 5.5 is now available with POWER support, and all future CUDA Toolkits will support POWER, starting with CUDA 7 in 2015. The Tesla Accelerated Computing Platform enables multiple approaches to programming accelerated applications: libraries (cuBLAS, cuFFT, Thrust, AmgX, cuDNN and many more), and depending on platform, compiler directives (OpenACC), and programming languages (CUDA C++, CUDA Fortran, Python). Developers have a choice of approaches for programming GPU-accelerated systems, and system builders have a choice of technologies for deployment: Tesla GPUs can now be paired with POWER, x86, or ARM CPUs.
NVIDIA® GPU Boost™ is a feature available on NVIDIA® GeForce® and Tesla® GPUs that boosts application performance by increasing GPU core and memory clock rates when sufficient power and thermal headroom are available (See the earlier Parallel Forall post about GPU Boost by Mark Harris). In the case of Tesla GPUs, GPU Boost is customized for compute-intensive workloads running on clusters. In this post I describe GPU Boost in more detail and show you how you can take advantage of it in your applications. I also introduce Tesla K80 autoboost and demonstrate that it can automatically match the performance of explicitly controlled application clocks.
Tesla GPUs target a specific power budget, for example Tesla K40 has a TDP (Thermal Design Power) of 235W and Tesla K80 has a TDP of 300W. These TDP ratings are upper limits, and the graph in Figure 1 shows that many HPC workloads do not come close to this power limit. NVIDIA GPU Boost for Tesla allows users to increase application performance by using available power headroom to select higher graphics clock rates.
NVIDIA GPU Boost is exposed for Tesla accelerators via application clock settings and on the new Tesla K80 accelerator it can also be enabled via the new autoboost feature, which is enabled by default. A user or system administrator can disable autoboost and manually set the right clocks for an application, by either:
running the command line tool nvidia-smi locally on the node, or
As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at nvidia.qwiklab.com.
Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud. Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.
You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.
The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.
This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading →
The Java ecosystem is the leading enterprise software development platform, with widespread industry support and deployment on platforms like the IBM WebSphere Application Server product family. Java provides a powerful object-oriented programming language with a large developer ecosystem and developer-friendly features like automated memory management, program safety, security and runtime portability, and high performance features like just-in-time (JIT) compilation.
Java application developers face increasingly complex challenges, with big data and analytics workloads that require next generation performance. Big data pushes the scale of the problem to a new level with multiple hundreds of gigabytes of information common in these applications, while analytics drive the need for higher computation speeds. The Java platform has evolved by adding developer support for simpler parallel programming via the fork/join framework and concurrent collection APIs. Most recently, Java 8 adds support for lambda expressions, which can simplify the creation of highly parallel applications using Java.
Our first step brings capabilities of the CUDA programming model into the Java programming environment. Java developers familiar with CUDA concepts can use the new IBM CUDA4J library, which provides a Java API for managing and accessing GPU devices, libraries, kernels, and memory. Using these new APIs it is possible to write Java programs that manage GPU device characteristics and offload work to the GPU with the convenience of the Java memory model, exceptions, and automatic resource management that Java developers expect. Continue reading →
Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5 and all of the latest features of the CUDA platform, including Unified Memory and Dynamic Parallelism.
GM204′s 16 SMs make it over 3 times faster than the first-generation GM107 GPU that I introduced earlier this year on Parallel Forall, and additional architectural improvements help GM204 pack an even bigger punch.
SMM: The Maxwell Multiprocessor
As I discussed in my earlier Maxwell post, the heart of Maxwell’s power-efficient performance is it’s Streaming Multiprocessor, known as SMM. Maxwell’s new datapath organization and improved instruction scheduler provide more than 40% higher delivered performance per CUDA core, and overall twice the efficiency of Kepler GK104. The new SMM, shown in Figure 1, includes all of the architectural benefits of its first-generation Maxwell predecessor, including improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and more.
SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. Each SMM provides eight texture units, one polymorph engine (geometry processing for graphics), and dedicated register file and shared memory.
Our Spotlight is on Dr. Cris Cecka, a research scientist and lecturer in the new Institute for Applied Computational Science (IACS) at Harvard University. Harvard has been a CUDA Center of Excellence since 2009, led by Dr. Hanspeter Pfister, IACS Director. Cris is currently also performing research with the Mathematics Department at the Massachusetts Institute of Technology. Previously, Cris was a graduate student in the Institute for Computational and Mathematical Engineering (ICME) at Stanford University with Prof. Eric Darve.
The following is an excerpt from our interview (read the complete Spotlight here).
NVIDIA: Cris, what are your primary research interests? Cris: My research focuses on computational mathematics, particularly for interdisciplinary applications in science and engineering. In the past, I’ve used CUDA for non-linear PDEs (partial differential equations) and real-time computing with applications in simulation and virtual surgery.
More recently, I have become interested in mathematical and computational abstractions to produce efficient, library-quality scientific software. Specifically, I have focused on generalized n-body problems, including integral equation methods, particle methods, and structured dense matrices.
As part of my work, I’ve released several software libraries, including FMMTL to aid in the research, development, and use of kernel matrices and CrowdCL to aid in the use of GPU computing within a browser.
NVIDIA: Tell us more about FMMTL. Is it GPU-accelerated?
Cris: FMMTL is a research code that is exploring fast algorithms (like Treecode, FMM, H-matrix, and Butterfly) for kernel matrices and other structured dense matrices. Why structured? Well, plenty of algorithms exist for dense matrices, e.g. all of BLAS and LAPACK. These use values of the matrix to compute products, eigenvalues, factorizations, etc. But there are huge classes of problems where we never actually want to construct all of the elements of the matrix — generalized n-body problems — and can be accelerated either by compressing rows, columns, or blocks of the matrix or by avoiding computing elements of the matrix all-together.
By avoiding the computation of all of the elements or delaying the computation until the matrix element is requested, the amount of data required to define the matrix is reduced to O(N), which is great in terms of computational intensity! There is very little data to access and lots and lots of computation. Continue reading →
Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do. While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems. In fact, ML has now become a pervasive technology, underlying many modern applications. Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few. The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.
There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs). The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng. Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology. A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.
Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN. The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.
Machine Learning with DNNs
A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading →
We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).
Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.
cudaError_t cudaSetDevice(int device)
After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.