So far in the CUDA Python mini-series on CUDACasts, I introduced you to using the
@vectorize decorator and CUDA libraries, two different methods for accelerating code using NVIDIA GPUs. In today’s CUDACast, I’ll be demonstrating how to use the NumbaPro compiler from Continuum Analytics to write CUDA Python code which runs on the GPU.
In CUDACast #12, we’ll continue using the Monte Carlo options pricing example, and I’ll show how to write the
step function in CUDA Python rather than using the @vectorize decorator. In addition, by using the nvprof command-line profiler, we’ll be able to see the speed-up we’re able to achieve by writing the code explicitly in CUDA.
In the previous episode of CUDACasts I introduced you to NumbaPro, the high-performance Python compiler from Continuum Analytics, and demonstrated how to accelerate simple Python functions on the GPU. Continuing the Python theme, today’s CUDACast demonstrates NumbaPro’s support for CUDA libraries.
The optimized algorithms in GPU-accelerated libraries often provide the easiest way to accelerate applications. NumbaPro includes a Python API interface to the cuBLAS, cuFFT, and cuRAND libraries. In CUDACasts episode #11 I show you how to use cuRAND to accelerate random-number generation for a Python Monte Carlo options pricing example, achieving a 17x overall speed-up.
CUDA 5 added a powerful new tool to the CUDA Toolkit:
nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance,
nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But
nvprof is much more than that; to me,
nvprof is the light-weight profiler that reaches where other tools can’t.
nvprof for Quick Checks
I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with
nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.
==9261== Profiling application: ./tHogbomCleanHemi
==9261== Profiling result:
Time(%) Time Calls Avg Min Max Name
58.73% 737.97ms 1000 737.97us 424.77us 1.1405ms subtractPSFLoop_kernel(float const *, int, float*, int, int, int, int, int, int, int, float, float)
38.39% 482.31ms 1001 481.83us 475.74us 492.16us findPeakLoop_kernel(MaxCandidate*, float const *, int)
1.87% 23.450ms 2 11.725ms 11.721ms 11.728ms [CUDA memcpy HtoD]
1.01% 12.715ms 1002 12.689us 2.1760us 10.502ms [CUDA memcpy DtoH]
In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode,
nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls. Continue reading
This week’s CUDACast continues the Parallel Forall Python theme kicked off in last week’s post by Mark Harris, demonstrating exciting new support for CUDA acceleration in Python with NumbaPro. This video is the first in a 3-part series showing various ways to accelerate your Python code on NVIDIA GPUs.
Tomorrow you won’t want to miss the chance to learn about Python GPU acceleration with NumbaPro from its creators, in a GTC Express Webinar called “Pythonic Parallel Patterns for the GPU with NumbaPro” from Siu Kwan Lam, NumbaPro’s primary author at Continuum Analytics. Click the link to sign up now!
Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors influencing the popularity of python, including its clean and expressive syntax and standard data structures, comprehensive “batteries included” standard library, excellent documentation, broad ecosystem of libraries and tools, availability of professional support, and large and open community. Perhaps most important, though, is the high productivity enabled by a dynamically typed, interpreted language like Python. Python is nimble and flexible, making it a great language for quick prototyping, but also for building complete systems.
But Python’s greatest strength can also be its greatest weakness: its flexibility and typeless, high-level syntax can result in poor performance for data- and computation-intensive programs. For this reason, Python programmers concerned about efficiency often rewrite their innermost loops in C and call the compiled C functions from Python. There are a number of projects aimed at making this optimization easier, such as Cython, but they often require learning a new syntax. Ideally, Python programmers would like to make their existing Python code faster without using another programming language, and, naturally, many would like to use accelerators to get even higher performance from their code.
NumbaPro: High Productivity for High-Performance Computing
In this post I’ll introduce you to NumbaPro, a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. Since Python is not normally a compiled language, you might wonder why you would want a Python compiler. The answer is of course that running native, compiled code is many times faster than running dynamic, interpreted code. NumbaPro works by allowing you to specify type signatures for Python functions, which enables compilation at run time (this is “Just-in-Time”, or JIT compilation). NumbaPro’s ability to dynamically compile code means that you don’t give up the flexibility of Python. This is a huge step toward providing the ideal combination of high productivity programming and high-performance computing. Continue reading
Programming environments like C and Fortran allow complete and unrestricted access to computing hardware, but often require programmers to understand the low-level details of the hardware they target. Although these efficiency-oriented systems are essential to every computing platform, many programmers prefer to use higher level programming environments like Python or Ruby, focused on productivity rather than absolute performance. Productivity-focused programmers solving large or intensive problems do need high performance, and many seek to exploit parallel computing, but without the costs of understanding low-level hardware details or programming directly to a particular machine.
Copperhead is a project that aims to enable productivity-focused programmers to take advantage of parallel computing, without explicitly coding to any particular machine. Copperhead programs use familiar Python primitives such as map and reduce, and they execute in parallel on both CUDA-enabled GPUs as well as multicore
Parallel Hello World: axpy
Let’s start with an example: below find Copperhead code for axpy, the “hello world” of parallel programs. (axpy is the type-generic form of saxpy. See Six Ways to SAXPY for more.)
from copperhead import *
import numpy as np
def axpy(a, x, y):
return [a * xi + yi for xi, yi in zip(x, y)]
n = 1000000
a = 2.0
x = np.random.rand(n)
y = np.random.rand(n)
gpu_result = axpy(a, x, y)
cpu_result = axpy(a, x, y)
This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?
In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]
There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.
SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.
void saxpy(int n, float a, float *x, float *y)
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);