Python Logo

NumbaPro: High-Performance Python with CUDA Acceleration

Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors influencing the popularity of python, including its clean and expressive syntax and standard data structures, comprehensive “batteries included” standard library, excellent documentation, broad ecosystem of libraries and tools, availability of professional support, and large and open community. Perhaps most important, though, is the high productivity enabled by a dynamically typed, interpreted language like Python. Python is nimble and flexible, making it a great language for quick prototyping, but also for building complete systems.

But Python’s greatest strength can also be its greatest weakness: its flexibility and typeless, high-level syntax can result in poor performance for data- and computation-intensive programs. For this reason, Python programmers concerned about efficiency often rewrite their innermost loops in C and call the compiled C functions from Python. There are a number of projects aimed at making this optimization easier, such as Cython, but they often require learning a new syntax. Ideally, Python programmers would like to make their existing Python code faster without using another programming language, and, naturally, many would like to use accelerators to get even higher performance from their code.

NumbaPro: High Productivity for High-Performance Computing

In this post I’ll introduce you to NumbaPro, a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. Since Python is not normally a compiled language, you might wonder why you would want a Python compiler. The answer is of course that running native, compiled code is many times faster than running dynamic, interpreted code. NumbaPro works by allowing you to specify type signatures for Python functions, which enables compilation at run time (this is “Just-in-Time”, or JIT compilation). NumbaPro’s ability to dynamically compile code means that you don’t give up the flexibility of Python. This is a huge step toward providing the ideal combination of high productivity programming and high-performance computing.

With NumbaPro, it is now possible to write standard Python functions and run them on a CUDA-capable GPU. NumbaPro is designed for array-oriented computing tasks, much like the widely used NumPy library. The data parallelism in array-oriented computing tasks is a natural fit for accelerators like GPUs. NumbaPro understands NumPy array types, and uses them to generate efficient compiled code for execution on GPUs or multicore CPUs. The programming effort required can be as simple as adding a function decorator to instruct NumbaPro to compile for the GPU. For example, the @vectorize decorator in the following code generates a compiled, vectorized version of the scalar function Add at run time so that it can be used to process arrays of data in parallel on the GPU.

import numpy as np
from numbapro import vectorize

@vectorize(['float32(float32, float32)'], target='gpu')
def Add(a, b):
  return a + b

# Initialize arrays
A = np.ones(N, dtype=np.float32)
B = np.ones(A.shape, dtype=A.dtype)
C = np.empty_like(A, dtype=A.dtype)

# Add arrays on GPU
C = Add(A, B)

To compile and run the same function on the CPU, we simply change the target to ‘cpu’, which yields performance at the level of compiled, vectorized C code on the CPU. This flexibility helps you produce more reusable code, and lets you develop on machines without GPUs.

GPU-Accelerated Libraries for Python

One of the strengths of the CUDA parallel computing platform is its breadth of available GPU-accelerated libraries. NumbaPro provides a Python interface to the CUDA cuBLAS (dense linear algebra), cuFFT (Fast Fourier Transform), and cuRAND (random number generation) libraries. Many applications will be able to get significant speedup just from using these libraries, without writing any GPU-specific code. For example the following code generates a million uniformly distributed random numbers on the GPU using the “XORWOW” pseudorandom number generator.

from numbapro.cudalib import curand
from numpy import empty

prng = curand.PRNG(rndtype=curand.PRNG.XORWOW)
rand = empty(1000000)
print rand(:10)
array([ 0.43845084,  0.4603647 ,  0.25021471,  0.49474377,  0.05301112,
        0.33769926,  0.39676252,  0.87441866,  0.48216683,  0.0428398 ])

Massive Parallelism with CUDA Python

Continuum Analytics recognized that achieving large speedups on some computations requires a more expressive programming interface with more detailed control over parallelism than libraries and automatic loop vectorization can provide. Therefore, NumbaPro has another important set of features that make up what is unofficially known as “CUDA Python”. NumbaPro exposes the CUDA programming model, just like in CUDA C/C++, but using pure python syntax, so that programmers can create custom, tuned parallel kernels without leaving the comforts and advantages of Python behind. NumbaPro’s CUDA JIT (available via decorator or function call) compiles CUDA Python functions at run time, specializing them for the types you use, and its CUDA Python API provides explicit control over data transfers and CUDA streams, among other features.

The Mandelbrot SetThe following code example demonstrates this with a simple Mandelbrot set kernel. Notice the mandel_kernel funtion uses the cuda.threadIdx, cuda.blockIdx, cuda.blockDim, and cuda.gridDim structures provided by NumbaPro to compute the global X and Y pixel indices for the current thread. As in other CUDA languages, we launch the kernel by inserting an “execution configuration” (CUDA-speak for the number of threads and blocks of threads to use to run the kernel) in brackets, between the function name and the argument list: mandel_kernel[griddim, blockdim](-2.0, 1.0, -1.0, 1.0, d_image, 20). You can also see the use of the to_host and to_device API functions to copy data to and from the GPU.

@cuda.jit(restype=uint32, argtypes=[f8, f8, uint32], device=True)
def mandel(x, y, max_iters):
  Given the real and imaginary parts of a complex number,
  determine if it is a candidate for membership in the Mandelbrot
  set given a fixed number of iterations.
  c = complex(x, y)
  z = 0.0j
  for i in range(max_iters):
    z = z*z + c
    if (z.real*z.real + z.imag*z.imag) >= 4:
      return i

  return max_iters

@cuda.jit(argtypes=[f8, f8, f8, f8, uint8[:,:], uint32])
def mandel_kernel(min_x, max_x, min_y, max_y, image, iters):
  height = image.shape[0]
  width = image.shape[1]

  pixel_size_x = (max_x - min_x) / width
  pixel_size_y = (max_y - min_y) / height

  startX = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x
  startY = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y
  gridX = cuda.gridDim.x * cuda.blockDim.x;
  gridY = cuda.gridDim.y * cuda.blockDim.y;

  for x in range(startX, width, gridX):
    real = min_x + x * pixel_size_x
    for y in range(startY, height, gridY):
      imag = min_y + y * pixel_size_y 
      image[y, x] = mandel(real, imag, iters)

gimage = np.zeros((1024, 1536), dtype = np.uint8)
blockdim = (32, 8)
griddim = (32,16)

start = timer()
d_image = cuda.to_device(gimage)
mandel_kernel[griddim, blockdim](-2.0, 1.0, -1.0, 1.0, d_image, 20) 
dt = timer() - start

print "Mandelbrot created on GPU in %f s" % dt


On my Tesla K20c GPU, this CUDA Python Mandelbrot code runs over 1000 times faster than the pure Python version. 1000x may seem an unrealistic speedup, but keep in mind that we are comparing compiled, parallel, GPU-accelerated Python code to interpreted, single-threaded Python code on the CPU.

NumbaPro provides Python developers with an easy entry into GPU-accelerated computing and a path for using increasingly sophisticated CUDA code with a minimum of new syntax and jargon. You can start with simple function decorators to automatically compile your functions, or use the powerful CUDA libraries exposed by NumbaPro. As you advance your understanding of parallel programming concepts and when you need expressive and flexible control of parallel threads, CUDA is available without requiring you to jump in on the first day.

NumbaPro is part of Anaconda Accelerate, a commercially licensed product (NumbaPro is also available under a free license for academic users) from Continuum Analytics. It is built on top of the BSD-licensed, open-source Numba project which itself relies heavily on the capabilities of the LLVM compiler. The GPU backend of NumbaPro utilizes the LLVM-based NVIDIA Compiler SDK.

To get started with NumbaPro, the first step is to download and install the Anaconda Python distribution, a “completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing” that includes many popular packages (Numpy, Scipy, Matplotlib, iPython, etc) and “conda”, a powerful package manager. Once you have Anaconda installed, install Anaconda Accelerate by typing conda install accelerate.  Then check out the example programs available here on Continuum Analytics’s github repository. I also recommend that you check out the NumbaPro posts on Continuum Analytics’ blog.

To learn more about NumbaPro from its creators, please sign up for a GTC Express Webinar called “Pythonic Parallel Patterns for the GPU with NumbaPro” from Siu Kwan Lam, NumbaPro’s primary author at Continuum Analytics. We’ll also be showing you more about NumbaPro on Parallel Forall soon, starting with Mark Ebersole’s next CUDACast, which will take you through installing Anaconda Accelerate and writing your first GPU-accelerated Python function with NumbaPro.

Update: Check out CUDACasts Episode 10 for a video intro to NumbaPro.


About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
  • Giampiero Salvi

    could you please indicate what modules you include in the fractal example (Mendel kernel)? I tried with

    from numpy import uint32
    from numbapro import cuda

    but I am still missing something
    Thank you

    • Check the iPython notebook here:

      But to answer your question, you need:

      import numpy as np
      from pylab import imshow, show
      from timeit import default_timer as timer
      from numbapro import cuda
      from numba import *

      • Giampiero Salvi

        Hi Mark,
        thank you! Now it’s working, provided that I change the “ThreadIdx” attributes to “threadIdx”, that is with lower case “T”.

        • That’s a typo, which I have now fixed — thanks!