Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.
By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.
From Prototype to Full Dataset with @cuda.jit
When doing exploratory programming, the interactivity of IPython Notebook and a comprehensive collection of scientific libraries (e.g. SciPy, Scikit-Learn, Theano, etc.) allow data scientists to process and visualize their data quickly. There are times when a fast implementation of what you need isn’t in a library, and you have to implement something new. Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.
Working with Dr. Alex Dimakis and his team at UT Austin, we implemented their densest-k-subgraph (DkS) algorithm . Our goal was to extract the densest domain from the 2012 WebDataCommon pay-level-domain hyperlink graph using one NVIDIA Tesla K20 GPU accelerator. We developed the entire application using NumPy for array operations, Numba to JIT compile Python to CUDA, NumbaPro for GPU sorting and cuBLAS routines, and Bokeh for plotting the results. Continue reading →
So far in the CUDA Python mini-series on CUDACasts, I introduced you to using the @vectorize decorator and CUDA libraries, two different methods for accelerating code using NVIDIA GPUs. In today’s CUDACast, I’ll be demonstrating how to use the NumbaPro compiler from Continuum Analytics to write CUDA Python code which runs on the GPU.
In CUDACast #12, we’ll continue using the Monte Carlo options pricing example, and I’ll show how to write the step function in CUDA Python rather than using the @vectorize decorator. In addition, by using the nvprof command-line profiler, we’ll be able to see the speed-up we’re able to achieve by writing the code explicitly in CUDA.
In the previous episode of CUDACasts I introduced you to NumbaPro, the high-performance Python compiler from Continuum Analytics, and demonstrated how to accelerate simple Python functions on the GPU. Continuing the Python theme, today’s CUDACast demonstrates NumbaPro’s support for CUDA libraries.
The optimized algorithms in GPU-accelerated libraries often provide the easiest way to accelerate applications. NumbaPro includes a Python API interface to the cuBLAS, cuFFT, and cuRAND libraries. In CUDACasts episode #11 I show you how to use cuRAND to accelerate random-number generation for a Python Monte Carlo options pricing example, achieving a 17x overall speed-up.
This week’s CUDACast continues the Parallel Forall Python theme kicked off in last week’s post by Mark Harris, demonstrating exciting new support for CUDA acceleration in Python with NumbaPro. This video is the first in a 3-part series showing various ways to accelerate your Python code on NVIDIA GPUs.
Python is a high-productivity dynamic programming language that is widely used in science, engineering, and data analytics applications. There are a number of factors influencing the popularity of python, including its clean and expressive syntax and standard data structures, comprehensive “batteries included” standard library, excellent documentation, broad ecosystem of libraries and tools, availability of professional support, and large and open community. Perhaps most important, though, is the high productivity enabled by a dynamically typed, interpreted language like Python. Python is nimble and flexible, making it a great language for quick prototyping, but also for building complete systems.
But Python’s greatest strength can also be its greatest weakness: its flexibility and typeless, high-level syntax can result in poor performance for data- and computation-intensive programs. For this reason, Python programmers concerned about efficiency often rewrite their innermost loops in C and call the compiled C functions from Python. There are a number of projects aimed at making this optimization easier, such as Cython, but they often require learning a new syntax. Ideally, Python programmers would like to make their existing Python code faster without using another programming language, and, naturally, many would like to use accelerators to get even higher performance from their code.
NumbaPro: High Productivity for High-Performance Computing
In this post I’ll introduce you to NumbaPro, a Python compiler from Continuum Analytics that can compile Python code for execution on CUDA-capable GPUs or multicore CPUs. Since Python is not normally a compiled language, you might wonder why you would want a Python compiler. The answer is of course that running native, compiled code is many times faster than running dynamic, interpreted code. NumbaPro works by allowing you to specify type signatures for Python functions, which enables compilation at run time (this is “Just-in-Time”, or JIT compilation). NumbaPro’s ability to dynamically compile code means that you don’t give up the flexibility of Python. This is a huge step toward providing the ideal combination of high productivity programming and high-performance computing. Continue reading →