Copperhead: Data Parallel Python

Programming environments like C and Fortran allow complete and unrestricted access to computing hardware, but often require programmers to understand the low-level details of the hardware they target. Although these efficiency-oriented systems are essential to every computing platform, many programmers prefer to use higher level programming environments like Python or Ruby, focused on productivity rather than absolute performance. Productivity-focused programmers solving large or intensive problems do need high performance, and many seek to exploit parallel computing, but without the costs of understanding low-level hardware details or programming directly to a particular machine.

Copperhead is a project that aims to enable productivity-focused programmers to take advantage of parallel computing, without explicitly coding to any particular machine. Copperhead programs use familiar Python primitives such as map and reduce, and they execute in parallel on both CUDA-enabled GPUs as well as multicore
CPUs.

Parallel Hello World: `axpy`

Let’s start with an example: below find Copperhead code for axpy, the “hello world” of parallel programs. (axpy is the type-generic form of saxpy. See Six Ways to SAXPY for more.)

from copperhead import *
import numpy as np

@cu
def axpy(a, x, y):
    return [a * xi + yi for xi, yi in zip(x, y)]

n = 1000000
a = 2.0
x = np.random.rand(n)
y = np.random.rand(n)
with places.gpu0:
    gpu_result = axpy(a, x, y)
with places.openmp:
    cpu_result = axpy(a, x, y)

Copperhead programs are embedded in Python programs using a clearly delineated subset of the Python language. We use a Python function decorator @cu, to mark the functions which are written in the subset of Python supported by Copperhead. Python programs execute normally, until the first call from the Python interpreter to a Copperhead decorated function. From that point, the program executes via the Copperhead runtime until it returns from the original entry-point function back to the Python interpreter. Data is garbage collected using Python’s standard garbage collector.

Looking at the body of axpy, we see an element-wise operation being performed over the two input arrays, applied via a Python list comprehension. This list comprehension will be executed in parallel. Equivalently, the user could have written axpy using map and a lambda anonymous function.

@cu
def axpy(a, x, y):
    return map(lambda xi, yi: a * xi + yi, x, y)

Or via a named, nested function.

@cu
def axpy(a, x, y):
    def el(xi, yi):
        return a * xi + yi
    return map(el, x, y)

All three forms perform identically, programmers are free to use whichever form they find most convenient.

Python’s map operates element-wise over sequences, passing elements from the input sequences to the function it invokes. What about auxiliary data that is not operated on element-wise, but needs to be used in an element-wise computation? Copperhead programs use closures to broadcast this data; in this example, you can see that the element-wise function uses the scalar a from the enclosing scope. This is a common idiom in Python code, and is how Copperhead programs broadcast data to element-wise operations.

Another point of interest in this example is that it interoperates with Python’s widely-used numpy library to provide homogeneous, well-typed arrays. In this example, axpy is called with 64-bit floating point values (making it a daxpy), since that’s the default data type for floating point literals in Python (a = 2.0) and also for random arrays using numpy.random.rand. To call this program with 32-bit floating point values (making it a saxpy), the programmer would use the standard numpy mechanisms:

a = np.float32(2.0)
x = np.array(np.random.rand(n), dtype=np.float32)
y = np.array(np.random.rand(n), dtype=np.float32)

Finally, note that this function contains no loops. In general, Copperhead does not support side effects, where the value of a variable is changed. This is a significant restriction, but enables safe parallelization of Copperhead programs. For example, the following program is a legal implementation of axpy in Python that overwrites one of the inputs using a loop:

def side_effect_axpy(a, x, y):
    for i in range(len(y)):
        y[i] = a * x[i] + y[i]

This style of programming is not supported in Copperhead. In general, safely parallelizing loops such as these is a difficult problem, and there are loops that cannot be parallelized. Instead of using loops, Copperhead programs use data parallel primitives that are designed to always be parallelizeable, giving the runtime flexibility to schedule computations as necessary for various parallel platforms. For cases where sequential iteration is required, Copperhead programs use tail recursion.

As a consequence of this decision, all branches in Copperhead must terminate with a return statement.

Copperhead Primitives

Copperhead programs use a set of familiar data parallel primitives. Where possible, we use the same interface as the native Python versions of the same primitives. Here are some of the most important.

map(f, ...)
Applies the function f element-wise across a number of sequences. The function f should have arity equal to the number of arguments. Each sequence must have the same length. For practical reasons, this arity is limited to 10. For programmer convenience, list comprehensions are treated equivalently to map, so the following two lines of code are equivalent.
```
z = [fn(xi, yi) for xi in zip(x, y)]
z = map(fn, x, y)
```
reduce(fn, x, init)
Applies binary function fn cumulatively to the items of x so as to reduce them to a single value. The given function fn is required to be both associative and commutative, and unlike Python’s built-in reduce, parallel semantics mean that the elements are not guaranteed to be reduced from left to right.
filter(fn, x)
Returns a sequence containing those items of x for which fn(x) returns True. The order of items in sequence x is preserved.
gather(x, indices)
Returns the sequence [x[i] for i in indices]
scatter(src, indices, dst) Creates a copy of dst and updates it by scattering each src[i] to location indices[i] of the copy. If any indices are duplicated, one of the corresponding values from src will be chosen arbitrarily and placed in the result. The updated copy is returned.
scan(fn, x)
Returns the inclusive scan (also known as prefix sum) of fn over sequence x. Also, rscan, exclusive_scan, and exclusive_rscan.
zip(...)
Returns a sequence of tuples, given several sequences. All input sequences must have the same length.
unzip(x)
Returns a tuple of sequences, given a sequence of tuples.
sort(fn, x) Returns a sorted copy of x, given binary comparator fn(a,b) that returns True if a < b.
indices(x)
Returns a sequence containing all the indices for elements in x.
replicate(x, n)
Returns a sequence containing n copies of x, where x is a scalar.
range(n)
Returns a sequence containing [0, n).
bounded_range(a, b)
Returns a sequence containing [a, b)

Copperhead Type System

Python programs are famously duck-typed: if it walks like a duck and swims like a duck and quacks like a duck, it must be a duck! Python lists, for example, can contain elements of any and all types. This makes programming easier, but imposes large performance penalties due to the cost of introspection. A native Python implementation of axpy would require inspecting every element of the input vectors in order to discover what kind of add operation to perform. To avoid these costs, Copperhead uses type inference. All procedures marked with the @cu function decorator are type checked by the Copperhead runtime. Procedures which don’t pass type inference are rejected.

Copperhead has a simple type system that operates on only a few basic types: scalars, tuples, and sequences. Classes and other user defined types are not supported.

Copperhead’s type system interoperates with the numpy library, widely used to do numeric computing in Python. numpy provides numeric types and homogeneously typed arrays. Copperhead uses the following five scalar types from numpy.

np.float32 : 32-bit floating point number
np.float64 : 64-bit floating point number
np.int32 : 32-bit integer number
np.int64 : 64-bit integer number
np.bool : Boolean

Copperhead also interprets Python types using numpy conventions:

the bool type is interpreted naturally as a Boolean;
the float type is interpreted as a 64-bit floating point number, following numpy convention;
the int type is interpreted as a 64-bit integer, again following numpy convention. Copperhead currently does not implement an infinitely large integer type with the same meaning as Python’s int type.

In addition to the five basic scalar types, Copperhead programs can use tuples of these types, and these tuples can be nested. Although Copperhead doesn’t support classes, tuples can often serve as data structures which combine multiple data elements. Using tuples is fairly straightforward. For example, the following procedure compares the k element of two key-value pairs, returning the key-value pair with the smallest key (note I used two different ways of expressing the tuples here just to demonstrate that both are supported).

@cu
def compare(kv0, (k1, v1)):
    k0, v0 = kv0
    if k0 < k1:
       return kv0
    else:
       return k1, v1

You access elements of a tuple by unpacking it into multiple elements, as commonly done in Python code. For example, k0, v0 = kv0 unpacks kv0 into two elements. This unpacking can be done in a bind statement, as above, or in the parameters to a function.

You create tuples by assigning multiple elements to one variable, or by returning multiple elements. For example kv0 = k0, v0 packs two elements into a tuple. return x, y returns a tuple of two elements.

Unlike standard Python, Copperhead does not support dynamic, random access to tuples via the [] operator. Since tuples are heterogeneously typed, random access to tuples would require dynamic typing, which Copperhead does not support.

Along with scalars and tuples, Copperhead programs also use sequences. Sequences are similar to Python lists, with the restriction that, like numpy arrays, they must be homogeneously typed. Sequences can be indexed using the [] operator, but only when reading elements. They cannot be assigned to using the [] operator in Copperhead code, since Copperhead does not support side effects.

When calling a Copperhead function, the inputs must be types that the Copperhead runtime understands. These types are:

the 5 scalar types from numpy, along with the 3 scalar types from Python that we outlined earlier;
homogeneously typed Python lists, where each element is one of the scalar types mentioned earlier, or a tuple of those scalar types;
one-dimensional numpy arrays, where each element is a supported scalar type;
tuples of the five scalar types and of sequence types;
copperhead.cuarray. This sequence type is returned by Copperhead programs. It manages memory automatically across heterogeneous memory spaces, conserving bandwidth by lazily transferring data.

The runtime converts all sequence types to copperhead.cuarray types before calling a Copperhead function. This is done to prepare the data and marshal it for GPU execution, if necessary. Although the runtime can consume Python lists and numpy arrays, conversion overheads can dominate if they’re done repeatedly. If Copperhead functions are being called from within a loop in the Python interpreter, we recommend explicitly constructing copperhead.cuarray types to avoid unnecessary data conversions.

Literals in Copperhead are weakly typed: an integer literal will be given the type of np.int64, and a floating point literal will be given the type of np.float64, unless it is used in a context with some
other type.

Functions can be used as values in Copperhead programs, but at present lambdas cannot escape the scope at which they are defined. In practice, this means that Copperhead programs make use of nested functions that close over values as arguments to primitives like map, but a program cannot accept a function as an argument when called from Python, and it cannot produce a function and return it to Python. This restriction could be lifted in the future.

Copperhead Runtime

The programmer controls the execution “place” (e.g. CPU or GPU) using Python with statements.

@cu
def foo():
    #...

def on_gpu():
    with places.gpu0:
        foo()

def on_cpu():
    with places.openmp:
        foo()

The Copperhead runtime lazily moves data between memory spaces as necessary. For example, when you call a Copperhead function on the GPU, the runtime makes sure the data is on the GPU before the function starts, and the result will also be in the GPU memory. If a subsequent function uses that result as an input, and is invoked on the same GPU, the data will not be moved. If it is invoked on a different place, for example on the CPU, the data will be moved to the CPU before executing the function.

Returned values from Copperhead procedures are futures: in some cases, control returns immediately to the Python interpreter, while the computation proceeds outside of the interpreter. Synchronization may occur when dereferencing a copperhead.cuarray object from the Python interpreter, and successive calls to Copperhead functions always wait for previous calls to finish before beginning, to preserve program order.

The Copperhead runtime automatically invokes compilers to create binaries for Copperhead functions. These binaries are cached persistently in a __pycache__ directory, which contains Python extensions for every Copperhead function that has been called from the module, as well as the generated C++/CUDA source code for the function. Once a function has been compiled and cached, function call overhead from the Python interpreter is generally between 10-100 microseconds, depending on the number of arguments.

The Copperhead compiler aggressively fuses primitive operations together, to optimize memory bandwidth usage. It generates Thrust code to implement computations, and the output of the compiler can be examined in the __pycache__ directory after a program has been executed.

Conclusion

Copperhead is a Python environment for writing platform independent, parallel code. Copperhead is built on Thrust, and exposes a good deal of the functionality of Thrust to programmers, without the need to understand complex C++ iterator types (for more about Thrust, see Expressive Algorithmic Programming with Thrust). At present, Copperhead programs can only be invoked on flat, one-dimensional sequences, although we envision removing these restrictions with time. Copperhead is an experimental project at NVIDIA Research, and can be downloaded at http://copperhead.github.com.

In a follow up to this post, I plan to provide a more in depth example of using Copperhead, so keep reading Parallel Forall!