CUDA Spotlight: Dr. Cris Cecka on GPU-Accelerated Computational Mathematics

Our Spotlight is on Dr. Cris Cecka, a research scientist and lecturer in the new Institute for Applied Computational Science (IACS) at Harvard University. Harvard has been a CUDA Center of Excellence since 2009, led by Dr. Hanspeter Pfister, IACS Director. Cris is currently also performing research with the Mathematics Department at the Massachusetts Institute of Technology. Previously, Cris was a graduate student in the Institute for Computational and Mathematical Engineering (ICME) at Stanford University with Prof. Eric Darve.

NVIDIA: Cris, what are your primary research interests?
Cris: My research focuses on computational mathematics, particularly for interdisciplinary applications in science and engineering. In the past, I’ve used CUDA for non-linear PDEs (partial differential equations) and real-time computing with applications in simulation and virtual surgery.

More recently, I have become interested in mathematical and computational abstractions to produce efficient, library-quality scientific software. Specifically, I have focused on generalized n-body problems, including integral equation methods, particle methods, and structured dense matrices.

As part of my work, I’ve released several software libraries, including FMMTL to aid in the research, development, and use of kernel matrices and CrowdCL to aid in the use of GPU computing within a browser.

NVIDIA: Tell us more about FMMTL. Is it GPU-accelerated?

Cris: FMMTL is a research code that is exploring fast algorithms (like Treecode, FMM, H-matrix, and Butterfly) for kernel matrices and other structured dense matrices. Why structured? Well, plenty of algorithms exist for dense matrices, e.g. all of BLAS and LAPACK. These use values of the matrix to compute products, eigenvalues, factorizations, etc. But there are huge classes of problems where we never actually want to construct all of the elements of the matrix — generalized n-body problems — and can be accelerated either by compressing rows, columns, or blocks of the matrix or by avoiding computing elements of the matrix all-together.

By avoiding the computation of all of the elements or delaying the computation until the matrix element is requested, the amount of data required to define the matrix is reduced to O(N), which is great in terms of computational intensity! There is very little data to access and lots and lots of computation.

For this reason, these computations are a great fit for GPUs. Notoriously, GPUs are hard to “feed” — they compute much, much faster than they can access data to compute on. Indeed, sparse linear algebra usually gets only 1-5% of the peak performance because there is approximately one operation per piece of data read from memory.

Dense linear algebra does much, much better, but can still be difficult to tune. Implicit dense matrices, where the elements are computed on-demand and each piece of data read from memory can be reused N times or more, are very easy to immediately get 80% of peak performance. For this reason, FMMTL attempts to use GPUs for these direct n-body computations whenever possible.

NVIDIA: What approaches have you used to apply the CUDA platform to your work?
Cris: My research with GPU computing has really changed the way I approach software and algorithm development. For maximum robustness, our algorithms should be expressed in terms of parallel primitives that can be performed efficiently on nearly any architecture.

thrust_logo With the development of the Thrust library and the recently proposed parallel computing additions to the C++ standards being led by NVIDIA, expressing our algorithms in terms of these optimized primitives has never been more important. Primitives like partition, prefix sum, reduce, and map, when fully abstracted, act as powerful components of efficient cross-platform generic algorithms. This is a more functional-programming-like way of writing code, but it is compact and efficient. I wish more people used the C++ std:: library (or equivalents) and NVIDIA’s Thrust.

Using the same kind of advanced C++ techniques, we also create robust and reusable primitives that dispatch based on data structure, available co-processors, available threading, etc. Currently, I’m working with Wesley Chen to wrap up a generalized n-body direct evaluation that is provably optimal in terms of distributed memory communication and takes advantage of available GPUs and threading appropriately. This is written with a general std::-like interface and can be applied far beyond the typically considered n-body problems. Writing code in this way allows us to think about the concepts and algorithms we’re trying to express rather than the details of the implementation, while still benefiting from CPU and GPU parallelization.

NVIDIA: Tell us about the course you are teaching.
Cris: I developed and teach Harvard’s parallel computing course, CS205, which is a core course for the IACS masters but is additionally offered through the online Harvard Extension School.

We wanted to make this course accessible to a broad range of students, and we chose to teach the course in Python. We cover MapReduce using MRJob, distributed computing using MPI4py, and CUDA with PyCUDA.

The choice to run the course in Python made it much more accessible for a broad range of students — we’ve had students from the social sciences and government who are simply interested in accelerating their data processing Python/Matlab scripts and producing better software.

Clearly, the long exalted notion that the average programmer will need to understand and develop parallel programs is becoming a reality.

NVIDIA: How has learning the CUDA programming model helped your students?
Cris: One student in particular, Rebecca Perry, had a fantastic final project on accelerating 3D digital holographic microscopy using CUDA.

This is an imaging technique of encoding 3D volumes in 2D images that allows the recording frame rate to be increased dramatically. The 3D data must then be reconstructed with back-propagation of the scattered light, which is very computationally demanding. Furthermore, experimentalists must fit parameters of the back-propagation and have an intuition for how the resulting data changes due to the parameters. The initial guesses for the parameters were constructed by hand.

Her team was using Holopy, an open source python package, to recover the 3D information given the data and parameters… slowly. Fitting the parameters to a single frame could take minutes. Using the GPU computing techniques learned in class, Rebecca supplemented Holopy and achieved real-time manipulation of the input parameters via a GUI interface.

The speedup achieved led to new insights into the imagining technique. Failures and discrepancies became much more obvious when the data was manipulatable in real-time, when previously these features were shrouded by computation.

NVIDIA: How did you first hear about CUDA?
Cris: I first worked with CUDA in my graduate research at Stanford University in 2008. My first GPU was the GTX 8800, one of the first GPUs to be truly CUDA-enabled. I know I had it a lot easier than pre-2007 “graphics highjackers,” but the GTX 8 series still had its challenges in terms of GPU computing.

I’ve been impressed with how NVIDIA’s programming model has evolved since then. CUDA forces a user to inherently think about their problem in a hierarchical way — breaking it into smaller and smaller problems with higher and higher degrees of parallelism.

This programming model follows the architecture of NVIDIA’s GPUs closely. With algorithm-architecture co-design, the same hierarchical strategies appear in CPU caches, distributed memory algorithms, threading strategies, etc. I think the skills acquired in mapping ideas to CUDA and GPU computing are similar to the skills required to write cache-efficient, parallel code in other contexts.

NVIDIA: What excites you the most about your work?
Cris: Enabling new, better, and faster science by developing new and easier-to-use fast algorithms for classes of dense matrices!