In this post, I will discuss techniques you can use to maximize the performance of your GPU-accelerated MATLAB® code. First I explain how to write MATLAB code which is inherently parallelizable. This technique, known as vectorization, benefits all your code whether or not it uses the GPU. Then I present a family of function wrappers—bsxfun, pagefun, and arrayfun—that take advantage of GPU hardware, yet require no specialist parallel programming skills. The most advanced function, arrayfun, allows you to write your own custom kernels in the MATLAB language.
If these techniques do not provide the performance or flexibility you were after, you can still write custom CUDA code in C or C++ that you can run from MATLAB, as discussed in our earlier Parallel Forall posts on MATLAB CUDA Kernels and MEX functions.
All of the features described here are available out of the box with MATLAB and Parallel Computing Toolbox™.
Mobile phone signal strength example
Throughout this post, I will use an example to help illustrate the techniques. A cellular phone network wants to map its coverage to help plan for new antenna installations. We imagine an idealized scenario with M = 25 cellphone masts, each H = 20 meters in height, evenly spaced on an undulating 10km x 10km terrain. Figure 1 shows what the map looks like.
On the GPU, in the following listing we define a number of variables including:
map: An N x 3 height field in a 10km x 10km grid (N = 10,000);
masts: An M x 3 array of antenna positions, at height H;
AntennaDirection: A 3 x M array of vectors representing the orientation of each antenna.
In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.
The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.
MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.
Feature Detection Example
I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.
Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.
To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading →
This week’s Spotlight is on Dr. Adam Gazzaley of UC San Francisco, where he is the founding director of the Neuroscience Imaging Center and an Associate Professor in Neurology, Physiology and Psychiatry. His work was featured in Nature in September 2013.
NVIDIA: Adam, how are you using GPU computing in your research? Adam: We are working with a distributed team (UCSF, Stanford, UCSD and Eye Vapor) to CUDA-enable EEG (electroencephalography) processing to increase the fidelity of real-time brain activity recordings.
The goal is to more accurately represent the brain sources and neural networks, as well as to perform real-time artifact correction and mental state decoding. Not only will this improve the visualization capabilities, but more importantly, it will move EEG closer to being a real-time scientific tool.
Where CUDA and the GPU really excel is with very intense computations that use large matrices. We generate that type of data when we’re recording real-time brain activity across many electrodes.
NVIDIA: Describe the hardware/software platform currently in use by the development team. Adam: We primarily use Python, MATLAB and C/C++. Our software is routinely executed on a range of platforms, including Linux (running Fedora 18), Windows 7, and Mac OS (Snow Leopard and Lion).Hardware we currently make use of includes NVIDIA Tesla K20s (for calculations), NVIDIA Quadro 5000s (for visualization) and two Intel Quad-core CPUs.
We use Microsoft Visual Studio 2010 x64 with CUDA 5.0, with the TCC driver for the Tesla GPUs. The Nvidia Nsight debugging tools are used with Visual Studio to optimize the code performance and get a better idea of what is happening ‘under the hood’ of the GPUs in real time. Continue reading →
This guest post by Daniel Armyr and Dan Doherty from MathWorks describes how you can use MATLAB to support your development of CUDA C and C++ kernels. You will need MATLAB, Parallel Computing Toolbox™, and Image Processing Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. For a more detailed description of this workflow, refer to the MATLAB for CUDA Programmers webinar and associated demo files.
NVIDIA GPUs are becoming increasingly popular for large-scale computations in image processing, financial modeling, signal processing, and other applications—largely due to their highly parallel architecture and high computational throughput. The CUDA programming model lets programmers exploit the full power of this architecture by providing fine-grained control over how computations are divided among parallel threads and executed on the device. The resulting algorithms often run much faster than traditional code written for the CPU.
While algorithms written for the GPU are often much faster, the process of building a framework for developing and testing them can be time-consuming. Many programmers write CUDA kernels integrated into C or Fortran programs for production. For this reason, they often use these languages to iterate on and test their kernels, which requires writing significant amounts of “glue code” for tasks such as transferring data to the GPU, managing GPU memory, initializing and launching CUDA kernels, and visualizing kernel outputs. This glue code is time-consuming to write and may be difficult to change if, for example, you want to run the kernel on different input data or visualize kernel outputs using a different type of plot.
Using an image white balancing example, this article describes how MATLAB® supports CUDA kernel development by providing a language and development environment for quickly evaluating kernels, analyzing and visualizing kernel results, and writing test harnesses to validate kernel results. Continue reading →