This week’s Spotlight is on Patrick Roye of Luna, Inc.
Patrick works on accelerating Luna’s processing algorithms using GPUs. He and a team of engineers and scientists are developing a prototype system that uses CUDA to calculate the shape of a fiber-optic sensor in real-time.
Luna’s shape-sensing systems, which are currently in development, will be used to guide the next generation of medical robotic systems safely through a patient’s body. Read Patrick’s full Spotlight here. Excerpt:
NVIDIA: What are some applications of Luna’s technology?
Patrick: One of our key target markets is healthcare, including the area of Minimally Invasive Surgery (MIS). Luna’s shape-sensing systems, which are currently in development, calculate the shape of fiber-optic sensors in real-time.
NVIDIA: Why did you choose to work with GPUs?
Patrick: The processing for our shape-sensing technology was initially developed on FPGAs, which allowed us to transfer and process data at extremely low latencies, on the order of milliseconds. But when higher levels of accuracy required us to increase the number of points and complexity of our algorithms, the FPGAs we were using were no longer a viable option.
Fortunately, at the same time the door closed on our FPGAs, NVIDIA opened a window with the announcement of GPUDirect RDMA. Since we had used CUDA a year earlier to accelerate our strain and temperature sensing calculations, we already had an idea of the advantages of GPU-accelerated processing. With GPUDirect RDMA and CUDA-accelerated processing, we determined that we could perform data acquisition and minimal processing on an FPGA, transfer our data directly to the GPU for processing and then transfer the results back to the FPGA fast enough to meet our real-time requirements.
NVIDIA: What approaches did you find useful for developing on the CUDA platform?
Patrick: The algorithm requires over 100 kernels, operating on tens-of-thousands of data points. All kernels must complete before the next data set arrives from the FPGA, so every kernel had to be optimized to run as fast as possible. Here are a few tips I learned from this extreme optimization process.
- Get it working first. There’s no point in doing something fast if you’re doing it wrong.
- Take time to generate comprehensive unit tests for each of the kernels. Once you begin optimizing, these unit tests will be invaluable for ensuring your optimizations don’t introduce new processing bugs.
- Implement each kernel a few different ways. There were a few times where an implementation I was almost certain would be slower turned out to be the fastest one. Additionally, thinking through multiple solutions to one kernel may give you an idea that helps accelerate a different kernel later on.
NVIDIA: Tell us about some of the computations performed by the many CUDA kernels you use.
Patrick: Our algorithms employ FFTs, filters, complex integrals, complex derivatives, phase unwrapping, and a host of proprietary algorithms. We use CUFFT for our large FFTs, and everything else is custom. The most difficult algorithm to parallelize was the final shape calculation which integrates coordinate transform matrices to calculate the position and rotation of each point along the sensor.
NVIDIA: What types of parallel algorithms are being implemented?
Patrick: Many of our calculations are vector-based, making parallelization easy. But even after parallelizing, many of those operations had to be optimized for efficient global memory access. For more complicated calculations, we typically use reduction or partitioning.
NVIDIA: In your field, what are the biggest challenges going forward?
Patrick: In the future, we’d like to make our shape-sensing systems smaller and lighter. Currently, we require a motherboard, 64-bit CPU, and memory just so that we can set up the GPU and send kernels to it. That’s a lot of space and energy required for components that basically act as glue between the FPGA and GPU. We’re very excited about NVIDIA’s mobile SoC (System-on-a-Chip) roadmap toward Parker, which we hope will allow us to shrink our design considerably by combining a next-generation Maxwell GPU with a 64-bit ARM CPU in a single package.