Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1

Image RecognitionGPUs have quickly become the go-to platform for accelerating machine learning applications for training and classification. Deep Neural Networks (DNNs) have grown in importance for many applications, from image classification and natural language processing to robotics and UAVs. To help researchers focus on solving core problems, NVIDIA introduced a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, but only for workstations and server-based machine learning applications.

In the meantime, the Jetson TK1 development kit has become a must-have for mobile and embedded parallel computing due to the amazing level of performance packed into such a low-power board. Demand for embedded machine learning has been incredible, so to address this demand, we’ve released cuDNN for ARM (Linux for Tegra—L4T).

Jetson TK1 developer boardThe combination of these two powerful tools enables industry standard machine learning frameworks, such as Berkeley’s Caffe or NYU’s Torch7, to run on a mobile device with excellent performance. Numerous machine learning applications will benefit from this platform, enabling advances in robotics, autonomous vehicles and embedded computer vision. Continue reading


CUDACasts Episode #6: CUDA on ARM with CUDA 5.5

In CUDACast #5, we saw how to use the new NVIDIA RPM and Debian packages to install the CUDA toolkit, samples, and driver on a supported Linux OS with a standard package manager. With CUDA 5.5, it is now possible to compile and run CUDA applications on ARM-based systems such as the Kayla development platform. In addition to native compilation on an ARM-based CPU system, it is also possible to cross-compile for ARM systems, allowing for greater development flexibility.

NVIDIA’s next-generation Logan system on a chip will contain a Kepler GPU supporting CUDA along with a multicore ARM CPU. The combination of ARM support in CUDA 5.5 and the Kayla platform gives developers a powerful toolset to prepare for the next step in the mobile visual computing revolution.

What amazing applications will you be able to create with a small and power-efficient CPU combined with a massively parallel Kepler GPU—the same GPU architecture powering some of the most powerful supercomputers in the world?

Continue reading

CUDA for ARM Platforms is Now Available

SECO mITX GPU DEVKIT_340In 2012 alone, over 8.7 billion ARM-based chips were shipped worldwide. Many developers of GPU-accelerated applications are planning to port their applications to ARM platforms, and some have already started. I recently chatted about this with John Stone, the lead developer of VMD, a high performance (and CUDA-accelerated) molecular visualization tool used by researchers all over the world. But first … some exciting news.

To help developers working with ARM-based computing platforms, we are excited to announce the public availability of the CUDA Toolkit version 5.5 Release Candidate (RC) with support for the ARM CPU architecture. This latest release of the CUDA Toolkit includes support for the following features and functionality on ARM-based platforms.

  • The CUDA C/C++ compiler (nvcc), debugging tools (cuda-gdb and cuda-memcheck), and the command-line profiler (nvprof). (Support for the NVIDIA Visual Profiler and NSight Eclipse Edition to come; for now, I recommend capturing profiling data with nvprof and viewing it in the Visual Profiler.)
  • Native compilation on ARM CPUs, for fast and easy application porting.
  • Fast cross-compilation on x86 CPUs, which reduces development time for large applications by enabling developers to compile ARM code on faster x86 processors, and then deploy the compiled application on the target computer.
  • GPU-accelerated libraries including CUFFT (FFT), CUBLAS (linear algebra), CURAND (random number generation), CUSPARSE (sparse linear algebra), and NPP (NVIDIA Performance primitives for signal and image processing).
  • Complete documentation, code samples, and more to help developers quickly learn how to take advantage of GPU-accelerated parallel computing on ARM-based systems.

Continue reading