Image recognition and GPUs go hand-in-hand, particularly when using deep neural networks (DNNs). The strength of GPU-based DNNs for image recognition has been unequivocally demonstrated by their success over the past few years in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and DNNs have recently achieved classification accuracy on par with trained humans, as Figure 1 shows. The new Low-Power Image Recognition Challenge (LPIRC) highlights the importance of image recognition on mobile and embedded devices.
DNNs with convolutional layers are a biologically inspired artificial neural network. These networks may have five or more layers with many neurons in each layer. Links similar to synapses connect the layers, forwarding information to the next layer. The training process adjusts weights on the links, improving the network’s ability to classify the information presented to it. The more data used to train a DNN, the better its classification performance. This big data requirement has resulted in heavy GPU use, because GPUs are designed for high throughput on highly parallel computations like those used in deep learning.
ImageNet is a great resource for imagery, hosting a large database of images organized according to a hierarchy of descriptive nouns. Each year, ImageNet hosts the ILSVRC, for which entrants develop algorithms for accurately recognizing objects in the images. ImageNet provides a large image set of over 1.2 million images from 1000 different object categories for training recognition algorithms. Academic as well as industrial participants have performed strongly, with competitors from Google, Stanford University, University of California, Berkeley, and Adobe (among many others) in recent years.
A Low-Power Challenge
To motivate improved image recognition on low-power devices, Yung-Hsiang Lu, Associate Professor of Electrical and Computer Engineering at Purdue University, and Alex Berg, Assistant Professor of Computer Science at UNC Chapel Hill, are organizing the Low-Power Image Recognition Challenge (LPIRC), a competition focused on identifying the best technology in both image recognition and energy conservation. Registration for the LPIRC is now open.
Achieving high performance while maintaining low power can be challenging, as these two parameters often increase together. Last year NVIDIA released the Jetson TK1 Development Kit, a low-power GPU-accelerated computing platform that is well-suited for image processing and computer vision applications. Jetson TK1’s low power requirements and image processing capabilities will make it a popular platform for LPIRC competitors. Continue reading →
The ArrayFire library is a high-performance software library with a focus on portability and productivity. It supports highly tuned, GPU-accelerated algorithms using an easy-to-use API. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.
ArrayFire has three back ends to enable portability across many platforms: CUDA, OpenCL and CPU. It even works on embedded platforms like NVIDIA’s Jetson TK1.
In a past post about ArrayFire we demonstrated the ArrayFire capabilities and how you can increase your productivity by using ArrayFire. In this post I will tell you how you can use ArrayFire to exploit various kind of parallelism on NVIDIA GPUs. Continue reading →
GPUs have quickly become the go-to platform for accelerating machine learning applications for training and classification. Deep Neural Networks (DNNs) have grown in importance for many applications, from image classification and natural language processing to robotics and UAVs. To help researchers focus on solving core problems, NVIDIA introduced a library of primitives for deep neural networks called cuDNN. The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, but only for workstations and server-based machine learning applications.
In the meantime, the Jetson TK1 development kit has become a must-have for mobile and embedded parallel computing due to the amazing level of performance packed into such a low-power board. Demand for embedded machine learning has been incredible, so to address this demand, we’ve released cuDNN for ARM (Linux for Tegra—L4T).
The combination of these two powerful tools enables industry standard machine learning frameworks, such as Berkeley’s Caffe or NYU’s Torch7, to run on a mobile device with excellent performance. Numerous machine learning applications will benefit from this platform, enabling advances in robotics, autonomous vehicles and embedded computer vision. Continue reading →
In the previous CUDACasts episode, we saw how to flash your Jetson TK1 to the latest release of Linux4Tegra, and install both the CUDA toolkit and OpenCV SDK. We’ll continue exploring the power efficiency the Jetson TK1 Kepler-based GPU brings to computer vision by porting a simple OpenCV sample to run on the GPU. We’ll explore computer vision further in a future CUDACast when we look at the VisionWorks toolkit from NVIDIA.
The Jetson TK1 development kit has fast become a must-have for mobile and embedded parallel computing due the amazing level of performance packed into such a low-power board. In this and the following CUDACast, you’ll learn how to get started building computer vision applications on your Jetson TK1 using CUDA and the OpenCV library.
NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA. With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM). This brings game-changing performance to low-SWaP (Size, Weight and Power) and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.
Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports. TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP). There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.
As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain. Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1. A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler. Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine. Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra. There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit. Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies. In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs. Continue reading →
NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.
Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.
Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit. If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.
CUDA toolkit setup
The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6. Continue reading →
Today, cars are learning to see pedestrians and road hazards; robots are becoming higher functioning; complex medical diagnostic devices are becoming more portable; and unmanned aircraft are learning to navigate autonomously. As a result, the computational requirements for these devices are increasing exponentially, while their size, weight, and power limits continue to decrease. Aimed at these and other embedded parallel computing applications, last week at the 2014 GPU Technology Conference NVIDIA announced an awesome new developer platform called Jetson TK1.
Jetson TK1 is a tiny but full-featured computer designed for development of embedded and mobile applications. Jetson TK1 is exciting because it incorporates Tegra K1, the first mobile processor to feature a CUDA-capable GPU. Jetson TK1 brings the capabilities of Tegra K1 to developers in a compact, low-power platform that makes development as simple as developing on a PC.
Jetson TK1 is aimed at two groups of people. The first are OEMs, including robotics, avionics, and medical device companies, who would like to develop new products that use Tegra K1 SoCs, and need a development platform that makes it easy to write software for these products. Once these companies are ready to move to production, they can work with one of our board partners to design the exact board that they need for their product. The second group is the large number of independent developers, researchers, makers, and hobbyists who would like a platform that will enable them to create amazing technology such as robots, security devices, or anything that needs substantial parallel computing or computer vision in a small, flexible and low-power platform. For this group, Jetson TK1 offers the size and adaptability of Raspberry Pi or Arduino, with the computational capability of a desktop computer. We’re excited to see what developers create with Jetson TK1!
Tegra K1 is NVIDIA’s latest mobile processor. It features a Kepler GPU with 192 cores, Continue reading →