NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.
For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.
In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading →
Our Spotlight is on Dr. Michela Taufer, Associate Professor at the University of Delaware.
Michela heads the Global Computing Lab (GCLab), which focuses on high performance computing (HPC) and its application to the sciences.
Her research interests include software applications and their advanced programmability in heterogeneous computing (i.e., multi-core platforms and GPUs); cloud computing and volunteer computing; and performance analysis, modeling and optimization of multi-scale applications.
The following is an excerpt from our interview (read the complete Spotlight here).
NVIDIA: Michela, what is the mission of the Global Computing Lab at the University of Delaware? Michela: We are engaged in the design and testing of efficient computational algorithms and adaptive scheduling policies for scientific computing on GPUs, the Cloud, and Volunteer Computing.
Interdisciplinary research with scientists and engineers in fields such as chemistry and chemical engineering, pharmaceutical sciences, seismology, and mathematics is at the core of our activities and philosophy.
NVIDIA: Tell us about your work with GPUs. Michela: My team’s work is all about rethinking application algorithms to fit on the GPU architecture in order to get the most out of its computing power, while preserving the scientific accuracy of the simulations. This has resulted in many exciting achievements!
NVIDIA: Can you provide an example? Michela: My group and I were the first to propose a completely-on-GPU PME (Particle Mesh Ewald) code for MD (molecular dynamics) simulations. We achieved that goal by changing the traditional way researchers algorithmically look at charges in long-range electrostatics and their interactions.
With our code empowered with the PME components, we could move the traditional scale for studying membranes like DMPC lipid bilayers from membranes on the order of 72 lipid molecules (17,004 atoms) to 16-times larger membranes of 1,152 lipid molecules (27,3936 atoms) in explicit solvent [see Figure 1].
Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.
CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.
Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003. Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.
Unified Memory is a CUDA feature that we’ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7 enables Unified Memory in CUDA Fortran.
In a PGInsider article called CUDA Fortran Managed Memory, PGI Applications and Services Manager Brent Leback writes “using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.” PGI 14.7 adds the managed keyword to the language, which you can use in host code similarly to the device keyword. Here’s part of an example Brent included in his article, showing the allocation of managed arrays. Continue reading →
Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.
What is pointer aliasing?
Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:
At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.
Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.
What can I do about aliasing?
Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. Continue reading →
R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the Comprehensive R Archive Network).
However, R, like many other high-level languages, is not performance competitive out of the box with lower-level languages like C++, especially for highly data- and computation-intensive applications. R programs tend to process large amounts of data, and often have significant independent data and task parallelism. Therefore, R applications stand to benefit from GPU acceleration. This way, R users can benefit from R’s high-level, user-friendly interface while achieving high performance.
In this article, I will introduce the computation model of R with GPU acceleration, focusing on three topics:
accelerating R computations using CUDA libraries;
calling your own parallel algorithms written in CUDA C/C++ or CUDA Fortran from R; and
profiling GPU-accelerated R applications using the CUDA Profiler.
The GPU-Accelerated R Software Stack
Figure 1 shows that there are two ways to apply the computational power of GPUs in R:
use R GPU packages from CRAN; or
access the GPU through CUDA libraries and/or CUDA-accelerated programming languages, including C, C++ and Fortran.
Figure 1: The R + GPU software stack.
The first approach is to use existing GPU-accelerated R packages listed under High-Performance and Parallel Computing with R on the CRAN site. Examples include gputools and cudaBayesreg. These packages are very easy to install and use. On the other hand, the number of GPU packages is currently limited, quality varies, and only a few domains are covered. This will improve with time.
In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.
The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.
MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.
Feature Detection Example
I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.
Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.
To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading →
We received a wide variety of entries from around the world, representing professors, students, researchers, scientists and domain experts. We had recognized several participants earlier as CUDA Spotlights, includingValerie Halyoof Princeton and Monica Syalof Advanced Rotorcraft Technology.
CUDA Fellow Lorena Barba comments: “This is a good way to remind people that women write code, participate in open-source projects, and invent things. It’s important to make the technology world more attractive to female students.” Dr. Barba is an associate professor of Engineering and Applied Science at George Washington University.
Congrats to the women on our inaugural Women and CUDA list:
Sonia Lopez Alarcon, Rochester Institute of Technology | Heterogeneous Computing Rommie Amaro, University of California, San Diego | Biological Systems Michela Becchi, University of Missouri | GPU Virtualization Continue reading →
Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has time complexity for graphs with vertices and edges, making the analysis of large networks challenging.
This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.
Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex is defined as:
where is the number of shortest paths between vertices and and is the number of those shortest paths that pass through . Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading →
CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.
Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.
CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.