This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.
OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.
When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.
Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.
Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with respect to each other. [See the post How to Overlap Data Transfers in CUDA C/C++ for an example]
When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.
Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.
What is pointer aliasing?
Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:
At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.
Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.
What can I do about aliasing?
Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. Continue reading →
Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has time complexity for graphs with vertices and edges, making the analysis of large networks challenging.
This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.
Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex is defined as:
where is the number of shortest paths between vertices and and is the number of those shortest paths that pass through . Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading →
When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved andTheoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of resources (registers and shared memory) used by those blocks for a particular GPU and is computed without running the kernel on the GPU. The Achieved Occupancy, on the other hand, is measured from the execution of the kernel (as the number of active warps divided by the number of active cycles compared to the maximum number of executable warps).
Recently, while working on a kernel for a finance benchmark, I could see an Achieved Occupancy of 41.52% whereas the Theoretical Occupancy was 50%. In NVIDIA Nsight Visual Studio Edition, the Instruction per Clock (IPC) showed a lot of load imbalance between the different SMs with respect to the number of executed instructions by the kernel (see the left graph in the figure below).
One of the main reasons for accelerating code on an NVIDIA GPU is for an increase in application performance. This is why it’s important to use the best tools available to help you get the performance you’re looking for. CUDA 6 includes great improvements to the guided analysis tool in the NVIDIA Visual Profiler. Watch today’s CUDACast to see how to use guided analysis to locate potential optimizations for your GPU code.