cuda_pro_tip

CUDA Pro Tip: Do The Kepler Shuffle

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL, or “shuffle”.

In upcoming posts here on Parallel Forall we will demonstrate uses of shuffle. To prepare, I highly recommend watching the following recording of a GTC 2013 talk by Julien Demouth entitled “Kepler’s SHUFFLE (SHFL): Tips and Tricks”. In the talk, Julien covers many uses for shuffle, including reductions, scans, transpose, and sorting, demonstrating that shuffle is always faster than safe uses of shared memory, and never slower than unsafe uses of shared memory.

Earlier this year on the Acceleware blog, Kelly Goss wrote a detailed post about shuffle, including a detailed example. Like Julien, Kelly provided several reasons to use shuffle. Continue reading

Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)

CUDA Fortran for Scientists and EngineersThis post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an nx × ny = 1024 × 768 element matrix that is distributed amongst four devices. Each device contains a horizontal slice of the input matrix shown in the figure, as well as a horizontal slice of the output matrix. These input matrix slices of 1024 × 192 elements are divided into four tiles containing 256 × 192 elements each, which are referred to as p2pTile in the code. As the name indicates, the p2pTiles are used for peer-to-peer transfers. After a p2pTile has been transferred to the appropriate device if necessary (tiles on the block diagonal do not need to be transferred as the input and output tiles are on the same device), a CUDA transpose kernel launch transposes the elements within the p2pTile using thread blocks that process smaller tiles of 32 × 32 elements.

Device data layout for peer-to-peer transpose with a nx x ny = 1024 x 768 matrix on four devices. Each device holds a 1024 x 192 horizontal slice of input matrix (as well as a 768 x 256 horizontal slice of the output matrix). Each slice of the input matrix is broken into four tiles of 256 x 192 elements, which are used for  peer-to-peer transfers. The CUDA kernel transposes this tile using 48 thread blocks, each of which processes a 32 x 32 tile.
Device data layout for peer-to-peer transpose with a nx x ny = 1024 x 768 matrix on four devices. Each device holds a 1024 x 192 horizontal slice of input matrix (as well as a 768 x 256 horizontal slice of the output matrix). Each slice of the input matrix is broken into four tiles of 256 x 192 elements, which are used for peer-to-peer transfers. The CUDA kernel transposes this tile using 48 thread blocks, each of which processes a 32 x 32 tile.

The full code is available on the website for the CUDA Fortran for Scientists and Engineers textbook [line numbers below refer to the file CUDAFortranCode/chapter4/P2P/transposeP2P.cuf in the source code archive]. In this post we pull in only the relevant parts for our discussion. Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode 13: Clock, Power, and Thermal Profiling with Nsight Eclipse Edition

In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW.  For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling.

A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code.  This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux.  Learn how to activate and use this feature by watching CUDACasts Episode 13.

Continue reading

cuda_pro_tip

CUDA Pro Tip: Increase Performance with Vectorized Memory Access

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it very important to take steps to mitigate bandwidth bottlenecks in your code. In this post I will show you how to use vector loads and stores in CUDA C/C++ to help increase bandwidth utilization while decreasing the number of executed instructions.

Let’s begin by looking at the following simple memory copy kernel.

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) { 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  for (int i = idx; i < N; i += blockDim.x * gridDim.x) { 
    d_out[i] = d_in[i]; 
  } 
} 

void device_copy_scalar(int* d_in, int* d_out, int N) 
{ 
  int threads = 128; 
  int blocks = min((N + threads-1) / threads, MAX_BLOCKS);  
  device_copy_scalar_kernel<<<blocks, threads>>>(d_in, d_out, N); 
}

In this code I am using grid-stride loops, described in an earlier CUDA Pro Tip post. Figure 1 shows the throughput of the kernel in GB/s as a function of copy size.

copybandwidth
Figure 1: Copy bandwidth as a function of copy size.

Continue reading

An Efficient Matrix Transpose in CUDA C/C++

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses.

Matrix Transpose

The code we wish to optimize is a transpose of a matrix of single precision values that operates out-of-place, i.e. the input and output are separate arrays in memory. For simplicity of presentation, we’ll consider only square matrices whose dimensions are integral multiples of 32 on a side. The entire code is available on Github. It consists of several kernels as well as host code to perform typical tasks such as allocation and data transfers between host and device, launches and timing of several kernels as well as validation of their results, and deallocation of host and device memory. In this post I’ll only include the kernel code; you can view the rest or try it out on Github.

In addition to performing several different matrix transposes, we run simple matrix copy kernels because copy performance indicates the performance that we would like the matrix transpose to achieve. For both matrix copy and transpose, the relevant performance metric is effective bandwidth, calculated in GB/s by dividing twice the size in GB of the matrix (once for loading the matrix and once for storing) by time in seconds of execution. Continue reading

An Efficient Matrix Transpose in CUDA Fortran

My last CUDA Fortran post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance gains achievable using shared memory. Specifically, I will optimize a matrix transpose to show how to use shared memory to reorder strided global memory accesses into coalesced accesses.

Matrix Transpose

The code we wish to optimize is a transpose of a matrix of single precision values that operates out-of-place, i.e. the input and output are separate arrays in memory. For simplicity of presentation, we’ll consider only square matrices whose dimensions are integral multiples of 32 on a side. The entire code is available on Github. It consists several kernels as well as host code to perform typical tasks such as allocation and data transfers between host and device, launches and timings of several kernels as well as validation of their results, and deallocation of host and device memory. In this post I’ll only include the kernel code; you can view the rest or try it out on Github.

In addition to performing several different matrix transposes, we run simple matrix copy kernels because copy performance indicates the performance that we would like the matrix transpose to achieve. For both matrix copy and transpose, the relevant performance metric is effective bandwidth, calculated in GB/s by dividing twice the size in GB of the matrix (once for loading the matrix and once for storing) by time in seconds of execution. Continue reading

CUDA Pro Tip: Flush Denormals with Confidence

I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post, so I’ll begin by paraphrasing it.

What’s a denormal?

Normal floating point values have no leading zeros in the mantissa (or significand). The mantissa is normalized and any leading zeros are moved into the exponent. Subnormal numbers (or denormal numbers) are floating point numbers where this normalized representation would result in an exponent that is too small (not representable).  So unlike normal floating point numbers, subnormal numbers have leading zeros in the mantissa. Doing this loses significant digits, but not as much as if the mantissa is flushed to zero on underflow. This allows what is known as “gradual underflow” when a result is very small, and helps avoid catastrophic division-by-zero errors.

Denormal numbers can incur extra computational cost. The Wikipedia entry explains that some platforms implement denormal numbers in software, while others handle them in hardware. On NVIDIA GPUs starting with the Fermi architecture (Compute Capability 2.0 and higher), denormal numbers are handled in hardware for operations that go through the Fused Multiply-Add pipeline (including adds and multiplies), and so these operations don’t carry performance overhead. But multi-instruction sequences such as square root and (notably for my example later in this post) reciprocal square root, must do extra work and take a slower path for denormal values (including the cost of detecting them). Continue reading

How to Implement Performance Metrics in CUDA C/C++

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

CUDA performance measurement is most commonly done from host code, and can be implemented using either CPU timers or CUDA-specific timers. Before we jump into these performance measurement techniques, we need to discuss how to synchronize execution between the host and device.

Host-Device Synchronization

Let’s take a look at the data transfers and kernel launch of the SAXPY host code from the previous post:

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

The data transfers between the host and device using cudaMemcpy() are synchronous (or blocking) transfers. Synchronous data transfers do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls cannot begin until the synchronous transfer has completed. Therefore the saxpy kernel launch on the third line will not issue until the transfer from y to d_y on the second line has completed. Kernel launches, on the other hand, are asynchronous. Once the kernel is launched on the third line, control returns immediately to the CPU and does not wait for the kernel to complete. While this might seem to set up a race condition for the device-to-host data transfer in the last line, the blocking nature of the data transfer ensures that the kernel completes before the transfer begins. Continue reading

How to Implement Performance Metrics in CUDA Fortran

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

CUDA performance measurement is most commonly done from host code, and can be implemented using either CPU timers or CUDA-specific timers. Before we jump into these performance measurement techniques, we need to discuss how execution between the host and device are synchronized.

Host-Device Synchronization

Let’s take a look at the data transfers and kernel launch of the SAXPY host code from the previous post:

x_d = x 
y_d = y 
call saxpy<<<grid, <span="" class="hiddenSpellError" pre="">tBlock>>>(x_d, y_d, a) 
y = y_d

The data transfers between the host and device performed by assignment statements are synchronous (or blocking) transfers. Synchronous data transfers do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls cannot begin until the synchronous transfer has completed. Therefore the saxpy kernel launch on the third line will not be issued until the transfer y_d = y on the second line has completed. Kernel launches, on the other hand, are asynchronous. Once the kernel is launched on the third line, control returns immediately to the CPU and does not wait for the kernel to complete. While this might seem to set up a race condition for the device-to-host data transfer in the last line, the blocking nature of the data transfer ensures that the kernel completes before the transfer begins. Continue reading