CUDA Pro Tip: Increase Performance with Vectorized Memory Access

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it very important to take steps to mitigate bandwidth bottlenecks in your code. In this post, I will show you how to use vector loads and stores in CUDA C/C++ to help increase bandwidth utilization while decreasing the number of executed instructions.

Let’s begin by looking at the following simple memory copy kernel.

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) { 
  int idx = blockIdx.x * blockDim.x + threadIdx.x; 
  for (int i = idx; i < N; i += blockDim.x * gridDim.x) { 
    d_out[i] = d_in[i]; 

void device_copy_scalar(int* d_in, int* d_out, int N) 
  int threads = 128; 
  int blocks = min((N + threads-1) / threads, MAX_BLOCKS);  
  device_copy_scalar_kernel<<<blocks, threads>>>(d_in, d_out, N); 

In this code, I am using grid-stride loops, described in an earlier CUDA Pro Tip post. Figure 1 shows the throughput of the kernel in GB/s as a function of copy size.

Figure 1: Copy bandwidth as a function of copy size.

We can inspect the assembly for this kernel using the cuobjdump tool included with the CUDA Toolkit.

 %> cuobjdump -sass executable

The SASS for the body of the scalar copy kernel is the following:

/*0058*/ IMAD R6.CC, R0, R9, c[0x0][0x140]                
/*0060*/ IMAD.HI.X R7, R0, R9, c[0x0][0x144]              
/*0068*/ IMAD R4.CC, R0, R9, c[0x0][0x148]               
/*0070*/ LD.E R2, [R6]                                   
/*0078*/ IMAD.HI.X R5, R0, R9, c[0x0][0x14c]              
/*0090*/ ST.E [R4], R2

Here we can see a total of six instructions associated with the copy operation. The four IMAD instructions compute the load and store addresses and the LD.E and ST.E load and store 32 bits from those addresses.

We can improve performance of this operation by using the vectorized load and store instructions LD.E.{64,128} and ST.E.{64,128}. These operations also load and store data but do so in 64- or 128-bit widths. Using vectorized loads reduces the total number of instructions, reduces latency, and improves bandwidth utilization.

The easiest way to use vectorized loads is to use the vector data types defined in the CUDA C/C++ standard headers, such as int2, int4, or float2. You can easily use these types via type casting in C/C++. For example in C++ you can recast the int pointer d_in to an int2 pointer using reinterpret_cast<int2*>(d_in). In C99 you can do the same thing using the casting operator: (int2*(d_in)).

Dereferencing those pointers will cause the compiler to generate the vectorized instructions. However, there is one important caveat: these instructions require aligned data. Device-allocated memory is automatically aligned to a multiple of the size of the data type, but if you offset the pointer the offset must also be aligned. For example reinterpret_cast<int2*>(d_in+1) is invalid because d_in+1 is not aligned to a multiple of sizeof(int2).

You can safely offset arrays if you use an “aligned” offset, as in reinterpret_cast<int2*>(d_in+2). You can also generate vectorized loads using structures as long as the structure is a power of two bytes in size.

struct Foo { int a, int b, double c}; // 16 bytes in size
Foo *a, *b;

Now that we have seen how to generate vectorized instructions let’s modify the memory copy kernel to use vector loads.

__global__ void device_copy_vector2_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for (int i = idx; i < N/2; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int2*>(d_out)[i] = reinterpret_cast<int2*>(d_in)[i];

  //process remaining elements
  for(int i = idx + N/2 * 2; i<N; i += blockIdx.x*blockDim.x+threadIdx.x)
   d_out[i] = d_in[i];

void device_copy_vector2(int* d_in, int* d_out, int n) {
  threads = 128; 
  blocks = min((N/2 + threads-1) / threads, MAX_BLOCKS); 

  device_copy_vector2_kernel<<<blocks, threads>>>(d_in, d_out, N);

This kernel has only a few changes. First, the loop now executes only N/2 times because each iteration processes two elements. Second, we use the casting technique described above in the copy. Third, we handle any remaining elements which may arise if N is not divisible by 2. Finally, we launch half as many threads as we did in the scalar kernel.

Inspecting the SASS we see the following.

/*0088*/                IMAD R10.CC, R3, R5, c[0x0][0x140]              
/*0090*/                IMAD.HI.X R11, R3, R5, c[0x0][0x144]            
/*0098*/                IMAD R8.CC, R3, R5, c[0x0][0x148]             
/*00a0*/                LD.E.64 R6, [R10]                                      
/*00a8*/                IMAD.HI.X R9, R3, R5, c[0x0][0x14c]           
/*00c8*/                ST.E.64 [R8], R6

Notice that now the compiler generates LD.E.64 and ST.E.64. All the other instructions are the same. However, it is important to note that there will be half as many instructions executed because the loop only executes N/2 times. This 2x improvement in instruction count is very important in instruction-bound or latency-bound kernels.

We can also write a vector4 version of the copy kernel.

__global__ void device_copy_vector4_kernel(int* d_in, int* d_out, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  for(int i = idx; i < N/4; i += blockDim.x * gridDim.x) {
    reinterpret_cast<int4*>(d_out)[i] = reinterpret_cast<int4*>(d_in)[i];

  //process remaining elements
  for(int i = idx + N/4 * 4; i<N; i += blockIdx.x*blockDim.x+threadIdx.x)
   d_out[i] = d_in[i];

void device_copy_vector4(int* d_in, int* d_out, int N) {
  int threads = 128;
  int blocks = min((N/4 + threads-1) / threads, MAX_BLOCKS);

  device_copy_vector4_kernel<<<blocks, threads>>>(d_in, d_out, N);

The corresponding SASS is the following:

/*0090*/                IMAD R10.CC, R3, R13, c[0x0][0x140]              
/*0098*/                IMAD.HI.X R11, R3, R13, c[0x0][0x144]            
/*00a0*/                IMAD R8.CC, R3, R13, c[0x0][0x148]               
/*00a8*/                LD.E.128 R4, [R10]                               
/*00b0*/                IMAD.HI.X R9, R3, R13, c[0x0][0x14c]             
/*00d0*/                ST.E.128 [R8], R4

Here we can see the generated LD.E.128 and ST.E.128. This version of the code has reduced the instruction count by a factor of 4. You can see the overall performance for all 3 kernels in Figure 2.

Figure 2: Copy bandwidth as a function of copy size for vectorized kernels.

In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.

Vectorized loads are a fundamental CUDA optimization that you should use when possible, because they increase bandwidth, reduce instruction count, and reduce latency. In this post, I’ve shown how you can easily incorporate vectorized loads into existing kernels with relatively few changes.



About Justin Luitjens

Justin Luitjens
Justin Luitjens is a member of the Developer Technology team at NVIDIA where he works on accelerating applications on GPUs. He holds a Ph.D in Scientific Computing from the University of Utah.
  • Anton Anikin

    Thank you for interesting material. But it’s seems to be a mistake in the end of device_copy_vector2_kernel code:

    if (i<n) d_out[i]="d_in[i];" }=""…

    • Mark Harris

      Thanks Anton, I have fixed the formatting error.

      • tjhb

        The section above that (process remaining elements) dangles in mid air…

        • Mark Harris

          This is fixed now too. Thanks!

  • tjhb

    “Dereferencing those pointers will cause the compiler to generate the vectorized instructions.”

    I wonder where I can look for more on the conditions under which the compiler will take the initiative in this way, both for memory access and for local instructions? It seems a bit magical to a newbie.

    Another way of putting it: I want to understand *why* casting to a pointer to type int2 triggers automatic vectorization. (Happy to do my own reading, if someone would be kind enough to give helpful indirection.)

    • Mark Harris

      The compiler cannot guarantee that the pointer is int2 or int4 aligned and thus cannot generate vectorized loads in the absence of additional information. By casting to a vector type the user is giving the compiler more information by explicitly saying it is aligned. Of course if the user is wrong about the alignment the code is likely to produce a memory access fault. The Maximum vectorized load/store is 16 bytes. Depending on the size of the struct the compiler can do some interesting things, for example on 24-byte structs it could load 8,8,8 bytes but not 16,8 (since the 16 would only be aligned every second struct element).

      • Tim Baigent

        Dr Harris,

        Thank you for your helpful simplified explanation. I confess that for me… well, I just (confess *) and leave it at that. You can tell the depth of my ignorance below.

        Is the cast here not doing something better done by a declaration (and function definition), issuing a contract to the compiler, against which it could check at compile time, and which would therefore be more robust than a named cast, which can only be checked at run time?

        Here, why not

        __global__ void device_copy_vector2_kernel(int2 * d_in, int2 * d_out, int N) {


        (Assuming I had previously initialized the referent of each struct argument using the __align__(16) specifier?)

        What is gained by passing in a plain int *, then doing a reinterpret_cast?

        I may only be showing that I fundamentally don’t understand C++ well enough to digest this Pro Tip. And this may not be a great forum for these extra questions. Perhaps I should shift them to the CUDA forums.


        • Tim Baigent

          [The penny has dropped. Thanks.]

  • kuba jacobs

    quick question: how can I load four integers into registers (an array of four) which can be addressed by an index and not .x, .y, .z, .w?

  • Bulat Ziganshin

    >if(i < N)

    this statement in the second vectorized variant should be replaced with for(;i<N;i++)

    since you may need to copy up to 3 remaining items

  • Andrew Chap

    Is there a compute capability requirement? I’m on a Tesla C2070 with compute capability 2.0 and I don’t see any LD.64 when I look at the SASS for device_copy_vector2_kernel, I just see regular LD’s

    • Mark Harris

      The examples here use Kepler, which has a different SASS ISA than Fermi (CC 2.x). However, the approach of vectorizing loads and stores should be valuable on Fermi GPUs as well.