cuda_pro_tip

CUDA Pro Tip: Optimize for Pointer Aliasing

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.

What is pointer aliasing?

Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:

void example1(float *a, float *b, float *c, int i) {
  a[i] = a[i] + c[i];
  b[i] = b[i] + c[i];
}

At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.

Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.

What can I do about aliasing?

Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. The C99 standard includes the keyword restrict for use in C. In C++ there is no standard keyword, but most compilers allow the keywords __restrict__ or __restrict to be used for the same purpose as restrict in C.

By giving a pointer the restrict property, the programmer is promising the compiler that any data written to through that pointer is not read by any other pointer with the restrict property. In other words, the compiler doesn’t have to worry that a write to a restrict pointer will cause a value read from another restrict pointer to change. This greatly helps the compiler optimize code.

To show the performance benefits of restrict-decorated pointers, consider the following function:

void example2a(float *a, float *b, float *c) {
  for (int i = 0; i < 1024; i++) {
    a[i] = 0.0f;
    b[i] = 0.0f;
    for (int j = 0; j < 1024; j++) {
      a[i] = a[i] + c[i*1024 + j];
      b[i] = b[i] + c[i*1024 + j] * c[i*1024 + j];
    }
  }
}

This function is similar to our original example and, as before, the compiler generates sub-optimal code to ensure that it works with aliased pointers. Because the compiler must assume that a[i] and b[i] overlap, it must both read and write them every iteration of the inner loop.

If we know at compile time that our three pointers are not used to access overlapping regions, we can add __restrict__ to our pointers. Now the compiler knows that a[i] and b[i] cannot overlap, so it can optimize the inner loop by storing the running sum in a local variable and only writing it once at the end.

void example2b(float * __restrict__ a, float * __restrict__ b, float * __restrict__ c) {
  for (int i = 0; i < 1024; i++) {
    a[i] = 0.0f;
    b[i] = 0.0f;
    for (int j = 0; j < 1024; j++) {
      a[i] = a[i] + c[i*1024 + j];
      b[i] = b[i] + c[i*1024 + j] * c[i*1024 + j];
    }
  }
}

Timing these two functions:

Original (example2a): 3.13ms
Restrict (example2b): 1.05ms
Average timings over 10,000 calls, gcc 4.8.2, Intel® Xeon® CPU E5-2690 v2 @ 3.00GHz.

Just adding __restrict__ in this case produces 3x faster code! I could have achieved the same result by introducing local summation variables myself, but in real-world situations allowing the compiler to do this optimization is often easier.

Wait, where’s the CUDA?

I haven’t talked about GPUs or CUDA at all so far. This is because pointer aliasing is something developers of high-performance code need to be aware of on both the GPU and the CPU and, as demonstrated above, proper use can significantly improve performance.

There is, however, one potential GPU-specific benefit to __restrict__. Compute Capability 3.5 NVIDIA GPUs (e.g. Kepler) have a cache designed for read-only data which can, for some codes, improve data access performance. This cache can only be used for data that is read-only for the lifetime of the kernel. To use the read-only data cache, the compiler must determine that data is never written. Due to potential aliasing, the compiler can’t be sure a pointer references read-only data unless the pointer is marked with both const and __restrict__. Also, as the Kepler Tuning Guide points out, “adding these qualifiers where applicable can improve code generation quality via other mechanisms on earlier GPUs as well.”

In the following code I copy elements of array a into array b. These elements are chosen by reading an index in array c, which is initialized with random integers between 0 and the array length.

__global__ void example3a(float* a, float* b, int* c) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  b[index] = a[c[index]];
}

Note that in this case there are no redundant memory accesses due to potential pointer aliasing. Each thread reads one element of c and a and writes one element of b. However, because both a and c are read-only, and I know that the data does not overlap, I can add const and __restrict__ to the above code.

__global__ void example3b(const float* __restrict__ a, float* __restrict__ b, const int*  __restrict__ c) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  b[index] = a[c[index]];
}

This extra information allows the CUDA compiler to use the read-only data cache and improves performance by more than 2x.

Original (example3a): 47.6μs
Restrict (example3b): 22.5μs
Average timings over 10,000 calls, 256,000 threads, CUDA 6.5 RC, NVIDIA Kepler K40.

Conclusion

It’s important to understand pointer aliasing when writing code where every clock cycle counts. While you can sometimes explicitly write around performance problems caused by potential aliasing, using the __restrict__ keyword allows the compiler to do much of the work for you. It also allows the use of the GPU read-only data cache, potentially accelerating data movement to your kernel.

As with most code-level optimizations, your mileage may vary. Always profile your code and try to determine the bottlenecks and how far it is from hardware performance limits before spending too much time trying to optimize.

∥∀

About Jeremy Appleyard

Jeremy Appleyard
Jeremy Appleyard is a developer in NVIDIA's European Developer Technology team. Based near Oxford, England, he works with developers accelerating applications on GPUs. He holds a Ph.D in computational fluid dynamics from Cranfield University.
  • Joseph Pingenot

    I do love the restrict keyword (and const, despite being told in a previous life that you didn’t really need the const keyword). The Restrict Contract is an amusing way of putting down what the keyword means. A revised version was on my door for a while: http://cellperformance.beyond3d.com/articles/2006/05/demystifying-the-restrict-keyword.html

  • Christian Ahlstrand

    Why use pointers anyway if L2 is big enough?

  • ponce

    Is there a pragma to specify a loop has no loop-dependencies instead?

    • http://www.markmark.net/ Mark Harris

      Pointer aliasing and loop dependencies are two different, but related issues. Even if there are no loop dependencies, pointers may alias, and vice versa.

  • D_J_B

    When you write

    “By giving a pointer the restrict property, the programmer is promising the compiler that any data accessed through that pointer is not accessed in any other way. In other words, the compiler doesn’t have to worry about aliasing when using a pointer with the restrict property.”

    This is not true.

    Here is a trivial example:

    void example1(restrict float *a, restrict float *b, float *c, int i) {
    c[i] = a[i] + b[i];
    }

    In this example, a and b may alias (and a variant of this example is given in the C99 standard as how two restricted variables may still alias)

    C99/C11 require that the restricted storage be modified in the restrict block for any guarantee to hold.
    In fact, there are all sorts of messy/strange requirements on restrict that make it almost impossible for users to reason about whether it will still make their pointers alias. This is one reason that C++ does not have restrict in the standard yet. After implementing conforming support in GCC/LLVM, the number of issues where the compiler could not do what the user wanted, and the user became seriously confused as to why two restricted pointers were thought aliasing, made them sit down and think about whether it was really the right model.

    • http://www.markmark.net/ Mark Harris

      I’ve updated those two sentences to hopefully be true. The fact is that using __restrict does help performance in both the CPU and GPU examples demonstrated here, on two different compilers. Unlike the example you give, our examples all write to a restricted pointer.

  • Tom Szelag

    I’ve been trying to understand pointer aliasing and performance, and am glad I came across this article. However, tried compiling example1 with and without __restrict from Visual Studio (Microsoft Optimizing Compiler Version 18.00.30723.0). Admittedly I’m very new to attempting to read x86 assembly, but here’s what I get (both are in release configuration for optimization).

    In both cases we have:
    _TEXT SEGMENT
    _a$ = 8 ; size = 4
    _b$ = 12 ; size = 4
    _c$ = 16 ; size = 4
    _i$ = 20 ; size = 4

    Original
    ; 3 : void example1(float* a, float* b, float* c, int i) {

    push ebp
    mov ebp, esp

    ; 4 : a[i] = a[i] + c[i];

    mov edx, DWORD PTR _i$[ebp]
    mov ecx, DWORD PTR _c$[ebp]
    mov eax, DWORD PTR _a$[ebp]
    movss xmm0, DWORD PTR [ecx+edx*4]
    addss xmm0, DWORD PTR [eax+edx*4]
    movss DWORD PTR [eax+edx*4], xmm0

    ; 5 : b[i] = b[i] + c[i];

    mov eax, DWORD PTR _b$[ebp]
    movss xmm0, DWORD PTR [ecx+edx*4]
    addss xmm0, DWORD PTR [eax+edx*4]
    movss DWORD PTR [eax+edx*4], xmm0

    ; 6 : }

    pop ebp
    ret 0

    __restrict
    ; 3 : void example1(float* __restrict a, float* __restrict b, float* __restrict c, int i) {

    push ebp
    mov ebp, esp

    ; 4 : a[i] = a[i] + c[i];

    mov edx, DWORD PTR _i$[ebp]
    mov eax, DWORD PTR _c$[ebp]
    mov ecx, DWORD PTR _a$[ebp]
    movss xmm1, DWORD PTR [eax+edx*4]

    ; 5 : b[i] = b[i] + c[i];

    mov eax, DWORD PTR _b$[ebp]
    movss xmm0, DWORD PTR [ecx+edx*4]
    addss xmm0, xmm1
    addss xmm1, DWORD PTR [eax+edx*4]
    movss DWORD PTR [ecx+edx*4], xmm0
    movss DWORD PTR [eax+edx*4], xmm1

    ; 6 : }

    pop ebp
    ret 0

    In either case there are the same total operations (1 push, 1 pop, 5 mov, 4 movss, 2 addss), just in a slightly different order. Not obvious to me how one is advantageous compared to the other.

    • http://www.markmark.net/ Mark Harris

      Did you compare wall clock run time?

      • Tom Szelag

        I did not. The blog post implies that pointer aliasing will result in inefficient machine code because additional load operations are necessary (one additional operation in example1, if I read it correctly). However, the assembly code I’m seeing doesn’t support that. That’s more what I’m trying to understand – why or why aren’t there additional loads, rather than if a re-ordering of the same operations is any different.

        In any event it would be really insightful if this blog entry included some assembly language examples to really drive the point home.

        Thanks for the reply.

        • Jeremy Appleyard

          It’s worth remembering that the compiler is free to optimize however it likes, and additional information may not be used. The actual instructions generated are going to depend on both your compiler and your target architecture.

          For example: if I compile the first example for a compute capability 3.5 GPU using nvcc on linux, I get the following SASS assembler (viewed using: cuobjdump –dump-sass):

          Without restrict:

          MOV R1, c[0x0][0x44];
          MOV R9, c[0x0][0x158];
          MOV32I R10, 0×4;
          IMAD.U32.U32 R4.CC, R9, R10, c[0x0][0x140];
          IMAD.HI.X R5, R9, R10, c[0x0][0x144];
          IMAD.U32.U32 R6.CC, R9, R10, c[0x0][0x150];
          LD.E R3, [R4];

          IMAD.HI.X R7, R9, R10, c[0x0][0x154];
          LD.E R0, [R6];
          IMAD.U32.U32 R8.CC, R9, R10, c[0x0][0x148];
          IMAD.HI.X R9, R9, R10, c[0x0][0x14c];
          FADD R2, R3, R0;
          ST.E [R4], R2;
          LD.E R0, [R6];

          LD.E R3, [R8];
          FADD R0, R3, R0;
          ST.E [R8], R0;

          With restrict:

          MOV R1, c[0x0][0x44];
          MOV R7, c[0x0][0x158];
          MOV32I R8, 0×4;
          IMAD.U32.U32 R2.CC, R7, R8, c[0x0][0x150];
          IMAD.HI.X R3, R7, R8, c[0x0][0x154];
          LDG.E R3, [R2];
          IMAD.U32.U32 R4.CC, R7, R8, c[0x0][0x140];

          IMAD.HI.X R5, R7, R8, c[0x0][0x144];
          LD.E R0, [R4];
          IMAD.U32.U32 R6.CC, R7, R8, c[0x0][0x148];
          IMAD.HI.X R7, R7, R8, c[0x0][0x14c];
          TEXDEPBAR 0×0;
          FADD R2, R0, R3;
          ST.E [R4], R2;

          LD.E R0, [R6];
          FADD R0, R0, R3;
          ST.E [R6], R0;

          —–

          In this case the compiler generates four loads in the non-restrict version, and two normal loads and one texture cached load in the restrict version. The latter is clearly preferred.