An Even Easier Introduction to CUDA

CUDA AI CubeThis post is a super simple introduction to CUDA, the popular parallel computing platform and programming model from NVIDIA. I wrote a previous “Easy Introduction” to CUDA in 2013 that has been very popular over the years. But CUDA programming has gotten easier, and GPUs have gotten much faster, so it’s time for an updated (and even easier) introduction.

CUDA C++ is just one of the ways you can create massively parallel applications with CUDA. It lets you use the powerful C++ programming language to develop high performance algorithms accelerated by thousands of parallel threads running on GPUs. Many developers have accelerated their computation- and bandwidth-hungry applications this way, including the libraries and frameworks that underpin the ongoing revolution in artificial intelligence known as Deep Learning.

So, you’ve heard about CUDA and you are interested in learning how to use it in your own applications. If you are a C or C++ programmer, this blog post should give you a good start. To follow along, you’ll need a computer with an CUDA-capable GPU (Windows, Mac, or Linux, and any NVIDIA GPU should do), or a cloud instance with GPUs (AWS, Azure, IBM SoftLayer, and other cloud service providers have them). You’ll also need the free CUDA Toolkit installed.

Let’s get started!

Starting Simple

We’ll start with a simple C++ program that adds the elements of two arrays with a million elements each.

#include <iostream>
#include <math.h>

// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20; // 1M elements

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

First, compile and run this C++ program. Put the code above in a file and save it as add.cpp, and then compile it with your C++ compiler. I’m on a Mac so I’m using clang++, but you can use g++ on Linux or MSVC on Windows.

> clang++ add.cpp -o add

Then run it:

> ./add
 Max error: 0.000000

(On Windows you may want to name the executable add.exe and run it with .\add.)

As expected, it prints that there was no error in the summation and then exits. Now I want to get this computation running (in parallel) on the many cores of a GPU. It’s actually pretty easy to take the first steps.

First, I just have to turn our add function into a function that the GPU can run, called a kernel in CUDA. To do this, all I have to do is add the specifier __global__ to the function, which tells the CUDA C++ compiler that this is a function that runs on the GPU and can be called from CPU code.

// CUDA Kernel function to add the elements of two arrays on the GPU
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}

These __global__ functions are known as kernels, and code that runs on the GPU is often called device code, while code that runs on the CPU is host code.

Memory Allocation in CUDA

To compute on the GPU, I need to allocate memory accessible by the GPU. Unified Memory in CUDA makes this easy by providing a single memory space accessible by all GPUs and CPUs in your system. To allocate data in unified memory, call cudaMallocManaged(), which returns a pointer that you can access from host (CPU) code or device (GPU) code. To free the data, just pass the pointer to cudaFree().

I just need to replace the calls to new in the code above with calls to cudaMallocManaged(), and replace calls to delete [] with calls to cudaFree.

  // Allocate Unified Memory -- accessible from CPU or GPU
  float *x, *y;
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  ...

  // Free memory
  cudaFree(x);
  cudaFree(y);

Finally, I need to launch the add() kernel, which invokes it on the GPU. CUDA kernel launches are specified using the triple angle bracket syntax <<< >>>. I just have to add it to the call to add before the parameter list.

add<<<1, 1>>>(N, x, y);

Easy! I’ll get into the details of what goes inside the angle brackets soon; for now all you need to know is that this line launches one GPU thread to run add().

Just one more thing: I need the CPU to wait until the kernel is done before it accesses the results (because CUDA kernel launches don’t block the calling CPU thread). To do this I just call cudaDeviceSynchronize() before doing the final error checking on the CPU.

Here’s the complete code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

CUDA files have the file extension .cu. So save this code in a file called add.cu and compile it with nvcc, the CUDA C++ compiler.

> nvcc add.cu -o add_cuda
> ./add_cuda
Max error: 0.000000

This is only a first step, because as written, this kernel is only correct for a single thread, since every thread that runs it will perform the add on the whole array. Moreover, there is a race condition since multiple parallel threads would both read and write the same locations.

Note: on Windows, you need to make sure you set Platform to x64 in the Configuration Properties for your project in Microsoft Visual Studio.

Profile it!

I think the simplest way to find out how long the kernel takes to run is to run it with nvprof, the command line GPU profiler that comes with the CUDA Toolkit. Just type nvprof ./add_cuda on the command line:

$ nvprof ./add_cuda
==3355== NVPROF is profiling process 3355, command: ./add_cuda
Max error: 0
==3355== Profiling application: ./add_cuda
==3355== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  463.25ms         1  463.25ms  463.25ms  463.25ms  add(int, float*, float*)
...

Above is the truncated output from nvprof, showing a single call to add. It takes about half a second on an NVIDIA Tesla K80 accelerator, and about the same time on an NVIDIA GeForce GT 740M in my 3-year-old Macbook Pro.

Let’s make it faster with parallelism.

Picking up the Threads

Now that you’ve run a kernel with one thread that does some computation, how do you make it parallel? The key is in CUDA’s <<<1, 1>>>syntax. This is called the execution configuration, and it tells the CUDA runtime how many parallel threads to use for the launch on the GPU. There are two parameters here, but let’s start by changing the second one: the number of threads in a thread block. CUDA GPUs run kernels using blocks of threads that are a multiple of 32 in size, so 256 threads is a reasonable size to choose.

add<<<1, 256>>>(N, x, y);

If I run the code with only this change, it will do the computation once per thread, rather than spreading the computation across the parallel threads. To do it properly, I need to modify the kernel. CUDA C++ provides keywords that let kernels get the indices of the running threads. Specifically, threadIdx.x contains the index of the current thread within its block, and blockDim.x contains the number of threads in the block. I’ll just modify the loop to stride through the array with parallel threads.

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

The add function hasn’t changed that much. In fact, setting index to 0 and stride to 1 makes it semantically identical to the first version.

Save the file as add_block.cu and compile and run it in nvprof again. For the remainder of the post I’ll just show the relevant line from the output.

Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  2.7107ms         1  2.7107ms  2.7107ms  2.7107ms  add(int, float*, float*)

That’s a big speedup (463ms down to 2.7ms), but not surprising since I went from 1 thread to 256 threads. The K80 is faster than my little Macbook Pro GPU (at 3.2ms). Let’s keep going to get even more performance.

Out of the Blocks

CUDA GPUs have many parallel processors grouped into Streaming Multiprocessors, or SMs. Each SM can run multiple concurrent thread blocks. As an example, a Tesla P100 GPU based on the Pascal GPU Architecture has 56 SMs, each capable of supporting up to 2048 active threads. To take full advantage of all these threads, I should launch the kernel with multiple thread blocks.

By now you may have guessed that the first parameter of the execution configuration specifies the number of thread blocks. Together, the blocks of parallel threads make up what is known as the grid. Since I have N elements to process, and 256 threads per block, I just need to calculate the number of blocks to get at least N threads. I simply divide N by the block size (being careful to round up in case N is not a multiple of blockSize).

int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
Figure 1: The CUDA parallel thread hierarchy. CUDA executes kernels using a <em>grid</em> of <em>blocks</em>of <em>threads</em>. This figure shows the common indexing pattern used in CUDA programs using the CUDA keywords gridDim.x (the number of thread blocks), blockDim.x (the number of threads in each block), blockIdx.x (the index the current block within the grid), and threadIdx.x (the index of the current thread within the block).
Figure 1: The CUDA parallel thread hierarchy. CUDA executes kernels using a grid of blocksof threads. This figure shows the common indexing pattern used in CUDA programs using the CUDA keywords gridDim.x (the number of thread blocks), blockDim.x (the number of threads in each block), blockIdx.x (the index the current block within the grid), and threadIdx.x (the index of the current thread within the block).

I also need to update the kernel code to take into account the entire grid of thread blocks. CUDA provides gridDim.x, which contains the number of blocks in the grid, and blockIdx.x, which contains the index of the current thread block in the grid. Figure 1 illustrates the the approach to indexing into an array (one-dimensional) in CUDA using blockDim.xgridDim.x, and threadIdx.x. The idea is that each thread gets its index by computing the offset to the beginning of its block (the block index times the block size: blockIdx.x * blockDim.x) and adding the thread’s index within the block (threadIdx.x). The code blockIdx.x * blockDim.x + threadIdx.x is idiomatic CUDA.

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

The updated kernel also sets stride to the total number of threads in the grid (blockDim.x * gridDim.x). This type of loop in a CUDA kernel is often called a grid-stride loop.

Save the file as add_grid.cu and compile and run it in nvprof again.

Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  94.015us         1  94.015us  94.015us  94.015us  add(int, float*, float*)

That’s another 28x speedup, from running multiple blocks on all the SMs of a K80! We’re only using one of the 2 GPUs on the K80, but each GPU has 13 SMs. Note the GeForce in my laptop has 2 (weaker) SMs and it takes 680us to run the kernel.

Summing Up

Here’s a rundown of the performance of the three versions of the add() kernel on the Tesla K80 and the GeForce GT 750M.

Laptop (GeForce GT 750M) Server (Tesla K80)
Version Time Bandwidth Time Bandwidth
1 CUDA Thread 411ms 30.6 MB/s 463ms 27.2 MB/s
1 CUDA Block 3.2ms 3.9 GB/s 2.7ms 4.7 GB/s
Many CUDA Blocks 0.68ms 18.5 GB/s 0.094ms 134 GB/s

As you can see, we can achieve very high bandwidth on GPUs. The computation in this post is very bandwidth-bound, but GPUs also excel at heavily compute-bound computations such as dense matrix linear algebra, deep learning, image and signal processing, physical simulations, and more.

Excercises

To keep you going, here are a few things to try on your own. Please post about your experience in the comments section below.

  1. Browse the CUDA Toolkit documentation. If you haven’t installed CUDA yet, check out the Quick Start Guide and the installation guides. Then browse the Programming Guideand the Best Practices Guide. There are also tuning guides for various architectures.
  2. Experiment with printf() inside the kernel. Try printing out the values of threadIdx.xand blockIdx.x for some or all of the threads. Do they print in sequential order? Why or why not?
  3. Print the value of threadIdx.y or threadIdx.z (or blockIdx.y) in the kernel. (Likewise for blockDim and gridDim). Why do these exist? How do you get them to take on values other than 0 (1 for the dims)?
  4. If you have access to a Pascal-based GPU, try running add_grid.cu on it. Is performance better or worse than the K80 results? Why? (Hint: read about Pascal’s Page Migration Engine and the CUDA 8 Unified Memory API.) For a detailed answer to this question, see the post Unified Memory for CUDA Beginners.

Where To From Here?

I hope that this post has whet your appetite for CUDA and that you are interested in learning more and applying CUDA C++ in your own computations. If you have questions or comments, don’t hesitate to reach out using the comments section below.

I plan to follow up this post with further CUDA programming material, but to keep you busy for now, there is a whole series of older introductory posts that you can continue with (and that I plan on updating / replacing in the future as needed):

There is also a series of CUDA Fortran posts mirroring the above, starting with An Easy Introduction to CUDA Fortran.

You might also be interested in signing up for the online course on CUDA programming from Udacity and NVIDIA.

There is a wealth of other content on CUDA C++ and other GPU computing topics here on the NVIDIA Parallel Forall developer blog, so look around!

96 Comments
  • Kirk

    Thanks for the post Mark. To get accurate profiling, is it still a good idea to put cudaDeviceReset() just prior to exiting? https://devblogs.nvidia.com/parallelforall/pro-tip-clean-up-after-yourself-ensure-correct-profiling/

    Also, is it possible to get this level of timing via code? cudaEventElapsedTime does not seem to have this same level of precision.

    • I think the Windows tools are more dependent on cudaDeviceReset(). I kept it out of this post to keep things simple. cudaEventElapsedTime() should have the same level of precision, but in more complex apps you may get things in your timing that you didn’t intend.

      I believe the most reliable way to accurately time is to run your kernel many times in a loop, followed by cudaDeviceSynchronize() or cudaStreamSynchronize(), and use a high precision CPU timer (like std::chrono) to wrap the whole loop and the sync. Then divide by the number of iterations.

  • Mike Agius

    Thanks Mark. I tried this and it did not work as a straight copy and paste. The cudaMallocManaged did not appear to do anything. This is on a Titan X with up to date drivers and NSight. I replaced the cudaMallocManaged functionality with the relevant cudaMalloc and cudaMempy, which sorted it. Am I missing something wrt cudaMallocManaged?

    • What do you mean did not appear to do anything? Did you get an error? Incorrect results? What CUDA version do you have installed? Is it an “NVIDIA Titan X” (Pascal) or “GeForce GTX Titan X” (Maxwell)?

      • Mike Agius

        CUDA 8 latest. It’s the Pascal card. After executing the cudaMallocManaged function the variable pointed to address 0x0. I’ll put error checking into your original code in the morning and try to get more diagnostic information.

        • Mark Harris

          Thanks. If there’s a bug I would like to fix it.

        • Kshitij Shah

          I think I had the same problem. Var pointed to ‘0x0’ and it was not accessible by host. However accessible by device. I know it works the old was using separate var for host and device. But it would be good if we can make it work without all those memCpy like this example does. Tell me if you have any luck.

          • Did you guys change the program at all? If you share your changes I can try to diagnose.

          • Mike Agius

            I just put the unchanged original on a box with a GT 755M and it fails similarly. What have I done wrong?
            :::

            #include “cuda_runtime.h”
            #include “device_launch_parameters.h”

            #include
            #include
            // Kernel function to add the elements of two arrays
            __global__
            void add(int n, float *x, float *y)
            {
            for (int i = 0; i < n; i++)
            y[i] = x[i] + y[i];
            }

            int main(void)
            {
            int N = 1 << 20;
            float *x, *y;

            // Allocate Unified Memory ñ accessible from CPU or GPU
            cudaMallocManaged(&x, N*sizeof(float));
            cudaMallocManaged(&y, N*sizeof(float));

            // initialize x and y arrays on the host
            for (int i = 0; i < N; i++) {
            x[i] = 1.0f;
            y[i] = 2.0f;
            }

            // Run kernel on 1M elements on the GPU
            add << > >(N, x, y);

            // Wait for GPU to finish before accessing on host
            cudaDeviceSynchronize();

            // Check for errors (all values should be 3.0f)
            float maxError = 0.0f;
            for (int i = 0; i < N; i++)
            maxError = fmax(maxError, fabs(y[i] – 3.0f));
            std::cout << "Max error: " << maxError << std::endl;

            // Free memory
            cudaFree(x);
            cudaFree(y);

            return 0;
            }

          • Mark Harris

            Looks like you added includes (which are automatically included for .cu files so not needed).

          • Mike Agius

            It is not sufficient to change the the target to x64 in the CUDA properties solution properties and Active(x64) in the properties – you must change it in the solution configuration manager. It then execs fine. The error is simply a ‘not supported’ one.
            Thanks Mark – suspected it was my bad.

          • Mike Agius

            You need to change the solution properties to x64 in the Configuration Manager. Changin it in the CUDA props and in the Active platform dropdown don’t get it done.

          • Mark Harris

            Weird, didn’t know x64 is not the default on Windows! I’m a Mac user. I will update the post.

          • Kshitij Shah

            Thanks, that worked !

          • any chance you can post a screenshot of this, @mike_agius:disqus? Thanks!

          • Mike Agius
          • Moose McFearson

            This fixed it for me! Thanks!

  • Mike Agius

    It’s a really good post btw. Thanks – I have learned much.

  • Kshitij Shah

    I cannot access variable assigned using ‘cudaMallocManaged’ on host. It throws ‘0xC0000005: Access violation writing location 0x00000000’ error. I can access them fine on device kernel. Am I missing something ? I am using MS Visual Studio Community 2015 with CUDA Runtime 8.0 on GTX 1070.

    • Mark Harris

      Can you post your changes to the code?

    • Did you change the code, or are you getting the error in the initialization loop?

      • Kshitij Shah

        I didn’t change the code at all. Just copied it to visual studio. Yes, I got error on initialization loop. So I put initialization in kernel. Got error at verification as expected. Removed verification and it ran fine. From this I concluded that it’s not accessible by host. Further mode in VS debugging watch it shows it the same was as var allocated by cudaMalloc, some can’t read or something indicating that it’s not accessible to CPU as I understand. And the address is 0x0. Sorry for long response. I really appreciate your help.

        • Cryptopone

          Make sure when you compile and run your program you select x64 from the drop down for the Solution Platform (At the top of the screen under the Debug menu item).

          When I created a new project to go through this example it set the Solution Platform to x86 which was causing the error for me.

  • Bhairav Pardiwala

    I have used this tech way back and am quite pleased with the results .

    The speedup is so adictive that i swear by my program

    I encorage all to give cuda a must try to solve problems even if it requires nvidia specific hardware

    .This is my code developed when i was grad student

    http://bit.ly/2ko6rNb

    • You’re a god. That CUDA-based file indexer+searcher+browser+parser has just made Windows Explorer Search Obsolete for me and I have long since longed for this moment. Thank you.

      • Thanks for the comment .
        Pls keep in mind to keep refreshing the index every so often as the program searches on basis of snapshot taken of the harddrive .

  • zmchang

    Do you have a similar post for Python?

  • Nghi

    Hello,
    I profiled the add_grid code on a Titan X GPU (latest one), and only got about 3.5ms for the add() function. I don’t know what is wrong. The output of nvcc version is release 8.0, V8.0.44. And my code is:

    include
    #include
    // Kernel function to add the elements of two arrays
    __global__

    void add(int n, float *x, float *y)
    {
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
    }

    int main(void)
    {
    int N = 1<<20;
    float *x, *y;
    int blockSize = 8*128;
    int numBlocks = (N + blockSize – 1) / blockSize;

    // Allocate Unified Memory – accessible from CPU or GPU
    cudaMallocManaged(&x, N*sizeof(float));
    cudaMallocManaged(&y, N*sizeof(float));

    // initialize x and y arrays on the host
    for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
    }

    // Run kernel on 1M elements on the GPU
    add<<>>(N, x, y);

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    // Check for errors (all values should be 3.0f)
    float maxError = 0.0f;
    for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
    std::cout << "Max error: " << maxError << std::endl;

    // Free memory
    cudaFree(x);
    cudaFree(y);

    return 0;

  • Jonathan Joseph

    Mark,
    I’m a CUDA noobie and felt some intimidation entering the world of GPU processing. This introduction was quite helpful for me. I ran the examples on a p2.xlarge instance (1 Tesla K80 GPU) under Centos 7 on Amazon’s EC2, after installing the CUDA 8 Toolkit. All the examples ran as expected without any issues.

    The next day it occurred to me that I didn’t fully understand and appreciate the ‘Grid Size Loop’ in the final example that takes advantage of multiple SMs. That’s because I didn’t read the article on the Grid-Stride Loop. I couldn’t figure out why there was a ‘for’ loop, rather than the monolithic kernel (with just the if condition). Your post on the Grid-Stride Loop provided the insight I needed.

    To sum up:
    I had a lot of fun reading your article and running the examples.
    Thank you for this great introduction to CUDA!
    Jonathan Joseph

  • Jonathan Joseph

    Is there a tutorial for cuDNN similar to this CUDA Introduction?

    • cuDNN is really intended to be used by developers of Deep Learning Frameworks, so there isn’t a beginner’s introduction, as such. I suspect what you really want is a beginner’s introduction to Caffe, or TensorFlow, or Torch, etc. There are many such introductions available on the web.

      • Jonathan Joseph

        That makes sense. I’ve started looking at TensorFlow, which I have running on an Amazon p2.xlarge instance that hosts a Tesla K80 GPU. Thanks for your insights.

  • Alexander

    …example needs a continuation – how to engage second cpu on K80 with the same code?)

    • Not a bad suggestion! I’ll add it to my list. :)

  • Kunal Gulati

    We calculated the block size and the number of threads in each block so that each one of the one million elements is worked on by a single thread. I’ve seen CUDA codes for the same problem but for loop wasn’t used in any of those. Can you please explain why are we using the for loop and why can’t we just add all the elements with their individual threads?

  • PiotrLenarczyk

    Mentioned by author Udacity course Intro to Parallel Programming is great!

  • Max

    Hi,

    What is the answer to the question “If you have access to a Pascal-based GPU, try running add_grid.cu on it. Is performance better or worse than the K80 results? Why?” ? I have tried with a Pascal Titan X GPU and the add_cuda seem indeed to run slower than on my GTX 750, but I didn’t really get why; does it have to do with the new unified memory?

  • Antoine

    Trying to get into parallel computing and CUDA. I had a look at the works from Kurt Keutzer (Berkeley) and Tim Mattson (Intel) on a language for parallel design patterns and it looked amazing to me. Is this work considered a reference in the domain, or is it just another attempt among many others? In which case what should I look at in terms of design patterns and consolidating my understanding from the long built experience of others?

  • Sungwoo

    Hi, I’m a newbie learning cuda. I followed the same step in this page and got pretty similar result except the last experiment. I did exactly same coding on my computer but last experiment result shows same time elapsed with second last one like 3.6ms on both. My GPU is GTX 1080 and I found that it can make 1048576 threads at same time with threads-max function. Can you give me a hint about this result? I really need someone to help me.

    • You might look at my answer to the question by Max below. GTX 1080 is a Pascal GPU. I still plan to write a followup post to explain this but I have been swamped with other work.

      • Sungwoo

        Thank you for fast answer.
        So what you are saying is 1080 is a Pascal GPU and when we see the results of nvprof, we see the whole time spent including memory transfer and stuff? Which makes sense about so little time difference between last and second last experiment. But your K80 does not?

        • First, remember that before the kernel runs, all the data in the array was last accessed on the CPU (during initialization). In both cases (K80 and 1080) nvprof is just timing the kernel execution. But on 1080, when the GPU accesses the Unified Memory array it page faults (because it’s resident in CPU memory). The threads that fault on each page have to wait for the page to be migrated to the GPU. These migrations get included in the kernel run time measured by nvprof.

          But K80 is incapable of page faulting. So when you launch the kernel, first the driver has to migrate all the pages touched by the CPU (whole array in this case) back to the GPU before running the kernel. Since it happens before the kernel runs on the GPU, nvprof doesn’t include that in the kernel run time.

          On your 1080, if you run the kernel twice, you’ll see the minimum kernel run time will be lower than the maximum — because once the memory is paged in by the first run, the page faults don’t happen on the second run.

  • Yongduek Seo

    Thanks for the nice article.

    The code with N=24, in which case the number of blocks is 65536, resulted in

    Max error: 1

    What would be the cause of this error, please?
    Thank you in advance.

    • Yongduek Seo

      Including a flag for nvcc simply solved the problem:
      “nvcc add_grid.cu –gpu-architecture sm_50 -o add_grid“

      But still don’t know how to find the maximum value for that.

  • Julius O

    Hi Dr. Harris:
    I copied verbatim and compiled add.cu with nvcc, but received a “segmentation fault (core dumped)” upon execution? Could it be that the GPU is too old? The setup is a GT 540M running Ubuntu 16.04.2 LTS and CUDA toolkit 8.0.61. Lastly, do you think the GeForce 1050-series cards would be good choice for a starter configuration for CUDA/ML/Tensor Flow applications? Thanks!

    • Hi Julius. Yes, that’s a Fermi-based GPU so it does not support cudaMallocManaged or Unified Memory. Unfortunately this post didn’t introduce error checking (I wanted to keep things as simple as possible for the introduction) — if you check the return values from cudaMallocManaged you will probably see an error code returned. This post teaches how to properly check for errors in the CUDA runtime API: https://devblogs.nvidia.com/parallelforall/how-query-device-properties-and-handle-errors-cuda-cc/

      GeForce 1050 series is a good starter GPU for CUDA. I am not sure how powerful it is for ML / TensorFlow — I haven’t used one myself. Someone asked the same question on Quora and got a mixed answer: https://www.quora.com/Is-NVIDIA-GTX-1050-Ti-4GB-sufficient-for-Deep-Learning

      • Julius O

        Great advice. Thanks very much!

      • Arthur Ianuzzi

        Mark, I’m having the same problem “segmentation fault (core dumped)” on dual GeForce GTX 1060 3GB cards. I installed cuda and apt installed nvidia-cuda-toolkit (since nvcc was not installed – got message below when trying to compile the first time). Do I have something installed incorrectly?

        Error message on running nvcc – The program ‘nvcc’ is currently not installed. You can install it by typing:
        sudo apt install nvidia-cuda-toolkit

        • I’m not sure exactly what the issue could be, but I checked, and we don’t officially support installation of the CUDA toolkit via apt. You should use apt to remove that package, and then download and install the toolkit from https://developer.nvidia.com/cuda-toolkit. (.deb installers are available there).

  • Moose McFearson

    I’m also running into runtime errors even with a straight copy-paste. I see sometimes this is caused by the card, so before I start messing with it too much: Is the M500M (maxwell-based) compatible with this setup?

    • Moose McFearson

      This is the error btw

      Exception thrown at 0x00B318E0 in CudaAccelerated1.exe: 0xC0000005: Access violation writing location 0x00000000.

      If there is a handler for this exception, the program may be safely continued.

  • Veera

    thanks a lot for helpful post Mark.
    I tried all the example successfully and able to write those kernels where there is no dependency on calculation done at previous index or indexes at any point. How to achieve that ?

  • Yagami Kira

    Hi,
    I have installed CUDA 8.0 and I run it with Microsoft Visual studio 2015 but when I tried with your different function my time with GPU are more slower than CPU.
    I built it in x64 and debug mode, I have an error with <<< for the kernel but the program compil and run

    Do you have any idea ?

    Thanks

  • Rory Gamble

    I got quite excited and then disappointed by the statement “…any NVIDIA GPU should do”. In reality it has to be a reasonably modern gpu. I will have to wait until I get access to a newer GPU before I get to play with CUDA. It’s an excellent article otherwise.

    • How old is your GPU? You could start with this older post, which doesn’t use Unified Memory (which requires a Kepler GPU or newer): https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

      • Rory Gamble

        Thanks for the reply. Unfortunately all of the various ones I have access to in various computers are too old. As a programmer, they are basically chosen as “the cheapest one which is quiet and supports my screen”. I will keep an eye out when it is next time to upgrade…

  • Shaurakar Das

    Hello Mark,

    I have a small doubt (this is the case when only one block and only one thread is launched) –
    When I compile the code using x86 in visual studio (2017 with cuda 9) I get an error, something like this
    0xC0000005: Access violation writing location 0x00000000. The exception is thrown at
    for (int i = 0; i < N; i++) {
    x[i] = 1.0f; <—– here it stops exception thrown

    However, when I use the x64, the program runs correctly. However, this should not happen as x86 should be able to handle the size of (N * sizeof(float))
    When I tested the size 1<<10 instead of 1<<20 it worked in x86.
    Please help me in understanding this. Correct me if there is something wrong in understanding.

    Thanks

  • Mohammed Anas Fareed

    When a kernel calls a thread, does a single thread have both the values it needs, in this case element from x and element from y. Or one thread gets access to element from x and another thread gets access to element y and the kernel accesses these two threads and writes the result to a new thread.
    Please can anyone clarify

    • Kernels don’t call threads. Threads run kernels. Think of the kernel (the __global__ function) as the program that is run by every thread that is launched in the <<>> call. So every thread runs the whole kernel program, and therefore every thread has access to all variables. Then each thread computes a unique address to index into the x and y arrays (in this kernel).

      • Mohammed Anas Fareed

        Thank You Mark,
        This cleared a lot of misunderstandings I had.
        The “unique address” you are talking about is it the thread ID ?
        And can you please tell if I am right about this
        when a thread is called it gets the data its needs from the memory, runs the kernel on this data, computes the result, stores the result in memory

        • The unique address in this example is the index `i` calculated to index into x and y. As for your “if I’m right”: more precisely, the kernel is run by many threads in parallel. *Inside* the kernel each thread loads data, computes, and stores results. I’t JUST LIKE in any normal C++ program except that there are thousands of threads running the same program, each with its own indices.

  • Yongduek Seo

    Very helpful article for me to start CUDA programming. Thanks a lot.
    One question.

    When the size of the array was changed: N=25
    then: numBlocks = 65536
    and the add() function did not work well, resulting in ‘Max error:1’
    These were observed on Titan X and K4200.

    Would anyone explain why this happened?
    Up to N=24, everything was OK.

    • Do you mean N=1<<24? Or did you actually set it to only 24? I suspect that you are getting a CUDA error because you are trying to launch too many blocks. Maximum blocks is probably 65535 — (2^16-1).

      • Yongduek Seo

        Thanks for the replay.
        N=1<<25 caused an error.
        But deviceQuery for my Titan X said:

        Maximum threads per block: 1024
        Maximum dimension 0 of block: 1024
        Maximum dimension 1 of block: 1024
        Maximum dimension 2 of block: 64
        Maximum dimension 0 of grid: 2147483647
        Maximum dimension 1 of grid: 65535
        Maximum dimension 2 of grid: 65535

        According to this, the maximum grid size was expected to be as large as 2147483647 because our cuda code is using only one dimensional addition. Probably, there seems something else that I don't know yet.

        • Yes, but you set numBlocks to 65536 — it says right there that “maximum dimension 1 of grid” is 65535 — so you went one too high.

  • OnlyOnce

    How can your files compile without “#include “?

    • When compiling .cu with nvcc it is automatically included.

  • OnlyOnce

    I’m getting weird results compared to yours on my macbook. I’m running windows, running an Intel i7-3740QM@2.70GHZ, with an NVidia 650m with 1gb memory.

    You don’t say how fast your CPU runs were, but my CPU runs are much faster, usually less than a millisecond to do 1 million operations. When I move that add function into the “.cu” file it slows down to 133-160 million operations/second or about 6-8ms per run. I don’t know if this is something to do with the nvidia compiler not optimizing as well as the microsoft compiler.

    My (1,1) runs are about 453 milliseconds, or 2.2M operations/second. That is close to your results so I figure the base clock is probably close. But when I up the values it is is always around 10-12M operations/second no matter what parameters I use. I know it’s working because I reset the values in both arrays before each run (which is really fast on the cpu). Any idea what the difference might be? I’m using cudaMalloc, but maybe the card is different and is using host memory for some reason? The cpu add function is getting the same speed no matter whether I allocate the memory normally or use cudaMalloc, so that makes me think that is what is happening if it should be slower using memory from cudaMalloc… If I get 1/15 the performance on the GPU, what’s the point?

    Found 1 devices

    Device 0
    name: GeForce GT 650M
    totalGlobalMem: 1073741824
    regsPerBlock: 65536
    warpSize: 32
    memPitch: 2147483647
    maxThreadsPerBlock: 1024
    maxThreadsDim[0]: 1024
    maxThreadsDim[1]: 1024
    maxThreadsDim[2]: 64
    maxGridSize[0]: 2147483647
    maxGridSize[1]: 65535
    maxGridSize[2]: 65535
    totalConstMem: 65536
    major: 3
    minor: 0
    clockRate: 900000
    textureAlignment: 512
    deviceOverlap: 1
    multiProcessorCount: 2
    kernelExecTimeoutEnabled: 1
    integrated: 0
    canMapHostMemory: 1
    computeMode: 0
    concurrentKernels: 1
    ECCEnabled: 0
    pciBusID: 1
    pciDeviceID: 0
    tccDriver: 0

    cpu_add (normal mem):1.00000000M in 0.00750510 sec, 133.242605M per second
    cuda_add(1, 1) : 1.00000000M in 0.45356481 sec, 2.20475657M per second
    cpu_add : 1.00000000M in 0.00613933 sec, 162.884155M per second
    cuda_add(1, 256) : 1.00000000M in 0.09324973 sec, 10.7238908M per second
    cuda_add(32, 32) : 1.00000000M in 0.08917370 sec, 11.2140686M per second
    cuda_add(64, 64) : 1.00000000M in 0.08583966 sec, 11.6496266M per second
    cuda_add(96, 96) : 1.00000000M in 0.08724686 sec, 11.4617294M per second
    cuda_add(128, 128) : 1.00000000M in 0.08895133 sec, 11.2421028M per second
    cuda_add(1024, 64) : 1.00000000M in 0.08821465 sec, 11.3359848M per second
    cuda_add(256, 256) : 1.00000000M in 0.08196660 sec, 12.2000899M per second
    cuda_add(1024, 256): 1.00000000M in 0.08266413 sec, 12.0971453M per second
    cuda_add(3907, 256): 1.00000000M in 0.08733581 sec, 11.4500561M per second

    • OnlyOnce

      I see, using cudaManagedMalloc does memory transfers behind the scenes. So calling a kernel might take 2.8ms to transfer the data to the GPU memory, only 0.7ms to do the opereations, then another 2.7ms to transfer the data back to cpu memory. Doing the copy operations manually gives me a big improvement in total time.

      Just looking at operations I get 252M/sec using the CPU compiled with nvcc, 1,150M/sec usign the CPU compiled with cl.exe, and up to 2,200M/sec on the GPU. So I’d be weary of using regular code compiled with nvcc.exe and be mindful of data transfers.

      I got errors when blockSize is greater than 1024 which makes sense for my cpu. Best performance is about twice what I can get on the cpu for this simple task.

      • Keep in mind that 1M adds is nothing, even for a little notebook GPU. Try filling up the memory and processing that. Say 2^25 elements rather than 2^20. And as you realized, vectorAdd is bandwidth intensive, not compute intensive, so it’s not a great test vs. CPU, especially on a laptop GPU where the available memory bandwidth is high but not super high. As for your CPU code performance: nvcc is a compiler driver, not a compiler (see the docs), so it’s still using your host compiler behind the scenes to compile your CPU code. But it probably doesn’t have the same level of optimization set. You could try -O3 or look at what args you are passing to cl.exe and pass them through nvcc via “-Xcompiler “.

  • Gio

    I’ve been spoiled and always had an IT guy while Sound Editor on many animated shows. Even being an Emmy Winner, my head spins on what should be a driver uodate for NVIDIA GeForce 310M. Laptop says config is wrong with Video card driver. Ok go to update driver but says driver is ok, not true from 2009. Tried suggested NVIDIA Driver, not compatable? Now on my way Cuba. You’ve got to be kidding right? All I want is to update a driver that obviously needs updating. I have too much to do to learn what I should be, I accept it but thought to update a driver did not require developer skills, I envy you, but really?

  • Matthew Giovannucci

    Hi Mark,

    This is a very helpful guide and has taught be a lot about programing on my GPU. I am also working on a MAC Laptop about 5 years old and whenever I run CUDA code my screen goes black. Is this normal or am I having a driver issue?

    Thanks!
    Matt

    • Thanks! I get that sometimes too on my MacBook, but usually only when my code does something wrong, like run too long or access memory out of bounds. Does the code produce correct results? In system preferences, under CUDA does it say you need to update?

      • Matthew Giovannucci

        The code does appear to be working, but only when I switch over to NVIDIA Web Drivers from the OS X default drivers, and I have to execute the code as root for it to not return a “no cuda compatible device detected” error. When using the NVIDIA web drivers, the CUDA drivers say it is up to date, but when I’m using the OS X driver, the CUDA driver says it needs to be updated. Its very weird behavior I know, and to add to that when NVIDIA web drivers is enabled sometimes my computer screen will go blank when I’m doing more graphically intensive computing. I think this happens when it switches from the integrated graphics card to the NVIDIA graphics card. Thank you for the help and hopefully you can shed some light on this.

  • Red Gator

    Very gratifying post until I got to the last example. My numbers are 198ms, 4.03ms, and 3.05ms, respectively. I have a “GeForce GTX 1050 Ti (0)” – are these numbers sane for the difference in hardware? And my run on a Pascal “Quadro P5000” is not much better: 188ms, 2.9ms, 2.3ms.

  • Alec Wu

    In Fig.1
    girdDim.x should be 16.

    • Actually 4096 as used in the figure is correct, since blockIdx.x ranges from 0 to 4095. Thanks!

  • Anton Neverov

    I have Nvidia Shield Tablet (Nvidia Tegra K1) . And i want programming cuda on Android. Could you help me?

  • Fergal Cotter

    Hi Mark,
    Great blog post! I had a bit of trouble at 2 stages of your post:

    Running with a GTX 1080:
    * Tried profiling, got “no kernels were profiled”
    * Searched nvidia forums for what’s causing this, found that adding the ‘–unified-memory-profiling off’ flag to nvprof fixes this.
    * Adding many cuda blocks gives me no speedup??
    * Found your blog post on unified memory, added code to preload memory into the GPU.
    * Removed ‘–unified-memory-profiling off’ flag

    If this is meant to happen, I would suggest adding a little comment about newer GPU models in this blog post and mentioning the ‘–unified-memory-profiling’ flag. It looks like you wanted us to find this out by trying it, which now I’m glad I have, but I spent perhaps a little too long with some of these issues.

  • Adam S.

    Hi Mark,
    Thank you very much for this tutorial. I followed it and got the same results on a K80. It would be great how to learn how the same code could take advantage of the second of the 2 GPUs on the K80, as you allude at the end of the article. In running a direct comparison of your K80 example to the same (nonparallel) clang++ compilation on a MacBook Pro 2.8 GHz Intel Core i7 (mid 2015 model), a single processor is able to do the same in about 700 us, so “only” about 7x slower. It may be about the same if we were to compare to the 8 processes available on the laptop and write an MPI version. So, this seems to me that the GPU, when both are engaged on the K80, is only about 2x faster than a single MacBook? How can that be?
    Thanks,
    Adam S.