In our last CUDA C/C++ post we discussed how to transfer data efficiently between the host and device. In this post, we discuss how to overlap data transfers with computation on the host, computation on the device, and in some cases other data transfers between the host and device. Achieving overlap between data transfers and other operations requires the use of CUDA streams, so first let’s learn about streams.
A stream in CUDA is a sequence of operations that execute on the device in the order in which they are issued by the host code. While operations within a stream are guaranteed to execute in the prescribed order, operations in different streams can be interleaved and, when possible, they can even run concurrently.
The default stream
All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream (also called the “null stream”) is used. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.
Let’s look at some simple code examples that use the default stream, and discuss how operations progress from the perspective of the host as well as the device.
cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice); increment<<<1,N>>>(d_a) cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
In the code above, from the perspective of the device, all three operations are issued to the same (default) stream and will execute in the order that they were issued. Continue reading