Sync and Async in CUDA – FighterNan's Blog

Summary

In this post, I will introduce the Sync and Async behaviors in CUDA.

Conclusion

The followings are handy codes testing the behaviors of CPU and streams.

__global__ void cuda_hello1(){
    clock_block(10000);
    printf("Hello World from GPU1!\n");
}

__global__ void cuda_hello2(){
    printf("Hello World from GPU2!\n");
    clock_block(10000);
}

void cpu_hello() {
    printf("hello world from cpu?\n");
}

/*
hello world from cpu?
hello world from cpu?
Hello World from GPU1!
Hello World from GPU2!
*/
int main1() {
    cudaStream_t streams;
    cudaStreamCreate(&streams);

    cpu_hello();
    cuda_hello1<<<1,1,0,streams>>>();
    // legacy default stream blocks other blocking streams
    cuda_hello2<<<1,1>>>();
    cpu_hello();
    cudaDeviceSynchronize();

    return 0;
}

/*
hello world from cpu?
hello world from cpu?
Hello World from GPU2!
Hello World from GPU1!
*/
int main2() {
    cudaStream_t streams;
    cudaStreamCreateWithFlags(&streams, cudaStreamNonBlocking);
    // legacy default stream does not block non-blocking streams
    cpu_hello();
    cuda_hello1<<<1,1,0,streams>>>();
    cuda_hello2<<<1,1>>>();
    cpu_hello();
    cudaDeviceSynchronize();

    return 0;
}

Details

There are two aspects, kernels and streams.

1. Kernels

Some of my conclusions are,

All kernels will return immediately no matter we use the default stream or customized streams;
Kernels launched in different streams will run concurrently; Kernels in the same stream (including the default stream) will run sequentially;
Different threads share the same default stream (called legacy stream) unless per-thread options are turned on;
If a kernel of the legacy stream launches, from the device side, it will block until all previous kernels in all other streams (except for non-blocking stream) finished; Use cudaStreamCreateWithFlags(&streams, cudaStreamNonBlocking) to create non-blocking streams.

2. Default Stream (aka Stream ‘0’, the legacy stream)

Usually, we will use APIs without the “Async” suffix. These APIs are completely synchronous w.r.t. host and device. For the host, it will return only if the API finishes. However, there are some exceptions,
Exceptions – asynchronous w.r.t. host, return immediately,

Kernel launches in the default stream;
cudaMemcpyAsync; cudaMemsetAsync;
cudaMemcpy within the same device (within single GPU); but cudaMemcpy HOST TO HOST is sync;

For example,

// completely synchronous with respect to the device, which means for the device, the order is sequential: cudaMalloc, cudaMemcpy, kernel2, kernel3, cudaMemcpy

cudaMalloc ( &dev1, size ) ;
 double* host1 = (double*) malloc ( &host1, size ) ; 
… 
cudaMemcpy ( dev1, host1, size, H2D ) ;
// GPU kernels are asynchronous with host by default, kernel launch will return immediately
kernel2 <<< grid, block, 0 >>> ( …, dev2, … );
// potentially overlapped, which means cpu will execute this function concurrently with kernel2
some_CPU_method ();

kernel3 <<< grid, block, 0 >>> ( …, dev3, … ); 
cudaMemcpy ( host4, dev4, size, D2H ) ;

Per thread stream

CUDA 7 introduces the per-thread default stream. It has two effects.

It gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently;
These default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams;
The per-thread default stream is not a non-blocking stream and will synchronize with the legacy default stream if both are used in a program.

3. Customized Stream

Usually, we will use APIs with the “Async” suffix.
Async APIs will return the status immediately with the following exceptions.

Exceptions – synchronous w.r.t. host, blocking,

cudaMemcpy*Async from device memory to pageable host memory, function call blocks until the copy has completed.
cudaMemcpy*Async from any host memory to any host memory, function call blocks until the copy has completed.
For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.