{"id":632,"date":"2019-12-19T15:15:43","date_gmt":"2019-12-19T23:15:43","guid":{"rendered":"http:\/\/35.243.195.209\/?p=632"},"modified":"2020-04-20T20:45:13","modified_gmt":"2020-04-21T03:45:13","slug":"sync-and-async-in-cuda","status":"publish","type":"post","link":"https:\/\/nanzhou.cc\/index.php\/2019\/12\/19\/sync-and-async-in-cuda\/","title":{"rendered":"Sync and Async in CUDA"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In this post, I will introduce the Sync and Async behaviors in CUDA.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The followings are handy codes testing the behaviors of CPU and streams.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp\">__global__ void cuda_hello1(){\n    clock_block(10000);\n    printf(\"Hello World from GPU1!\\n\");\n}\n\n__global__ void cuda_hello2(){\n    printf(\"Hello World from GPU2!\\n\");\n    clock_block(10000);\n}\n\nvoid cpu_hello() {\n    printf(\"hello world from cpu?\\n\");\n}\n\n\/*\nhello world from cpu?\nhello world from cpu?\nHello World from GPU1!\nHello World from GPU2!\n*\/\nint main1() {\n    cudaStream_t streams;\n    cudaStreamCreate(&amp;streams);\n\n    cpu_hello();\n    cuda_hello1&lt;&lt;&lt;1,1,0,streams>>>();\n    \/\/ legacy default stream blocks other blocking streams\n    cuda_hello2&lt;&lt;&lt;1,1>>>();\n    cpu_hello();\n    cudaDeviceSynchronize();\n\n    return 0;\n}\n\n\/*\nhello world from cpu?\nhello world from cpu?\nHello World from GPU2!\nHello World from GPU1!\n*\/\nint main2() {\n    cudaStream_t streams;\n    cudaStreamCreateWithFlags(&amp;streams, cudaStreamNonBlocking);\n    \/\/ legacy default stream does not block non-blocking streams\n    cpu_hello();\n    cuda_hello1&lt;&lt;&lt;1,1,0,streams>>>();\n    cuda_hello2&lt;&lt;&lt;1,1>>>();\n    cpu_hello();\n    cudaDeviceSynchronize();\n\n    return 0;\n}<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Details<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">There are two aspects, kernels and streams.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1. Kernels<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Some of my conclusions are,<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>All kernels will return immediately no matter we use the default stream or customized streams;<\/li><li>Kernels launched in different streams will run concurrently; Kernels in the same stream (including the default stream) will run sequentially;<\/li><li>Different threads share the same default stream (called legacy stream) unless per-thread options are turned on;<\/li><li>If a kernel of the legacy stream launches, from the device side, it will block until all previous kernels in all other streams (except for non-blocking stream) finished; Use <code>cudaStreamCreateWithFlags(&amp;streams, cudaStreamNonBlocking)<\/code> to create non-blocking streams.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">2. Default Stream (aka Stream &#8216;0&#8217;, the legacy stream)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Usually, we will use APIs without the &#8220;Async&#8221; suffix. These APIs are completely synchronous w.r.t. host and device. For the host, it will return only if the API finishes. However, there are some exceptions,<br>Exceptions \u2013 asynchronous w.r.t. host, return immediately,<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Kernel launches in the default stream;<\/li><li>cudaMemcpy<em>Async; cudaMemset<\/em>Async;<\/li><li>cudaMemcpy within the same device (within single GPU); but cudaMemcpy HOST TO HOST is sync;<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">For example,<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp\">\/\/ completely synchronous with respect to the device, which means for the device, the order is sequential: cudaMalloc, cudaMemcpy, kernel2, kernel3, cudaMemcpy\n\ncudaMalloc ( &amp;dev1, size ) ;\n double* host1 = (double*) malloc ( &amp;host1, size ) ; \n\u2026 \ncudaMemcpy ( dev1, host1, size, H2D ) ;\n\/\/ GPU kernels are asynchronous with host by default, kernel launch will return immediately\nkernel2 &lt;&lt;&lt; grid, block, 0 >>> ( \u2026, dev2, \u2026 );\n\/\/ potentially overlapped, which means cpu will execute this function concurrently with kernel2\nsome_CPU_method ();\n\nkernel3 &lt;&lt;&lt; grid, block, 0 >>> ( \u2026, dev3, \u2026 ); \ncudaMemcpy ( host4, dev4, size, D2H ) ;<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Per thread stream<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">CUDA 7 introduces <code>the per-thread default stream<\/code>. It has two effects.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>It gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently;<\/li><li>These default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams;<\/li><li>The per-thread default stream is not a non-blocking stream and will synchronize with the legacy default stream if both are used in a program.<\/li><\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">3. Customized Stream<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Usually, we will use APIs with the &#8220;Async&#8221; suffix.<br>Async APIs will return the status immediately with the following exceptions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Exceptions \u2013 synchronous w.r.t. host, blocking,<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><code>cudaMemcpy*Async<\/code> from device memory to pageable host memory, function call blocks until the copy has completed.<\/li><li><code>cudaMemcpy*Async<\/code> from any host memory to any host memory, function call blocks until the copy has completed.<\/li><li>For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.<\/li><\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">For example,<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code lang=\"cpp\" class=\"language-cpp\">cudaStream_t stream1, stream2, stream3, stream4 ; \ncudaStreamCreate ( &amp;stream1) ;\n ... \ncudaMalloc ( &amp;dev1, size ) ; \n\/\/ pinned memory required on host \u2026 \ncudaMallocHost ( &amp;host1, size ) ; \n\/\/ paged memory \nmalloc(&amp;host2, size);\n\n\/\/ return to CPU immediately\ncudaMemcpyAsync ( dev1, host1, size, H2D, stream1 ) ; \nkernel2 &lt;&lt;&lt; grid, block, 0, stream2 >>> ( \u2026, dev2, \u2026 ) ; \nkernel3 &lt;&lt;&lt; grid, block, 0, stream3 >>> ( \u2026, dev3, \u2026 ) ; \ncudaMemcpyAsync ( host4, dev4, size, D2H, stream4 ) ; \n\/\/ blocks until finishing, since host2 is paged memory\ncudaMemcpyAsync ( dev1, host2, size, D2H, dev ) ; \nsome_CPU_method ();<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Reference<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/www.clear.rice.edu\/comp422\/resources\/cuda\/html\/cuda-driver-api\/api-sync-behavior.html#api-sync-behavior\">API synchronization behavior<\/a><br><a href=\"https:\/\/www.clear.rice.edu\/comp422\/resources\/cuda\/html\/cuda-driver-api\/stream-sync-behavior.html\">Stream synchronization behavior<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary In this post, I will introduce the Sync and Async behaviors in CUDA. Conclusion The followings are handy codes testing the behaviors of CPU and streams. Details There are two aspects, kernels and streams. 1. Kernels Some of my conclusions are, All kernels will return immediately no matter we use the default stream or&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[30,29,31],"tags":[],"class_list":["post-632","post","type-post","status-publish","format-standard","hentry","category-cuda","category-gpu","category-parallel-computation"],"_links":{"self":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/632","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/comments?post=632"}],"version-history":[{"count":3,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/632\/revisions"}],"predecessor-version":[{"id":1038,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/632\/revisions\/1038"}],"wp:attachment":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/media?parent=632"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/categories?post=632"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/tags?post=632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}