Category: Parallel Computation

Compute Capabilities and Thoughputs on NVIDIA’s GPUs

Summary In this post, I will introduce the thoughputs and compute capabilities on NVIDIA's GPUs. The post doesn't contain hardware ...

February 3, 2020 / fighternan / GPU, Parallel Computation

How to debug Async Kernels or APIs in CUDA

Summary In this post, I will introduce how to debug async kernels or async APIs in CUDA. The async operations ...

December 19, 2019 / fighternan / CUDA, GPU, Parallel Computation

Sync and Async in CUDA

Summary In this post, I will introduce the Sync and Async behaviors in CUDA. Conclusion The followings are handy codes ...

December 19, 2019 / fighternan / CUDA, GPU, Parallel Computation

Thread Pool in C++

Summary In this post, I will introduce how to build a simple thread pool in C++. Conclusion The codes are ...

August 4, 2019 / fighternan / C++, Muitl-Thread, Parallel Computation, Programming Language

Mutex in C++ and Java

Summary In this post, I will introduce the correct way to use mutex in C++, compared to Java. Conclusion Try ...

August 4, 2019 / fighternan / C++, Java, Muitl-Thread, Parallel Computation, Programming Language

Profile Applications in CUDA

Summary In this post, I will introduce how to use the tool nvprof to profile your CUDA applications. Details It ...

May 17, 2019 / fighternan / CUDA, GPU

Install CUDA 10.1 and Driver 418

Summary In this post, I will introduce how to install the newest CUDA and corresponding Nvidia driver in Ubuntu 16.04 ...

April 7, 2019 / fighternan / C++, CUDA, GPU, Programming Language

Intel Threading Building Blocks

Summary In this post, I will introduce how to solve a parallel computation task using Intel Threading Building Blocks. Problem ...

December 20, 2018 / fighternan / Muitl-Thread, Parallel Computation

Compute Capabilities and Thoughputs on NVIDIA’s GPUs

Posted on February 3, 2020 by fighternan

Summary In this post, I will introduce the thoughputs and compute capabilities on NVIDIA’s GPUs. The post doesn’t contain hardware details. Conclusion It might be a common sense that half precision floats will run faster on GPUs, like this post by Intel. However, it is a different story on NVIDIA’s GPUs. For example, you may…

How to debug Async Kernels or APIs in CUDA

Posted on December 19, 2019 by fighternan

Summary In this post, I will introduce how to debug async kernels or async APIs in CUDA. The async operations will not block CPU codes. When we check the return type of the functions calls, it may be SUCCESS but there are bugs like "illegal memory access". On the other hand, when we find the…

Sync and Async in CUDA

Posted on December 19, 2019April 20, 2020 by fighternan

Summary In this post, I will introduce the Sync and Async behaviors in CUDA. Conclusion The followings are handy codes testing the behaviors of CPU and streams. Details There are two aspects, kernels and streams. 1. Kernels Some of my conclusions are, All kernels will return immediately no matter we use the default stream or…

Thread Pool in C++

Posted on August 4, 2019August 4, 2019 by fighternan

Summary In this post, I will introduce how to build a simple thread pool in C++. Conclusion The codes are from here. The thread pool only uses thread, mutex, and condition_variable. #include <thread> #include <mutex> #include <condition_variable> class ThreadPool; // our worker thread objects class Worker { public: Worker(ThreadPool &s) : pool(s) { } void…

Mutex in C++ and Java

Posted on August 4, 2019August 4, 2019 by fighternan

Summary In this post, I will introduce the correct way to use mutex in C++, compared to Java. Conclusion Try not to use std::mutex directly. Use std::unique_lock, std::lock_guard, or std::scoped_lock (since C++17) to manage locking in a more exception-safe manner. Undefined behaviors will happen if a). A mutex is destroyed while still owned by any…

Profile Applications in CUDA

Posted on May 17, 2019 by fighternan

Summary In this post, I will introduce how to use the tool nvprof to profile your CUDA applications. Details It is a good practice to dive deeper to see how much time each kernel or each CUDA runtime API takes when you want to optimize your applications. Intuition It is not good to use any…

Install CUDA 10.1 and Driver 418

Posted on April 7, 2019April 20, 2020 by fighternan

Summary In this post, I will introduce how to install the newest CUDA and corresponding Nvidia driver in Ubuntu 16.04. Details I want to use CUDA for neural network inference. But after I compile the executable files and run, it tells me driver not compatible with this version of CUDA. I have GTX 1060 and…

Intel Threading Building Blocks

Posted on December 20, 2018May 17, 2019 by fighternan

Summary In this post, I will introduce how to solve a parallel computation task using Intel Threading Building Blocks. Problem In the deep learning platform, given inputs contains several thousand images, we want to analyze the data path of a certain deep learning model. The analysis part of each image is identical, for example, we…