Summary
In this post, I will introduce how to use the tool nvprof
to profile your CUDA applications.
Details
It is a good practice to dive deeper to see how much time each kernel or each CUDA runtime API takes when you want to optimize your applications.
Intuition
It is not good to use any external timing framework since on the host side, what you measure is the time that the host lunches device functions. Lunches are asynchronous so that to measure the device execution time you have to force to synchronize and make sure the kernels has finished at the point you do measurements. However, since there are streams. Those synchronizations may affect your actual performance.
Actually, Nvidia has provided official tools to do applications profiling. The nvprof is a light-weight profiling tool which enables you to collect and view profiling data from the command-line.
In the next part of this post, I will introduce the basic usage of nvprof
.
Basic Usage
Say I want to profile my kernel networks inference application.
According to this post, you have to switch to admin.
root@zhoun14: sudo su
You may need to manually find your CUDA nvprof
. Simply call nvprof ./myApp
to start the profiling.
root@zhoun14:# /usr/local/cuda-10.1/bin/nvprof ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
model = FR.onnx
output = last_layer.txt
==15630== NVPROF is profiling process 15630, command: ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
==15630== Warning: Profiling results might be incorrect with current version of nvcc compiler used to compile cuda app. Compile with nvcc compiler 9.0 or later version to get correct profiling results. Ignore this warning if code is already compiled with the recommended nvcc version
Done!
==15630== Profiling application: ./cmake-build-debug/run_inferencer -m FR.onnx -o last_layer.txt
==15630== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 93.34% 68.179ms 604 112.88us 1.1200us 16.896ms [CUDA memcpy HtoD]
2.37% 1.7291ms 37 46.731us 6.4000us 170.98us sgemm_32x32x32_NN_vec
1.07% 785.00us 54 14.537us 1.8560us 115.94us im2col_gpu_kernel
0.77% 565.73us 260 2.1750us 768ns 12.704us fill_kernel
0.62% 455.17us 8 56.896us 53.761us 66.496us maxwell_sgemm_128x64_nn
0.57% 414.82us 79 5.2500us 1.6960us 34.241us void fast_batchnorm_kernel<float>(float*, float*, float*, float*, float*, float*, float, unsigned int, unsigned int, unsigned int)
0.49% 354.40us 7 50.628us 5.9520us 62.977us sgemm_32x32x32_NN
0.24% 173.31us 1 173.31us 173.31us 173.31us void gemv2T_kernel<int, int, float, float, float, int=128, int=16, int=4, int=4, bool=0, cublasGemvParams<cublasGemvTensorStridedBatched<float const >, cublasGemvTensorStridedBatched<float>, float>>(float const )
0.18% 133.95us 24 5.5810us 1.7920us 28.576us zeropad_kernel
0.16% 117.22us 24 4.8840us 1.9840us 11.232us void fast_add_kernel<float>(float**, float*, unsigned int, unsigned int)
0.14% 101.60us 25 4.0640us 1.3120us 23.360us void fast_leaky_relu_kernel<float>(float*, float*, float, int)
0.02% 16.096us 1 16.096us 16.096us 16.096us maxwell_sgemm_128x128_nn
0.02% 13.728us 8 1.7160us 1.0560us 6.0160us [CUDA memset]
0.00% 3.2640us 1 3.2640us 3.2640us 3.2640us void splitKreduce_kernel<float, float, float>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *)
0.00% 1.9520us 1 1.9520us 1.9520us 1.9520us add_bias_kernel
0.00% 1.4720us 1 1.4720us 1.4720us 1.4720us [CUDA memcpy DtoH]
API calls: 63.75% 349.74ms 235 1.4882ms 762ns 247.18ms cudaFree
22.87% 125.45ms 606 207.01us 1.4810us 109.92ms cudaMalloc
12.60% 69.095ms 605 114.21us 3.2550us 15.990ms cudaMemcpy
0.39% 2.1482ms 522 4.1150us 3.1760us 33.269us cudaLaunchKernel
0.25% 1.3453ms 285 4.7200us 95ns 201.08us cuDeviceGetAttribute
0.05% 292.69us 3 97.563us 95.337us 101.15us cuDeviceTotalMem
0.04% 214.00us 2046 104ns 76ns 623ns cudaGetLastError
0.03% 156.25us 3 52.082us 50.296us 53.649us cuDeviceGetName
0.01% 53.811us 467 115ns 84ns 402ns cudaPeekAtLastError
0.01% 27.846us 8 3.4800us 2.8330us 6.8830us cudaMemsetAsync
0.00% 20.851us 55 379ns 308ns 1.1510us cudaGetDevice
0.00% 11.753us 9 1.3050us 998ns 2.3830us cudaEventQuery
0.00% 11.078us 16 692ns 364ns 5.4220us cudaEventCreateWithFlags
0.00% 7.5690us 9 841ns 664ns 1.6170us cudaEventRecord
0.00% 3.6160us 11 328ns 211ns 936ns cudaDeviceGetAttribute
0.00% 2.5050us 1 2.5050us 2.5050us 2.5050us cudaSetDevice
0.00% 2.4690us 1 2.4690us 2.4690us 2.4690us cuDeviceGetPCIBusId
0.00% 2.4460us 5 489ns 131ns 1.2070us cuDeviceGetCount
0.00% 1.4280us 4 357ns 209ns 634ns cuDeviceGet
0.00% 1.1030us 2 551ns 548ns 555ns cuInit
0.00% 772ns 2 386ns 351ns 421ns cuDriverGetVersion
0.00% 513ns 3 171ns 167ns 176ns cuDeviceGetUuid
We have the following observations:
- The most expensive activity is
memcpy
from host to device. Since we have to copy all the weights of kernel networks from cpu to the global memory in GPU; cudaFree
andcudaMalloc
dominates the process. It takes nearly one thousand times longer time than thememcpy
;- The convolution kernels
gemm
andim2col
takes longer time compared to other operations, which is a common sense.