{"id":38,"date":"2018-12-20T19:57:42","date_gmt":"2018-12-21T03:57:42","guid":{"rendered":"http:\/\/35.243.195.209\/?p=38"},"modified":"2019-05-17T14:04:52","modified_gmt":"2019-05-17T21:04:52","slug":"38","status":"publish","type":"post","link":"https:\/\/nanzhou.cc\/index.php\/2018\/12\/20\/38\/","title":{"rendered":"Intel Threading Building Blocks"},"content":{"rendered":"<h2>Summary<\/h2>\n<p>In this post, I will introduce how to solve a parallel computation task using <a>Intel Threading Building Blocks<\/a>.<\/p>\n<h2>Problem<\/h2>\n<p>In the deep learning platform, given inputs contains several thousand images, we want to analyze the data path of a certain deep learning model. The analysis part of each image is identical, for example, we want to collect the minimum and maximum values of a certain layer in the model. Say that we only care about the output of the last Softmax layer which is basically a 1-D array. So the problem is we want to collect the minimum and maximum value (two float numbers) of that layer for all the image inputs.<\/p>\n<h2>Settings<\/h2>\n<ol>\n<li>The deep learning platform works on a company self-designed compiler which only supports C++. Thus, our solution should base on C++.<\/li>\n<li>Say that we have 5,000 images.<\/li>\n<li>The model is quite complex which takes about 2 minutes to get the output result of 1 image. It means if we use sequential solution and calculate output one by one, it costs about $2 \\times 5,000 \/ 60 \/var\/www\/htmlrox 167 hours \/var\/www\/htmlrox 7 days$. We can not afford such a long time.<\/li>\n<\/ol>\n<h2>Research<\/h2>\n<p>Since the behavior of each image is similar, the intuition is to utilize multiple threads to do parallel computations. We can easily create threads in C++ 14 thanks to the thread library. However, it makes no sense that we create 5000 threads (actually the overhead of creating and deleting those threads will ruin the performance, see this <a> reference<\/a>).<\/p>\n<p>We can manually maintain a thread pool. We firstly create $MAX_THREADS$ (a typical value is the number of cores) threads and every time a thread finished, we start a new thread until all the images have been processed.<\/p>\n<p>It is not a good idea to build wheels from scratch. There are two most popular libraries that implements thread pools, <a href=\"https:\/\/www.amazon.com\/Intel-Threading-Building-Blocks-Parallelism\/dp\/0596514808\/ref=sr_1_1?ie=UTF8&amp;qid=1502746643&amp;sr=8-1&amp;keywords=thread+building+block\">Intel Thread Building Blocks(TBB)<\/a> and <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dd492418.aspx\">Microsoft Parallel Patterns Library(PPL)<\/a>.<\/p>\n<p>After research, PPL is not good for multi-platform development. Thus, we choose TBB.<\/p>\n<h2>Install and Build<\/h2>\n<p>There are lots of bugs before I make it work. I only present the succeed way for you and for my later review.<\/p>\n<ol>\n<li>Install CodesThe easiest way is to use apt-get.\n<pre><code>sudo apt-get install libtbb-dev<\/code><\/pre>\n<p>It will automatically add the lib and build of TBB to the system PATH. It helps for the following <code>find_package<\/code> in CMake.<\/p>\n<p>Try not to build TBB from source codes, since there is no <code>make install<\/code> and the scripts handling PATH in TBB do not work well.<\/li>\n<li>CMakeIn order to include the library and link the .so files of TBB, you should write a well-defined MakeFile. We use CMake instead which is more readable.However, to make <code>find_package<\/code> work, we should download another <a>file<\/a>.In your CMakeList, write the following codes. The codes are self-explained.\n<pre><code class=\"cmake\">###############\n# Add FindTBB.cmake path file to the module path\nlist(APPEND CMAKE_MODULE_PATH \"${PROJECT_SOURCE_DIR}\/\")\n\n# Set RPATHS in executables\nset(CMAKE_INSTALL_RPATH \"${CMAKE_INSTALL_PREFIX}\/lib\")\n\n# ==============================================================================\n#\n# Print input variables used by FindTBB.cmake\n\nmessage(\"CMAKE_SYSTEM_NAME = '${CMAKE_SYSTEM_NAME}'\")\nmessage(\"CMAKE_BUILD_TYPE  = '${CMAKE_BUILD_TYPE}'\")\n\nmessage(\"User Input Variables:\")\nmessage(\"TBB_ROOT_DIR = '${TBB_ROOT_DIR}'\")\nmessage(\"TBB_INCLUDE_DIR = '${TBB_INCLUDE_DIR}'\")\nmessage(\"TBB_LIBRARY = '${TBB_LIBRARY}'\")\nmessage(\"TBB_tbb_LIBRARY = '${TBB_tbb_LIBRARY}'\")\nmessage(\"TBB_tbb_debug_LIBRARY = '${TBB_tbb_debug_LIBRARY}'\")\nmessage(\"TBB_tbbmalloc_LIBRARY = '${TBB_tbbmalloc_LIBRARY}'\")\nmessage(\"TBB_tbbmalloc_debug_LIBRARY = '${TBB_tbbmalloc_debug_LIBRARY}'\")\nmessage(\"TBB_tbb_preview_LIBRARY = '${TBB_tbb_preview_LIBRARY}'\")\nmessage(\"TBB_tbb_preview_debug_LIBRARY = '${TBB_tbb_preview_debug_LIBRARY}'\")\nmessage(\"TBB_USE_DEBUG_BUILD = '${TBB_USE_DEBUG_BUILD}'\")\n\nmessage(\"Environment Varaibles used by FindTBB:\")\nmessage(\"TBB_INSTALL_DIR = '${TBB_INSTALL_DIR}'\")\nmessage(\"TBBROOT         = '${TBBROOT}'\")\nmessage(\"LIBRARY_PATH    = '${LIBRARY_PATH}'\")\n\n#find_package(TBB COMPONENTS tbbmalloc tbbmalloc_proxy tbb_preview)\nfind_package(TBB)\n\n# ==============================================================================\n# Print output variables from FindTBB.cmake\n\nset(TBB_SEARCH_COMPOMPONENTS tbb_preview tbbmalloc_proxy tbbmalloc tbb)\n\nmessage(\"FindTBB Result Variables:\")\nmessage(\"TBB_FOUND = '${TBB_FOUND}'\")\nmessage(\"TBB_tbbmalloc_FOUND = '${TBB_tbbmalloc_FOUND}'\")\nmessage(\"TBB_tbbmalloc_proxy_FOUND = '${TBB_tbbmalloc_FOUND}'\")\nmessage(\"TBB_tbb_preview_FOUND = '${TBB_tbb_preview_FOUND}'\")\nmessage(\"TBB_VERSION = '${TBB_VERSION}'\")\nmessage(\"TBB_VERSION_MAJOR = '${TBB_VERSION_MAJOR}'\")\nmessage(\"TBB_VERSION_MINOR = '${TBB_VERSION_MINOR}'\")\nmessage(\"TBB_INTERFACE_VERSION = '${TBB_INTERFACE_VERSION}'\")\nforeach(_comp ${TBB_SEARCH_COMPOMPONENTS})\n   message(\"TBB_${_comp}_LIBRARY_RELEASE = '${TBB_${_comp}_LIBRARY_RELEASE}'\")\n   message(\"TBB_${_comp}_LIBRARY_DEBUG = '${TBB_${_comp}_LIBRARY_DEBUG}'\")\n   message(\"TBB_${_comp}_LIBRARY = '${TBB_${_comp}_LIBRARY}'\")\nendforeach()\n\nmessage(\"FindTBB Output Variables:\")\nmessage(\"TBB_INCLUDE_DIRS = '${TBB_INCLUDE_DIRS}'\")\nmessage(\"TBB_LIBRARIES_RELEASE = '${TBB_LIBRARIES_RELEASE}'\")\nmessage(\"TBB_LIBRARIES_DEBUG = '${TBB_LIBRARIES_DEBUG}'\")\nmessage(\"TBB_LIBRARIES = '${TBB_LIBRARIES}'\")\nmessage(\"TBB_DEFINITIONS = '${TBB_DEFINITIONS}'\")\n\nmessage(\"TBB_INCLUDE_DIRS = '${TBB_INCLUDE_DIRS}'\")\nmessage(\"TBB_LIBRARIES_RELEASE = '${TBB_LIBRARIES_RELEASE}'\")\nmessage(\"TBB_LIBRARIES_DEBUG = '${TBB_LIBRARIES_DEBUG}'\")\nmessage(\"TBB_LIBRARIES = '${TBB_LIBRARIES}'\")\nmessage(\"TBB_DEFINITIONS = '${TBB_DEFINITIONS}'\")<\/code><\/pre>\n<p>A good output of CMake is showed below.<\/p>\n<pre><code class=\"language-bash\">CMAKE_SYSTEM_NAME = 'Linux'\nCMAKE_BUILD_TYPE  = 'Debug'\nUser Input Variables:\nTBB_ROOT_DIR = ''\nTBB_INCLUDE_DIR = ''\nTBB_LIBRARY = ''\nTBB_tbb_LIBRARY = ''\nTBB_tbb_debug_LIBRARY = ''\nTBB_tbbmalloc_LIBRARY = ''\nTBB_tbbmalloc_debug_LIBRARY = ''\nTBB_tbb_preview_LIBRARY = ''\nTBB_tbb_preview_debug_LIBRARY = ''\nTBB_USE_DEBUG_BUILD = ''\nEnvironment Varaibles used by FindTBB:\nTBB_INSTALL_DIR = ''\nTBBROOT         = ''\nLIBRARY_PATH    = ''\nFindTBB Result Variables:\nTBB_FOUND = 'TRUE'\nTBB_tbbmalloc_FOUND = ''\nTBB_tbbmalloc_proxy_FOUND = ''\nTBB_tbb_preview_FOUND = ''\nTBB_VERSION = '4.4'\nTBB_VERSION_MAJOR = '4'\nTBB_VERSION_MINOR = '4'\nTBB_INTERFACE_VERSION = '9002'\nTBB_tbb_preview_LIBRARY_RELEASE = ''\nTBB_tbb_preview_LIBRARY_DEBUG = ''\nTBB_tbb_preview_LIBRARY = ''\nTBB_tbbmalloc_proxy_LIBRARY_RELEASE = ''\nTBB_tbbmalloc_proxy_LIBRARY_DEBUG = ''\nTBB_tbbmalloc_proxy_LIBRARY = ''\nTBB_tbbmalloc_LIBRARY_RELEASE = ''\nTBB_tbbmalloc_LIBRARY_DEBUG = ''\nTBB_tbbmalloc_LIBRARY = ''\nTBB_tbb_LIBRARY_RELEASE = '\/usr\/lib\/x86_64-linux-gnu\/libtbb.so'\nTBB_tbb_LIBRARY_DEBUG = 'TBB_tbb_LIBRARY_DEBUG-NOTFOUND'\nTBB_tbb_LIBRARY = ''\nFindTBB Output Variables:\nTBB_INCLUDE_DIRS = '\/usr\/include'\nTBB_LIBRARIES_RELEASE = '\/usr\/lib\/x86_64-linux-gnu\/libtbb.so'\nTBB_LIBRARIES_DEBUG = ''\nTBB_LIBRARIES = '\/usr\/lib\/x86_64-linux-gnu\/libtbb.so'\nTBB_DEFINITIONS = ''\nTBB_INCLUDE_DIRS = '\/usr\/include'\nTBB_LIBRARIES_RELEASE = '\/usr\/lib\/x86_64-linux-gnu\/libtbb.so'\nTBB_LIBRARIES_DEBUG = ''\nTBB_LIBRARIES = '\/usr\/lib\/x86_64-linux-gnu\/libtbb.so'\nTBB_DEFINITIONS = ''<\/code><\/pre>\n<\/li>\n<\/ol>\n<h2>Usage Example<\/h2>\n<p>TBB has good <a>documentations<\/a>. Here we use <code>parallel_for_each<\/code>, note that when update the global analysis, we should avoid race condition (here I use mutex).<\/p>\n<pre><code class=\"language-c\">\/\/ use global varible to simplize codes\nunordered_map&lt;string, datapath_analysis_new::Layer&gt; layer_name_to_all_files_analysis;\nmutex all_files_analysis_mutex;\n\nvoid safe_analysis_update(unordered_map&lt;string, datapath_analysis_new::Layer&gt; layer_name_to_one_file_analysis, ProgressBar &amp;progressBar) {\n    \/\/ while a thread is updating all_files_analysis\n    \/\/ other threads will wait here\n    all_files_analysis_mutex.lock();\n    \/\/ now we can update all_files_analysis safely\n    \/\/ it is the first file\n    if (layer_name_to_all_files_analysis.empty()) {\n        layer_name_to_all_files_analysis = layer_name_to_one_file_analysis;\n    } else {\n        update(layer_name_to_all_files_analysis, layer_name_to_one_file_analysis);\n    }\n\/\/    \/\/ now release the lock\n    all_files_analysis_mutex.unlock();\n}\n\nint main() {\n    parallel_for_each(all_inputs.begin(), all_inputs.end(), [&amp;] (string input_file) {\n        try {\n            auto layer_name_to_one_file_analysis = analyze_one_file(input_file);\n            safe_analysis_update(layer_name_to_one_file_analysis);\n        } catch (const std::exception&amp; e) {\n            cerr &lt;&lt; e.what() &lt;&lt; endl;\n            return 1;\n        }\n\n    });\n}<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Summary In this post, I will introduce how to solve a parallel computation task using Intel Threading Building Blocks. Problem In the deep learning platform, given inputs contains several thousand images, we want to analyze the data path of a certain deep learning model. The analysis part of each image is identical, for example, we&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,31],"tags":[],"class_list":["post-38","post","type-post","status-publish","format-standard","hentry","category-muitl-thread","category-parallel-computation"],"_links":{"self":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/38","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/comments?post=38"}],"version-history":[{"count":9,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/38\/revisions"}],"predecessor-version":[{"id":63,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/posts\/38\/revisions\/63"}],"wp:attachment":[{"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/media?parent=38"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/categories?post=38"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nanzhou.cc\/index.php\/wp-json\/wp\/v2\/tags?post=38"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}