研究生: |
廖偉丞 Wei-Cheng Liao |
---|---|
論文名稱: |
改善相依性程式之排程法 Dependency Aware GPGPU Kernel Scheduling |
指導教授: |
黃元欣
Yuan-Shin Hwang |
口試委員: |
謝仁偉
Jen-Wei Hsieh 賴祐吉 YU-CHI Lai |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 41 |
中文關鍵詞: | 圖形處理器通用計算 、相依性核心排程法 、執行緒塊排程法 、快取寫回規則 |
外文關鍵詞: | GPGPU, Dependent Kernels Scheduling, CTA Scheduling, Cache Write Policy |
相關次數: | 點閱:440 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
現今GPU已廣泛應用於各個領域,例如影像處理、深度學習、人工智慧等等,而相關的研究也不斷地被提出,顯得此領域日趨重要。而諸如深度學習、人工智慧類型的程式通常需要大量運算,這些運算不會只交給一個程式核心來執行,而是將任務分給不同的子核心,這些具有資料相依性的子核心我們將其歸類為相依性核心。
以目前GPU程式寫法來說,這些相依性核心會因為資料相依而必須採用序列方式執行,這將造成GPU平行度降低。另外GPU通常會將處理完的資料存回記憶體,但對於相依性核心來說這些資料都是會再使用的,因此造成不必要的記憶體存取。
本篇論文提出的方法透過修改模擬器中的排程法打破了相依性核心必須序列執行的規則,並且搭配適合的記憶體寫回操作將資料留在快取記憶體中,讓資料能夠重複使用,以此達到提高效能的目標。
Recent GPUs are widely used in a variety of areas, such as image processing, deep learning, artificial intelligence, and so on. Related researches are constantly being presented. And programs such as deep learning and artificial intelligence do not use only one kernel to execute, but assign tasks to different sub-kernels, which are data dependency. We categorize them as dependent kernels.
In the case of current GPU programming, these dependent kernels will be implemented in a sequential manner because of data dependencies, which will result in a reduction in GPU parallelism. In addition, the GPU will usually save the processed data back to memory, but for the dependent core, the data will be used again, resulting in unnecessary memory access.
We proposed the method to break the rules that the dependency kernel must perform in sequence by modifying the scheduling method in the simulator and retains the data in the cache memory with the appropriate memory write back policy.
[1] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of ISPASS’09, pages 163–174.
[2] Onur Kayıran, Adwait Jog, Mahmut T. Kandemir and Chita R. Das, Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs. In CSE Penn State Tech Report, TR-CSE-2012-006, 2012.
[3] Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. Enabling preemptive multiprogramming on GPUs. In Proc. of the 41st Annual International Symposium on Computer Architecture, pages 193– 204, 2014
[4] Gwangsun Kim, Jiyun Jeong, John Kim, Mark Stephenson, Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs. In PACT’16, September 11-15, 2016, Haifa, Israel
[5]GeForce GTX 480
http://www.nvidia.com.tw/object/product_geforce_gtx_480_tw.html
[6] Hotball's Hive, “CDUA Matrix multiplication”, http://www2.kimicat.com/%E7%AC%AC%E4%BA%8C%E5%80%8Bcuda%E7%A8%8B%E5%BC%8F
[7] NVIDIA CUDA: Kepler Vs. Fermi Architecture
http://blog.cuvilib.com/2012/03/28/nvidia-cuda-kepler-vs-fermi-architecture/
[8] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
[9] Songho.ca ,"Convolution" http://www.songho.ca/dsp/convolution/convolution.html
[10] Caffe
http://caffe.berkeleyvision.org/
[11]TensorFlow
https://www.tensorflow.org/get_started/
[12] NVIDIA, “CUDA Toolkit Documentation v4.2,
https://developer.nvidia.com/cuda-toolkit-42-archive
[13]GPGPU-Sim Main Page
http://gpgpu-sim.org/manual/index.php/Main_Page
[14] GPGPU-SIM Code Study
http://people.cs.pitt.edu/~yongli/notes/gpgpu/GPGPUSIMNotes.html
[15] How the Fermi Thread Block Scheduler Works (Illustrated)
https://users.ices.utexas.edu/~sreepai/fermi-tbs/