簡易檢索 / 詳目顯示

研究生: 邱兆偉
Zhao-Wei Qiu
論文名稱: 提升多神經網路推論效能之圖形處理器記憶體超額認購管理
Memory Oversubscription Management in GPUs for Boosting Multiple Neural Networks Inference
指導教授: 陳雅淑
Ya-Shu Chen
口試委員: 謝仁偉
Jen-Wei Hsieh
吳晉賢
Chin-Hsien Wu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 50
中文關鍵詞: 記憶體超額認購記憶體管理記憶體震盪
外文關鍵詞: Memory oversubscription, Memory management, Memory thrashing
相關次數: 點閱:217下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 現代智慧設備上通常會運行多個神經網路以提供更好的服務。然而,當執行程式所需記憶體超過實體記憶體容量時,又稱為記憶體超額認購(Memory oversubscription),系統效能會顯著降低。為了在有限實體記憶體下支援多個神經網路的執行,本研究探討了在使用統一虛擬記憶體(Unified virtual memory)與需求分頁(Demand paging)下的圖像處理器(Graphics Processing Unit, GPU)資源管理。我們首先分析了多個神經網路同時執行時,運算單元(Streaming Multiprocessor, SM)分配數量與記憶體震盪(Memory thrashing)導致尋頁錯失(Page fault)懲罰之間的關係。為了通過降低尋頁錯失懲罰來提高效能,我們提出了批量處理感知資源管理(Batch-aware resource management, BARM),其中包括 (1)批量處理感知運算單元調度器以增加尋頁錯失處理時的批量處理大小和 (2)預防震盪記憶體分配器以消除運行時的記憶體震盪。本論文透過多個工作負載評估所提出方法的效能,與最先進的尋頁錯失預取器和批量處理感知執行序並行管理相比,本論文提出的方法能顯著降低響應延遲。並在真實平台上進行驗證評估與案例研究,與Linux默認的設定相比,其顯示響應延遲也得到顯著改善。


    Modern intelligent devices usually perform multiple neural networks for providing better services. However, the system performance degrades significantly when the working set exceeds the physical memory capability which is called memory oversubscription. To support multiple neural networks execution with limited physical memory. This paper explores resource management in GPUs with unified virtual memory and demand paging. We first analyze the relationship between the simultaneous execution of multiple neural networks from SMs assignment and the page fault overhead from the memory thrashing. To boost the performance by reducing the page fault penalty, we propose our batch-aware resource management, including (1) batch-aware SM resource allocation which increases the batch size of the page fault handler, and (2) thrashing-preventing memory allocation which eliminates the run-time thrashing. The performance of the proposed methodology was evaluated by using a series of workloads, and the response latency reduces significantly over the state-of-the-art page fault prefetcher and batch-aware TLP management. The proposed framework was also implemented on the real platform and evaluated by a case study, and impressive results were obtained.

    1 Introduction 2 Background 2.1 System Model 2.2 Unified Virtual Memory in GPUs 3 Motivation 4 Approach 4.1 Page Fault Prediction 4.2 Batch-aware SM Dispatcher 4.3 Thrashing-prevention Memory Allocator 5 Performance Evaluation 5.1 Experimental Environment 5.2 Experimental Results 6 Case Study 7 Related Work 8 Conclusion References

    [1] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1– 12.
    [2] N. Sakharnykh, “Everything you need to know about unified memory,” https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-e verything-you-need-to-know-about-unified-memory.pdf, NVIDIA Corp, 2018.
    [3] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high performance paged memory for gpus,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016, pp. 345–357.
    [4] H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified memory management in gpus for irregular workloads,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 1357–1370.
    [5] C. Li, R. Ausavarungnirun, C. J. Rossbach, Y. Zhang, O. Mutlu, Y. Guo, and J. Yang, “A framework for memory oversubscription management in graphics processing units,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 49–63.
    [6] H. Wang, F. Luo, M. Ibrahim, O. Kayiran, and A. Jog, “Efficient and fair multi-programming in gpus via effective bandwidth management,” in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018, pp. 247–258.
    [7] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, “Laperm: Locality aware scheduler for dynamic parallelism on gpus,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 583–595.
    [8] G. Li, X. Liao, H. Huang, S. Song, B. Liu, and Y. Zeng, “Robust stereo visual slam for dynamic environments with moving object,” IEEE Access, vol. 9, pp. 32 310–32 320, 2021.
    [9] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. J. Rossbach, and O. Mutlu, “Mask: Redesigning the gpu memory hierarchy to support multi-application concurrency,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 503–518, 2018.
    [10] T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. D. Smith, “Gpu scheduling on the nvidia tx2: Hidden details revealed,” in 2017 IEEE Real-Time Systems Symposium (RTSS). IEEE, 2017, pp. 104–115.
    [11] “Nvidia tesla v100,” https://images.nvidia.com/content/volta-architectur e/pdf/volta-architecture-whitepaper.pdf, NVIDIA Corp, 2017.
    [12] “Cuda c++ programming guide,” https://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf, NVIDIA Corp, 2021.
    [13] R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, “Mosaic: a gpu memory manager with application-transparent support for multiple page sizes,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017, pp. 136–150.
    [14] “Jetson agx xavier developer kit,” https://developer.nvidia.com/embedded/ jetson-agx-xavier-developer-kit, NVIDIA Corp, 2021.
    [15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
    [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
    [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [18] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
    [19] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
    [20] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 116–131.
    [21] Y. Yu, T. Zhao, K.Wang, and L. He, “Light-opu: An fpga-based overlay processor for lightweight convolutional neural networks,” in Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020, pp. 122–132.
    [22] S. Marcel and Y. Rodriguez, “Torchvision the machine-vision package of torch,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 1485–1488.
    [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-s tyle-high-performance-deep-learning-library.pdf
    [24] O. Villa, M. Stephenson, D. Nellans, and S. W. Keckler, “Nvbit: A dynamic binary instrumentation framework for nvidia gpus,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019, pp. 372–383.
    [25] “Profiler user’s guide,” https://docs.nvidia.com/cuda/profiler-users-guide/i ndex.html, NVIDIA Corp, 2021.
    [26] “Cgroup,” https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups. txt, 2019.
    [27] J. Power, M. D. Hill, and D. A. Wood, “Supporting x86-64 address translation for 100s of gpu lanes,” in 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2014, pp. 568–578.
    [28] S. Shin, G. Cox, M. Oskin, G. H. Loh, Y. Solihin, A. Bhattacharjee, and A. Basu, “Scheduling page table walks for irregular gpu applications,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018, pp. 180–192.
    [29] Y. Hao, Z. Fang, G. Reinman, and J. Cong, “Supporting address translation for accelerator-centric architectures,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2017, pp. 37–48.
    [30] H. Yoon, J. Lowe-Power, and G. S. Sohi, “Filtering translation bandwidth with virtual caching,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018, pp. 113–127.
    [31] H. Yoon and G. S. Sohi, “Revisiting virtual l1 caches: A practical design using dynamic synonym remapping,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016, pp. 212–224.
    [32] B. Pichai, L. Hsu, and A. Bhattacharjee, “Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces,” ACM SIGARCH Computer Architecture News, vol. 42, no. 1, pp. 743–758, 2014.
    [33] D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020, pp. 451–461.
    [34] Q. Yu, B. Childers, L. Huang, C. Qian, H. Guo, and Z. Wang, “Coordinated page prefetch and eviction for memory oversubscription management in gpus,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020, pp. 472–482.
    [35] D. Ganguly, Z. Zhang, J. Yang, and R. Melhem, “Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory,” in Proceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 224–235.
    [36] G. Ayers, H. Litz, C. Kozyrakis, and P. Ranganathan, “Classifying memory access patterns for prefetching,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020, pp. 513–526.
    [37] A. H. N. Sabet, Z. Zhao, and R. Gupta, “Subway: Minimizing data transfer during out-of-gpu-memory graph processing,” in Proceedings of the Fifteenth European Conference on Computer Systems, 2020, pp. 1–16.
    [38] S. W. Min, V. S. Mailthody, Z. Qureshi, J. Xiong, E. Ebrahimi, and W.-m. Hwu, “Emogi: Efficient memory-access for out-of-memory graph-traversal in gpus,” arXiv preprint arXiv:2006.06890, 2020.
    [39] O. Kayıran, A. Jog, M. T. Kandemir, and C. R. Das, “Neither more nor less: Optimizing thread-level parallelism for gpgpus,” in Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE, 2013, pp. 157–166.
    [40] Z. Lin, H. Dai, M. Mantor, and H. Zhou, “Coordinated cta combination and bandwidth partitioning for gpu concurrent kernel execution,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 16, no. 3, pp. 1–27, 2019.
    [41] S.-K. Shekofteh, H. Noori, M. Naghibzadeh, H. Fr¨oning, and H. S. Yazdi, “ccuda: Effective co-scheduling of concurrent kernels on gpus,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 4, pp. 766– 778, 2019.
    [42] T. Allen, X. Feng, and R. Ge, “Slate: Enabling workload-aware efficient multiprocessing for modern gpgpus,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019, pp. 252– 261.
    [43] C. Zhao, W. Gao, F. Nie, F. Wang, and H. Zhou, “Fair and cache blocking aware warp scheduling for concurrent kernel execution on gpu,” Future Generation Computer Systems, vol. 112, pp. 1093–1105, 2020.
    [44] B. L´opez-Albelda, F. M. Castro, J. M. Gonz´alez-Linares, and N. Guil, “Flexsched: Efficient scheduling techniques for concurrent kernel execution on gpus,” The Journal of Supercomputing, pp. 1–29, 2021.
    [45] M. Khairy, M. Zahran, and A. Wassal, “Sacat: Streaming-aware conflictavoiding thrashing-resistant gpgpu cache management scheme,” IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 6, pp. 1740– 1753, 2016.
    [46] W. Cui, Q. Chen, H. Zhao, M. Wei, X. Tang, and M. Guo, “E 2 bird: Enhanced elastic batch for improving responsiveness and throughput of deep learning services,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 6, pp. 1307–1321, 2020.

    無法下載圖示 全文公開日期 2026/09/08 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE