簡易檢索 / 詳目顯示

研究生: 劉錕笙
Kun-Sheng Liu
論文名稱: 用於GPU上並行推理多個神經網絡的批量推理排程
A Batch Inference Scheduler for Parallel Inference Multi-Tenant Neural Networks on GPU
指導教授: 陳雅淑
Ya-Shu Chen
口試委員: 謝仁偉
Jen-Wei Hsieh
吳晉賢
Chin-Hsien Wu
修丕承
Pi-Cheng Hsiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 39
中文關鍵詞: 多神經網路運算神經網路批量推理記憶體超額認購記憶體資源管理GPU資源排程
外文關鍵詞: Multi-tenant DNN, Batch inference, Memory oversubscription, Memory management, GPU resource schedule
相關次數: 點閱:196下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 為了提升現今智慧裝置的服務品質,越來越多的應用程式需要同時執行多種不同的神經網路,並且每個神經網路可能會有多個的執行需求~(稱為神經網路的批量推理)。批量推理具備提升記憶體以及計算資源利用率的特性,然而在嵌入式設備這類具有記憶體空間限制的情境上,提升計算資源的需求而會導致性能的反向下降。在這種情況下,如何避免記憶體以及計算資源的爭用,並減少執行所需要的時間,是對排程的一種挑戰。

    在本研究中,我們提出了一個同時考慮批量推理以及分頁錯誤懲罰的實時的排程方案,用以決策神經網路中每一層的批量處理大小並分配系統有限的計算資源,從而降低應用程式的執行時間。我們在不同計算資源和記憶體情境下的測試了本篇論文提出的排程方法,結果顯示本方法能顯著的降低執行所需的時間。


    To enhance service quality on modern intelligent devices, the use of applications that combine multiple neural network inferences with multiple inference requests (referred to as batch inference) is becoming increasingly popular. Due to the fact that batch inference increases memory and computing utilization, performance degrades due to the limited memory space in embedded systems. The scheduling challenge arises from computing resource contention and memory thrashing in such applications. In this study, we propose a runtime scheduler that determines the computing resources and memory allocation for each layer, considering multiple requests and the page fault penalty. This scheduler then schedules an appropriate batch size for each layer to minimize the application's response time. We have tested the proposed scheduler under various considerations of computing resources and memory conditions, and the results demonstrate encouraging improvements in response time.

    1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 BACKGROUND AND SYSTEM MODEL. . . . . . . . . . . . . . . . . . . . . 3 2.1 Memory Oversubscription . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Batch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1 Batch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Resource Allocation . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Multiple NN Inference . . . . . . . . . . . . . . . . . . . . . . . 11 4 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.1 Batch Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Multiple NN Inference in Parallel . . . . . . . . . . . . . . . . . 15 5 APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5.1 Batch-Aware SM Dispatcher . . . . . . . . . . . . . . . . . . . . . 17 5.2 Overhead-Reduction Batch Inference Scheduler . . . . . . . . . . . 19 6 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 6.1 Experimental Environment . . . . . . . . . . . . . . . . . . . . . 22 6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 23 7 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    [1] “Waymo open dataset.” https://waymo.com/open/challenges/, 2023.
    [2] “The autoware challenge.” https://autoware.org/autoware-challenge-2023/, 2023.
    [3] “Zoox autonomy vehicle.” https://zoox.com/autonomy/, 2023.
    [4] Y. Choi and M. Rhu, “Prema: A predictive multi-task scheduling algorithm for preemptible neural processing units,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020.
    [5] Z. Liu, J. Leng, Z. Zhang, Q. Chen, C. Li, and M. Guo, “Veltair: Towards high performance multi-tenant deep learning services via adaptive compilation and scheduling,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022.
    [6] P. Li, X. Wang, K. Huang, Y. Huang, S. Li, and M. Iqbal, “Multi-model running latency optimization in an edge computing paradigm,” Sensors, 2022.
    [7] F. Yu, S. Bray, D. Wang, L. Shangguan, X. Tang, C. Liu, and X. Chen, “Automated runtime-aware scheduling for multi-tenant dnn inference on gpu,” in 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2021.
    [8] S. Ghodrati, B. H. Ahn, J. Kyung Kim, S. Kinzer, B. R. Yatham, N. Alla, H. Sharma, M. Alian, E. Ebrahimi, N. S. Kim, C. Young, and H. Esmaeilzadeh, “Planaria: Dynamic architecture fission for spatial multi-tenant acceleration of deep neural networks,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020.
    [9] Z.-W. Qiu, K.-S. Liu, and Y.-S. Chen, “Barm: A batch-aware resource manager for boosting multiple neural networks inference on gpus with memory oversubscription,” IEEE Transactions on Parallel and Distributed Systems, 2022.
    [10] Y. Inoue, “Queueing analysis of gpu-based inference servers with dynamic batching: A closed-form characterization,” Performance Evaluation, 2021.
    [11] R. Hadidi, J. Cao, M. S. Ryoo, and H. Kim, “Toward collaborative inferencing of deep neural networks on internet-of-things devices,” IEEE Internet of Things Journal, 2020.
    [12] S. M. Nabavinejad, S. Reda, and M. Ebrahimi, “Batchsizer: Power-performance trade-off for dnn inference,” in Proceedings of the 26th Asia and South Pacific Design Automation Conference, 2021.
    [13] S. M. Nabavinejad, S. Reda, and M. Ebrahimi, “Coordinated batching and dvfs for dnn inference on gpu accelerators,” IEEE Transactions on Parallel and Distributed Systems, 2022.
    [14] Y. Choi, Y. Kim, and M. Rhu, “Lazy batching: An sla-aware batching system for cloud machine learning inference,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021.
    [15] W. Cui, H. Zhao, Q. Chen, H. Wei, Z. Li, D. Zeng, C. Li, and M. Guo, “DVABatch: Diversity-aware Multi-Entry Multi-Exit batching for efficient processing of DNN services on GPUs,” in 2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022.
    [16] T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler, “Towards high performance paged memory for gpus,” in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016.
    [17] Z. Liu, Q. Lan, and K. Huang, “Resource allocation for multiuser edge inference with batching and early exiting,” IEEE Journal on Selected Areas in Communications, 2023.
    [18] S. Choi, S. Lee, Y. Kim, J. Park, Y. Kwon, and J. Huh, “Multi-model machine learning inference serving with gpu spatial partitioning,” 2021.
    [19] H. Muthukrishnan, D. Nellans, D. Lustig, J. A. Fessler, and T. F. Wenisch, “Efficient multi-gpu shared memory via automatic optimization of fine-grained transfers,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021.
    [20] Z. Wang, X. He, Z. Zhou, X. Wang, Q. Ma, X. Miao, Z. Liu, L. Thiele, and Z. Yang, “Stitching weight-shared deep neural networks for efficient multitask inference on gpu,” in 2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), 2022.
    [21] I.-I. Sadou, S. M. Nabavinejad, Z. Lu, and M. Ebrahimi, “Inference time reduction of deep neural networks on embedded devices: A case study,” in 2022 25th Euromicro Conference on Digital System Design (DSD), 2022.
    [22] J. Bakita and J. H. Anderson, “Enabling gpu memory oversubscription via transparent paging to an nvme ssd,” 2022.
    [23] F. Xu, J. Xu, J. Chen, L. Chen, R. Shang, Z. Zhou, and F. Liu, “igniter: Interference-aware gpu resource provisioning for predictable dnn inference in the cloud,” IEEE Transactions on Parallel and Distributed Systems, 2023.
    [24] S. M. Nabavinejad and S. Reda, “Bayestuner: Leveraging bayesian optimization for dnn inference configuration selection,” IEEE Computer Architecture Letters, 2021.
    [25] Z. Zhang, H. Li, Y. Zhao, C. Lin, and J. Liu, “Pos: An operator scheduling framework for multi-model inference on edge intelligent computing,” in Proceedings of the 22nd International Conference on Information Processing in Sensor Networks, 2023.
    [26] K. Guo, Y. Xu, Z. Qi, and H. Guan, “Optimum: Runtime optimization for multiple mixed model deployment deep learning inference,” Journal of Systems Architecture, 2023.
    [27] H. Kim, J. Sim, P. Gera, R. Hadidi, and H. Kim, “Batch-aware unified memory management in gpus for irregular workloads,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.
    [28] N. Sakharnykh, “Everything you need to know about unified memory.” https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf, 2018.
    [29] “Jetson agx xavier developer kit.” https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-xavier-series/, 2023.
    [30] “Nvidia ampere ga102 gpu architecture.” https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf, 2021.
    [31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM International Conference on Multimedia, 2014.
    [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.

    無法下載圖示 全文公開日期 2028/08/30 (校內網路)
    全文公開日期 2028/08/30 (校外網路)
    全文公開日期 2028/08/30 (國家圖書館:臺灣博碩士論文系統)
    QR CODE