簡易檢索 / 詳目顯示

研究生: 陳柏勳
Po-Hsun Chen
論文名稱: 用於邊緣設備之影片分類法
Video Classification on Edge Devices
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
阮聖彰
Shanq-Jang Ruan
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 57
中文關鍵詞: 影片分類弱監督式學習模型優化邊緣裝置
外文關鍵詞: Video classification, Weakly supervised learning, Model optimization, Edge device
相關次數: 點閱:132下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 影片分類是計算機視覺中的重要課題之一,例如駕駛員行為、人類動作和跌倒檢測。然而由於攝像機角度、光照、天氣等多種因素,影片分類仍然是一項艱鉅的任務。此外由於需要大量計算,邊緣設備的實施是另一個挑戰。在本論文中,我們的目標是在廣泛使用的邊緣設備Jetson Nano上實現影片分類網絡,該網絡可以在低延遲的情況下實現準確的性能。為此我們考慮了一個弱監督的影片分類網絡。首先我們選擇MobileNet-V2作為我們的骨幹網絡主幹,以在計算成本和準確性之間取得良好的平衡。由於MobileNet-V2是一個二維卷積神經網絡,它無法提取時間關係。因此我們將時間移位模塊添加到我們的網絡中。時間移位模塊沿時間維度移動一些通道,因此相鄰幀可以相互交換信息。為了進一步提高準確性,我們還將非本地操作添加到網絡中,這樣我們就可以通過計算位置之間的交互來直接捕獲遠程依賴。之後我們考慮在Jetson Nano上實現上述內容。為此我們在工作中導入了張量虛擬機。我們首先使用帶有遠程過程調用的自動調優模塊(自動張量虛擬機 或 自動調度器)來調優模型,從而獲得最優的模型調度。最後我們將獲得的時間表導入張量虛擬機以優化模型。對各種數據集的模擬證明了這個方法在Jetson Nano上的有效性。


    Video classification, such as driver behaviors, human actions, and fall detection, is one of the important topics in computer vision. However, due to a variety of factors such as camera angle, lighting, weather, {\it etc.}, video classification remains to be a difficult task. In addition, the implementation on edge devices is another challenge due to the massive computations required. In this thesis, we aim to implement the video classification network on the widespread edge device Jetson Nano, which can achieve accurate performance with low latency. Toward this end, we consider a weakly supervised video classification network. First, we choose MobileNet-V2 as our backbone network backbone to get a good trade-off between computational cost and accuracy. Since MobileNet-V2 is a 2D convolutional neural network (CNN), it can not extract the temporal relationships. Thereby, we add the temporal shift module (TSM) to our network. TSM shifts some channels along the temporal dimension, so the neighboring frames can exchange information with each other. To further enhance the accuracy, we also add the non-local operation into the network so we can capture the long-range dependency directly by computing the interactions between positions. Afterwards, we consider the implementation of the aforementioned on Jetson Nano. For this, we import the tensor virtual machine (TVM) in our work. We first use the auto-tuning module (AutoTVM or AutoScheduler) with remote procedure call (RPC) to tune the model so that we can get the optimal model schedule. Finally, we import the obtained schedule into TVM to optimize the model. Simulations on a variety of datasets demonstrate the efficacy of this method on Jetson Nano.

    摘要 i Abstract ii Acknowledgment iii Table of contents iv List of Figures vii List of Tables x List of Acronyms xi 1 Introduction 1 2 Related Work 3 2.1 Video Classification 3 2.2 Weakly Supervised Learning 4 2.3 Temporal Modeling 4 2.4 Edge Computation 5 2.5 Model Optimization 5 2.6 Summary 6 3 Proposed Method 7 3.1 Proposed Architecture 7 3.2 TSM Architecture 8 3.2.1 Temporal Shift Module 8 3.2.2 Non-Local Operation 9 3.2.3 Segmental Consensus 10 3.2.4 Loss Function 10 3.3 Implementation on Edge Devices 11 3.3.1 Tensor Virtual Machine 11 3.3.2 Auto-tuning Module 13 3.3.3 Remote Procedure Call 13 3.4 Summary 14 4 Experimental Results and Discussions 15 4.1 Datasets 15 4.1.1 Driver Monitoring Dataset 15 4.1.2 UP Fall Dataset 17 4.1.3 Our Driver Monitoring Dataset 18 4.1.4 Our Fall Dataset 20 4.2 Experimental Setup 20 4.2.1 Model Parameters 20 4.2.2 Data Augmentation 21 4.2.3 Evaluation Metrics 22 4.3 Experimental Results 22 4.3.1 DMD Dataset Results 22 4.3.2 UP Fall Dataset Results 24 4.3.3 Our Driver Monitoring Dataset 26 4.3.4 Our Fall Dataset 28 4.3.5 Implementation on Jetson Nano 29 4.4 Failure Cases and Error Analysis 32 4.4.1 Imbalanced Numbers of Each Class 32 4.4.2 Detailed Action of Fall Event 34 4.5 Summary 37 5 Conclusion and Future Works 38 5.1 Conclusion 38 5.2 Future Works 38 References 39

    [1] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-
    Fei, “Large-scale video classification with convolutional neural networks,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 1725–1732, 2014.
    [2] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
    action recognition in videos,” Advances in Neural Information Processing
    Systems, vol. 27, 2014.
    [3] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
    and the kinetics dataset,” in Proceedings of the IEEE Conference on Com-
    puter Vision and Pattern Recognition, pp. 6299–6308, 2017.
    [4] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa-
    tiotemporal features with 3d convolutional networks,” in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
    [5] Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National
    Science Review, vol. 5, no. 1, pp. 44–53, 2018.
    [6] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video
    understanding,” in Proceedings of the IEEE/CVF International Conference
    on Computer Vision, pp. 7083–7093, 2019.
    [7] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 7794–7803, 2018.
    [8] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,
    L. Wang, Y. Hu, L. Ceze, et al., “{TVM}: An automated {End-to-End}
    optimizing compiler for deep learning,” in 13th USENIX Symposium on Op-
    erating Systems Design and Implementation (OSDI 18), pp. 578–594, 2018.
    [9] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin,
    and A. Krishnamurthy, “Learning to optimize tensor programs,” Advances
    in Neural Information Processing Systems, vol. 31, 2018.
    [10] L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang,
    D. Zhuo, K. Sen, et al., “Ansor: Generating {High-Performance} tensor pro-
    grams for deep learning,” in 14th USENIX symposium on operating systems
    design and implementation (OSDI 20), pp. 863–879, 2020.
    [11] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
    tv-l 1 optical flow,” in Joint Pattern Recognition Symposium, pp. 214–223,
    Springer, 2007.
    [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool,
    “Temporal segment networks: Towards good practices for deep action recog-
    nition,” in European Conference on Computer Vision, pp. 20–36, 2016.
    [13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-
    han, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 1–9, 2015.
    [14] M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion feature network:
    Fixed motion filter for action recognition,” in Proceedings of the European
    Conference on Computer Vision (ECCV), pp. 387–403, 2018.
    [15] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer
    look at spatio temporal convolutions for action recognition,” in Proceed-
    ings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 6450–6459, 2018.
    [16] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with
    pseudo-3d residual networks,” in Proceedings of the IEEE International Con-
    ference on Computer Vision, pp. 5533–5541, 2017.
    [17] K. Liu and H. Ma, “Exploring background-bias for anomaly detection in
    surveillance videos,” in Proceedings of the 27th ACM International Confer-
    ence on Multimedia, pp. 1490–1499, 2019.
    [18] J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, “Abnormal event detec-
    tion and localization via adversarial event prediction,” IEEE Transactions
    on Neural Networks and Learning Systems, 2021.
    [19] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis,
    “Learning temporal regularity in video sequences,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–
    742, 2016.
    [20] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings
    of the European Conference on Computer Vision (ECCV), pp. 399–417, 2018.
    [21] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “Actionvlad:
    Learning spatio-temporal aggregation for action classification,” in Proceed-
    ings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 971–980, 2017.
    [22] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational
    reasoning in videos,” in Proceedings of the European Conference on Computer
    Vision (ECCV), pp. 803–818, 2018.
    [23] A. Graves, “Long short-term memory,” Supervised Sequence Labelling with
    Recurrent Neural Networks, pp. 37–45, 2012.
    [24] D. Xu, T. Li, Y. Li, X. Su, S. Tarkoma, T. Jiang, J. Crowcroft, and
    P. Hui, “Edge intelligence: Architectures, challenges, and applications,”
    arXiv preprint arXiv:2003.12172, 2020.
    [25] X. Wang, M. Magno, L. Cavigelli, and L. Benini, “Fann-on-mcu: An open-
    source toolkit for energy-efficient neural network inference at the edge of the
    internet of things,” IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4403–
    4417, 2020.
    [26] S. S. Alam and M. I. H. Bhuiyan, “Detection of seizure and epilepsy using
    higher order statistics in the emd domain,” IEEE Journal of Biomedical and
    Health Informatics, vol. 17, no. 2, pp. 312–318, 2013.
    [27] R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina, “Creating human ac-
    tivity recognition systems using pareto-based multiobjective optimization,”
    in 2009 Sixth IEEE International Conference on Advanced Video and Signal
    Based Surveillance, pp. 37–42, 2009.
    [28] X. Qi, M. Keally, G. Zhou, Y. Li, and Z. Ren, “Adasense: Adapting sampling
    rates for activity recognition in body sensor networks,” in 2013 IEEE 19th
    Real-Time and Embedded Technology and Applications Symposium (RTAS),
    pp. 163–172, 2013.
    [29] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, “Deepsense: A unified
    deep learning framework for time-series mobile sensing data processing,”
    in Proceedings of the 26th International Conference on World Wide Web,
    pp. 351–360, 2017.
    [30] M. Xu, M. Zhu, Y. Liu, F. X. Lin, and X. Liu, “Deepcache: Principled cache
    for mobile deep vision,” in Proceedings of the 24th Annual International
    Conference on Mobile Computing and Networking, pp. 129–144, 2018.
    [31] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo-
    bilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–
    4520, 2018.
    [32] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guide-
    lines for efficient cnn architecture design,” in Proceedings of the European
    Conference on Computer Vision (ECCV), pp. 116–131, 2018.
    [33] S. Mannor, D. Peleg, and R. Rubinstein, “The cross entropy method for clas-
    sification,” in Proceedings of the 22nd International Conference on Machine
    learning, pp. 561–568, 2005.
    [34] J. D. Ortega, N. Kose, P. Ca ̃nas, M.-A. Chao, A. Unnervik, M. Nieto,
    O. Otaegui, and L. Salgado, “Dmd: A large-scale multi-modal driver moni-
    toring dataset for attention and alertness analysis,” in European Conference
    on Computer Vision, pp. 387–405, 2020.
    [35] L. Mart ́ınez-Villase ̃nor, H. Ponce, J. Brieva, E. Moya-Albor, J. N ́u ̃nez-
    Mart ́ınez, and C. Pe ̃nafort-Asturiano, “Up-fall detection dataset: A mul-
    timodal approach,” Sensors, vol. 19, no. 9, p. 1988, 2019.
    [36] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
    A large-scale hierarchical image database,” in 2009 IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 248–255, 2009.
    [37] R. Espinosa, H. Ponce, S. Guti ́errez, L. Mart ́ınez-Villase ̃nor, J. Brieva, and
    E. Moya-Albor, “A vision-based approach for fall detection using multiple
    cameras and convolutional neural networks: A case study using the up-fall
    detection dataset,” Computers in Biology and Medicine, vol. 115, p. 103520,
    2019.
    [38] S. Saurav, R. Saini, and S. Singh, “A dual-stream fused neural network
    for fall detection in multi-camera and 360 videos,” Neural Computing and
    Applications, vol. 34, no. 2, pp. 1455–1482, 2022.

    無法下載圖示 全文公開日期 2024/08/23 (校內網路)
    全文公開日期 2024/08/23 (校外網路)
    全文公開日期 2024/08/23 (國家圖書館:臺灣博碩士論文系統)
    QR CODE