研究生: |
范綱元 Gang-Yuan Fan |
---|---|
論文名稱: |
具備共同決策的物聯網邊緣攝影機之環境感知稠密影片字幕生成 Environment-aware Dense Video Captioning for IoT-enabled Edge Cameras with Joint-Decision |
指導教授: |
陸敬互
Ching-Hu Lu |
口試委員: |
蘇順豐
Shun-Feng Su 鍾聖倫 Sheng-Luen Chung 花凱龍 Kai-Long Hua 黃正民 Zheng-Min Huang 陸敬互 Ching-Hu Lu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 97 |
中文關鍵詞: | 稠密影片字幕生成 、輕量化神經網路 、環境感知適應 、邊緣運算 、物聯網 |
外文關鍵詞: | Dense Video Captioning, Lightweight Neural Network, Environment-aware Adaption, Edge Computing, Internet of Things |
相關次數: | 點閱:786 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,智慧物聯網 (AIoT) 促使邊緣計算快速發展,現有影片描述系統也有機會直接佈署在計算資源逐漸強大的AIoT攝影機 (以下稱為邊緣攝影機)。因此本研究提出輕量化稠密影片描述網路,其乃基於Transformer (翻譯轉換模型) 之設計框架並藉以強化影片各事件間的關聯性,進而減少模型層數及複雜度,使其能夠在邊緣攝影機上快速運行。此外為了探討概念飄移對於影片描述網路的影響,我們也提出一套環境感知適應,其基於Transformer編碼器之環境感知偵測網路,讓影片描述能因應環境變化來產生更正確的影片描述。接著本研究將實驗分為兩部分,一是速度導向模型,可佈署於邊緣攝影機;二是品質導向模型,其效果勝過現有模型。根據稠密影片描述網路的實驗結果顯示,若採生成速度導向的設計,Bleu3指標提升23.5%, Bleu4指標提升18.3%,Meteor指標提升8%,計算時間更減少46.4%,運行於邊緣攝影機 (採NVIDIA Jetson TX2平台) 的速度更是達到27.63 FPS,相較於現有研究快4.7%,且本研究首度將運行速度列入考量。另外若是採生成品質導向的設計,基於實際候選框之 Bleu3提升6%, Bleu4提升0.36%, Meteor提升6.1%;另外基於生成事件候選框上, Bleu3提升30%, Bleu4提升58.9%,Meteor提升7.4%。此外,在基於Transformer之環境感知偵測網路的驗證上,本研究相較現有研究,mAP (mean Average Precision) 提升了11.3%,準確率也提升了4.3%。總結上述可說明本研究提出之稠密影片字幕生成模型與過去研究相比取得更好的表現,且在不同場景上具有更高的應用彈性。
In recent years, Artificial Intelligence of Things (a.k.a AIoT) has led to the rapid development of edge computing, and existing video captioning systems have the opportunity to be deployed directly on AIoT cameras (hereinafter referred to as edge cameras), which are becoming increasingly powerful computing resources.Therefore, this study proposes a lightweight dense video captioning network, which is based on the Transformer design framework to strengthen the correlation between video events, thereby reducing the number of layers and complexity of the model and enabling it to run quickly on edge cameras.In addition, to investigate the effect of concept drift on the video captioning network, we also propose an environment-aware adaption, which is based on the environment-aware detection network of the Transformer encoder, which allows the video captions to respond the changes in the environment to produce accurate video captions. Then, this study divides the experiment into two parts, a speed-oriented model, which can be deployed on the edge camera, and a quality-oriented-model, which outperforms the existing models. Experimental results on a dense video captioning network show that the Bleu3 metrics increases by 23.5%, the Bleu4 metrics increases by 18.3%, the Meteor metrics increases by 8%, the computation time decreases by 46.4%, and the speed of the edge camera (on the NVIDIA Jetson TX2 platform) runs at a rate of 27.63 FPS, which is also faster than existing research by 4.7%.This is the first time that our study takes into account the running speed in this scope.If we adopt a quality-oriented design, Bleu3, Bleu4, and Meteor increase by 6%, 0.36%, and 6.1%, respectively, based on the ground-truth proposal anchor box.And Bleu3, Bleu4, and Meteor increase by 30%, 58.9%, and 7.4%, respectively, based on the learned proposal anchor box.In addition, the mAP (mean Average Precision) of the Transformer-based environment-aware detection is 11.3% higher and the accuracy is 4.3% higher than the existing research.In conclusion, it can be shown that the proposed dense video captioning model performs better than the previous ones and is more flexible in different scenarios.
[1] H. P. A. van der Aa, H. C. Comijs, B. W. J. H. Penninx, G. H. M. B. van Rens, and R. M. A. van Nispen, "Major depressive and anxiety disorders in visually impaired older adults," Investigative ophthalmology & visual science, vol. 56, no. 2, pp. 849-854, 2015.
[2] C.-H. Lu and S.-T. Chiu, "Multi-scene Sequentially Environment-aware Image Recursive Enhancement for IoT-enabled Edge Cameras," IEEE Transactions on Emerging Topics in Computing, 2020 (Submitted).
[3] K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in International conference on machine learning, 2015, pp. 2048-2057.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, 2014, pp. 3104-3112.
[5] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706-715.
[6] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
[7] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, "Daps: Deep action proposals for action understanding," in European Conference on Computer Vision, 2016: Springer, pp. 768-784.
[8] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles, "Sst: Single-stream temporal action proposals," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2911-2920.
[9] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, "Bidirectional attentive fusion with context gating for dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7190-7198.
[10] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, "Streamlined dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6588-6597.
[11] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
[12] K. Cho et al., "Learning phrase representations using RNN encoder-decoder for statistical machine translation," arXiv preprint arXiv:1406.1078, 2014.
[13] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8739-8748.
[14] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
[15] L. Zhou, C. Xu, and J. J. Corso, "Towards automatic learning of procedures from web instructional videos," arXiv preprint arXiv:1703.09788, 2017.
[16] B. Shi et al., "Dense procedure captioning in narrated instructional videos," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 6382-6391.
[17] V. Iashin and E. Rahtu, "Multi-modal Dense Video Captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 958-959.
[18] T. Mihaylova and A. F. Martins, "Scheduled Sampling for Transformers," arXiv preprint arXiv:1906.07651, 2019.
[19] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, "Jointly localizing and describing events for dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7492-7500.
[20] M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.
[21] M. A. Ranzato, S. Chopra, M. Auli, and W. Zaremba, "Sequence level training with recurrent neural networks," arXiv preprint arXiv:1511.06732, 2015.
[22] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, "A survey on concept drift adaptation," ACM computing surveys (CSUR), vol. 46, no. 4, pp. 1-37, 2014.
[23] I. Žliobaitė, "Learning under concept drift: an overview," arXiv preprint arXiv:1010.4784, 2010.
[24] S. Muthukrishnan, E. Van Den Berg, and Y. Wu, "Sequential change detection on data streams," in Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), 2007: IEEE, pp. 551-550.
[25] E. Ikonomovska, J. Gama, and S. Džeroski, "Learning model trees from evolving data streams," Data mining and knowledge discovery, vol. 23, no. 1, pp. 128-168, 2011.
[26] R. Klinkenberg and I. Renz, "Adaptive information filtering: Learning in the presence of concept drifts," Learning for text categorization, pp. 33-40, 1998.
[27] A. Dries and U. Rückert, "Adaptive concept drift detection," Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 2, no. 5‐6, pp. 311-327, 2009.
[28] I. Adä and M. R. Berthold, "EVE: a framework for event detection," Evolving systems, vol. 4, no. 1, pp. 61-70, 2013.
[29] J. Gama, P. Medas, G. Castillo, and P. Rodrigues, "Learning with drift detection," in Brazilian symposium on artificial intelligence, 2004: Springer, pp. 286-295.
[30] R. Klinkenberg and T. Joachims, "Detecting concept drift with support vector machines," in ICML, 2000, pp. 487-494.
[31] J. M. Carmona-Cejudo, M. Baena-García, J. del Campo-Avila, R. Morales-Bueno, and A. Bifet, "Gnusmail: Open framework for on-line email classification," 2011.
[32] G. Hulten, L. Spencer, and P. Domingos, "Mining time-changing data streams," in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, 2001, pp. 97-106.
[33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, "Albert: A lite bert for self-supervised learning of language representations," arXiv preprint arXiv:1909.11942, 2019.
[35] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[36] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
[37] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, "Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8934-8943.
[38] J. Carreira and A. Zisserman, "Quo vadis, action recognition? a new model and the kinetics dataset," in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299-6308.
[39] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
[40] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," 2017.
[41] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-End Object Detection with Transformers," arXiv preprint arXiv:2005.12872, 2020.
[42] A. Howard et al., "Searching for mobilenetv3," in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1314-1324.
[43] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
[44] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
[45] P. Das, C. Xu, R. F. Doell, and J. J. Corso, "A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2634-2641.
[46] H. Zhang and V. M. Patel, "Density-aware single image de-raining using a multi-stream dense network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 695-704.
[47] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
[48] C.-H. Lu and B.-E. Shao, "Environment-aware Multi-Scene Image Enhancement for IoT-enabled Edge Cameras," IEEE Systems Journal, 2020 (submitted).