簡易檢索 / 詳目顯示

研究生: 黃旭輝
Syu-Huei Huang
論文名稱: 具備多視角與幀選擇的邊緣攝影機之利用序列感知可學習稀疏遮罩的端對端稠密影片字幕描述
Sequence-aware Learnable Sparse Mask for Multi-perspective Frame-selectable End-to-End Dense Video Captioning for IoT Smart Cameras
指導教授: 陸敬互
Ching-Hu Lu
口試委員: 蘇順豐
Shun-Feng Su
鍾聖倫
Sheng-Luen Chung
黃正民
Cheng-Ming Huang
李俊賢
Jin-Shyan Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 108
中文關鍵詞: 端對端稠密影片字幕描述可學習稀疏注意力平行運算輕量化神經網路影片幀選擇邊緣運算物聯網
外文關鍵詞: End-to-End Dense Video Captioning, Learnable Attention, Frame Selection
相關次數: 點閱:372下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,智慧物聯網 (AIoT) 技術已廣泛應用於各種智慧系統中,加快了邊緣計算的發展。TikTok曾在2021年發展影片自動生成字幕,隨著計算資源不斷增強,影片描述系統直接布署在計算能力日益強大的AIoT攝影機 (以下稱為邊緣攝影機) 上也指日可待。 然而,現有研究尚未針對用於影片描述之連續影片幀做進一步探討,導致影片描述系統將受到冗餘影片幀的影響而生成出錯誤的描述。為了克服此問題,我們提出輕量化幀選擇模型,其主要利用我們所提出的基於注意力機制之輕量化殘差門控網路,以少量的運算成本達到理想的準確度,並利用此模型將影片中冗餘的幀加以剃除,以獲得具有代表性的關鍵影片幀。此外,現有研究之端對端稠密影片字幕描述中採用可變形注意力機制與自我注意力機制,無法關注全局資訊與容易關注不重要的資訊,因此本研究提出基於序列感知可學習稀疏遮罩之端對端稠密影片字幕描述模型,利用序列感知可學習稀疏遮罩,學習關注影片中重要的資訊,忽略不重要的資訊,以提升字幕生成品質,最後並與前述的輕量化幀選擇網路進行整合,來提升影像字幕的生成品質。本研究經實驗證實,基於序列感知之端對端稠密影片字幕描述網路,相較於最新研究,可以於BLEU3提升8.69%、BLEU4提升12.62%、METEOR提升4.27%、CIDEr提升22.50%。而輕量化幀選擇模型,相較於最新研究,模型參數量降低56.90%、FLOPs降低69.24%、記憶體使用率降低2.50%、推論速度於雲端伺服器上降低1.7%、於邊緣裝置 (採NVIDIA Jetson TX2平台) 上降低11.28%、耗能降低了8.33%瓦特,相較於既有研究更輕量、推論速度更快。綜合上述,可證實本研究提出的基於序列感知可學習稀疏遮罩之端對端稠密影片字幕描述與過去研究相比,準確率更高,而輕量化幀選擇網路則效率更高,基於幀選擇網路後所生成的字幕也更加正確。


    In recent years, Artificial Intelligence of Things (AIoT) technology has been widely applied in various smart systems, accelerating the development of edge computing. In 2021, TikTok developed an automatic video captioning feature. With the continuous enhancement of computing resources, the direct deployment of video captioning systems on powerful AIoT cameras, referred to as edge cameras, is becoming imminent. However, existing studies have not extensively explored the use of consecutive video frames for video captioning, resulting in the video captioning system being influenced by redundant video frames and generating incorrect captions. To overcome this issue, we propose a lightweight frame selection model that primarily utilizes our proposed lightweight attention-enhancement residual gated network to achieve the desired accuracy with a smaller computational cost. In addition, the existing studies in end-to-end dense video captioning cannot focus on global information and tend to pay attention to irrelevant details, our study proposes an end-to-end dense video captioning model with sequence-aware learnable sparse mask to focus on essential information in the video while ignoring irrelevant details, thus enhancing the quality of caption generation. The experiments show that the end-to-end dense video captioning network with sequence-aware learnable sparse mask outperforms the latest study, for BLEU3 by 8.69%, BLEU4 by 2.62%, METEOR by 4.27%, and CIDEr by 22.50%. Moreover, the lightweight frame selection model reduces model parameters by 56.90%, FLOPs by 69.24%, and power consumption by 8.33% compared to the latest study. In summary, our proposed end-to-end dense video captioning model with sequence-aware learnable sparse mask achieves higher accuracy compared to previous studies, and the lightweight frame selection network achieves higher efficiency and generates more accurate captions after frame selection.

    中文摘要 I Abstract II 致謝 IV 目錄 V 圖目錄 VIII 表格目錄 XI 第一章 簡介 1 1.1 研究動機 1 1.2 文獻探討 5 1.2.1 「容易關注不重要資訊」的議題 5 A. 單句影片字幕描述 5 B. 稠密影片字幕描述 6 i. 基於組件之稠密影片字幕描述 6 ii. 基於遞歸神經網路之端對端稠密影片字幕描述 7 iii. 基於Transformer之端對端稠密影片字幕描述 10 ● 稠密注意力機制 12 ● 稀疏注意力機制 13 1.2.2 「多相機影像融合會產生冗餘視覺資訊」的議題 16 A. 基於強化學習之幀選擇網路 16 B. 基於監督式學習之幀選擇網路 18 i. 由上至下 (Top-down) 方法 18 ii. 由下至上 (Bottom-up) 方法 19 1.3 本研究貢獻與文章架構 20 第二章 系統架構簡介 22 2.1 系統架構 22 2.2 應用情境範例 24 第三章 基於多影像融合之輕量化幀選擇模型 26 3.1 多影像融合 26 3.2 幀選擇模型 27 3.2.1 影片特徵處理 28 3.2.2 圖注意力網路塊 30 3.2.3 幀選擇策略 32 3.2.4 門控網路 33 3.3 基於注意力機制之輕量化殘差門控網路 34 3.4 損失函數 41 第四章 具備序列感知可學習稀疏遮罩之端對端稠密影片字幕生成模型 42 4.1 序列感知可學習稀疏遮罩 44 4.2 影片特徵處理 48 4.3 特徵編碼器 49 4.3.1 位置編碼 50 4.3.2 具備序列感知可學習稀疏遮罩之多頭式線性統一嵌套注意力 51 4.3.3 前饋式網路 53 4.4 平行解碼網路 53 4.4.1 解碼器 54 4.4.2 事件定位頭 55 4.4.3 字幕描述頭 56 4.4.4 事件計數器 58 4.5 損失函數 59 第五章 實驗結果與討論 61 5.1 實驗平台 61 5.2 實驗資料集與評估指標 61 5.2.1 實驗資料集 61 5.2.2 評估指標 62 5.3 具備序列感知可學習稀疏遮罩之端對端稠密影片字幕生成模型 65 5.3.1 具備序列感知可學習稀疏遮罩實驗 66 5.3.2 稀疏損失函數實驗 68 5.4 基於多影像融合之輕量化幀選擇模型 70 5.4.1 輕量化幀選擇模型消融實驗與參數量實驗 71 5.4.2 輕量化幀選擇模型之記憶體用量實驗 72 5.4.3 輕量化幀選擇模型之測試時間實驗 72 5.4.4 輕量化幀選擇模型之耗電量實驗 72 5.5 相關研究比較 74 5.5.1 基於序列感知可學習稀疏遮罩之端對端稠密影片字幕描述與相關研究比較 74 5.5.2 輕量化幀選擇模型與相關研究比較 75 5.6 示範應用 77 5.6.1 稠密影片字幕描述示範 77 5.6.2 系統整合示範 79 第六章 結論與未來研究方向 82 參考文獻 84

    [1] Y. Liu, M. Peng, G. Shou, Y. Chen, and S. Chen, "Toward edge intelligence: Multiaccess edge computing for 5G and Internet of Things," IEEE Internet of Things Journal, vol. 7, no. 8, pp. 6722-6747, 2020.
    [2] N. Gupta, P. K. Juneja, S. Sharma, and U. Garg, "Future aspect of 5G-IoT architecture in smart healthcare system," in 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), 2021: IEEE, pp. 406-411.
    [3] C. R. Storck and F. Duarte-Figueiredo, "A survey of 5G technology evolution, standards, and infrastructure associated with vehicle-to-everything communications by internet of vehicles," IEEE access, vol. 8, pp. 117593-117614, 2020.
    [4] K. Cao, Y. Liu, G. Meng, and Q. Sun, "An overview on edge computing research," IEEE access, vol. 8, pp. 85714-85728, 2020.
    [5] W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, "Explaining deep neural networks and beyond: A review of methods and applications," Proceedings of the IEEE, vol. 109, no. 3, pp. 247-278, 2021.
    [6] J. Zhang and D. Tao, "Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things," IEEE Internet of Things Journal, vol. 8, no. 10, pp. 7789-7817, 2020.
    [7] M. Ren, R. Kiros, and R. Zemel, "Exploring models and data for image question answering," Advances in neural information processing systems, vol. 28, 2015.
    [8] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706-715.
    [9] T. Glasmachers, "Limits of end-to-end learning," in Asian conference on machine learning, 2017: PMLR, pp. 17-32.
    [10] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, "Translating videos to natural language using deep recurrent neural networks," arXiv preprint arXiv:1412.4729, 2014.
    [11] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
    [12] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, "Learning spatiotemporal features with 3d convolutional networks," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489-4497.
    [13] V. Escorcia, F. Caba Heilbron, J. C. Niebles, and B. Ghanem, "Daps: Deep action proposals for action understanding," in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, 2016: Springer, pp. 768-784.
    [14] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, "Bidirectional attentive fusion with context gating for dense video captioning," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7190-7198.
    [15] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, "Jointly localizing and describing events for dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7492-7500.
    [16] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," Advances in neural information processing systems, vol. 27, 2014.
    [17] M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.
    [18] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
    [19] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8739-8748.
    [20] Y. Xiong et al., "Cuhk & ethz & siat submission to activitynet challenge 2016," arXiv preprint arXiv:1608.00797, 2016.
    [21] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [22] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International conference on machine learning, 2015: pmlr, pp. 448-456.
    [23] L. Zhou, C. Xu, and J. Corso, "Towards automatic learning of procedures from web instructional videos," in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32, no. 1.
    [24] Z. Yu and N. Han, "Accelerated masked transformer for dense video captioning," Neurocomputing, vol. 445, pp. 72-80, 2021.
    [25] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, "End-to-end dense video captioning with parallel decoding," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6847-6857.
    [26] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, "Deformable detr: Deformable transformers for end-to-end object detection," arXiv preprint arXiv:2010.04159, 2020.
    [27] W. Choi, J. Chen, and J. Yoon, "Parallel Pathway Dense Video Captioning With Deformable Transformer," IEEE Access, vol. 10, pp. 129899-129910, 2022.
    [28] M. Zaheer et al., "Big bird: Transformers for longer sequences," Advances in neural information processing systems, vol. 33, pp. 17283-17297, 2020.
    [29] S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, "Linformer: Self-attention with linear complexity," arXiv preprint arXiv:2006.04768, 2020.
    [30] C. Wu et al., "Smart bird: Learnable sparse attention for efficient and effective transformer," arXiv preprint arXiv:2108.09193, 2021.
    [31] X. Ma et al., "Luna: Linear unified nested attention," Advances in Neural Information Processing Systems, vol. 34, pp. 2441-2453, 2021.
    [32] Z. Wu, C. Xiong, C.-Y. Ma, R. Socher, and L. S. Davis, "Adaframe: Adaptive frame selection for fast video recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1278-1287.
    [33] W. Wu, D. He, X. Tan, S. Chen, and S. Wen, "Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6222-6231.
    [34] B. Korbar, D. Tran, and L. Torresani, "Scsampler: Sampling salient clips from video for efficient action recognition," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6232-6242.
    [35] R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, "Listen to look: Action recognition by previewing audio," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10457-10467.
    [36] S. N. Gowda, M. Rohrbach, and L. Sevilla-Lara, "Smart frame selection for action recognition," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1451-1459.
    [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
    [38] A. Ghodrati, B. E. Bejnordi, and A. Habibian, "Frameexit: Conditional early exiting for efficient video recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15608-15618.
    [39] S. Alfasly et al., "FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition," Neurocomputing, vol. 516, pp. 231-244, 2023.
    [40] N. Gkalelis, D. Daskalakis, and V. Mezaris, "Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism," in 2022 IEEE International Symposium on Multimedia (ISM), 2022: IEEE, pp. 113-120.
    [41] S. Chen and Y.-G. Jiang, "Towards bridging event captioner and sentence localizer for weakly supervised dense event captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8425-8435.
    [42] A. Chadha, G. Arora, and N. Kaloty, "iPerceive: Applying common-sense reasoning to multi-modal dense video captioning and video question answering," arXiv preprint arXiv:2011.07735, 2020.
    [43] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [44] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy, "Do vision transformers see like convolutional neural networks?," Advances in Neural Information Processing Systems, vol. 34, pp. 12116-12128, 2021.
    [45] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, 2015.
    [46] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
    [47] N. Gkalelis, D. Daskalakis, and V. Mezaris, "ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network," IEEE Access, vol. 10, pp. 108797-108816, 2022.
    [48] L. Wang et al., "Temporal segment networks for action recognition in videos," IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 11, pp. 2740-2755, 2018.
    [49] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, "Transformers are rnns: Fast autoregressive transformers with linear attention," in International Conference on Machine Learning, 2020: PMLR, pp. 5156-5165.
    [50] N. Kitaev, Ł. Kaiser, and A. Levskaya, "Reformer: The efficient transformer," arXiv preprint arXiv:2001.04451, 2020.
    [51] Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, "Sparse sinkhorn attention," in International Conference on Machine Learning, 2020: PMLR, pp. 9438-9447.
    [52] K. Choromanski et al., "Rethinking attention with performers," arXiv preprint arXiv:2009.14794, 2020.
    [53] K. Pearson, "LIII. On lines and planes of closest fit to systems of points in space," The London, Edinburgh, and Dublin philosophical magazine and journal of science, vol. 2, no. 11, pp. 559-572, 1901.
    [54] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, "Activitynet: A large-scale video benchmark for human activity understanding," in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961-970.
    [55] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, "End-to-end object detection with transformers," in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, 2020: Springer, pp. 213-229.
    [56] L. Yao et al., "Describing videos by exploiting temporal structure," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4507-4515.
    [57] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, "Streamlined dense video captioning," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6588-6597.
    [58] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, "Generalized intersection over union: A metric and a loss for bounding box regression," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 658-666.
    [59] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980-2988.
    [60] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "Bleu: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
    [61] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
    [62] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, "Cider: Consensus-based image description evaluation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566-4575.
    [63] D. Kinga and J. B. Adam, "A method for stochastic optimization," in International conference on learning representations (ICLR), 2015, vol. 5: San Diego, California;, p. 6.
    [64] M. Suin and A. Rajagopalan, "An efficient framework for dense video captioning," in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, no. 07, pp. 12039-12046.

    QR CODE