簡易檢索 / 詳目顯示

研究生: 黃翎軒
Ling-Hsuan Huang
論文名稱: 具備多視角的邊緣攝影機之平均稀疏注意力稠密影片字幕描述
Average Sparse Attention for Dense Video Captioning from Multi-perspective IoT Smart Cameras
指導教授: 陸敬互
Ching-Hu Lu
口試委員: 蘇順豐
Shun-Feng Su
鍾聖倫
Sheng-Luen Chung
廖峻鋒
Chun-Feng Liao
黃正民
Cheng-Min Huang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 112
中文關鍵詞: 稠密影片字幕生成稀疏注意力輕量化神經網路影像拼接邊緣運算物聯網
外文關鍵詞: dense video caption, sparse attention, lightweight neural networks, image stitching, edge computing, Internet of things
相關次數: 點閱:459下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,智慧物聯網 (AIoT) 已催化出各種智慧系統,加速邊緣計算的發展, Google 曾在2019發表即時字幕功能。未來,影片描述系統直接佈署在計算資源逐漸強大的 AIoT 攝影機 (以下稱為邊緣攝影機) 也指日可待。然而,由於現有研究尚未對邊緣裝置友善的多相機決策模型做進一步探討,導致影片描述系統受限於單相機視角的限制,無法獲得更豐富的視野。為了克服此限制以及設計出硬體友好 (hardware-friendly) 的模型,我們提出輕量化之影像拼接模型,其主要採用我們所提出的倒置剪枝殘差SE-Nets (Squeeze-and-Excitation Networks) 模組,以少量運算成本達到理想的準確度,並利用此模型將多相機的影像相互拼接以獲得全局影像。此外現有研究之稠密影片描述模型皆採密集型注意力機制,容易遺失重要資訊,因此本研究提出平均稀疏注意力之稠密影片描述模型,利用平均稀疏注意力,降低影片特徵資訊的複雜度,使該模型能更關注於重要資訊,忽略不重要資訊,以提升字幕升成品質,最後與前述的輕量化之影像拼接網路整合,來提升影像字幕生成品質。本研究經實驗證實,平均稀疏注意力之稠密影片描述網路,相較於最新研究,可以於固定隨機種子的條件下,BLEU3提升22.97%, BLEU4 提升35.04%, METEOR提升 7.51%,於不固定隨機種子的條件下,BLEU3提升20.14%, BLEU4 提升32.12%, METEOR提升 6.37%。而輕量化之影像拼接模型,相較於最新研究,模型總參數量降低13.40%,並於雲端伺服器上運行的FPS提升25.91%,於邊緣裝置 (採 NVIDIA Jetson TX2 平台) 運行的FPS提升28.96%,耗能降低了18.31% 瓦特,相較於既有研究更輕量、推論速度更快。綜合上述,可證實本研究提出之平均稀疏注意力之稠密影片描述網路與過去研究相比,準確度更高,而輕量化之影像拼接模型則效率更高,基於多相機影像拼接後所生成的字幕也更完整。


    In recent years, the Artificial Intelligent of Things (AIoT) has accelerated the development of edge computing. Since the existing studies have not further explored directly on-edge multi-perspective camera decision, dense video caption is limited by the single-phase view angle and cannot obtain more richer perspective. Therefore, in order to overcome the limitation of single camera viewing angle and design a hardware-friendly model, we propose a lightweight image stitching model, which mainly adopts our proposed inverted pruned residual SE-Nets (Squeeze-and-Excitation Networks) module to achieve the desired accuracy with a small computational cost. In addition, the existing dense video caption models adopt intensive attention mechanism, which may easily lose important information, our study propose a dense video caption model with average sparse attention to reduce the complexity of video feature information, so that the model can focus more on important information and ignore unimportant information to improve the quality of caption enhancement. The experiments show that the dense video caption network with average sparse attention can improve 22.97% for BLEU3, 35.04% for BLEU4, 7.51% for METEOR when random seeds are fixed, and also improve 20.14% for BLEU3, 32.12% for BLEU4, and 6.37% for METEOR when random seeds are non-fixed. Besides, the lightweight video stitching model has 13.40% lower total number of model parameters and 25.91% higher FPS on the cloud server and 28.96% higher FPS , 18.31% lower energy consumption on an edge device (NVIDIA Jetson TX2 platform) compared to the latest study. In summary, our can be confirmed that the proposed dense video caption network with average sparse attention is more accurate than previous studies, and the lightweight video stitching model is more efficient, and the captions generated after stitching based on lightweight video stitching model are more complete.

    中文摘要 I Abstract II 致謝 III 目錄 IV 圖目錄 VII 表格目錄 X 第一章 簡介 1 1.1 研究動機 1 1.2 文獻探討 3 1.2.1「尚無基於多相機之稠密文字生成」的議題 3 ●影像拼接 3 1.2.2「容易遺失重要資訊」的議題 8 ●短影片之影像字幕描述 8 ●稠密影片字幕描述 8 ●稠密注意力機制 13 ●稀疏注意力機制 15 1.3 本研究貢獻與文章架構 17 第二章 系統設計理念與架構簡介 18 2.1 系統架構 18 2.2 應用情境範例 19 第三章 基於輕量化設計之多相機影像拼接模型 21 3.1多相機影像拼接模型 21 3.1.1單應性偵測器 22 3.1.2拼接生成器 24 3.2 輕量化之影像拼接模型 27 3.3 損失函數 31 第四章 平均稀疏注意力之稠密影像字幕生成模型 33 4.1平均稀疏注意力之Transformer模型 34 4.1.1位置編碼 35 4.1.2多頭式注意力機制 35 4.1.3前饋式網路 37 4.1.4編碼器與解碼器 39 4.1.5平均稀疏注意力機制 40 4.2影片特徵處理 41 4.3候選事件偵測器 42 4.4事件注意力模組 45 4.5描述生成 49 4.6損失函數 56 第五章 實驗結果與討論 57 5.1 實驗平台 57 5.2 實驗資料集與評估指標 57 5.2.1 實驗資料集 58 5.2.2 評估指標 59 5.3 稀疏注意力之稠密影片生成字幕模型 62 5.3.1 稀疏注意力機制 64 5.3.2 平均稀疏注意力機制 65 5.3.3實驗假設 67 5.4 輕量化之多相機影像拼接模型 70 5.5 相關研究比較 72 5.6 示範應用 75 5.6.1 稠密影像字幕生成示範 76 5.6.2多相機影像拼接示範 76 5.6.3 系統整合示範 81 第六章 結論與未來研究方向 86 參考文獻 88 口試委員之建議與回覆 92

    [1] J. Zhang and D. Tao, "Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things," IEEE Internet of Things Journal, vol. 8, no. 10, pp. 7789-7817, 2020.
    [2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, "Edge computing: Vision and challenges," IEEE internet of things journal, vol. 3, no. 5, pp. 637-646, 2016.
    [3] J. Lim, J. Seo, and Y. Baek, "CamThings: IoT camera with energy-efficient communication by edge computing based on Deep Learning," pp. 1-6: IEEE.
    [4] C.-H. Lu and G.-Y. Fan, "Environment-aware Dense Video Captioning for IoT-enabled Edge Cameras," IEEE Internet of Things Journal, 2021.
    [5] I. Sutskever, O. Vinyals, and Q. V. Le, "Sequence to sequence learning with neural networks," in Advances in neural information processing systems, 2014, pp. 3104-3112.
    [6] R. Szeliski, "Image alignment and stitching," in Handbook of mathematical models in computer vision: Springer, 2006, pp. 273-292.
    [7] Z. Zhu, E. M. Riseman, and A. R. Hanson, "Parallel-perspective stereo mosaics," vol. 1, pp. 345-352: IEEE.
    [8] J. Gao, S. J. Kim, and M. S. Brown, "Constructing image panoramas using dual-homography warping," pp. 49-56: IEEE.
    [9] J. Zaragoza, T.-J. Chin, M. S. Brown, and D. Suter, "As-projective-as-possible image stitching with moving DLT," pp. 2339-2346.
    [10] C.-H. Chang, Y. Sato, and Y.-Y. Chuang, "Shape-preserving half-projective warps for image stitching," pp. 3254-3261.
    [11] F. Zhang and F. Liu, "Parallax-tolerant image stitching," pp. 3262-3269.
    [12] A. Eden, M. Uyttendaele, and R. Szeliski, "Seamless image stitching of scenes with large motions and exposure differences," vol. 2, pp. 2498-2505: IEEE.
    [13] K. Lin, N. Jiang, L.-F. Cheong, M. Do, and J. Lu, "Seagull: Seam-guided local alignment for parallax-tolerant image stitching," pp. 370-385: Springer.
    [14] L. Nie, C. Lin, K. Liao, M. Liu, and Y. Zhao, "A view-free image stitching network based on global homography," Journal of Visual Communication and Image Representation, vol. 73, p. 102950, 2020.
    [15] C. Shen, X. Ji, and C. Miao, "Real-time image stitching with convolutional neural networks," pp. 192-197: IEEE.
    [16] W.-S. Lai, O. Gallo, J. Gu, D. Sun, M.-H. Yang, and J. Kautz, "Video stitching for linear camera arrays," arXiv preprint arXiv:1907.13622, 2019.
    [17] L. Nie, C. Lin, K. Liao, and Y. Zhao, "Learning edge-preserved image stitching from large-baseline deep homography," arXiv preprint arXiv:2012.06194, 2020.
    [18] L. Nie, C. Lin, K. Liao, S. Liu, and Y. Zhao, "Unsupervised deep image stitching: Reconstructing stitched features to images," IEEE Transactions on Image Processing, vol. 30, pp. 6184-6197, 2021.
    [19] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, "Dense-captioning events in videos," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706-715.
    [20] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu, "Bidirectional attentive fusion with context gating for dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7190-7198.
    [21] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, "Jointly localizing and describing events for dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7492-7500.
    [22] J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, "Streamlined dense video captioning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6588-6597.
    [23] S. Lee and I. Kim, "DVC‐Net: A deep neural network model for dense video captioning," IET Computer Vision, vol. 15, no. 1, pp. 12-23, 2021.
    [24] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," pp. 8739-8748.
    [25] V. Iashin and E. Rahtu, "Multi-modal Dense Video Captioning," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 958-959.
    [26] Z. Yu and N. Han, "Accelerated masked transformer for dense video captioning," Neurocomputing, vol. 445, pp. 72-80, 2021.
    [27] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, "End-to-End Dense Video Captioning with Parallel Decoding," pp. 6847-6857.
    [28] K. Xu et al., "Show, attend and tell: Neural image caption generation with visual attention," in International conference on machine learning, 2015, pp. 2048-2057.
    [29] S. Ji, W. Xu, M. Yang, and K. Yu, "3D convolutional neural networks for human action recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 1, pp. 221-231, 2012.
    [30] V. Escorcia, F. C. Heilbron, J. C. Niebles, and B. Ghanem, "Daps: Deep action proposals for action understanding," in European Conference on Computer Vision, 2016, pp. 768-784: Springer.
    [31] S. Buch, V. Escorcia, C. Shen, B. Ghanem, and J. Carlos Niebles, "Sst: Single-stream temporal action proposals," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2911-2920.
    [32] A. Vaswani et al., "Attention is all you need," arXiv preprint arXiv:1706.03762, 2017.
    [33] M.-T. Luong, H. Pham, and C. D. Manning, "Effective approaches to attention-based neural machine translation," arXiv preprint arXiv:1508.04025, 2015.
    [34] L. Zhou, C. Xu, and J. J. Corso, "Towards automatic learning of procedures from web instructional videos," arXiv preprint arXiv:1703.09788, 2017.
    [35] B. Shi et al., "Dense procedure captioning in narrated instructional videos," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 6382-6391.
    [36] M. Zaheer et al., "Big Bird: Transformers for Longer Sequences."
    [37] I. Beltagy, M. E. Peters, and A. Cohan, "Longformer: The long-document transformer," arXiv preprint arXiv:2004.05150, 2020.
    [38] Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, and C. Zheng, "Synthesizer: Rethinking Self-Attention for Transformer Models," pp. 10183-10192: PMLR.
    [39] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [40] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
    [41] C.-H. Lu and B.-E. J. I. S. J. Shao, "Environment-Aware Multiscene Image Enhancement for Internet of Things Enabled Edge Cameras," 2020.
    [42] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
    [43] L. Floridi and M. Chiriatti, "GPT-3: Its nature, scope, limits, and consequences," Minds and Machines, vol. 30, no. 4, pp. 681-694, 2020.
    [44] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, "Transformers in vision: A survey," ACM Computing Surveys (CSUR), 2021.
    [45] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
    [46] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," arXiv preprint arXiv:1502.03167, 2015.
    [47] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, "Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8934-8943.
    [48] T. Mihaylova and A. F. Martins, "Scheduled Sampling for Transformers," arXiv preprint arXiv:1906.07651, 2019.
    [49] P. Das, C. Xu, R. F. Doell, and J. J. Corso, "A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2634-2641.
    [50] T.-Y. Lin et al., "Microsoft coco: Common objects in context," pp. 740-755: Springer.
    [51] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311-318.
    [52] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65-72.
    [53] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
    [54] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, "End-to-end dense video captioning with masked transformer," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8739-8748.

    QR CODE