簡易檢索 / 詳目顯示

研究生: 羅宥鈞
YOU-JYUN LO
論文名稱: 利用窗口自注意力機制及多層級校準於多光譜物件偵測
Cross-modality fusion using shifted window self-attention and multi-level alignment for multispectral object detection
指導教授: 陳永耀
Yung-Yao Chen
口試委員: 林敬舜
Ching-Shun Lin
林淵翔
Yuan-Hsiang Lin
花凱龍
kai-Lung Hua
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 50
中文關鍵詞: 窗口自注意力多層級校準多光譜物件偵測物件偵測
外文關鍵詞: Cross-modality fusion, shifted window, multi-level alignment, multispectral object detection
相關次數: 點閱:209下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

自動駕駛領域的技術與日俱進,將多感測器及深度學習應用於車輛上,解決了過往許多在自動駕駛系統上遭遇的問題,而相關技術的研究也大大改善交通安全及效率,也為大眾帶來更便利的行車服務。因此本篇論文深入探討多感測器的融合,而多光譜影像(e.g.可見光及熱影像)是一項最常被使用且極具效果在於全天候(白天及夜晚戶外場景)物件偵測訓練模型,過往常見的影像偵測模型以不同模態的影像特徵進行模型訓練,在沒有良好的特徵篩選機制以及對不同模態影像下的特徵偏差進行處理,往往會對訓練結果造成負面的影響。如何有效的融合並應用熱影像及可見光影像特徵在物件偵測模型上,成為本篇論文的深究重點。為了有效的提升不同光譜的圖像特徵並改善在不同天候狀況下單一影像受環境因素影響造成的劣勢,因此我們導入了自注意力機制以及模態特徵偏移校正技術改善多光譜影像融合。在本篇論文中,我們提出了新型態的神經網路特徵融合方法,利用改良版自注意力模組將多光譜的物件特徵進行補強,並將多光譜強化特徵加入神經網路架構進行訓練。另外,以我們的方法,強化了對不同光譜特徵的加權和訓練,以及在自注意力機制階段前校正不同模態下的特徵偏差後進行融合,改善雙影像特徵偏差對於模型訓練的影響,除此之外在模型預測階段加入了光照度機制以及不同模態感興趣區塊偏差校正,加強模型對於不同天候狀況的自適應預測能力。我們實現主要的新架構想法在yolov5上,首先我們設計的(Cross-modal Shifted Window Base Transformer Module),用於增強雙影像在訓練階段的特徵提取,在效能上有顯著的提升並且有效的降低模型參數量,大幅降低雙影像的訓練時長與模型大小,另外多層級模態偏移校正(Modality Feature Regulate module)模型透過不同天候狀況以及對不同光譜在相同場景下的特徵偏移校正,增強在訓練階段及決策階段對於雙影像物件偵測的決策能力。 本篇論文的結果,實現在兩個不同的資料集KAIST及NTUST多光譜實際道路場景資料集,跟現有的多光譜物件偵測模型進行比對皆能達到。


The technology in the field of autonomous driving is advancing day by day. The application of multi-sensors and deep learning to vehicles has solved many problems encountered in the automatic driving system in the past. Research on related technologies has also greatly improved traffic safety and efficiency. Bring more convenient driving services to the public.
Therefore, this thesis deeply explores the fusion of multi-sensors, and multi-spectral imaging (e.g. visible light and thermal imaging) is the most commonly used and effective training model for all-weather (day and night outdoor scenes) object detection, In the past, common image detection models used image features of different modalities for model training. Without a good feature screening mechanism and dealing with feature deviations under different modalities of images, the training results were often negatively affected. How to effectively integrate and apply thermal image and visible image features in object detection model has become the focus of this thesis. In order to effectively improve the image features of different spectra and improve the disadvantages of a single image affected by environmental factors under different weather conditions, we introduced a self-attention mechanism and modal feature offset correction technology to improve multi-spectral image fusion. In this thesis, we propose a new type of neural network feature fusion method, using an improved version of the self-attention module to enhance the multi-spectral object features, and adding the multi-spectral enhancement features to the neural network architecture for training . In addition, with our method, the weighting and training of different spectral features are strengthened, and the feature deviations in different modalities are corrected before the self-attention mechanism stage and then fused to improve the influence of dual-image feature deviations on model training, in addition to In addition, in the model prediction stage, the illumination mechanism and the deviation correction of the blocks of interest in different modalities are added to enhance the adaptive prediction ability of the model for different weather conditions. We implemented the main new architectural idea on yolov5. First of all, we designed the (Cross-modal Shifted Window Base Transformer Module) to enhance the feature extraction of dual images in the training phase, which significantly improved the performance and effectively reduced the model. The number of parameters greatly reduces the training time and model size of dual images. In addition, the multi-level modal shift correction (Modality Feature Regulate module) model can enhance the performance of The decision-making ability for dual-image object detection in the training phase and the decision-making phase. The results of this thesis are achieved in two different datasets KAIST and NTUST multispectral actual road scene datasets, and can be compared with existing multispectral object detection models.

指導教授推薦書 I 考試委員審定書 II 致謝 III 摘要 1 ABSTRACT 3 目錄 5 圖目錄 7 表目錄 9 第一章緒論 10 1.1前言 10 1.2研究動機 11 1.3論文貢獻 13 第二章相關文獻 14 2.1 物件偵測相關技術 14 2.1.1 one-stage 物件偵測方法 14 2.1.2 two-stage 物件檢測方法 16 2.2 多光譜行人偵測相關技術 18 2.3 ATTENTION BASED MODEL 20 2.4 VISION TRANSFORMER 21 第三章方法 25 3.1 基於移位窗口之多光譜自注意力機制模組(CROSS-MODAL SHIFTED WINDOW BASE TRANSFORMER MODULE) 26 3.1.1 CMSW-Yolov5x backbone 27 3.1.2 基於移位窗口實現特徵補強模組(Shifted window based self-attention Block ) 27 3.2 MULTI-LEVEL MODAL ALIGNMENT 30 3.3 損失函數以及多頭式感測器 32 第四章實驗 35 4.1實驗環境 35 4.2 資料集 37 4.2.1 KAIST 多光譜影像數據集 37 4.2.2 NTUST 多光譜道路影像數據集 38 4.3 實驗方法 40 4.3.1 KAIST資料集評測 40 4.3.2 NTUST 資料集評測 42 4.4 消融實驗 44 4.5 實驗結果 46 第五章 結論與未來展望 49 參考文獻 50

[1] S. Hwang et al., "Multispectral pedestrian detection: Benchmark dataset and baseline," 2015 IEEE conference on computer vision and pattern recognition(CVPR)., pp. 1037-1045, 2015
[2] A. González et al., "Pedestrian Detection at Day/Night Time with Visible and FIR Cameras: A Comparison", 2016, Sensors, vol. 16, no. 6, p. 820, Available: 10.3390/s16060820 [Accessed 30 July 2022]
[3] A. Kshitij et al., "Enhancing Object Detection in Adverse Conditions using Thermal Imaging," 2019, [Online]. Available: arXiv:1909.13551 [cs.CV] [Accessed: 30- Sep- 2019].
[4] J. Liu et al.,"Multispectral Deep Neural Networks for Pedestrian Detection", 2016, British Machine Vision Conference, Available: 10.5244/c.30.73, [Accessed-30-July-2022].
[5] D. Konig, et al., "Fully convolutional region proposal networks for multispectral person detection," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49-56, 2017
[6] C. Li et al., "Illumination-aware faster R-CNN for robust multispectral pedestrian detection", Pattern Recognition, vol. 85, pp. 161-171, 2019
[7] L. Zhang et al., "Weakly aligned crossmodal learning for multispectral pedestrian detection," 2019 IEEE International Conference on Computer Vision(ICCV), pp. 5127-5137, 2019
[8] Y. Lin et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, 2021 IEEE International Conference on Computer Vision (ICCV), Mar 2021
[9] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks, "2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132-7141, Jun. 2018
[10] S. Zhang, J. Yang, and B. Schiele, "Occluded pedestrian detection through guided attention in CNNs,"2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6995-7003, Jun. 2018
[11] P. Dollar et al., "Fast Feature Pyramids for Object Detection", 2014 IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1532-1545, 2014
[12] I. Bello et al., "Attention augmented convolutional networks, " 2019 IEEE International Conference on Computer Vision (ICCV), pp. 3286–3295, 2019
[13] S. WOO et al., "CBAM: Convolutional Block Attention Module," 2018 European conference on computer vision (EVVC), pp. 3-19, 2018
[14] J. Redmon, Divvala, S., Girshick, R., Farhadi, A. (2016). "You Only Look Once: Unified, Real-Time Object Detection. " 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
[15] J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517-6525, 2017
[16] J. Redmon and A. Farhadi, "YOLOv3: An incremental improvement," 2018, [Online]. Available: arXiv:1804.02767 [cs.CV]. [Accessed: 8- Apr- 2018].
[17] A. Bochkovskiy, C. Y. Wang and H.Y. Liao, "YOLOv4: Optimal Speed and Accuracy of Object Detection," 2020, [Online]. Available: arXiv:2004.10934 [cs.CV]. [Accessed: 23- Apr- 2020].
[18] R. Girshick et al.,"Rich feature hierarchies for accurate object detection and semantic segmentation," 2014, [Online]. Available: arXiv:1311.2524[cs.CV]. [Accessed: 22-Oct-2014]
[19] R. Girshick et al., "Fast R-CNN," 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440-1448, 2015
[20] S. Ren et al. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137-1149, 2015
[21] Liu, W. et al., "SSD: Single shot multibox detector," 2016 European Conference on Computer Vision(ECCV), pp 21–3, 2016
[22] Y. Zhang et al., "Attention Based Multi-Layer Fusion of Multispectral Images for Pedestrian Detection", IEEE Access, vol. 8, pp. 165071-165084, 2020. Available: 10.1109/access.2020.3022623.
[23] K. He et al., "Deep Residual Learning for Image Recognition, "2015, [Online]. Available: arXiv:1512.03385[cs.CV]. [Accessed: 10-Dec-2015]
[24] A. Vaswani et al., "Attention Is All You Need," 2017, [Online]. Available: arXiv:1706.03762[cs.CL]. [Accessed: 6-Dec-2017]
[25] N. Carion et al., " End-to-End Object Detection with Transformers, " 2020 European Conference on Computer Vision (ECCV), pp. 213-229, Aug 2020
[26] F. Qingyun et al., "Cross-Modality Fusion Transformer for Multispectral Object Detection," 2021, [Online]. Available: arXiv:2111.00273[cs.CL]. [Accessed: 1-Dec-2021]
[27] K. Chou et al., " Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems, " 2020 European Conference on Computer Vision (ECCV), pp. 213-229, Sep 2020

無法下載圖示 全文公開日期 2028/01/11 (校內網路)
全文公開日期 2028/01/11 (校外網路)
全文公開日期 2028/01/11 (國家圖書館:臺灣博碩士論文系統)
QR CODE