簡易檢索 / 詳目顯示

研究生: 游淙舜
Cong-Shun You
論文名稱: 空間注意力金字塔網路用於無監督領域自適應物件偵測之研究
The study of unsupervised domain adaptation for object detection using spatial attention pyramid networks
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
林銘波
Ming-Bo Lin
陳省隆
Hsing-Lung Chen
林昌鴻
Chang-Hong Lin
方文賢
Wen-Hsien Fang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 111
語文別: 英文
論文頁數: 74
中文關鍵詞: 物件偵測卷積神經網路域鑑別器非監督式學習領域自適應
外文關鍵詞: object detection, domain adaptation, CNN, discriminator, unsupervised learning
相關次數: 點閱:262下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來最具代表性的物件檢測方法是faster RCNN和YOLO,這些基於傳統監督學習的方法通常依賴於完全標註的資料集,並且假設訓練和測試資料取自同一個分佈。當測試資料來自不同分佈時,使用監督學習的偵測器的效能並不理想。 為了在各種場景中實現準確的目標檢測,偵測器能夠適應訓練和測試資料之間的領域變化或域差。無監督域適應是一種跨越域差的方法,它的目的是拉近源域與目標域在特徵空間的分布,使域分類器無法分辨特徵來自哪個域。在本論文中,基礎方法是SAPNet,它是一種基於注意力的域分類器,僅針對全局級別的特徵對齊。來自骨幹的全局特徵和任務特定特徵使用通道連接作特徵融合,全局特徵具有圖像級別的信息,任務相關的特徵來自區域生成網絡,是錨點的置信度。為了幫助域分類器更準確,我們引入了多尺度通道注意模塊(MS-CAM),它有兩條路徑可以同時處理全局和局部尺度的資訊,以生成融合的特徵中每個空間位置的注意力權重並強化有豐富的域相關資訊的區域。為了使偵測器在目標域上更加穩健,我們將 MEDM 損失函數整合到區域生成網絡中。它由兩個損失函數組成,最小化熵將目標域的類分佈正則化為峰值分佈,最大化diversity防止過度正則化。 實驗結果顯示我們提出的域分類器可以跨越源域和目標域之間的域差,對不同天氣、虛擬到真實和不同相機參數的資料集的表現可以躋身state-of-the-art的行列。


    In this work, we study unsupervised domain adaptation for object detection. The most representative methods for object detection, such as faster RCNN and YOLO, assume that training and test data follow identical distribution. Therefore, the object detectors suffer from performance degradation when training data and test data have different distributions. To address this issue, unsupervised domain adaptation is investigated to adapt domain shifts between labeled source domain and unlabeled target domain. Specifically, SAPNet, first, uses an attention based domain classifier that aligns feature at the global level. Then, backbone feature and task-specific information are fused using channel concatenation, where the backbone feature has image level information and the task-specific information comes from region proposal network and the confidence of the anchors. In this manuscript, we enhance SAPNet with a more accurate domain classifier. First, we introduce muilti-scale channel attention module (MS-CAM) which has two paths to process the information in global and local scale simultaneously to generate attention weight to highlight regions with abundant domain information in the fused feature. In order to make detector more robustic on target domain, we integrate MEDM loss into region proposal network. It consists of two losses, entropy minimization regularizes class distribution of target domain to a peaky distribution, diversity maximization prevents over-regularization. As the result, proposed domain classifier can across domain gap between source and target do mains, simulations on different weather, virtual to real and cross camera showcase the effectiveness of the proposed method.

    摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Faster RCNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Adversarial Based Unsupervised Domain Adaptation . . . . . . . 7 2.4 Domain Adaptation for Object Detection . . . . . . . . . . . . . . 9 2.5 Spatial Attention Pyramid Network . . . . . . . . . . . . . . . . . 11 2.6 Image-to-image Translation . . . . . . . . . . . . . . . . . . . . . 12 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Multi-scale Channel Attention Module . . . . . . . . . . . . . . . 16 3.3 Entropy Based Domain Adaptation . . . . . . . . . . . . . . . . . 17 iv 3.3.1 Entropy Minimization Only . . . . . . . . . . . . . . . . . 18 3.3.2 Minimal-entropy Diversity Maximization . . . . . . . . . . 18 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . 21 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.1 Weather Adaptation . . . . . . . . . . . . . . . . . . . . . 26 4.4.2 Virtual to Real Adaptation . . . . . . . . . . . . . . . . . 30 4.4.3 Cross Camera Adaptation . . . . . . . . . . . . . . . . . . 33 4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5.1 Ablation on Weather Adaptation . . . . . . . . . . . . . . 37 4.5.2 Ablation on Virtual to Real Adaptation . . . . . . . . . . 43 4.6 Failure Cases and Error Analysis . . . . . . . . . . . . . . . . . . 49 4.6.1 Different Class, Similar Appearance . . . . . . . . . . . . . 49 4.6.2 Class Definition Difference between Source and Target Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 v 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    [1] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, pp. 618–626, 2017.
    [2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
    [3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016.
    [4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016.
    [5] K. Saito, Y. Ushiku, T. Harada, and K. Saenko, “Strong-weak distribution alignment for adaptive object detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6949–6958, 2019.
    [6] R. Xie, F. Yu, J. Wang, Y. Wang, and L. Zhang, “Multi-level domain adaptive learning for cross-domain detection,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3213–3219, 2019.
    [7] Y. Yang and N. Ray, “Foreground-focused domain adaption for object detection,” in 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6941–6948, 2021. 55
    [8] J. Jiang, B. Chen, J. Wang, and M. Long, “Decoupled adaptation for crossdomain object detection,” in International Conference on Learning Representations, 2022.
    [9] L. Zhao and L. Wang, “Task-specific inconsistency alignment for domain adaptive object detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
    [10] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster r-cnn for object detection in the wild,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3339–3348, 2018.
    [11] C. Li, D. Du, L. Zhang, L. Wen, T. Luo, Y. Wu, and P. Zhu, “Spatial attention pyramid network for unsupervised domain adaptation,” in Computer Vision – ECCV 2020 (A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, eds.), (Cham), pp. 481–497, Springer International Publishing, 2020.
    [12] Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 3559–3568, 2021.
    [13] X. Wu, S. Zhang, Q. Zhou, Z. Yang, C. Zhao, and L. J. Latecki, “Entropy minimization versus diversity maximization for domain adaptation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2021.
    [14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
    [15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [16] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125, 2017. 56
    [17] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790, 2020.
    [18] D. Guan, J. Huang, A. Xiao, S. Lu, and Y. Cao, “Uncertainty-aware unsupervised domain adaptation in object detection,” IEEE Transactions on Multimedia, vol. 24, pp. 2502–2514, 2022.
    [19] Y. Pan, A. J. Ma, Y. Gao, J. Wang, and Y. Lin, “Multi-scale adversarial cross-domain detection with robust discriminative learning,” in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1313– 1321, 2020.
    [20] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollr, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020.
    [21] H.-K. Hsu, C.-H. Yao, Y.-H. Tsai, W.-C. Hung, H.-Y. Tseng, M. Singh, and M.-H. Yang, “Progressive domain adaptation for object detection,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 749–757, 2020.
    [22] R. Liu, Y. Han, Y. Wang, and Q. Tian, “Frequency spectrum augmentation consistency for domain adaptive object detection,” arXiv preprint arXiv:2112.08605, 2021.
    [23] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251, 2017.
    [24] Y. Yang and S. Soatto, “Fda: Fourier domain adaptation for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4085–4095, 2020. 57
    [25] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223, 2016.
    [26] D. D. V. G. L. Sakaridis, Christos, “Semantic foggy scene understanding with synthetic data,” vol. 126, pp. 973–992, 2018.
    [27] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan, “Driving in the matrix: Can virtual worlds replace humangenerated annotations for real world tasks?,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 746–753, 2017.
    [28] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361, IEEE, 2012.
    [29] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision (IJCV), vol. 88, no. 2, pp. 303–338, 2010.
    [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    [31] X. Zhu, J. Pang, C. Yang, J. Shi, and D. Lin, “Adapting object detectors via selective cross-domain alignment,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 687–696, 2019.
    [32] Q. Cai, Y. Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao, “Exploring object relation in mean teacher for cross-domain detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11449–11458, 2019. 58
    [33] H. Wang, S. Liao, and L. Shao, “Afan: Augmented feature alignment network for cross-domain object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 4046–4056, 2021.
    [34] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang, “Cross-domain detection via graph-induced prototype alignment,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12352–12361, 2020.
    [35] F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang, and B. Schiele, “Seeking similarities over differences: Similarity-based domain alignment for adaptive object detection,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9184–9193, 2021.
    [36] W. Wang, Y. Cao, J. Zhang, F. He, Z.-J. Zha, Y. Wen, and D. Tao, “Exploring sequence feature alignment for domain adaptive detection transformers,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 1730–1738, 2021.
    [37] Y. Wang, R. Zhang, S. Zhang, M. Li, Y. Xia, X. Zhang, and S. Liu, “Domain-specific suppression for adaptive object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9598– 9607, 2021.
    [38] B. Csaba, X. Qi, A. Chaudhry, P. Dokania, and P. Torr, “Multilevel knowledge transfer for cross-domain object detection,” arXiv preprint arXiv:2108.00977, 2021.
    [39] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.

    無法下載圖示 全文公開日期 2024/12/09 (校內網路)
    全文公開日期 2024/12/09 (校外網路)
    全文公開日期 2024/12/09 (國家圖書館:臺灣博碩士論文系統)
    QR CODE