簡易檢索 / 詳目顯示

研究生: 鍾鎮鴻
Chen-Hung Chung
論文名稱: 基於分佈外數據與卷積注意力模塊於跨語言圖像匹配之弱監督語義分割
CLODA: Cross Language Image Matching Based on Out-of-Distribution Data and Convolutional Block Attention Module for Weakly Supervised Semantic Segmentation
指導教授: 郭景明
Jing-Ming Guo
口試委員: 陸敬互
高文忠
張傳育
宋啟嘉
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 80
中文關鍵詞: 深度學習弱監督語義分割類激活圖偽標籤語言圖像
外文關鍵詞: Deep Learning, Weakly Supervised Semantic, Class Activation Maps, Pseudo Mask, Language-Image
相關次數: 點閱:265下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 全監督語義分割任務需要對每個像素進行詳細標註,逐點像素級標註非常費時費力。為了解決這個問題,本篇論文的研究方向是通過使用圖像級別的分類標註來進行語義分割任務。這意味只需為整個圖像提供一個整體的類別標註,而無需對每個像素進行詳細標註,從而減少標註的工作量並降低人力成本。
    現有使用圖像級別標注的方法,通常利用類激活圖尋找目標物件的定位為首要步驟,通過訓練一個分類器,可以有效地在圖像中搜索物件的存在,然而類激活圖會出現(1)物件過度關注特定區域,只捕獲最顯著且關鍵的區域和(2)容易將經常出現的背景區域誤認,即前景和背景被混淆的誤解情況。
    本論文在跨語言圖像框架上引入雙分支概念,在注意力分支上為了解決類激活圖對於物件過度關注問題導致其餘區域遭忽視,加入卷積注意力模塊加強捕捉圖像中重要特徵,使關注範圍向四周擴散;為了解決背景區域誤解為物件的問題,在分佈外分支上導入分佈外數據,利用額外資訊來監督區分前景和背景,幫助分類網路取得物件的前背景區域資訊,再配合跨語言圖像匹配去監督激活出更完整的物件區域並抑制背景區域,以改善關注區域的誤解情況;在雙分支上使用交叉偽監督促使分類網路在已學習到的分佈外語義特徵下,優化注意力分支學習的關注區域。
    在實驗結果方面,本論文採用公開圖像分割競賽資料庫PASCAL VOC 2012進行測試,並與近年的方法比較,從結果顯示提出的架構所生成的偽標籤於此公開資料集訓練集上的在平均交並比(mIoU)上可達75.3 %,且使用偽標籤訓練語義切割網路其效能在此公開資料集驗證集與測試集上平均交並比(mIoU)上可達72.3%與72.1%。


    The fully supervised semantic segmentation requires detailed annotation of each pixel, which is time-consuming and laborious at the pixel-by-pixel level. To solve this problem, the direction of this thesis is to perform the semantic segmentation task by using image-level categorical annotation.
    Existing methods using image level annotation usually use CAMs (Class Activation Maps) to find the location of the target object as the first step. By training a classifier, the presence of objects in the image can be searched effectively. However, CAMs appears that (1) objects are excessively focused on specific regions, capturing only the most prominent and critical areas and (2) it is easy to misinterpretation the frequently occurring background regions, the foreground and background are confused.
    This thesis introduces the concept of double branching in the cross language image matching framework, and adds a convolutional attention module to the attention branch to solve the problem of excess focus on objects in the CAMs. Importing out of distribution data on out of distribution branches helps classification networks improve misinterpretation of areas of focus. Optimizing regions of interest for attentional branch learning using cross pseudo supervision on two branches.
    Experimental results show that the pseudo masks generated by the proposed network can achieve 75.3% in mIoU with the PASCAL VOC 2012 training set. The performance of the segmentation network trained with the pseudo masks is up to 72.3% and 72.1% in mIoU on the validation and testing set of PASCAL VOC 2012.

    摘要 1 Abstract 2 致謝 3 目錄 4 圖片索引 6 表格索引 9 第一章 緒論 10 1.1 背景介紹 10 1.2 研究動機與目的 11 1.3 論文架構 13 第二章 文獻探討 14 2.1 深度學習 14 2.1.1 類神經網路(Artificial Neural Network, ANN) 14 2.1.2 卷積神經網路(Convolutional Neural Network, CNN) [1] 18 2.2 學習方式 23 2.2.1 監督學習(Supervised Learning) 23 2.2.2 非監督學習(Unsupervised Learning) 24 2.2.3 半監督學習(Semi-Supervised Learning) 26 2.2.4 弱監督學習(Weakly Supervised Learning) 26 2.2.5 遷移式學習(Transfer Learning) 27 2.3 語義分割(Semantic Segmentation) 28 2.3.1 Fully Convolutional Network(FCN) [12] 28 2.3.2 U-Net [13] 29 2.3.3 DeepLab Series 30 2.4 弱監督語義分割(Weakly Supervised Semantic Segmentation) 33 2.4.1 偽標籤生成 34 2.4.2 偽標籤細化 41 第三章 研究方法 44 3.1 整體架構 46 3.2 類激活圖生成 46 3.3 分佈外分支(Out-of-distribution branch) 47 3.4 注意力分支(Attention branch) 48 3.5 跨語言圖像匹配(Cross language image matching) 49 3.6 損失函數 50 3.6.1 Cross pseudo supervision loss 50 3.6.2 Object region and text label matching loss [20] 51 3.6.3 Background region and text label matching loss [20] 51 3.6.4 Co-occurring background suppression loss [20] 52 3.6.5 Pixel-level area regularization loss [20] 52 3.6.6 訓練目標 52 第四章 實驗結果 53 4.1 數據集 53 4.1.1 PASCAL VOC 2012 53 4.1.2 SBD數據集 54 4.1.3 OoD數據集 56 4.2 實驗環境 58 4.3 實驗分析與結果 58 4.3.1 評估指標 58 4.3.2 實驗設定 59 4.3.3 消融測試 62 4.3.4 實驗結果 64 第五章 結論與未來展望 74 參考文獻 75

    [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
    [2] M.-C. Popescu, V. E. Balas, L. Perescu-Popescu, and N. Mastorakis, "Multilayer perceptron and neural networks," WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579-588, 2009.
    [3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
    [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
    [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255.
    [6] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [7] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
    [8] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [9] J. A. Hartigan and M. A. Wong, "Algorithm AS 136: A k-means clustering algorithm," Journal of the royal statistical society. series c (applied statistics), vol. 28, no. 1, pp. 100-108, 1979.
    [10] H. Abdi and L. J. Williams, "Principal component analysis," Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433-459, 2010.
    [11] A. Ng, "Sparse autoencoder," CS294A Lecture notes, vol. 72, no. 2011, pp. 1-19, 2011.
    [12] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
    [13] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
    [14] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014.
    [15] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
    [16] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
    [17] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
    [18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, " Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.
    [19] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, "Learning deep features for discriminative localization," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921-2929.
    [20] J. Xie, X. Hou, K. Ye, and L. Shen, "CLIMS: cross language image matching for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4483-4492.
    [21] A. Radford et al., "Learning transferable visual models from natural language supervision," in International conference on machine learning, 2021: PMLR, pp. 8748-8763.
    [22] J. Lee, S. J. Oh, S. Yun, J. Choe, E. Kim, and S. Yoon, "Weakly supervised semantic segmentation using out-of-distribution data," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16897-16906.
    [23] J. Qin, J. Wu, X. Xiao, L. Li, and X. Wang, "Activation modulation and recalibration scheme for weakly supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, no. 2, pp. 2117-2125.
    [24] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "Cbam: Convolutional block attention module," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3-19.
    [25] J. Ahn, S. Cho, and S. Kwak, "Weakly supervised learning of instance segmentation with inter-pixel relations," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2209-2218.
    [26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, pp. 303-338, 2010.
    [27] T.-Y. Lin et al., "Microsoft coco: Common objects in context," in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014: Springer, pp. 740-755.
    [28] S. Zagoruyko and N. Komodakis, "Wide residual networks," arXiv preprint arXiv:1605.07146, 2016.
    [29] J. Lee, E. Kim, and S. Yoon, "Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4071-4080.
    [30] B. Kim, S. Han, and J. Kim, "Discriminative region suppression for weakly-supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1754-1761.
    [31] S. Lee, M. Lee, J. Lee, and H. Shim, "Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5495-5505.
    [32] P.-T. Jiang, Y. Yang, Q. Hou, and Y. Wei, "L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16886-16896.
    [33] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [34] Y. Lin et al., "Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15305-15314.

    無法下載圖示 全文公開日期 2025/08/21 (校內網路)
    全文公開日期 2025/08/21 (校外網路)
    全文公開日期 2025/08/21 (國家圖書館:臺灣博碩士論文系統)
    QR CODE