簡易檢索 / 詳目顯示

研究生: 曾立安
Li-An Tseng
論文名稱: 基於抑制及注意力模組結合線上資料增強之弱監督語意分割任務
Combining Suppression and Attention with Online Augmentation for Weakly Supervised Semantic Segmentation
指導教授: 郭景明
Jing-Ming Guo
口試委員: 張傳育
Chuan-Yu,Chang
陸敬互
Ching-Hu Lu
宋啟嘉
Chi-Chia Sun
高文忠
Wen-Chung Kao
郭景明
Jing-Ming Guo
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 83
中文關鍵詞: 弱監督學習語意分割任務類別活化圖偽標籤深度學習
外文關鍵詞: Weakly Supervised, Semantic Segmentation, Class Activation Map, Pseudo Mask, Deep Learning
相關次數: 點閱:418下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

語意分割是計算機視覺領域中的重要任務,旨在將圖像的每個像素進行逐點的類別預測,從而實現對圖像的精細分析。然而,傳統的語意分割方法需要大量的像素級別資料來訓練模型,這一過程耗時且耗費人力。為了減少標註的需求,研究者們開始探索弱監督語意分割方法,這些方法使用比像素級別更粗糙的標註來實現相同的目標。
弱監督語意分割是指在訓練過程中僅使用影像級別的標註,而不需要逐點的像素級別標註,主要是使用分類模型所產生的類別活化圖當作初始的步驟。不過因為骨幹網路的主要任務為分類任務,因此會過度關注分類目標的特徵,造成產生的類別活化圖與分類目標有巨大差異,會有以下三個主要問題(1) 模型過度關注特定特徵導致其他區域被忽略(2) 物件範圍超出目標區域與(3) 物件內部因為色彩平滑造成難以辨識。
本篇論文為解決以上問題,使用抑制模組來限制模型過度專注的區域,並分析抑制模組放置於模型的位置增加效率,此外,導入了通道注意力層對特徵圖的眾多通道賦予權重,產生品質更好的類別活化圖,並且引入了一種改良的線上資料增強算法,解決了原始算法所產生的遮蓋問題,並使得模型更好的辨識物體邊界。
在實驗結果方面,前人的方法比較,本論文在公開切割競賽資料庫 PASCAL VOC 2012 中進行測試,從結果顯示提出的架構使用偽標籤訓練語意切割網路其效能在此公開資料集驗證集與測試集上平均交並比(MIoU)上可達 72.46% 與 72.81%。


Semantic segmentation is aiming to predict the class for each pixel in an image. However, traditional semantic segmentation methods require extensive pixel-level annotations to train the models, which is time-consuming and labor-intensive. In order to reduce the annotation requirement, researchers have started exploring weakly supervised semantic segmentation methods, which achieve the same objective using coarser annotations.
Weakly supervised semantic segmentation refers to training models with only image-level annotations, without the need for pixel-level annotations. It primarily involves using class activation maps generated by classification models as an initial step. However, due to the main focus of the backbone network on the classification task, there are three main issues with the generated class activation maps: (1) the model tends to overly focus on specific features, neglecting other regions, (2) the object boundaries extend beyond the target area, and (3) the interior of objects is difficult to discern due to color smoothness.
To address these issues, this thesis proposes the use of suppression modules to limit the excessive attention of the model to specific regions. The placement of these suppression modules in the model is analyzed to maximize effectiveness. Additionally, a channel attention module is introduced to assign weights to the numerous channels of the feature map. Furthermore, an improved online data augmentation algorithm is incorporated to address the masking issue in the original algorithm.
In terms of experimental results, the proposed architecture, trained with pseudo-labels for semantic segmentation, achieves a performance of 72.46% and 72.81% in terms of mIoU on the validation set and testing set of PASCAL VOC 2012, respectively.

摘要 4 Abstract 5 致謝 6 目錄 7 圖片索引 9 表格索引 11 第一章 緒論 12 1.1 背景介紹 12 1.2 目的與研究動機 13 1.3 論文架構 17 第二章 文獻探討 18 2.1 深度學習架構 18 2.1.1 深度學習歷史與其架構 18 2.1.2 ANN 19 2.1.2.1 前向傳播 19 2.1.2.2 反向傳播 22 2.1.2.3 損失函數 23 2.1.3 CNN 24 2.1.4 語意分割網路 31 2.1.4.1 FCN 31 2.1.4.2 UNet 31 2.1.4.3 PSPNet 32 2.1.4.4 DeepLab 33 2.2 弱監督語意分割 36 2.2.1 弱監督語意分割模型架構 36 2.2.1.1 Puzzle CAM 36 2.2.1.2 L2G 36 2.2.1.3 PPC 37 2.2.1.4 RCA 37 2.2.1.5 SLRNet 38 2.2.1.6 AdvCAM 39 2.2.2 弱監督語意分割模組 40 2.2.2.1 DRS 40 2.2.2.2 AMR 41 2.2.3 弱監督語意分割資料增強手段 41 2.2.3.1 MixupCAM 41 2.2.3.2 CDA 42 2.3 類別活化圖 44 2.4 注意力模組 48 2.4.1 SENet 48 2.4.2 DANet 48 2.4.3 PSA 49 第三章 研究方法 51 3.1 模型架構 51 3.2 類別活化流程與架構 53 3.3 抑制模組 54 3.4 通道注意力 55 3.5 CDA改良算法 56 第四章 實驗結果 59 4.1 資料集介紹 59 4.1.1 VOC2012 Dataset 59 4.1.2 SBD Dataset 59 4.1.3 Saliency Dataset 61 4.2 測試環境 63 4.3 消融實驗 63 4.3.1 定量評估指標 63 4.3.2 測試參數 64 4.3.2.1 分類網路 64 4.3.2.2 分割網路 65 4.3.3 可學習抑制模組對於網路的影響 66 4.3.4 注意力模組之消融實驗 67 4.3.5 改良CDA算法之消融實驗 68 4.3.6 實驗結果 71 4.3.7 與主流架構比較 75 第五章 結論與未來展望 79 參考文獻 80

[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
[3] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[4] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[5] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International conference on machine learning, 2015: pmlr, pp. 448-456.
[6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818-2826.
[7] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-resnet and the impact of residual connections on learning," in Proceedings of the AAAI conference on artificial intelligence, 2017, vol. 31, no. 1.
[8] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[9] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492-1500.
[10] H. Zhang et al., "Resnest: Split-attention networks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2736-2746.
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[12] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[13] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
[14] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, "Pyramid scene parsing network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881-2890.
[15] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014.
[16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[17] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
[18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.
[19] S. Jo and I.-J. Yu, "Puzzle-cam: Improved localization via matching partial and full features," in 2021 IEEE International Conference on Image Processing (ICIP), 2021: IEEE, pp. 639-643.
[20] P.-T. Jiang, Y. Yang, Q. Hou, and Y. Wei, "L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16886-16896.
[21] Y. Du, Z. Fu, Q. Liu, and Y. Wang, "Weakly supervised semantic segmentation by pixel-to-prototype contrast," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320-4329.
[22] T. Zhou, M. Zhang, F. Zhao, and J. Li, "Regional semantic contrast and aggregation for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4299-4309.
[23] J. Pan et al., "Learning self-supervised low-rank network for single-stage weakly and semi-supervised semantic segmentation," International Journal of Computer Vision, vol. 130, no. 5, pp. 1181-1195, 2022.
[24] J. Lee, E. Kim, and S. Yoon, "Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4071-4080.
[25] B. Kim, S. Han, and J. Kim, "Discriminative region suppression for weakly-supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1754-1761.
[26] J. Qin, J. Wu, X. Xiao, L. Li, and X. Wang, "Activation modulation and recalibration scheme for weakly supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, no. 2, pp. 2117-2125.
[27] Y.-T. Chang, Q. Wang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.-H. Yang, "Mixup-cam: Weakly-supervised semantic segmentation via uncertainty regularization," arXiv preprint arXiv:2008.01201, 2020.
[28] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization," arXiv preprint arXiv:1710.09412, 2017.
[29] Y. Su, R. Sun, G. Lin, and Q. Wu, "Context decoupling augmentation for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7004-7014.
[30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, "Learning deep features for discriminative localization," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921-2929.
[31] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
[32] J. Fu et al., "Dual attention network for scene segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146-3154.
[33] H. Liu, F. Liu, X. Fan, and D. Huang, "Polarized self-attention: Towards high-quality pixel-wise regression," arXiv preprint arXiv:2107.00782, 2021.
[34] J. Ahn and S. Kwak, "Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4981-4990.
[35] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, pp. 303-308, 2009.
[36] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, "Semantic contours from inverse detectors," in 2011 international conference on computer vision, 2011: IEEE, pp. 991-998.
[37] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, "A simple pooling-based design for real-time salient object detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3917-3926.
[38] P. Goyal et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[39] L. Ru, H. Zheng, Y. Zhan, and B. Du, "Token Contrast for Weakly-Supervised Semantic Segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3093-3102.

無法下載圖示 全文公開日期 2025/08/16 (校內網路)
全文公開日期 2025/08/16 (校外網路)
全文公開日期 2025/08/16 (國家圖書館:臺灣博碩士論文系統)
QR CODE