研究生: |
曾立安 Li-An Tseng |
---|---|
論文名稱: |
基於抑制及注意力模組結合線上資料增強之弱監督語意分割任務 Combining Suppression and Attention with Online Augmentation for Weakly Supervised Semantic Segmentation |
指導教授: |
郭景明
Jing-Ming Guo |
口試委員: |
張傳育
Chuan-Yu,Chang 陸敬互 Ching-Hu Lu 宋啟嘉 Chi-Chia Sun 高文忠 Wen-Chung Kao 郭景明 Jing-Ming Guo |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 83 |
中文關鍵詞: | 弱監督學習 、語意分割任務 、類別活化圖 、偽標籤 、深度學習 |
外文關鍵詞: | Weakly Supervised, Semantic Segmentation, Class Activation Map, Pseudo Mask, Deep Learning |
相關次數: | 點閱:423 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語意分割是計算機視覺領域中的重要任務,旨在將圖像的每個像素進行逐點的類別預測,從而實現對圖像的精細分析。然而,傳統的語意分割方法需要大量的像素級別資料來訓練模型,這一過程耗時且耗費人力。為了減少標註的需求,研究者們開始探索弱監督語意分割方法,這些方法使用比像素級別更粗糙的標註來實現相同的目標。
弱監督語意分割是指在訓練過程中僅使用影像級別的標註,而不需要逐點的像素級別標註,主要是使用分類模型所產生的類別活化圖當作初始的步驟。不過因為骨幹網路的主要任務為分類任務,因此會過度關注分類目標的特徵,造成產生的類別活化圖與分類目標有巨大差異,會有以下三個主要問題(1) 模型過度關注特定特徵導致其他區域被忽略(2) 物件範圍超出目標區域與(3) 物件內部因為色彩平滑造成難以辨識。
本篇論文為解決以上問題,使用抑制模組來限制模型過度專注的區域,並分析抑制模組放置於模型的位置增加效率,此外,導入了通道注意力層對特徵圖的眾多通道賦予權重,產生品質更好的類別活化圖,並且引入了一種改良的線上資料增強算法,解決了原始算法所產生的遮蓋問題,並使得模型更好的辨識物體邊界。
在實驗結果方面,前人的方法比較,本論文在公開切割競賽資料庫 PASCAL VOC 2012 中進行測試,從結果顯示提出的架構使用偽標籤訓練語意切割網路其效能在此公開資料集驗證集與測試集上平均交並比(MIoU)上可達 72.46% 與 72.81%。
Semantic segmentation is aiming to predict the class for each pixel in an image. However, traditional semantic segmentation methods require extensive pixel-level annotations to train the models, which is time-consuming and labor-intensive. In order to reduce the annotation requirement, researchers have started exploring weakly supervised semantic segmentation methods, which achieve the same objective using coarser annotations.
Weakly supervised semantic segmentation refers to training models with only image-level annotations, without the need for pixel-level annotations. It primarily involves using class activation maps generated by classification models as an initial step. However, due to the main focus of the backbone network on the classification task, there are three main issues with the generated class activation maps: (1) the model tends to overly focus on specific features, neglecting other regions, (2) the object boundaries extend beyond the target area, and (3) the interior of objects is difficult to discern due to color smoothness.
To address these issues, this thesis proposes the use of suppression modules to limit the excessive attention of the model to specific regions. The placement of these suppression modules in the model is analyzed to maximize effectiveness. Additionally, a channel attention module is introduced to assign weights to the numerous channels of the feature map. Furthermore, an improved online data augmentation algorithm is incorporated to address the masking issue in the original algorithm.
In terms of experimental results, the proposed architecture, trained with pseudo-labels for semantic segmentation, achieves a performance of 72.46% and 72.81% in terms of mIoU on the validation set and testing set of PASCAL VOC 2012, respectively.
[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, vol. 25, 2012.
[3] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[4] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[5] S. Ioffe and C. Szegedy, "Batch normalization: Accelerating deep network training by reducing internal covariate shift," in International conference on machine learning, 2015: pmlr, pp. 448-456.
[6] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818-2826.
[7] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, "Inception-v4, inception-resnet and the impact of residual connections on learning," in Proceedings of the AAAI conference on artificial intelligence, 2017, vol. 31, no. 1.
[8] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[9] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, "Aggregated residual transformations for deep neural networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492-1500.
[10] H. Zhang et al., "Resnest: Split-attention networks," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 2736-2746.
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[12] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[13] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
[14] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, "Pyramid scene parsing network," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881-2890.
[15] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014.
[16] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[17] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
[18] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.
[19] S. Jo and I.-J. Yu, "Puzzle-cam: Improved localization via matching partial and full features," in 2021 IEEE International Conference on Image Processing (ICIP), 2021: IEEE, pp. 639-643.
[20] P.-T. Jiang, Y. Yang, Q. Hou, and Y. Wei, "L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16886-16896.
[21] Y. Du, Z. Fu, Q. Liu, and Y. Wang, "Weakly supervised semantic segmentation by pixel-to-prototype contrast," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4320-4329.
[22] T. Zhou, M. Zhang, F. Zhao, and J. Li, "Regional semantic contrast and aggregation for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4299-4309.
[23] J. Pan et al., "Learning self-supervised low-rank network for single-stage weakly and semi-supervised semantic segmentation," International Journal of Computer Vision, vol. 130, no. 5, pp. 1181-1195, 2022.
[24] J. Lee, E. Kim, and S. Yoon, "Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4071-4080.
[25] B. Kim, S. Han, and J. Kim, "Discriminative region suppression for weakly-supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, no. 2, pp. 1754-1761.
[26] J. Qin, J. Wu, X. Xiao, L. Li, and X. Wang, "Activation modulation and recalibration scheme for weakly supervised semantic segmentation," in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, vol. 36, no. 2, pp. 2117-2125.
[27] Y.-T. Chang, Q. Wang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.-H. Yang, "Mixup-cam: Weakly-supervised semantic segmentation via uncertainty regularization," arXiv preprint arXiv:2008.01201, 2020.
[28] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization," arXiv preprint arXiv:1710.09412, 2017.
[29] Y. Su, R. Sun, G. Lin, and Q. Wu, "Context decoupling augmentation for weakly supervised semantic segmentation," in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7004-7014.
[30] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, "Learning deep features for discriminative localization," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2921-2929.
[31] J. Hu, L. Shen, and G. Sun, "Squeeze-and-excitation networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132-7141.
[32] J. Fu et al., "Dual attention network for scene segmentation," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146-3154.
[33] H. Liu, F. Liu, X. Fan, and D. Huang, "Polarized self-attention: Towards high-quality pixel-wise regression," arXiv preprint arXiv:2107.00782, 2021.
[34] J. Ahn and S. Kwak, "Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4981-4990.
[35] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, pp. 303-308, 2009.
[36] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, "Semantic contours from inverse detectors," in 2011 international conference on computer vision, 2011: IEEE, pp. 991-998.
[37] J.-J. Liu, Q. Hou, M.-M. Cheng, J. Feng, and J. Jiang, "A simple pooling-based design for real-time salient object detection," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3917-3926.
[38] P. Goyal et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[39] L. Ru, H. Zheng, Y. Zhan, and B. Du, "Token Contrast for Weakly-Supervised Semantic Segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3093-3102.