研究生: |
陳柏華 Po-Hua Chen |
---|---|
論文名稱: |
基於圖塊遮罩於稠密語意特徵對齊之少樣本語意分割 Few-shot Semantic Segmentation with Mask-Based Dense Semantic Feature Alignment |
指導教授: |
郭景明
Jing-Ming Guo |
口試委員: |
郭景明
Jing-Ming Guo 張傳育 Chuan-Yu Chang 高文忠 Wen-Chung Kao 陸敬互 Ching-Hu Lu 宋啟嘉 Chi-Chia Sun |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 87 |
中文關鍵詞: | 少樣本語意分割 、無監督訓練 、非對稱卷積 、特徵比對 |
外文關鍵詞: | Few-shot segmentation, Unsupervised learning, Asymmetric convolution, Feature matching |
相關次數: | 點閱:272 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
過往的語意分割任務中,深度神經網路遵循傳統的學習方法,通過大量且帶像素級別標籤標註的資料集引導模型進行學習,模型可以從中習得有助於理解和描述圖像的語意特徵,這些特徵涵蓋了圖像中的物體類別、區域以及上下文等重要資訊,從而協助模型準確地進行像素級別的分類。然而,在實際的應用場景中,收集並準確標註大量的資料集可能會面臨眾多挑戰,在這當中又以密集預測的語意分割任務更是如此,在語意分割任務的標註中,資料的標註範圍涵蓋了圖像內的所有像素,將圖像中的區域細分成不同的類別,標註的過程費時且仰賴高昂的人力成本。此外,傳統的語意分割架構僅具備識別訓練時出現過類別的能力,當圖像中出現未曾見過的新類別時,模型往往無法正確的劃分出物件的輪廓以及所屬的類別。綜合以上論述,大量帶正確標註的資料集和預定義類別的先決條件限制了語意分割的發展性。
為了克服傳統語意分割的問題,少樣本語意分割引起了廣泛的關注。少樣本語意分割旨在開發一種無類別限制的網路,通過僅有少量像素級別標註的資料集訓練網路對未曾見過的新類別進行語意分割。其關鍵在於如何有效地使網路適應各種情況,利用查詢圖像和支援集之間細粒度相關性的交互作用,充分使用可取得的圖像中有限的資訊。然而,對於現今的少樣本語意分割方法來說,因資料壓縮造成粗糙的分割結果仍然是一個存在的問題,歸因於代表性特徵的限制以及資料內部的語意特徵未被充分的使用。在本論文中,設計了基於圖塊遮罩於稠密語意特徵對齊之少樣本語意分割網路,所提出方法能有效地發掘目標類別內像素間的相關性,增強查詢特徵和支援特徵之間目標類別的語意概念。同時,設計非對稱特徵融合模組,通過改良基底架構的結構並引入過往文獻中的先驗遮罩,搭配非對稱卷積結構結合多尺度特徵,進一步提升語意分割的性能。實驗結果表明,本論文所提出方法能有效的利用少量帶有像素級別標註的資料集,在公開的資料集FSS-1000中取得具有競爭力的語意分割成果。
In the realm of traditional semantic segmentation, significant advancements have been made by leveraging a substantial amount of labeled training data. However, in data-scarce scenarios, collecting a sufficient amount of data may pose challenges. Furthermore, the annotation of data, especially for intensive prediction tasks, is a laborious and time-consuming process. The prerequisite for extensive labeled datasets and predefined class constraints hinders its applicability.
To overcome these limitations and problems, few-shot semantic segmentation (FSS) has garnered significant attention, which aims to develop class-agnostic network that segment unseen classes with handful annotated support images of the target class. The Key challenge lies in effectively adapting the network to fully utilize the limited information by harnessing the intricate interplay of fine-grained correlations between Query and support images. Nevertheless, coarse segmentation granularity remains a challenge for many existing approaches, primarily attributed to the limitations of prototype representation and the underutilization of information within semantic features. In this thesis, we present the Mask-Based Dense Semantic Feature Alignment Network (MDSFANet), which effectively explores pixel-wise correlations and enhances the conception of the target class semantics between paired Query and support features. Consequently, we further propose a feature integration module that utilizes asymmetric convolution to combines multi-scale features. Our approach demonstrates fairly results in the conducted experiments. Experiments show that our MDSFANet achieves competitive performance on the public benchmarks.
[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[2] V. E. B. M.-C. Popescu, L. Perescu-Popescu, and N. Mastorakis, "Multilayer perceptron and neural networks," WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579-588, 2009.
[3] S. Gidaris, P. Singh, and N. Komodakis, "Unsupervised Representation Learning by Predicting Image Rotations," in International Conference on Learning Representations, 2018.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations," in International Conference on Machine Learning, 2020: PMLR, pp. 1597-1607.
[5] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000-16009.
[6] A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in International Conference on Learning Representations, 2020.
[7] S. Laine and T. Aila, "Temporal Ensembling for Semi-Supervised Learning," in International Conference on Learning Representations, 2016.
[8] A. Tarvainen and H. Valpola, "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results," Advances in Neural Information Processing Systems, vol. 30, 2017.
[9] A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017.
[10] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, "Boosting few-shot visual learning with self-supervision," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8059-8068.
[11] Z. Tian, H. Zhao, M. Shu, Z. Yang, R. Li, and J. Jia, "Prior guided feature enrichment network for few-shot segmentation," IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 1050-1065, 2020.
[12] J. Min, D. Kang, and M. Cho, "Hypercorrelation squeeze for few-shot segmentation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6941-6952.
[13] X. Shi et al., "Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation," in European Conference on Computer Vision, 2022: Springer, pp. 151-168.
[14] X. Li, T. Wei, Y. P. Chen, Y.-W. Tai, and C.-K. Tang, "FSS-1000: A 1000-class dataset for few-shot segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2869-2878.
[15] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra, "Matching networks for one shot learning," Advances in Neural Information Processing Systems, vol. 29, 2016.
[16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
[17] S. Woo et al., "Convnext v2: Co-designing and scaling convnets with masked autoencoders," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16133-16142.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: Ieee, pp. 248-255.
[19] R. Azad, A. R. Fayjie, C. Kauffmann, I. Ben Ayed, M. Pedersoli, and J. Dolz, "On the texture bias for few-shot cnn segmentation," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2674-2683.
[20] S. Moon et al., "HM: Hybrid Masking for Few-Shot Segmentation," in European Conference on Computer Vision, 2022: Springer, pp. 506-523.
[21] H. Wang, X. Zhang, Y. Hu, Y. Yang, X. Cao, and X. Zhen, "Few-Shot Semantic Segmentation with Democratic Attention Networks," in European Conference on Computer Vision 2020, pp. 730-746.