基於圖塊遮罩於稠密語意特徵對齊之少樣本語意分割｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳柏華 Po-Hua Chen
論文名稱：	基於圖塊遮罩於稠密語意特徵對齊之少樣本語意分割 Few-shot Semantic Segmentation with Mask-Based Dense Semantic Feature Alignment
指導教授：	郭景明 Jing-Ming Guo
口試委員:	郭景明 Jing-Ming Guo 張傳育 Chuan-Yu Chang 高文忠 Wen-Chung Kao 陸敬互 Ching-Hu Lu 宋啟嘉 Chi-Chia Sun
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	87
中文關鍵詞：	少樣本語意分割、無監督訓練、非對稱卷積、特徵比對
外文關鍵詞：	Few-shot segmentation, Unsupervised learning, Asymmetric convolution, Feature matching
相關次數：	點閱：272 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

過往的語意分割任務中，深度神經網路遵循傳統的學習方法，通過大量且帶像素級別標籤標註的資料集引導模型進行學習，模型可以從中習得有助於理解和描述圖像的語意特徵，這些特徵涵蓋了圖像中的物體類別、區域以及上下文等重要資訊，從而協助模型準確地進行像素級別的分類。然而，在實際的應用場景中，收集並準確標註大量的資料集可能會面臨眾多挑戰，在這當中又以密集預測的語意分割任務更是如此，在語意分割任務的標註中，資料的標註範圍涵蓋了圖像內的所有像素，將圖像中的區域細分成不同的類別，標註的過程費時且仰賴高昂的人力成本。此外，傳統的語意分割架構僅具備識別訓練時出現過類別的能力，當圖像中出現未曾見過的新類別時，模型往往無法正確的劃分出物件的輪廓以及所屬的類別。綜合以上論述，大量帶正確標註的資料集和預定義類別的先決條件限制了語意分割的發展性。
為了克服傳統語意分割的問題，少樣本語意分割引起了廣泛的關注。少樣本語意分割旨在開發一種無類別限制的網路，通過僅有少量像素級別標註的資料集訓練網路對未曾見過的新類別進行語意分割。其關鍵在於如何有效地使網路適應各種情況，利用查詢圖像和支援集之間細粒度相關性的交互作用，充分使用可取得的圖像中有限的資訊。然而，對於現今的少樣本語意分割方法來說，因資料壓縮造成粗糙的分割結果仍然是一個存在的問題，歸因於代表性特徵的限制以及資料內部的語意特徵未被充分的使用。在本論文中，設計了基於圖塊遮罩於稠密語意特徵對齊之少樣本語意分割網路，所提出方法能有效地發掘目標類別內像素間的相關性，增強查詢特徵和支援特徵之間目標類別的語意概念。同時，設計非對稱特徵融合模組，通過改良基底架構的結構並引入過往文獻中的先驗遮罩，搭配非對稱卷積結構結合多尺度特徵，進一步提升語意分割的性能。實驗結果表明，本論文所提出方法能有效的利用少量帶有像素級別標註的資料集，在公開的資料集FSS-1000中取得具有競爭力的語意分割成果。

In the realm of traditional semantic segmentation, significant advancements have been made by leveraging a substantial amount of labeled training data. However, in data-scarce scenarios, collecting a sufficient amount of data may pose challenges. Furthermore, the annotation of data, especially for intensive prediction tasks, is a laborious and time-consuming process. The prerequisite for extensive labeled datasets and predefined class constraints hinders its applicability.
To overcome these limitations and problems, few-shot semantic segmentation (FSS) has garnered significant attention, which aims to develop class-agnostic network that segment unseen classes with handful annotated support images of the target class. The Key challenge lies in effectively adapting the network to fully utilize the limited information by harnessing the intricate interplay of fine-grained correlations between Query and support images. Nevertheless, coarse segmentation granularity remains a challenge for many existing approaches, primarily attributed to the limitations of prototype representation and the underutilization of information within semantic features. In this thesis, we present the Mask-Based Dense Semantic Feature Alignment Network (MDSFANet), which effectively explores pixel-wise correlations and enhances the conception of the target class semantics between paired Query and support features. Consequently, we further propose a feature integration module that utilizes asymmetric convolution to combines multi-scale features. Our approach demonstrates fairly results in the conducted experiments. Experiments show that our MDSFANet achieves competitive performance on the public benchmarks.

摘要    III
Abstract    IV
致謝    V
目錄    VI
圖片索引    IX
表格索引    XI
第一章 緒論    1
1 背景介紹    1
2 研究動機與目的    2
3 論文架構    3
第二章 文獻探討    4
1 深度學習架構與特徵萃取技術    4
1.1 類神經網路架構    6
1.2 前向傳播    7
1.3 反向傳播    10
1.4 卷積神經網路架構    11
2 自監督學習技術    14
2.1 Predictive Methods    15
2.1.1 Predicting Image Rotations [3]    15
2.2 Contrastive Methods    17
2.2.1 SimCLR [4]    17
2.3 Generative Methods    19
2.3.1 Masked AutoEncoder [5]    19
3 自集成模型介紹    22
3.1 π-model [7]    22
3.2 Temporal Ensembling [7]    24
3.3 Mean Teacher [8]    25
4 注意力機制介紹    27
4.1 Scaled Dot-Product Attention [9]    28
4.2 Multi-Head Attention [9]    30
4.3 Positional Encoding [9]    31
5 少樣本學習、少樣本語意分割介紹    32
5.1 Boosting Few-Shot Visual Learning with Self-Supervision [10]    33
5.2 PFENet [11]    35
5.3 HSNet [12]    37
5.4 DCAMA [13]    39
第三章 研究方法    42
1 FSS-1000資料集介紹 [14]    43
2 少樣本分割任務說明    46
3 架構流程圖    47
4 基於圖塊遮罩之稠密語意特徵對齊    48
5 特徵提取架構說明    51
6 注意力模組    52
7 先驗遮罩    54
8 非對稱語意融合模組    56
第四章 實驗結果    57
1 訓練與測試實驗環境    57
2 實現細節    57
3 實驗結果與分析    58
3.1 定量評估指標    58
3.2 消融試驗比較    59
3.2.1 遮蓋比例及遮蓋圖塊尺寸的組合比較    59
3.2.2 MDSFA、ASFM以及基底架構比較    61
3.2.3 MDSFANet與相關文獻結果比較    63
3.2.4 MDSFA於基底架構的實驗比較    64
3.3 可視化結果    65
3.3.1 MDSFANet與基底架構的可視化結果比較    65
3.3.2 分割誤判的可視化結果    66
3.3.3 各類別可視化結果    66
3.3.3.1 空中動物類別 (Air animal)    67
3.3.3.2 日常用品類別 (Daily object)    67
3.3.3.3 食物類別 (Food)    68
3.3.3.4 電子設備類別 (Electronic device)    68
3.3.3.5 蔬果類別 (Fruit & Plant)    69
3.3.3.6 陸地動物類別 (Land animal)    69
3.3.3.7 室外物體類別 (Outdoor object)    70
3.3.3.8 工具類別 (Tool)    70
3.3.3.9 雜項類別 (Misc)    71
3.3.3.10 音樂與運動類別 (Music & Sport)    71
3.3.3.11 運輸工具類別 (Transportation)    72
3.3.3.12 水生動物類別 (Water animal)    72
第五章 結論與未來展望    73
參考文獻    74


                                

[1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[2] V. E. B. M.-C. Popescu, L. Perescu-Popescu, and N. Mastorakis, "Multilayer perceptron and neural networks," WSEAS Transactions on Circuits and Systems, vol. 8, no. 7, pp. 579-588, 2009.
[3] S. Gidaris, P. Singh, and N. Komodakis, "Unsupervised Representation Learning by Predicting Image Rotations," in International Conference on Learning Representations, 2018.
[4] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, "A simple framework for contrastive learning of visual representations," in International Conference on Machine Learning, 2020: PMLR, pp. 1597-1607.
[5] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked autoencoders are scalable vision learners," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000-16009.
[6] A. Dosovitskiy et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," in International Conference on Learning Representations, 2020.
[7] S. Laine and T. Aila, "Temporal Ensembling for Semi-Supervised Learning," in International Conference on Learning Representations, 2016.
[8] A. Tarvainen and H. Valpola, "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results," Advances in Neural Information Processing Systems, vol. 30, 2017.
[9] A. Vaswani et al., "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017.
[10] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord, "Boosting few-shot visual learning with self-supervision," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8059-8068.
[11] Z. Tian, H. Zhao, M. Shu, Z. Yang, R. Li, and J. Jia, "Prior guided feature enrichment network for few-shot segmentation," IEEE transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 2, pp. 1050-1065, 2020.
[12] J. Min, D. Kang, and M. Cho, "Hypercorrelation squeeze for few-shot segmentation," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6941-6952.
[13] X. Shi et al., "Dense cross-query-and-support attention weighted mask aggregation for few-shot segmentation," in European Conference on Computer Vision, 2022: Springer, pp. 151-168.
[14] X. Li, T. Wei, Y. P. Chen, Y.-W. Tai, and C.-K. Tang, "FSS-1000: A 1000-class dataset for few-shot segmentation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2869-2878.
[15] O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra, "Matching networks for one shot learning," Advances in Neural Information Processing Systems, vol. 29, 2016.
[16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
[17] S. Woo et al., "Convnext v2: Co-designing and scaling convnets with masked autoencoders," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16133-16142.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009: Ieee, pp. 248-255.
[19] R. Azad, A. R. Fayjie, C. Kauffmann, I. Ben Ayed, M. Pedersoli, and J. Dolz, "On the texture bias for few-shot cnn segmentation," in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2674-2683.
[20] S. Moon et al., "HM: Hybrid Masking for Few-Shot Segmentation," in European Conference on Computer Vision, 2022: Springer, pp. 506-523.
[21] H. Wang, X. Zhang, Y. Hu, Y. Yang, X. Cao, and X. Zhen, "Few-Shot Semantic Segmentation with Democratic Attention Networks," in European Conference on Computer Vision 2020, pp. 730-746.

全文公開日期 2025/08/21 (校內網路)
全文公開日期 2025/08/21 (校外網路)
全文公開日期 2025/08/21 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文