借助注意力機制進行對比式自監督學習｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	李欣諭 HSIN-YU LEE
論文名稱：	借助注意力機制進行對比式自監督學習 Contrastive Self-supervised Learning with a Little Help of Attention
指導教授：	鮑興國 Hsing-Kuo Pao
口試委員:	曾俊元 Chinyang Henry Tseng 鄧惟中 Wei-Chung Teng 項天瑞 Tien-Ruey Hsiang
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2024
畢業學年度：	112
語文別：	英文
論文頁數：	42
中文關鍵詞：	自監督學習、對比學習、注意力機制、視覺轉換器、資料增強
外文關鍵詞：	Self-Supervised Learning, Contrastive Learning, Attention Mechanism, Vision Transformer, Data Augmentation
相關次數：	點閱：62 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近幾年，自監督學習因為能夠在不使用標籤的情況下學習到有用的表
徵而廣受關注。其中，對比式自監督學習是自監督學習中的主流方法之
一，使用圖像數據增強生成視圖，並透過對比視圖間的相似性和不相似性
來進行表徵學習。然而，隨機裁切、顏色變形等圖像增強等方法往往依照
人類的直覺，可能缺乏可解釋性和有效性。更甚者，隨機裁切可能會在裁
切時忽略某些語意細節，導致視圖品質下降，進一步影響學習出的表徵品
質。
為了解決這些問題，我們提出了一種視圖生成方法，旨在減少對數據
增強的依賴。我們的方法和傳統增強技術不同，主要通過將注意力集中在
圖像中的主體來提升提取出特徵的品質，從而降低對於對比式自監督學習
方法中數據增強的需求。
此外，傳統的對比式自監督學習方法多著重於編碼器後的調整與設
計，與之不同的是，我們的方法則是在編碼器之前進行輸入處理的調整，
這樣可以直接結合使用使用數據增強產生視圖的對比式自監督學習方法，
而無需改變其架構。
實驗結果表明，我們的方法在 STL-10, Tiny ImageNet 等數據集上的
性能提高了約 2%。
關鍵字—自監督學習、對比學習、注意力機制、視覺轉換器、資料增
強

In recent years, self-supervised learning (SSL) has gained popularity due to
its ability to learn useful representations without labels. Contrastive self-
supervised learning (Contrastive SSL) is a primary SSL method that uses
data augmentation to generate views, and learning representations by con-
trasting similar and dissimilar data. However, augmentations like random
cropping and color distortion often rely on human intuition and may lack
interpretability, risking their effectiveness. Random cropping can miss im-
portant semantic details by removing the main object and leaving only the
background, resulting in poor representations.
To address these issues, we propose a view generation method that re-
duces reliance on data augmentation. Instead of traditional augmentation
techniques, our method focuses on enhancing representations by concen-
trating on the main object in the image. This is achieved through an atten-
tion mechanism that eliminates the need for one of the augmented views
typically used in contrastive SSL methods.
Unlike traditional contrastive SSL approaches focusing on adjustments
and design after the encoder, our approach modifies input processing before
the encoder. This allows integration with existing contrastive SSL methods
that use data augmentation to generate views without altering their architec-
ture. As a result, our method achieves an approximately 2% performance
improvement on datasets like STL-10 and Tiny ImageNet.
Keywords—Self-Supervised Learning, Contrastive Learning, Attention
Mechanism, Vision Transformer, Data Augmentation.

Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . I
Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . III
Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . IV
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . V
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . XII
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Contrastive Self-Supervised Learning . . . . . . . . . . . 5
2.2 Generates Views by Data Augmentation . . . . . . . . . . 6
2.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 7
3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 View Generation . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Over-confident Problem in InfoNCE . . . . . . . . . . . . 11
3.3 Align Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Datasets and Implementation Details . . . . . . . . . . . . 16
4.2 Temperature Tuning . . . . . . . . . . . . . . . . . . . . . 17
4.3 Linear Classification . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Analyzing the Impact of Individual Components . 19
4.3.2 Linear Evaluation of Various Contrastive SSL . . . 20
4.3.3 Loss Analysis . . . . . . . . . . . . . . . . . . . . 21
4.4 Feature Visualization . . . . . . . . . . . . . . . . . . . . 22
4.5 Views Visualization . . . . . . . . . . . . . . . . . . . . . 25
4.6 Interval Training . . . . . . . . . . . . . . . . . . . . . . 27
4.7 Semi-Supervised Learning via Fine-Tuning . . . . . . . . 28
4.8 Similarity of Positive Pairs in Training . . . . . . . . . . . 29
4.9 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . 30
4.9.1 Should Random Cropping be Applied to the Vit-
Based View? . . . . . . . . . . . . . . . . . . . . 30
4.9.2 Whether the Effect Stems from Entire Image . . . 32
4.9.3 Background Masking . . . . . . . . . . . . . . . . 33
4.9.4 Who Should be the Target View? . . . . . . . . . 34
4.9.5 Loss Coefficient λ . . . . . . . . . . . . . . . . . 35
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
A.1 ViT Model Configuration . . . . . . . . . . . . . . . . . . 40
A.2 Enhancing Attention Magnitude Scale . . . . . . . . . . . 42
                                

[1] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020.
[2] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires,
Z. Guo, M. Gheshlaghi Azar, et al., “Bootstrap your own latent-a new approach to self-supervised
learning,” Advances in neural information processing systems, vol. 33, pp. 21271–21284, 2020.
[3] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750–15758, 2021.
[4] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
[5] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual rep-
resentation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
[6] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via
redundancy reduction,” in International conference on machine learning, pp. 12310–12320, PMLR, 2021.
[7] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.
[8] C. Feichtenhofer, Y. Li, K. He, et al., “Masked autoencoders as spatiotemporal learners,” Advances in neural information processing systems, vol. 35, pp. 35946–35958, 2022.
[9] Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, “Scaling language-image pre-training via masking,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23390–23400, 2023.
[10] P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” Advances in Neural Information Processing Systems, vol. 35, pp. 28708–28720, 2022.
[11] Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” Advances in neural information processing systems, vol. 35, pp. 10078–10093, 2022.37
[12] Y. Shi, N. Siddharth, P. Torr, and A. R. Kosiorek, “Adversarial masking for self-supervised learning,”in International Conference on Machine Learning, pp. 20026–20040, PMLR, 2022.
[13] J. Li, W. Qiang, C. Zheng, B. Su, and H. Xiong, “Metaug: Contrastive learning via meta feature
augmentation,” in International Conference on Machine Learning, pp. 12964–12978, PMLR, 2022.
[14] R. Zhu, B. Zhao, J. Liu, Z. Sun, and C. W. Chen, “Improving contrastive learning by visualizing feature transformation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10306–10315, 2021.
[15] X. Peng, K. Wang, Z. Zhu, M. Wang, and Y. You, “Crafting better contrastive views for siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16031–16040, 2022.
[16] J. Wu, J. Hobbs, and N. Hovakimyan, “Hallucination improves the performance of unsupervised visual representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16132–16143, 2023.
[17] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,”AISTATS, 2011.
[18] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, 2009.
[19] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018.
[20] R. R. Selvaraju, K. Desai, J. Johnson, and N. Naik, “Casting your model: Learning to localize improves self-supervised representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11058–11067, 2021.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Łukasz Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
[22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional trans-formers for language understanding,” NAAVL, 2019.
[23] T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[24] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Transformers for image recognition at scale,” ICLR, 2021.
[25] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,”in International conference on machine learning, pp. 4055–4064, PMLR, 2018.
38
[26] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained
image processing transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12299–12310, 2021.
[27] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 6836–6846, 2021.
[28] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3202–3211, 2022.
[29] H. Akbari, L. Yuan, R. Qian, W.-H. Chuang, S.-F. Chang, Y. Cui, and B. Gong, “Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text,” Advances in Neural Information Processing Systems, vol. 34, pp. 24206–24221, 2021.
[30] T.-D. Truong, C. N. Duong, H. A. Pham, B. Raj, N. Le, K. Luu, et al., “The right to talk: An audio-visual transformer approach,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1105–1114, 2021.
[31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
[32] Z. Shen, Z. Liu, Z. Liu, M. Savvides, T. Darrell, and E. Xing, “Un-mix: Rethinking image mixtures for unsupervised visual representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 2216–2224, 2022.
[33] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, 2009.
[34] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[35] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in International conference on machine learning, pp. 9929–9939, PMLR, 2020.

全文公開日期 2026/08/26 (校內網路)
全文公開日期 2027/08/26 (校外網路)
全文公開日期 2027/08/26 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文