研究生: 李欣諭
論文名稱: 借助注意力機制進行對比式自監督學習
Contrastive Self-supervised Learning with a Little Help of Attention
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 曾俊元
Chinyang Henry Tseng
Wei-Chung Teng
Tien-Ruey Hsiang
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 42
中文關鍵詞: 自監督學習對比學習注意力機制視覺轉換器資料增強
外文關鍵詞: Self-Supervised Learning, Contrastive Learning, Attention Mechanism, Vision Transformer, Data Augmentation
  • 近幾年,自監督學習因為能夠在不使用標籤的情況下學習到有用的表
    實驗結果表明,我們的方法在 STL-10, Tiny ImageNet 等數據集上的
    性能提高了約 2%。

    In recent years, self-supervised learning (SSL) has gained popularity due to
    its ability to learn useful representations without labels. Contrastive self-
    supervised learning (Contrastive SSL) is a primary SSL method that uses
    data augmentation to generate views, and learning representations by con-
    trasting similar and dissimilar data. However, augmentations like random
    cropping and color distortion often rely on human intuition and may lack
    interpretability, risking their effectiveness. Random cropping can miss im-
    portant semantic details by removing the main object and leaving only the
    background, resulting in poor representations.
    To address these issues, we propose a view generation method that re-
    duces reliance on data augmentation. Instead of traditional augmentation
    techniques, our method focuses on enhancing representations by concen-
    trating on the main object in the image. This is achieved through an atten-
    tion mechanism that eliminates the need for one of the augmented views
    typically used in contrastive SSL methods.
    Unlike traditional contrastive SSL approaches focusing on adjustments
    and design after the encoder, our approach modifies input processing before
    the encoder. This allows integration with existing contrastive SSL methods
    that use data augmentation to generate views without altering their architec-
    ture. As a result, our method achieves an approximately 2% performance
    improvement on datasets like STL-10 and Tiny ImageNet.
    Keywords—Self-Supervised Learning, Contrastive Learning, Attention
    Mechanism, Vision Transformer, Data Augmentation.

    Recommendation Letter . . . . . . . . . . . . . . . . . . . . . . . . I Approval Letter . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . III Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . IV Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . V Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . XII 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Contrastive Self-Supervised Learning . . . . . . . . . . . 5 2.2 Generates Views by Data Augmentation . . . . . . . . . . 6 2.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . 7 3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 View Generation . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Over-confident Problem in InfoNCE . . . . . . . . . . . . 11 3.3 Align Loss . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Datasets and Implementation Details . . . . . . . . . . . . 16 4.2 Temperature Tuning . . . . . . . . . . . . . . . . . . . . . 17 4.3 Linear Classification . . . . . . . . . . . . . . . . . . . . 19 4.3.1 Analyzing the Impact of Individual Components . 19 4.3.2 Linear Evaluation of Various Contrastive SSL . . . 20 4.3.3 Loss Analysis . . . . . . . . . . . . . . . . . . . . 21 4.4 Feature Visualization . . . . . . . . . . . . . . . . . . . . 22 4.5 Views Visualization . . . . . . . . . . . . . . . . . . . . . 25 4.6 Interval Training . . . . . . . . . . . . . . . . . . . . . . 27 4.7 Semi-Supervised Learning via Fine-Tuning . . . . . . . . 28 4.8 Similarity of Positive Pairs in Training . . . . . . . . . . . 29 4.9 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . 30 4.9.1 Should Random Cropping be Applied to the Vit- Based View? . . . . . . . . . . . . . . . . . . . . 30 4.9.2 Whether the Effect Stems from Entire Image . . . 32 4.9.3 Background Masking . . . . . . . . . . . . . . . . 33 4.9.4 Who Should be the Target View? . . . . . . . . . 34 4.9.5 Loss Coefficient λ . . . . . . . . . . . . . . . . . 35 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 A.1 ViT Model Configuration . . . . . . . . . . . . . . . . . . 40 A.2 Enhancing Attention Magnitude Scale . . . . . . . . . . . 42

