簡易檢索 / 詳目顯示

研究生: 黃如郁
Ju-Yu Huang
論文名稱: 應用對比式學習與全域池化於多標籤圖像分類及視覺化方法之研究
Multi-label Image Classification and Visualization Methods Based on Contrastive Learning and Global Pooling
指導教授: 林伯慎
Bor-shen Lin
口試委員: 羅乃維
楊傳凱
林伯慎
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2023
畢業學年度: 112
語文別: 中文
論文頁數: 70
中文關鍵詞: 多標籤分類對比式學習後全域池化空間視覺化非監督圖像分類
外文關鍵詞: Multi-label Classification, Contrastive Learning, Post-global Pooling, Spatial Visualization, Unsupervised Image Classification
相關次數: 點閱:197下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 對比式學習(contrastive learning)是一種表示學習方法,其訓練目標是提升參考樣本與正負樣本之間的相似度的對比,以達到參考樣本與正樣本之相似度最大化、而與負樣本之相似度最小化。以此目標來訓練模型,能讓模型學習到圖像資料中具鑑別力的潛藏特徵,並應用在不同的任務中。而多標籤學習(multi-label learning)是圖像分類的一個重要應用方向;在多標籤分類任務中,每個圖像可能標記了一個以上的類別,並且類別可能是抽象概念而非實體物件,如室內或夜間。多標籤分類不同於一般物件偵測任務,它的訓練圖像並沒有標記興趣區(region of interest)資訊,而只標記了影像所含的類別標籤。本研究的目的是改進多標籤分類以及實現分類結果的視覺化。我們提出了基於後全域池化網路的分類架構,使用對比式學習的概念來設計多層特徵圖的損失函數,並研究此架構下的物件視覺化方法。我們在mir-flickr 25k資料集的24類多標籤分類任務上,對提出的分類架構進行實驗。VGG結合全連接層做分類時,micro-F1為0.7232;使用「後全域池化網路」取代全連結層,micro-F1提升至0.7389;加上「對比式學習」損失函數進行學習分類,micro-F1可再提升至0.7981;使用正負樣本平衡的訓練方式,micro-F1可達到0.8539。視覺化結果也驗證了:在未標記訓練圖像中物件邊界的情況下,結合「後全域池化網路」與「對比式學習」,仍可擷取出各類物件的位置;這除可大幅減少標記圖像的成本,也朝非監督式圖像辨識跨前一步。


    Contrastive learning, as a representation learning method, aims to improve the contrast of similarities between similar samples and dissimilar samples, to increase the similarity between similar samples, and to reduce the similarity between the dissimilar samples. Training the model with this learning goal allows the model to learn the discriminative latent features of the image that could be applied to various tasks, such as image categorization, clustering, or search. Multi-label image classification is one important application of image classification tasks, for which each image could be labeled with more than one category, and the categories may be abstract concepts rather than the notations of concrete objects, such as indoor and night. Multi-label image classification differs from object detection since its training images are usually labeled with the categories contained in the image only, but not with the regions of interest.
    The goal of this research is to improve multi-label image classification and visualize the classification results spatially. We proposed a classification architecture that integrates the VGG-19 backbone with post-global pooling network, used contrastive learning to derive the loss function of multi-layer feature maps, and investigated visualization methods based on this architecture. The proposed approaches were tested on the multi-label classification task of the mir-flickr 25k dataset with 24 categories. When the conventional fully connected layer was replaced with the post-global pooling network, micro-F1 increased from 0.7232 to 0.7389. With the contrastive learning loss further applied, micro-F1 was improved up to 0.7981. When the model was trained with balanced positive and negative samples, micro-F1 arrived at 0.8539 finally. In addition, based on the proposed approaches the visualization results could be produced, which shows that without labeling the regions of interest for the training images, it’s still feasible to extract the spatial locations of the objects within the image. This not only can reduce the cost of image labeling but also takes a step forward on unsupervised image classification.

    目錄 第1章 序論 1 1.1 研究背景 1 1.2 研究方向與貢獻 1 1.3 論文組織與架構 3 第2章 文獻回顧 4 2.1 卷積神經網路 4 2.1.2 VGG模型 5 2.1.3 ResNet模型 7 2.2 對比式學習 7 2.3 多標籤分類 13 2.3.1 C-GMVAE 14 2.4 類別活化映射 17 2.5 效能評估指標 19 2.6 本章摘要 21 第3章 分類架構與改進 22 3.1 基於「後全域池化」的分類架構 22 3.2 實驗設定 25 3.3 基礎實驗 27 3.3.1 類別活化圖之產生 29 3.4 多混合的對比式學習 32 3.4.1 目標函數 32 3.4.2 實驗與分析 37 3.4.3 訓練樣本數量的平衡 43 3.5 視覺化方法 46 3.6 本章摘要 55 第4章 結論與未來展望 56 參考文獻 57

    [1] Alec Radford, Luke Metz and Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv preprint arXiv:1511.06434, 2015.
    [2] Joseph Redmon , Santosh Divvala, Ross Girshick and Ali Farhadi, “You only look once: Unified, real-time object detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779-788, 2016.
    [3] Yann LeCun Léon Bottou Yoshua Bengio and Patrick Haffner, "Gradient-based learning applied to document recognition," in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
    [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., pp.1097–1105, 2012.
    [5] Karen Simonyan and Andrew Zisserman, "Very deep convolutional networks for large-scale image recognition", 3rd International Conference on Learning Representations (ICLR), pp. 1–14, 2015.
    [6] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich, "Going deeper with convolutions," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1-9, 2015.
    [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep Residual Learning for Image Recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
    [8] Zhirong Wu, Yuanjun Xiong, Stella X. Yu and Dahua Lin, "Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3733–3742, 2018.
    [9] Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020.
    [10] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu and Dilip Krishnan, “Supervised contrastive learning,” in Proc. of NeurIPS, 2020.
    [11] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. “Unsupervised learning of visual features by contrasting cluster assignments”. Advances in Neural Information Processing Systems, 33, 2020.
    [12] Rong Jin and Zoubin Ghahramani, "Learning with Multiple Labels," in Proc. of NIPS, 2002.
    [13] Rui Shu, “Gaussian mixture vae: Lessons in variational inference, generative models, and deep nets,” 2016.
    [14] Junwen Bai, Shufeng Kong and Carla Gomes, “Gaussian mixture variational autoencoder with contrastive learning for multilabel classifcation,” in International Conference on Machine Learning, pp.1383–1398. PMLR, 2022.
    [15] Tomas Mikolov, Kai Chen, Greg Corrado and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
    [16] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva and Antonio Torralba, “Learning Deep Features for Discriminative Localization,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921-2929, 2016.
    [17] Diederik P. Kingma and Jimmy Lei Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
    [18] John Duchi, Elad Hazan and Yoram Singer, “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization,” Journal of Machine Learning Research 12, 2011.
    [19] Sebastian Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016.
    [20] Olaf Ronneberger, Philipp Fischer and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Medical Image Computing and Computer-Assisted Intervention(MICCAI). Lecture Notes in Computer Science(), vol 9351, 2015.
    [21] Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Girshick, “Mask R-CNN,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 386-397, 2020.
    [22] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng and Lei Zhang, “Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising,” in IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142-3155, 2017.
    [23] William Chan, Navdeep Jaitly, Quoc V. Le and Oriol Vinyals, “Listen, Attend and Spell: A neural network for large vocabulary conversational speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964, 2016.

    QR CODE