簡易檢索 / 詳目顯示

研究生: 陳曹倫
Tsao-Lun Chen
論文名稱: 利用注意力融合模型與對抗式生成網路用於端對端廣義零次學習
Incorporating Attention Fusion Module and GAN for End-to-End Generalized Zero-Shot Learning
指導教授: 蘇順豐
Shun-Feng Su
口試委員: 蘇順豐
姚立德
莊鎮嘉
郭重顯
陸敬互
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 59
中文關鍵詞: 零次學習廣義零次學習分類器視覺問答系統生成對抗網路多模態學習網路注意力機制
外文關鍵詞: Zero-Shot Learning, Generalized Zero-Shot Learning, Classification, Visual Question Answering, Generative Adversarial Network, Multimodal Learning, Self-Attention
相關次數: 點閱:200下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 零次學習 (Zero-Shot Learning) 旨在通過語意屬性從可見類別轉移到未見類別。在本研究,通過結合注意力融合模型 (Attention Fusion Module) 和對抗式生成網路 (Generative Adversarial Network),改進了端到端廣義零次學習 (Generalized Zero-Shot Learning) 任務。注意力融合模型是根據基於注意力的視覺問答 (Visual Question Answering) 方法,在視覺和文本空間進行多模態學習。然後我們在基於注意力的視覺問答系統中加入了自注意力機制,使區域特徵更具代表性。通過將所提出的方法結合到端到端的廣義零次學習任務中,訓練了一個可以提取語義特徵的視覺特徵提取器。然而,注意力融合模型無法解決數據缺失問題。為了避免所見類別的偏差,一種特徵生成方法將具有對比嵌入的對抗式生成網路嫁接到分類器之前。特徵生成方法將合成的不可見特徵和真實訓練的可見特徵結合起來當作完整的資料集來訓練分類器,避免了數據缺失的問題,然後用softmax分類器來辨識可見與不可見類別。最後,在三個零次學習基準 CUB、AWA2 和 SUN 上進行了實驗,我們的方法可以分別達到 75.8%、72.3% 和 42.5% 的準確度。可以發現,與最先進的零次學習方法相比,我們的模型可以在 CUB 和 AWA2 中實現卓越或具有競爭力的性能。


    Zero-Shot Learning (ZSL) is aimed at recognizing novel classes by transferring semantic knowledge from seen classes to unseen classes. In this study, the end-to-end Generalized Zero-Shot Learning (GZSL) task is improved by combining the Attention Fusion Module (AFM) and Generative Adversarial Network (GAN). The AFM is based on attention-based Visual Question Answering (VQA) methods, which are used to perform multi-modal learning between the information of the visual domain and the textual domain. Then a self-attention mechanism is added to the attention-based VQA model to make the regional features more representative. By incorporating the proposed method into the end-to-end GZSL task, a visual feature extractor that can extract semantic features is trained. However, the AFM can’t solve the data absence problem. In order to avoid the deviation of seen classes, a feature generation method grafted GAN with contrastive embedding before the classifier. The feature generation method combines the synthesized unseen features and the real trained seen features as a full dataset to train the classifier to avoid the problem of absence data, and then recognize the seen and unseen classes with the softmax classification. Finally, experiments are conducted on three ZSL benchmarks, CUB, AWA2, and SUN and our approach can achieve accuracies of 75.8%, 72.3%, and 42.5%, respectively. It can be found that our model can achieve superior or competitive performance in CUB and AWA2 compared to the state-of-the-art ZSL methods.

    中文摘要 I Abstract II 致謝 III Table of Contents IV List of Figures VII List of Tables VIII Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivations 2 1.3 Contributions 3 1.4 Thesis Organization 4 Chapter 2 Related Work 5 2.1 Generalized Zero-shot Learning 5 2.1.1 Embedding projection methods 5 2.1.2 Feature generation methods 6 2.1.3 Part-based methods 8 2.1.4 Hybrid methods 9 2.2 Visual Question Answering 10 2.2.1 Visual attention methods 10 2.2.2 External knowledge methods 11 2.2.3 Fusion-based methods 12 2.3 Self-Attention Network 13 2.4 Generative Adversarial Network 14 Chapter 3 Methodology 16 3.1 Problem setting and notations [72] 16 3.2 Attention Fusion Module 17 3.2.1 Visual self-attention 17 3.2.2 Textual self-attention 19 3.2.3 Visual-textual co-attention 21 3.2.4 Loss function 22 3.2.5 Classification 23 3.3 Feature Generator 25 3.3.1 Feature generation methods 26 3.3.2 Contrastive Embedding 26 3.3.3 CE-GZSL framework 28 3.3.4 Classification 28 Chapter 4 Experiments 29 4.1 Datasets 29 4.1.1 CUB-200-2011 29 4.1.2 Animals with Attributes 2 30 4.1.3 SUN Attribute 30 4.2 Evaluation Protocol 31 4.3 Implementation Details 31 4.3.1 Preprocessing 31 4.3.2 Training method Organization 32 4.3.3 Network Optimization 32 4.3.4 Classification 32 4.3.5 Feature generation 33 4.3.6 Environment 33 4.4 Comparison with State-of-the-arts 34 4.5 Ablation Study 35 4.5.1 Component analysis 35 4.5.2 Training method analysis 35 4.5.3 Self-attention analysis 36 Chapter 5 Conclusions and Future Work 38 5.1 Conclusions 38 5.2 Future Work 39 Reference 40

    [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
    [2] O. Russakovsky, et al., “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
    [3] J. Pennington, R. Socher, and C. D. Manning, “GloVe: Global vectors for word representation,” Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014. [Online]. Available: http://www.aclweb.org/anthology/D14-1162.
    [4] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, California Institute of Technology, Technical Report, CNS-TR-2011-001, 2011.
    [5] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly,” IEEE Transactions On Pattern Analysis and Machine Intelligence, vol. 41, no. 9, pp. 2251–2265, 2018.
    [6] G. Patterson and J. Hays, “Sun attribute database: Discovering, annotating, and recognizing scene attributes,” 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758, 2012.
    [7] A. Nichol, J. Achiam, and J. Schulman, “On first-order meta-learning algorithms,” arXiv preprint arXiv:1803.02999, 2018.
    [8] A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
    [9] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature generating networks for zero-shot learning,” Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 5542–5551, 2018.
    [10] R. Felix, I. Reid, G. Carneiro, et al., “Multi-modal cycle-consistent generalized zero-shot learning,” Proceedings of The European Conference on Computer Vision, pp. 21–37, 2018.
    [11] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalized zero-and few-shot learning via aligned variational autoencoders,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8247–8255, 2019.
    [12] Y. Xian, S. Sharma, B. Schiele, and Z. Akata, “f-vaegan-d2: A feature generating framework for any-shot learning,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10275–10284, 2019.
    [13] S. Narayan, A. Gupta, F. S. Khan, C. G. Snoek, and L. Shao, “Latent embedding feedback and discriminative features for zero-shot classification,” European Conference on Computer Vision, pp. 479–495, 2020.
    [14] Z. Han, Z. Fu, S. Chen, and J. Yang, “Contrastive embedding for generalized zero-shot learning,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2371–2381, 2021.
    [15] J. Song, C. Shen, Y. Yang, Y. Liu, and M. Song, “Transductive unbiased embedding for zero-shot learning,” Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1024–1033, 2018.
    [16] G.-S. Xie et al., “Attentive region embedding network for zero-shot learning,” Proceedings of The IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9384–9393, 2019.
    [17] G.-S. Xie et al., “Region graph embedding network for zero-shot learning,” European Conference on Computer Vision, pp. 562–580, 2020.
    [18] W. Xu, Y. Xian, J. Wang, B. Schiele, and Z. Akata, “Attribute prototype network for zero-shot learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 21969–21980, 2020.
    [19] Y. Liu et al., “Goal-oriented gaze estimation for zero-shot learning,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3794–3803, 2021.
    [20] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, “Self-attention generative adversarial networks,” International Conference on Machine Learning, pp. 7354–7363, 2019.
    [21] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1757–1772, 2012.
    [22] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
    [23] Y. Liu et al., “Dual self-attention with co-attention networks for visual question answering,” Pattern Recognition, vol. 117, p. 107956, 2021.
    [24] Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830, 2017.
    [25] H. Zheng, J. Fu, T. Mei, and J. Luo, “Learning multi-attention convolutional neural network for fine-grained image recognition,” Proceedings of the IEEE International Conference on Computer Vision, pp. 5209–5217, 2017.
    [26] S. Gidaris, and N. Komodakis, “Dynamic few-shot visual learning without forgetting,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367–4375, 2018.
    [27] C. Luo, J. Zhan, X. Xue, L. Wang, R. Ren, and Q. Yang, “Cosine normalization: Using cosine similarity instead of dot product in neural networks,” International Conference on Artificial Neural Networks, pp. 382–391, 2018.
    [28] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
    [29] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” International Conference on Machine Learning, pp. 1597–1607, 2020.
    [30] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958, 2009.
    [31] I. Goodfellow et al., “Generative adversarial nets,” Advances in Neural Information Processing Systems, vol. 27, 2014.
    [32] R. Felix, M. Sasdelli, I. Reid, and G. Carneiro, “Multi-modal ensemble classification for generalized zero shot learning,” arXiv preprint arXiv:1901.04623, 2019.
    [33] W.-L. Chao, S. Changpinyo, B. Gong, and F. Sha, “An empirical study and analysis of generalized zero-shot learning for object recognition in the wild,” European Conference on Computer Vision, pp. 52–68, 2016.
    [34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, 2012.
    [35] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
    [36] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” International Conference on Machine Learning, pp. 214–223, 2017.
    [37] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in Neural Information Processing Systems, vol. 29, 2016.
    [38] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in Neural Information Processing Systems, vol. 30, 2017.
    [39] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” Advances in Neural Information Processing Systems, vol. 30, 2017.
    [40] H. Sharma and A. S. Jalal, “A survey of methods, datasets and evaluation metrics for visual question answering,” Image and Vision Computing, vol. 116, p. 104327, 2021.
    [41] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing objects by their attributes,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785, 2009.
    [42] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958, 2009.
    [43] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
    [44] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Advances in Neural Information Processing Systems, vol. 26, 2013.
    [45] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object categories,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594–611, 2006.
    [46] Y. Xian, B. Schiele, and Z. Akata, “Zero-shot learning-the good, the bad and the ugly,” Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 4582–4591, 2017.
    [47] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell, “Zero-shot learning with semantic output codes,” Advances in Neural Information Processing Systems, vol. 22, 2009.
    [48] A. Frome et al., “Devise: A deep visual-semantic embedding model,” Advances in Neural Information Processing Systems, vol. 26, 2013.
    [49] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-shot learning through cross-modal transfer,” Advances in Neural Information Processing Systems, vol. 26, 2013.
    [50] J. Weston, S. Bengio, and N. Usunier, “Large scale image annotation: learning to rank with joint word-image embeddings,” Machine Learning, vol. 81, no. 1, pp. 21–35, 2010.
    [51] X. Kong et al., “En-compactness: Self-distillation embedding & contrastive generation for generalized zero-shot learning,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9306–9315, 2022.
    [52] G. Hinton, O. Vinyals, J. Dean, et al., “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, vol. 2, no. 7, 2015.
    [53] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
    [54] S. Antol et al., “Vqa: Visual question answering,” Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433, 2015.
    [55] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29, 2016.
    [56] D. Yu, J. Fu, T. Mei, and Y. Rui, “Multi-level attention networks for visual question answering,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4709–4717, 2017.
    [57] J. Song, P. Zeng, L. Gao, and H. T. Shen, “From pixels to objects: Cubic visual attention for visual question answering,” arXiv preprint arXiv:2206.01923, 2022.
    [58] Q. Wu, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, “Ask me anything: Free-form visual question answering based on knowledge from external sources,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4622–4630, 2016.
    [59] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247–1250, 2008.
    [60] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, “Dbpedia: A nucleus for a web of open data,” The Semantic Web, Springer, pp. 722–735, 2007.
    [61] P. Wang, Q. Wu, C. Shen, A. van den Hengel, and A. Dick, “Explicit knowledge-based reasoning for visual question answering,” arXiv preprint arXiv:1511.02570, 2015.
    [62] Q. Wu, C. Shen, P. Wang, A. Dick, and A. Van Den Hengel, “Image captioning and visual question answering based on attributes and external knowledge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1367–1381, 2017.
    [63] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” arXiv preprint arXiv:1606.01847, 2016.
    [64] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, and B.-T. Zhang, “Hadamard product for low-rank bilinear pooling,” arXiv preprint arXiv:1610.04325, 2016.
    [65] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” Proceedings of the IEEE International Conference on Computer Vision, pp. 2612–2620, 2017.
    [66] X. Huang, S. Qian, Q. Fang, J. Sang, and C. Xu, “Csan: Contextual self-attention network for user sequential recommendation,” Proceedings of the 26th ACM International Conference on Multimedia, pp. 447–455, 2018.
    [67] J. Fu et al., “Dual attention network for scene segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154, 2019.
    [68] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” arXiv preprint arXiv:1803.02155, 2018.
    [69] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, “Stand-alone self-attention in vision models,” Advances in Neural Information Processing System, vol. 32, 2019.
    [70] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
    [71] Z. Wu et al., "Unsupervised feature learning via non-parametric instance discrimination." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742, 2018.
    [72] F. Pourpanah et al., “A review of generalized zero-shot learning methods,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
    [73] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” Proceedings of The IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208, 2018.
    [74] S. Rahman, S. Khan, and F. Porikli, “A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning,” IEEE Transactions on Image Processing, vol. 27, no. 11, pp. 5652–5667, 2018.

    無法下載圖示 全文公開日期 2023/08/31 (校內網路)
    全文公開日期 2027/08/31 (校外網路)
    全文公開日期 2027/08/31 (國家圖書館:臺灣博碩士論文系統)
    QR CODE