簡易檢索 / 詳目顯示

研究生: 潘立玄
Li-Syuan Pan
論文名稱: 基於轉譯器之跨注意力網路於視覺問答系統
Transformer based Cross Attention Network for Visual Question Answering
指導教授: 蘇順豐
Shun-Feng Su
口試委員: 陳永耀
Yung-Yao Chen
花凱龍
Kai-Lung Hua
陳美勇
Mei-Yung Chen
陸敬互
Ching-Hu Lu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 69
中文關鍵詞: 視覺問答系統多模態學習網路轉換器自注意力機制跨注意力機制
外文關鍵詞: visual question answering, multimodal learning, Transformer, self-attention, cross-attention
相關次數: 點閱:269下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

視覺問答(Visual Question Answering)為一項橫跨電腦視覺(Computer Vision)以及自然語言處理(Natural Language Processing)的任務,視覺問答模型由影像模型以及語言模型所構成,為多模態(multimodal)網路架構。此外,更重要的是如何學習到影像及語言之間的對應關係。因此,本論文提出基於轉換器(Transformer)架構設計的跨注意力學習網路。此網路主要由自注意力及跨注意力層組成來達到對影像及語言的更深入理解,首先自注意力層分別被套用在影像及語言輸入編碼中來學習到加權特徵,由於這些加權特徵是分別在各自的單模態中學習,因此,藉由使用跨注意力層來增加語言及影像中物件的對應關係資訊來學習到較好的影像及語言特徵。即使不利用預處理的方法,我們不僅成功的將轉換器(Transformer)架構套用在語言模型,也套用在影像模型中。除此之外,我們證明了設計的網路是真的從影像中擷取資訊而不是簡單的從問題中回答出最常見的答案。在目前VQA最大的資料集VQA v2.0上能得到70.06%的準確度,實驗結果證實相較於其他未使用預訓練策略的模型,我們的模型得到較好的表現。


Visual Question Answering (VQA) is a task which requires deep understanding of both visual concepts and language semantics. VQA is a multimodal learning architecture combined with image and language models. Additionally, the relationship between the images and languages is especially important. Thus, a Transformer based cross-attention learning network is proposed in this study. The network is mainly composed of self-attention and cross-attention layers to improve the reasoning between images and languages. A self-attention layer is first applied into image and language to learn an attended feature from the original input embedding. Due to the fact that these attended features are learned from the single modality separately, a cross-attention layer is adopted to learn a better representation by adding cross-modality information between image regions and question words. Without pre-training our model, the system can achieve promising results by successfully including Transformer into not only the language model but also the image model. Besides, a visual concepts experiment is shown to prove that the network truly understands the image concepts rather than generates the most frequent answer from language prior. Our results are evaluated on the VQA v2.0 dataset, which is currently the largest benchmark dataset in the VQA task, and show a 70.06% accuracy, which is a relatively higher performance than that of using other non-pre-training models in VQA.

中文摘要 I Abstract IV 致謝 V Table of Contents VI List of Figures IX List of Tables XI Chapter 1 Introduction 1 1.1 Background and Motivation 1 1.2 Related Work 2 1.2.1 Jointly Embedding Models 2 1.2.2 Attention Models (without pre-training) 3 1.2.3 Attention Models (with pre-training) 4 1.2.4 LXMERT 4 1.3 Contributions 6 1.4 Thesis Organization 7 Chapter 2 Methodology 8 2.1 Background 9 2.1.1 Transformer 9 2.1.2 Multi-Head Attention 10 2.2 Input Embedding 12 2.2.1 Word Embedding 12 2.2.2 Image Embedding 12 2.3 Network Architecture 13 2.3.1 Previous Attemptions 13 2.3.2 Cross-Attention Network 15 2.3.3 Cross-Modality Encoder 17 2.3.4 Feed-Forward Network 19 2.3.5 Output Classifier 20 2.4 Positional Encoding 21 2.5 Warmup Learning Rate 22 2.6 Gradient Accumulation 22 Chapter 3 Experiments 24 3.1 Dataset 24 3.1.1 VQA v2.0 Dataset 24 3.1.2 Visual Genome Dataset 25 3.2 Evaluation Metric 25 3.3 Experiments 26 3.3.1 Cross-Attention Layer 26 3.3.2 Learning Rate 27 3.3.3 Gradient Accumulation 29 3.3.4 Summary 30 3.4 Analysis 31 3.4.1 Visualization of Attention Maps 31 3.4.2 Visual Concept Experiments 33 3.5 Implementation Details 36 3.5.1 Preprocessing 36 3.5.2 Hyperparameters 37 3.6 Environment 38 Chapter 4 Results 39 4.1 Results 40 4.1.1 Binary Cases 40 4.1.2 Number Cases 41 4.1.3 Other Cases 42 4.1.4 Wrong Cases 44 4.2 Comparison with SOTA approaches without pre-training 47 4.3 Comparison with SOTA approaches with pre-training 49 Chapter 5 Conclusions and Future work 51 5.1 Conclusions 51 5.2 Future Work 51 5.2.1 Tokenization 51 5.2.2 Pre-Training Model 52 References 53

[1] S. Antol et al., "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, pp. 2425-2433, 2015.
[2] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, "Yin and yang: Balancing and answering binary visual questions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014-5022, 2016.
[3] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, "Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904-6913, 2017.
[4] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, "Simple baseline for visual question answering," arXiv preprint arXiv:1512.02167, 2015.
[5] M. Malinowski, M. Rohrbach, and M. Fritz, "Ask your neurons: A neural-based approach to answering questions about images," in Proceedings of the IEEE international conference on computer vision, pp. 1-9, 2015.
[6] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[7] J.-H. Kim et al., "Multimodal residual learning for visual qa," Advances in neural information processing systems, vol. 29, pp. 361-369, 2016.
[8] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked attention networks for image question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21-29, 2016.
[9] K. J. Shih, S. Singh, and D. Hoiem, "Where to look: Focus regions for visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4613-4621, 2016.
[10] J. Lu, J. Yang, D. Batra, and D. Parikh, "Hierarchical question-image co-attention for visual question answering," Advances in neural information processing systems, vol. 29, pp. 289-297, 2016.
[11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, "Multimodal compact bilinear pooling for visual question answering and visual grounding," arXiv preprint arXiv:1606.01847, 2016.
[12] Z. Yu, J. Yu, J. Fan, and D. Tao, "Multi-modal factorized bilinear pooling with co-attention learning for visual question answering," in Proceedings of the IEEE international conference on computer vision, pp. 1821-1830, 2017.
[13] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, "Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering," IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 5947-5959, 2018.
[14] J.-H. Kim, J. Jun, and B.-T. Zhang, "Bilinear attention networks," in Advances in Neural Information Processing Systems, pp. 1564-1574, 2018.
[15] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[17] P. Anderson et al., "Bottom-up and top-down attention for image captioning and visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077-6086, 2018.
[18] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137-1149, 2016.
[19] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, pp. 5998-6008, 2017.
[20] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, "Deep modular co-attention networks for visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6281-6290, 2019.
[21] H. Tan and M. Bansal, "Lxmert: Learning cross-modality encoder representations from transformers," arXiv preprint arXiv:1908.07490, 2019.
[22] W. Su et al., "Vl-bert: Pre-training of generic visual-linguistic representations," arXiv preprint arXiv:1908.08530, 2019.
[23] J. Lu, D. Batra, D. Parikh, and S. Lee, "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks," in Advances in Neural Information Processing Systems, pp. 13-23, 2019.
[24] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, "Visualbert: A simple and performant baseline for vision and language," arXiv preprint arXiv:1908.03557, 2019.
[25] Y.-C. Chen et al., "Uniter: Learning universal image-text representations," arXiv preprint arXiv:1909.11740, 2019.
[26] M. E. Peters et al., "Deep contextualized word representations," arXiv preprint arXiv:1802.05365, 2018.
[27] J. Howard and S. Ruder, "Universal language model fine-tuning for text classification," arXiv preprint arXiv:1801.06146, 2018.
[28] A. M. Dai and Q. V. Le, "Semi-supervised sequence learning," Advances in neural information processing systems, vol. 28, pp. 3079-3087, 2015.
[29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[30] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[31] D. Teney, P. Anderson, X. He, and A. Van Den Hengel, "Tips and tricks for visual question answering: Learnings from the 2017 challenge," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4223-4232, 2018.
[32] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
[33] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
[34] P. Goyal et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[35] M. Ott, S. Edunov, D. Grangier, and M. Auli, "Scaling neural machine translation," arXiv preprint arXiv:1806.00187, 2018.
[36] T.-Y. Lin et al., "Microsoft coco: Common objects in context," in European conference on computer vision, pp. 740-755, 2014.
[37] R. Krishna et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," International journal of computer vision, vol. 123, no. 1, pp. 32-73, 2017.
[38] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, "On large-batch training for deep learning: Generalization gap and sharp minima," arXiv preprint arXiv:1609.04836, 2016.
[39] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[40] Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.

無法下載圖示 全文公開日期 2026/02/04 (校內網路)
全文公開日期 2026/02/04 (校外網路)
全文公開日期 2026/02/04 (國家圖書館:臺灣博碩士論文系統)
QR CODE