基於轉譯器之跨注意力網路於視覺問答系統｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	潘立玄 Li-Syuan Pan
論文名稱：	基於轉譯器之跨注意力網路於視覺問答系統 Transformer based Cross Attention Network for Visual Question Answering
指導教授：	蘇順豐 Shun-Feng Su
口試委員:	陳永耀 Yung-Yao Chen 花凱龍 Kai-Lung Hua 陳美勇 Mei-Yung Chen 陸敬互 Ching-Hu Lu
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	69
中文關鍵詞：	視覺問答系統、多模態學習網路、轉換器、自注意力機制、跨注意力機制
外文關鍵詞：	visual question answering, multimodal learning, Transformer, self-attention, cross-attention
相關次數：	點閱：269 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

視覺問答(Visual Question Answering)為一項橫跨電腦視覺(Computer Vision)以及自然語言處理(Natural Language Processing)的任務，視覺問答模型由影像模型以及語言模型所構成，為多模態(multimodal)網路架構。此外，更重要的是如何學習到影像及語言之間的對應關係。因此，本論文提出基於轉換器(Transformer)架構設計的跨注意力學習網路。此網路主要由自注意力及跨注意力層組成來達到對影像及語言的更深入理解，首先自注意力層分別被套用在影像及語言輸入編碼中來學習到加權特徵，由於這些加權特徵是分別在各自的單模態中學習，因此，藉由使用跨注意力層來增加語言及影像中物件的對應關係資訊來學習到較好的影像及語言特徵。即使不利用預處理的方法，我們不僅成功的將轉換器(Transformer)架構套用在語言模型，也套用在影像模型中。除此之外，我們證明了設計的網路是真的從影像中擷取資訊而不是簡單的從問題中回答出最常見的答案。在目前VQA最大的資料集VQA v2.0上能得到70.06%的準確度，實驗結果證實相較於其他未使用預訓練策略的模型，我們的模型得到較好的表現。

Visual Question Answering (VQA) is a task which requires deep understanding of both visual concepts and language semantics. VQA is a multimodal learning architecture combined with image and language models. Additionally, the relationship between the images and languages is especially important. Thus, a Transformer based cross-attention learning network is proposed in this study. The network is mainly composed of self-attention and cross-attention layers to improve the reasoning between images and languages. A self-attention layer is first applied into image and language to learn an attended feature from the original input embedding. Due to the fact that these attended features are learned from the single modality separately, a cross-attention layer is adopted to learn a better representation by adding cross-modality information between image regions and question words. Without pre-training our model, the system can achieve promising results by successfully including Transformer into not only the language model but also the image model. Besides, a visual concepts experiment is shown to prove that the network truly understands the image concepts rather than generates the most frequent answer from language prior. Our results are evaluated on the VQA v2.0 dataset, which is currently the largest benchmark dataset in the VQA task, and show a 70.06% accuracy, which is a relatively higher performance than that of using other non-pre-training models in VQA.

中文摘要    I
Abstract    IV
致謝    V
Table of Contents    VI
List of Figures    IX
List of Tables    XI
Chapter 1    Introduction    1
1    Background and Motivation    1
2    Related Work    2
2.1    Jointly Embedding Models    2
2.2    Attention Models (without pre-training)    3
2.3    Attention Models (with pre-training)    4
2.4    LXMERT    4
3    Contributions    6
4    Thesis Organization    7
Chapter 2    Methodology    8
1    Background    9
1.1    Transformer    9
1.2    Multi-Head Attention    10
2    Input Embedding    12
2.1    Word Embedding    12
2.2    Image Embedding    12
3    Network Architecture    13
3.1    Previous Attemptions    13
3.2    Cross-Attention Network    15
3.3    Cross-Modality Encoder    17
3.4    Feed-Forward Network    19
3.5    Output Classifier    20
4    Positional Encoding    21
5    Warmup Learning Rate    22
6    Gradient Accumulation    22
Chapter 3    Experiments    24
1    Dataset    24
1.1    VQA v2.0 Dataset    24
1.2    Visual Genome Dataset    25
2    Evaluation Metric    25
3    Experiments    26
3.1    Cross-Attention Layer    26
3.2    Learning Rate    27
3.3    Gradient Accumulation    29
3.4    Summary    30
4    Analysis    31
4.1    Visualization of Attention Maps    31
4.2    Visual Concept Experiments    33
5    Implementation Details    36
5.1    Preprocessing    36
5.2    Hyperparameters    37
6    Environment    38
Chapter 4    Results    39
1    Results    40
1.1    Binary Cases    40
1.2    Number Cases    41
1.3    Other Cases    42
1.4    Wrong Cases    44
2    Comparison with SOTA approaches without pre-training    47
3    Comparison with SOTA approaches with pre-training    49
Chapter 5    Conclusions and Future work    51
1    Conclusions    51
2    Future Work    51
2.1    Tokenization    51
2.2    Pre-Training Model    52
References    53


                                

[1] S. Antol et al., "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, pp. 2425-2433, 2015.
[2] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh, "Yin and yang: Balancing and answering binary visual questions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014-5022, 2016.
[3] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, "Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904-6913, 2017.
[4] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, and R. Fergus, "Simple baseline for visual question answering," arXiv preprint arXiv:1512.02167, 2015.
[5] M. Malinowski, M. Rohrbach, and M. Fritz, "Ask your neurons: A neural-based approach to answering questions about images," in Proceedings of the IEEE international conference on computer vision, pp. 1-9, 2015.
[6] S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural computation, vol. 9, no. 8, pp. 1735-1780, 1997.
[7] J.-H. Kim et al., "Multimodal residual learning for visual qa," Advances in neural information processing systems, vol. 29, pp. 361-369, 2016.
[8] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, "Stacked attention networks for image question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21-29, 2016.
[9] K. J. Shih, S. Singh, and D. Hoiem, "Where to look: Focus regions for visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4613-4621, 2016.
[10] J. Lu, J. Yang, D. Batra, and D. Parikh, "Hierarchical question-image co-attention for visual question answering," Advances in neural information processing systems, vol. 29, pp. 289-297, 2016.
[11] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, "Multimodal compact bilinear pooling for visual question answering and visual grounding," arXiv preprint arXiv:1606.01847, 2016.
[12] Z. Yu, J. Yu, J. Fan, and D. Tao, "Multi-modal factorized bilinear pooling with co-attention learning for visual question answering," in Proceedings of the IEEE international conference on computer vision, pp. 1821-1830, 2017.
[13] Z. Yu, J. Yu, C. Xiang, J. Fan, and D. Tao, "Beyond bilinear: Generalized multimodal factorized high-order pooling for visual question answering," IEEE transactions on neural networks and learning systems, vol. 29, no. 12, pp. 5947-5959, 2018.
[14] J.-H. Kim, J. Jun, and B.-T. Zhang, "Bilinear attention networks," in Advances in Neural Information Processing Systems, pp. 1564-1574, 2018.
[15] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[16] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
[17] P. Anderson et al., "Bottom-up and top-down attention for image captioning and visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077-6086, 2018.
[18] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137-1149, 2016.
[19] A. Vaswani et al., "Attention is all you need," in Advances in neural information processing systems, pp. 5998-6008, 2017.
[20] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, "Deep modular co-attention networks for visual question answering," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6281-6290, 2019.
[21] H. Tan and M. Bansal, "Lxmert: Learning cross-modality encoder representations from transformers," arXiv preprint arXiv:1908.07490, 2019.
[22] W. Su et al., "Vl-bert: Pre-training of generic visual-linguistic representations," arXiv preprint arXiv:1908.08530, 2019.
[23] J. Lu, D. Batra, D. Parikh, and S. Lee, "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks," in Advances in Neural Information Processing Systems, pp. 13-23, 2019.
[24] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, "Visualbert: A simple and performant baseline for vision and language," arXiv preprint arXiv:1908.03557, 2019.
[25] Y.-C. Chen et al., "Uniter: Learning universal image-text representations," arXiv preprint arXiv:1909.11740, 2019.
[26] M. E. Peters et al., "Deep contextualized word representations," arXiv preprint arXiv:1802.05365, 2018.
[27] J. Howard and S. Ruder, "Universal language model fine-tuning for text classification," arXiv preprint arXiv:1801.06146, 2018.
[28] A. M. Dai and Q. V. Le, "Semi-supervised sequence learning," Advances in neural information processing systems, vol. 28, pp. 3079-3087, 2015.
[29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
[30] J. L. Ba, J. R. Kiros, and G. E. Hinton, "Layer normalization," arXiv preprint arXiv:1607.06450, 2016.
[31] D. Teney, P. Anderson, X. He, and A. Van Den Hengel, "Tips and tricks for visual question answering: Learnings from the 2017 challenge," in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4223-4232, 2018.
[32] J. Pennington, R. Socher, and C. D. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.
[33] D. Hendrycks and K. Gimpel, "Gaussian error linear units (gelus)," arXiv preprint arXiv:1606.08415, 2016.
[34] P. Goyal et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour," arXiv preprint arXiv:1706.02677, 2017.
[35] M. Ott, S. Edunov, D. Grangier, and M. Auli, "Scaling neural machine translation," arXiv preprint arXiv:1806.00187, 2018.
[36] T.-Y. Lin et al., "Microsoft coco: Common objects in context," in European conference on computer vision, pp. 740-755, 2014.
[37] R. Krishna et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations," International journal of computer vision, vol. 123, no. 1, pp. 32-73, 2017.
[38] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, "On large-batch training for deep learning: Generalization gap and sharp minima," arXiv preprint arXiv:1609.04836, 2016.
[39] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[40] Y. Wu et al., "Google's neural machine translation system: Bridging the gap between human and machine translation," arXiv preprint arXiv:1609.08144, 2016.

全文公開日期 2026/02/04 (校內網路)
全文公開日期 2026/02/04 (校外網路)
全文公開日期 2026/02/04 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文