雙語場景文字辨識之輪廓生成轉換器｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	何瑞騰 Jui-Teng Ho
論文名稱：	雙語場景文字辨識之輪廓生成轉換器 Contour Generation Transformer for Bilingual Scene Text Recognition
指導教授：	徐繼聖 Gee-Sern Hsu
口試委員:	鍾聖倫 Sheng-Luen Chung 陳祝嵩 Chu-Song Chen 洪一平 Yi-Ping Hung 林惠勇 Huei-Yung Lin
學位類別：	碩士 Master
系所名稱：	工程學院 - 機械工程系 Department of Mechanical Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	60
中文關鍵詞：	場景文字辨識、轉換器、資料庫
外文關鍵詞：	Scene Text Recognition, Transformer, Dataset
相關次數：	點閱：319 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

本篇論文提出雙語場景文字辨識(Scene Text Recognition，STR)之輪廓生成轉換器(Contour Generation Transformer，CGT)。現今大多數場景文字辨識方法都著重於英文辨識，然而中文也是一種主要語言，在許多地區/國家中也時常看到這兩種語言同時存在的場景。因此，我們同時考慮了英文和中文，並提出雙語文字辨識模型，CGT。CGT由輪廓生成器(Contour Generator，CG)以及嵌入了語言模型(Language Model)的轉換器(Transformer)所組成。首先，CG會檢測文本(Text)的字符輪廓，並利用輪廓-查詢交叉注意力層(Contour-Query Cross-Attention Layer)，將字符輪廓特徵嵌入到轉換器中，以更好的定位出每個字符的位置，並提升模型辨識能力。CGT的訓練分為兩個階段，第一階段是在具有字符輪廓遮罩(Character Contour Mask)的合成資料上進行訓練，第二階段則是將模型訓練在真實資料上，此時字符的輪廓遮罩只能透過模型預測得到。我們將提出的CGT評估在英文和中文資料集上，並與最新的方法一同比較。

We propose the Contour Generation Transformer (CGT) for bilingual Scene Text Recognition (STR). As most STR approaches focus on English, we consider both English and Chinese as Chinese is also a major language, and it is a common scene in many areas/countries where both languages can be seen. The CGT consists of a Contour Generator (CG) and a transformer with a language model embedded. The CG detects the character contour of the text and embeds the contour features into a transformer with the contour-query cross-attention layer to better locate each character and enhance the text recognition performance. The training of CGT has two phases, one is training on synthetic data where the text contour masks are made available, followed by the other training on real data where the text contour masks can only be estimated. The proposed CGT is evaluated on Chinese and several English benchmark datasets and compared with state-of-the-art methods.

摘要    2
Abstract    3
誌謝    4
目錄    5
圖目錄    8
表目錄    10
第1章    介紹    11
1    研究背景和動機    11
2    方法概述    13
3    論文貢獻    15
4    論文架構    16
第2章    文獻回顧    17
1    TRBA    17
2    ViTSTR    18
3    MSTR    19
4    SRN    20
5    ABINet    21
6    MATRN    23
第3章    主要方法    24
1    整體網路架構    24
2    輪廓生成器    25
3    轉換編碼器與解碼器    27
第4章    實驗設置與分析    29
1    資料庫介紹    29
1.1    中/英文合成資料集    29
1.2    MJ & ST    30
1.3    IIIT5K-Words    31
1.4    Street View Text    31
1.5    ICDAR2013    31
1.6    ICDAR2015    32
1.7    SVT Perspective    32
1.8    CUTE80    32
1.9    中文場景文字競賽資料集    33
2    實驗設置    34
2.1    效能評估指標    34
2.2    實驗設計    34
3    實驗結果與分析    36
3.1    不同特徵碼的效能比較    36
3.2    語言模型在中文的分析    36
3.3    用於特徵增強之多模態轉換器    37
3.4    對比損失函數的影響    39
3.5    轉換編碼器之消融實驗    41
3.6    字元輪廓遮罩不同的融合策略    42
3.7    輪廓生成器損失函數的消融實驗    43
4    效能比較與分析    44
4.1    不同方法在中文辨識下的效能比較    44
4.2    與相關文獻在六個英文基準資料集上的比較    46
4.3    於雙語資料集下的辨識效能比較    47
4.4    注意力圖的視覺化    48
4.5    錯誤分析    49
5    補充實驗    50
5.1    合成資料模糊化後對效能的影響    50
5.2    不同類別數量的實驗與比較    51
5.3    與不同方法在車牌資料集的比較    52
5.4    AI計畫資料集的效能評估    53
第5章    結論與未來研究方向    54
第6章    參考文獻    55
                                

[1] Wikipedia. List of languages by number of native speakers. https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers.
[2] hanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision, pages 5076–5084, 2017.
[3] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8610–8617, 2019.
[4] Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Scene text recognition from two-dimensional per- spective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8714–8721, 2019.
[5] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[6] Lee, Junyeop, et al. "On recognizing texts of arbitrary shapes with 2D self-attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.
[7] Atienza, Rowel. "Vision transformer for fast and efficient scene text recognition." International Conference on Document Analysis and Recognition. Cham: Springer International Publishing, 2021.
[8] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[9] Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7098–7107, 2021.
[10] Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13528–13537, 2020.
[11] Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12113–12122, 2020.
[12] AI CUP. Moe ai competition and labeled data acquisition project. https://www.aicup.tw/.
[13] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4715–4723, 2019.
[14] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NIPS, 2014.
[15] Ankush Gupta, Andrea Vedaldi, and Andrew Zisser-man. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
[16] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene text recognition using higher order language priors. In BMVC-British machine vision conference. BMVA, 2012.
[17] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomezi Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Al- mazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pages 1484–1493. IEEE, 2013.
[18] Wang, Kai, Boris Babenko, and Serge Belongie. "End-to-end scene text recognition." 2011 International conference on computer vision. IEEE, 2011.
[19] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
[20] Phan, Trung Quy, et al. "Recognizing text with perspective distortion in natural scenes." Proceedings of the IEEE international conference on computer vision. 2013.
[21] Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
[22] Shi, Haodong, et al. "Mask Scene Text Recognizer." International Conference on Document Analysis and Recognition. Cham: Springer International Publishing, 2021.
[23] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
[24] Wu, Liang, et al. "Editing text in the wild." Proceedings of the 27th ACM international conference on multimedia. 2019.
[25] Simon M Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. Icdar 2003 robust reading competitions. In ICDAR, pages 682–687, 2003.
[26] Khosla, Prannay, et al. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.
[27] Na, Byeonghu, Yoonsik Kim, and Sungrae Park. "Multi-modal text recognition networks: Interactive enhancements between visual and semantic features." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[28] Shi, Bowen, et al. "Learning audio-visual speech representation by masked multimodal cluster prediction." arXiv preprint arXiv:2201.02184 (2022).
[29] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
[30] Zhang, Xinyun, et al. "Context-based contrastive learning for scene text recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 3. 2022.
[31] Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.
[32] Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.
[33] Wang, Tianwei, et al. "Decoupled attention network for text recognition." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 07. 2020.
[34] Da, Cheng, Peng Wang, and Cong Yao. "Levenshtein OCR." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[35] Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.
[36] Yue, Xiaoyu, et al. "Robustscanner: Dynamically enhancing positional clues for robust text recognition." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.
[37] Xu, Xingqian, et al. "Rethinking text segmentation: A novel dataset and a text-specific refinement approach." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

簡易檢索 / 詳目顯示

相關論文