研究生: |
何瑞騰 Jui-Teng Ho |
---|---|
論文名稱: |
雙語場景文字辨識之輪廓生成轉換器 Contour Generation Transformer for Bilingual Scene Text Recognition |
指導教授: |
徐繼聖
Gee-Sern Hsu |
口試委員: |
鍾聖倫
Sheng-Luen Chung 陳祝嵩 Chu-Song Chen 洪一平 Yi-Ping Hung 林惠勇 Huei-Yung Lin |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 機械工程系 Department of Mechanical Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 60 |
中文關鍵詞: | 場景文字辨識 、轉換器 、資料庫 |
外文關鍵詞: | Scene Text Recognition, Transformer, Dataset |
相關次數: | 點閱:319 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本篇論文提出雙語場景文字辨識(Scene Text Recognition,STR)之輪廓生成轉換器(Contour Generation Transformer,CGT)。現今大多數場景文字辨識方法都著重於英文辨識,然而中文也是一種主要語言,在許多地區/國家中也時常看到這兩種語言同時存在的場景。因此,我們同時考慮了英文和中文,並提出雙語文字辨識模型,CGT。CGT由輪廓生成器(Contour Generator,CG)以及嵌入了語言模型(Language Model)的轉換器(Transformer)所組成。首先,CG會檢測文本(Text)的字符輪廓,並利用輪廓-查詢交叉注意力層(Contour-Query Cross-Attention Layer),將字符輪廓特徵嵌入到轉換器中,以更好的定位出每個字符的位置,並提升模型辨識能力。CGT的訓練分為兩個階段,第一階段是在具有字符輪廓遮罩(Character Contour Mask)的合成資料上進行訓練,第二階段則是將模型訓練在真實資料上,此時字符的輪廓遮罩只能透過模型預測得到。我們將提出的CGT評估在英文和中文資料集上,並與最新的方法一同比較。
We propose the Contour Generation Transformer (CGT) for bilingual Scene Text Recognition (STR). As most STR approaches focus on English, we consider both English and Chinese as Chinese is also a major language, and it is a common scene in many areas/countries where both languages can be seen. The CGT consists of a Contour Generator (CG) and a transformer with a language model embedded. The CG detects the character contour of the text and embeds the contour features into a transformer with the contour-query cross-attention layer to better locate each character and enhance the text recognition performance. The training of CGT has two phases, one is training on synthetic data where the text contour masks are made available, followed by the other training on real data where the text contour masks can only be estimated. The proposed CGT is evaluated on Chinese and several English benchmark datasets and compared with state-of-the-art methods.
[1] Wikipedia. List of languages by number of native speakers. https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers.
[2] hanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision, pages 5076–5084, 2017.
[3] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 8610–8617, 2019.
[4] Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. Scene text recognition from two-dimensional per- spective. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8714–8721, 2019.
[5] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
[6] Lee, Junyeop, et al. "On recognizing texts of arbitrary shapes with 2D self-attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.
[7] Atienza, Rowel. "Vision transformer for fast and efficient scene text recognition." International Conference on Document Analysis and Recognition. Cham: Springer International Publishing, 2021.
[8] Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).
[9] Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7098–7107, 2021.
[10] Zhi Qiao, Yu Zhou, Dongbao Yang, Yucan Zhou, and Weiping Wang. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13528–13537, 2020.
[11] Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. Towards accurate scene text recognition with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12113–12122, 2020.
[12] AI CUP. Moe ai competition and labeled data acquisition project. https://www.aicup.tw/.
[13] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4715–4723, 2019.
[14] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. In Workshop on Deep Learning, NIPS, 2014.
[15] Ankush Gupta, Andrea Vedaldi, and Andrew Zisser-man. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
[16] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene text recognition using higher order language priors. In BMVC-British machine vision conference. BMVA, 2012.
[17] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomezi Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Al- mazan Almazan, and Lluis Pere De Las Heras. Icdar 2013 robust reading competition. In 2013 12th international conference on document analysis and recognition, pages 1484–1493. IEEE, 2013.
[18] Wang, Kai, Boris Babenko, and Serge Belongie. "End-to-end scene text recognition." 2011 International conference on computer vision. IEEE, 2011.
[19] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, et al. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
[20] Phan, Trung Quy, et al. "Recognizing text with perspective distortion in natural scenes." Proceedings of the IEEE international conference on computer vision. 2013.
[21] Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim Tan. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
[22] Shi, Haodong, et al. "Mask Scene Text Recognizer." International Conference on Document Analysis and Recognition. Cham: Springer International Publishing, 2021.
[23] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
[24] Wu, Liang, et al. "Editing text in the wild." Proceedings of the 27th ACM international conference on multimedia. 2019.
[25] Simon M Lucas, Alex Panaretos, Luis Sosa, Anthony Tang, Shirley Wong, and Robert Young. Icdar 2003 robust reading competitions. In ICDAR, pages 682–687, 2003.
[26] Khosla, Prannay, et al. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.
[27] Na, Byeonghu, Yoonsik Kim, and Sungrae Park. "Multi-modal text recognition networks: Interactive enhancements between visual and semantic features." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[28] Shi, Bowen, et al. "Learning audio-visual speech representation by masked multimodal cluster prediction." arXiv preprint arXiv:2201.02184 (2022).
[29] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
[30] Zhang, Xinyun, et al. "Context-based contrastive learning for scene text recognition." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 3. 2022.
[31] Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Cham: Springer International Publishing, 2020.
[32] Shi, Baoguang, Xiang Bai, and Cong Yao. "An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition." IEEE transactions on pattern analysis and machine intelligence 39.11 (2016): 2298-2304.
[33] Wang, Tianwei, et al. "Decoupled attention network for text recognition." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 07. 2020.
[34] Da, Cheng, Peng Wang, and Cong Yao. "Levenshtein OCR." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022.
[35] Shi, Baoguang, et al. "Aster: An attentional scene text recognizer with flexible rectification." IEEE transactions on pattern analysis and machine intelligence 41.9 (2018): 2035-2048.
[36] Yue, Xiaoyu, et al. "Robustscanner: Dynamically enhancing positional clues for robust text recognition." European Conference on Computer Vision. Cham: Springer International Publishing, 2020.
[37] Xu, Xingqian, et al. "Rethinking text segmentation: A novel dataset and a text-specific refinement approach." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.