簡易檢索 / 詳目顯示

研究生: 李維
Wei Li
論文名稱: 風格保留之場景文字生成
Generation of Style-Preserving Scene Text
指導教授: 徐繼聖
Gee-Sern Hsu
口試委員: 鍾聖倫
Sheng-Luen Chung
林惠勇
Huei-Yung Lin
陳祝嵩
Chu-Song Chen
洪一平
Yi-Ping Hung
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 57
中文關鍵詞: 場景文字編輯風格轉換機器學習
外文關鍵詞: Scene Text Editing, StyleGAN2
相關次數: 點閱:231下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

我們提出了風格保留生成器(Style-Preserving Generator, SPG),主要是應用於
場景文字編輯(Scene Text Editing, STE)。場景文字編輯的目的是將原始文字替換
成所需文字,同時保留原始文字的背景和樣式。本作貢獻有三點,第一,主流模
型的場景文字編輯架構相當複雜,SPG 不同於主流架構我們去除了模型中圖像
修復的機制,降低了模型複雜度; 第二,SPG 模型中加入 Transformer 機制,這
使的模型在圖像生成上能同時兼具細節與全局觀; 第三,SPG 模型針對大角度的
場景文字編輯有非常好的效果。
應用上場景文字編輯在合成資料的生成上有相當的潛力,透過轉移真實資料
上的文字,我們可以生成大量高品質擬真資料。本文我們將使用 SPG 生成之擬
真資料,應用於場景文字辨識(Scene Text Recognition, STR)的任務,從實驗結果
中我們發現加入 SPG 生成之資料,可以有效提升辨識效能。在現今大數據、大
資料量流行的趨勢下,SPG 可以省下過去資料蒐集、標註大量的成本藉由合成
資料提升真實資料集的辨識效能。最後,我們引入了新的車牌數據集(License
Plate Dataset 2022, LP2023)。利用 SPG 產生文字編輯後的合成資料,並對我們在
LP2023 上評估我們的方法,以證明其在從合成數據中學習的性能。


We propose a Style-Preserving Generator (SPG) primarily designed
for scene text editing (STE). STE is to replace the original text with the
desired text while preserving the background and style of the original
content. The complex architectural designs in existing research on scene
text editing have resulted in cumbersome workflows and increased
training difficulty for models. SPG is a reasonable and effective
framework that utilizes a more streamlined approach, offering improved
performance and robustness. Additionally, we address the problem of
applying scene text editing to scene text recognition (STR). Collecting real
data for STR is a challenging and time-consuming task. Therefore, we aim
to enhance the recognition performance of real datasets by leveraging
synthetic data. We introduce a new dataset for license plates (LP2023).
Utilizing SPG, we generate synthetic data with edited text and evaluate our
method on LP2023 to demonstrate its performance in learning from
synthetic data.

摘要....................................................................2 Abstract ...............................................................3 誌謝....................................................................4 目錄....................................................................5 圖目錄..................................................................7 表目錄..................................................................9 第 1 章 介紹............................................................10 1.1 研究背景和動機......................................................10 1.2 方法概述............................................................12 1.3 論文貢獻............................................................13 1.4 論文架構............................................................14 第 2 章 文獻回顧........................................................15 2.1 SRNet .............................................................15 2.2 MOSTEL.............................................................16 2.3 Handwriting Transformer............................................17 2.4 Segformer..........................................................19 2.5 EfficientNet ......................................................20 2.6 MADF ..............................................................22 2.7 StyleGAN2..........................................................23 第 3 章 主要方法........................................................25 3.1 整體網路架構........................................................26 3.2 文字風格保留生成器設計...............................................27 3.3 前景文字抽取器設計...................................................29 3.4 融合風格生成器設計...................................................30 3.5 損失函數設計........................................................31 第 4 章 實驗設置與分析...................................................33 4.1 資料庫介紹..........................................................33 4.1.1 LP2023 ..........................................................33 4.1.2 LP2023-Synth ....................................................34 4.1.3 AOLP ............................................................35 4.1.4 Tamper-Synth.....................................................36 4.1.5 Tamper-Scene.....................................................36 4.2 實驗設置............................................................37 4.2.1 資料劃分、設置.....................................................37 4.2.2 效能評估指標.......................................................38 4.2.3 實驗設計...........................................................40 4.3 實驗結果與分析........................................................41 4.4 與相關文獻之比較......................................................45 4.5 合成資料集對場景文字辨識效能測試........................................47 4.6 平衡資料分布實驗.......................................................47 第 5 章 結論與未來研究方向..................................................49 第 6 章 參考文獻...........................................................50

[1] WU, Liang, et al. Editing text in the wild. In: Proceedings of the 27th ACM
international conference on multimedia. 2019.
[2] QU, Yadong, et al. Exploring stroke-level modifications for scene text editing.
In: Proceedings of the AAAI Conference on Artificial Intelligence. 2023.
[3] BHUNIA, Ankan Kumar, et al. Handwriting transformers. In: Proceedings of the
IEEE/CVF international conference on computer vision. 2021.
[4] VASWANI, Ashish, et al. Attention is all you need. Advances in neural
information processing systems, 2017.
[5] XIE, Enze, et al. SegFormer: Simple and efficient design for semantic
segmentation with transformers. Advances in Neural Information Processing
Systems, 2021.
[6] POPESCU, Marius-Constantin, et al. Multilayer perceptron and neural
networks. WSEAS Transactions on Circuits and Systems, 2009.
[7] KOONCE, Brett; KOONCE, Brett. EfficientNet. Convolutional Neural
Networks with Swift for Tensorflow: Image Recognition and Dataset
Categorization, 2021.
[8] SANDLER, Mark, et al. Mobilenetv2: Inverted residuals and linear bottlenecks.
In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018.
[9] ZHU, Manyu, et al. Image inpainting by end-to-end cascaded refinement with
mask awareness. IEEE Transactions on Image Processing, 2021.
[10] ZHOU, Bolei, et al. Places: An image database for deep scene
understanding. arXiv preprint arXiv:1610.02055, 2016.
[11] LIU, Ziwei, et al. Large-scale celebfaces attributes (celeba) dataset. Retrieved
August, 2018.
[12] DOERSCH, Carl, et al. What makes paris look like paris?. ACM Transactions
on Graphics, 2012.
[13] KARRAS, Tero, et al. Analyzing and improving the image quality of stylegan.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2020.
[14] KARRAS, Tero; LAINE, Samuli; AILA, Timo. A style-based generator
architecture for generative adversarial networks. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. 2019.
[15] HE, Kaiming, et al. Deep residual learning for image recognition.
In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016.
[16] UDRE, Carole H., et al. Generalised dice overlap as a deep learning loss
function for highly unbalanced segmentations. In: Deep Learning in Medical
Image Analysis and Multimodal Learning for Clinical Decision Support: Third
International Workshop, DLMIA 2017, and 7th International Workshop, MLCDS 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada,
September 14, Proceedings 3. Springer International Publishing, 2017.
[17] HO, Yaoshiang; WOOKEY, Samuel. The real-world-weight cross-entropy loss
function: Modeling the costs of mislabeling. IEEE access, 2019.
[18] NIKLAUS, Simon; LIU, Feng. Context-aware synthesis for video frame
interpolation. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. 2018.
[19] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with
adaptive instance normalization. In ICCV, 2017.
[20] SIMONYAN, Karen; ZISSERMAN, Andrew. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[21] GUPTA, Ankush; VEDALDI, Andrea; ZISSERMAN, Andrew. Synthetic data
for text localisation in natural images. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2016.
[22] HSU, Gee-Sern; CHEN, Jiun-Chang; CHUNG, Yu-Zu. Application-oriented
license plate recognition. IEEE transactions on vehicular technology, 2012.
[23] KARATZAS, Dimosthenis, et al. ICDAR 2013 robust reading competition.
In: 2013 12th international conference on document analysis and recognition.
IEEE, 2013.
[24] K. Wang, B. Babenko, and S. Belongie. End-to-end scene text recognition. In
ICCV 2011, 2011.
[25] PHAN, Trung Quy, et al. Recognizing text with perspective distortion in natural
scenes. In: Proceedings of the IEEE international conference on computer
vision. 2013.
[26] MISHRA, Anand; ALAHARI, Karteek; JAWAHAR, C. V. Scene text
recognition using higher order language priors. In: BMVC-British machine
vision conference. BMVA, 2012.
[27] NAYEF, Nibal, et al. ICDAR2019 robust reading challenge on multi-lingual
scene text detection and recognition—RRC-MLT-2019. In: 2019 International
conference on document analysis and recognition (ICDAR). IEEE, 2019.
[28] VEIT, Andreas, et al. Coco-text: Dataset and benchmark for text detection and
recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
[29] BAEK, Jeonghun, et al. What is wrong with scene text recognition model
comparisons? dataset and model analysis. In: Proceedings of the IEEE/CVF
international conference on computer vision. 2019.
[30] HEUSEL, Martin, et al. Gans trained by a two time-scale update rule converge
to a local nash equilibrium. Advances in neural information processing systems,
2017.
[31] RONNEBERGER, Olaf; FISCHER, Philipp; BROX, Thomas. U-net:
Convolutional networks for biomedical image segmentation. In: Medical Image
Computing and Computer-Assisted Intervention–MICCAI 2015: 18th
International Conference, Munich, Germany, October 5-9, 2015, Proceedings,
Part III 18. Springer International Publishing, 2015.
[32] LI, Jingyuan, et al. Recurrent feature reasoning for image inpainting.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2020.
[33] LI, Wenbo, et al. Mat: Mask-aware transformer for large hole image inpainting.
In: Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 2022.
[34] XU, Zhenbo, et al. Towards end-to-end license plate detection and recognition:
A large dataset and baseline. In: Proceedings of the European conference on
computer vision (ECCV). 2018.

無法下載圖示 全文公開日期 2024/08/08 (校內網路)
全文公開日期 2024/08/08 (校外網路)
全文公開日期 2024/08/08 (國家圖書館:臺灣博碩士論文系統)
QR CODE