基於常態分佈標籤生成與結果合併之文本檢測｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳棣文 Di-Wen Chen
論文名稱：	基於常態分佈標籤生成與結果合併之文本檢測 Label Generation with Normal Distribution and Result Merge for Scene Text Detection
指導教授：	林昌鴻 Chang-Hong Lin
口試委員:	阮聖彰 Sheng-Zhang Ruan 陳維美 Wei-Mei Chen 吳晋賢 Jin-Xian Wu
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	英文
論文頁數：	73
中文關鍵詞：	場景文本偵測、卷積神經網路、多語言文本偵測、深度學習
外文關鍵詞：	Scene text detection, convolution neurual network, multilingual text detection, deep learning
相關次數：	點閱：175 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

在計算機視覺領域上，場景文本偵測一直是很常見且具實用性的研究項目。通常作為場景文本識別的第一步，比如智慧監控、盲人輔助、自動駕駛和紙本文字轉換為數據資料，然後識別出的文本含義可以應用到其他的地方。所以，在進行文本偵測應用時，如果文本偵測的不好可能會影響它的結果。由此可知在文本偵測中，單字偵測的完整度是很重要的，而這能使得圖片中的文本更容易被辨識。在本論文中，我們提出了一個名為result merge的後處理方法和一個基於常態分佈標籤生成的方法並應用在基於U-Net的注意力機制骨幹網路。我們的後處理方法，可以有效地融合原始與切割後的圖片結果，使文本能完整地被找到。另外，我們的標籤生成方法利用常態分佈，根據不同收縮文本區域的短邊長度給予不同的值。這些方法可以有效地提升文本偵測的完整度，且能偵測到更多的小型文本，藉此提升recall。
本論文的方法在ICDAR2015、MSRA-TD500和HUST-TD400數據集中訓練，並在ICDAR2015和MSRA-TD500數據及上評估我們的訪法，也進行消融實驗來比較各個方法的優劣。在ICDAR2015的實驗結果:recall為87.4%，precision為88.0%，F-measure為87.7%，FPS 為8.5，而在MSRA-TD500的實驗結果recall為83.5%，precision為86.7%，F-measure為85.1%。以上結果表示，本論文提出的方法和現有的方法相比有較佳的結果。

Scene text detection has always been a common and practical research project in the field of computer vision. It is usually used as the first step in scene text recognition, such as intelligent monitoring, blind auxiliary, automatic driving, and converting paper text into data, and then the recognized text meaning can be applied to other places. Therefore, in the application of text detection, if text detection is incomplete, it may affect the results of the applications. It can be seen that in text detection, the completeness of word detection is very important, which can make the text in the image easier to be recognized.

In this thesis, we propose a post-processing method named result merge and a method based on normal distribution label generation and apply it to the U-Net-based attention mechanism backbone network. Our post-processing method can effectively fuse the original and cropped image results, so that the text can be found in its entirety. In addition, our label generation method utilizes the normal distribution, giving different values according to the length of the short side of different shrunk text regions. These methods can effectively improve the integrity of text detection, and can detect more small texts, thereby improving the recall.

The method in this thesis is trained on the ICDAR2015, MSRA-TD500 and HUST-TD400 datasets, and our method is evaluated on the ICDAR2015 and MSRA-TD500 datasets. Ablation experiments are also performed to compare the pros and cons of each method. Experimental results in ICDAR2015: recall is 87.4%, precision is 88.0%, F-measure is 87.7% and FPS is 8.5, and in MSRA-TD500: recall is 83.5%, precision is 86.7%, and F-measure is 85.1%. These results show that the method proposed in this thesis has better results, compared to the state-of-the-art.

摘要    I
ABSTRACT    II
致謝    III
LIST OF CONTENTS    IV
LIST OF FIGURES    VII
LIST OF TABLES    VIII
CHAPTER 1    INTRODUCTIONS    1
1    Motivation    1
2 Contributions    3
3 Thesis Organization    4
CHAPTER 2    RELATED WORKS    5
1    Regression-based methods    5
2    Segmentation-based methods    7
CHAPTER 3    PROPOSED METHODS    9
1    Data Augmentation    11
1.1    Random Flip & Random Rotation & Random Resize    12
1.2    Random Hue and Saturation Adjustment    14
1.3    Random Crop    17
2    Network Architecture    19
2.1    ResNet [42]    20
2.2 Spatial Attention Network [44]    22
2.3 The Decoder of U-Net    24
3    Label Generation    26
3.1 Density Map and Shrink Mask    26
3.2 Border Map and Border Mask    31
4 Loss Function    34
4.1 Binary Cross Entropy Loss    35
4.2 Mask L1 Loss    37
4.3 Dice Loss [56]    38
4.4 Inference Period    39
5 Result Merge    41
CHAPTER 4    EXPERIMENTAL RESULTS    45
1    Experimental Environment    45
2    Scene Text Dataset    46
2.1 ICDAR2015 dataset [58]    46
2.2 MSRA-TD500 and HUST-TR400 dataset [59, 60]    47
3    Evaluation Methods    48
4    Evaluation and Results    48
4.1    Training Details    50
4.2    ICDAR2015 Dataset [58]    51
4.3    MSRA-TD500 Dataset [59]    52
4.4    Ablation Study    53
CHAPTER 5    CONCLUSIONS and Future works    55
1    Conclusions    55
2    Future Works    57
REFERENCES    58

                                

[1] Y. Huang, Y. Lin, and R. Miao, "An auxiliary blind guide system based on multi-sensor data fusion," in Interantional Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp. 395-398, 2017.
[2] Q. Zhang, Z. Zhu, and D. Zhang, "Research of auxiliary recognition system of image for the blind based on tactile perception," The Open Automation Control Systems Journal, vol. 7, no. 1, pp. 1181-1184, 2015.
[3] V. Blobel, K. Claus, and M. Frank, "Fast alignment of a complex tracking detector using advanced track models," Computer Physics Communications, vol. 182, no. 9, pp. 1760-1763, 2011.
[4] F. Victor, G. Steffen, Z. Shane, K. Jim, and T. Matthew, "TranslatAR: A mobile augmented reality translator," in IEEE Workshop on Applications of Computer Vision (WACV), pp. 497-502, 2011.
[5] M. G. Ertosun and D. L. Rubin, "Probabilistic visual search for masses within mammography images using deep learning," in IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1310-1315, 2015.
[6] J. Ahmad, K. Muhammad, and S. W. Baik, "Data augmentation-assisted deep learning of hand-drawn partially colored sketches for visual search," PloS one, vol. 12, no. 8, p. e0183838, 2017.
[7] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, "A survey of deep learning techniques for autonomous driving," Journal of Field Robotics, vol. 37, no. 3, pp. 362-386, 2020.
[8] H. Fujiyoshi, T. Hirakawa, and T. Yamashita, "Deep learning-based image recognition for autonomous driving," IATSS research, vol. 43, no. 4, pp. 244-252, 2019.
[9] P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, "Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes," in European Conference on Computer Vision (ECCV), pp. 67-83, 2018.
[10] Z. Tian, M. Shu, P. Lyu, R. Li, C. Zhou, X. Shen, and J. Jia, "Learning shape-aware embedding for scene text detection," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4234-4243, 2019.
[11] W. Wang, E. Xie, X. Li, W. Hou, T. Lu, G. Yu, and S. Shao, "Shape robust text detection with progressive scale expansion network," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9336-9345, 2019.
[12] C. Xue, S. Lu, and F. Zhan, "Accurate scene text detection through border semantics awareness and bootstrapping," in European Conference on Computer Vision (ECCV), pp. 355-372, 2018.
[13] Z. Zhang, C. Zhang, W. Shen, C. Yao, W. Liu, and X. Bai, "Multi-oriented text detection with fully convolutional networks," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4159-4167, 2016.
[14] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, "Multi-oriented scene text detection via corner localization and region segmentation," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7553-7563, 2018.
[15] E. Xie, Y. Zang, S. Shao, G. Yu, C. Yao, and G. Li, "Scene text detection with supervised pyramid context network," in AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 9038-9045, 2019.
[16] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao, "Scene text detection via holistic, multi-channel prediction," arXiv preprint arXiv:1606.09002, 2016.
[17] J. K. Patel and C. B. Read, Handbook of the normal distribution. CRC Press, 1996.
[18] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, and S. Lu, "ICDAR 2015 competition on robust reading," in IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1156-1160, 2015.
[19] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, "Detecting texts of arbitrary orientations in natural images," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1083-1090, 2012.
[20] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," Advances in Neural Information Processing Systems, vol. 28, pp. 1137-1149, 2015.
[21] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, "Textboxes: A fast text detector with a single deep neural network," in AAAI Conference on Artificial Intelligence, vol. 31, no. 01, 2017.
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "SSD: Single shot multibox detector," in European Conference on Computer Vision (ECCV), pp. 21-37, 2016.
[23] M. Liao, B. Shi, and X. Bai, "Textboxes++: A single-shot oriented scene text detector," IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3676-3690, 2018.
[24] Y. Liu and L. Jin, "Deep matching prior network: Toward tighter multi-oriented text detection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1962-1969, 2017.
[25] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, "Arbitrary-oriented scene text detection via rotation proposals," IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3111-3122, 2018.
[26] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, "Deep direct regression for multi-oriented scene text detection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 745-753, 2017.
[27] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai, "Rotation-sensitive regression for oriented scene text detection," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5909-5918, 2018.
[28] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, "EAST: an efficient and accurate scene text detector," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5551-5560, 2017.
[29] S. Long, J. Ruan, W. Zhang, X. He, W. Wu, and C. Yao, "Textsnake: A flexible representation for detecting text of arbitrary shapes," in European Conference on Computer Vision (ECCV), pp. 20-36, 2018.
[30] B. Shi, X. Bai, and S. Belongie, "Detecting oriented text in natural images by linking segments," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2550-2558, 2017.
[31] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, "Detecting text in natural image with connectionist text proposal network," in European Conference on Computer Vision (ECCV), pp. 56-72, 2016.
[32] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, in Introduction to Algorithms 3rd Edition: The MIT Press, pp. 13-13, 2002.
[33] I. V. Tetko, D. J. Livingstone, and A. I. Luik, "Neural network studies. 1. Comparison of overfitting and overtraining," Journal of Chemical Information Computer Sciences, vol. 35, no. 5, pp. 826-833, 1995.
[34] C. K. Ch'ng and C. S. Chan, "Total-text: A comprehensive dataset for scene text detection and recognition," in IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 935-942, 2017.
[35] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao, "Geometry-aware scene text detection with instance transformation network," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1381-1389, 2018.
[36] Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, "Character region awareness for text detection," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9365-9374, 2019.
[37] L. Xing, Z. Tian, W. Huang, and M. R. Scott, "Convolutional character networks," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9126-9136, 2019.
[38] M. Liao, Z. Wan, C. Yao, K. Chen, and X. Bai, "Real-time scene text detection with differentiable binarization," in AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 11474-11481, 2020.
[39] Z. Chen, W. Wang, E. Xie, Z. Yang, T. Lu, and P. Luo, "FAST: Searching for a Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation," arXiv preprint arXiv:2111.02394, 2021.
[40] Y. Dai, Z. Huang, Y. Gao, Y. Xu, K. Chen, J. Guo, and W. Qiu, "Fused text segmentation networks for multi-oriented scene text detection," in International Conference on Pattern Recognition (ICPR), pp. 3604-3609, 2018.
[41] A. R. Smith, "Color gamut transform pairs," ACM Siggraph Computer Graphics, vol. 12, no. 3, pp. 12-19, 1978.
[42] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan, "FOTS: Fast oriented text spotting with a unified network," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5676-5685, 2018.
[43] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
[44] O. Ronneberger, P. Fischer, and T. Brox, "U-Net: Convolutional networks for biomedical image segmentation," in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234-241, 2015.
[45] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, "CBAM: Convolutional block attention module," in European Conference on Computer Vision (ECCV), pp. 3-19, 2018.
[46] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, "Residual attention network for image classification," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156-3164, 2017.
[47] B. R. Vatti, "A generic solution to polygon clipping," Communications of the ACM, vol. 35, no. 7, pp. 56-63, 1992.
[48] M. Ahsanullah, B. Kibria, and M. Shakil, "Normal and student's t distributions and their applications," Atlantis Press, pp. 7-50, 2014.
[49] U. Ruby and V. Yendapalli, "Binary cross entropy with deep learning technique for image classification," International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE), vol. 9, no. 10, 2020.
[50] Y.-d. Ma, Q. Liu, and Z.-B. Qian, "Automated image segmentation using improved PCNN model based on cross-entropy," in International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 743-746, 2004.
[51] M. Yeung, E. Sala, C.-B. Schönlieb, and L. Rundo, "Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation," Computerized Medical Imaging Graphics, vol. 95, p. 102026, 2022.
[52] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, "Semantic segmentation using adversarial networks," NIPS Workshop on Adversarial Training, 2016.
[53] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal loss for dense object detection," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2980-2988, 2017.
[54] J. Wei, S. Wang, and Q. Huang, "F³Net: fusion, feedback and focus for salient object detection," in AAAI Conference on Artificial Intelligence, vol. 34, no. 07, pp. 12321-12328, 2020.
[55] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, "Basnet: Boundary-aware salient object detection," in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7479-7489, 2019.
[56] S. Wu, G. Li, L. Deng, L. Liu, D. Wu, Y. Xie, and L. Shi, "L1-norm batch normalization for efficient training of deep neural networks," IEEE Transactions on Neural Networks Learning Systems, vol. 30, no. 7, pp. 2043-2051, 2018.
[57] T. A. Soomro, A. J. Afifi, J. Gao, O. Hellwich, M. Paul, and L. Zheng, "Strided U-Net model: Retinal vessels segmentation using dice loss," in International Conferance on Digital Image Computing: Techniques and Applications (DICTA), pp. 1-8, 2018.
[58] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, "Pytorch: An imperative style, high-performance deep learning library," in International Conference on Neural Information Processing Systems, vol. 32, pp. 8026-8037, 2019.
[59] C. Yao, X. Bai, and W. Liu, "A unified framework for multioriented text detection and recognition," IEEE Transactions on Image Processing, vol. 23, no. 11, pp. 4737-4749, 2014.
[60] L. Deng, M. Yang, Y. Qian, C. Wang, and B. Wang, "CNN based semantic segmentation for urban traffic scenes using fisheye camera," in IEEE Intelligent Vehicles Symposium (IV), pp. 231-236, 2017.
[61] D. Deng, H. Liu, X. Li, and D. Cai, "Pixellink: Detecting scene text via instance segmentation," in AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[62] L. Deng, Y. Gong, Y. Lin, J. Shuai, X. Tu, Y. Zhang, Z. Ma, and M. Xie, "Detecting multi-oriented text with corner-based region proposals," Neurocomputing, vol. 334, pp. 134-142, 2019.
[63] Z. Huang, Z. Zhong, L. Sun, and Q. Huo, "Mask R-CNN with pyramid attention network for scene text detection," in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 764-772, 2019.
[64] Q. Yang, M. Cheng, W. Zhou, Y. Chen, M. Qiu, and W. Lin, "Inceptext: A new inception-text module with deformable PSROI pooling for multi-oriented scene text detection," in International Joint Conference on Artificial Intelligence (IJCAI), pp. 1071-1077, 2018.
[65] X. Jiang, S. Xu, S. Zhang, and S. Cao, "Arbitrary-shaped text detection with adaptive text region representation," IEEE Access, vol. 8, pp. 102106-102118, 2020.
[66] W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, "Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text," IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 44, no. 9, pp. 5349-5367, 2021.

全文公開日期 2024/09/15 (校內網路)
全文公開日期 2024/09/15 (校外網路)
全文公開日期 2024/09/15 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文