研究生: |
楊晴皓 Ching-Hao Yang |
---|---|
論文名稱: |
基於密集多尺度金字塔模型之對稱性編解碼器架構於文件語義切割的應用 DeepDoc: A Symmetric Encoder-Decoder Network with Muti-Task Learning and Densely Multi-scale Pyramid Module for Document Segmentation |
指導教授: |
郭景明
Jing-Ming Guo |
口試委員: |
鍾國亮
Kuo-Liang Chung 楊士萱 Shin-Hsuan Yang 王乃堅 Nai-Jian Wang 夏至賢 Chih-Hsien Hsia |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 99 |
中文關鍵詞: | 文件頁面分割 、文件分析 、語義分割網路 、深度學習 |
外文關鍵詞: | Document Segmentation, Document Analysis, Semantic Segmentation, Deep Learning |
相關次數: | 點閱:255 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
文件分割解析在語義分割技術是具有挑戰的任務之一,其原因為文件影像具備大量的圖片、表格、文字與背景等複雜的結構資訊,導致現今語義分割技術在處理複雜的文件結構影像上仍有改善及進步的空間。文件語義分割技術目前所面臨的瓶頸有三: (1)文件結構分割結果之破損與分類錯誤 (2)無法分割文件之小物件與(3)不同結構物件邊界之切割輪廓不平整。
本論文針對文件分割應用提出一基於密集多尺度金字塔模型之對稱性編碼與解碼器架構以提升文件分割效能。為了解決文件分割所造成的破損與分類錯誤,本論文採用高效率對稱性編解碼器架構來補償空間資訊的流失與降低網路之參數量。另外,本論文中設計密集多尺度金字塔模塊(DMPM)與特徵融合模塊(FFM)來強化網路之視野度,以利網路有效解決小物件切割失敗的問題。同時,本論文設計邊緣監督網路架構用以改善文件的輪廓分割結果,並透過多任務學習方式使網路在文件語義分析與邊緣資訊共同學習來提升網路整體效能。在實驗結果方面,本論文採用公開切割競賽資料庫(RDCL2017)進行測試,並與前人的方法比較。從結果顯示所提出的網路架構其效能在均交並比(mIoU)上可高達92.02%。而在公開表格資料庫(Marmot)中,本論文所提出的方法在召回率與精準度各為80.3%與82.7%,與前人的方法比較上也皆是相當突出的。綜合上述實驗結果可得知本論文中所設計的網路架構能克服文件切割中的問題並有效提升其切割效能。
Document segmentation is one of the most challenging tasks in semantic segmentation. Since page images comprise a large number of figures, tables, texts, and the background, the segmentation results using state-of-the-art approaches were dissatisfactory due to these complex structures. Consequently, three bottlenecks in document segmentation are (1) misclassification of structural data or incomplete segmented region, (2) missing tiny objects, and (3) poor boundary delineations of the objects among structures.
In this thesis, a densely multi-scale pyramid module embedded with the symmetric encoder-decoder structure is proposed to largely improve the performance for document segmentation. To tackle the problem of misclassification and incomplete segmented region, an efficient network with symmetric encoder-decoder structure is designed to compensate the spatial information loss and reduce the number of parameters. Besides, Densely Multi-scale Pyramid Module (DMPM) and Feature Fusion Module (FFM) are designed and employed into the network to extend the multiple effective Fields-of-View (FoV), so that the issue of missing tiny objects can be resolved. Furthermore, the edge supervision network is proposed to enhance the contour delineations of the document objects and leverage up the network performance by multi-task learning. The experimental results show that the proposed network outperforms state-of-the-art methods on the public dataset of RDCL2017. The proposed method reaches up to 92.02% in mIoU for performance evaluation. And for another public dataset of Marmot, the proposed network achieves 80.3% and 82.7% in recall and precision, respectively. Accordingly, the network proposed in this thesis can be considered as a very competitive candidate for the applications of document segmentation.
V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.
[2] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, "Striving for simplicity: The all convolutional net," arXiv preprint arXiv:1412.6806, 2014.
[3] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, "Deconvolutional networks," in 2010 IEEE Computer Society Conference on computer vision and pattern recognition, 2010: IEEE, pp. 2528-2535.
[4] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European conference on computer vision, 2014: Springer, pp. 818-833.
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
[8] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[9] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[10] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[12] M. Tan and Q. V. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," arXiv preprint arXiv:1905.11946, 2019.
[13] A. Ess, T. Müller, H. Grabner, and L. Van Gool, "Segmentation-Based Urban Traffic Scene Understanding," in BMVC, 2009, vol. 1: Citeseer, p. 2.
[14] A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: IEEE, pp. 3354-3361.
[15] M. Cordts et al., "The cityscapes dataset for semantic urban scene understanding," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213-3223.
[16] K. Khan, M. Mauro, and R. Leonardi, "Multi-class semantic segmentation of faces," in 2015 IEEE International Conference on Image Processing (ICIP), 2015: IEEE, pp. 827-831.
[17] S. Benini, K. Khan, R. Leonardi, M. Mauro, and P. Migliorati, "Face analysis through semantic face segmentation," Signal Processing: Image Communication, vol. 74, pp. 21-31, 2019.
[18] D. Skiparis, "Semantic face segmentation from video streams in the wild," Universitat Rovira i Virgili, 2017.
[19] F. Milletari, N. Navab, and S.-A. Ahmadi, "V-net: Fully convolutional neural networks for volumetric medical image segmentation," in 2016 fourth international conference on 3D vision (3DV), 2016: IEEE, pp. 565-571.
[20] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, "3D U-Net: learning dense volumetric segmentation from sparse annotation," in International conference on medical image computing and computer-assisted intervention, 2016: Springer, pp. 424-432.
[21] D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles, "Multi-scale multi-task fcn for semantic page segmentation and table detection," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1: IEEE, pp. 254-261.
[22] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. Lee Giles, "Learning to extract semantic structure from documents using multimodal fully convolutional neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5315-5324.
[23] D. S. Bloomberg and L. Vincent, "Document image applications," Morphologie Mathmatique, vol. 8, 2007.
[24] S. S. Bukhari, F. Shafait, and T. M. Breuel, "Improved document image segmentation algorithm using multiresolution morphology," in Document recognition and retrieval XVIII, 2011, vol. 7874: International Society for Optics and Photonics, p. 78740D.
[25] F. C. Fernández and O. R. Terrades, "Document segmentation using relative location features," in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012: IEEE, pp. 1562-1565.
[26] M. Volpi and V. Ferrari, "Semantic segmentation of urban scenes by learning local class interactions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 1-9.
[27] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[28] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
[29] V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: A deep convolutional encoder-decoder architecture for image segmentation," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481-2495, 2017.
[30] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014.
[31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[32] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
[33] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
[34] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.
[35] B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, "Multi-task learning for segmentation of building footprints with deep neural networks," in 2019 IEEE International Conference on Image Processing (ICIP), 2019: IEEE, pp. 1480-1484.
[36] D. Cheng, G. Meng, S. Xiang, and C. Pan, "Fusionnet: Edge aware deep convolutional networks for semantic segmentation of remote sensing harbor images," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 12, pp. 5769-5783, 2017.
[37] S. Miao et al., "Dilated fcn for multi-agent 2d/3d medical image registration," arXiv preprint arXiv:1712.01651, 2017.
[38] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu, "Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation," arXiv preprint arXiv:1903.11816, 2019.
[39] H. Zhang et al., "Context encoding for semantic segmentation," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151-7160.
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
[41] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
[42] J. Dai, K. He, and J. Sun, "Instance-aware semantic segmentation via multi-task network cascades," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150-3158.
[43] J. Fang, X. Tao, Z. Tang, R. Qiu, and Y. Liu, "Dataset, ground-truth and performance metrics for table detection evaluation," in 2012 10th IAPR International Workshop on Document Analysis Systems, 2012: IEEE, pp. 445-449.
[44] J. Sauvola and H. Kauniskangas, "MediaTeam Document Database II, a CD-ROM collection of document images," University of Oulu, Finland, 1999.
[45] L. Todoran, M. Worring, and A. W. Smeulders, "The UvA color document dataset," International Journal of Document Analysis and Recognition (IJDAR), vol. 7, no. 4, pp. 228-240, 2005.
[46] A. Antonacopoulos, D. Bridson, C. Papadopoulos, and S. Pletschacher, "A realistic dataset for performance evaluation of document layout analysis," in 2009 10th International Conference on Document Analysis and Recognition, 2009: IEEE, pp. 296-300.
[47] R. Padilla, S. L. Netto, and E. A. da Silva, "A Survey on Performance Metrics for Object-Detection Algorithms," in 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 2020: IEEE, pp. 237-242.
[48] Y. Liu, K. Bai, P. Mitra, and C. L. Giles, "Tableseer: automatic table metadata extraction and searching in digital libraries," in Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007, pp. 91-100.
[49] B. Yildiz, K. Kaiser, and S. Miksch, "pdf2table: A method to extract table information from pdf files," in IICAI, 2005, pp. 1773-1785.
[50] M. Sarkar, M. Aggarwal, A. Jain, H. Gupta, and B. Krishnamurthy, "Document Structure Extraction for Forms using Very High Resolution Semantic Segmentation," arXiv preprint arXiv:1911.12170, 2019.