簡易檢索 / 詳目顯示

研究生: 楊晴皓
Ching-Hao Yang
論文名稱: 基於密集多尺度金字塔模型之對稱性編解碼器架構於文件語義切割的應用
DeepDoc: A Symmetric Encoder-Decoder Network with Muti-Task Learning and Densely Multi-scale Pyramid Module for Document Segmentation
指導教授: 郭景明
Jing-Ming Guo
口試委員: 鍾國亮
Kuo-Liang Chung
楊士萱
Shin-Hsuan Yang
王乃堅
Nai-Jian Wang
夏至賢
Chih-Hsien Hsia
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 99
中文關鍵詞: 文件頁面分割文件分析語義分割網路深度學習
外文關鍵詞: Document Segmentation, Document Analysis, Semantic Segmentation, Deep Learning
相關次數: 點閱:256下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

文件分割解析在語義分割技術是具有挑戰的任務之一,其原因為文件影像具備大量的圖片、表格、文字與背景等複雜的結構資訊,導致現今語義分割技術在處理複雜的文件結構影像上仍有改善及進步的空間。文件語義分割技術目前所面臨的瓶頸有三: (1)文件結構分割結果之破損與分類錯誤 (2)無法分割文件之小物件與(3)不同結構物件邊界之切割輪廓不平整。
本論文針對文件分割應用提出一基於密集多尺度金字塔模型之對稱性編碼與解碼器架構以提升文件分割效能。為了解決文件分割所造成的破損與分類錯誤,本論文採用高效率對稱性編解碼器架構來補償空間資訊的流失與降低網路之參數量。另外,本論文中設計密集多尺度金字塔模塊(DMPM)與特徵融合模塊(FFM)來強化網路之視野度,以利網路有效解決小物件切割失敗的問題。同時,本論文設計邊緣監督網路架構用以改善文件的輪廓分割結果,並透過多任務學習方式使網路在文件語義分析與邊緣資訊共同學習來提升網路整體效能。在實驗結果方面,本論文採用公開切割競賽資料庫(RDCL2017)進行測試,並與前人的方法比較。從結果顯示所提出的網路架構其效能在均交並比(mIoU)上可高達92.02%。而在公開表格資料庫(Marmot)中,本論文所提出的方法在召回率與精準度各為80.3%與82.7%,與前人的方法比較上也皆是相當突出的。綜合上述實驗結果可得知本論文中所設計的網路架構能克服文件切割中的問題並有效提升其切割效能。


Document segmentation is one of the most challenging tasks in semantic segmentation. Since page images comprise a large number of figures, tables, texts, and the background, the segmentation results using state-of-the-art approaches were dissatisfactory due to these complex structures. Consequently, three bottlenecks in document segmentation are (1) misclassification of structural data or incomplete segmented region, (2) missing tiny objects, and (3) poor boundary delineations of the objects among structures.
In this thesis, a densely multi-scale pyramid module embedded with the symmetric encoder-decoder structure is proposed to largely improve the performance for document segmentation. To tackle the problem of misclassification and incomplete segmented region, an efficient network with symmetric encoder-decoder structure is designed to compensate the spatial information loss and reduce the number of parameters. Besides, Densely Multi-scale Pyramid Module (DMPM) and Feature Fusion Module (FFM) are designed and employed into the network to extend the multiple effective Fields-of-View (FoV), so that the issue of missing tiny objects can be resolved. Furthermore, the edge supervision network is proposed to enhance the contour delineations of the document objects and leverage up the network performance by multi-task learning. The experimental results show that the proposed network outperforms state-of-the-art methods on the public dataset of RDCL2017. The proposed method reaches up to 92.02% in mIoU for performance evaluation. And for another public dataset of Marmot, the proposed network achieves 80.3% and 82.7% in recall and precision, respectively. Accordingly, the network proposed in this thesis can be considered as a very competitive candidate for the applications of document segmentation.

摘要 1 Abstract II 致謝 III 目錄 IV 圖片索引 VI 表格索引 IX 第一章 緒論 1 1.1 研究背景 1 1.2 研究動機與目的 2 1.3 論文架構 4 第二章 文獻探討 5 2.1 類神經網路 7 2.1.1 前向傳播(Forward Propagation) 7 2.1.2 反向傳播(Backward Propagation) 10 2.1.3 類神經網路之缺陷 14 2.2 卷積神經網路 18 2.2.1 卷積運算 18 2.2.2 激勵函數之介紹 20 2.2.3 特徵萃取機制 21 2.2.4 卷積神經網路之學習 23 2.2.5 卷積神經網路之發展 26 2.3 語義分割網路 32 2.3.1 全卷積網路 33 2.3.2 對稱性編碼器與解碼器架構 35 2.3.3多尺度模型 36 2.4 文件分割技術 40 2.5 多任務學習機制 43 第三章 基於密集多尺度金字塔模型之對稱性編解碼器架構 46 3.1 特徵萃取基底架構 47 3.2 文件分割架構 49 3.3 密集多尺度金字塔模塊與特徵融合模塊 53 3.4 邊緣監督網路 57 第四章 實驗數據及結果 59 4.1 公開資料庫 59 4.1.1 RDCL2017 資料集 59 4.1.2 Marmot 資料集 61 4.2 DSCL2020資料庫 62 4.3 實驗結果 65 4.3.1 定量評估指標 65 4.3.2 網路與訓練參數設置 66 4.3.3 實驗結果分析 67 第五章 結論與未來展望 81 參考文獻 82

V. Nair and G. E. Hinton, "Rectified linear units improve restricted boltzmann machines," in ICML, 2010.
[2] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, "Striving for simplicity: The all convolutional net," arXiv preprint arXiv:1412.6806, 2014.
[3] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, "Deconvolutional networks," in 2010 IEEE Computer Society Conference on computer vision and pattern recognition, 2010: IEEE, pp. 2528-2535.
[4] M. D. Zeiler and R. Fergus, "Visualizing and understanding convolutional networks," in European conference on computer vision, 2014: Springer, pp. 818-833.
[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A large-scale hierarchical image database," in 2009 IEEE conference on computer vision and pattern recognition, 2009: Ieee, pp. 248-255.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
[8] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
[9] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
[10] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
[11] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[12] M. Tan and Q. V. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," arXiv preprint arXiv:1905.11946, 2019.
[13] A. Ess, T. Müller, H. Grabner, and L. Van Gool, "Segmentation-Based Urban Traffic Scene Understanding," in BMVC, 2009, vol. 1: Citeseer, p. 2.
[14] A. Geiger, P. Lenz, and R. Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite," in 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012: IEEE, pp. 3354-3361.
[15] M. Cordts et al., "The cityscapes dataset for semantic urban scene understanding," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3213-3223.
[16] K. Khan, M. Mauro, and R. Leonardi, "Multi-class semantic segmentation of faces," in 2015 IEEE International Conference on Image Processing (ICIP), 2015: IEEE, pp. 827-831.
[17] S. Benini, K. Khan, R. Leonardi, M. Mauro, and P. Migliorati, "Face analysis through semantic face segmentation," Signal Processing: Image Communication, vol. 74, pp. 21-31, 2019.
[18] D. Skiparis, "Semantic face segmentation from video streams in the wild," Universitat Rovira i Virgili, 2017.
[19] F. Milletari, N. Navab, and S.-A. Ahmadi, "V-net: Fully convolutional neural networks for volumetric medical image segmentation," in 2016 fourth international conference on 3D vision (3DV), 2016: IEEE, pp. 565-571.
[20] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, "3D U-Net: learning dense volumetric segmentation from sparse annotation," in International conference on medical image computing and computer-assisted intervention, 2016: Springer, pp. 424-432.
[21] D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles, "Multi-scale multi-task fcn for semantic page segmentation and table detection," in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017, vol. 1: IEEE, pp. 254-261.
[22] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer, and C. Lee Giles, "Learning to extract semantic structure from documents using multimodal fully convolutional neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5315-5324.
[23] D. S. Bloomberg and L. Vincent, "Document image applications," Morphologie Mathmatique, vol. 8, 2007.
[24] S. S. Bukhari, F. Shafait, and T. M. Breuel, "Improved document image segmentation algorithm using multiresolution morphology," in Document recognition and retrieval XVIII, 2011, vol. 7874: International Society for Optics and Photonics, p. 78740D.
[25] F. C. Fernández and O. R. Terrades, "Document segmentation using relative location features," in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), 2012: IEEE, pp. 1562-1565.
[26] M. Volpi and V. Ferrari, "Semantic segmentation of urban scenes by learning local class interactions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 1-9.
[27] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431-3440.
[28] O. Ronneberger, P. Fischer, and T. Brox, "U-net: Convolutional networks for biomedical image segmentation," in International Conference on Medical image computing and computer-assisted intervention, 2015: Springer, pp. 234-241.
[29] V. Badrinarayanan, A. Kendall, and R. Cipolla, "Segnet: A deep convolutional encoder-decoder architecture for image segmentation," IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481-2495, 2017.
[30] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Semantic image segmentation with deep convolutional nets and fully connected crfs," arXiv preprint arXiv:1412.7062, 2014.
[31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs," IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
[32] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904-1916, 2015.
[33] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, "Rethinking atrous convolution for semantic image segmentation," arXiv preprint arXiv:1706.05587, 2017.
[34] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, "Encoder-decoder with atrous separable convolution for semantic image segmentation," in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801-818.
[35] B. Bischke, P. Helber, J. Folz, D. Borth, and A. Dengel, "Multi-task learning for segmentation of building footprints with deep neural networks," in 2019 IEEE International Conference on Image Processing (ICIP), 2019: IEEE, pp. 1480-1484.
[36] D. Cheng, G. Meng, S. Xiang, and C. Pan, "Fusionnet: Edge aware deep convolutional networks for semantic segmentation of remote sensing harbor images," IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 12, pp. 5769-5783, 2017.
[37] S. Miao et al., "Dilated fcn for multi-agent 2d/3d medical image registration," arXiv preprint arXiv:1712.01651, 2017.
[38] H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu, "Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation," arXiv preprint arXiv:1903.11816, 2019.
[39] H. Zhang et al., "Context encoding for semantic segmentation," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151-7160.
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510-4520.
[41] A. G. Howard et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications," arXiv preprint arXiv:1704.04861, 2017.
[42] J. Dai, K. He, and J. Sun, "Instance-aware semantic segmentation via multi-task network cascades," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3150-3158.
[43] J. Fang, X. Tao, Z. Tang, R. Qiu, and Y. Liu, "Dataset, ground-truth and performance metrics for table detection evaluation," in 2012 10th IAPR International Workshop on Document Analysis Systems, 2012: IEEE, pp. 445-449.
[44] J. Sauvola and H. Kauniskangas, "MediaTeam Document Database II, a CD-ROM collection of document images," University of Oulu, Finland, 1999.
[45] L. Todoran, M. Worring, and A. W. Smeulders, "The UvA color document dataset," International Journal of Document Analysis and Recognition (IJDAR), vol. 7, no. 4, pp. 228-240, 2005.
[46] A. Antonacopoulos, D. Bridson, C. Papadopoulos, and S. Pletschacher, "A realistic dataset for performance evaluation of document layout analysis," in 2009 10th International Conference on Document Analysis and Recognition, 2009: IEEE, pp. 296-300.
[47] R. Padilla, S. L. Netto, and E. A. da Silva, "A Survey on Performance Metrics for Object-Detection Algorithms," in 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 2020: IEEE, pp. 237-242.
[48] Y. Liu, K. Bai, P. Mitra, and C. L. Giles, "Tableseer: automatic table metadata extraction and searching in digital libraries," in Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, 2007, pp. 91-100.
[49] B. Yildiz, K. Kaiser, and S. Miksch, "pdf2table: A method to extract table information from pdf files," in IICAI, 2005, pp. 1773-1785.
[50] M. Sarkar, M. Aggarwal, A. Jain, H. Gupta, and B. Krishnamurthy, "Document Structure Extraction for Forms using Very High Resolution Semantic Segmentation," arXiv preprint arXiv:1911.12170, 2019.

無法下載圖示 全文公開日期 2025/08/25 (校內網路)
全文公開日期 2025/08/25 (校外網路)
全文公開日期 2025/08/25 (國家圖書館:臺灣博碩士論文系統)
QR CODE