簡易檢索 / 詳目顯示

研究生: 簡志遠
Chih-Yuan Chien
論文名稱: 使用邏輯結構描述演算法於超文件表格自動化理解系統
Automatic Hypertext Table Understanding by using Logical Structure Description Algorithm
指導教授: 李漢銘
Hahn-Ming Lee
許鈞南
Chun-Nan Hsu
口試委員: 項天瑞
Tien-Ruey Hsiang
何建明
Jan-Ming Ho
陳信希
Hsin-Hsi Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 60
中文關鍵詞: 表格理解超文件表格網路表格邏輯結構
外文關鍵詞: Table Understanding, Hypertext Table, Web Table, Logical Structure
相關次數: 點閱:179下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

表格擁有容易重複利用,比較與表現資料的特性,因此表格已被廣泛的應用在網頁上。超文件表格自動化理解系統的主要功能是分析表格的結構與並且擷取表格中的資訊。一個表格包含了編排結構與邏輯結構,編排結構主要是決定表格呈現的樣式與欄位的排列,而邏輯結構則負責決定表格內容間的關係。因此隨者欄位的排列不同,一個邏輯結構可以被表現成多種不同的編排結構,我們將這樣的問題定義為多樣性外觀呈現的問題。因為多樣性外觀呈現的問題會增加外觀呈現的複雜度,所以傳統的方法只能處理部分的表格,並且擷取到的邏輯結構會有不完整的狀況。
本論文中,我們提出一的邏輯結構描述演算法,用來自動描述表格的邏輯結構。經由表格欄位間的相對應關係,我們的演算法可以產生邏輯結構描述規則,而不需要定義外觀的呈現種類。最後,由實驗解果證明,我們的超文件表格自動化理解系統不只可以成功將輸入的超文件表格轉換成一個關連式表格,並且原始超文件表格的邏輯結構可以完整的被表留在輸出的關連式表格中。


The characteristic of tables is easy to reuse, compare and represent data, thus tables are widely used in many Web pages. The main function of table understanding is to analyze the structures and extract meaningful information from Web tables. Thus, the table understanding is an important task in terms of the information retrieval. Based on different field arrangements, a logical structure can be mapped into various layout structures. In this case, we define this problem as a multi-layout problem. Because the multi-layout problem will enlarge the complexity of layout patterns, traditional approaches can only deal with some specific Web tables and extracted logical structures might incompleteness.
In this thesis, we propose a logical structure description algorithm, named Structure Description Algorithm, to automatically describe logical structures from Web tables. Based on table field relationships, this algorithm can generate logical structure description rules without to define layout patterns. Finally, our experimental results demonstrate that our Table Understanding system can translate a Web table into a relational table, and the original logical structure will be completely retrained in output relational table.

Abstract II Acknowledgements IV Content V List of Figures VII List of Tables IX Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Challenges of Table Understanding 2 1.3 Goals 3 1.4 Outline of the Thesis 3 Chapter 2 Background 4 2.1 Introduction to Table Characteristics 4 2.2 Table Detection Approaches 6 2.3 Table Interpretation Approaches 7 2.3.1 Module-based Approaches 8 2.3.2 Rule-based Approaches 9 2.3.3 Grammar-based Approaches 10 2.4 Summary of Related Work 14 Chapter 3 Table Understanding System 15 3.1 The Concept of Table Understanding 17 3.1.1 Table Definition 17 3.1.2 Problem Definition 18 3.1.2 Automatic Logical Structure Understanding 19 3.2 System Architecture of Table Understanding System 21 3.2.1 Table Preprocessor 24 3.2.2 Layout Graph Generator 25 3.2.3 Structure Description Algorithm 28 3.2.4 Table Translator 35 Chapter 4 Experiments 37 4.1 Description of Data Set 38 4.2 Evaluation Metrics 43 4.3 Experimental Results 44 Chapter 5 Conclusion 51 5.1 Discussion 51 5.2 Conclusion 53 5.3 Further work 54 References 55

[1] A. H. F. Laender, B. A. Ribeiro-Neto, A. S. da Silva, J. S. Teixeira, "A brief survey of web data extraction tools," Proceedings of ACM SIGMOD Record, Vol. 31, pp. 84-93, 2002.
[2] D. E. Appelt, D. Israel, "Introduction to Information Extraction Technology," Proceedings of the 16th International Conference on Artificial Intelligence, August 1999.
[3] L. Eikvil, "Information Extraction from World Wide Web: a Survey," Norwegian Computing Center Technical Report, No. 495, July 1999.
[4] R. Zanibbi, D. Blostein, J.R. Cordy, "A Survey of Table Recognition: Models, Observations, Transformations, and Inferences," Journal of Document Analysis and Recognition, Vol. 7, pp. 1-16, 2004.
[5] D. Lopresti, G. Nagy, "Automatic Table Processing: An Survey," Proceedings of the 3rd International Workshop on Graphics Recognition, pp.93-120, 1999.
[6] M. Hurst, "Layout and language: Challenges for table understanding on the web," Proceedings of the 1st International Workshop on Web Document Analysis, pp, 27 - 30, September 2001.
[7] Y. Wang, I.T. Phillips, R.M. Haralick "Table Structure Understanding and Its Performance Evaluation," Journal of Pattern Recognition Society, vol. 37, pp. 1479-1497, 2004.
[8] Y. Wang, J. Hu, "A machine learning based approach for table detection on the web," Proceedings of the 11th International Conference on World Wide Web, pp. 242 - 250, 2002.
[9] G. Penn, J. Hu, H. Lu, R. McDonald, "Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices," Proceedings of the 1st International Workshop on Web Document Analysis, pp. 1074, 2001.
[10] T. Watanabe, Q. Luo, T. Fukumura, "A frame work of layout recognition of document understanding," Proceedings of the 1st International Conference on Document Analysis and Information Retrieval, pp. 77 - 95, 1992.
[11] T. Watanabe, Q. Luo, N. Sugie, "Layout recognition of multi-kinds of table-form documents," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 4, pp. 432-445, April 1995.
[12] X. Wang, D. Wood, "An Abstract Model for Tables," Journal of TUGBoat, pp. 231-237, October 1993.
[13] X. Wang, D. Wood, "A Conceptual Model for Tables," Proceedings of the 4th International Workshop on Principles of Digital Document Processing, pp. 10-23, October 1998.
[14] H. Silberhorn, "TabulaMagica: an integrated approach to manage complex tables," Proceedings of ACM Symposium on Document Engineering, pp. 68-75, 2001.
[15] H.H. Chen, S.C. Tsai, J.H. Tsai, "Mining Tables from Large Scale HTML Texts," Proceedings of the 17th International Conference on Computational linguistics, pp. 166-172, 2000.
[16] Y. Yang, W.-S. Luk, "A framework for web table mining," Proceedings of the 4th International Workshop on Web information and data management, pp. 36-42, November 2002.
[17] S. Li, Z. Peng, M. Liu, "Extraction and integration information in HTML tables," Proceedings of the 4th International Conference on Computer and Information Technology, pp. 315-320, September 2004.
[18] S. Li, M. Liu, G. Wang, Z. Peng, "Capturing Semantic Hierarchies to Perform Meaningful Integration in HTML Tables," Proceedings of the 6th Asia-Pacific Web Conference on Advanced Web Technologies and Applications, Vol. 3007, pp. 899-902, March 2004.
[19] S. Li, M. Liu, G. Wang, Z. Peng, "Wrapping HTML Tables into XML," Proceedings of the 5th International Conference on Web Information Systems Engineering, Vol. 3306, pp. 147, October 2004.
[20] W.W. Cohen, M. Hurst, L.S. Jensen, "A flexible learning system for wrapping tables and lists in HTML documents," Proceedings of the 11th International Conference on World Wide Web, pp. 232-241, May, 2002.
[21] M. Yoshida, K. Torisawa, J. Tsujii, "A method to integrate tables of the World Wide Web," Proceedings of International Workshop on Web Document Analysis, pp. 31-34, 2001.
[22] Y.A. Tijerino, D.W. Embley, D.W. Lonsdale, G. Nagy, "Ontology generation from tables," Proceedings of the 4th International Conference on Web Information Systems Engineering, pp. 242-249, December 2003.
[23] Tengli, Y. Yang, N. Ma, "Learning Table Extraction from Examples," Proceedings of the 20th International Conference on Computational Linguistics, 2004.
[24] H.L. Wang, S.H. Wu, K.K. Wang, C.L. Sung, W.L. Hsu, W.K. Shih, "Semantic search on Internet tabular information extraction for answering queries," Proceedings of the 9th International Conference on Information and knowledge management, pp. 243-249, 2000.
[25] A. Amano, Y. Honmachi, S.K Kyoto, "Modification table form generation system based on the form recognition," Proceedings of the 17th International Conference on Pattern Recognition, Volume 2, pp. 659-662, August 2004.
[26] A. Amano, N. Asada, "Graph Grammar Based Analysis System of Complex Table Form Document," Proceedings of the 7th International Conference on Document Analysis and Recognition, Volume 2, pp. 916, August 2003.
[27] A. Amano, N. Asada, "Complex Table Form Analysis Using Graph Grammar," Proceedings of the 5th International Workshop on Document Analysis Systems, pp. 283-286, 2002.
[28] A. Amano, N. Asada, T. Motoyama, T. Sumiyoshi, K. Suzuki, "Table form document synthesis by grammar-based structure analysis," Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 533-537, September 2001.
[29] M. Hurst, Understanding Tables in Text, PhD thesis, University of Edinburgh, 2000.
[30] R. Elmasri, S.B. Navathe, 2003, Fundamentals of Database Systems, Addison Wesley, New York.
[31] A. Salomaa, 1973, Formal Languages, Academic Press.
[32] R.W. Sebesta, 2002, Concepts of Programming Languages, Addison Wesley, New York.
[33] R.P. Grimaldi, 1994, Discrete and Combinatorial Mathematics an Applied Introduction, Addison Wesley, New York.
[34] S. Anderson-Freed, E. Horowitz, S. Sahni, 1995, Fundamentals of Data Structures in C, Computer Science Press, New York.
URL Lists:
[35] Google, http://www.google.com/, 2005.
[36] Yahoo, http://www.yahoo.com/, 2005.
[37] National Taiwan University Hospital, http://ntuh.mc.ntu.edu.tw/, 2005.
[38] The Common Data Set, http://www.commondataset.org/, 2005.
[39] Information Extraction Corpus, http://www.grappa.univ-lille3.fr/~marty/corpus.html, 2005.
[40] China Airlines, http://dp.china-airlines.com/TWN_2005/en/price_index.htm, 2005.
[41] Scripts Directory of PHP, ASP, ASP.NET, Java, Javascript, Perl, http://www.scripts.com/, 2005.

QR CODE