簡易檢索 / 詳目顯示

研究生: 陳水石
Shui-shih Chen
論文名稱: 使用聚類器整合及重複字串序列搜尋技術於個人著述列表網頁中自動化擷取引用文獻資訊
CRE: An Automatic Citation Record Extractor for Publication List Pages by Using Cluster Ensemble and Repeat Pattern Analysis
指導教授: 李漢銘
Hahn-ming Lee
何建明
Jan-Ming Ho
口試委員: 蔡明祺
Mi-ching Tsai
莊庭瑞
Tyng-Ruey Chuang
鮑興國
Hsing-Kuo Kenneth Pao
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2008
畢業學年度: 96
語文別: 英文
論文頁數: 78
中文關鍵詞: 資訊擷取聚類器整合字串搜尋
外文關鍵詞: Information extraction, Cluster ensemble, Repeat pattern finding
相關次數: 點閱:307下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著網際網路的蓬勃發展,以及個人網頁編寫的方便性,研究學者的研究著述紀錄例如:發表過的學術論文及書籍等均可從學者的個人著述列表網頁(publication list page)中獲得。藉由分析學者的研究著述紀錄可延伸出許多應用,例如學術社群分析,利益衝突迴避等。然而大部分的個人著述列表網頁是由研究者自行編排設計,因此不同研究學者的個人著述列表網頁均不盡相同,所以如何有效率且正確地從個人著述列表網頁中擷取有用的資訊供分析使用仍然是個深具挑戰性的研究課題。
    在此篇論文中,我們提出了一個自動化引用文獻紀錄(citation record) 擷取系統,能自動的從不同樣式的網頁中擷取引用文獻紀錄資訊,而不需要人力介入。我們根據對個人著述列表網頁編排的觀察,發現大部分的引用文獻紀錄會以規律的編排方式呈現在網頁中,所以我們提出利用聚類器整合技術(cluster ensemble)將網頁內容的特徵呈現在一個連續的特徵字串序列中,再利用重複字串序列搜尋(repeat pattern finding)技術找出序列中重複的片段,而這些重複的片段可能代表著網頁中引用文獻紀錄的編排的樣式,最後我們利用引用文獻的特性從這些可能的樣式挑選出代表引用文獻的樣式。最後藉由實驗的分析結果,證明我們所提出的自動化擷取技術確實能正確的從不同樣式的網頁中擷取引用文獻紀錄。


    Today, a huge amount of researchers’ publication list pages are available on the Web, which could be an important resource for many value-added applications, such as citation analysis, conflict of interest, and academic social networks. How to gather citation records from those publication list pages efficiently and accurately is still a challenging problem because many of those pages are crafted manually by researchers themselves, and the layouts of those pages and the formats of citation records could be quite different depending on the researchers’ affinities.
    In this thesis, we propose a Citation Record Extractor (CRE) system, which is capable of extracting citation records presented with various layouts of publication list pages correctly and automatically. Our cues to solve this problem are inspired from the regular nature of citation records in a given publication list page. A cluster ensemble framework is adopted to analyze the implicitly features of pages and a repeat pattern finding technique is utilized to reveal possible patterns that may represent citation records. Extensive experiments are conducted to measure the effects of all parameters and system performance. The experiment results clearly show that our approach can perform stable and well (with 88.9% of F-measure on average) and outperform the MDR [33] system and two Naïve systems that based on DOM tree structure and Visual separating assumption. We also provide an analysis on our dataset to show that the richness of publication information contained in publication list pages.

    Chapter 1 Introduction 1 1.1 The Challenges of Extraction Citation Records 3 1.2 Motivations 5 1.3 Goals 8 1.4 Outline of the Thesis 8 Chapter 2 Background 9 2.1 Information Extraction 9 2.1.1 Wrapper based methods 10 2.1.2 Repeat pattern analysis methods 10 2.1.3 Visual layout reasoning methods 11 2.2 Cluster Ensemble 12 Chapter 3 Citation Record Extractor (CRE) System 13 3.1 Concept of Proposed Methodology 13 3.2 System Architecture 17 3.3 Web Page Sequence Builder 19 3.3.1 HTML-object representation unit 19 3.3.2 Base-cluster builder unit 22 3.3.3 Graph-base cluster ensemble unit 28 3.3.4 Inter-cluster similarity evaluation unit 35 3.3.5 Web page sequencing unit 36 3.4 Web Page Repeat Pattern Analyzer Module 37 3.4.1 Repetitive pattern finding unit 37 3.4.2 Alignment score matrix calculation unit 40 3.4.3 Pattern ranking unit 42 3.4.4 Record extraction unit 44 3.4.5 Non-citation record filter unit 44 Chapter 4 Experiments 47 4.1 Experiment Design 47 4.1.1 Construction of citation record extraction dataset 48 4.1.2 Performance evaluation criteria 50 4.1.3 Characteristic analysis of citation records extraction dataset 51 4.2 Analysis of CRE system 54 4.2.1 Influence of the granularity of basic instance 54 4.2.2 Influence of pattern ranking methods 56 4.2.3 Performance analysis of cluster ensemble 58 4.2.4 The efficiency of non-citation record filter unit 60 4.3 Performance Test 61 4.3.1 Comparison with MDR and two naïve approaches 61 4.4 Discussions 65 Chapter 5 Conclusions and Future Work 69 5.1 Conclusions 69 5.2 Future Work 70 References 71

    [1] Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander, “OPTICS: Ordering Points to Identify the Clustering Structure,” in Proceedings of the ACM SIGMOD international conference on Management of data, pp. 49-60, 1999.
    [2] Ezekil F. Adebiyi, Tao Jiang and Michael Kaufmann, “An Efficient Algorithm for Finding Short Approximate Non-tandem Repeats,” Bioinformatics, vol. 17, supplement 1, pp.S5-S12, 2001.
    [3] S. F. Altschul, “Amino Acid Substitution Matrices from an Information Theoretic Perspective,” Journal of Molecular Biology, vol.219, pp. 555-565, 1991.
    [4] L. D. Baker and A. K. McCallum, “Distributional Clustering of Words for Text Classification,” in Proceedings of ACM Special Interest Group on Information Retrieval (SIGIR), pp. 96-103, 1998.
    [5] Eric Brill, “A Simple Rule-based Part Of Speech Tagger,” in Proceedings of the third Conference on Applied Natural Language Processing, pp. 152-155, 1992.
    [6] Chien-Chih Chen, Kai-Hsiang Yang, Hung-Yu Kao and Jan-Ming Ho, “BibPro: A Citation Parser Based on Sequence Alignment Techniques,” in workshop of Advanced Information Networking and Applications, pp. 1175-1180, 2008.
    [7] E. Cortez, A. S. da Silva, M.A. Gonçalves, F. Mesquita and E. S. de Moura, “FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata,” in Proceedings of the 2007 conference on Digital libraries, pp. 215-224, 2007.
    [8] C. H. Chang, C. N. Hsu and S .C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems Journal, vol. 35, Issue 1, pp. 129-147, Apr. 2003.
    [9] Deng Cai, Shipeng Yu, Ji-Rong Wen and Wei-Ying Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” in Proceedings of Asia Pacific Web Conference, pp. 406-417, 2003.
    [10] Deng Cai, Shipeng Yu, Ji-Rong Wen, Wei-Ying Ma, “VIPS: a Vision-based Page Segmentation Algorithm,” Microsoft Research Technical Report, MSR-TR-2003-79, 2003.
    [11] W. Cohen, M. Hurst and L. Jensen, “A Flexible Learning System for Wrapping Tables and Lists in HTML Documents,” in Proceedings of the 11th International World Wide Web conference, pp. 232-241, 2002.
    [12] S. Chawathe, H. Garcia-Molina and J. Hammer, “The TSIMMIS Project: Integration of Heterogeneous Information Sources,” Journal of Intelligent Information Systems, vol.8, pp.117-132, 1997.
    [13] B. Chidlovskii, U. Borgho, and P. Chevalier, “Towards Sophisticated Wrapping of Web-based Information Repositories,” in Proceedings of the 5th International RIAO Conference, pp. 123-135, 1997.
    [14] Robert D. Cameron, “A Universal Citation Database as a Catalyst for Reform in Scholarly Communication,” First Monday, vol.2, no.4, April 7th, 1997.
    [15] Scott Deerwester, Susan T. Dumais, George W. Furnas and Richard Harshman, “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, pp. 391-407, 1990.
    [16] Junlan Feng, Patrick Haffner, Mazin Gilbert, “A Learning Approach to Discovering Web Page Semantic Structures,” in Proceedings of the 8th International Conference on Document Analysis and Recognition, pp. 1055-1059, 2005.
    [17] Xiaoli Zhang Fern and Carla E. Brodley, “Solving Cluster Ensemble Problems by Bipartite Graph Partitioning,” in Proceedings of International Conference on Machine Learning, pp. 281-288, 2004.
    [18] Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog and Bernhard Krüpl, “Towards Domain-Independent Information Extraction from Web Tables,” in Proceedings of the 16th International World Wide Web Conference (WWW ), pp. 71-80, 2007.
    [19] C. Lee Giles, Kurt D. Bollacker and Steve Lawrence, “CiteSeer: An Automatic Citation Indexing System,” in International Conference on Digital Libraries of the third ACM conference on Digital libraries, pp.89-98, 1998.
    [20] Dan Gusfield, “Algorithms on Strings, Trees, and Sequences,” Cambridge University Press, New York, ISBN 0521585198, 1997.
    [21] D. W. Goodall, “A New Similarity Index Based on Probability,” Biometics, vol.22, pp. 882-907, 1966.
    [22] Eugene Garfield, “The Concept of Citation Indexing: A Unique and Innovative Tool for Navigating the Research Literature,” Current Contents, January 3, 1994.
    [23] Jaekyu Ha, R.M. Haralick, I. T. Phillips, “Recursive X-Y Cut Using Bounding Boxes of Connected Components,” in Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 952-955, 2005.
    [24] Hui Han, C. Lee Giles, Eren Manavoglu, Hongyuan Zha, Zhenyue Zhang and Edward A. Fox, “Automatic Document Metadata Extraction using Support Vector Machines,” in Proceedings of Joint Conference on Digital Libraries, pp. 37-48, May 2003.
    [25] H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang and E. A. Fox, “Automatic Document Metadata Extraction Using Support Vector Machines,” in Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, pp. 37-48, 2003.
    [26] Amy M. Hauth and Deborah A. Joseph, “Beyond Tandem Repeats: Complex Pattern Structures and Distant Regions of Similarity,” Bioinformatics, vol. 18, supplement 1, pp. S31-S37, 2002.
    [27] C. N. Hsu and M. T. Dung, “Generating Finite-state Transducers for Semi-structured Data Extraction from the Web,” Information Systems, vol.23, issue 8, pp. 521-538, 1998.
    [28] Thorsten Joachims, Tamara Galor and Ron Elber, “Learning to Align Sequences: A Maximum-Margin Approach,” Lecture Notes in Computational Science and Engineering, IISN 1439-7358, vol.49, 2006.
    [29] Sheldon Krimsky and L. S. Rothenberg, “Conflict of Interest Policies in Science and Medical Journals: Editorial Practices and Author Disclosures,” Science and Engineering Ethics, vol. 7, no. 2, pp. 205-218, 2007.
    [30] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol.118, pp.15-68, 2000.
    [31] Sampath K. Kannan and Eugene W. Myers, “An Algorithm For Locating Non-Overlapping Regions Of Maximum Alignment Score,” in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 74-86, 1993.
    [32] George Karypis and Vipin Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM Journal on Scientific Computing, vol. 20, pp. 359-392, 1998.
    [33] Bing Liu, Robert Grossman and Yanhong Zhai, “Mining Data Records in Web Pages, ” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601-606, 2003.
    [34] Bing Liu and Yanhong Zhai, “NET - A System for Extracting Web Data from Flat and Nested Data Records,” in Proceedings of the 6th International Conference on Web Information Systems Engineering, pp. 163-168, 2005.
    [35] Cen Li and Gautam Biswas, “Unsupervised Learning with Mixed Numeric and Nominal Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, issue 4, pp. 673 – 690, July 2002.
    [36] Gad M. Landau, Jeanette P. Schmidtand and Dina Sokol, “An Algorithm for Approximate Tandem Repeats,” in Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching, pp. 120-133, 1993.
    [37] Pavel Moravec, Michal Kolovrat and Vaclav Snasel, “LSI vs. Wordnet Ontology in Dimension Reduction for Information Retrieval,” in Proceedings of the Signal-Image Technology and Internet-Based Systems, pp. 254-259, 2004.
    [38] I. Muslea, S. Minton and C. Knoblock, “A Hierarchical Approach to Wrapper Induction,” the third annual conference on Autonomous Agents, pp.190-197, 1999.
    [39] D. Pinto, A. McCallum, X. Wei and W. Bruce Croft, “Table Extraction Using Conditional Random Fields,” in Proceedings of the 26th ACM SIGIR Special Interest Group on Information Retrieval Conference, pp. 1-4, 2003.
    [40] Jie Ren and Richard N. Taylor, “Automatic and Versatile Publications Ranking for Research Institutions and Scholars,” Communications of the ACM, vol. 50, Issue 6, 2007.
    [41] Dina Sokol, Gary Benson and Justin Tojeira, “Tandem Repeats over the Edit Distance,” in Proceedings European Conference on Computational Biology, vol. 23, pp. e30-e35, 2006.
    [42] Kai Simon and Georg Lausen, “ViPER: Augmenting Automatic Information Extraction with Visual Perceptions,” in Proceedings of International Conference on Information and Knowledge Management (CIKM), pp. 381-388, 2005.
    [43] Fangting Sun, Fangting SunInverse and David Fernández-Baca, “Parametric Sequence Alignment,” Journal of Algorithms, pp.36-54, vol. 53 , Issue 1, October, 2004.
    [44] Alexander Strehl and Koydeep Ghosh, “Cluster Ensembles- A knowledge Reuse Framework for Combining Multiple Partitions,” Journal of Machine Learning Research, vol.3, pp. 583-617, 2002.
    [45] Kai-Hsiang Yang, Jen-Ming Chung and Jan-Ming Ho, “PLF: A Publication List Web Page Finder for Researchers Web Intelligence,” IEEE International Conference on Web Intelligence, pp. 295-298, 2007.
    [46] Yanhong Zhai and Bing Liu, “Web Data Extraction Based on Partial Tree Alignment,” in Proceedings of the 16th International World Wide Web Conference (WWW), pp. 76-85, 2005.
    [47] CiteSeer, website: http://citeseer.ist.psu.edu/
    [48] Libra, website: http://libra.msra.cn/
    [49] Cobra: Java HTML Renderer & Parser, website: http://lobobrowser.org/
    [50] StarDict, website: http://stardict.sourceforge.net/index.php
    [51] Apache Lucene, website: http://lucene.apache.org/java/docs/index.html
    [52] ACM Digital Library, website: http://portal.acm.org/dl.cfm
    [53] Elsevier, website: http://www.elsevier.com/wps/find/homepage.cws_home
    [54] IEEE Xplore, website: http://ieeexplore.ieee.org/Xplore/dynhome.jsp
    [55] SpringerLink, website: http://www.springerlink.com
    [56] The DBLP Computer Science Bibliography, website: http://www.informatik.uni-trier.de/~ley/db/
    [57] CiteSeer: Scientific Literature Digital Library, website: http://citeseer.ist.psu.edu/
    [58] Google Scholar, website: http://scholar.google.com.tw

    QR CODE