簡易檢索 / 詳目顯示

研究生: 江建毅
Jian-Yi Jiang
論文名稱: 引用文獻之作者身分消歧-利用網路文件延伸引用文獻關係之研究
Extending Citation Relationships from Web Documents for Authorship Disambiguation
指導教授: 李漢銘
Hahn-Ming Lee
何建明
Jan-Ming Ho
口試委員: 廖宜恩
I-En Liao
李育杰
Yuh-Jye Lee
鮑興國
Hsing-Kuo Pao
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 85
中文關鍵詞: 網路探勘姓名消歧
外文關鍵詞: web mining, name disambiguation
相關次數: 點閱:485下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

對於學者所發表的著作,通常會以引用文獻 (Citation) 的格式來紀錄。此外,為了方便查詢與統計分析,書目數位圖書館 (Bibliographic Digital Library) 會將引用文獻分類索引。然而,我們發現同名作者的引用文獻無法正確地被書目數位圖書館依照作者身分索引。造成錯誤作者索引的問題有兩個,分別是資訊量不足的問題與資訊混淆的問題。這兩個問題使得同一作者的引用文獻會被辨識為不同作者的,不同作者的引用文獻卻會被認為是同一作者的。

在此篇論文中,我們提出了一種新的作者身分消歧方法用以解決上述的兩個問題。此方法可藉由搜尋網路文件來擴充引用文獻的資訊量,並利用此資訊來幫助引用文獻的作者身分消歧。此外,一種基於學習配對的聚類方法被應用以將同一作者的引用文獻聚為一類,並且使用提出的配對過濾器來減輕高度混淆資訊影響消歧結果。實驗結果證實,我們提出的方法可以有效改善因資訊量不足所造成錯誤消歧結果,且消歧結果的品質也可藉由使用提出的配對過濾器而提升。


In general, the publications of scholars are often recorded by using the citation format. Moreover, bibliographic digital libraries would index those citations in order to provide a search service which is convenient to users and some statistics analysis. However, we observe that the bibliographic digital libraries cannot index the citations with the same author name by authorship correctly. There are two reasons for the wrong author indices, i.e. information shortage problem and information ambiguity problem. Due to the two problems, the citations authored by the same individual would be identified as different individuals’ ones, and citations of different individuals would be identified as the same individual’s ones.

In this thesis, we proposed a novel authorship disambiguation approach to solve the information shortage problem and the information ambiguity problem. This approach is based on web documents searching to enrich the citation information, and this kind of web information is used for helping authorship disambiguation in citations. Besides, the cluster algorithm based on a learned pairing function is applied to cluster the citations authored by the same individual together, and a proposed pair filter is used for alleviating the influence of high ambiguous information on authorship disambiguation. The experimental results show that the incorrect disambiguation result caused by the information shortage problem can be improved by using the approach we proposed, and the performance of authorship disambiguation can be improved again when the proposed pair filter is used.

ABSTRACT......................II ACKNOWLEDGEMENTS..............IV CONTENT.......................V LIST OF FIGURES AND TABLES....VIII Chapter 1 Introduction 1.1 The Challenges in Authorship Disambiguation.....3 1.2 Motivations.....................................4 1.3 Goals...........................................5 1.4 Contributions...................................6 1.5 Outline of This Thesis..........................7 Chapter 2 Background 2.1 The Name Ambiguity Problem and Related Work.....8 2.1.1 Information Type................................9 2.1.2 Similarity Metric..............................10 2.1.3 Disambiguation Process.........................12 2.2 The Classifier in Proposed Disambiguation Approach ..15 Chapter 3 Proposed Disambiguation Approach 3.1 Concept of Proposed Disambiguation Approach....20 3.2 System Architecture............................23 3.2.1 Citation Feature Extractor and Web Feature Generator ..24 3.2.2 Similarity Calculator..........................27 3.2.3 Binary Classifier and Cluster Builder..........32 3.2.4 Pair Filter....................................34 3.3 Characteristics of Proposed Approach...........35 Chapter 4 Experiments 4.1 Experimental Data..............................38 4.2 Evaluation Design..............................40 4.2.1 Cluster Precision..............................41 4.2.2 Cluster Recall.................................42 4.2.3 Disambiguation Accuracy........................43 4.3 Experimental Results...........................44 4.3.1 The Influences of Attributes and Similarity Metrics on Authorship Disambiguation...........................44 4.3.2 The Performance of Proposed Disambiguation Approach ..48 4.4 Discussions....................................54 Chapter 5 Conclusion and Further Work 5.1 Conclusion.....................................59 5.2 Further Work...................................60 References.............................................63

[1] R. Bekkerman and A. McCallum, “Disambiguating Web Appearances of People in a Social Network,” In Proceedings of the international World Wide Web conference (WWW), pp. 463-470, 2005.
[2] M. Bilenko and R. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 39-48, 2003.
[3] M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE Intelligent Systems, pp. 16-23, 2003.
[4] B. Boser, I. Guyon and V. Vapnik, “A training algorithm for optimal margin classifiers,” In Proceedings of the Annual ACM Workshop on Computational Learning Theory, pp. 144-152, 1992.
[5] M. Califf and R. Mooney, “Relational learning of pattern-match rules for information extraction,” In proceedings of the 16th national conference on Artifical Intelligence, pp. 328-334, 1999.
[6] W. Cohen, P. Ravikumar and S. Fienberg, “A Comparison of String Distance Metrics for Name-matching Tasks,” In Proceedings of the IJCAI-03 Workshop on Information Integration on the Web, pp. 73–78, 2003.
[7] W. Cohen and J. Richman, “Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration,” In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 475-480, 2002.
[8] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning Journal, vol. 20, pp. 273-297, 1995.
[9] I. Dhillon, S. Manella and R. Kumar, “A divisive information-theoretic feature clustering for text classification,” Journal of Machine Learning Research (JMLR), vol. 3, pp. 1265-1287, 2003.
[10] J. Diederich, J. Kindermann, E. Leopold and G. Paass, “Authorship Attribution with Support Vector Machines,” Applied Intelligence, vol. 19, pp. 109-123, 2003.
[11] D. Feitelson, “On Identifying Name Equivalences in Digital Libraries,” Information Research, 9(4):192, 2004.
[12] I. Fellegi and A. Sunter, “A theory for record linkage,” Journal of the American Statistical Association, vol. 64, pp. 1183-1210, 1969.
[13] M. Fleischman and E. Hovy, “Multi-Document Person Name Resolution,” In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2004.
[14] F. Ginter, J. Boberg, J. Jarvinen and T. Salakoski, “New Techniques for Disambiguation in Natural Language and Their Application to Biological Text,” Journal of Machine Learning Research, vol. 5, pp. 605-621, 2004.
[15] L. Gravano, P. Ipeirotis, N. Koudas and D. Srivastava, “Text Joins in an RDBMS for Web Data Integration,” In Proceedings of the international World Wide Web conference (WWW), pp. 90-101, 2003.
[16] H. Han, L. Giles, H. Zha, C. Li and K. Tsioutsiouliklis, “Two Supervised Learning Approaches for Name Disambiguation in Author Citations,” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 296-305, 2004.
[17] H. Han, W. Xu, H. Zha and C. Giles, “A Hierarchical Naïve Bayes Mixture Model for Name Disambiguation in Author Citations,” In Proceedings of the ACM symposium on Applied computing (SAC), pp. 1065-1069, 2005.
[18] H. Han, H. Zha and C. Giles, “Name Disambiguation in Author Citations using a K-way Spectral Clustering Method,” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 334-343, 2005.
[19] F. Harray, Graph Theory, Reading, MA: Addison-Wesley, 1994.
[20] M. Hernandez and S. Stolfo, “The Merge/Purge Problem for Large Databases,” In Proceedings of ACM SIGMOD international conference on Management of data, pp. 127-138, 1995.
[21] Y. Hong, B. On and D. Lee, “System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach,” In Proceedings of the European Conference on Digital Library (ECDL), pp. 134-144, 2004.
[22] M. Koppel and J. Schler, “Authorship Verification as a One-Class Classification Problem,” In Proceedings of the International Conference on Machine Learning (ICML), pp. 489-495, 2004.
[23] R. Krovetz and W. Croft, “Word sense disambiguation using machine-readable dictionaries,” In proceedings of the 12th Annual ACM SIGIR conference, pp. 127-136, 1989.
[24] D. Lee, B. On, J. Kang and S. Park, “Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries,” In Proceedings of the international workshop on Information Quality in Information Systems (IQIS), pp. 69-76, 2005.
[25] D. Lin and P. Pantel, “Concept discovery from text,” In Proceedings of Conference on Computational Linguistics (COLING), pp. 577-583, 2002.
[26] L. Lloyd, V. Bhagwan, D. Gruhl and A. Tomkins, Disambiguation of References to Individuals, IBM Research Report, 2005.
[27] B. Malin, “Unsupervised Name Disambiguation via Social Network Similarity,” In Proceedings of the Workshop on Link Analysis, Counterterrorism, and Security, in conjunction with the SIAN International Conference on Data Mining, pp. 93-102, 2005.
[28] B. Malin, E. Airoldi and K. Carley, “A Network Analysis Model for Disambiguation of Names in Lists,” Computational & Mathematical Organization Theory, vol. 11, pp. 119-139, 2005.
[29] G. Mann and D. Yarowsky, “Unsupervised Personal Name Disambiguation,” In Proceedings of the Conference on Natural Language Learning (CoNLL), pp. 33-40, 2003.
[30] A. McCallum, K. Nigam and L. Ungar, “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching,” In Proceedings of international conference on Knowledge Discovery and Data Mining (KDD), pp. 169-178, 2000.
[31] K. Morik, P. Brockhausen and T. Joachims, “Combining statistical learning with a knowledge-based approach – A case study in intensive care monitoring,” In Proceedings of International Conference on Machine Learning (ICML), pp. 268-277, 1999.
[32] C. Niu, W. Li and R. Srihari, “Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction,” In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 597-604, 2004.
[33] B. On, D. Lee, J. Kang and P. Mitra, “Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework,” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 344-353, 2005.
[34] S. Oyama and C. Manning, “Using Feature Conjunctions across Examples for Learning Pairwise Classifiers,” In Proceedings of the European Conference on Machine Learning (ECML) pp. 322-333, 2004.
[35] H. Pasula, B. Marthi, B. Milch, S. Russell and I. Shpitser, “Identity Uncertainty and Citation Matching,” In Proceedings of Neural Information Processing Systems (NIPS), pp. 401-408, 2002.
[36] T. Pedersen, A. Purandare and A. Kulkarni, “Name Discrimination by Clustering Similar Contexts,” In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing), pp. 226-237, 2005.
[37] M. Porter, “An algorithm for suffix stripping,” Program, vol. 14, pp. 130-137, 1980.
[38] K. Seymore, A. McCallum and R. Rosenfeld, “Learning hidden Markov model structure for information extraction,” In Proceedings of AAAI Workshop on Machine Learning for Information Extraction, pp. 37-42, 1999.
[39] D. Smith and G. Crane, “Disambiguating Geographic Names in a Historical Digital Library,” In Proceedings of the European Conference on Digital Library (ECDL), pp. 127-136, 2002.
[40] S. Tejada, C. Knoblock and S. Minton, “Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification,” In Proceedings of international conference on Knowledge Discovery and Data Mining (KDD), pp. 350-359, 2002.
[41] V. Torvik, M. Weeber, D. Swanson and N. Smalheiser, “A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation,” Journal of the American Society for Information Science and Technology (JASIST), vol. 56(2), pp. 140-158, 2005.
[42] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995.
[43] X. Wan, J. Gao, M. Li and B. Ding, “Person resolution in person search results: WebHawk,” In Proceedings of the ACM international Conference on Information and Knowledge Management (CIKM), pp. 163-170, 2005.
[44] J. Warner and E. Brown, “Automated Name Authority Control,” In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), pp. 21-22, 2001.
[45] W. Winkler and Y. Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census, Statistical Research Report Series RR91/09, 1991.
[46] L. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999.
[47] D. Yarowsky, “Hierarchical Decision Lists for Word Sense Disambiguation,” Computers and the Humanities, vol. 34(1-2), pp. 179-186, 2000.
[48] ACM Computing Classification System
http://www.acm.org/class/
[49] CiteSeer: Scientific Literature Digital Library
http://citeseer.ist.psu.edu/
[50] DBLP Bibliography
http://www.informatik.uni-trier.de/~ley/db/
[51] Math World: The Web's Most Extensive Mathematics Resource
http://mathworld.wolfram.com/
[52] Pubmed Medline: The U.S. National Library of Medicine’s medline and pre-medline database
http://www.ncbi.nlm.nih.gov/entrez/
[53] Stopword list
http://rdsweb2.rdsinc.com/help/stopword_list.html
[54] WordNet: a lexical database for the English languae
http://wordnet.princeton.edu/
[55] C.-C. Chang, C.-J. Lin, LIBSVM : a library for support vector machines, 2001.
Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

QR CODE