研究生: 江建毅
Jian-Yi Jiang
論文名稱: 引用文獻之作者身分消歧-利用網路文件延伸引用文獻關係之研究
Extending Citation Relationships from Web Documents for Authorship Disambiguation
指導教授: 李漢銘
Hahn-Ming Lee
Jan-Ming Ho
口試委員: 廖宜恩
I-En Liao
Yuh-Jye Lee
Hsing-Kuo Pao
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 英文
論文頁數: 85
中文關鍵詞: 網路探勘姓名消歧
外文關鍵詞: web mining, name disambiguation
相關次數: 點閱:485下載:1
對於學者所發表的著作,通常會以引用文獻 (Citation) 的格式來紀錄。此外,為了方便查詢與統計分析,書目數位圖書館 (Bibliographic Digital Library) 會將引用文獻分類索引。然而,我們發現同名作者的引用文獻無法正確地被書目數位圖書館依照作者身分索引。造成錯誤作者索引的問題有兩個,分別是資訊量不足的問題與資訊混淆的問題。這兩個問題使得同一作者的引用文獻會被辨識為不同作者的,不同作者的引用文獻卻會被認為是同一作者的。


In general, the publications of scholars are often recorded by using the citation format. Moreover, bibliographic digital libraries would index those citations in order to provide a search service which is convenient to users and some statistics analysis. However, we observe that the bibliographic digital libraries cannot index the citations with the same author name by authorship correctly. There are two reasons for the wrong author indices, i.e. information shortage problem and information ambiguity problem. Due to the two problems, the citations authored by the same individual would be identified as different individuals’ ones, and citations of different individuals would be identified as the same individual’s ones.

In this thesis, we proposed a novel authorship disambiguation approach to solve the information shortage problem and the information ambiguity problem. This approach is based on web documents searching to enrich the citation information, and this kind of web information is used for helping authorship disambiguation in citations. Besides, the cluster algorithm based on a learned pairing function is applied to cluster the citations authored by the same individual together, and a proposed pair filter is used for alleviating the influence of high ambiguous information on authorship disambiguation. The experimental results show that the incorrect disambiguation result caused by the information shortage problem can be improved by using the approach we proposed, and the performance of authorship disambiguation can be improved again when the proposed pair filter is used.

ABSTRACT......................II ACKNOWLEDGEMENTS..............IV CONTENT.......................V LIST OF FIGURES AND TABLES....VIII Chapter 1 Introduction 1.1 The Challenges in Authorship Disambiguation.....3 1.2 Motivations.....................................4 1.3 Goals...........................................5 1.4 Contributions...................................6 1.5 Outline of This Thesis..........................7 Chapter 2 Background 2.1 The Name Ambiguity Problem and Related Work.....8 2.1.1 Information Type................................9 2.1.2 Similarity Metric..............................10 2.1.3 Disambiguation Process.........................12 2.2 The Classifier in Proposed Disambiguation Approach ..15 Chapter 3 Proposed Disambiguation Approach 3.1 Concept of Proposed Disambiguation Approach....20 3.2 System Architecture............................23 3.2.1 Citation Feature Extractor and Web Feature Generator ..24 3.2.2 Similarity Calculator..........................27 3.2.3 Binary Classifier and Cluster Builder..........32 3.2.4 Pair Filter....................................34 3.3 Characteristics of Proposed Approach...........35 Chapter 4 Experiments 4.1 Experimental Data..............................38 4.2 Evaluation Design..............................40 4.2.1 Cluster Precision..............................41 4.2.2 Cluster Recall.................................42 4.2.3 Disambiguation Accuracy........................43 4.3 Experimental Results...........................44 4.3.1 The Influences of Attributes and Similarity Metrics on Authorship Disambiguation...........................44 4.3.2 The Performance of Proposed Disambiguation Approach ..48 4.4 Discussions....................................54 Chapter 5 Conclusion and Further Work 5.1 Conclusion.....................................59 5.2 Further Work...................................60 References.............................................63

