研究生: 許齊麟
Chi-lin Hsu
論文名稱: 在沒有參考文集下使用維基百科及語言模型對學術論文進行個人專長分類
Using Wikipedia and Language Model to Analyze Academic Expertise without Historical Corpus
指導教授: 吳怡樂
Yi-leh Wu
口試委員: 何建明
Jan-ming Ho
Cheng-yuan Tang
Yuh-jye Lee
Hsing-kuo Pao
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 34
中文關鍵詞: 文字探勘專家分類語言模型維基百科
外文關鍵詞: Text mining, expertise classification, language modeling, Wikipedia
本文設計一套方便簡易的模型化技術,對個人發表過的學術論文進行分析,進而分類出其專長領域所在。採用語言模型(language modeling)在資訊檢索(information retrieval)的應用,以其在統計理論的基礎作為根基,及結合近年興起web 2.0概念下的產物維基百科(Wikipedia),利用其具有更新快速及進化的特性作為文集,在沒有任何歷史文集下單獨使用即有不錯的效果。本文並嘗試使用關鍵字作為分析的特徵,此為近數十年收錄論文時會被另外加上的項目。此技術可用於評估及核對個人於學術界之專長領域,進一步可用於專家搜尋。

In this work, we propose a convenient and easy modeling to analyze theses published by a person to determine his/her expertise. We use language modeling in information retrieval approaches with statistical theory foundations and use Wikipedia which follows the new concept of Web 2.0 with the merits of frequent updates and evolutionary character as corpus. Our experiments suggest that by using only the Wikipedia as corpus can produce satisfactory results. Besides, we propose to use keywords as a feature, which is commonly added in academic journal recent years. The proposed model can be used to evaluate and validate one's academic expertise. The proposed model can further be applied in expertise search engine.

中文摘要 4 Abstract 5 Acknowledgement 6 Chapter 1 Introduction 8 §1-1 Motivation 8 §1-2 Related Literatures 9 §1-3 Research Method 9 §1-4 Organization of This Thesis 11 §2-1 Text Mining 12 §2-1-1 Filtering 13 §2-1-2 Stemming 13 §2-1-3 Index Term Selection 14 §2-2 Language modeling 15 §2-3 Wikipedia 17 §2-2-1 Categories in Wikipedia 17 §2-2-2 Corpus of Language Modeling 18 Chapter 3 Experiment 20 §3-1 Dataset 20 §3-2 Process and Result 21 §3-2-1 Preprocess the Dataset 21 §3-2-2 Preprocess the Wikipedia 21 §3-2-3 Language Modeling and Validation 22 §3-2-3-1 Entropy of Wikipedia 22 §3-2-3-2 Abstract and Keyword 22 §3-2-4 Classification 26 §3-2-5 Expertise Analysis with Real World Data 27 Chapter 4 Conclusion 31 References 32

