簡易檢索 / 詳目顯示

研究生: 許齊麟
Chi-lin Hsu
論文名稱: 在沒有參考文集下使用維基百科及語言模型對學術論文進行個人專長分類
Using Wikipedia and Language Model to Analyze Academic Expertise without Historical Corpus
指導教授: 吳怡樂
Yi-leh Wu
口試委員: 何建明
Jan-ming Ho
唐政元
Cheng-yuan Tang
李育杰
Yuh-jye Lee
鮑興國
Hsing-kuo Pao
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 34
中文關鍵詞: 文字探勘專家分類語言模型維基百科
外文關鍵詞: Text mining, expertise classification, language modeling, Wikipedia
相關次數: 點閱:301下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本文設計一套方便簡易的模型化技術,對個人發表過的學術論文進行分析,進而分類出其專長領域所在。採用語言模型(language modeling)在資訊檢索(information retrieval)的應用,以其在統計理論的基礎作為根基,及結合近年興起web 2.0概念下的產物維基百科(Wikipedia),利用其具有更新快速及進化的特性作為文集,在沒有任何歷史文集下單獨使用即有不錯的效果。本文並嘗試使用關鍵字作為分析的特徵,此為近數十年收錄論文時會被另外加上的項目。此技術可用於評估及核對個人於學術界之專長領域,進一步可用於專家搜尋。


In this work, we propose a convenient and easy modeling to analyze theses published by a person to determine his/her expertise. We use language modeling in information retrieval approaches with statistical theory foundations and use Wikipedia which follows the new concept of Web 2.0 with the merits of frequent updates and evolutionary character as corpus. Our experiments suggest that by using only the Wikipedia as corpus can produce satisfactory results. Besides, we propose to use keywords as a feature, which is commonly added in academic journal recent years. The proposed model can be used to evaluate and validate one's academic expertise. The proposed model can further be applied in expertise search engine.

中文摘要 4 Abstract 5 Acknowledgement 6 Chapter 1 Introduction 8 §1-1 Motivation 8 §1-2 Related Literatures 9 §1-3 Research Method 9 §1-4 Organization of This Thesis 11 §2-1 Text Mining 12 §2-1-1 Filtering 13 §2-1-2 Stemming 13 §2-1-3 Index Term Selection 14 §2-2 Language modeling 15 §2-3 Wikipedia 17 §2-2-1 Categories in Wikipedia 17 §2-2-2 Corpus of Language Modeling 18 Chapter 3 Experiment 20 §3-1 Dataset 20 §3-2 Process and Result 21 §3-2-1 Preprocess the Dataset 21 §3-2-2 Preprocess the Wikipedia 21 §3-2-3 Language Modeling and Validation 22 §3-2-3-1 Entropy of Wikipedia 22 §3-2-3-2 Abstract and Keyword 22 §3-2-4 Classification 26 §3-2-5 Expertise Analysis with Real World Data 27 Chapter 4 Conclusion 31 References 32

[1] Andreas Hotho, Andreas Nurnberger, and Gerhard Paaß. “A Brief Survey of Text Mining.” LDV-Forum, 20(1):19–62, 2005.
[2] Apache Lucene site: http://lucene.apache.org/java/docs/
[3] Chengxiang Zhai and John Lafferty. “A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval.” Proc. 24th Annual international ACM SIGIR Conference: 334-342, 2001.
[4] D. Hiemstra. “Using Language models for information retrieval.” PhD thesis, University of Twente, 2001.
[5] D. Miller, T. Leek, snd R. Schwartz. “A hidden Markov model information retrieval system.” Proc. of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 214–221, 1999.
[6] Fabrizio and Sebastiani. “Machine Learning in Automated Text Categorization.” ACM Comput. Surv. 34(1): 1-47
[7] Hsing-Kuo Kenneth Pao’s page: http://linker3.csie.ntust.edu.tw/~pao/
[8] J. M. Ponte and W. B. Croft. “A language modeling approach to information retrieval.” Proc. of the 21st annual international ACM SIGIR conference on Research and development in information retrieval: 275-281, 1998.
[9] K. E. Lochbaum and L. A. Streeter. “Combining and comparing the effectiveness of latent semantic indexing and the ordinary vector space model for information retrieval.” Information Processing and Management, 25(6):665–676
[10] Krisztian Balog, Leif Azzopardi, and Maarten de Rijke. “Formal Models for Expert Finding in Enterprise Corpora.” Proc. of the 29th annual international ACM SIGIR: 43-50, 2006
[11] Liu, Liu, P., Curson, J., Dew, and P. M.. “Exploring RDF for Expertise Matching within an Organizational Memory.” the 14th International Conference on Advanced Information Systems Engineering: 100–116, 2002.
[12] Markus Krotzsch, Denny Vrandecic, and Max Volkel. “Wikipedia and the Semantic Web-The Missing Links.” Proc. of the WikiMania2005, 2005.
[13] McCallum, A. K., Nigan, K., Rennie, J., Seymore, and K.. “Automating the construction of internet portals with machine learning.” Information Retrieval, 3(2): 127-163, 2000.
[14] Mockus, A., Herbsleb, J. D.. “Expertise Browser: a quantitative approach to identifying expertise.” Proc. of the 24th International Conference on Software Engineering: 503–512, 2002.
[15] Porter Stemming Algorithm official home page (written and maintained by its author, Martin Porter): http://www.tartarus.org/martin/PorterStemmer/
[16] R. Feldman and I. Dagan. “Kdt – knowledge discovery in texts.” Proc.of the First Int. Conf. on Knowledge Discovery(KDD), pages 112–117, 1995.
[17] Thorsten Joachims. “A statistical learning learning model of text classification for support vector machines.” Proc. of the 24th annual international ACM SIGIR conference on Research and development in information retrieval: 128-136, 2001
[18] W. B. Frakes and R. Baeza-Yates. “Information Retrieval: Data Structures & Algorithms.” Prentice Hall, New Jersey, 1992.
[19] Webby Awards official site: http://www.webbyawards.com/
[20] Xiaodan Song , Belle L. Tseng , Ching-Yung Lin , and Ming-Ting Sun. “ExpertiseNet: Relational and Evolutionary Expert Modeling.” 10th Intl. Conf. on User Modeling: 99-108, 2005.
[21] Yuh-Jye Lee’s page: http://dmlab1.csie.ntust.edu.tw/Leepage/index_c.htm

QR CODE