簡易檢索 / 詳目顯示

研究生: 周家慶
Chia-Ching Chou
論文名稱: 基於語言模型及社群網路分析之權威專家搜尋系統
AEFS: Authoritative Expert Finding System Based on a Language Model and Social Network Analysis
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 王勝德
Sheng-De Wang
王榮英
Jung-Ying Wang
何建明
Jan-Ming Ho
李育杰
Yuh-Jye Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 67
中文關鍵詞: 專家搜尋語言模型社群網路分析
外文關鍵詞: expert finding, language model, social network analysis
相關次數: 點閱:226下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在一個給定的主題下,去搜尋有關該主題的專家在許多現實世界的情況中是一個非常迫切的問題,例如:尋找共同合作對象或是找一個有專業人士來解決特殊的問題。雖然如此,過去的研究只著重在一個專家候選人的名字出現在相關主題文件的次數來決定該候選人是否具有符合給定主題的專長,如此一來搜尋到的專家可能不是最適合的,這是因為無法確認該候選人在給定主題下是否具有可靠的權威性,因此為了解決此問題,我們提出一個專家搜尋系統稱做“權威專家搜尋系統”。
本系統是依據專家們的出版作品的品質及權威性來判定專家們的專長。本系統利用非文字訊息,也就是影響力指數(impact factor)來決定出版作品的品質並且提出一個基於社會網路分析的概念的參考文獻辨別的方法來移除掉重複的鍵值。在我們的實驗過程中,我們比較相關的方法顯示: (1) 我們提出的方法在正確率和精確率的調和平均數(F-measure)達到良好的效率; (2) 在參考文獻辨別的方法中,可以減少訓練例子的數目,(3) 非文字訊息對搜尋專家是有用的。


Searching for experts on a given topic is a critical problem in many real-world situations, such as collaborative finding or speaker finding. Even so, previous works have only focused on searching for experts based on the appearance of topic query in an organization’s documents, which means that the experts selected might not be suitable for the task at hand.

In order to resolve this problem, we propose an Authoritative Expert Finding System, called AEFS, which ranks the publications of experts to indicate their level of expertise. AEFS uses non-textual information, e.g. impact factor, to represent the quality of publications, and provides a citation matching function that removes duplicated citations based on the concept of centrality in social network analysis (SNA).In our experiments, we compare a number of related approaches to show that: (1) the proposed approach achieves a good performance in terms of the average F-measure;(2) citation matching can reduce the number of training examples required; and (3)impact factor are very effective for searching experts.

1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 The Challenges in Expert Finding . . . . . . . . . . . . . . . . . . . 3 1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Outlines of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 6 2.1 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Character-based Similarity Metrics . . . . . . . . . . . . . . 7 2.1.2 Token-based Similarity Metrics . . . . . . . . . . . . . . . . 7 2.1.3 Hybrid Similarity Metrics . . . . . . . . . . . . . . . . . . . 8 2.2 Learned Pairing Function . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Social Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Degree Centrality . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Betweenness Centrality . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Closeness Centrality . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Expert Finding System . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 System Architecture 17 3.1 Citation Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.1 Social Network Centrality Pair Generator . . . . . . . . . . . 20 3.1.2 Attribute Dependency Similarity Calculator . . . . . . . . . . 21 3.1.3 Binary Classifier and Social Network Centrality Cluster Builder 24 3.2 Citation Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Expert Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4 Experiments 28 4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 Experimental Methodology . . . . . . . . . . . . . . . . . . . . . . . 31 4.3 The Performance of Citation Matching . . . . . . . . . . . . . . . . . 32 4.3.1 Compare with Other Methods . . . . . . . . . . . . . . . . . 33 4.3.2 The Performance of Attribute Dependency Calculator . . . . . 34 4.3.3 The Efficiency of CMSNC in Training Phase . . . . . . . . . 35 4.4 The Performances of Expert Finding Task . . . . . . . . . . . . . . . 35 4.4.1 The Performances of the Different Queries in Each Topic . . . 36 4.4.2 The Effectives of Non-textual Feature in Expert Finding Task 38 4.4.3 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5 Conclusion and Further Work 42 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.2 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

[1] B. Aleman-Meza, M. Nagarajan, C. Ramakrishnan, L. Ding, P. Kolari, A. P.
Sheth, I. B. Arpinar, A. Joshi, and T. Finin, “Semantic analytics on social networks:experiences in addressing the problem of conflict of interest detection,” in Proceedings of the 15th International Conference on World Wide Web, 2006, pp.407–416.
[2] K. Balog, L. Azzopardi, and M. de Rijke, “Formal models for expert finding in enterprise corpora,” in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp.43–50.
[3] K. Balog and M. de Rijke, “Finding experts and their details in e-mail corpora,”in Proceedings of the 15th International Conference on World Wide Web, 2006, pp. 1035–1036.
[4] R. T. Bayes, “An essay towards solving a problem in the doctrine of chances,”Philosophical Transactions of the Royal Society London, vol. 53, pp. 370–418,1763.
[5] A. Berger and J. Lafferty, “Information retrieval as statistical translation,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999, pp. 222–229.
[6] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable
string similarity measures,” in Proceedings of the 9th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2003, pp. 39–48.
[7] M. Bilenko, R. J. Mooney, W. W. Cohen, P. Ravikumar, and S. E. Fienberg,
“Adaptive name matching in information integration,” IEEE Intelligent Systems,
vol. 18, no. 5, pp. 16–23, 2003.
[8] J. Bollen, M. A. Rodriguez, and H. Van de Sompel, “Journal status,” Scientometrics,
vol. 69, no. 3, pp. 669–687, 2006.
[9] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,”
in Proceedings of 7th International World-Wide Web Conference, 1998,
pp. 107–117.
[10] M. Califf and R. Mooney, “A training algorithm for optimal margin classifiers,” in
Proceedings of the Annual ACM Workshop on Computational Learning Theory,
1992, pp. 144–152.
[11] C. S. Campbell, P. P. Maglio, A. Cozzi, and B. Dom, “Expertise identification
using email communications,” in Proceedings of the 12th ACM Conference on
Information and Knowledge Management, 2003, pp. 528–531.
[12] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001,
software available at http://www.csie.ntu.edu.tw/»cjlin/libsvm.
[13] W. W. Cohen, P. Ravikumar, and S. E. Fienberg, “A comparison of string distance
metrics for name-matching tasks,” in Proceedings of IJCAI-03 Workshop
on Information Integration on the Web (IIWeb-03), 2003.
[14] W. W. Cohen and J. Richman, “Learning to match and cluster entity names,” in
ACM SIGIR’01 Workshop on Mathematical/Formal Methods in IR, 2001.
[15] W. W. Cohen and J. Richman, “Learning to match and cluster large highdimensional
data sets for data integration,” in Proceedings of the 8th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
2002, pp. 475–480.
[16] Cora dataset. [Online]. Available: http://www.cs.umass.edu/»mccallum/data/
cora-refs.tar.gz
[17] T. H. Davenport and L. Prusak, Working Knowledge: How Organizations Manage
What They Know. Harvard Business School Press, 1998.
[18] X. Dong, A. Y. Halevy, and J. Madhavan, “Reference reconciliation in complex
information spaces,” in Proceedings of the 24th ACM SIGMOD International
Conference on Management of Data, 2005, pp. 85–96.
[19] Electronic theses and dissertation system. [Online]. Available: http://etds.ncl.
edu.tw
[20] I. P. Felligi and A. B. Sunter, “A theory for record linkage,” Journal of the American
Statistical Society, vol. 64, pp. 1183–1210, 1969.
[21] L. C. Freeman, “A set of measures of centrality based on betweenness,” Sociometry,
vol. 40, no. 1, pp. 35–41, 1977.
[22] L. C. Freeman, “Centrality in social networks: conceptual clarification,” Social
Networks, vol. 1, no. 3, pp. 215–239, 1978.
[23] M. Girvan and M. E. J. Newman, “Community structure in social and biological
networks,” in Proceedings of National Academy of Science, 2002, pp. 7821–7826.
[24] M. Granovetter, “the strength of weak ties,” The American Journal of Sociology,
vol. 78, no. 6, pp. 1360–1380, 1973.
[25] J. E. Hirsch, “An index to quantify an individual’s scientific research output,” in
Proceedings of the National Academy of Sciences, 2005, pp. 16 569–16 572.
[26] Information retrieval. [Online]. Available: http://en.wikipedia.org/wiki/
Information retrieval
[27] P. Jaccard, “The distribution of the flora of the alpine zone,” New Phytologist,
vol. 11, no. 2, pp. 37–50, 1912.
[28] F. Jelinek and R. Mercer, “Interpolated estimation of markov sourceparameters
from sparse data,” in Proceedings of Workshop on Pattern Recognition in Practice,
1980, pp. 381–402.
[29] J. Kleinberg, “Authoritative sources in a hyperlinked environment,” in Proceedings
of 9th Annual ACM-SIAM Symposium on Discrete Algorithms, 1998, pp.
604–632.
[30] V. Lavrenko and W. B. Croft, “Relevance based language models,” in Proceedings
of the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2001, pp. 120–127.
[31] S. Lawrence, C. L. Giles, and K. D. Bollacker, “Autonomous citation matching,”
in Proceedings of the 3rd International Conference on Autonomous Agents, 1999,
pp. 392–393.
[32] C. Macdonald and ladh Ounis, “Voting for candidates: Adapting data fusion techniques
for an expert search task,” in Proceedings of the 15th ACM Conference on
Information and Knowledge Management, 2006, pp. 387–396.
[33] D. MacKay and L. Peto, “A hierarchical dirichlet language model,” Natural Language
Engineering, vol. 1, no. 3, pp. 1–19, 1995.
[34] M. Maron, S. Curry, and P. Thompson, “An inductive search system: Theory, design,
and implementation,” IEEE Transactions on Systems, Man and Cybernetics,
vol. 16, no. 1, pp. 21–28, 1986.
[35] A. McCallum, K. Bellare, and F. Pereira, “A conditional random field for
discriminatively-trained finite-state string edit distance,” in Proceedings of 21st
Conference on Uncertainty in Artificial Intelligence, 2005, pp. 388–395.
[36] A. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction
of internet portals with machine learning,” Information Retrieval, vol. 3,
no. 2, pp. 127–163, 2000.
[37] A. McCallum, K. Nigam, and L. H. Ungar, “Efficient clustering of highdimensional
data sets with application to reference matching,” in Proceedings
of the 6th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2000, pp. 169–178.
[38] H. NEY, U. ESSEN, and R. KNESER, “On structuring probabilistic dependencies
in stochastic language modeling,” Computer Speech and Language, vol. 8, no. 1,
pp. 1–38, 1994.
[39] D. Petkova and B. W. Croft, “Hierarchical language models for expert finding in
enterprise corpora,” in Proceedings of the 18th IEEE International Conference
on Tools with Artificial Intelligence, Washington, DC, USA, 2006, pp. 599–608.
[40] G. Salton and M. J. McGill, Introduction to modern information retrieval.
McGraw-Hill, 1983.
[41] M. V. Simkin and V. P. Roychowdhury, “Read before you cite!” Complex System,
vol. 14, pp. 269–274, 2003.
[42] Social network. [Online]. Available: http://en.wikipedia.org/wiki/Social network
[43] Social network analysis, a brief introduction. [Online]. Available: http:
//www.orgnet.com/sna.html
[44] F. Song and W. Croft, “A general language model for information retrieval,” in
Proceedings of the 8th International Conference on Information and Knowledge
management, 1999, pp. 316–321.
[45] Text retrieval quality: A primer. [Online]. Available: http://www.oracle.com/
technology/products/text/htdocs/imt quality.htm
[46] Q. T. Tho, S. C. Hui, and A. C. M. Fong, “A web mining approach for finding
expertise in research areas,” in Proceedings of the 2003 International Conference
on Cyberwords, 2003, pp. 310–317.
[47] TREC. Enterprise Track 2005. [Online]. Available: http://www.ins.cwi.nl/
projects/trec-ent/wiki/
[48] J.Wang, Z. Chen, L. Tao,W.-Y. Ma, and L.Wenyin, “Ranking user’s relevance to
a topic through link analysis on web logs,” in Proceedings of the 4th International
Workshop on Web Information and Data Management, 2002, pp. 49–54.
[49] S.Wasserman and K. Faust, Social Network Analysis: methods and applications.
Cambridge University Press, 1994.
[50] I. Witten and T. Bell, “The zero-frequency problem: estimating the probabilities
of novelevents in adaptive text compression,” IEEE Transactions on Information
Theory, vol. 37, no. 4, pp. 1085–1094, 1991.
[51] C. Zhai and J. Lafferty, “A study of smoothing methods for language models
applied to information retrieval,” ACM Transactions on Information Systems,
vol. 22, no. 2, pp. 179–214, 2004.

QR CODE