簡易檢索 / 詳目顯示

研究生: 邱坤彥
Kun-Yan Chiou
論文名稱: 網際網路人名實體對應消歧-利用社會網路關係之鏈結基礎與內容基礎資訊
Disambiguation Web Appearance of People by Exploiting Link-based and Content-based Information of Social Network Relation
指導教授: 李漢銘
Hahn-Ming Lee
口試委員: 王勝德
Sheng-De Wang
何建明
Jan-Ming Ho
李育杰
Yuh-Jye Lee
王榮英
Jung-Ying Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 79
中文關鍵詞: 同名同姓鏈結基礎內文基礎社會網路
外文關鍵詞: name uncertainty, link-based, network motif
相關次數: 點閱:212下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 從網際網路搜尋人名相關資訊是個相當方便且普遍的行為,但隨著虛擬網路社會蓬勃的發展,大量的人名相關網頁資訊不斷快速的產生,使用者必須花費相當多的時間來瀏覽過濾所取得的人名資料。這問題主要來自於人名並不具有唯一性,我們可以在網際網路上可查詢到許多同名同姓的人,當然使用者所關注的對象實體也包含在其中。目前許多搜尋引擎都嘗試著去提供名詞實體的對應分類,但針對人名所提供的分類目前較少而且分類的效果也有待改進。因此我們嘗試去對網際網路上所搜尋到包含人名的網頁進行以實體為基礎的群聚分類,其結果可簡化使用搜尋人名資料所花的時間,快速的從大量網頁中取得公開的人名資料,進而獲得搜尋對象的相關資料或是取得連絡方式。
    利用特定樣板規則或傳統的文件向量化,來計算不同頁面間相似度的方法是較為普遍的,但這類的方式在真實的應用上存有一些限制。本篇論文提出一個以基於網際網路中的虛擬社會網路鏈結特性,將群聚的方式透過網頁中的超連結達成,並利用超連結文字進一步將部份不包含鏈結資訊的網頁選取到適當的分群中。這個方法最大的好處是不需要使用者提供額外的資訊,可免除額外關鍵字所需要的經驗與認知。經由實驗證明本論文所提出的方法可以獲得良好的群聚效果。


    Searching for personal data about people of interest is one of the most popular types of search activity. However, with the exponential growth of the WWW, users need to spend much time on filtering none interesting web pages. This is because personal names are not unique, many different individuals with same person name are found in search result web pages. Currently, many search engines provide functions for automatic classifying web pages; however, there is less function for personal name classifying. Therefore we try to provide a system which can automatically group web pages referred to different individual. The web pages grouping result will reduce the time consuming of users; they can effectively browse these public web pages referred to target individual.
    Many past researches utilized predefined template rule or traditional content VSM model to measure the similarity between different web pages, but these kinds of methods have some limitation in real situation. In this thesis, we propose a link-based method by exploiting social network relation in the WWW, then grouping together searched web pages referred to different individual. Additionally, content-based information adopts collect part of web pages which have fewer link relations. Our proposed method need no additional background knowledge for target personal name, thus user can reduce the confusion on setting additional query keyword. The experiments result show that proposed methods can get good achievement in general.

    Abstract II Acknowledgements IV Content V List of Tables VIII List of Figures IX Chpater 1 Introduction 1 1.1 Introduction 1 1.2 Motivation and Problem Statement 4 1.3 Goal and Design 6 1.4 Outline of This Thesis 7 Chpater 2 Background 9 2.1 A Short Review of Identity Uncertainty Problem 9 2.2 Previous Research for Personal Name Uncertainty Problem 10 2.3 Affection Factors on the Personal Name Uncertainty Problem 12 2.4 Features in Web Page 14 2.5 Network Structure of the WWW 15 2.6 Link Motivation of Hyperlinks in Web Page 17 Chpater 3 Disambiguation Web Appearance of Personal Name System 19 3.1 The Concept of Operation Hyperlink and Anchor Text in DWAP 20 3.2 Overview of the DWAP System 22 3.3 The LSRE Module 24 3.3.1 A Link Structure Construction and Link Data Database Usage 26 3.3.2 Filtering Noisy Hyperlinks 31 3.3.3 Network Motif Types 32 3.3.4 Retrieving Network Motifs Detected in the Relations of R-Pages 35 3.4 Anchor Text Parsing and Pairwise Relation Extraction 36 3.4.1 Term Definitions and Description of Anchor Text 38 3.4.2 Detection of the Pairwise Relations of R-Pages 39 3.4.3 Solving Problems Caused by Using Anchor Text 40 3.5 Grouping R-pages 42 3.5.1 Combining the Retrieved Pairwise Relations of Network Motifs and Anchor Texts 43 3.6 Characteristics of the DWAP System 46 3.6.1 Comparison with Other Methods 48 Chpater 4 Experiments 50 4.1 Experimental data 51 4.2 Evaluation Approach 53 4.3 Experiment Results 56 4.3.1 Effects of Different Network Motifs 56 4.3.2 Effects of Different Motif Detection Methods 57 4.3.3 Effects of Filtering Noisy Hyperlinks 59 4.3.4 Effects of Different Extension Processes 59 4.3.5 Combination of Network Motifs 60 4.3.6 Combining Anchor Text with Network Motifs 61 Chpater 5 Conclusion and Further Work 68

    [1] R. Albert, H. Jeong, and A.-L. Barabasi. “The Diameter of the World Wide Web,” Nature, Vol. 401, pp.130-133, 1999.
    [2] U. Alon, R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan and D. Chklovskii, “Network Motifs: Simple building Blocks of Complex Networks,” Science, Vol. 298, pp. 824-827, 2002.
    [3] J. Artiles, J. Gonzalo and F. Verdejo, “A Testbed for People Searching Strategies in the WWW”, In Proc. of the 28th Annual International ACM SIGIR Conference, pp. 569-570, 2005.
    [4] Bagga and B. Baldwin, “Entity-Based Cross-Document Coreferencing Using the Vector Space Model,” In Proc. of the 17th International Conference on Computational Linguistics, Association for Computational Linguistics, pp. 79-85, 1998.
    [5] Y. Bar-Shalom and T. Fortmann, “Tracking and Data Association,” Academic Press, New York, 1988.
    [6] Albert-Laszlo Barabasi, “Linked: The New Science of Networks,” Cambridge, MA: Perseus. Publishing, 2002.
    [7] R. Bekkerman and A. McCallum, “Disambiguation Web Appearances of People in a Social Network,” In Proc. of the 14th International World Wide Web Conference, pp. 463-470, 2005.
    [8] K. Bharat and M.R. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” In Proc. of the 21st Annual International ACM SIGIR Conference, pp. 104-111, 1998.
    [9] I. Bhattacharya and L. Getoor. “Iterative Record Linkage for Cleaning and Integration,” In Proc. of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp.11-18, 2004.
    [10] I. Bhattacharya and L. Getoor, “Reduplication and group detection using links,” In Proc. of LinkKDD Workshop 2004 on Link Analysis and Group Detection, 2004.
    [11] J. Bondy and U. Murty, “Graph Theory with Applications,” Macmilliam Press Ltd., 1976.
    [12] S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” In Proc. of the 14th International World Wide Web Conference, pp. 107-117, 1998.
    [13] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,R. Stata, A. Tomkins and J. Wiener, “Graph structure in the Web,” In Proc. of the 9th WWW Conference on Computer networks, pp. 309-320, 2000.
    [14] G. Calinescu, C. G. Fernandes, U. Finkler and H. Karloff, “A Better Approximation Algorithm for Finding Planar Subgraphs,” Journal of Algorithms, Vol. 27(2), pp. 269-302, 1998.
    [15] S. Cbakrabarti, B. Dom, D. Gibson, and J. Kleinberg, “Mining the Web's Link Structure,” IEEE Computer, Vol. 32(8), pp.60-67, 1999.
    [16] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan and S. Rajagopalan, “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” Computer Networks, Vol. 30(1-7), pp. 65-74, 1998.
    [17] Z. Chen, S. Liu, W. Liu, G. Pu and W. Ma, “Building a Web Thesaurus from Web Link Structure,” In Proc. of the 26th Annual International ACM SIGIR Conference, pp. 48-55, 2003.
    [18] A. Chirita, D. Olmedilla and W. Nejdl, “Finding Related Pages Using the Link Structure of the WWW,” In Proc. of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp.632-635, 2004.
    [19] N. Craswell, D. Hawking, and S. Robertson, “Effective site finding using link anchor information,” In Proc. of the 24th Annual International ACM SIGIR Conference, pp. 250-257, 2001.
    [20] X. Dong, A. Helevy and J. Madhavan, “Reference Reconciliation in Complex Information Spaces,” In Proc. of the2005 ACM SIGMOD Conference, pp. 85-96, 2005.
    [21] M. Bilenko, R. Mooney and P. Ravikumar and S. Fienberg, “Adaptive Name Matching in Information Integration,” IEEE intelligent System, Vol. 18(5), pp. 16-23, 2003.
    [22] D. Davison, “Topical locality in the Web,” In Proc. of the 23rd Annual International ACM SIGIR Conference, pp. 272-279, 2000.
    [23] N. Eiron and K. McCurley, “Link Structure of Hierarchical Information Networks,” In Proc. of 3rd Workshop on Algorithms and Models for the Web-Graph (WAW 2004), pp. 143-155, 2004.
    [24] P. Erdôs and A. Rényi, “On Random Graphs I,” Publ. Math. (Debrecen), Vol. 6 , 1959.
    [25] D. Feitelson, “On identifying name equivalences in digital libraries,” Information Research, Vol. 9(4), 2004.
    [26] M. Fleischman and E. Hovy, “Multi-Document Person Name Resolution,” In Proc. of ACL-42, Reference Resolution Workshop, 2004.
    [27] E. Glover, K. Tsioutsiouliklis, S. Lawrence, D. Pennock and G.Flake, “Using Web Structure for Classifying and Describing Web Pages,” In Proc. of the 11th International World Wide Web Conference, pp.562-569, 2002.
    [28] C. Gooi and J. Allan, “Cross-document Coreference on a Large Scale Corpus,” In Proc. of the Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, pp. 9-16, 2004.
    [29] R. Guha and A. Garg, “Disambiguating people in search,” Stanford University, 2004.
    [30] H. Han, L. Giles, H. Zha, C. Li, K. Tsioutsiouliklis, “Two Supervised Learning Approaches for Name Disambiguation in Author Citations,” In Proc. of the Joint Conference on Digital Libraries(JCDL), pp. 296-305, 2004.
    [31] A. Jain, R. Dubes, ”Algorithms for clustering data“ , Prentice Hall, 1988.
    [32] R. Kamha and D. Embley, “Grouping Search-Engine Returned Citations for Person-Name Queries,” In Proc. of the 6th ACM International Workshop on Web Information and Data Management, pp. 96-103, 2004.
    [33] N. Kashtan1, S. Itzkovitz, R. Milo and U. Alon, “Efficient Sampling Algorithm for Estimating Subgraph Concentrations and Detecting Network Motifs,” Bioinformatics, Vol. 20, pp. 1746–1758, 2004.
    [34] J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Journal of the ACM, Vol. 46(5), pp. 604-632, 1999.
    [35] R. Kraft and J. Zien, “Mining Anchor Text for Query Refinement,” In Proc. of the 13th International World Wide Web Conference, pp.666-674, 2004.
    [36] A. Kulkarni, “Unsupervised Discrimination and Labeling of Ambiguous Names,” Student Research Workshop of the 43rd Annual Meeting of the Association of Computational Linguistics, 2005.
    [37] S. Lawrence and C. Giles, “Searching the World Wide Web,” Science, Vol. 280(3), pp. 98-100, 1998.
    [38] W. Li, K. Candan, Q. Vu and D. Agrawal, “Retrieving and Organizing Web Pages by Information Unit,” In Proc. of the 10th International World Wide Web Conference, pp.230-244, 2001.
    [39] X. Li, P. Morie and D. Roth, “Identification and Tracing of Ambiguous Names:Discriminative and Generative Approaches,” In Proc. of the 19th National Conference on Artificial Intelligence, pp. 419-424 ,2004.
    [40] L. Lloyd, V. Bhagwan and D. Gruhl. “Disambiguation of References to Individuals,” IBM Research Report, RJ10364 (A0510-011), 2005.
    [41] G. Mann and D. Yarowsky, “Unsupervised Personal Name Disambiguation,” In Proc. of CoNLL-7, pp. 33-40, 2003.
    [42] T. Niessen. “How to Find Overfull Subgraphs in Graphs with Large Maximum Degree,” Discrete Applied Math, Vol. 51, pp.117-125, 1994.
    [43] C. Niu, W. Li and R. Srihari, “Weakly Supervised Learning for Cross-document Person Name Disambiguation Supported by Information Extraction,” In Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics, pp. 597-604, 2004.
    [44] J. Novak, P. Raghavan and A. Tomkins. “Anti-aliasing on the web,” In Proc. of the 13th International World Wide Web Conference, pp. 30-39, 2004.
    [45] B. On, D. Lee, J. Kang and P. Mitra, “Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework,” In Proc. of the Joint Conference on Digital Libraries(JCDL), pp. 344-353, 2005.
    [46] H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” In Proc. of theNeural Information Processing Systems Conference, 2002.
    [47] Parag and P. Domingos, “Multi-relational record linkage,” In Proc. of 3rd Workshop on Multi-Relational Data Mining, 2004.
    [48] H. Park and M. Thelwall, “Hyperlink Analyses of the World Wide Web: A Review,” Journal of Computer-Mediated Communication, Vol. 8(4), 2003.
    http://www.ascusc.org/jcmc/vol8/issue4/park.html
    [49] L. Price and M. Thelwall, “The Clustering Power of Low Frequency Words in Academic Webs,” Journal of the American Society for Information Science and Technology, Vol. 56 (88), pp. 883-888, 2005.
    [50] E. Rasmussen. “Clustering algorithms,” In W.B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data Structure and Algorithms, Chap 16. Prentice Hall, 1992.
    [51] G. Salton, A. Wong and C. Yang, “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Vol. 18(11), pp.613–620, 1975.
    [52] G. Salton and C. Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, Vol. 24(5), pp.513-523, 1988.
    [53] G. Salton, “Developments in Automatic Text Retrieval,” Science, Vol. 253, pp. 974-979, 1991.
    [54] M. Thelwall, “What is this link doing here? Beginning a fine-grained process of identifying reasons for academic hyperlink creation,” Information Research, Vol. 8 (3), 2003.
    http://informationr.net/ir/8-3/paper151.html
    [55] F. Schreiber and H. Schw¨obbermeyer, “Towards Motif Detection in Networks: Frequency Concepts and Flexible Search,” In Proc. Intl. Workshop Network Tools and Applications in Biology (NETTAB’04), pp. 91-102, 2004.
    [56] D. Smith and G. Crane, “Disambiguation Geographic Names in a Historic Digital Library”, In Proc. of the 6th European Conference on Research and Advanced Technology for Digital Libraries, pp. 177-136, 2002.
    [57] X. Wan, J. Gao, M. Li and B. Ding, “Person resolution in person search results: WebHawk,” In Proc. of ACM 14th Conference on Information and Knowledge Management, pp.163-170, 2005.
    [58] D. Watts and S. Strogatz., “Collective dynamics of 'small-world' networks,” Nature, Vol. 393, pp. 440-442, 1998.
    [59] G. Xue, Q. Yang, H. Zeng, Y. Yu and Z. Chen, “Exploiting the Hierarchical Structure for Link Analysis,” In Proc. of the ACM SIGIR Conference, pp. 186-193, 2005.
    [60] http://www.w3c.org
    [61] http://clusty.com/
    [62] http://vivisimo.com/
    [63] http://www.google.com

    QR CODE