Basic Search / Detailed Display

Author: 周麗玲
Li-Ling Chou
Thesis Title: 超文字與關鍵字相關度為基礎之主題式查詢-應用於網頁資訊檢索
Topic Hierarchy Generation Based on Anchor Text and Term-correlation
Advisor: 李漢銘
Hahn-Ming Lee
Committee: 許清琦
Ching-Chi Hsu
何建明
Jan-Ming Ho
何正信
Cheng-Seen Ho
李育杰
Yuh-Jye Lee
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 英文
Pages: 51
Keywords (in Chinese): 網頁目錄搜尋關鍵字超文字搜尋引擎
Keywords (in other languages): Topic Directory Query, Term-correlation, Anchor Text, Search Engine
Reference times: Clicks: 237Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 隨著網際網路的蓬勃發展,資訊越來越多元化,許多使用者借由網路資訊去取得新技術資料及課程內容。透過搜尋引擎提供網頁目錄查詢服務,並幫助使用者在很短的時間對此技術包含那些相關之子技術有所了解,即便成了一個重要服務。
    在此篇論文中,我們提出以超文字與關鍵字相關度的技術用來建構階層式主題搜尋,讓使用者迅速有效的熟知涵蓋主題範圍,有別於手動機制之建立。在我們的實驗分析結果,證明我們所提出的方法能有效搜尋相關階層式子技術,特別是那些主題無法經由手動機制建立的搜尋引擎得到服務,卻可以在本系統中得到較好之服務。因此有關階層式主題搜尋拓展研究的未來發展,仍然有許多問題在本篇論文中,如相關精確度與新主題搜尋問題尚需解決。
    然而我們期望藉由階層式主題搜尋的推廣,讓每個人學習新技術將不再是困擾,此外也希望本篇論文中提到之問題可以繼續被研究下去。


    As Internet booms prosperously, there is various information available for user to obtain, such as new technique information and course contents for instance. It has become an important task to provide "Topic Directory Query" Service in order to help users understanding relevant subtopics of their interested techniques within a short period of time.
    In this thesis, we propose an approach that utilizes Anchor Text and Term-correlation technique to construct and generate topic hierarchy, in order to facilitate users search effectively and efficiently the scope of their interested topics that differs from manually constructed topic hierarchy, such as Open Directory Project or Yahoo Web Directory for instance. In our experiment analysis results, our proposed approach was proved to be effective in searching relevant hierarchical subtopics, especially those topics that cannot be found from manually constructed topic directory search engine mentioned previously but can be found in our system. Therefore, with regard to "Topic Directory Query" Service, there are still many issues need to be resolved, such as precision rate enhancement and new topic detection.
    However, we still hope that learning new techniques for everyone will never be a troublesome problem. Furthermore, by promoting the concept of topic hierarchy generation, we hope the issues mentioned previously can be researched continuously.

    Content Abstract II Acknowledgements IV Content VI List of Figures VIII List of Tables IX Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Challenges 3 1.2.1 Definition of Query Scope 3 1.2.2 Hierarchical Structure Generator Issues 3 1.3 Our Goal and Design 4 1.4 Outlines 4 Chapter 2 Background 5 2.1 Basic Definition 5 2.2 Introduction of Search Engines 8 2.2.1 Google 8 2.2.2 Web Crawler 11 2.2.3 IBM Focused Crawler 12 Chapter 3 System Architecture 14 3.1 Concept of Hierarchical Structure Generator 15 3.2 Architecture of Hierarchical Structure Generator 16 3.2.1 Crawler Agent 18 3.2.2 Data Preprocessing Unit 21 3.2.3 Noisy Terms Finder 23 3.2.4 Candidate Terms Finder 26 3.2.5 Correlation Analysis Unit 27 3.2.6 Structure Generator 28 3.27 Interface Agent 31 3.3 Hierarchical Structure Generator (HSG) Program 32 Chapter 4 Experiment 35 4.1 Characteristics of Experimental Datasets 35 4.2 Criteria Evaluation 36 4.3 Experimental Results 36 4.4 Discussion 41 4.4.1 Characteristics of our proposed method 41 4.4.2 Limitations of our proposed method 41 Chapter 5 Conclusion 43 5.1 Conclusion 43 5.2 Future Work 43

    References
    [1] Lawrence Page, Sergey Brin, “The Anatomy of a Large-scale Hypertext Web Search Engine,” Proceedings of the 7th International World Wide Web Conference, 1998. Avaiable:http://www-db.stanford.edu/~backrub/google.html
    [2] Vladislay Shkapenyuk and Torsten Suel, “Design and Implementation of a High-Performance Distributed Web Crawler,” Proceedings of 18th International Conference on Data Engineering, pp. 357-368, 2002.
    [3] Allan Heydon, Marc Najork, “Mercator: A Scalable, Extensible Web Crawler,” World Wide Web Journal, Volume 2, Issue 4, pp. 219-229, 1999.
    [4] S. Chakrabarti, M. Berg, B. Dom, “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proceedings of the 8th International World Wide Web Conference (WWW8), Volume 31, pp. 1623-1640, 1999.
    [5] Han-joon Kim and Sang-goo Lee, “Building Topic Hierarchy based on Fuzzy Relations,” Neurocomputing (SCIE) , Vol. 51, pp. 481-486, April 2003.
    [6] Emilia Stoica and Marti Hearst, “Nearly-Automated Metadata Hierarchy Creation,” in HLT-NAACL'04, Companion Volume, Boston, May 2004.
    [7] Dawn Lawrie, W. Bruce Croft, A. Rosenberg, “Finding Topic Words for Hierarchical Summarization,” Proceedings of SIGIR 01 conference, pp. 349-357, 2001.
    [8] Mark Sanderson and Bruce Croft, “Deriving concept hierarchies from text,” In Proceedings of SIGIR, pp. 206–213, 1999.
    [9] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, M. Gori., “Focused Crawling Using Context Graphs,” Proceedings of the 26th International Conference on Very Large Databases, pp. 527-534, 2000.
    [10] Kraft, Reiner and Zien, Jason, “Mining Anchor Text for Query Refinement,” In Proceedings International WWW 2004 Conference, New York, USA.
    [11] Focused Crawling – survey paper. (http://www.cs.berkeley.edu/~soumen/focus/)
    [12] Wen-Hsiang Lu, Jenq-Haur Wang, and Lee-Feng Chien, “Towards Web Mining of Query Translations for Cross-Language Information Retrieval in Digital Libraries,” Proceedings of the 6th International Conference of Asian Digital Libraries (ICADL 2003), pp. 86-99, Kuala Lumpur, Malaysia, Dec. 8-11, 2003. (LNCS 2911, Springer-Verlag)
    [13] Daniel Sleator and Davy Temperley, Parsing English with Link Grammar. In Proceedings Third International Workshop on Parsing Technologies, 1993.
    [14] Bing Liu, Chee Wee Chin, Hwee Tou Ng., “Mining Topic-Specific Concepts and Definitions on the Web,” Proceedings of the twelfth international World Wide Web conference (WWW-2003), 20-24 May 2003, Budapest, HUNGARY.
    [15] Jimin Liu and Tat-Seng Chua, “Building semantic perceptron net for topic spotting,” In Proceedings of 37th Meeting of Association of Computational Linguistics (ACL 2001), Toulouse, France, pp. 370-377, July 2001.
    [16] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, “Focused Crawling: A New Approach for Topic-Specific Resource Discovery,” In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, May 1999.
    [17] Gerry McKiernan, “New Age Navigation: Innovative Information Interfaces for Electronic Journals,” The Serials Librarian, pp. 87-123, 2003.
    [18] Chris Clifton, Robert Cooley, Jason Rennie, “TopCat: Data Mining for Topic Identification in a Text Corpus,” IEEE Transaction on Knowledge and Data Engineering, pp. 949-964, August 2004 (Vol. 16, No. 8).
    [19] Hermine Njike-Fotzo and Patrick Gallinari, “Learning Generalization/Specialization Relations between Concepts – Application for Automatically Building Thematic Document Hierarchies,” RIAO 2004, pp. 26-28 April 2004, France.
    [20] Kummamuru, Krishna and Lotlikar, Rohit and Roy, Shourya and Singal, Karan and Krishnapuram, Raghu, “A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results,” In Proceedings International WWW (2004) Conference, New York, USA.
    [21] Domonkos Tikk, Gyorgy Biro, and Jaedong Yang, “A Hierarchical Text Categorization Approach and its Application to FRT Expansion,” submitted to Fuzzy Systems, Examination of research papers in 2003.
    [22] Lawrence Kai Shih, David R. Karger, “Using URLs and Table Layout for Web Classification Tasks,” Massachusetts Institute of Technology WWW2004, May-17-22-2004.
    [23] 黃聖傑,多文件自動摘要方法研究,國立台灣大學資訊工程學系,碩士論文,民88 年。

    URL Lists:
    [50] Internet Data Corporation. Available: http://www.idc.com/
    [51] NetNames Statistics. Available: http://www.netnames.com/
    [52] Google. Available: http://www.google.com/
    [53] Google Directory. Available: http://directory.google.com/
    [54] NEC Research Institute ResearchIndex. Available: http://citeseer.ist.psu.edu/
    [55] Alta Vista. Available: http://www.altavista.com
    [56] Openfind. Available: http://www.openfind.com/
    [57] Ask Jeeves. Available: http://www.ask.com/
    [58] Vivisimo. Available: http://vivisimo.com/
    [59] Overture. Available: http://www.content.overture.com/d/
    [60] Teoma. Available: http://www.teoma.com/
    [61] Mooter. Available: http://www.mooter.com/
    [62] Yahoo. Available: http://www.yahoo.com/
    [63] Yahoo Web Directory. Available: http://dir.yahoo.com/
    [64] Open Directory Project. Available: http://dmoz.org
    [65] Porter Stemming Algorithm. Available: http://www.tartarus.org/~martin/PorterStemmer/
    [66] WebCrawler. Available: http://msxml.webcrawler.com/info.wbcrwl/search/web/mining
    [67] Mamma meta search engine. Available: http://www.mamma.com/
    [68] Yam. Available: http://www.yam.com/
    [69] Chebyshev distance. Available: http://www.nationmaster.com/encyclopedia/Chebyshev-distance
    [70] Calculating Distances of Vectors. Available: http://en.wikipedia.org/wiki/Chebyshev_distance
    [71] WordNet. Available: http://wordnet.princeton.edu/obtain
    [72] A Semantic Web Demo. Available: http://infomesh.net/2001/swintro/
    [73] Link Grammar. Available: http://www.link.cs.cmu.edu/link/
    [74] Neural Information Processing System Foundation. Available: http://www.nips.cc/
    [75] PDF2TXT. Available: http://toget.pchome.com.tw/intro/business_wordprocessing/17853.html
    [76] Google Web APIs. Available: http://www.google.com/apis/
    [77] Highwire. Available: http://www.highwire.org/
    [78] WEKA. Available: http://www.cs.waikato.ac.nz/~ml/weka/
    [79] W3C. Available: http://www.w3c.org/

    無法下載圖示 Full text public date This full text is not authorized to be published. (Intranet public)
    Full text public date This full text is not authorized to be published. (Internet public)
    Full text public date This full text is not authorized to be published. (National library)
    QR CODE