超文字與關鍵字相關度為基礎之主題式查詢-應用於網頁資訊檢索

簡易檢索 / 詳目顯示

回結果列表

研究生：	周麗玲 Li-Ling Chou
論文名稱：	超文字與關鍵字相關度為基礎之主題式查詢-應用於網頁資訊檢索 Topic Hierarchy Generation Based on Anchor Text and Term-correlation
指導教授：	李漢銘 Hahn-Ming Lee
口試委員:	許清琦 Ching-Chi Hsu 何建明 Jan-Ming Ho 何正信 Cheng-Seen Ho 李育杰 Yuh-Jye Lee
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2005
畢業學年度：	93
語文別：	英文
論文頁數：	51
中文關鍵詞：	網頁目錄搜尋、關鍵字、超文字、搜尋引擎
外文關鍵詞：	Topic Directory Query, Term-correlation, Anchor Text, Search Engine
相關次數：	點閱：302 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

隨著網際網路的蓬勃發展，資訊越來越多元化，許多使用者借由網路資訊去取得新技術資料及課程內容。透過搜尋引擎提供網頁目錄查詢服務，並幫助使用者在很短的時間對此技術包含那些相關之子技術有所了解，即便成了一個重要服務。
在此篇論文中，我們提出以超文字與關鍵字相關度的技術用來建構階層式主題搜尋，讓使用者迅速有效的熟知涵蓋主題範圍，有別於手動機制之建立。在我們的實驗分析結果，證明我們所提出的方法能有效搜尋相關階層式子技術，特別是那些主題無法經由手動機制建立的搜尋引擎得到服務，卻可以在本系統中得到較好之服務。因此有關階層式主題搜尋拓展研究的未來發展，仍然有許多問題在本篇論文中，如相關精確度與新主題搜尋問題尚需解決。
然而我們期望藉由階層式主題搜尋的推廣，讓每個人學習新技術將不再是困擾，此外也希望本篇論文中提到之問題可以繼續被研究下去。

As Internet booms prosperously, there is various information available for user to obtain, such as new technique information and course contents for instance. It has become an important task to provide "Topic Directory Query" Service in order to help users understanding relevant subtopics of their interested techniques within a short period of time.
In this thesis, we propose an approach that utilizes Anchor Text and Term-correlation technique to construct and generate topic hierarchy, in order to facilitate users search effectively and efficiently the scope of their interested topics that differs from manually constructed topic hierarchy, such as Open Directory Project or Yahoo Web Directory for instance. In our experiment analysis results, our proposed approach was proved to be effective in searching relevant hierarchical subtopics, especially those topics that cannot be found from manually constructed topic directory search engine mentioned previously but can be found in our system. Therefore, with regard to "Topic Directory Query" Service, there are still many issues need to be resolved, such as precision rate enhancement and new topic detection.
However, we still hope that learning new techniques for everyone will never be a troublesome problem. Furthermore, by promoting the concept of topic hierarchy generation, we hope the issues mentioned previously can be researched continuously.

Content

Abstract	II
Acknowledgements	IV
Content	VI
List of Figures	VIII
List of Tables	IX

Chapter 1   Introduction	1
1	Motivation	1
2	Challenges	3
2.1	Definition of Query Scope	3
2.2	Hierarchical Structure Generator Issues	3
3		Our Goal and Design	4
4		Outlines	4

Chapter 2	Background	5
1		Basic Definition	5
2		Introduction of Search Engines	8
2.1	Google	8
2.2	Web Crawler	11
2.3	IBM Focused Crawler	12

Chapter 3	System Architecture	14
1			Concept of Hierarchical Structure Generator	15
2			Architecture of Hierarchical Structure Generator	16
2.1 	Crawler Agent	18
2.2	Data Preprocessing Unit	21
2.3	Noisy Terms Finder	23
2.4	Candidate Terms Finder	26
2.5	Correlation Analysis Unit	27
2.6	Structure Generator	28
27	Interface Agent	31
3			Hierarchical Structure Generator (HSG) Program	32

Chapter 4	Experiment	35
1			Characteristics of Experimental Datasets	35
2			Criteria Evaluation	36
3			Experimental Results	36
4			Discussion	41
4.1 	Characteristics of our proposed method	41
4.2	Limitations of our proposed method	41

Chapter 5	Conclusion	43
1			Conclusion	43
2			Future Work	43

                                

References
[1] Lawrence Page, Sergey Brin, “The Anatomy of a Large-scale Hypertext Web Search Engine,” Proceedings of the 7th International World Wide Web Conference, 1998. Avaiable:http://www-db.stanford.edu/~backrub/google.html
[2] Vladislay Shkapenyuk and Torsten Suel, “Design and Implementation of a High-Performance Distributed Web Crawler,” Proceedings of 18th International Conference on Data Engineering, pp. 357-368, 2002.
[3] Allan Heydon, Marc Najork, “Mercator: A Scalable, Extensible Web Crawler,” World Wide Web Journal, Volume 2, Issue 4, pp. 219-229, 1999.
[4] S. Chakrabarti, M. Berg, B. Dom, “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Proceedings of the 8th International World Wide Web Conference (WWW8), Volume 31, pp. 1623-1640, 1999.
[5] Han-joon Kim and Sang-goo Lee, “Building Topic Hierarchy based on Fuzzy Relations,” Neurocomputing (SCIE) , Vol. 51, pp. 481-486, April 2003.
[6] Emilia Stoica and Marti Hearst, “Nearly-Automated Metadata Hierarchy Creation,” in HLT-NAACL'04, Companion Volume, Boston, May 2004.
[7] Dawn Lawrie, W. Bruce Croft, A. Rosenberg, “Finding Topic Words for Hierarchical Summarization,” Proceedings of SIGIR 01 conference, pp. 349-357, 2001.
[8] Mark Sanderson and Bruce Croft, “Deriving concept hierarchies from text,” In Proceedings of SIGIR, pp. 206–213, 1999.
[9] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, M. Gori., “Focused Crawling Using Context Graphs,” Proceedings of the 26th International Conference on Very Large Databases, pp. 527-534, 2000.
[10] Kraft, Reiner and Zien, Jason, “Mining Anchor Text for Query Refinement,” In Proceedings International WWW 2004 Conference, New York, USA.
[11] Focused Crawling – survey paper. (http://www.cs.berkeley.edu/~soumen/focus/)
[12] Wen-Hsiang Lu, Jenq-Haur Wang, and Lee-Feng Chien, “Towards Web Mining of Query Translations for Cross-Language Information Retrieval in Digital Libraries,” Proceedings of the 6th International Conference of Asian Digital Libraries (ICADL 2003), pp. 86-99, Kuala Lumpur, Malaysia, Dec. 8-11, 2003. (LNCS 2911, Springer-Verlag)
[13] Daniel Sleator and Davy Temperley, Parsing English with Link Grammar. In Proceedings Third International Workshop on Parsing Technologies, 1993.
[14] Bing Liu, Chee Wee Chin, Hwee Tou Ng., “Mining Topic-Specific Concepts and Definitions on the Web,” Proceedings of the twelfth international World Wide Web conference (WWW-2003), 20-24 May 2003, Budapest, HUNGARY.
[15] Jimin Liu and Tat-Seng Chua, “Building semantic perceptron net for topic spotting,” In Proceedings of 37th Meeting of Association of Computational Linguistics (ACL 2001), Toulouse, France, pp. 370-377, July 2001.
[16] Soumen Chakrabarti, Martin van den Berg, and Byron Dom, “Focused Crawling: A New Approach for Topic-Specific Resource Discovery,” In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, May 1999.
[17] Gerry McKiernan, “New Age Navigation: Innovative Information Interfaces for Electronic Journals,” The Serials Librarian, pp. 87-123, 2003.
[18] Chris Clifton, Robert Cooley, Jason Rennie, “TopCat: Data Mining for Topic Identification in a Text Corpus,” IEEE Transaction on Knowledge and Data Engineering, pp. 949-964, August 2004 (Vol. 16, No. 8).
[19] Hermine Njike-Fotzo and Patrick Gallinari, “Learning Generalization/Specialization Relations between Concepts – Application for Automatically Building Thematic Document Hierarchies,” RIAO 2004, pp. 26-28 April 2004, France.
[20] Kummamuru, Krishna and Lotlikar, Rohit and Roy, Shourya and Singal, Karan and Krishnapuram, Raghu, “A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results,” In Proceedings International WWW (2004) Conference, New York, USA.
[21] Domonkos Tikk, Gyorgy Biro, and Jaedong Yang, “A Hierarchical Text Categorization Approach and its Application to FRT Expansion,” submitted to Fuzzy Systems, Examination of research papers in 2003.
[22] Lawrence Kai Shih, David R. Karger, “Using URLs and Table Layout for Web Classification Tasks,” Massachusetts Institute of Technology WWW2004, May-17-22-2004.
[23] 黃聖傑，多文件自動摘要方法研究，國立台灣大學資訊工程學系，碩士論文，民88 年。

URL Lists:
[50] Internet Data Corporation. Available: http://www.idc.com/
[51] NetNames Statistics. Available: http://www.netnames.com/
[52] Google. Available: http://www.google.com/
[53] Google Directory. Available: http://directory.google.com/
[54] NEC Research Institute ResearchIndex. Available: http://citeseer.ist.psu.edu/
[55] Alta Vista. Available: http://www.altavista.com
[56] Openfind. Available: http://www.openfind.com/
[57] Ask Jeeves. Available: http://www.ask.com/
[58] Vivisimo. Available: http://vivisimo.com/
[59] Overture. Available: http://www.content.overture.com/d/
[60] Teoma. Available: http://www.teoma.com/
[61] Mooter. Available: http://www.mooter.com/
[62] Yahoo. Available: http://www.yahoo.com/
[63] Yahoo Web Directory. Available: http://dir.yahoo.com/
[64] Open Directory Project. Available: http://dmoz.org
[65] Porter Stemming Algorithm. Available: http://www.tartarus.org/~martin/PorterStemmer/
[66] WebCrawler. Available: http://msxml.webcrawler.com/info.wbcrwl/search/web/mining
[67] Mamma meta search engine. Available: http://www.mamma.com/
[68] Yam. Available: http://www.yam.com/
[69] Chebyshev distance. Available: http://www.nationmaster.com/encyclopedia/Chebyshev-distance
[70] Calculating Distances of Vectors. Available: http://en.wikipedia.org/wiki/Chebyshev_distance
[71] WordNet. Available: http://wordnet.princeton.edu/obtain
[72] A Semantic Web Demo. Available: http://infomesh.net/2001/swintro/
[73] Link Grammar. Available: http://www.link.cs.cmu.edu/link/
[74] Neural Information Processing System Foundation. Available: http://www.nips.cc/
[75] PDF2TXT. Available: http://toget.pchome.com.tw/intro/business_wordprocessing/17853.html
[76] Google Web APIs. Available: http://www.google.com/apis/
[77] Highwire. Available: http://www.highwire.org/
[78] WEKA. Available: http://www.cs.waikato.ac.nz/~ml/weka/
[79] W3C. Available: http://www.w3c.org/

全文公開日期本全文未授權公開 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文