簡易檢索 / 詳目顯示

研究生: 林家正
Jia-Jeng Lin
論文名稱: 針對不完備查詢的階層式架構搜尋引擎
Search Engine with Hierarchical Structures for Incomplete Queries
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 李育杰
Yuh-Jye Lee
張源俊
none
項天瑞
Tien-Ruey Hsiang
吳怡樂
none
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2006
畢業學年度: 94
語文別: 中文
論文頁數: 34
中文關鍵詞: 分群不完備查詢搜尋引擎階層式架構
外文關鍵詞: clustering, incomplete query, search engine, hierarchical structure
相關次數: 點閱:257下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

自從網際網路(Internet) 發展至今, 不過短短二三十年, 但是所擁有的資訊量, 卻是
相當龐大的, 而且還不斷的在持續增加中, 也就因為資訊增加的速度太快, 很多有用的
資訊我們就難以搜尋到; 最早期網際網路尚未普遍時, 網路上的資訊有限, 所以要上網
搜尋資料, 只需記得少數幾個常用的網址, 就能找到資料, 但是找到的資料卻也極其有
限, 而到了現在, 網路的資訊之多之廣, 已經不是少數幾個網址就能夠涵蓋的, 使用者
也不太可能去記得太多的網址, 因此, 搜尋引擎就應運而生了, 透過搜尋引擎, 我們就
能夠快速的尋找到我們所需要的資訊。不完備搜尋引擎是針對不完備查詢而設計的搜
尋引擎, 此搜尋引擎能夠觀察使用者的瀏覽行為以及時間, 來判斷使用者感興趣的資
料是哪些, 然後傳回使用者所感興趣的資料, 以達到節省搜尋時間的目的。但是不完備
搜尋引擎有其缺點在, 第一個缺點, 不完備搜尋引擎在前面幾頁必須要有使用者感興
趣的資料, 否則就沒有辦法發揮作用, 第二個缺點就是不完備搜尋引擎對雜訊非常的
敏感, 如果使用者一開始操作不當的話, 就有可能傳回的資料跟使用者真正想要的會
有很大差距。本研究為了改善不完備搜尋引擎的這兩個缺點, 加入了分群的概念, 提高
了前面幾頁出現使用者感興趣的資料的可能性。又以分群的概念為基礎, 實做出階層
式的架構, 利用階層式的架構特性, 縮小查詢的範圍, 以提高查全率。


In recent decades, Internet grows up quickly and provides very much information. It’s hard for us to find useful information because information is increasing rapidly. There are limited information on Internet before, so we
can find information with few URL. However, the information that we can
find are limited. Until now, the information on the Internet is so much that
it is impossible to use few URL to present these information. Furthermore,
users couldn’t remember too much URL. As a result, users need search engine.
We can find the information we needed quickly with search engine.
”Adaptive Search Engine for Incomplete Queries” is the search engine designed
for incomplete queries. It can observe the users’ behaviors and find
what kind of data users want, then return these data to users to save users’
search time. But this kind of search engine have some problems. First, there
must be positive data in initial pages, or search engine will not work.Second,
it is sensitive to ”noise”. If users didn’t use it well, it will return bad results. Our research use ”clustering” to improve the system. We increase the probability that the positive data appear in initial pages, and we construct the hierarchical structure with clustering. With the property of the hierarchical structure, we can reduce the range of query and increase the recall rate.

1 緒論 1.1 研究動機與目的. . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 研究方法與架構. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 論文架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 相關研究及文獻探討 2.1 搜尋引擎簡介. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 分類目錄式(category and directory search) . . . . . . . . 4 2.1.2 關鍵字查詢式(keyword query search) . . . . . . . . . . . 5 2.1.3 整合搜尋式(meta search) . . . . . . . . . . . . . . . . . 5 2.1.4 專屬搜尋式(specific purpose search) . . . . . . . . . . . 5 2.2 文件分類及向量空間模型(Document Classification and Vector Space Model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 文件前處理與效能評估. . . . . . . . . . . . . . . . . . . 6 2.2.2 向量空間模型的建立. . . . . . . . . . . . . . . . . . . . 8 2.3 維度降低技術. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 奇異值分解(Singular Value Decomposition) . . . . . . . 11 2.3.2 LLE (Locally Linear Embedding) . . . . . . . . . . . . 13 3 高查全率搜尋引擎與階層式的架構 3.1 不完備查詢. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 階層式架構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 分群的功能. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 分群的方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.2 高斯混合模型:Gaussian Mixture Model . . . . . . . . . . 20 3.4.3 UPGMA (Unweighted Pair-Group Method Using Arithmetic averages) . . . . . . . . . . . . . . . . . . . . . . 22 4 實驗結果 4.1 實驗一. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 實驗二. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 以K-means 做分群. . . . . . . . . . . . . . . . . . . . . 24 4.2.2 以GMM+EM 做分群. . . . . . . . . . . . . . . . . . . 26 4.2.3 以UPGMA 做分群. . . . . . . . . . . . . . . . . . . . . 28 5 結論 5.1 未來工作. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

[1] Akiko Aizawa. The feature quantity: an information theoretic perspective
of tfidf-like measures. Proceedings of the 23rd annual international
ACM SIGIR conference on Research and development in information
retrieval, pages 104–111, 2000.
[2] R. Bellman. Adaptive Control Processes: A Guided Tour. NJ: Princeton
University Press, 1961.
[3] M. Berry, S. Dumais, T. Landauer, , and G. O‘Brien. Using linear
algebra for intelligent information retrieval. SIAM Review,, 37(4):573–
595, 1995.
[4] J.A. Bilmes. A gentle tutorial of em algorithm and its application to
parameter estimation for gaussian mixture and hidden markov models.
Technical Report, University of Berkeley, ICSI-TR-97-021, 1997.
[5] D. Boley, M. Gini, R. Gross, E.H. Han, K. Hastings, G. Karypis, V. Kumar,
B. Mobasher, and J. Moore. Partitioning-based clustering for
web document categorization. Decision Support Systems, 27(3):329–341,
2000.
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual web
search engine. Proceedings of the seventh international conference on
World Wide Web, pages 107–117, 1998.
[7] C.Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for
clustering with the integrated completed likelihood. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 22(7):719–725, 2000.
[8] C.H. Chen. Adaptive search engine for incomplete queries. Master’s
thesis, National Taiwan University of Science and Technology, 2005.
[9] A.P. Dempster, N.M. Laird, and D.B Rubin. Maximum likelihood from
incomplete data via the em algorithm. J.Roy. Stat. Soc., 39(1):1–38,
1977.
[10] Chris Ding, Xiaofeng He, Hongyuan Zha, and Horst D. Simon. Adaptive
dimension reduction for clustering high dimensional data. Proceedings
of the 2002 IEEE International Conference on Data Mining (ICDM’02),
page 147, 2002.
[11] E.M.Voorhees. Implementing agglomerative hierarchical clustering algorithms
for use in document retrieval. Information Processing and
Management, 22:465–475, 1986.
[12] Paolo Ferragina and Antonio Gulli. A personalized search engine based
on web-snippet hierarchical clustering. Special interest tracks and posters
of the 14th international conference on World Wide Web, pages 10–14,
2005.
[13] M. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture
models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3):381–396, 2002.
[14] H. Hartley. Maximum likelihood estimation from incomplete data. Biometrics,
14:174–194, 1958.
[15] AK Jain and RC Dubes. Algorithms for clustering data. Prentice-Hall,
Inc. Upper Saddle River, NJ, USA, 1988.
[16] Jyh-Shing Roger Jang. Data Clustering and Pattern Recognition.
[17] B.J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information
retrieval: a study of user queries on the web. ACM SIGIR Forum,
32(1):5–17, 1998.
[18] M. I. Jordan. Learning in Graphical Models. Kluwer Academic Publishers,
1998.
[19] M.I. Mauldin. Lycos: design choices in an internet search service. Expert,
IEEE, 12:8–11, 1997.
[20] S.J. Mckenna, S. Gong, and Y. Raja. Tracking colour objects using
adaptive mixture models. Image and Vision Computing, 17(3-4):225–
231, 1999.
[21] P.Willet. Recent trends in hierarchical document clustering: a critical
review. Information Processing and Management, 24:577–597, 1998.
[22] S. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches
to gaussian mixture modeling. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 20(11):1133–1142, 1998.
[23] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally
linear embedding. Science, 290:2323–2326, 2000.
[24] Lawrence K. Saul and Sam T. Roweis. An introduction to locally linear
embedding. Report at AT & T Labs-Research, 2000.
[25] D. W. Scott. Multivariate Density Estimation. Multivariate Density
Estimation, Wiley, New York, 1992.
[26] C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a
very large altavista query log. SRC Technical Note, 6(2), 1998.
[27] Carlo Tomasi. Estimating gaussian mixture densities with em - a tutorial.
2003.

QR CODE