簡易檢索 / 詳目顯示

研究生: 賴志松
Chih-Sung Lai
論文名稱: Concept Separation by Manifold Learning Approach
Concept Separation by Manifold Learning Approach
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 張源俊
Yuan-Chin Chang
劉庭祿
Tyng-Luh Liu
李育杰
Yuh-Jye Lee
戴碧如
Bi-Ru Dai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 71
中文關鍵詞: Incomplete QueriesConcept SeparationManifold LearningClusteringDimension ReductionIsomapSearch Engine
外文關鍵詞: Incomplete Queries, Concept Separation, Manifold Learning, Clustering, Dimension Reduction, Isomap, Search Engine
相關次數: 點閱:208下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

Abstract
We usually rely on effective search engines to maximize the usage of rich web information. However, in many situations, the query input is in limited size and even the powerful search engine can not catch the actual goal from the users. Statistics report that there is merely one keyword submitted to the search engines by 77% users approximately, and queries submitted by 85% users contain less than three keywords. We call such queries as incomplete queries. Our goal is to separate the main concept corresponding to a query into sub-concept where one of them may be related to the real interest of the user. The documents usually lie on a very high dimensional space where the existence of a keyword means a dimension to be considered. Therefore, we propose a dimension reduction method based on a manifold learning approach for the clustering process. We adopt Isomap for dimension reduction. Experimentally, after the Isomap process, the dimension reduced dataset gives us better presentation of the dataset. Also, due to the data size is reduced, the execution time of the whole mining process can be close to real time. Several variants of Isomap will also be studied in this thesis.


Abstract
We usually rely on effective search engines to maximize the usage of rich web information. However, in many situations, the query input is in limited size and even the powerful search engine can not catch the actual goal from the users. Statistics report that there is merely one keyword submitted to the search engines by 77% users approximately, and queries submitted by 85% users contain less than three keywords. We call such queries as incomplete queries. Our goal is to separate the main concept corresponding to a query into sub-concept where one of them may be related to the real interest of the user. The documents usually lie on a very high dimensional space where the existence of a keyword means a dimension to be considered. Therefore, we propose a dimension reduction method based on a manifold learning approach for the clustering process. We adopt Isomap for dimension reduction. Experimentally, after the Isomap process, the dimension reduced dataset gives us better presentation of the dataset. Also, due to the data size is reduced, the execution time of the whole mining process can be close to real time. Several variants of Isomap will also be studied in this thesis.

1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . 1 1.2 ProposedMethod . . . . . . . . . . . . . 3 1.3 PreviousWorks . . . . . . . . . . . . . 5 1.4 Thesis Outline . . . . . . . . . . . . . 7 2 Data Preprocessing 8 2.1 Tokenization . . . . . . . . . . . . . . 8 2.2 TermWeighting . . . . . . . . . . . . . 10 2.3 Term-by-DocumentMatrix . . . . . . . . . 12 2.4 DissimilarityMeasures . . . . . . . . . 12 3 Dimensionality Reduction 16 3.1 Intrinsic Dimensionality . . . . . . . . 17 3.2 Isomap . . . . . . . . . . . . . . . . . 19 3.3 Comparion of Isomap and OtherMethods . . 23 3.4 Landmark Isomap . . . . . . . . . . . . 24 3.5 Incremental Isomap . . . . . . . . . . . 27 3.6 Avoiding Outliers . . . . . . . . . . . 31 4 Clustering Methods 36 4.1 k-Means . . . . . . . . . . . . . . . . 36 4.2 GaussianMixtureModel . . . . . . . . . . 37 5 Experiment Results 40 5.1 Introduction to Dataset . . . . . . . . 41 5.2 Ex1 : Using Dataset I . . . . . . . . . 42 5.3 Ex2 : Using Dataset II . . . . . . . . . 46 5.4 Ex3 : Using Dataset III . . . . . . . . 49 6 Conclusion and Future Work 52 6.1 Conclusions . . . . . . . . . . . . . . 52 6.2 FutureWork . . . . . . . . . . . . . . . 53

[1] Cross industry standard process for data mining. Http://www.crispdm.
org, 1999.
[2] K. Steinbach ad P. Willett. Reading in information r trieval. Morgan
Kaufmann.
[3] E. Alpaydin. Introduction to machine learning. MA: MIT Press, 2004.
[4] M. Balasubramanian and E.L. Schwartz. The isomap algorithm and
topological stability. Science, 295(5552), 2002.
[5] M.W. Berry and M. Browne. Understanding search engines: mathematical
modeling and text retrieval. Society for Industrial and Applied
Mathematics, 2nd edition, 2005.
[6] M.W. Berry, S.T. Dumais, and T.A. Letsche. Computational methods
for intelligent information access. Proceedings of Supercomputing’95,
San Diego, CA, 1995.
[7] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for
intelligent information retrieval. SIAM Review, 37(4):573–595, 1995.
[8] M.W. Berry and R.D. Fierro. Low-rank orthogonal decompositions for
information retrieval applications. Numerical Linear Algebra with Applications,
3(4):301–328, 1996.
[9] J.C. Bezdek. Pattern recognition with fuzzy objective function algorithms.
Plenum Press, New York, 1981.
[10] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model
for clustering with the integrated completed likelihood. IEEE Trans. on
Pattern Analysis and Machine Intelligence, 22(7):719–725, 2000.
[11] C. Borgelt and A. Nurnberger. Fast fuzzy clustering of web page collections.
In proc. of PKDD Workshop on Statistical Approaches for Web
Mining (SAWM), Pisa, Italy, 2004.
[12] J. Bruske and G. Sommer. Intrinsic dimensionality estimation with
optimally topology preserving maps. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 20(5):572–575, 1988.
[13] F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of
data with a fractal-based method. IEEE Trans. on Pattern Analysis
and Machine Intelligence, 24(10):1404–1407, 2002.
[14] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hyper-
Text Data. Science and Technology Books, 2003.
[15] J. Costa and A.O. Hero. Manifold learning using euclidean k-nearest
neighbor graphs. Proceedings of IEEE International Conference on
Acoustic Speech and Signal Processing, 4:988–991, 2004.
[16] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. American Statistical
Association, 2001.
[17] V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear
dimensionality reduction. Advances in Neural Information Processing
Systems, pages 705–712, 2003.
[18] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical
Society. Series B (Methodological), 39(1):1–38, 1997.
[19] I.S. Dhillon, S. Mallela, and D.S. Modha. Information-theoretic coclustering.
In Proc. of the ninth ACM SIGKDD int. conf. on Knowledge
Discovery and Data Mining, pages 89–98, 2003.
[20] I.S. Dhillon and D.S. Modha. Concept decompositions for large sparse
text data using clustering. IBM Almaden Research Cent, San Jose, CA,
1999.
[21] D. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding
techniques for high dimensional data. Proc. of National Academy of
Sciences, 100(10):5591–5596, 2003.
[22] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley-
Interscience, 2nd edition, 2000.
[23] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence
Analysis - Probabilistic Models of Proteins and Nucleic Acids. Cambridge
University Press, 1998.
[24] U.M. Fayyad, G. Piatetsky-shapiro, and P.Smyth. Knotledge discovery
and data mining: Towards a unifying framework. In Knowledge
Discovery and Data Mining, pages 82–88, 1996.
[25] M.A.F. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture
models. IEEE Trans. On Pattern Analysis and Machine Intellegence,
24(3):381–396, 2002.
[26] R.A. Fisher. The use of multiple measurements in taxonomic problems.
Annals of Eugenics, 7:179–188, 1936.
[27] R. Gaizauskas. An information extraction perspective on text mining:
Tasks, technologies and prototype applications.
http://www.itri.bton.ac.uk/projects/euromap/
TextMiningEvent/Rob Gaizauskaas.pdf, 2003.
[28] H.O. Hartley. Maximum likelihood estimation from incomplete data.
Biometrics, 14(2):174–194, 1958.
[29] M. Hearst. Untangling text data mining. In Proc. of ACL’99 the 37th
Annual Meeting of the Association for Computational Linguistics, 1999.
[30] M. Hein and J.Y. Audibert. Intrinsic dimensionality estimation of submanifolds
in Rd. ACM International Conference Proceeding Series,
119:289–296, 2005.
[31] J.M.G. Hidalgo. Tutorial on text mining and internet content filtering.
Tutorial Notes Online: http://ecmlpkdd.cs.helsinki.
fi/pdf/hidalgo.pdf, 2002.
[32] H. Hotelling. Analysis of a complex of statistical variables into principal
components. Journal of Educational Psychology, 24:417–441, 1933.
[33] A. Hotho, A. Nurnberger, and G. Paai. A brief survey of text mining.
Zeitschrift fuer Computerlinguistik und Sprachtechnologie (GLDVJournal
for Computational Linguistics and Language Technologie),
20(i):19–62, 2005.
[34] J.S. Jang. Data clustering and pattern recognition.
http://neural.cs.nthu.edu.tw/jang/books/dcpr/index.asp, 2005.
[35] B.J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information
retrieval: a study of user queries on the web. ACM SIGIR Forum,
32(1):5–17, 1998.
[36] M.I. Jordan. Learning in Graphical Models. MIT Press, 1999.
[37] B. Kegl. Intrinsic dimension estimation using packing numbers. Advances
in Neural Information Processing Systems 15. MIT Press, 2003.
[38] Y. Kodratof. Knowledge discovery in texts: A definition and applications.
Lecture Notes in Computer Science, 1609:16–29, 1999.
[39] Y. Kohonen. Self-organized formation of topologically correct feature
maps. Biological Cybernetics, 43:59–69, 1982.
[40] V. Kumar and M. Joshi. What is data mining? htp://www-users.
cs.umn.edu/ mjoshi/phdmtut/sld004.htm, 2003.
[41] M.H.C. Law and A.K. Jain. Incremental nonlinear dimensionality reduction
by manifold learning. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 28(3):377– 391, 2006.
[42] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic
dimension. Advances in Neural Information Processing Systems, MIT
Press, 2005.
[43] U. Nahm and R. Mooney. Text mining with information extraction. In
Proceedings of the AAAI 2002 Spring Symposium on Mining Answers
from Texts and Knowledge Bases, 2002.
[44] K. Pettis, A.K. Jain T. Bailey, and R. Dubes. An intrinsic dimensionality
estimator from near-neighbor information. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 1(1):25–36, 1979.
[45] J.C. Platt. Fastmap, metricmap and landmark mds are all nystrom
algorithms. 10th International Workshop on Artificial Intelligence and
Statistics, pages 261–268, 2005.
[46] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,
1980.
[47] S.J. Roberts, I. Rezek D. Husmeier, and W. Penny. Bayesian approaches
to gaussian mixture modeling. IEEE Trans. on Pattern Analysis and
Machine Intelligence, 20(11):1133–1142, 1998.
[48] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally
linear embedding. Science, 290(5500):2323–2326, 2000.
[49] G. Salton, A.Wong, and C.S. Yang. A vector space model for automatic
indexing. Communications of the ACM, 18(22):613–620, 1975.
[50] L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised
learning of low dimensional manifolds. Journal of Machine Learning
Research, pages 119–155, 2003.
[51] C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a
very large altavista query log. Digital SRC Technical Note, 1998.
[52] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric
framework for nonlinear dimensionality reduction. Science,
290(5500):2319–2323, 2000.
[53] C. Tomasi. Estimating gaussian mixture densities with em - a tutorial,
2003.
[54] S. Wasserman and K. Faust. Social Network Analysis. Cambridge University
Press, 2006.
[55] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality
reduction via local tangent space alignment. SIAM Journal of Scientific
Computing, 26(1):313–338, 2004.

QR CODE