簡易檢索 / 詳目顯示

研究生: 賴志松
Chih-Sung Lai
論文名稱: Concept Separation by Manifold Learning Approach
Concept Separation by Manifold Learning Approach
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 張源俊
Yuan-Chin Chang
劉庭祿
Tyng-Luh Liu
李育杰
Yuh-Jye Lee
戴碧如
Bi-Ru Dai
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 71
中文關鍵詞: Incomplete QueriesConcept SeparationManifold LearningClusteringDimension ReductionIsomapSearch Engine
外文關鍵詞: Incomplete Queries, Concept Separation, Manifold Learning, Clustering, Dimension Reduction, Isomap, Search Engine
相關次數: 點閱:217下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Abstract
    We usually rely on effective search engines to maximize the usage of rich web information. However, in many situations, the query input is in limited size and even the powerful search engine can not catch the actual goal from the users. Statistics report that there is merely one keyword submitted to the search engines by 77% users approximately, and queries submitted by 85% users contain less than three keywords. We call such queries as incomplete queries. Our goal is to separate the main concept corresponding to a query into sub-concept where one of them may be related to the real interest of the user. The documents usually lie on a very high dimensional space where the existence of a keyword means a dimension to be considered. Therefore, we propose a dimension reduction method based on a manifold learning approach for the clustering process. We adopt Isomap for dimension reduction. Experimentally, after the Isomap process, the dimension reduced dataset gives us better presentation of the dataset. Also, due to the data size is reduced, the execution time of the whole mining process can be close to real time. Several variants of Isomap will also be studied in this thesis.


    Abstract
    We usually rely on effective search engines to maximize the usage of rich web information. However, in many situations, the query input is in limited size and even the powerful search engine can not catch the actual goal from the users. Statistics report that there is merely one keyword submitted to the search engines by 77% users approximately, and queries submitted by 85% users contain less than three keywords. We call such queries as incomplete queries. Our goal is to separate the main concept corresponding to a query into sub-concept where one of them may be related to the real interest of the user. The documents usually lie on a very high dimensional space where the existence of a keyword means a dimension to be considered. Therefore, we propose a dimension reduction method based on a manifold learning approach for the clustering process. We adopt Isomap for dimension reduction. Experimentally, after the Isomap process, the dimension reduced dataset gives us better presentation of the dataset. Also, due to the data size is reduced, the execution time of the whole mining process can be close to real time. Several variants of Isomap will also be studied in this thesis.

    1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . 1 1.2 ProposedMethod . . . . . . . . . . . . . 3 1.3 PreviousWorks . . . . . . . . . . . . . 5 1.4 Thesis Outline . . . . . . . . . . . . . 7 2 Data Preprocessing 8 2.1 Tokenization . . . . . . . . . . . . . . 8 2.2 TermWeighting . . . . . . . . . . . . . 10 2.3 Term-by-DocumentMatrix . . . . . . . . . 12 2.4 DissimilarityMeasures . . . . . . . . . 12 3 Dimensionality Reduction 16 3.1 Intrinsic Dimensionality . . . . . . . . 17 3.2 Isomap . . . . . . . . . . . . . . . . . 19 3.3 Comparion of Isomap and OtherMethods . . 23 3.4 Landmark Isomap . . . . . . . . . . . . 24 3.5 Incremental Isomap . . . . . . . . . . . 27 3.6 Avoiding Outliers . . . . . . . . . . . 31 4 Clustering Methods 36 4.1 k-Means . . . . . . . . . . . . . . . . 36 4.2 GaussianMixtureModel . . . . . . . . . . 37 5 Experiment Results 40 5.1 Introduction to Dataset . . . . . . . . 41 5.2 Ex1 : Using Dataset I . . . . . . . . . 42 5.3 Ex2 : Using Dataset II . . . . . . . . . 46 5.4 Ex3 : Using Dataset III . . . . . . . . 49 6 Conclusion and Future Work 52 6.1 Conclusions . . . . . . . . . . . . . . 52 6.2 FutureWork . . . . . . . . . . . . . . . 53

    [1] Cross industry standard process for data mining. Http://www.crispdm.
    org, 1999.
    [2] K. Steinbach ad P. Willett. Reading in information r trieval. Morgan
    Kaufmann.
    [3] E. Alpaydin. Introduction to machine learning. MA: MIT Press, 2004.
    [4] M. Balasubramanian and E.L. Schwartz. The isomap algorithm and
    topological stability. Science, 295(5552), 2002.
    [5] M.W. Berry and M. Browne. Understanding search engines: mathematical
    modeling and text retrieval. Society for Industrial and Applied
    Mathematics, 2nd edition, 2005.
    [6] M.W. Berry, S.T. Dumais, and T.A. Letsche. Computational methods
    for intelligent information access. Proceedings of Supercomputing’95,
    San Diego, CA, 1995.
    [7] M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for
    intelligent information retrieval. SIAM Review, 37(4):573–595, 1995.
    [8] M.W. Berry and R.D. Fierro. Low-rank orthogonal decompositions for
    information retrieval applications. Numerical Linear Algebra with Applications,
    3(4):301–328, 1996.
    [9] J.C. Bezdek. Pattern recognition with fuzzy objective function algorithms.
    Plenum Press, New York, 1981.
    [10] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model
    for clustering with the integrated completed likelihood. IEEE Trans. on
    Pattern Analysis and Machine Intelligence, 22(7):719–725, 2000.
    [11] C. Borgelt and A. Nurnberger. Fast fuzzy clustering of web page collections.
    In proc. of PKDD Workshop on Statistical Approaches for Web
    Mining (SAWM), Pisa, Italy, 2004.
    [12] J. Bruske and G. Sommer. Intrinsic dimensionality estimation with
    optimally topology preserving maps. IEEE Trans. on Pattern Analysis
    and Machine Intelligence, 20(5):572–575, 1988.
    [13] F. Camastra and A. Vinciarelli. Estimating the intrinsic dimension of
    data with a fractal-based method. IEEE Trans. on Pattern Analysis
    and Machine Intelligence, 24(10):1404–1407, 2002.
    [14] S. Chakrabarti. Mining the Web: Discovering Knowledge from Hyper-
    Text Data. Science and Technology Books, 2003.
    [15] J. Costa and A.O. Hero. Manifold learning using euclidean k-nearest
    neighbor graphs. Proceedings of IEEE International Conference on
    Acoustic Speech and Signal Processing, 4:988–991, 2004.
    [16] T.F. Cox and M.A.A. Cox. Multidimensional Scaling. American Statistical
    Association, 2001.
    [17] V. de Silva and J.B. Tenenbaum. Global versus local methods in nonlinear
    dimensionality reduction. Advances in Neural Information Processing
    Systems, pages 705–712, 2003.
    [18] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from
    incomplete data via the em algorithm. Journal of the Royal Statistical
    Society. Series B (Methodological), 39(1):1–38, 1997.
    [19] I.S. Dhillon, S. Mallela, and D.S. Modha. Information-theoretic coclustering.
    In Proc. of the ninth ACM SIGKDD int. conf. on Knowledge
    Discovery and Data Mining, pages 89–98, 2003.
    [20] I.S. Dhillon and D.S. Modha. Concept decompositions for large sparse
    text data using clustering. IBM Almaden Research Cent, San Jose, CA,
    1999.
    [21] D. Donoho and C. Grimes. Hessian eigenmaps: locally linear embedding
    techniques for high dimensional data. Proc. of National Academy of
    Sciences, 100(10):5591–5596, 2003.
    [22] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Wiley-
    Interscience, 2nd edition, 2000.
    [23] R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison. Biological Sequence
    Analysis - Probabilistic Models of Proteins and Nucleic Acids. Cambridge
    University Press, 1998.
    [24] U.M. Fayyad, G. Piatetsky-shapiro, and P.Smyth. Knotledge discovery
    and data mining: Towards a unifying framework. In Knowledge
    Discovery and Data Mining, pages 82–88, 1996.
    [25] M.A.F. Figueiredo and A.K. Jain. Unsupervised learning of finite mixture
    models. IEEE Trans. On Pattern Analysis and Machine Intellegence,
    24(3):381–396, 2002.
    [26] R.A. Fisher. The use of multiple measurements in taxonomic problems.
    Annals of Eugenics, 7:179–188, 1936.
    [27] R. Gaizauskas. An information extraction perspective on text mining:
    Tasks, technologies and prototype applications.
    http://www.itri.bton.ac.uk/projects/euromap/
    TextMiningEvent/Rob Gaizauskaas.pdf, 2003.
    [28] H.O. Hartley. Maximum likelihood estimation from incomplete data.
    Biometrics, 14(2):174–194, 1958.
    [29] M. Hearst. Untangling text data mining. In Proc. of ACL’99 the 37th
    Annual Meeting of the Association for Computational Linguistics, 1999.
    [30] M. Hein and J.Y. Audibert. Intrinsic dimensionality estimation of submanifolds
    in Rd. ACM International Conference Proceeding Series,
    119:289–296, 2005.
    [31] J.M.G. Hidalgo. Tutorial on text mining and internet content filtering.
    Tutorial Notes Online: http://ecmlpkdd.cs.helsinki.
    fi/pdf/hidalgo.pdf, 2002.
    [32] H. Hotelling. Analysis of a complex of statistical variables into principal
    components. Journal of Educational Psychology, 24:417–441, 1933.
    [33] A. Hotho, A. Nurnberger, and G. Paai. A brief survey of text mining.
    Zeitschrift fuer Computerlinguistik und Sprachtechnologie (GLDVJournal
    for Computational Linguistics and Language Technologie),
    20(i):19–62, 2005.
    [34] J.S. Jang. Data clustering and pattern recognition.
    http://neural.cs.nthu.edu.tw/jang/books/dcpr/index.asp, 2005.
    [35] B.J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information
    retrieval: a study of user queries on the web. ACM SIGIR Forum,
    32(1):5–17, 1998.
    [36] M.I. Jordan. Learning in Graphical Models. MIT Press, 1999.
    [37] B. Kegl. Intrinsic dimension estimation using packing numbers. Advances
    in Neural Information Processing Systems 15. MIT Press, 2003.
    [38] Y. Kodratof. Knowledge discovery in texts: A definition and applications.
    Lecture Notes in Computer Science, 1609:16–29, 1999.
    [39] Y. Kohonen. Self-organized formation of topologically correct feature
    maps. Biological Cybernetics, 43:59–69, 1982.
    [40] V. Kumar and M. Joshi. What is data mining? htp://www-users.
    cs.umn.edu/ mjoshi/phdmtut/sld004.htm, 2003.
    [41] M.H.C. Law and A.K. Jain. Incremental nonlinear dimensionality reduction
    by manifold learning. IEEE Trans. on Pattern Analysis and
    Machine Intelligence, 28(3):377– 391, 2006.
    [42] E. Levina and P.J. Bickel. Maximum likelihood estimation of intrinsic
    dimension. Advances in Neural Information Processing Systems, MIT
    Press, 2005.
    [43] U. Nahm and R. Mooney. Text mining with information extraction. In
    Proceedings of the AAAI 2002 Spring Symposium on Mining Answers
    from Texts and Knowledge Bases, 2002.
    [44] K. Pettis, A.K. Jain T. Bailey, and R. Dubes. An intrinsic dimensionality
    estimator from near-neighbor information. IEEE Trans. on Pattern
    Analysis and Machine Intelligence, 1(1):25–36, 1979.
    [45] J.C. Platt. Fastmap, metricmap and landmark mds are all nystrom
    algorithms. 10th International Workshop on Artificial Intelligence and
    Statistics, pages 261–268, 2005.
    [46] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137,
    1980.
    [47] S.J. Roberts, I. Rezek D. Husmeier, and W. Penny. Bayesian approaches
    to gaussian mixture modeling. IEEE Trans. on Pattern Analysis and
    Machine Intelligence, 20(11):1133–1142, 1998.
    [48] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction by locally
    linear embedding. Science, 290(5500):2323–2326, 2000.
    [49] G. Salton, A.Wong, and C.S. Yang. A vector space model for automatic
    indexing. Communications of the ACM, 18(22):613–620, 1975.
    [50] L.K. Saul and S.T. Roweis. Think globally, fit locally: unsupervised
    learning of low dimensional manifolds. Journal of Machine Learning
    Research, pages 119–155, 2003.
    [51] C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a
    very large altavista query log. Digital SRC Technical Note, 1998.
    [52] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global geometric
    framework for nonlinear dimensionality reduction. Science,
    290(5500):2319–2323, 2000.
    [53] C. Tomasi. Estimating gaussian mixture densities with em - a tutorial,
    2003.
    [54] S. Wasserman and K. Faust. Social Network Analysis. Cambridge University
    Press, 2006.
    [55] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimensionality
    reduction via local tangent space alignment. SIAM Journal of Scientific
    Computing, 26(1):313–338, 2004.

    QR CODE