簡易檢索 / 詳目顯示

研究生: 曾志翔
Chih-hsiang Tseng
論文名稱: 運用生成模型對文件類型資料做半監督式學習
Semi-Supervised Learning for Text Data Using Generative Models
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 李育杰
Yuh-Jye Lee
林智仁
Chih-jen Lin
張源俊
Yuan-chin Chang
項天瑞
Tien-Ruey Hsiang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 49
中文關鍵詞: 半監督式學習
外文關鍵詞: semi-supervised learning
相關次數: 點閱:192下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 半監督式問題的出現源自於現實世界的資料集中,已分類的資料通常不容易取得,這種特性常見於文件型態的資料集。生成模性與貝式分類器的架構已經被證明出對文件類型的資料有很好的分類效果,還能夠藉由期望值最大化的統計技巧來整合結構簡單的未分類資料到生成模型學習的過程中做半監督式學習。對於較複雜結構的資料集,我們可以模型化子主題結構到模型裡來適應於這類的資料,一些模型選擇的方法可以幫助我們預先決定模型的子主題結構跟初始參數值。
    本論文中,我們提出了一個自動決定模型結構的方法,我們稱為子主題結構學習演算法。實驗證明對於較複雜結構的資料集,我們的方法可以有效找到適宜的子主題結構和參數初始值,不但仍保有不錯的分類效能,同時也大大降低了模型選擇的時間。


    The semi-supervised learning problems originate from that we usually cannot obtain labeled data easily in the real world, especially in text data. The architecture of the generative model combining with the naïve Bayes classifier has been proven to achieve pretty good performance for text data. Moreover, by expectation-maximization statistical techniques, we can incorporate unlabeled data with simple structures to enhance the model learning. Even for complicated structured data, we model sub-topic structures into the model to adapt to these data. Some model selection methods can help us to determine the sub-topic structure and initial parameter values of the model.
    In this thesis, we propose an automatic model selection approach, which is called sub-topic learning algorithm. Experimental Results show that our approach can effectively learn the suitable sub-topic structure and initial parameter values. Not only does it preserve the good performance but also decrease the model selection time very much.

    論文摘要………………………………………………………………… I 英文摘要………………………………………………………………… II 誌謝……………………………………………………………………… III 目錄……………………………………………………………………… IV 圖型索引………………………………………………………………… VI 表格索引 ……………………………………………………………… VII 第一章 緒論 1.1 背景知識…………………………………………………… …… 1 1.2 研究動機和目的………………………………………… ……… 3 1.3 論文架構……………………………………………… ………… 4 第二章 相關的研究和發展 2.1 未分類資料的價值 ……………………………………… ……… 6 2.2 現存的方法 ……………………………………………………… 8 第三章 生成模型 3.1 文件模型化……………………………………………………… 12 3.1.1 文件前處理…………………………………………………12 3.1.2 文件資料的生成模型架構…………………………………13 3.2 簡單貝式文件分類器…………………………………………… 15 3.2.1 用已分類的資料訓練分類器………………………………16 3.2.2 對文件做分類………………………………………………17 3.3 使用EM去整合未分類的資料………………………………… 18 3.3.1 期望值最大化方法…………………………………… ……18 3.3.2 實驗結果…………………………………………… ………22 3.3.3 討論………………………………………………… ………26 3.4 模型化子主題結構……………………………………… ……… 26 3.4.1 一個類別對應多個混合元件的假設………………… ……27 3.4.3 討論 …………………………………………………… … 30 第四章 提升學習的效率 4.1 子主題結構學習演算法………………………………………… 31 4.2 實驗結果………………………………………………………… 38 4.3 討論……………………………………………………………… 41 第五章 結論 5.1 討論……………………………………………………………… 43 5.2 未來展望………………………………………………………… 43 參考文獻……………………………………………………………… 45

    Blum, A., & Chawla, S. (2001). In Proceedings of the 18th International Conference on Machine Learning.
    Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning.
    Belkin, M, & Niyogi, P. (2002). Semi-Supervised Learning on Manifolds.
    Belkin, M, & Niyogi, P. (2004). Semi-Supervised Learning on Riemannian Manifolds.
    Cozman, F. G., & Cohen, I (2001). Unlabeled Data Can Degrade Classification Performance of Generative Classifiers. Proceedings of International Conference on Artificial Intelligence.
    Craven et al. (1998). Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98).
    Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., & Slattery, S. (2000). Learning to construct knowledge bases from the World Wide Web. Artificial Intelligence, 118.
    Denis, F., Laurent A., Gilleron, R., & Tommasi, M. (2003). Text Classification and Co-training from Positive and Unlabeled Examples.
    Ethem Alpaydm. (2004). Introduction to Machine Learning. MIT Press.
    Dellaert, F. (2002). The Expectation Maximization Algorithm. Technical Report.
    Ghahramani, Z. (2004). Unsupervised Learning.
    Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing System.
    Goldman, S., & Zhou, Y. (2000). Enhancing Supevised Learning with Unlabeled Data. In International Conference on Machine Learning 2000.
    Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of International Statistical Institute, 36.
    Hofmann, T., & Puzicha, J. (1998). Statistical Models for Co-occurrence Data.
    Ian H. Witten, Eibe Frank, (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press.
    Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, Vol. 31.
    Jaakkola, T., Meila, M., & Jebara, T. (2000). Maximum entropy discrimination. Advances in Neural Information Processing Systems 12.
    Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines. Kluwer Academic Publishers.
    Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning.
    Joachims, T. (1999). Transductive inference for text classification using support vector machines. In International Conference on Machine Learning.
    Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation.
    Lang, K. (1995). Newsweeder: Learning to filter netnews. In International Conference on Machine Learning.
    Lewis, D., & Ringuette, M. (1994). A comparison of two learning algorithms for text classification. In third Annual Symposium on Document Analysis and Information Retrieval.
    McCallum, A., & Nigam, K. (1998). A comparison of event models for naïve Bayes text classification. Learning for Text Categorization: Papers from the AAAI Workshop.
    Miguel A. Carreira-Perpinan. (1997). A Review of Dimension Techniques. Technical Report.
    Minka, T. P. (1998). Expectation-Maximization as lower bound maximization.
    Mitchell T. (1997). Machine Learning. New York: McGraw-Hills.
    Mitchell T. (1999). The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science.
    Ng, A. Y., & Jordan., M. I. (2002). On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes. In NIPS 14.
    Nigam, K., Lafferty, J., & McCallum, A. (1999). Using Maximum Entropy for Text Classification. In Proceedings of the IJCAI-99 Workshop on Machine Learning.
    Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. Proceedings of the Fifteenth National Conference on Artificial Intelligence.
    Pazzani, M., Muramatsu, J., & Billsus, D. (1996). Syskill & webert: Identifying interesting web sites. In Proceedings of the Thirteenth National Conference on Artificial Intelligence (AAAI).
    Porter, M. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems).
    Ricardo Baeza-Yates & Berthier Ribeiro-Neto. (1999). Modern Information Retrieval. Addison Wesley.
    Richard O. Duda, Peter E. Hart, & David G. Stork. (2001). Pattern Classification. John Wiley & Sons, Inc.
    Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, V.290.
    Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. Learning for Text Categorization: Papers from the AAAI Workshop.
    Salton, G. (1991). Developments in automatic text retrieval. Science, V.253.
    Saul, L. K., & Roweis, S. T. (2000). An Introduction to Locally Linear Embedding.
    Schapire, R. E., & Singer, Y. (2000). BoosTexter: A boosting-based system for text categorization. Machine Learning.
    Matthias Seeger. (2002). Learning with Labeled and Unlabeled Data. Technical Report.
    Seung, H. S., & Lee, D. D. (2000). The Manifold Ways of Perception. Science, V.290
    Soumen Chakrabarti. (2003). Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers.
    Trevor Hastie, Robert Tibshirani, & Jerome Friedman. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
    Vapnik, V. (1982). Estimation of Dependencies Based on Empirical Data. Springer.
    Vapnik, V. (1998). Statistical learning theory. New York: John Wiley and Sons.
    Zhang, T., & Oles, F. J. (2000). A probability analysis on the value of unlabeled data for classification problems. Proceedings of the Seventeenth International Conference on Machine Learning.
    Zhou, D., & Scholkopf, B. (2004). Learning from Labeled and Unlabeled Data Using Random Walks. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004).
    Zhu, X., & Ghahramani, Z. (2002). Learning from Labeled and Unlabeled Data with Label Propagation.
    Zhu, X., Ghahramani, Z. & Lafferty, J. (2003). Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. Proceedings of the Twentieth International Conference on Machine Learning.

    無法下載圖示 全文公開日期 2006/08/01 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE