研究生: 鍾興志
Hsin-Chih Chung
論文名稱: MDL-based Model Trees for Classification of Hybrid Type Data
MDL-based Model Trees for Classification of Hybrid Type Data
指導教授: 鮑興國
Hsing-Kuo Pao
口試委員: 劉庭祿
Tyng-Luh Liu
Yuan-Chin Chang
Yuh-Jye Lee
Bi-Ru Dai
學位類別: 碩士
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 英文
論文頁數: 55
中文關鍵詞: model treeminimum description lengthdecision treesupport vector machine
外文關鍵詞: model tree, minimum description length, decision tree, support vector machine
We propose a method of model selection for the dataset of hybrid types, that is, the dataset includes both of nominal and numeric data attributes. Motivated by the effectiveness of decision tree on nominal data and the success of support vector machine on numeric data, we propose a model tree combining both models. We derive a synthesized Boolean attribute based on the classification from SVM applying only on those numeric attributes. After that, the SVM-synthesized attribute as well as all of the nominal attributes are collected for the decision tree induction, or specifically the ID3 algorithm which selects the "best" attribute based on some goodness criteria. The concept of model tree is not new. Different from the model tree proposed by Chang et al. in 2004, we aim at improving the performance by a Minimum Description Length approach. The MDL principle is adopted to balance the choice between the SVM-synthesized attribute and a discrete attribute by also considering their model complexity. That is, an SVM is considered a more complex model than a simple discrete classifier (such as "education = Master or Ph.D."). Therefore, a large penalty should be paid to an SVM classifier rather than a discrete classifier in the selection of best attribute in decision tree induction. The penalty gives in a form where its 1-D case coincides the one proposed by Quinlan in 1996 for a simple numeric classifier (such as "age>=22"). Our experiments show that the modification improves the prediction accuracy in many datasets from the real world.

1 Introduction 1.1 Motivation 1.2 Proposed Method 1.3 Thesis Outline 2 Classification of Hybrid Type Data 2.1 Decision Trees 2.1.1 Constructing a Decision Tree 2.1.2 Incorporating Continuous-Valued Attributes 2.1.3 Univariate and Multivariate Decision Trees 2.1.4 Pruning Strategies 2.2 Support Vector Machines 2.2.1 Conventional Support Vector Machines 2.2.2 Incorporating Nominal-Valued Attributes 3 Model Trees 3.1 The Concept of Model Trees 3.2 Building Model Trees 4 MDL-based Model Trees 4.1 Minimum Description Length Principle 4.2 Estimate the coding length of Support Vector Machines 5 Experiments 5.1 Dataset Descriptions 5.2 Numerical Results and Comparisons 6 Conclusion and future work 6.1 Conclusions 6.2 Future work

