簡易檢索 / 詳目顯示

研究生: 林威志
Wie-Zhih Lin
論文名稱: 一個基於特徵組合之類別資料低維度轉換方法
A low dimensional categorical data transform based on Feature Combination
指導教授: 鄧惟中
Wei-Chung Teng
口試委員: 項天瑞
Tien-Ruey Hsiang
王勝德
Sheng-De Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 53
中文關鍵詞: 類別型資料低維度特徵結合前篩選編碼器
外文關鍵詞: Categorical Data, Low-dimension, Feature Combination, Pre-Selection, Encoder
相關次數: 點閱:212下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文所著重的部分在將類別資料有效的轉換為數值型資料。雖已有眾多的編碼器(encoder)可以處理類別資料,但大多數產生的資料維度皆相當的高。因此,希望透過以編碼器當作基底所衍生出來的轉換法,得到低維度且有效的特徵。

    本研究提出方法的核心名為Feature Combination,是利用原始類別資料欄位組合,產生新的類別資料,而基底的編碼器為OneHotEncoder,並搭配k-means來變形,試著擷取及保留更多的資訊。雖然Feature Combination確實提升了預測模型的表現,但核心方法Feature Combination卻有著高維度的瓶頸,因此,本研究提出了Pre-Selection來改善此問題,利用Information Gain在Feature Combination之前篩選出較有用的欄位,讓最終的方法達到初衷的理想目標。

    本論文提出的方法,透過UCI和CTU提供的類別資料集做實驗。最後的結果,資料維度只介於1至4,取決於k-means的群集數設定,而表現分數的部分,整體平均表現相對於OneHotEncoder高了將近2\%。雖然進步的幅度不高,資料的維度卻低於OneHotEncoder產生的欄位數至少20倍以上。


    This research focuses on how to transform the categorical data into numerical data efficiently. Although there are plenty of encoders for processing the categorical data, there are two severe defects of high informaiton with high dimension or low dimension with low information in each of encoders. Hence, the goal of the proposed method is try to derive a new categorical transform with the encoders as the base, so as to obtain the low dimensional and effective features.

    First of all, the proposed method takes OneHotEncoder as the base encoder and try to improve it by combining with k-means. Secondly, this research proposes a kernel trick, called Feature Combination, to extract and to reserve more information from the dataset. The concept of the trick is combining the original categorical columns to create new columns. Feature Combination exactly advances the accuracy of the prediction model, but the bottleneck of the trick comes from its high dimentional output. Therefore, this research proposes Pre-Selection, which selects important columns through Information Gain before executing Feature Combination, to solve the bottleneck, and then make the proposed method achieve the original goal.

    The proposed method is evalauted with the categorical data from UCI and CTU. The final results of the experiments show that the features, after transforming from the proposed method, are of dimensions from 1 and 4 according to the numbers of clusters of k-means. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accurancy is not as high as what we expected, the numbers of dimensions of features is at least 20 times lower than that of OneHotEncoder.

    論文摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II 誌謝.......................................... III 目錄.......................................... IV 圖目錄 ........................................VII 表目錄 ........................................ IX 1緒論........................................ 1 1.1 研究動機.................................. 1 1.2 研究目標.................................. 2 1.3 論文架構.................................. 3 2 相關研究 . .................................... 4 2.1 編碼器 ................................... 4 2.1.1 OrdinalEncoder........................... 5 2.1.2 BinaryEncoder ........................... 6 2.1.3 SumEncoder ............................ 7 2.1.4 HelmertEncoder .......................... 8 2.1.5 BackwardDifferenceEncoder ................... 10 2.1.6 HashingEncoder .......................... 11 2.1.7 OneHotEncoder........................... 12 2.2 DimensionReduction............................ 13 2.2.1 PCA................................. 14 2.2.2 LDA ................................ 14 3研究方法..................................... 16 3.1 OneHotEncoder............................... 16 3.2 OneHotEncoderDimensionReduction................... 17 3.2.1 PCA(PrincipleComponentsAnalysis) . . . . . . . . . . . . . . 17 3.2.2 LDA(LinearDiscriminantAnalysis) . . . . . . . . . . . . . . . 19 3.3 FeatureCombination ............................ 21 3.4 Pre-Seletion................................. 22 4實驗流程..................................... 24 4.1 資料集介紹................................. 24 4.1.1 UCI................................. 24 4.1.2 CTU ................................ 25 4.2 實驗與比較流程.............................. 26 4.2.1 實驗流程.............................. 26 4.2.2 比較流程.............................. 26 5實驗結果與分析................................. 28 5.1 各編碼器比較結果............................. 28 5.2 PCA與LDA比較結果 .......................... 30 5.3 LDA結合k-means前後比較結果..................... 31 5.4 Feature Combination 加上 Pre-Selection 前後的比較結果 . . . . . . . . 32 5.5 整體比較結果與結論 ........................... 35 6結論與延伸工作................................. 38 6.1 結論..................................... 38 6.2 延伸工作.................................. 38 參考文獻....................................... 39

    [1] M. Scholz, “Pca - principal component analysis.”
    [2] ccjou, “費雪的判別分析與線性判別分析.”
    [3] G. Carey, Coding Categorical Variables. QMIN, 2003.
    [4] S. Viaene, G. Dedene, and R. A. Derrig, “Auto claim fraud detection using bayesian learning neural networks,” Expert Systems with Applications, vol. 29, no. 3, pp. 653– 666, 2005.
    [5] J. Beck and B. Woolf, “High-level student modeling with machine learning,” pp. 584–593, 2000.
    [6] Y.Gong,S.Kumar,H.A.Rowley,andS.Lazebnik,“Learningbinarycodesforhigh- dimensional data using bilinear projections,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 484–491, 2013.
    [7] U. S. C. Group, “Contrast coding systems for categorical variables.” , 2010.
    [8] K.Weinberger,A.Dasgupta,J.Langford,A.Smola,andJ.Attenberg,“Featurehash- ing for large scale multitask learning,” in Proceedings of the 26th Annual Interna- tional Conference on Machine Learning, pp. 1113–1120, ACM, 2009.
    [9] S. RAY, “Simple Methods to deal with Categorical Variables in Predic-
    tive Modeling.” , November 2015.
    [10] Scikit-Learn, “Category encoders.” , 2016.
    [11] S. Kolenikov, G. Angeles, et al., “The use of discrete data in pca: theory, simula- tions, and applications to socioeconomic indices,” Chapel Hill: Carolina Population Center, University of North Carolina, pp. 1–59, 2004.
    39[12] M. Pohar, M. Blas, and S. Turk, “Comparison of logistic regression and linear dis- criminant analysis: a simulation study,” Metodoloski zvezki, vol. 1, no. 1, p. 143, 2004.
    [13] I. T. Jolliffe, “Principal component analysis and factor analysis,” in Principal com- ponent analysis, pp. 115–128, Springer, 1986.
    [14] S.Wold,K.Esbensen,andP.Geladi,“Principalcomponentanalysis,”Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
    [15] A. J. Izenman, “Linear discriminant analysis,” in Modern multivariate statistical techniques, pp. 237–280, Springer, 2013.
    [16] E. I. Altman, G. Marco, and F. Varetto, “Corporate distress diagnosis: Compar- isons using linear discriminant analysis and neural networks (the italian experience),” Journal of banking & finance, vol. 18, no. 3, pp. 505–529, 1994.
    [17] M. Li and B. Yuan, “2d-lda: A statistical linear discriminant analysis for image ma- trix,” Pattern Recognition Letters, vol. 26, no. 5, pp. 527–532, 2005.

    QR CODE