簡易檢索 / 詳目顯示

研究生: 賴威志
Wei-Chih Lai
論文名稱: 瞭解超高維度資料之特徵的平行架構
A Parallel Structure for Understanding the Feature of Extreme High Dimensional Data
指導教授: 李育杰
Yuh-Jye Lee
口試委員: 楊傳凱
Chuan-Kai Yang
葉倚任
Yi-Ren Yeh
鮑興國
Hsing-Kuo Pao
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 英文
論文頁數: 49
中文關鍵詞: 最小冗餘最大相關性MapReduce的
外文關鍵詞: minimum redundancy maximum relevance, mapreduce
相關次數: 點閱:318下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了一個並行計算結構數據的理解和特徵選擇。如今,人們使用社交媒體和環境傳感器的應用自動收集巨大的數據集。這類有大量實例與屬性的數據集很難處理,即使我們只使用最簡單的算法。我們的工具箱變換數據集到基於列的稀疏數據格式並將它成多個分區。每個分區可以包含不同的列數因為我們大致平衡每個分區的大小。採用並行結構後,我們的工具箱可以提供用於常規算法的基本數據的信息。此外,超高維度數據難以用傳統的學習算法來處理。因此,資料分析社群上有對極端高維的數據進行正向特徵選擇的需要。我們的工具箱建議使用最小冗餘最大的改進版本(MRMR),雙條最小冗餘最大相關性(2sMRMR)以過濾器冗餘屬性。2sMRMR只需花費線性遞增計算時間對於所選屬性的數量。加快選擇過程,2sMRMR是建立在並行結構。我們對不同規模的數據集測試我們提出的方法。計算結果表明,該方法有更好的計算效能,並與一個直接的MRMR實現會得到的結果。我們還提供Github上的源代碼。


    We propose a parallel computational structure for data understanding and feature selection.
    Nowadays, people use social media and environmental sensor applications to
    automatically collect tremendous datasets. These kind of large instance number and high
    dimensional datasets are hard to handle, even though we just use the simplest algorithms.
    Our toolbox transforms the dataset into column-based sparse data format and splits it
    into many partitions. Each partition may contain different column number because we
    roughly balance the size of each partition. After applying parallel structure, our toolbox
    can provide basic data information for conventional algorithms. Moreover, super high
    dimensional data is difficult to be processed by traditional learning algorithms. Therefore,
    the society has the need of forward feature selection on extreme high dimensional
    data. Our toolbox proposes an improved version of Minimum Redundancy Maximum
    Relevance(MRMR), Two-Strips Minimum Redundancy Maximum Relevance(2sMRMR),
    to lter redundant features. 2sMRMR only costs linear increasing computing time with
    respect of the number of selected features. For speeding up selection process, 2sMRMR is
    built on a parallel structure. We test our proposed method on different scale of datasets.
    The numerical results show that our proposed method has better performance on computing
    speed and the same results with a straightforward implementation on large dataset
    size. We also provide the source code on Github.

    Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Feature Understanding 5 2.1 Statistical Characteristic of Feature . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Symbolic Aggregate Approximation . . . . . . . . . . . . . . . . . . . . . . 7 3 Feature Selection 8 3.1 Pearson Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3 Minimum Redundancy Maximum Relevance(MRMR) . . . . . . . . . . . . 10 3.3.1 Minimum Redundancy Maximum Relevance in iterator . . . . . . . 13 4 Parallel Framework Scheme 15 4.1 Feature understanding and feature selection with parallel structure . . . . 16 5 Minimum Redundancy Maximum Relevance in the Parallel Structure 19 5.1 Minimum Redundancy Maximum Relevance with Hadoop . . . . . . . . . 20 5.2 Two-Strips Minimum Redundancy Maximum Relevance(2sMRMR) . . . . 20 6 Numerical Result 25 6.1 Experiment Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 6.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 6.3 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.3.1 2sMRMR vs. original MRMR . . . . . . . . . . . . . . . . . . . . . 27 6.3.2 2sMRMR vs Hadoop MRMR . . . . . . . . . . . . . . . . . . . . . 27 6.3.3 High dimensional dataset . . . . . . . . . . . . . . . . . . . . . . . . 27 7 Conclusion and Future Work 36 7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 8 Appendix 41 8.1 MPI Parallel Structure on Feature Understanding . . . . . . . . . . . . . . 44 8.1.1 Minimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 8.1.2 Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 8.1.3 Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 8.1.4 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8.2 MPI Parallel Structure on Pearson Correlation . . . . . . . . . . . . . . . . 48 8.3 MPI Parallel Structure on Mutual Info . . . . . . . . . . . . . . . . . . . . 49

    [1] Ahmed Al-Ani. Ant colony optimization for feature subset selection. In WEC (2),
    pages 35{38. Citeseer, 2005.
    [2] Salem Alelyani, Jiliang Tang, and Huan Liu. Feature selection for clustering: A
    review. Data Clustering: Algorithms and Applications, 29, 2013.
    [3] Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luj an. Conditional likelihood
    maximisation: a unifying framework for information theoretic feature selection. The
    Journal of Machine Learning Research, 13(1):27{66, 2012.
    [4] P Chaudhari, Dipti P Rana, Rupa G Mehta, NJ Mistry, and MM Raghuwanshi.
    Discretization of temporal data: A survey. arXiv preprint arXiv:1402.4283, 2014.
    [5] Li-Yeh Chuang, Cheng-Huei Yang, and Cheng-Hong Yang. Tabu search and binary
    particle swarm optimization for feature selection using microarray data. Journal of
    computational biology, 16(12):1689{1703, 2009.
    [6] Je rey Dean and Sanjay Ghemawat. Mapreduce: simpli ed data processing on large
    clusters. Communications of the ACM, 51(1):107{113, 2008.
    [7] Chris Ding and Hanchuan Peng. Minimum redundancy feature selection from microarray
    gene expression data. Journal of bioinformatics and computational biology,
    3(02):185{205, 2005.
    [8] Jennifer G Dy and Carla E Brodley. Feature selection for unsupervised learning. The
    Journal of Machine Learning Research, 5:845{889, 2004.
    [9] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.
    Liblinear: A library for large linear classi cation. The Journal of Machine Learning
    Research, 9:1871{1874, 2008.
    [10] Julie Hamon. Optimisation combinatoire pour la s election de variables en r egression
    en grande dimension: Application en g en etique animale. PhD thesis, Universit e des
    Sciences et Technologie de Lille-Lille I, 2013.
    [11] Cho-Jui Hsieh, Si Si, and Inderjit S Dhillon. A divide-and-conquer solver for kernel
    support vector machines. arXiv preprint arXiv:1311.0914, 2013.
    [12] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support
    vector classi cation, 2003.
    [13] Anil Jain and Douglas Zongker. Feature selection: Evaluation, application, and small
    sample performance. Pattern Analysis and Machine Intelligence, IEEE Transactions
    on, 19(2):153{158, 1997.
    [14] Thorsten Joachims. Text categorization with support vector machines: Learning with
    many relevant features. Springer, 1998.
    [15] YongSeog Kim, W Nick Street, and Filippo Menczer. Evolutionary model selection
    in unsupervised learning. Intelligent Data Analysis, 6(6):531{556, 2002.
    [16] Alexander Kraskov, Harald Stogbauer, Ralph G Andrzejak, and Peter Grassberger.
    Hierarchical clustering using mutual information. EPL (Europhysics Letters),
    70(2):278, 2005.
    [17] Wei-Chih Lai, Po-Han Huang, Yuh-Jye Lee, and Alvin Chiang. A distributed ensemble
    scheme for nonlinear support vector machine. In Intelligent Sensors, Sen-
    sor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International
    Conference on, pages 1{6. IEEE, 2015.
    [18] Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. A symbolic representation
    of time series, with implications for streaming algorithms. In Proceedings of
    the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge
    discovery, pages 2{11. ACM, 2003.
    [19] Huan Liu and Hiroshi Motoda. Computational methods of feature selection. CRC
    Press, 2007.
    [20] Pabitra Mitra, CA Murthy, and Sankar K. Pal. Unsupervised feature selection using
    feature similarity. IEEE transactions on pattern analysis and machine intelligence,
    24(3):301{312, 2002.
    [21] Fabian Model, Peter Adorjan, Alexander Olek, and Christian Piepenbrock. Feature
    selection for dna methylation based cancer classi cation. Bioinformatics, 17(suppl
    1):S157{S164, 2001.
    [22] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell.
    Text classi cation from labeled and unlabeled documents using em. Machine learn-
    ing, 39(2-3):103{134, 2000.
    [23] Hsing-Kuo Pao, Shou-Chih Chang, and Yuh-Jye Lee. Model trees for hybrid data
    type classi cation. In Proceedings of 6th Internationa Conference on Intelligent Data
    Engineering and Automated Learning, 2005.
    [24] Hanchuan Peng, Fulmi Long, and Chris Ding. Feature selection based on mutual information
    criteria of max-dependency, max-relevance, and min-redundancy. Pattern
    Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1226{1238, 2005.
    [25] Tu Minh Phuong, Zhen Lin, and Russ B Altman. Choosing snps using feature
    selection. In Computational Systems Bioinformatics Conference, 2005. Proceedings.
    2005 IEEE, pages 301{309. IEEE, 2005.
    [26] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
    [27] CLAUDIO REGGIANI. Scaling feature selection algorithms using mapreduce on
    apache hadoop. 2013.
    [28] Yong Rui, Thomas S Huang, and Shih-Fu Chang. Image retrieval: Current techniques,
    promising directions, and open issues. Journal of visual communication and
    image representation, 10(1):39{62, 1999.
    [29] Yvan Saeys, I~naki Inza, and Pedro Larra~naga. A review of feature selection techniques
    in bioinformatics. bioinformatics, 23(19):2507{2517, 2007.
    [30] Baris Senliol, Gokhan Gulgezen, Lei Yu, and Zehra Cataltepe. Fast correlation based
    lter (fcbf) with a di erent search strategy. In Computer and Information Sciences,
    2008. ISCIS'08. 23rd International Symposium on, pages 1{4, 2008.
    [31] Wojciech Siedlecki and Jack Sklansky. On automatic feature selection. International
    Journal of Pattern Recognition and Arti cial Intelligence, 2(02):197{220, 1988.
    [32] Luis Talavera. Feature selection as a preprocessing step for hierarchical clustering.
    In ICML, volume 99, pages 389{397. Citeseer, 1999.
    [33] Luis Talavera. Feature selection and incremental learning of probabilistic concept hierarchies.
    In In Proceedings of the Seventeenth International Conference on Machine
    Learning. Citeseer, 2000.
    [34] Luis Talavera. An evaluation of lter and wrapper methods for feature selection in
    categorical clustering. In Advances in Intelligent Data Analysis VI, pages 440{451.
    Springer, 2005.
    [35] Jiliang Tang, Salem Alelyani, and Huan Liu. Feature selection for classi cation: A
    review. Data Classi cation: Algorithms and Applications. Editor: Charu Aggarwal,
    CRC Press In Chapman & Hall/CRC Data Mining and Knowledge Discovery Series,
    2014.
    [36] Tom White. Hadoop: The de nitive guide. " O'Reilly Media, Inc.", 2012.
    [37] Daniela M Witten and Robert Tibshirani. A framework for feature selection in
    clustering. Journal of the American Statistical Association, 105(490), 2010.
    [38] P Xuan, MZ Guo, J Wang, CY Wang, XY Liu, and Y Liu. Genetic algorithm-based
    e cient feature selection for classi cation of pre-mirnas. Genet Mol Res, 10:588{603,
    2011.
    [39] Hsiang-Fu Yu, Hung-Yi Lo, Hsun-Ping Hsieh, Jing-Kai Lou, Todd G McKenzie,
    Jung-Wei Chou, Po-Han Chung, Chia-Hua Ho, Chun-Fu Chang, Yin-Hsuan Wei,
    et al. Feature engineering and classi er ensemble for kdd cup 2010. KDD Cup, 2010.
    [40] Yunhong Zhou, Dennis Wilkinson, Robert Schreiber, and Rong Pan. Large-scale parallel
    collaborative ltering for the net
    ix prize. In Algorithmic Aspects in Information
    and Management, pages 337{348. Springer, 2008.

    QR CODE