研究生: |
高君豪 Chiun-How Kao |
---|---|
論文名稱: |
巨量資料之矩陣視覺化 Matrix Visualization for Big Data |
指導教授: |
楊傳凱
Chuan-kai Yang 陳君厚 Chun-houh Chen |
口試委員: |
楊傳凱
Chuan-kai Yang 陳君厚 Chun-houh Chen 張源俊 Yuan-chin Ivan Chang 李育杰 Yuh-Jye Lee 吳漢銘 Han-Ming Wu |
學位類別: |
博士 Doctor |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 英文 |
論文頁數: | 133 |
中文關鍵詞: | 矩陣視覺化 、巨量資料 、探索式資料分析 、象徵式資料分析 、廣義相關圖 |
外文關鍵詞: | Matrix Visualization, Big Data, Exploratory Data Analysis, Symbolic Data Analysis, Generalized Association Plots |
相關次數: | 點閱:235 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
由於生醫與工業技術的不斷創新,電腦科技的持續開發,資料產生與蒐集方式急遽改變,資料規模急速成長,而資料品質易參差不齊;相關資料檢驗與分析等統計方法之需求也隨之產生,巨量資料之計算技術與統計分析方法更是目前重要的研究趨勢。視覺化(visualization)與探索式資料分析 (Exploratory Data Analysis, EDA)在巨量資料的深層分析(deep analytics)將扮演重要的角色,但也有其待解決的問題與開發的技術。目前巨量資料視覺化環境偏向以節點-連結圖(node-link diagrams)為主呈現方式的動態網絡圖(dynamic network drawing),其基本技術為既有的 2D、3D 散佈圖(scatterplot),其優點為較不耗記憶體、計算能力與顯示空間;但是能夠呈現的資料維度相對有限。本研究首先針對矩陣視覺化(matrix visualization)分析連續型巨量資料的兩個困難進行探討:(一) 關係矩陣計算與排序之運算能力限制。(二)在有限螢幕範圍有效顯示巨量資料的矩陣視覺化。我們使用廣義相關圖(Generalized Association Plots)結合象徵性資料分析(Symbolic Data Analysis)與Hadoop/Spark計算環境來進行矩陣視覺化及叢聚分析,藉以突破巨量資料於關係矩陣計算、排序與呈現時所面臨的困難,並提供一套可以觀看連續型巨量資料之 EDA 工具。我們以台灣全民健康保險研究資料庫之百萬抽樣檔作為實際範例,呈現巨量資料之矩陣視覺化之分析結果。當完成連續型巨量資料矩陣視覺化環境開發後,未來將進行非連續型巨量資料之矩陣視覺化方法探勘,這包含二元、類別、混合、地圖學等可能形態巨量資料之矩陣視覺化;也將面臨與連續型巨量資料矩陣視覺化環境開發不同之挑戰。
The innovation of biomedical and industrial techniques with continued development of computer technology have caused dramatic changes of data generation and collection. Data scale tends to grow exponentially while data quality becomes unreliable. Statistical methods for validation and analysis of big data with its computation techniques became important research topics nowadays. Visualization and exploratory data analysis (EDA) are going to play essential roles in deep analytics on big data analysis. Yet there are some problems to be solved and techniques to be developed. Most current big data visualization methods focus on node-link diagram based dynamic network drawing. They mainly rely on the 2D and 3D scatterplots that do not consume much computing memory, power, and display space; however, the drawback is the limitation on dimensions of variable for visualization. This works first aims to resolve the potential difficulties for applying the techniques of matrix visualization for continuous type big data: (1) computation and permutation of proximity matrices; (2) display of big data. We shall integrate the strength of GAP (generalized association plots), SDA (symbolic data analysis), with Hadoop/Spark computing facility for taking care of these problems of computation and display and for creating environment for matrix visualization of continuous type big data. Here we apply the proposed MV for big data techniques on the 2000 Longitudinal Health Insurance Database (LHID2000) of National Health Insurance Research Database (NHIRD) published by National Health Research Institutes (NHRI) in Taiwan. We will then move on and expand the environment for matrix visualization of continuous type big data to binary, categorical, cartography, and other types of big data. We expect to face even more challenging difficulties while developing related techniques.
[1] bigvis. https://github.com/hadley/bigvis.
[2] Chun Houh Chen lab of information visualization. https://gap.stat.
sinica.edu.tw. Accessed: 2018-01-01.
[3] ggplot2.sparkr. http://skku-skt.github.io/ggplot2.SparkR/.
[4] tabplot. https://cran.r-project.org/web/packages/tabplot/
vignettes/tabplot-vignette.html.
[5] Abousalh-Neto, N. A. and Kazgan, S. (2012). Big data exploration through visual
analytics. In Visual Analytics Science and Technology (VAST), 2012 IEEE Conference
on, pages 285–286. IEEE.
[6] Agrawal, R., Kadadi, A., Dai, X., and Andres, F. (2015). Challenges and opportunities
with big data visualization. In Proceedings of the 7th International
Conference on Management of computational and collective intElligence in Digital
EcoSystems, pages 169–173. ACM.
[7] Ali, S. M., Gupta, N., Nayak, G. K., and Lenka, R. K. (2016). Big data visualization:
Tools and challenges. In Contemporary Computing and Informatics (IC3I),
2016 2nd International Conference on, pages 656–660. IEEE.
[8] Apache Software Foundation. Apache spark.
[9] Apache Software Foundation. Hadoop.
[10] Bar-Joseph, Z., Gifford, D. K., and Jaakkola, T. S. (2001). Fast optimal leaf
ordering for hierarchical clustering. Bioinformatics, 17(suppl 1):S22–S29.
[11] Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi: an open source
software for exploring and manipulating networks. Icwsm, 8:361–362.
[12] Bederson, B. B. and Hollan, J. D. (1994). Pad++: a zooming graphical interface
for exploring alternate interface physics. In Proceedings of the 7th annual ACM
symposium on User interface software and technology, pages 17–26. ACM.
[13] Bederson, B. B., Meyer, J., and Good, L. (2003). Jazz: an extensible zoomable
user interface graphics toolkit in java. In The Craft of Information Visualization,
pages 95–104. Elsevier.
[14] Bertin, J. (1983). Semiology of graphics: diagrams, networks, maps.
[15] Bertrand, P. and Diday, E. (1985). A visual representation of the compatibility
between an order and a dissimilarity index: the pyramids. Computational Statistics
Quarterly, 2(1):31–41.
[16] Billard, L. and Diday, E. (2003). From the statistics of data to the statistics of
knowledge: symbolic data analysis. Journal of the American Statistical Association,
98(462):470–487.
[17] Billard, L. and Diday, E. (2006). Symbolic data analysis: Conceptual statistics
and data mining john wiley.
[18] Billard, L., Douzal-Chouakria, A., and Diday, E. (2008). Symbolic principal
component for interval-valued observations.
[19] Bock, H.-H. (2002). Clustering methods and Kohonen maps for symbolic data.
Journal of the Japanese Society of Computational Statistics : JJSCS, 15:13 S.
[20] Bock, H.-H. (2008). Visualizing symbolic data by kohonen maps. Symbolic Data
Analysis and the SODAS Software, Wiley, pages 205–234.
[21] Borland, D. and Ii, R. M. T. (2007). Rainbow color map (still) considered harmful.
IEEE computer graphics and applications, 27(2).
[22] Bostock, M., Ogievetsky, V., and Heer, J. (2011). D3 data-driven documents.
IEEE transactions on visualization and computer graphics, 17(12):2301–2309.
[23] BREWER, C. (1994). Color use guidelines for mapping and visualization. in
maceachren, am & taylor, drf (eds.) visualization in modern cartography. tarrytown,
ny.
[24] Brewer, C. A. (1999). Color use guidelines for data representation. In Proceedings
of the Section on Statistical Graphics, American Statistical Association, pages
55–60.
[25] Brito, P. (2002). Hierarchical and pyramidal clustering for symbolic data. Journal
of the Japanese Society of Computational Statistics, 15(2):231–244.
[26] Chang, S.-C., Chen, C.-h., Chi, Y.-Y., and Ouyoung, C.-W. (2002). Relativity
and resolution for high dimensional information visualization with generalized association
plots (gap). In Compstat, pages 55–66. Springer.
[27] Chavent, M., de Carvalho, F. d. A., Lechevallier, Y., and Verde, R. (2006). New
clustering methods for interval data. Computational statistics, 21(2):211–229.
[28] Chavent, M. and Lechevallier, Y. (2002). Dynamical clustering of interval data:
optimization of an adequacy criterion based on hausdorff distance. In Classification,
clustering, and data analysis, pages 53–60. Springer.
[29] Chen, C.-H. (2002). Generalized association plots: Information visualization via
iteratively generated correlation matrices. Statistica Sinica, pages 7–29.
[30] Chen, C.-H., Hwu, H.-G., Jang, W.-J., Kao, C.-H., Tien, Y.-J., Tzeng, S., and
Wu, H.-M. (2004). Matrix visualization and information mining. In COMPSTAT
2004Proceedings in Computational Statistics, pages 85–100. Springer.
[31] Chouakria, A., Cazes, P., and Diday, E. (2000). Symbolic principal component
analysis. Analysis of Symbolic Data, ed. HH Bock, and E. Diday, pages 200–212.
[32] Cockburn, A., Karlson, A., and Bederson, B. B. (2009). A review of overview+
detail, zooming, and focus+ context interfaces. ACM Computing Surveys (CSUR),
41(1):2.
[33] Cox, T. F. and Cox, M. A. (2000). Multidimensional scaling. CRC press.
[34] de Carvalho, F. d. A., Brito, P., and Bock, H.-H. (2006). Dynamic clustering for
interval data based on l2 distance. Computational Statistics, 21(2):231–250.
[35] de Falguerolles, A., Friedrich, F., and Sawitzki, G. (1996). A tribute to j. bertin’s
graphical data analysis.
[36] de Souza, R. M. and De Carvalho, F. d. A. (2004). Clustering of interval data
based on city–block distances. Pattern Recognition Letters, 25(3):353–365.
[37] Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified data processing on
large clusters. Communications of the ACM, 51(1):107–113.
[38] Denoeux, T. and Masson, M. (2000). Multidimensional scaling of interval-valued
dissimilarity data. Pattern Recognition Letters, 21(1):83–92.
[39] Diday, E. (1988). The symbolic approach in clustering and related methods of
data analysis. Classification and related methods of data analysis, pages 673–684.
[40] Diday, E. and Bock, H. H. (2000). Analysis of symbolic data: Exploratory methods
for extracting statistical information from complex data.
[41] Diday, E. and Esposito, F. (2003). An introduction to symbolic data analysis and
the sodas software. Intelligent Data Analysis, 7(6):583–601.
[42] Diday, E. and Noirhomme-Fraiture, M. (2008). Symbolic data analysis and the
SODAS software. John Wiley & Sons.
[43] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis
and display of genome-wide expression patterns. Proceedings of the National
Academy of Sciences, 95(25):14863–14868.
[44] El Golli, A., Conan-Guez, B., and Rossi, F. (2004). A self-organizing map for
dissimilarity data. In Classification, Clustering, and Data Mining Applications,
pages 61–68. Springer.
[45] Elmqvist, N., Do, T.-N., Goodell, H., Henry, N., and Fekete, J.-D. (2008). Zame:
Interactive large-scale graph visualization. In Visualization Symposium, 2008. PacificVIS’
08. IEEE Pacific, pages 215–222. IEEE.
[46] Elmqvist, N., Dragicevic, P., and Fekete, J.-D. (2011). Color lens: Adaptive color
scale optimization for visual exploration. IEEE Transactions on Visualization and
Computer Graphics, 17(6):795–807.
[47] Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices.
The American Statistician, 56(4):316–324.
[48] Furnas, G. W. (1986). Generalized fisheye views, volume 17. ACM.
[49] Gale, N., Halperin, W. C., and Costanzo, C. M. (1984). Unclassed matrix shading
and optimal ordering in hierarchical cluster analysis. Journal of Classification,
1(1):75–92.
[50] Ghoniem, M., Fekete, J.-D., and Castagliola, P. (2005). On the readability of
graphs using node-link and matrix-based representations: a controlled experiment
and statistical analysis. Information Visualization, 4(2):114–135.
[51] Gioia, F. and Lauro, C. N. (2006). Principal component analysis on interval data.
Computational Statistics, 21(2):343–363.
[52] Godinho, P. I. A., Meiguins, B. S., Meiguins, A. S. G., do Carmo, R. M. C.,
de Brito Garcia, M., Almeida, L. H., and Lourenco, R. (2007). Prisma-a multidimensional
information visualization tool using multiple coordinated views. In Information
Visualization, 2007. IV’07. 11th International Conference, pages 23–32.
IEEE.
[53] Gowda, K. C. and Diday, E. (1991). Symbolic clustering using a new dissimilarity
measure. pattern recognition, 24(6):567–578.
[54] Groenen, P. J., Winsberg, S., Rodriguez, O., and Diday, E. (2006). I-scal: Multidimensional
scaling of interval dissimilarities. Computational Statistics & Data
Analysis, 51(1):360–378.
[55] Guo, J., Li, W., Li, C., and Gao, S. (2012). Standardization of interval symbolic
data based on the empirical descriptive statistics. Computational Statistics & Data
Analysis, 56(3):602–610.
[56] Henry, N. and Fekete, J.-D. (2006). Matrixexplorer: a dual-representation system
to explore social networks. IEEE transactions on visualization and computer
graphics, 12(5).
[57] Huber, P. J. (1985). Projection pursuit. The annals of Statistics, pages 435–475.
[58] Hunter, J. D. (2007). Matplotlib: A 2d graphics environment. Computing in
science & engineering, 9(3):90–95.
[59] Hwu, H.-G., Chen, C.-H., Hwang, T.-J., Liu, C.-M., Cheng, J. J., Lin, S.-K., Liu,
S.-K., Chen, C.-H., Chi, Y.-Y., Ou-Young, C.-W., et al. (2002). Symptom patterns
and subgrouping of schizophrenic patients: significance of negative symptoms assessed
on admission. Schizophrenia Research, 56(1):105–119.
[60] Ichino, M. and Yaguchi, H. (1994). Generalized minkowski metrics for mixed
feature-type data analysis. IEEE Transactions on Systems, Man, and Cybernetics,
24(4):698–708.
[61] Irpino, A., Lauro, C., and Verde, R. (2003). Visualizing symbolic data by closed
shapes. In Between Data Science and Applied Data Analysis, pages 244–251.
Springer.
[62] Kao, C.-H., Nakano, J., Shieh, S.-H., Tien, Y.-J., Wu, H.-M., Yang, C.-K., and
Chen, C.-h. (2014). Exploratory data analysis of interval-valued symbolic data with
matrix visualization. Computational Statistics & Data Analysis, 79:14–29.
[63] Kay, S. R., Fiszbein, A., and Opfer, L. A. (1987). The positive and negative
syndrome scale (panss) for schizophrenia. Schizophrenia bulletin, 13(2):261.
[64] Keim, D. A. (2002). Information visualization and visual data mining. IEEE
transactions on Visualization and Computer Graphics, 8(1):1–8.
[65] Kohonen, T. (1998). The self-organizing map. Neurocomputing, 21(1-3):1–6.
[66] Lauro, C. N., Palumbo, F., and DEnza, A. I. (2003). New graphical symbolic
objects representations in parallel coordinates. In Between data science and applied
data analysis, pages 288–295. Springer.
[67] Lauro, N., Verde, R., and Palumbo, F. (2000). Factorial discriminant analysis on
symbolic objects. Analysis of Symbolic Data: Exploratory Methods for Extracting
Statistical Information from Complex Data, 15:212–233.
[68] Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of
the American Statistical Association, 86(414):316–327.
[69] Liiv, I. (2010). Seriation and matrix reordering methods: An historical overview.
Statistical Analysis and Data Mining: The ASA Data Science Journal, 3(2):70–91.
[70] Liiv, I., Opik, R., Ubi, J., and Stasko, J. (2012). Visual matrix explorer for
collaborative seriation. Wiley Interdisciplinary Reviews: Computational Statistics,
4(1):85–97.
[71] Ling, R. L. (1973). A computer generated aid for cluster analysis. Communications
of the ACM, 16(6):355–361.
[72] Marchette, D. J. and Solka, J. L. (2003). Using data images for outlier detection.
Computational statistics & data analysis, 43(4):541–552.
[73] Matkovic, K., Hauser, H., Sainitzer, R., and Groller, M. E. (2002). Process visualization
with levels of detail. In Information Visualization, 2002. INFOVIS 2002.
IEEE Symposium on, pages 67–70. IEEE.
[74] Micallef, L., Dragicevic, P., and Fekete, J.-D. (2012). Assessing the effect of
visualizations on bayesian reasoning through crowdsourcing. IEEE Transactions
on Visualization and Computer Graphics, 18(12):2536–2545.
[75] Michailidis, G. and de Leeuw, J. (1998). The gifi system of descriptive multivariate
analysis. Statistical Science, pages 307–336.
[76] Minnotte, M. and West, R. W. (1998). The data image: a tool for exploring high
dimensional data sets. In Proceedings of the ASA Section on Statistical Graphics,
pages 25–33. Citeseer.
[77] Neto, E. d. A. L. and de Carvalho, F. d. A. (2008). Centre and range method for
fitting a linear regression model to symbolic interval data. Computational Statistics
& Data Analysis, 52(3):1500–1515.
[78] Neto, E. d. A. L. and de Carvalho, F. d. A. (2010). Constrained linear regression
models for symbolic interval-valued variables. Computational Statistics & Data
Analysis, 54(2):333–347.
[79] Noirhomme-Fraiture, M. and Rouard, M. (2000). Visualizing and editing symbolic
objects. In Analysis of Symbolic Data, pages 125–138. Springer.
[80] Peng, R. D. (2008). A method for visualizing multivariate time series data.
[81] Robinson, W. S. (1951). A method for chronologically ordering archaeological
deposits. American antiquity, 16(4):293–301.
[82] Rosenberg, N. A. (2004). Distruct: a program for the graphical display of population
structure. Molecular Ecology Resources, 4(1):137–138.
[83] Saito, T., Miyamura, H. N., Yamamoto, M., Saito, H., Hoshiya, Y., and Kaseda,
T. (2005). Two-tone pseudo coloring: Compact visualization for one-dimensional
data. In Information Visualization, 2005. INFOVIS 2005. IEEE Symposium on,
pages 173–180. IEEE.
[84] Sarkar, M. and Brown, M. H. (1994). Graphical fisheye views. Communications
of the ACM, 37(12):73–83.
[85] Scherr, M. (2008). Multiple and coordinated views in information visualization.
Trends in Information Visualization, 38.
[86] Silva, A. P. D. and Brito, P. (2006). Linear discriminant analysis for interval data.
Computational Statistics, 21(2):289–308.
[87] Sokal, R. R. and Rohlf, F. J. (1962). The comparison of dendrograms by objective
methods. Taxon, 11(2):33–40.
[88] Stirrup, J., Nandeshwar, A., Ohmann, A., and Floyd, M. (2016). Tableau: Creating
Interactive Data Visualizations. Packt Publishing.
[89] Thudt, A., Baur, D., and Carpendale, S. (2013). Visits: A spatiotemporal visualization
of location histories. In Proceedings of the eurographics conference on
visualization.
[90] Tien, Y.-J., Lee, Y.-S., Wu, H.-M., and Chen, C.-H. (2008). Methods for simultaneously
identifying coherent local clusters with smooth global patterns in gene
expression profiles. BMC bioinformatics, 9(1):155.
[91] Tukey, J. W. (1977). Exploratory data analysis, volume 2. Reading, Mass.
[92] Verde, R. and Lechevallier, Y. (2005). Crossed clustering method on symbolic
data tables. In New developments in classification and data analysis, pages 87–94.
Springer.
[93] Wang Baldonado, M. Q., Woodruff, A., and Kuchinsky, A. (2000). Guidelines
for using multiple views in information visualization. In Proceedings of the working
conference on Advanced visual interfaces, pages 110–119. ACM.
[94] Ware, C. (2012). Information visualization: perception for design. Elsevier.
[95] Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates.
Journal of the American Statistical Association, 85(411):664–675.
[96] White, T. (2012). Hadoop: The definitive guide. ” O’Reilly Media, Inc.”.
[97] Wijffelaars, M., Vliegen, R., Van Wijk, J. J., and Van Der Linden, E.-J. (2008).
Generating color palettes using intuitive parameters. In Computer Graphics Forum,
volume 27, pages 743–750. Wiley Online Library.
[98] Wilkinson, L. and Friendly, M. (2009). The history of the cluster heat map. The
American Statistician, 63(2):179–184.
[99] Wu, H.-M., Tien, Y.-J., and Chen, C.-h. (2010). Gap: A graphical environment for
matrix visualization and cluster analysis. Computational Statistics & Data Analysis,
54(3):767–778.
[100] Yi, J. S., ah Kang, Y., and Stasko, J. (2007). Toward a deeper understanding of
the role of interaction in information visualization. IEEE transactions on visualization
and computer graphics, 13(6):1224–1231.
[101] Zhu, Y. (2012). Introducing google chart tools and google maps api in data
visualization courses. IEEE computer graphics and applications, 32(6):6–9.