簡易檢索 / 詳目顯示

研究生: 張育榮
Yu-Jung Chang
論文名稱: 研究多重輸出預測問題之大數據預測分析中插補值的各種不同前處理方法之比較
On the Study of Various Preprocessing Approaches Comparison to Imputation Data in the Big Data Predictive Analytics for Multiple Prediction Problems
指導教授: 羅士哲
Shih-Che Lo
口試委員: 羅士哲
Shih-Che Lo
曹譽鐘
Yu-Chung Tsao
蔡鴻旭
Hung-Hsu Tsai
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 56
中文關鍵詞: 大數據缺失值插補法機器學習人工神經網路
外文關鍵詞: Big Data, Missing Values, Data Imputation, Machine Learning, Artificial Neural Network
相關次數: 點閱:214下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著時代的演變,科技的發展也越來越進步,進而使得數據的產生越來越大
    量而快速,因此產生了大數據這個名詞的誕生。在大數據研究中,遺失值的插補
    方法研究是其中一個重要的議題,也就是我們必須在有某些資料集發生有單個或
    多個屬性值的缺失或雜訊的情況下,利用適當的方法將其填補,以使得資料集完
    整進而得以完成後續的大數據預測分析研究。
    本論文將實驗分為兩個階段,第一階段實驗為將初始不完整的資料集中,含
    有缺失值的資料刪除後,再使用剩下完整資料集來建立含有不同缺失比率的資料
    集,接著利用不同的插補方法包括隨機森林、分類與迴歸樹、預測均值匹配法以
    及簡單的統計量來將這些資料集做完整的補值動作,進而分析何種插補方法能夠
    在資料比率不同的情況下,比較資料集的還原程度,以及是否能將此結果對應到
    接下來的第二階段實驗。第二階段實驗回到初始不完整資料集,利用第一階段實
    驗所使用的插補方法補完值後,使用人工神經網路來做後續智慧製造中預測多鑽
    頭機台加工品質的動作。實驗結果顯示,藉由加入資料差補方法到資料集中提升
    資料品質,大數據預測分析的品質亦能被提高。


    Recently, the development of science and technology has become more and more
    advanced, and the size of data is increasing and requiring fast processing, so the term Big Data was born. Data imputation is one of important issues in the Big Data research areas. That is, we have to use appropriate methods in the dataset that contain missing data or noise in one or more attributes. It is required to complete whole dataset to make subsequent Big Data Predictive Analytics successfully.
    In this thesis, we divided the research into two phases. In the first phase, records with missing data were removed from initial incomplete dataset to conduct experiment and used the complete dataset to create new testing datasets that contain different missing rate. Then, we use different methods such as Random Forest, Classification and Regression Tree, Predictive Mean Matching and some simple statistics to impute the testing datasets. Moreover, we analyzed different imputation methods restoration level from the testing datasets under different missing rate and evaluated whether the results can be corresponded to the second experiment. The experiment in the second phase use original incomplete dataset by applying imputation methods from the first phase to impute the missing values. Then, we use an Artificial Neural Network model to predict the quality of multiple drills machining in a station for smart manufacturing process. Experiment results shows that by adding imputation methods to the datasets with missing values to improve quality of data, the quality of Big Data Predictive Analytics can also be improved.

    摘要 ABSTRACT ACKNOWLEDGEMENTS CONTENTS FIGHRES TABLES CHAPTER 1 INTRODUCTION 1.1 Motivation 1.2 Objectives 1.3 Research Structure CHAPTER 2 LITERATURE REVIEW 2.1 Industry 4.0 2.2 Big Data 2.3 Data Imputation 2.4 Artificial Neural Network CHAPTER 3 RESEARCH METHODS 3.1 Big Data Predictive Analytics 3.2 Missing Data 3.3 The Classical Imputation Methods 3.4 Predictive Mean Matching 3.5 Classification and Regression Trees 3.6 Random Forest 3.7 Artificial Neural Network 3.8 Forecasting Performance Measures CHAPTER 4 COMPUTATIONAL EXPERIMENTS 4.1 First Experiment 4.2 Second Experiment CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH 5.1 Conclusions 5.2 Future Research REFERENCES

    Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on
    classifier accuracy. Classification, Clustering, and Data Mining Applications,
    639–647. (DOI: 10.1007/978-3-642-17103-1_60)
    Beier, G., Ullrich, A., Niehoff, S., Reißig, M., & Habich, M. (2020). Industry 4.0: How
    it is defined from a sociotechnical perspective and how much sustainability it
    includes–A literature review. Journal of Cleaner Production, 259, 120856. (DOI:
    10.1016/j.jclepro.2020.120856)
    Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and
    Regression Trees, Chapman & Hall/CRC (Verlag).
    Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. (DOI:
    10.1023/A:1010933404324)
    Bryson, A., E. (1961). A gradient method for optimizing multi-stage allocation
    processes. Proceedings of the Harvard University Symposium on Digital
    Computers and Their Applications, April 3-6, 1961, Cambridge: Harvard
    University Press. OCLC 498866871.
    Burgette, L. F., & Reiter, J. P. (2010). Multiple imputation for missing data via
    sequential regression trees. American Journal of Epidemiology, 172(9), 1070–
    1076. (DOI: 10.1093/aje/kwq260)
    Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and
    Applications, 19(2), 171–209. (DOI: 10.1007/s11036-013-0489-0)
    Erol, S., Jäger, A., Hold, P., Ott, K., & Sihn, W. (2016). Tangible Industry 4.0: a
    scenario-based approach to learning for the future of production. Procedia
    CIRP, 54, 13–18. (DOI: 10.1016/j.procir.2016.03.162)
    Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and
    analytics. International Journal of Information Management, 35(2), 137–144.
    (DOI: 10.1016/j.ijinfomgt.2014.10.007)
    García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern
    classification with missing data: a review. Neural Computing and
    Applications, 19(2), 263–282. (DOI: 10.1007/s00521-009-0295-6)
    45
    Ji, H., Songlin, W., Qinglin, W., & Xiaonan, C. (2012). Douhe reservoir flood
    forecasting model based on data mining technology. Procedia Environmental
    Sciences, 12, 93–98. (DOI: 10.1016/j.proenv.2012.01.252)
    Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10), 947-
    954. (DOI:10.2514/8.5282)
    Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2016). The emerging role of data
    scientists on software development teams. 2016 IEEE/ACM 38th International
    Conference on Software Engineering (ICSE), 96–107. (DOI:
    10.1145/2884781.2884783)
    Laney, D. (2001). 3-D data management: Controlling data volume, velocity, and variety.
    File: 949 Addendum, META Group. (Access May 2, 2020 from:
    https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-DataManagement-Controlling-Data-Volume-Velocity-and-Variety.pdf)
    Lasi, H., Fettke, P., Kemper, H. G., Feld, T., & Hoffmann, M. (2014). Industry
    4.0. Business & Information Systems Engineering, 6(4), 239–242. (DOI:
    10.1007/s12599-014-0334-4)
    Lee, J., Kao, H. A., & Yang, S. (2014). Service innovation and smart analytics for
    industry 4.0 and big data environment. Procedia CIRP, 16, 3–8. (DOI:
    10.1016/j.procir.2014.02.001)
    Lee, J., Ardakani, H. D., Yang, S., & Bagheri, B. (2015). Industrial big data analytics
    and cyber-physical systems for future maintenance & service
    innovation. Procedia CIRP, 38, 3–7. (DOI: 10.1016/j.procir.2015.08.026)
    Lins, T., & Oliveira, R. A. R. (2020). Cyber-physical production systems retrofitting in
    context of industry 4.0. Computers & Industrial Engineering, 139, 106193. (DOI:
    10.1016/j.cie.2019.106193)
    Little, R. J. A., & Rubin, D.B. (1987). Statistical Analysis with Missing Data, Wiley &
    Sons.
    Little, R. J. A. (1988). Missing-Data adjustments in large surveys. Journal of Business
    & Economic Statistics, 6(3): 287–296. (DOI: 10.2307/1391878)
    McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big
    data: the management revolution. Harvard Business Review, 90(10), 60–68.
    46
    McCulloch, W., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous
    activity. Bulletin of Mathematical Biophysics, 5(4): 115–133. (DOI:
    10.1007/BF02478259)
    Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language
    model. Aistats, 5, 246–252.
    Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by
    predictive mean matching and local residual draws. BMC Medical Research
    Methodology, 14:75. (DOI: 10.1186/1471-2288-14-75)
    Paryasto, M., Alamsyah, A., & Rahardjo, B. (2014). Big-data security management
    issues. 2014 2nd International Conference on Information and Communication
    Technology (ICoICT), 59–63. IEEE. (DOI: 10.1109/ICoICT.2014.6914040)
    Pandey, K. K., & Shukla, D. (2018). Challenges of big data to big data mining with
    their processing framework. 2018 8th International Conference on
    Communication Systems and Network Technologies (CSNT), 89–94.
    (DOI: 10.1109/CSNT.2018.8820282)
    Rafiq, M. Y., Bugmann, G., & Easterbrook, D. J. (2001). Neural network design for
    engineering applications. Computers & Structures, 79(17), 1541–1552. (DOI:
    10.1016/S0045-7949(01)00039-6)
    Rowley, H. A., Baluja, S., & Kanade, T. (1998). Neural network-based face
    detection. IEEE Transactions on Pattern Analysis and Machine
    Intelligence, 20(1), 23–38. (DOI: 10.1109/34.655647)
    Rubin, D. B. (1986). Statistical Matching Using File Concatenation with Adjusted
    Weights and Multiple Imputations. Journal of Business & Economic
    Statistics, 4(1): 87–94. (DOI:10.2307/1391390)
    Sainani, K. L. (2015). Dealing with missing data. Physical Medicine and
    Rehabilitation (PM&R), 7(9), 990–994. (DOI: 10.1016/j.pmrj.2015.07.011)
    Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best practices for missing data
    management in counseling psychology. Journal of Counseling Psychology, 57(1),
    1–10 (DOI: 10.1037/a0018082)
    Silva, N., Ferreira, L. M. D., Silva, C., Magalhães, V., & Neto, P. (2017). Improving
    supply chain visibility with artificial neural networks. Procedia
    Manufacturing, 11, 2083–2090. (DOI: 10.1016/j.promfg.2017.07.329)
    47
    Specht, D. F. (1991). A general regression neural network. IEEE Transactions on
    Neural Networks, 2(6), 568–576. (DOI: 10.1109/72.97934)
    Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value
    imputation for mixed-type data. Bioinformatics, 28(1), 112–118. (DOI:
    10.1093/bioinformatics/btr597)
    Velasco, L. C. P., Serquiña, R. P., Zamad, M. S. A. A., Juanico, B. F., & Lomocso, J. C.
    (2019). Week-ahead rainfall forecasting using multilayer perceptron neural
    network. Procedia Computer Science, 161, 386–397.
    (DOI: 10.1016/j.procs.2019.11.137)
    Xia, J., Zhang, S., Cai, G., Li, L., Pan, Q., Yan, J., & Ning, G. (2017). Adjusted weight
    voting algorithm for random forests in handling missing values. Pattern
    Recognition, 69, 52–60. (DOI: 10.1016/j.patcog.2017.04.005)
    Yadav, M. L., & Roychoudhury, B. (2018). Handling missing values: A study of popular
    imputation packages in R. Knowledge-Based Systems, 160, 104–118. (DOI:
    10.1016/j.knosys.2018.06.012)
    Zhang, M., Fulcher, J., & Scofield, R. A. (1997). Rainfall estimation using artificial
    neural network group. Neurocomputing, 16(2), 97–115. (DOI: 10.1016/S0925-
    2312(96)00022-7)

    無法下載圖示 全文公開日期 2025/07/19 (校內網路)
    全文公開日期 2025/07/19 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE