簡易檢索 / 詳目顯示

研究生: 張家瑋
Chia-Wei Chang
論文名稱: 設計騎警隨機森林插補法與騎警隨機森林預測法於多重輸出問題之工業大數據預測分析
Design of Ranger Random Forest Imputation and Ranger Random Forest Prediction in Industrial Big Data Predictive Analytics for Multiple Output Problems
指導教授: 羅士哲
Shih-Che Lo
口試委員: 范書愷
Shu-Kai Fan
曹譽鐘
Yu-chung Tsao
羅士哲
Shih-Che Lo
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 53
中文關鍵詞: 大數據預測分析缺失值插補法機器學習隨機森林
外文關鍵詞: Big Data Predictive Analytics, Missing Values, Data Imputation, Machine Learning, Random Forest
相關次數: 點閱:285下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著時代的躍遷,數據的產生越來越大量且快速,也因此迎來了「大數據時代」。無論何種產業皆能透過分析大數據預測未來趨勢,使大數據成為了各行各業都在發展的數位技術。然而數據遺失與雜訊產生問題,仍然無法完全避免。這種情況往往會影響後續的分析以及預測。本論文所研究的資料為多重輸入與多重鑽頭加工品質輸出預測的工業大數據資料。
    本文將實驗分為兩個階段,第一階段實驗為將初始不完整的資料集中,含有缺失值的資料刪除後,再使用剩下完整資料集來建立含有不同缺失比率的資料集,接著利用不同的插補方法包括隨機森林插補法與騎警隨機森林插補法來將這些資料集做完整的插補動作,進而分析不同插補方法在資料比率不同的情況下,比較資料集的還原程度,以及還原資料所花費的時間。第二階段實驗回到初始不完整資料集,分別使用隨機森林插補法與騎警隨機森林插補法將初始不完整資料集進行插補,然後再將插補完後的完整資料,分別對應使用隨機森林預測法以及騎警隨機森林預測法進行預測多鑽頭機台加工品質的動作。實驗結果顯示,所提出的騎警隨機森林插補法和騎警隨機森林預測法得出的準確度與隨機森林插補法和隨機森林預測法並無顯著差異,然而在小遺失率的時間方面,騎警隨機森林插補法和騎警隨機森林預測法優於傳統的隨機森林插補法和隨機森林預測法。


    Recently, the development of science and technology has become more and more advanced, and the size of data is increasing and requiring fast processing. Hence, the term Big Data was born. No matter what kind of industry predicting the future trend by analyzing the Big Data, it has become a digital technology that is developing in the various walks of life. However, the problem of missing data and signal noises cannot be completely avoided. This situation often affects subsequent analysis and prediction. This thesis studied the industrial big dataset with multiple attribute inputs and multiple outputs as multiple drills machining quality prediction problem.
    In this thesis, we divided the research into two phases. In the first phase, records with missing data were removed from initial incomplete dataset to conduct experiment and used the complete dataset to create new testing datasets that contain different missing rate. Then, we use two methods, the Random Forest Imputation (RFI) and the Ranger Random Forest Imputation (RRFI), to impute the testing datasets. Moreover, we analyzed different imputation methods restoration level and time from the testing datasets under different missing rate. The experiment in the second phase returns to the initial incomplete datasets, using the RFI and the RRFI, respectively, to impute the initial incomplete datasets, and then use the Random Forest Prediction (RFP) and the Ranger Random Forest Prediction (RRFP) to correspond to the complete datasets after the imputation to predict the quality of multiple drills machining. The experimental results show that the accuracy of the proposed the RRFI with the RRFP is not significantly different from that of the RFI with the RFP. However, the RRFI with the RRFP is better than the traditional RFI with the RFP in terms of time with a small missing rate.

    TABLE OF CONTENTS 摘要 I ABSTRACT II ACKNOWLEDGEMENTS III TABLE OF CONTENTS IV LIST OF FIGURES V LIST OF TABLES VI CHAPTER 1 INTRODUCTION 1 1.1 Motivation 1 1.2 Objectives 3 1.3 Business Analytics 4 1.4 Research Structure 6 CHAPTER 2 LITERATURE REVIEW 8 2.1 Industry 4.0 8 2.2 Big Data 9 2.3 Data Imputation 11 2.4 Decision Trees 15 2.5 Random Forest 15 2.6 Ranger Random Forest 18 CHAPTER 3 RESEARCH METHODS 20 3.1 Big Data Predictive Analytics 20 3.2 Missing Data 21 3.3 The Classical Imputation Methods 22 3.4 Decision Trees 22 3.5 Random Forest Imputation and Random Forest Prediction 24 3.6 Ranger Random Forest Imputation and Ranger Random Forest Prediction 26 3.7 Forecasting Performance Measure 28 CHAPTER 4 COMPUTATIONAL EXPERIMENTS 29 4.1 First Experiment 30 4.2 Second Experiment 34 4.3 Chapter Summary 38 CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH 39 5.1 Conclusions 39 5.2 Future Research 40 REFERENCES 41   LIST OF FIGURES Figure 1.1 Smart factory operation flowchart. 2 Figure 1.2 Discover Artificial Intelligence. 2 Figure 1.3 Applications of Artificial Intelligence. 3 Figure 1.4 Four types of analytics. 6 Figure 1.5 Framework of the research. 7 Figure 3.1 Processes of the CART. 23 Figure 3.2 The processes of the RF. 25 Figure 4.1 System diagram of this thesis. 29 Figure 4.2 Processes of first experiment. 31 Figure 4.3 Processes of second experiment. 36   LIST OF TABLES Table 3.1 Three types of data types. 20 Table 3.2 Three types of missing data. 21 Table 3.3 Algorithm 1: Random Forest Imputation. 26 Table 3.4 Algorithm 2: Random Forest Prediction. 26 Table 3.5 Algorithm 3: Ranger Random Forest Imputation. 27 Table 3.6 Algorithm 4: Ranger Random Forest Prediction. 27 Table 4.1 Amount of complete values and missing values. 30 Table 4.2 Amount of complete records and records with missing values. 30 Table 4.3 Time of partial dataset (sec). 32 Table 4.4 RMSE of partial dataset. 33 Table 4.5 Time of A1 dataset (sec). 33 Table 4.6 Performance comparison of A1 dataset. 37 Table 4.7 Performance comparison of A2 dataset. 37 Table 4.8 Performance comparison of A3 dataset. 37 Table 4.9 Performance comparison of A4 dataset. 37 Table 4.10 Performance comparison of A5 dataset. 37 Table 4.11 Performance comparison of A6 dataset. 38

    REFERENCES
    Acuna, E. and Rodriguez, C. (2004). The treatment of missing values and its effect on classifier accuracy. Classification, Clustering, and Data Mining Applications, 639–647. (DOI: 10.1007/978-3-642-17103-1_60)
    Beier, G., Ullrich, A., Niehoff, S., Reißig, M. and Habich, M. (2020). Industry 4.0: How it is defined from a sociotechnical perspective and how much sustainability it includes–A literature review. Journal of Cleaner Production, 259, 120856. (DOI: 10.1016/j.jclepro.2020.120856)
    Breiman, L., Friedman, J. H., Stone, C. J. and Olshen, R. A. (1984). Classification and Regression Trees, 1st Edition, Chapman & Hall/CRC (Verlag). (DOI: 10.1201/9781315139470)
    Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. (DOI: 10.1023/A:1010933404324)
    Burgette, L. F. and Reiter, J. P. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172(9), 1070–1076. (DOI: 10.1093/aje/kwq260)
    Chen, M., Mao, S. and Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2), 171–209. (DOI: 10.1007/s11036-013-0489-0)
    Erol, S., Jäger, A., Hold, P., Ott, K. and Sihn, W. (2016). Tangible Industry 4.0: a scenario-based approach to learning for the future of production. Procedia CIRP, 54, 13–18. (DOI: 10.1016/j.procir.2016.03.162)
    Gounaridis, D. and Koukoulas, S. (2016). Urban land cover thematic disaggregation, employing datasets from multiple sources and RandomForests modeling. International Journal of Applied Earth Observation and Geoinformation, 51, 1–10. (DOI: 10.1016/j.jag.2016.04.002)
    Goyal, H., Joshi, N. and Sharma, C. (2018). An empirical analysis of geospatial classification for agriculture monitoring. Procedia Computer Science, 132, 1102–1112. (DOI: 10.1016/j.procs.2018.05.025)
    Gallager, R. G. (2001). Claude E. Shannon: a retrospective on his life, work, and impact, IEEE Transactions on Information Theory, 47(7), 2681–2695. (DOI: 10.1109/18.959253)
    Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–144. (DOI: 10.1016/j.ijinfomgt.2014.10.007)
    García-Laencina, P. J., Sancho-Gómez, J. L. and Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2), 263–282. (DOI: 10.1007/s00521-009-0295-6)
    Ho, T. K. (1995). Random decision forests. The 3rd International Conference on Document Analysis and Recognition. (DOI: 10.1109/ICDAR.1995.598994)
    Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. (DOI: 10.1109/34.709601)
    Kim, H., Kim, I. and Kim, K. (2021). AIBFT: Artificial Intelligence Browser Forensic Toolkit. Forensic Science International Digital Investigation, 36. (DOI: 10.1016/j.fsidi.2020.301091)
    Khammas, B. M. (2020). Ransomware Detection using Random Forest Technique. ICT Express, 6(4), 325–331. (DOI: 10.1016/j.icte.2020.11.001)
    Kim, M., Zimmermann, T., DeLine, R. and Begel, A. (2016). The emerging role of data scientists on software development teams. 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), 96–107. (DOI: 10.1145/2884781.2884783)
    Kuswanto, H. and Mubarok, R. (2019). Classification of cancer drug compounds for radiation protection optimization using CART, Procedia Computer Science, 161, 458–465. (DOI: 10.1016/j.procs.2019.11.145)
    Laney, D. (2001). 3-D data management: Controlling data volume, velocity, and variety. File: 949 Addendum, META Group. (Access May 2, 2020 from: https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf)
    Lasi, H., Fettke, P., Kemper, H. G., Feld, T. and Hoffmann, M. (2014). Industry 4.0. Business & Information Systems Engineering, 6(4), 239–242. (DOI: 10.1007/s12599-014-0334-4)
    Lee, J., Kao, H. A. and Yang, S. (2014). Service innovation and smart analytics for industry 4.0 and big data environment. Procedia CIRP, 16, 3–8. (DOI: 10.1016/j.procir.2014.02.001)
    Lee, J., Ardakani, H. D., Yang, S. and Bagheri, B. (2015). Industrial big data analytics and cyber-physical systems for future maintenance & service innovation. Procedia CIRP, 38, 3–7. (DOI: 10.1016/j.procir.2015.08.026)
    Lins, T. and Oliveira, R. A. R. (2020). Cyber-physical production systems retrofitting in context of industry 4.0. Computers & Industrial Engineering, 139, 106193. (DOI: 10.1016/j.cie.2019.106193)
    Little, R. J. A. and Rubin, D.B. (1987). Statistical Analysis with Missing Data, Wiley & Sons.
    McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J. and Barton, D. (2012). Big data: the management revolution. Harvard Business Review, 90(10), 60–68.
    Morris, T. P., White, I. R. and Royston, P. (2014). Tuning multiple imputation by predictive mean matching and local residual draws. BMC Medical Research Methodology, 14:75. (DOI: 10.1186/1471-2288-14-75)
    Pellise, F., Casademunt, A. V., Haddad, S., Perez-Grueso, F. J., Bess, S., Acaroglu, E., Smith, J. S., Kleinstuck, F., Lafage, V., Obeid, I., Schwab, F. J., Shaffrey, C. I., Alanay, A., International Spine Study Group, ESSG European Spine Study Group (2018). Wednesday, September 26, 2018 7:35 AM–9:00 AM ePosters: P81. Successful creation of deployable preoperative predictive risk calculators for individual patient event-free survivorship for major complications, hospital readmissions and unplanned surgery following adult spinal deformity (ASD) surgery. The Spine Journal, 18(8), S178. (DOI: 10.1016/j.spinee.2018.06.619)
    Paryasto, M., Alamsyah, A. and Rahardjo, B. (2014). Big-data security management issues. 2014 2nd International Conference on Information and Communication Technology (ICoICT), 59–63, IEEE. (DOI: 10.1109/ICoICT.2014.6914040)
    Pandey, K. K. and Shukla, D. (2018). Challenges of big data to big data mining with their processing framework. 2018 8th International Conference on Communication Systems and Network Technologies (CSNT), 89–94. (DOI: 10.1109/CSNT.2018.8820282)
    Rao, G. M., Ramesh, D. and Kumar, A. (2020). RRF-BD: Ranger Random Forest Algorithm for Big Data Classification. Computational Intelligence in Data Mining, 15–25. (DOI: 10.1007/978-981-13-8676-3_2)
    Sainani, K. L. (2015). Dealing with missing data. Physical Medicine and Rehabilitation (PM&R), 7(9), 990–994. (DOI: 10.1016/j.pmrj.2015.07.011)
    Schlomer, G. L., Bauman, S. and Card, N. A. (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57(1), 1–10. (DOI: 10.1037/a0018082)
    Stekhoven, D. J. and Bühlmann, P. (2012). MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1), 112–118. (DOI: 10.1093/bioinformatics/btr597)
    Velasco, C. and Lazakis, I. (2020). Real-time data-driven missing data imputation for short-term sensor data of marine systems. A comparative study, Ocean Engineering, 218, 108261. (DOI: 10.1016/j.oceaneng.2020.108261)
    Wright, M. N. and Ziegler, A. (2017). ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1), 1–17. (DOI: 10.18637/jss.v077.i01)
    Xia, J., Zhang, S., Cai, G., Li, L., Pan, Q., Yan, J. and Ning, G. (2017). Adjusted weight voting algorithm for random forests in handling missing values. Pattern Recognition, 69, 52–60. (DOI: 10.1016/j.patcog.2017.04.005)
    Yadav, M. L. and Roychoudhury, B. (2018). Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems, 160, 104–118. (DOI: 10.1016/j.knosys.2018.06.012)
    Yan, Y., Wu, Y., Du, X. and Zhang, Y. (2020). Incomplete data ensemble classification using imputation-revision framework with local spatial neighborhood information. Applied Soft Computing, 99, 106905. (DOI: 10.1016/j.asoc.2020.106905)
    Zhang, Y., Zhou, B., Cai, X., Guo, W., Ding, X. and Yuan, X. (2021). Missing value imputation in multivariate time series with end-to-end generative adversarial networks. Information Sciences, 551, 67–82. (DOI: 10.1016/j.ins.2020.11.035)

    無法下載圖示
    全文公開日期 2024/07/15 (校外網路)
    全文公開日期 2024/07/15 (國家圖書館:臺灣博碩士論文系統)
    QR CODE