簡易檢索 / 詳目顯示

研究生: 張育榮
Yu-Jung Chang
論文名稱: 研究多重輸出預測問題之大數據預測分析中插補值的各種不同前處理方法之比較
On the Study of Various Preprocessing Approaches Comparison to Imputation Data in the Big Data Predictive Analytics for Multiple Prediction Problems
指導教授: 羅士哲
Shih-Che Lo
口試委員: 羅士哲
Shih-Che Lo
曹譽鐘
Yu-Chung Tsao
蔡鴻旭
Hung-Hsu Tsai
學位類別: 碩士
Master
系所名稱: 管理學院 - 工業管理系
Department of Industrial Management
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 56
中文關鍵詞: 大數據缺失值插補法機器學習人工神經網路
外文關鍵詞: Big Data, Missing Values, Data Imputation, Machine Learning, Artificial Neural Network
相關次數: 點閱:233下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

隨著時代的演變,科技的發展也越來越進步,進而使得數據的產生越來越大
量而快速,因此產生了大數據這個名詞的誕生。在大數據研究中,遺失值的插補
方法研究是其中一個重要的議題,也就是我們必須在有某些資料集發生有單個或
多個屬性值的缺失或雜訊的情況下,利用適當的方法將其填補,以使得資料集完
整進而得以完成後續的大數據預測分析研究。
本論文將實驗分為兩個階段,第一階段實驗為將初始不完整的資料集中,含
有缺失值的資料刪除後,再使用剩下完整資料集來建立含有不同缺失比率的資料
集,接著利用不同的插補方法包括隨機森林、分類與迴歸樹、預測均值匹配法以
及簡單的統計量來將這些資料集做完整的補值動作,進而分析何種插補方法能夠
在資料比率不同的情況下,比較資料集的還原程度,以及是否能將此結果對應到
接下來的第二階段實驗。第二階段實驗回到初始不完整資料集,利用第一階段實
驗所使用的插補方法補完值後,使用人工神經網路來做後續智慧製造中預測多鑽
頭機台加工品質的動作。實驗結果顯示,藉由加入資料差補方法到資料集中提升
資料品質,大數據預測分析的品質亦能被提高。


Recently, the development of science and technology has become more and more
advanced, and the size of data is increasing and requiring fast processing, so the term Big Data was born. Data imputation is one of important issues in the Big Data research areas. That is, we have to use appropriate methods in the dataset that contain missing data or noise in one or more attributes. It is required to complete whole dataset to make subsequent Big Data Predictive Analytics successfully.
In this thesis, we divided the research into two phases. In the first phase, records with missing data were removed from initial incomplete dataset to conduct experiment and used the complete dataset to create new testing datasets that contain different missing rate. Then, we use different methods such as Random Forest, Classification and Regression Tree, Predictive Mean Matching and some simple statistics to impute the testing datasets. Moreover, we analyzed different imputation methods restoration level from the testing datasets under different missing rate and evaluated whether the results can be corresponded to the second experiment. The experiment in the second phase use original incomplete dataset by applying imputation methods from the first phase to impute the missing values. Then, we use an Artificial Neural Network model to predict the quality of multiple drills machining in a station for smart manufacturing process. Experiment results shows that by adding imputation methods to the datasets with missing values to improve quality of data, the quality of Big Data Predictive Analytics can also be improved.

摘要 ABSTRACT ACKNOWLEDGEMENTS CONTENTS FIGHRES TABLES CHAPTER 1 INTRODUCTION 1.1 Motivation 1.2 Objectives 1.3 Research Structure CHAPTER 2 LITERATURE REVIEW 2.1 Industry 4.0 2.2 Big Data 2.3 Data Imputation 2.4 Artificial Neural Network CHAPTER 3 RESEARCH METHODS 3.1 Big Data Predictive Analytics 3.2 Missing Data 3.3 The Classical Imputation Methods 3.4 Predictive Mean Matching 3.5 Classification and Regression Trees 3.6 Random Forest 3.7 Artificial Neural Network 3.8 Forecasting Performance Measures CHAPTER 4 COMPUTATIONAL EXPERIMENTS 4.1 First Experiment 4.2 Second Experiment CHAPTER 5 CONCLUSIONS AND FUTURE RESEARCH 5.1 Conclusions 5.2 Future Research REFERENCES

Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on
classifier accuracy. Classification, Clustering, and Data Mining Applications,
639–647. (DOI: 10.1007/978-3-642-17103-1_60)
Beier, G., Ullrich, A., Niehoff, S., Reißig, M., & Habich, M. (2020). Industry 4.0: How
it is defined from a sociotechnical perspective and how much sustainability it
includes–A literature review. Journal of Cleaner Production, 259, 120856. (DOI:
10.1016/j.jclepro.2020.120856)
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and
Regression Trees, Chapman & Hall/CRC (Verlag).
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. (DOI:
10.1023/A:1010933404324)
Bryson, A., E. (1961). A gradient method for optimizing multi-stage allocation
processes. Proceedings of the Harvard University Symposium on Digital
Computers and Their Applications, April 3-6, 1961, Cambridge: Harvard
University Press. OCLC 498866871.
Burgette, L. F., & Reiter, J. P. (2010). Multiple imputation for missing data via
sequential regression trees. American Journal of Epidemiology, 172(9), 1070–
1076. (DOI: 10.1093/aje/kwq260)
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile Networks and
Applications, 19(2), 171–209. (DOI: 10.1007/s11036-013-0489-0)
Erol, S., Jäger, A., Hold, P., Ott, K., & Sihn, W. (2016). Tangible Industry 4.0: a
scenario-based approach to learning for the future of production. Procedia
CIRP, 54, 13–18. (DOI: 10.1016/j.procir.2016.03.162)
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and
analytics. International Journal of Information Management, 35(2), 137–144.
(DOI: 10.1016/j.ijinfomgt.2014.10.007)
García-Laencina, P. J., Sancho-Gómez, J. L., & Figueiras-Vidal, A. R. (2010). Pattern
classification with missing data: a review. Neural Computing and
Applications, 19(2), 263–282. (DOI: 10.1007/s00521-009-0295-6)
45
Ji, H., Songlin, W., Qinglin, W., & Xiaonan, C. (2012). Douhe reservoir flood
forecasting model based on data mining technology. Procedia Environmental
Sciences, 12, 93–98. (DOI: 10.1016/j.proenv.2012.01.252)
Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10), 947-
954. (DOI:10.2514/8.5282)
Kim, M., Zimmermann, T., DeLine, R., & Begel, A. (2016). The emerging role of data
scientists on software development teams. 2016 IEEE/ACM 38th International
Conference on Software Engineering (ICSE), 96–107. (DOI:
10.1145/2884781.2884783)
Laney, D. (2001). 3-D data management: Controlling data volume, velocity, and variety.
File: 949 Addendum, META Group. (Access May 2, 2020 from:
https://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-DataManagement-Controlling-Data-Volume-Velocity-and-Variety.pdf)
Lasi, H., Fettke, P., Kemper, H. G., Feld, T., & Hoffmann, M. (2014). Industry
4.0. Business & Information Systems Engineering, 6(4), 239–242. (DOI:
10.1007/s12599-014-0334-4)
Lee, J., Kao, H. A., & Yang, S. (2014). Service innovation and smart analytics for
industry 4.0 and big data environment. Procedia CIRP, 16, 3–8. (DOI:
10.1016/j.procir.2014.02.001)
Lee, J., Ardakani, H. D., Yang, S., & Bagheri, B. (2015). Industrial big data analytics
and cyber-physical systems for future maintenance & service
innovation. Procedia CIRP, 38, 3–7. (DOI: 10.1016/j.procir.2015.08.026)
Lins, T., & Oliveira, R. A. R. (2020). Cyber-physical production systems retrofitting in
context of industry 4.0. Computers & Industrial Engineering, 139, 106193. (DOI:
10.1016/j.cie.2019.106193)
Little, R. J. A., & Rubin, D.B. (1987). Statistical Analysis with Missing Data, Wiley &
Sons.
Little, R. J. A. (1988). Missing-Data adjustments in large surveys. Journal of Business
& Economic Statistics, 6(3): 287–296. (DOI: 10.2307/1391878)
McAfee, A., Brynjolfsson, E., Davenport, T. H., Patil, D. J., & Barton, D. (2012). Big
data: the management revolution. Harvard Business Review, 90(10), 60–68.
46
McCulloch, W., & Pitts, W. (1943). A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics, 5(4): 115–133. (DOI:
10.1007/BF02478259)
Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language
model. Aistats, 5, 246–252.
Morris, T. P., White, I. R., & Royston, P. (2014). Tuning multiple imputation by
predictive mean matching and local residual draws. BMC Medical Research
Methodology, 14:75. (DOI: 10.1186/1471-2288-14-75)
Paryasto, M., Alamsyah, A., & Rahardjo, B. (2014). Big-data security management
issues. 2014 2nd International Conference on Information and Communication
Technology (ICoICT), 59–63. IEEE. (DOI: 10.1109/ICoICT.2014.6914040)
Pandey, K. K., & Shukla, D. (2018). Challenges of big data to big data mining with
their processing framework. 2018 8th International Conference on
Communication Systems and Network Technologies (CSNT), 89–94.
(DOI: 10.1109/CSNT.2018.8820282)
Rafiq, M. Y., Bugmann, G., & Easterbrook, D. J. (2001). Neural network design for
engineering applications. Computers & Structures, 79(17), 1541–1552. (DOI:
10.1016/S0045-7949(01)00039-6)
Rowley, H. A., Baluja, S., & Kanade, T. (1998). Neural network-based face
detection. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 20(1), 23–38. (DOI: 10.1109/34.655647)
Rubin, D. B. (1986). Statistical Matching Using File Concatenation with Adjusted
Weights and Multiple Imputations. Journal of Business & Economic
Statistics, 4(1): 87–94. (DOI:10.2307/1391390)
Sainani, K. L. (2015). Dealing with missing data. Physical Medicine and
Rehabilitation (PM&R), 7(9), 990–994. (DOI: 10.1016/j.pmrj.2015.07.011)
Schlomer, G. L., Bauman, S., & Card, N. A. (2010). Best practices for missing data
management in counseling psychology. Journal of Counseling Psychology, 57(1),
1–10 (DOI: 10.1037/a0018082)
Silva, N., Ferreira, L. M. D., Silva, C., Magalhães, V., & Neto, P. (2017). Improving
supply chain visibility with artificial neural networks. Procedia
Manufacturing, 11, 2083–2090. (DOI: 10.1016/j.promfg.2017.07.329)
47
Specht, D. F. (1991). A general regression neural network. IEEE Transactions on
Neural Networks, 2(6), 568–576. (DOI: 10.1109/72.97934)
Stekhoven, D. J., & Bühlmann, P. (2012). MissForest—non-parametric missing value
imputation for mixed-type data. Bioinformatics, 28(1), 112–118. (DOI:
10.1093/bioinformatics/btr597)
Velasco, L. C. P., Serquiña, R. P., Zamad, M. S. A. A., Juanico, B. F., & Lomocso, J. C.
(2019). Week-ahead rainfall forecasting using multilayer perceptron neural
network. Procedia Computer Science, 161, 386–397.
(DOI: 10.1016/j.procs.2019.11.137)
Xia, J., Zhang, S., Cai, G., Li, L., Pan, Q., Yan, J., & Ning, G. (2017). Adjusted weight
voting algorithm for random forests in handling missing values. Pattern
Recognition, 69, 52–60. (DOI: 10.1016/j.patcog.2017.04.005)
Yadav, M. L., & Roychoudhury, B. (2018). Handling missing values: A study of popular
imputation packages in R. Knowledge-Based Systems, 160, 104–118. (DOI:
10.1016/j.knosys.2018.06.012)
Zhang, M., Fulcher, J., & Scofield, R. A. (1997). Rainfall estimation using artificial
neural network group. Neurocomputing, 16(2), 97–115. (DOI: 10.1016/S0925-
2312(96)00022-7)

無法下載圖示 全文公開日期 2025/07/19 (校內網路)
全文公開日期 2025/07/19 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE