研究生: |
Christiawan Muljono Christiawan Muljono |
---|---|
論文名稱: |
資料驅動觀點處理不平衡分類問題:以半導體製程之晶圓故障檢測為案 Data-driven Perspective for Handling the Imbalanced Class: Case of Wafer Fault Detection in Semiconductor Manufacturing Process |
指導教授: |
林希偉
Shi-Woei Lin 李強笙 Chiang-Sheng Lee |
口試委員: |
陳威志
Wei-Chih Chen 李強笙 Chiang-Sheng Lee 林希偉 Shi-Woei Lin |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 工業管理系 Department of Industrial Management |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 英文 |
論文頁數: | 90 |
中文關鍵詞: | 半導體 、感測器 、晶圓製造 、重抽樣 、特徵選擇 、分類 、不平衡數據 |
外文關鍵詞: | semiconductor, sensors, wafer fabrication, resampling, feature selection, classification, imbalanced data |
相關次數: | 點閱:192 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
半導體產業已經發展成資本最密集與技術最先進的行業之一。但嚴苛的競爭環境迫使半導體製造廠商需有效地預測晶圓製造並提高晶圓產量,透過低成本、快速、高品質的產品來取得競爭優勢。現今,半導體製造商能夠直接從機台的感測器,在晶圓製造過程中取得高維度的大數據集。儘管專家學者已針對半導體製程數據的特徵工程進行不少研究,但以啟發式演算法進行特徵選擇的研究卻相對較少。因此,本研究採用一種新的混合演算法Pearson-Binary Sine Cosine Algorithm(P-BSCA)來進行特徵選擇,並透過實際案例的資料驗證此方法的表現優於其他特徵選擇技術。
在本研究中,我們同時評估了數據插補,特徵選擇,重抽樣策略和分類方法等。比較分析結果支持將P-BSCA做為特徵選擇技術,將Synthetic Minority Oversampling Technique - Edited Nearest Neighbor(SMOTE-ENN)作為重抽樣策略,並將邏輯斯迴歸作為分類模型可達較高的評估指標。這項研究不僅可以提供不平衡數據分類問題的分析框架,也可以提供一些可以幫助製程工程師從管理的角度更快找到缺陷及其根本原因的建議。
The semiconductor industry has been evolving to one of the most capital-intensive and technologically advanced sectors. The competitive environment makes semiconductor manufacturing companies compete in delivering low cost, fast, and high-quality products by effective prediction of wafer fabrication fault to increase wafer yield. Nowadays, modern semiconductor manufacturer is capable of collecting vast amount of data directly from the sensors, creating a high-dimensional dataset during the wafer fabrication processes. While considerable attention has been paid in the past to research issues related to feature engineering in the dataset of semiconductor manufacturing, little research has been done on investigating the power of the metaheuristic as a feature selection method. Therefore, a new hybrid feature selection Pearson-Binary Sine Cosine Algorithm (P-BSCA) is introduced and proven to outperform the other feature selection techniques. In this research, we also evaluate different approaches involving data imputation, feature selection, resampling strategy, and classification methods. The comparative analysis results support that the configuration of P-BSCA as a feature selection technique, Synthetic Minority Oversampling Technique - Edited Nearest Neighbor (SMOTE-ENN) as a resampling strategy, and logistic regression as the classifier have superior evaluation metrics. This research not only aims to provide a framework based on the SECOM dataset for the future imbalanced classification works, but also provides some guidelines to help process engineers to find the defect and its root cause faster in a managerial point of view.
Aggarwal, C. C., & Parthasarathy, S. (2001). Mining massively incomplete data sets by conceptual reconstruction. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 01. doi:10.1145/502512.502543
Alhammady, H., & Ramamohanarao, K. (2004). The Application of Emerging Patterns for Improving the Quality of Rare-Class Classification. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science, 207-211. doi:10.1007/978-3-540-24775-3_27
Baranek, G. T., & Berkson, G. (1994). Tactile defensiveness in children with developmental disabilities: Responsiveness and habituation. Journal of Autism and Developmental Disorders, 24, 457–471.
Belazzoug, M., Touahria, M., Nouioua, F., & Brahimi, M. (2020). An improved sine cosine algorithm to select features for text categorization. Journal of King Saud University - Computer and Information Sciences, 32(4), 454-464. doi:10.1016/j.jksuci.2019.07.003
Blagus, R., & Lusa, L. (2012). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics, 14, 106 - 106.
Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., & Lang, M. (2020). Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis, 143, 106839. doi:10.1016/j.csda.2019.106839
Brezočnik, L., Fister, I., & Podgorelec, V. (2018). Swarm Intelligence Algorithms for Feature Selection: A Review. Applied Sciences, 8(9), 1521. doi:10.3390/app8091521
Burez, J., & Poel, D. V. (2009). Handling class imbalance in customer churn prediction. Expert Systems with Applications, 36(3), 4626-4636. doi:10.1016/j.eswa.2008.05.027
Chaudhry, M. U., & Lee, J. (2018). Feature Selection for High Dimensional Data Using Monte Carlo Tree Search. IEEE Access, 6, 76036-76048. doi:10.1109/access.2018.2883537
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321-357. doi:10.1613/jair.953
Chien, C., Wang, W., & Cheng, J. (2007). Data mining for yield enhancement in semiconductor manufacturing and an empirical study. Expert Systems with Applications, 33(1), 192-198. doi:10.1016/j.eswa.2006.04.014
Chuang, L., Chang, H., Tu, C., & Yang, C. (2008). Improved binary PSO for feature selection using gene expression data. Computational Biology and Chemistry, 32(1), 29-38. doi:10.1016/j.compbiolchem.2007.09.005
Cutress, D. I. (2019, December 12). Early TSMC 5nm Test Chip Yields 80%, HVM Coming in H1 2020. Retrieved from https://www.anandtech.com/show/15219/early-tsmc-5nm-test-chip-yields-80-hvm-coming-in-h1-2020
Davis, J., & Goadrich, M. (2006). The relationship between Precision-Recall and ROC curves. ICML '06. DOI:10.1145/1143844.1143874
Denning, D. E. (1986). An Intrusion-Detection Model. 1986 IEEE Symposium on Security and Privacy. doi:10.1109/sp.1986.10010
Djelloul, I., Sari, Z., & Sidibe, I. D. (2018). Fault diagnosis of manufacturing systems using data mining techniques. 2018 5th International Conference on Control, Decision and Information Technologies (CoDIT). doi:10.1109/codit.2018.8394807
Dorpe, S. V. (2018, December 13). Preprocessing with sklearn: A complete and comprehensive guide. Retrieved from https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9. [Accessed 28 April 2020]
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Lettter, 27, 861-874.
Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83-85. doi:10.1007/bf02985802
Ge, Z., & Song, Z. (2010). Semiconductor Manufacturing Process Monitoring Based on Adaptive Substatistical PCA. IEEE Transactions on Semiconductor Manufacturing, 23(1), 99-108. doi:10.1109/tsm.2009.2039188
Glover. (1986). Future paths for integer programming and links to artificial intelligence. Computers & Operations Research, 13(5), pp. 533-549
Gómez, F. & Quesada, A. (2017). Genetic algorithms for feature selection in Data Analytics. Retrieved from https://www.neuraldesigner.com/blog/genetic_algorithms_for_feature_selection
Goodlin, B. E., Boning, D. S., Sawin, H. H., & Wise, B. M. (2003). Simultaneous Fault Detection and Classification for Semiconductor Manufacturing Tools. Journal of The Electrochemical Society, 150(12). doi:10.1149/1.1623772
Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. J. Mach. Learn. Res., 3, 1157-1182.
Hafez, A. I., Zawbaa, H. M., Emary, E., & Hassanien, A. E. (2016). Sine cosine optimization algorithm for feature selection. 2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA). doi:10.1109/inista.2016.7571853
Haridy, S., & Wu, Z. (2009). Univariate and multivariate control charts for monitoring dynamic-behavior processes: A case study. Journal of Industrial Engineering and Management, 2(3). doi:10.3926/jiem.2009.v2n3.p464-498
Hassanzadeh, H., Groza, T., Nguyen, A., & Hunter, J. (2014). Load Balancing for Imbalanced Data Sets: Classifying Scientific Artefacts for Evidence Based Medicine. Lecture Notes in Computer Science PRICAI 2014: Trends in Artificial Intelligence, 972-984. doi:10.1007/978-3-319-13560-1_84
Hauskrecht, M., Batal, I., Valko, M., Visweswaran, S., Cooper, G. F., & Clermont, G. (2013). Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics, 46(1), 47-55. doi:10.1016/j.jbi.2012.08.004
He, Q. P., & Wang, J. (2007). Fault Detection Using the k-Nearest Neighbor Rule for Semiconductor Manufacturing Processes. IEEE Transactions on Semiconductor Manufacturing, 20(4), 345-354. doi:10.1109/tsm.2007.907607
Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press.
Hordri, N. F., Sophiayati, S., Firdaus, N., & Mariyam, S. (2018). Handling Class Imbalance in Credit Card Fraud using Resampling Methods. International Journal of Advanced Computer Science and Applications, 9(11). doi:10.14569/ijacsa.2018.091155
Huang, X., Zhang, L., Wang, B., Li, F., & Zhang, Z. (2017). Feature clustering based support vector machine recursive feature elimination for gene selection. Applied Intelligence, 48(3), 594-607. doi:10.1007/s10489-017-0992-2
Hulse, J. V., Khoshgoftaar, T. M., & Napolitano, A. (2009). An empirical comparison of repetitive undersampling techniques. 2009 IEEE International Conference on Information Reuse & Integration. doi:10.1109/iri.2009.5211614
Jelinek, H. F., Yatsko, A., Stranieri, A., Venkatraman, S., & Bagirov, A. (2015). Diagnostic with incomplete nominal/discrete data. Artificial Intelligence Research, 4(1). doi:10.5430/air.v4n1p22
Jourdan, L., Dhaenens, C., Talbi, E. (2001). A Genetic algorithm for feature subset selection in data-mining for genetics. Proc. 4th Metaheuristics International Conf., MIC 2001, 29–34.
Kavaliauskas, D. & Sakalauskas, L. (2019). Study of Convergence in Metaheuristic Algorithms. Baltic J. Modern Computing, 7(3), 436-443.
Kerdprasop, K., & Kerdprasop, N. (2011). A Data Mining Approach to Automate Fault Detection Model Development in the Semiconductor Manufacturing Process. Int. J. Mech. 5(4): 336-344.
Kennedy, J., & Eberhart, R. (2001). A discrete binary version of the particle swarm algorithm. 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation. doi:10.1109/icsmc.1997.637339
Kim, J., Han, Y., & Lee, J. (2016). Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process. doi:10.14257/astl.2016.133.15
Kirkpatrick, S., Gelatt, C.D., & Vecchi, M.P. (1983). Optimization by Simulated Annealing. Science, 220, 671 - 680.
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273-324. doi:10.1016/s0004-3702(97)00043-x
Lakshminarayan, K., Harp, S.A., & Samad, T. (2004). Imputation of Missing Data in Industrial Databases. Applied Intelligence, 11, 259-275.
Lee, D., Lee, C., Choi, S., & Kim, K. (2019). A method for wafer assignment in semiconductor wafer fabrication considering both quality and productivity perspectives. Journal of Manufacturing Systems, 52, 23-31. doi:10.1016/j.jmsy.2019.05.006
Lee, D., Yang, J., Lee, C., & Kim, K. (2019). A data-driven approach to selection of critical process steps in the semiconductor manufacturing process considering missing and imbalanced data. Journal of Manufacturing Systems, 52, 146-156. doi:10.1016/j.jmsy.2019.07.001
Maes, S., Tuyls, K., Vanschoenwinkel, B., & Manderick, B. (2002). Credit Card Fraud Detection Using Bayesian and Neural Networks. In Proceedings of the First International NAISO Congress on NEURO FUZZY THECHNOLOGIES, 261–270.
Marcano-Cedeno, A., Quintanilla-Dominguez, J., Cortina-Januchs, M. G., & Andina, D. (2010). Feature selection using Sequential Forward Selection and classification applying Artificial Metaplasticity Neural Network. IECON 2010 - 36th Annual Conference on IEEE Industrial Electronics Society. doi:10.1109/iecon.2010.5675075
McCann, M. and Johnston, A. (2008). SECOM Data Set. Retrieved from https://archive.ics.uci.edu/ml/datasets/secom [Accessed 13 January 2020]
McCann, M., Li, Y., Maguire, L.P., & Johnston, A. (2010). Causality Challenge: Benchmarking relevant signal components for effective monitoring and process control. NIPS Causality: Objectives and Assessment.
Mirjalili, S., & Lewis, A. (2013). S-shaped versus V-shaped transfer functions for binary Particle Swarm Optimization. Swarm and Evolutionary Computation, 9, 1-14. doi:10.1016/j.swevo.2012.09.002
Mirjalili, S. (2016). SCA: A Sine Cosine Algorithm for solving optimization problems. Knowledge-Based Systems, 96, 120-133. doi:10.1016/j.knosys.2015.12.022
Moldovan, D., Cioara, T., Anghel, I., & Salomie, I. (2017). Machine learning for sensor-based manufacturing processes. 2017 13th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP). doi:10.1109/iccp.2017.8116997
Munirathinam, S., & Ramadoss, B. (2016). Predictive Models for Equipment Fault Detection in the Semiconductor Manufacturing Process. International Journal of Engineering and Technology, 8(4), 273-285. doi:10.7763/ijet.2016.v8.898
Nakazawa, T., & Kulkarni, D. V. (2018). Wafer Map Defect Pattern Classification and Image Retrieval Using Convolutional Neural Network. IEEE Transactions on Semiconductor Manufacturing, 31(2), 309-314. doi:10.1109/tsm.2018.2795466
Narkhede, S. (2019, August 29). Understanding Confusion Matrix. Retrieved from https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62. [Accessed 3 May 2020]
Ngai, E., Hu, Y., Wong, Y., Chen, Y., & Sun, X. (2011). The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. Decision Support Systems, 50(3), 559-569. doi:10.1016/j.dss.2010.08.006
Pak, S., Kim, J. S., Park, C., Park, S. H., & Baek, J. (2014). Under Sampling for Imbalanced Data using Minor Class based SVM (MCSVM) in Semiconductor Process. Journal of Korean Institute of Industrial Engineers, 40(4), 404-414. doi:10.7232/jkiie.2014.40.4.404
Pant, A. (2019, January 22). Introduction to Logistic Regression. Retrieved from https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148. [Accessed 5 May 2020]
Pham, D. T., & Afify, A. A. (2005). Machine-learning techniques and their applications in manufacturing. Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering Manufacture, 219(5), 395–412.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3): 581-592.
Salem, M., Taheri, S., & Yuan, J. (2018). An Experimental Evaluation of Fault Diagnosis from Imbalanced and Incomplete Data for Smart Semiconductor Manufacturing. Big Data and Cognitive Computing, 2(4), 30. doi:10.3390/bdcc2040030
Sarle, W. S. (1998). Prediction with missing inputs’, Proceedings of the Fourth Joint Conference on Information Sciences, Vol. 2,399–402.
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., and Lichtendahl, K. C. (2018). Data mining for Business Analytics: Concepts, Techniques, and Applications in R. First Edition. John Wiley & Sons, Inc.
Siedlecki, W.W., & Sklansky, J. (1993). A note on Genetic Algorithms for Large-Scale Feature Selection. Handbook of Pattern Recognition and Computer Vision.
Song, Q. & Shepperd, M. (2007). Missing Data Imputation Techniques. International Journal of Business Intelligence and Data Mining. 2(3). 261-291.
Speiser, J. L., Miller, M. E., Tooze, J., & Ip, E. (2019). A comparison of random forest variable selection methods for classification prediction modeling. Expert Systems with Applications, 134, 93-101. doi:10.1016/j.eswa.2019.05.028
Stefanowski, J., & Wilk, S. (2008). Selective Pre-processing of Imbalanced Data for Improving Classification Performance. Data Warehousing and Knowledge Discovery Lecture Notes in Computer Science, 283-292. doi:10.1007/978-3-540-85836-2_27
Sun, Z., Yang, J., & Zheng, K. (2019). A Novel Fault Detection Method for Semiconductor Manufacturing Processes. 2019 IEEE International Instrumentation and Measurement Technology Conference (I2MTC). doi:10.1109/i2mtc.2019.8826957
Tafazzoli, E., & Saif, M. (2009). Application of combined support vector machines in process fault diagnosis. 2009 American Control Conference. doi:10.1109/acc.2009.5160577
Taghian, S., & Nadimi-Shahraki, M. H. (2019). Binary Sine Cosine Algorithms for Feature Selection from Medical Data. Advanced Computing: An International Journal, 10(5), 1-10. doi:10.5121/acij.2019.10501
Tang, Y., Zhang, Y., Chawla, N., & Krasser, S. (2009). SVMs Modeling for Highly Imbalanced Classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1), 281-288. doi:10.1109/tsmcb.2008.2002909
Tibshirani, R. (1996). Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288. doi:10.1111/j.2517-6161.1996.tb02080.x
TSIA. (2019). Overview on Taiwan Semiconductor Industry 2019 Edition. Retrieved from https://www.tsia.org.tw/EN/PublOverview?nodeID=60 [Accessed 24 March 2020]
Verdier, G., & Ferreira, A. (2011). Adaptive Mahalanobis Distance and k-Nearest Neighbor Rule for Fault Detection in Semiconductor Manufacturing. IEEE Transactions on Semiconductor Manufacturing, 24(1), 59-68. doi:10.1109/tsm.2010.2065531
Vieira, S. M., Mendonça, L. F., Farinha, G. J., & Sousa, J. M. (2013). Modified binary PSO for feature selection using SVM applied to mortality prediction of septic patients. Applied Soft Computing, 13(8), 3494-3504. doi:10.1016/j.asoc.2013.03.021
Wilson, D. L. (1972). Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3), 408-421. doi:10.1109/tsmc.1972.4309137
Wutzl, B., Leibnitz, K., Rattay, F., Kronbichler, M., Murata, M., & Golaszewski, S. M. (2019). Genetic algorithms for feature selection when classifying severe chronic disorders of consciousness. Plos One, 14(7). doi:10.1371/journal.pone.0219683
Xue, B., Zhang, M., Browne, W.N., & Yao, X. (2016). A Survey on Evolutionary Computation Approaches to Feature Selection. IEEE Transactions on Evolutionary Computation, 20, 606-626.
Xue, B., Zhang, M., & Browne, W. N. (2012). New fitness functions in binary particle swarm optimisation for feature selection. 2012 IEEE Congress on Evolutionary Computation. doi:10.1109/cec.2012.6256617
Xue, B., Zhang, M., & Browne, W. N. (2013). Particle Swarm Optimization for Feature Selection in Classification: A Multi-Objective Approach. IEEE Transactions on Cybernetics, 43(6), 1656-1671. doi:10.1109/tsmcb.2012.2227469
Yang, J., & Honavar, V. (1998). Feature Subset Selection Using a Genetic Algorithm. Feature Extraction, Construction and Selection, 117-136. doi:10.1007/978-1-4615-5725-8_8
Yusta, S. C. (2009). Different metaheuristic strategies to solve the feature selection problem. Pattern Recognition Letters, 30(5), 525-534. doi:10.1016/j.patrec.2008.11.012
Zhang, Y., Peng, P., Liu, C., & Zhang, H. (2019). Anomaly Detection for Industry Product Quality Inspection based on Gaussian Restricted Boltzmann Machine. 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). doi:10.1109/smc.2019.8914524