簡易檢索 / 詳目顯示

研究生: Khairul Izyan Bin Anuar
Khairul Izyan Bin Anuar
論文名稱: 使用平均連接聚類法,基於結構化的虛擬樣本生成解決小數據集問題
Structure-based virtual sample generation using average-linkage clustering for small dataset problems
指導教授: 張智傑
Chih-Chieh Chang
口試委員: 何建韋
Chien-Wei Ho
陳昱圻
Yu-Chi Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 管理學院MBA
School of Management International (MBA)
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 56
外文關鍵詞: Average Linkage, Virtual Sample Generation, Accuracy Improvements
相關次數: 點閱:228下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

  • Small data sets are often challenging due to their limited sample size. This can lead to problems with overfitting and poor generalization performance. This research introduces a novel solution to these problems: average linkage virtual sample generation (ALVSG). ALVSG leverages the underlying data structure to create virtual samples, which can be used to augment the original data set.

    The ALVSG process consists of two steps. First, an average-linkage clustering technique is applied to the data set to create a dendrogram. The dendrogram represents the hierarchical structure of the data set, with each merging operation regarded as a linkage. Next, the linkages are combined into an average-based data set, which serves as a new representation of the data set. The second step in the ALVSG process involves generating virtual samples using the average-based data set. The research project generates a set of 100 virtual samples by uniformly distributing them within the provided boundary. These virtual samples are then added to the original data set, creating a more extensive data set with improved generalization performance.

    The efficacy of the ALVSG approach is validated through resampling experiments and t-tests conducted on two small real-world data sets. The experiments are conducted on three forecasting models, the support vector machine for regression (SVR), deep learning model (DL) and XGBoost. The results show that the ALVSG approach outperforms the baseline methods in terms of mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE).

    The findings of this research suggest that ALVSG is a promising approach for addressing the challenges of small data sets. The ALVSG approach can be used to augment the original data set, which can lead to improved generalization performance. The ALVSG approach is also relatively easy to implement and can be used with various forecasting models.

    Abstract I Acknowledgment II 1. Introduction 2 1.1 Research Background 2 1.2 Small Dataset 2 1.3 Virtual Sample Generation 4 1.4 Purpose of the Research 5 1.5 Thesis Structure 5 2. Literature Review 7 2.1 The Impact of Small Dataset on Forecasting Models 7 2.2 Utilisation of Virtual Sample Generation 8 2.3 Cluster Analysis 9 2.4 Evaluation of Forecasting Models 11 3. Methodology 18 3.1 Clustering 18 3.2 Average-based Dataset 18 3.3 Virtual Sample Generation 20 3.4 Summary of Steps 21 4. Experiment 26 4.1 Parameter Settings 26 4.2 Case 1: Radiotherapy Treatment of Bladder Cancer 27 4.3 Case 2: Multi-layer Ceramic Capacitors (MLCC) 31 5. Conclusions 35 5.1 Findings 35 5.2 Practical Implications 36 5.3 Limitation and Future Research 38 References 40 Appendix 46

    Adaryani, F. R., Mousavi, S. J., & Jafari, F. (2022). Short-term rainfall forecasting using machine learning-based approaches of PSO-SVR, LSTM, and CNN. Journal of Hydrology, 614, 128463.
    Amari, S., & Wu, S.(1999). Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12(6), 783-789.
    Arora, P., Kumar, H., & Panigrahi, B. K. (2020). Prediction and analysis of COVID-19 positive cases using deep learning models: A descriptive case study of India. Chaos, solitons & fractals, 139, 110017.
    Bakoben, M., Bellotti, T., & Adams, N. (2020). Identification of credit risk based on cluster analysis of account behaviours. Journal of the Operational Research Society, 71(5), 775-783.
    Bissonette, J. A. (1999). Small sample size problems in wildlife ecology: a contingent analytical approach. Wildlife biology, 5(2), 65-71.
    Chaganti, S. Y., Nanda, I., Pandi, K. R., Prudhvith, T. G., & Kumar, N. (2020, March). Image Classification using SVM and CNN. In 2020 International Conference on computer science, engineering and applications (ICCSEA) (pp. 1- 5). IEEE.
    Chao, G. Y., Tsai, T. I., Lu, T. J., Hsu, H. C., Bao, B. Y., Wu, W. Y., ... & Lu, T. L. (2011). A new approach to prediction of radiotherapy of bladder cancer cells in small dataset analysis. Expert Systems with Applications, 38(7), 7963-7969.
    Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
    Chen, Z. S., Zhu, B., He, Y. L., & Yu, L. A. (2017). A PSO based virtual sample generation method for small sample sets: Applications to regression
    datasets. Engineering Applications of Artificial Intelligence, 59, 236-243.
    Chen, Z. S., Hou, K. R., Zhu, M. Y., Xu, Y., & Zhu, Q. X. (2021). A virtual sample generation approach based on a modified conditional GAN and centroidal Voronoi tessellation sampling to cope with small sample size problems: Application to soft sensing for chemical process. Applied Soft Computing, 101, 107070.
    Cho, S., & Cha, K. (1996, May). Evolution of neural network training set through addition of virtual samples. In Proceedings of IEEE International Conference on Evolutionary Computation (pp. 685-688). IEEE.
    Das, K., & Nenadic, Z. (2009). An efficient discriminant-based solution for small sample size problem. Pattern Recognition, 42(5), 857-866.
    Doan, Q. H., Mai, S. H., Do, Q. T., & Thai, D. K. (2022). A cluster-based data splitting method for small sample and class imbalance problems in impact damage classification. Applied Soft Computing, 120, 108628.
    Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural networks, 10(5), 1048-1054.
    Fan, C., Hou, B., Zheng, J., Xiao, L., & Yi, L. (2020). A surrogate-assisted particle swarm optimization using ensemble learning for expensive problems with small sample datasets. Applied Soft Computing, 91, 106242.
    Fong, S. J., Li, G., Dey, N., Crespo, R. G., & Herrera-Viedma, E. (2020). Finding an accurate early forecasting model from small dataset: A case of 2019-ncov novel coronavirus outbreak. arXiv preprint arXiv:2003.10776.
    Gong, H. F., Chen, Z. S., Zhu, Q. X., & He, Y. L. (2017). A Monte Carlo and PSO based virtual sample generation method for enhancing the energy prediction and energy optimization on small data problem: An empirical study of petrochemical industries. Applied Energy, 197, 405-415.
    Gu, Y., & Wei, H. L. (2018). A robust model structure selection method for small sample size and multiple datasets problems. Information Sciences, 451, 195-209.
    Hamdan, Y. B., & Sathesh, A. (2021). Construction of statistical SVM based recognition model for handwritten character recognition. Journal of Information Technology and Digital World, 3(2), 92-107.
    He, B., Ye, L., Pei, M., Lu, P., Dai, B., Li, Z., & Wang, K. (2022). A combined model for short-term wind power forecasting based on the analysis of numerical weather prediction data. Energy Reports, 8, 929-939.
    He, Y. L., Wang, P. J., Zhang, M. Q., Zhu, Q. X., & Xu, Y. (2018). A novel and effective nonlinear interpolation virtual sample generation method for enhancing energy prediction and analysis on small data problem: A case study of Ethylene industry. Energy, 147, 418-427.
    He, Y. L., Hua, Q., Zhu, Q. X., & Lu, S. (2022). Enhanced virtual sample generation based on manifold features: Applications to developing soft sensor using small data. ISA transactions, 126, 398-406.
    Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8), 651-666.
    Jiang, H., He, Z., Ye, G., & Zhang, H. (2020). Network intrusion detection based on PSO-XGBoost model. IEEE Access, 8, 58392-58401.
    Khot, L. R., Panigrahi, S., & Woznica, S. (2008). Neural-network-based classification of meat: evaluation of techniques to overcome small dataset problems. Biological Engineering Transactions, 1(2), 127-143.
    Kurani, A., Doshi, P., Vakharia, A., & Shah, M. (2023). A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Annals of Data Science, 10(1), 183-208.
    Li, D. C., Fang, Y. H., Lai, Y. Y., & Hu, S. C.(2009). Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation. Information Sciences, 179(16), 2740-2753
    Li, D. C., Liu, C. W., & Hu, S. C. (2010). A learning method for the class imbalance problem with medical data sets. Computers in biology and medicine, 40(5), 509-518.
    Li, D. C., & Liu, C. W. (2012). Extending Attribute Information for Small Data Set Classification. Knowledge and Data Engineering, IEEE Transactions on, 24(3), 452- 464.
    Li, D. C., Chang, C. C., & Liu, C. W. (2012). Using structure-based data transformation method to improve prediction accuracies for small data sets. Decision Support Systems, 52(3), 748-756.
    Li, L., Peng, Y., Qiu, G., Sun, Z., & Liu, S. (2018). A survey of virtual sample generation technology for face recognition. Artificial Intelligence Review, 50, 1- 20.
    Li, W., Yin, Y., Quan, X., & Zhang, H. (2019). Gene expression value prediction based on XGBoost algorithm. Frontiers in genetics, 10, 1077.
    Lin, L. S., Li, D. C., & Pan, C. W. (2016, July). Improving virtual sample generation for small sample learning with dependent attributes. In 2016 5th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI) (pp. 715- 718). IEEE.
    Liu, H., & Yu, L. (2005). Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 17(4), 491-502.
    Liu, W., Fan, W., Wang, Y., & Tan, T. (2005, September). Local manifold matching for face recognition. In IEEE International Conference on Image Processing 2005 (Vol. 2, pp. II-926). IEEE.
    Luo, X., Li, D., & Zhang, S. (2019). Traffic flow prediction during the holidays based on DFT and SVR. Journal of Sensors, 2019.
    Mousavi, S. M., Ellsworth, W. L., Zhu, W., Chuang, L. Y., & Beroza, G. C. (2020). Earthquake transformer—an attentive deep-learning model for simultaneous earthquake detection and phase picking. Nature communications, 11(1), 3952.
    Niyogi, P., Girosi, F., & Poggio, T. (1998). Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE, 86(11), 2196-2209.
    Newman, M. E. (2012). Communities, modules and large-scale structure in networks. Nature physics, 8(1), 25-31.
    Ogunleye, A., & Wang, Q. G. (2019). XGBoost model for chronic kidney disease diagnosis. IEEE/ACM transactions on computational biology and
    bioinformatics, 17(6), 2131-2140.
    Praveen, P., & Rama, B. (2017). A k-means clustering algorithm on numeric data. Int. J. Pure Appl. Math, 117(7).
    Rama, B., Jayashree, P., & Jiwani, S. (2010). A survey on clustering current status and challenging issues. International Journal on computer science and engineering, 2(9), 2976-2980.
    Shannon, W., Culverhouse, R., & Duncan, J. (2003). Analyzing microarray data using cluster analysis. Pharmacogenomics, 4(1), 41-52.
    Sharma, A., & Paliwal, K. K. (2015). Linear discriminant analysis for the small sample size problem: an overview. International Journal of Machine Learning and Cybernetics, 6, 443-454.
    Smith, G. C., Seaman, S. R., Wood, A. M., Royston, P., & White, I. R. (2014). Correcting for optimistic prediction in small data sets. American journal of epidemiology, 180(3), 318-324
    Tibble, H., Chan, A., Mitchell, E. A., Horne, E., Doudesis, D., Horne, R., ... & Tsanas, A. (2020). A data-driven typology of asthma medication adherence using cluster analysis. Scientific Reports, 10(1), 14999.
    Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. IEEE Transactions on neural networks, 16(3), 645-678.
    Vapnik,V. (1998). Statistical learning theory: Wiley, New York
    Venugopalan, J., Tong, L., Hassanzadeh, H. R., & Wang, M. D. (2021). Multimodal deep learning models for early detection of Alzheimer’s disease stage. Scientific reports, 11(1), 3254.
    Wang, G., Li, Z., Li, G., Dai, G., Xiao, Q., Bai, L., ... & Bai, S. (2021). Real-time liver tracking algorithm based on LSTM and SVR networks for use in surface-guided radiation therapy. Radiation Oncology, 16(1), 1-12.
    Wedyan, M., Crippa, A., & Al-Jumaily, A. (2019). A novel virtual sample generation method to overcome the small sample size problem in computer aided medical diagnosing. Algorithms, 12(8), 160.
    Yang, J., Yu, X., Xie, Z. Q., & Zhang, J. P. (2011). A novel virtual sample generation method based on Gaussian distribution. Knowledge-Based Systems, 24(6), 740-748.
    Zhang, X. H., Xu, Y., He, Y. L., & Zhu, Q. X. (2021). Novel manifold learning based virtual sample generation for optimizing soft sensor with small data. ISA transactions, 109, 229-241.
    Zhu, Q. X., Chen, Z. S., Zhang, X. H., Rajabifard, A., Xu, Y., & Chen, Y. Q. (2020). Dealing with small sample size problems in process industry using virtual sample generation: a Kriging-based approach. Soft Computing, 24, 6889-6902.
    Zhu, Q. X., Hou, K. R., Chen, Z. S., Gao, Z. S., Xu, Y., & He, Y. L. (2021). Novel virtual sample generation using conditional GAN for developing soft sensor with small data. Engineering Applications of Artificial Intelligence, 106, 104497.

    QR CODE