簡易檢索 / 詳目顯示

研究生: 陳柏翰
Po-Han Chen
論文名稱: 結合K-means分群演算法與輕量極限梯度模型於新進玩家流失預測
A Study on New Player Churn Prediction Model Based on K-means and LightGBM
指導教授: 戴文凱
Wen-Kai Tai
口試委員: 鮑興國
Hsing-Kuo Pao
章耀勳
Yao-Xun Chang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 58
中文關鍵詞: 免費遊玩遊戲巨量資料玩家流失預測資料分析機器學習分群演算法分類演算法輕量極限梯度模型
外文關鍵詞: LightGBM
相關次數: 點閱:199下載:13
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近幾年來許多遊戲開發商在線上遊戲和手機遊戲上, 主要都是以免費遊玩遊戲(Free-to-Play, F2P) 為其商業模式。遊戲營收為玩家在遊戲中儲值及購買遊戲幣(In-App-Purchase, IAP),因此越多玩家持續留在遊戲中
    對於儲值及購買量有很大的影響。遊戲開發商普遍希望能及時挽回流失玩家,尤其是新進玩家流失率較高,若能準確的預測出即將流失的玩家及分析流失原因,開發商就能對其進行挽留並提出對應的策略於遊戲當中,玩家若能因此續玩遊戲就可能有效的增加營收。本論文的目標為如何運用玩家的行為資料,有效進行資料探勘並且運用機器學習模型(Machine Learning) 進行訓練及預測出不錯的結果。

    本論文提出一巨量資料探勘及機器學習框架於新進玩家流失預測,以下四個階段將說明此框架的核心內容:(1) 資料預處理階段、(2) 資料分析階段、(3) 機器學習模型訓練階段以及(4) 模型預測結果分析階段。論文中訓練資料來自於一款麻將遊戲真實玩家遊玩紀錄,原始資料需經由本框架四大階段進行處理及訓練,並分析其實驗結果進而將模型運用於新進玩家流失預測中。機器學習模型訓練階段中結合K-means 分群及若干分類演算法,先將資料特徵相近玩家分在同一群體中,再針對不同群體進行分類演算法的訓練,希望能因此提高模型的表現。最終實驗顯示先分群再分類演算法的模型表現優於只進行分類演算法的模型,其中又以分群結合輕量極限梯度模(LightGBM) 分類演算法的表現較好。最終預測出哪些玩家傾向會流失,再分析可能影響流失的玩家行為。


    In recent years, numerous game companies allow players access to online and mobile games for free (Free-to-Play, F2P). The revenue of a F2P game comes from in-game purchases (In-APP-Purchase, IAP). However, F2P is different from telecommunication services in which player churn can be easily identified by the user unsubscribing. Compared with retaining current players, the game company operator is more challenging to recruit new players. Therefore, new player churn prediction is worth studying for the game industry.

    In this thesis, we propose a framework of big data mining and churn prediction model which includes 4 steps: (1) Data Pre-Processing (2) Data Analysis (3) Machine Learning (4) Result and Prediction Analysis. Our research data containing players' behaviors are from the mahjong mobile game. The raw data need to be processed and analyzed in Data Pre-Processing and Data Analysis steps respectively. The analyzed data with key features is the training data for Machine Learning step. We will evaluate the Machine Learning prediction results to find the best model and analyze the behaviors which cause churning in Result and Prediction Analysis step. According to our experiment results, the classification models with K-means clustering perform better than classification models without K-means. The Machine Learning model LightGBM with K-means has better performance and the highest recall values. Finally, our framework is able to predict and find the new players who tend to churn.

    Contents Abstract in Chinese i Abstract in English ii Acknowledgements iii Contents iv List of Figures vii List of Tables ix CHAPTER 1 Introduction 1 1.1 Background and Motivation 1 1.2 Research Goals 2 1.3 Overview of Our Method 2 1.3.1 Definition of Player Churn 2 1.3.2 Definition of Player in-game-time Period 2 1.3.3 Framework of Big Data Mining and Churn Prediction models 3 1.4 Contributions 4 CHAPTER 2 Related Work 6 2.1 Data Pre-Processing 6 2.2 Churn Prediction Model 7 2.2.1 Supervised Learning 7 2.2.2 Unsupervised Learning 9 2.3 Result Evaluation and Prediction Analysis 10 CHAPTER 3 Method 11 3.1 Data Pre-Processing 13 3.1.1 Data Cleaning 13 3.1.2 Data Integration 13 3.1.3 Data Normalization 13 3.1.4 Data Labeling 14 3.2 Data Analysis 16 3.3 Machine Learning 17 3.3.1 Train-Test Split 17 3.3.2 Machine Learning Model 17 3.3.3 Imbalance Data Processing 18 3.3.4 Hyperparameters Optimization 19 3.3.5 Cross Validation 19 3.4 Result Evaluation and Prediction Analysis 21 CHAPTER 4 Experiment 23 4.1 Experiment Setup 23 4.1.1 Data Collection 23 4.2 Data Pre-Processing 24 4.3 Data Analysis 26 4.4 Machine Learning 30 4.4.1 Train-Test Split 30 4.4.2 Imbalance Data Processing 30 4.4.3 Machine Learning Model 31 4.5 Result and Prediction Analysis 35 4.5.1 ROC-AUC and PR-AP results 35 4.5.2 Feature Importance 38 CHAPTER 5 Conclusions 42 5.1 Future Work 43 References 44

    [1] P. Miller, “GDC 2012: How valve made team fortress 2 free-to-play,” Gamasutra. Haettu, vol. 7, p. 2012, 2012.
    [2] M. C. Mozer, R. Wolniewicz, D. B. Grimes, E. Johnson, and H. Kaushansky, “Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 690–696, 2000.
    [3] E. Lee, Y. Jang, D.-M. Yoon, J. Jeon, S.-i. Yang, S.-K. Lee, D.-W. Kim, P. P. Chen, A. Guitart, P. Bertens, et al., “Game data mining competition on churn prediction and survival analysis using commercial game log data,” IEEE Transactions on Games, vol. 11, no. 3, pp. 215–226, 2018.
    [4] S.-K. Lee, S.-J. Hong, S.-I. Yang, and H. Lee, “Predicting churn in mobile free-to-play games,” in 2016 International Conference on Information and Communication Technology Convergence (ICTC),pp. 1046–1048, IEEE, 2016.
    [5] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, et al., “Top 10 algorithms in data mining,” Knowledge and Information Systems (KAIS), vol. 14, pp. 1–37, 2008.
    [6] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
    [7] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016.
    [8] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017.
    [9] J. Davis and M. Goadrich, “The relationship between precision-recall and roc curves,” in Proceedings of the 23rd International Conference on Machine Learning, pp. 233–240, 2006.
    [10] K. Mustač, K. Bačić, L. Skorin-Kapov, and M. Sužnjević, “Predicting player churn of a free-to-play mobile video game using supervised machine learning,” Applied Sciences, vol. 12, no. 6, p. 2795, 2022.
    [11] Z. Guan, T. Ji, X. Qian, Y. Ma, and X. Hong, “A survey on big data pre-processing,” in 2017 5th Intl Conf on Applied Computing and Information Technology/4th Intl Conf on Computational Science/Intelligence and Applied Informatics/2nd Intl Conf on Big Data, Cloud Computing, Data Science
    (ACIT-CSII-BCD), pp. 241–247, July 2017.
    [12] D. Borkin, A. Némethová, G. Michal’čonok, and K. Maiorov, “Impact of data normalization on classification model accuracy,” Research Papers Faculty of Materials Science and Technology Slovak University of Technology, vol. 27, no. 45, pp. 79–84, 2019.
    [13] I. El Naqa and M. J. Murphy, “What is machine learning?,” in Machine Learning in Radiation Oncology, pp. 3–11, Springer, 2015.
    [14] M. Somvanshi, P. Chavan, S. Tambade, and S. Shinde, “A review of machine learning techniques using decision tree and support vector machine,” in 2016 International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–7, IEEE, 2016.
    [15] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996.
    [16] Y. Freund, R. Schapire, and N. Abe, “A short introduction to boosting,” Journal-Japanese Society For Artificial Intelligence (JSAI), vol. 14, no. 771-780, p. 1612, 1999.
    [17] C. Boutsidis and M. Magdon-Ismail, “Deterministic feature selection for k-means clustering,” IEEE Transactions on Information Theory, vol. 59, no. 9, pp. 6099–6110, 2013.
    [18] K. R. Shahapure and C. Nicholas, “Cluster quality analysis using silhouette score,” in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp. 747–748, IEEE, 2020.
    [19] P. Tang, “Telecom customer churn prediction model combining k-means and xgboost algorithm,” in 2020 5th International Conference on Mechanical, Control and Computer Engineering (ICMCCE), pp. 1128–1131, IEEE, 2020.
    [20] M. Hossin and M. N. Sulaiman, “A review on evaluation metrics for data classification evaluations,” International Journal of Data Mining & Knowledge Management Process (IJDKP), vol. 5, no. 2, p. 1, 2015.
    [21] M.-H. KE, “New player churn prediction based on machine learning models,” pp. 12–13, 2022.
    [22] “Minmax scaler approach of machine learning.” https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html. Accessed: 05/27/2023.
    [23] “Standard scaler approach of machine learning.” https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Accessed: 05/27/2023.
    [24] J. M. Chambers, Graphical methods for data analysis. CRC Press, 2018.
    [25] A. Vabalas, E. Gowen, E. Poliakoff, and A. J. Casson, “Machine learning algorithm validation with a limited sample size,” PloS one, vol. 14, no. 11, p. e0224365, 2019.
    [26] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986.
    [27] M. R. Smith, T. Martinez, and C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine Learning, vol. 95, pp. 225–256, 2014.
    [28] G. E. Batista, R. C. Prati, and M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 20–29, 2004.
    [29] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority oversampling technique,” Journal of Artificial Intelligence Research (JAIR), vol. 16, pp. 321–357, 2002.
    [30] G. E. Batista, A. L. Bazzan, M. C. Monard, et al., “Balancing training data for automated annotation of keywords: a case study.,” in WOB, pp. 10–18, 2003.
    [31] N. V. Chawla, “Data mining for imbalanced datasets: An overview,” Data Mining and Knowledge Discovery Handbook, pp. 875–886, 2010.
    [32] H.-W. Liao, “Big data mining framework: Predicting potential new paying player in mobile free-to-play games based on extreme gradient boosting,” p. 22, 2020.
    [33] D. Berrar, Cross-Validation. 01 2018.
    [34] J. Brownlee, Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery, 2020.

    QR CODE