基於強化學習與自我對打之格鬥遊戲智能體訓練框架｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	周圓 Yuan Zhou
論文名稱：	基於強化學習與自我對打之格鬥遊戲智能體訓練框架 Fighting Game Agent Training Framework Based onReinforcement Learning and Self-play
指導教授：	戴文凱 Wen-Kai Tai
口試委員:	陳冠宇 Kuan-Yu Chen 陳奕廷 Yi-Ting Chen
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	77
中文關鍵詞：	機器學習、強化學習、格鬥遊戲、自我對打、行為克隆、遊戲 AI
外文關鍵詞：	Machine learning, Reinforcement learning, Fighting game, Self-play, Behavior cloning, Game AI
相關次數：	點閱：379 下載：23
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

一對一格鬥遊戲在整個遊戲史上的地位舉足輕重，直至今日依然保有大批忠實玩家。在格鬥遊戲誕生之初，一種特殊的玩家形態就已存在，那就是虛擬電腦玩家(遊戲 AI)。遊戲 AI 很好的滿足了玩家以單機模式遊玩格鬥遊戲的需求，大大豐富了格鬥遊戲的遊玩方式，逐漸成為格鬥遊戲中不可或缺的角色。然而，傳統的遊戲 AI 生成方式多基於人為設計的複雜規則或是行為樹算法，前者需要設計者具備相當高程度的領域知識，且設計過程過於複雜，AI 強度普遍不高；而後者則需要花費大量的時間進行空間探索，訓練成本過高。因此，如何在遊戲角色眾多的格鬥遊戲中，快速且高質量地生成虛擬玩家個體，是本論文的主要研究目標。

本論文以 FightingICE 作為實驗平台，提出了一個基於強化學習和自我對打的訓練框架。框架主要分為四個部分：(1)前處理，我們將收集 FightingICE 平台上往屆參賽選手模型的對打數據，並將其處理成強化學習模型能夠識別的形式，以供後續模型使用。(2)預訓練，此部分將使用行為克隆算法，針對(1)中收集的數據進行模仿學習，獲得預訓練模型。(3)強化學習訓練，我們分別嘗試了 DQN、PPO 和 SAC 三個算法，對比分析了其各自在 FightingICE 上的表現。除此之外，我們還加入了規則判斷和動作遮罩機制，協助加速強化訓練。(4)自我對打，為豐富訓練過程中對手模型的種類，我們將讓主模型與不同訓練階段的舊模型對打，避免訓練過擬合。

將我們的模型與往屆選手的模型進行比較後的結果表明，我們的模型表現優於 FightingICE 平台上的多位往屆選手，且需要使用的領域知識也遠小於大多數模型。此外，我們還驗證了自我對打的訓練模式對模型泛化性的影響，雖然針對單一模型的訓練可能在該模型上能夠快速達到更高的勝率，但模型整體的泛化能力極差，在面對新對手時表現落差較大。

One-on-one fighting games have played a pivotal role in the history of computer games and still retain a large number of loyal players today. When fighting games were first created, a special type of player existed, the virtual computer player (game AI). It's a good way to meet the needs of players playing fighting games in offline mode, which has greatly enriched the way of playing fighting games and gradually become an indispensable role in fighting games. However, traditional game AI are mostly based on complex rules designed by human or behavior tree algorithms. The former requires a high degree of domain knowledge from the designer, and the design process is too complicated, so the AI strength is generally not high; while the latter requires a lot of time for space exploration, and the training cost is too large. Therefore, this
thesis aims at finding a way to generate virtual players efficiently in a fighting game with multiple game characters.

In this thesis, we propose a training framework based on reinforcement learning and self-pairing games using FightingICE as an experimental platform. The framework is divided into four main parts: (1) Pre-processing, where we collect the sparring data of previous players' models on the FightingICE platform and process them into a form that can be recognized by the reinforcement learning model for the subsequent model. (2) Pre-training, where we use behavioral cloning algorithms to imitate the data collected in (1) to obtain and train the model. (3) Reinforcement learning, in training process, we design three experiments with different reinforcement learning algorithms, DQN, PPO and SAC. Then we compared their performance on FightingICE. In addition, we also add rule judgment and action masking mechanisms to help accelerate the reinforcement training. (4) Self-play, in order to enrich the variety of opponent models during training, we let the main model compete with old models in different training stages to avoid overfitting.

By comparing our model with previous players' models on the FightingICE platform, the experimental results show that our model outperforms several previous players and requires much less domain knowledge than most models. In addition, we verified the effect of the self-play training model on the generalizability of the model. Although training for a single model may achieve a higher win rate quickly on that model, the overall generalizability of the model is very poor, with a large performance gap against new opponents.

論文摘要. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .I
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .II
目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .IV
圖目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .VII
表目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .IX
演算法目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .X1
緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1研究背景與動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
2研究方法概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2
3研究貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
4本論文之章節結構. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4
文獻探討. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
研究方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
1研究環境介紹. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31
2行為克隆（預訓練）. . . . . . . . . . . . . . . . . . . . . . . . . . . .36
3強化學習. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38
4動作遮罩. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41
5自我對打(self­play) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
實驗設計. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
1實驗系統框架. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45
2模仿學習評估. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46
3強化學習模型評估. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
4自我對打機制評估. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48
實驗結果與分析. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
1模仿學習實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49
2最佳模型實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
3自我對打實驗結果. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54
結論與後續工作. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
1結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
2後續工作. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56
參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
附錄一：專家准換記錄數據格式. . . . . . . . . . . . . . . . . . . . . . . . . . .60
                                

[1]V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, andM. Riedmiller, “Playing atari with deep reinforcement learning,”arXiv preprintarXiv:1312.5602, 2013.[2]V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,et al., “Humanlevel control through deep reinforcement learning,”Nature, vol. 518, no. 7540, pp. 529–533,2015.[3]H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with doubleqlearning,”arXiv preprint arXiv:1509.06461, 2015.[4]T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,”arXiv preprint arXiv:1511.05952, 2015.[5]Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Duelingnetwork architectures for deep reinforcement learning,” inInternational conferenceon machine learning, pp. 1995–2003, 2016.[6]R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour,et al., “Policy gradient methods for reinforcement learning with function approximation.,” inNeurIPS,vol. 99, pp. 1057–1063, 1999.[7]V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, andK. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” inInternational conference on machine learning, pp. 1928–1937, 2016.[8]J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “Highdimensionalcontinuous control using generalized advantage estimation,”arXiv preprintarXiv:1506.02438, 2015.[9]J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” inInternational conference on machine learning, pp. 1889–1897,2015.[10]J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policyoptimization algorithms,”arXiv preprint arXiv:1707.06347, 2017.[11]T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor,” inInternationalConference on Machine Learning, pp. 1861–1870, 2018.[12]T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu,A. Gupta, P. Abbeel,et al., “Soft actorcritic algorithms and applications,”arXivpreprint arXiv:1812.05905, 2018.[13]P. Christodoulou, “Soft actorcritic for discrete action settings,”arXiv preprintarXiv:1910.07207, 2019.[14]M. Ishihara, T. Miyazaki, C. Y. Chu, T. Harada, and R. Thawonmas, “Applying andimproving montecarlo tree search in a fighting game ai,” inProceedings of the 13thinternational conference on advances in computer entertainment technology, pp. 1–6, 2016.[15]Z. Tang, Y. Zhu, D. Zhao, and S. M. Lucas, “Enhanced rolling horizon evolutionalgorithm with opponent model learning,”IEEE Transactions on Games, 2020.[16]D. Michie, M. Bain, and J. HayesMiches, “Cognitive models from subcognitiveskills,”IEE control engineering series, vol. 44, pp. 71–99, 1990.[17]C. J. C. H. Watkins and P. Dayan, “Qlearning,”Machine learning, vol. 8, no. 34,pp. 279–292, 1992.[18]L.J. Lin, “Reinforcement learning for robots using neural networks,” tech. rep.,School of Computer Science, CarnegieMellon Univ., Pittsburgh, PA, 1993.[19]M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,”Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013.[20]M. Bellemare, J. Veness, and M. Bowling, “Investigating contingency awarenessusing atari 2600 games,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 26, 2012.[21]M. Hausknecht, J. Lehman, R. Miikkulainen, and P. Stone, “A neuroevolution approach to general atari game playing,”IEEE Transactions on Computational Intelligence and AI in Games, vol. 6, no. 4, pp. 355–366, 2014.[22]A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements thatcan solve difficult learning control problems,”IEEE transactions on systems, man,and cybernetics, no. 5, pp. 834–846, 1983.[23]S. Kakade and J. Langford, “Approximately optimal approximate reinforcementlearning,” inIn Proc. 19th International Conference on Machine Learning, 2002.[24]E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for modelbased control,” in2012IEEE/RSJInternationalConferenceonIntelligentRobotsandSystems,pp. 5026–5033, 2012.[25]S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approximation error inactorcritic methods,” inInternational Conference on Machine Learning, pp. 1587–1596, 2018.[26]M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deepreinforcement learning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.[27]G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, andW. Zaremba, “Openai gym,” 2016.[28]A. Raffin, A. Hill, M. Ernestus, A. Gleave, A. Kanervisto, and N. Dormann, “Stablebaselines3.”https://github.com/DLR-RM/stable-baselines3, 2019.

簡易檢索 / 詳目顯示

相關論文