簡易檢索 / 詳目顯示

研究生: 范祐恩
Yu-En Fan
論文名稱: 以深度強化學習網路玩非對稱遊戲
Playing asymmetric games with deep reinforcement learning
指導教授: 洪西進
Shi-Jinn Horng
口試委員: 楊竹星
Chu-Sing Yang
楊朝棟
Chao-Tung Yang
李正吉
Cheng-Chi Lee
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 110
語文別: 中文
論文頁數: 55
中文關鍵詞: 深度學習強化學習
外文關鍵詞: Deep Learning, Reinforcement Learning
相關次數: 點閱:321下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  •   受其他深度強化學習研究所啟發,本研究創造了一個供強化學習agent(代理者)使用的非對稱遊戲環境,在遊戲中兩隊agent各有不同的目標需要完成,鬼需要抓到人,而人需要逃離鬼,雙方建立起一個互相對抗的遊戲。在透過一連串精心設計好的七階段課程訓練後,agent能快速學習基本策略,盡早進入最後階段發展複雜技巧,鬼學會了在各房間穿梭抓人,人學會了在逃跑過程中完成任務與逃生,最終使得兩隊能互相抗衡,展開精彩的勝負。隨後我們從agent的移動距離、目標進度分析證明訓練出來的網路有能力完成它們各自的目標,並以研究過程中發現的各種細節,探討不同修改對訓練所帶來的影響,透過研究這些細節讓我們能更好更快地訓練出高性能的深度強化學習agent。


      Inspired by other deep reinforcement learning researches, we create an asymmetric game environment for reinforcement learning agents. In this game, each team of agents has different goals to achieve, Ghost needs to catch Human, while Human needs to escape from Ghost. They build a game of confrontation. After training through a series of well-designed seven-stage curriculums, allow the agent to learn basic strategies quickly and enter the final stage to develop complex skills as soon as possible. Ghosts learn to shuttle and catch Humans in various rooms, and Humans learn to complete tasks and escape from the game while not caught by Ghosts. Eventually, the two teams can compete with each other and have a great game. Then we analyzed the moving distance of the agent and the progress of the agent’s target to prove that the networks we trained are indeed capable of accomplishing their goals respectively. In the process of researching, we discovered many details about curriculum, these details have a great impact on training results. By considering these details, we can train deep reinforcement learning agents better and faster.

    摘要 IV Abstract V 致謝 VI 目錄 VII 圖目錄 X 表目錄 XII 第一章 緒論 1 1.1 研究動機 1 1.2 什麼是深度強化學習 1 1.3 什麼是非對稱遊戲 3 1.4 內容大綱 4 第二章 相關研究 5 2.1 DEEPMIND ATARI 5 2.2 ALPHAGO 6 2.3 OPENAI捉迷藏 6 第三章 非對稱遊戲強化學習 8 3.1 研究環境 8 3.2 遊戲規則 9 3.3 模型、超參數與訓練方式 12 3.4 課程設計 13 3.5 演變及其他所學 18 3.6 意料之外的學習結果 19 第四章 實驗 21 4.1 課程的重要性 21 4.2 ELO評分系統及其衍生 22 4.3 自訂指標實驗 24 4.3.1 平均運動距離 24 4.3.2 鬼的平均目標進度 25 4.3.3 人的平均目標進度 26 4.3.4 閒置率 28 4.4 其他實驗 29 4.4.1 最佳batch size及buffer size 29 4.4.2 任務完成後消失對訓練帶來的影響 30 4.4.3 鬼在階段2到4存在與否對訓練帶來的影響 32 4.4.4 agent速度對訓練帶來的影響 35 第五章 結論與未來展望 37 參考文獻 39

    [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, "Mastering the game of Go with deep neural networks and tree search," Nature, vol. 529, no. 7587, pp. 484-489, 2016/01/01 2016, doi: 10.1038/nature16961.
    [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, "Mastering the game of Go without human knowledge," Nature, vol. 550, no. 7676, pp. 354-359, 2017/10/01 2017, doi: 10.1038/nature24270.
    [3] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch, "Emergent tool use from multi-agent autocurricula," ArXiv, vol. abs/1909.07528, 2020.
    [4] M. L. Minsky, Theory of neural-analog reinforcement systems and its application to the brain-model problem. Princeton University, 1954.
    [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, "Playing atari with deep reinforcement learning," arXiv preprint arXiv:1312.5602, 2013.
    [6] H. Van Hasselt, A. Guez, and D. Silver, "Deep reinforcement learning with double q-learning," in Proceedings of the AAAI conference on artificial intelligence, 2016, vol. 30, no. 1.
    [7] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, "Dueling network architectures for deep reinforcement learning," in International conference on machine learning, 2016: PMLR, pp. 1995-2003.
    [8] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized experience replay," arXiv preprint arXiv:1511.05952, 2015.
    [9] A. W. Senior, R. Evans, J. Jumper, J. Kirkpatrick, L. Sifre, T. Green, C. Qin, A. Žídek, A. W. R. Nelson, A. Bridgland, H. Penedones, S. Petersen, K. Simonyan, S. Crossan, P. Kohli, D. T. Jones, D. Silver, K. Kavukcuoglu, and D. Hassabis, "Improved protein structure prediction using potentials from deep learning," Nature, vol. 577, no. 7792, pp. 706-710, 2020/01/01 2020, doi: 10.1038/s41586-019-1923-7.
    [10] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis, "Highly accurate protein structure prediction with AlphaFold," Nature, vol. 596, no. 7873, pp. 583-589, 2021/08/01 2021, doi: 10.1038/s41586-021-03819-2.
    [11] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik, A. Huang, P. Georgiev, and R. Powell, "Alphastar: Mastering the real-time strategy game starcraft ii," DeepMind blog, vol. 2, 2019.
    [12] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, and P. Georgiev, "Grandmaster level in StarCraft II using multi-agent reinforcement learning," Nature, vol. 575, no. 7782, pp. 350-354, 2019.
    [13] A. Juliani, V.-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, Y. Gao, H. Henry, and M. Mattar, "Unity: A general platform for intelligent agents," arXiv preprint arXiv:1809.02627, 2018.
    [14] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, pp. 8026-8037, 2019.
    [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal policy optimization algorithms," arXiv preprint arXiv:1707.06347, 2017.
    [16] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor," in International conference on machine learning, 2018: PMLR, pp. 1861-1870.
    [17] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, "Tensorflow: Large-scale machine learning on heterogeneous distributed systems," arXiv preprint arXiv:1603.04467, 2016.
    [18] J. Piaget, "The construction of reality in the child," 1954.
    [19] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch. "Emergent tool use from multi-agent autocurricula." OpenAI. https://openai.com/blog/emergent-tool-use/ (accessed September 20, 2021).
    [20] A. E. Elo, The rating of chessplayers, past and present. Arco Pub., 1978.

    無法下載圖示 全文公開日期 2024/12/05 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE