研究生: |
何紹宇 Shao-Yu HE |
---|---|
論文名稱: |
強化式學習模型TD3之影像預測機械臂軌跡運動 Image Prediction of Robotic Manipulator Trajectory by the Reinforcement Learning Model TD3 |
指導教授: |
施慶隆
Ching-Long Shih |
口試委員: |
黃志良
Chih-Lyang Hwang 李文猶 Wen-Yo Lee 吳修明 Hsiu-Ming Wu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 57 |
中文關鍵詞: | 機械臂 、雙延遲深度確定性策略梯度模型 、離線策略學習 、深度Q網路 |
外文關鍵詞: | Robot Manipulator, TD3, Off-Policy, DQN |
相關次數: | 點閱:262 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文實現影像預測機械臂軌跡運動,它能在未知之環境完成抓取目標物的任務。本系統包含主控端控制器及驅動機械臂的兩個子系統,分別使用Python語言及Verilog語言撰寫控制程式。主控端系統會捕捉當前的影像當作強化神經網路的輸入,經由模型策略輸出可得轉換矩陣,與機械臂齊次矩陣相乘,最後經由FT232模塊傳送資料至FPGA開發板去驅動機械臂的移動。DQN無法解決連續控制的問題及DDPG的Q值會過度估計,所以選擇了TD3神經網路模型。TD3是一種離線策略學習(Off-Policy)的方法,在訓練前會先收集行為策略與環境交互的數據,再採樣數據進行神經網路訓練,訓練完的模型將會評估當前策略的好壞。策略評估為正值表示模型可正確預測目標物的運動軌跡及抓取。
This thesis implements the image prediction of the trajectory motion of robot manipulator, during the task of grabbing the target in an unknown environment. The system consists of two subsystems: the main control terminal controller and the driving manipulator, and the control programs are written in Python language and Verilog language respectively. The control system captures the current image as the input to the enhanced neural network, and obtains the transformation matrix through the output of the model strategy. Finally, the data is sent to the FPGA development board via the FT232 module to drive the movement of the manipulator. Because that Deep Q-learning (DQN) can’t solve the continuous control problem and the Q value of Deep Deterministic Policy Gradient (DDPG) is overestimated, so that Twin Delayed Deep Deterministic policy gradient algorithm(TD3) neural network model is chosen. Twin Delayed Deep Deterministic policy gradient algorithm(TD3) is an off-policy method, which first collects data about the interaction between behavioral strategies and the environment before training, and then samples the data for neural network training. After training, the model will evaluate the quality of the current strategy. The positive value of the strategy evaluation indicates that the model can correctly predict the trajectory and grasping of the target.
[1] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust region policy optimization.” in ICML, 2015, pp. 1889–1897..
[2] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[3] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” arXiv preprint arXiv:2004.14990, 2020.
[4] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” in International Conference on Learning Representations, 2020..
[5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
[6] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qtopt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
[7] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” arXiv preprint arXiv:1707.06887, 2017.
[8] Breyer, M., Furrer, F., Novkovic, T., Siegwart, R., & Nieto, J.,“Comparing task simplifications to learn closed-Loop object picking using deep reinforcement learning,”IEEE Robotics and Automation Letters, arXiv:1803.04996,2019
[9] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., “Learning dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2018.
[10] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IROS, 2017.
[11] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” arXiv preprint arXiv:1710.06542, 2017.
[12] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simto-real transfer of robotic control with dynamics randomization,” in International Conference on Robotics and Automation, 2018.
[13] X. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel, “Sim-to-Real transfer of robotic control with dynamics randomization”, 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. Available: 10.1109/icra .2018.8460528.
[14] Quillen, Deirdre, et al. "Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-Policy methods."arXiv preprint arXiv:1802.10264(2018).
[15] Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., and Fergus, R. Improving sample efficiency in modelfree reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019.
[16] Albert Zhan, Philip Zhao, Lerrel Pinto, Pieter Abbeel, and Michael Laskin. A framework for efficient robotic manipulation. arXiv:2012.07975 [cs], December 2020. arXiv: 2012.07975.
[17] 施慶隆、李文猶,機電整合與控制—多軸運動設計與應用,第三版,全華書局股份有限公司,2015。