簡易檢索 / 詳目顯示

研究生: 何紹宇
Shao-Yu HE
論文名稱: 強化式學習模型TD3之影像預測機械臂軌跡運動
Image Prediction of Robotic Manipulator Trajectory by the Reinforcement Learning Model TD3
指導教授: 施慶隆
Ching-Long Shih
口試委員: 黃志良
Chih-Lyang Hwang
李文猶
Wen-Yo Lee
吳修明
Hsiu-Ming Wu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 57
中文關鍵詞: 機械臂雙延遲深度確定性策略梯度模型離線策略學習深度Q網路
外文關鍵詞: Robot Manipulator, TD3, Off-Policy, DQN
相關次數: 點閱:262下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文實現影像預測機械臂軌跡運動,它能在未知之環境完成抓取目標物的任務。本系統包含主控端控制器及驅動機械臂的兩個子系統,分別使用Python語言及Verilog語言撰寫控制程式。主控端系統會捕捉當前的影像當作強化神經網路的輸入,經由模型策略輸出可得轉換矩陣,與機械臂齊次矩陣相乘,最後經由FT232模塊傳送資料至FPGA開發板去驅動機械臂的移動。DQN無法解決連續控制的問題及DDPG的Q值會過度估計,所以選擇了TD3神經網路模型。TD3是一種離線策略學習(Off-Policy)的方法,在訓練前會先收集行為策略與環境交互的數據,再採樣數據進行神經網路訓練,訓練完的模型將會評估當前策略的好壞。策略評估為正值表示模型可正確預測目標物的運動軌跡及抓取。


    This thesis implements the image prediction of the trajectory motion of robot manipulator, during the task of grabbing the target in an unknown environment. The system consists of two subsystems: the main control terminal controller and the driving manipulator, and the control programs are written in Python language and Verilog language respectively. The control system captures the current image as the input to the enhanced neural network, and obtains the transformation matrix through the output of the model strategy. Finally, the data is sent to the FPGA development board via the FT232 module to drive the movement of the manipulator. Because that Deep Q-learning (DQN) can’t solve the continuous control problem and the Q value of Deep Deterministic Policy Gradient (DDPG) is overestimated, so that Twin Delayed Deep Deterministic policy gradient algorithm(TD3) neural network model is chosen. Twin Delayed Deep Deterministic policy gradient algorithm(TD3) is an off-policy method, which first collects data about the interaction between behavioral strategies and the environment before training, and then samples the data for neural network training. After training, the model will evaluate the quality of the current strategy. The positive value of the strategy evaluation indicates that the model can correctly predict the trajectory and grasping of the target.

    摘要 I Abstract II 致謝 III 目錄 IV 圖目錄 VI 表目錄 VIII 第 1 章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 1 1.3 論文大綱 2 第 2 章 機械臂系統架構 3 2.1 系統架構 3 2.2 硬體介紹 5 2.2.1 六軸機械臂 5 2.2.2 機械臂底座 6 2.2.3 FPGA開發板 7 2.3 伺服馬達控制 8 2.3.1 通訊封包格式 9 2.3.2 除頻器 9 2.3.3 PWM訊號產生器 9 第 3 章 機械臂運動控制 10 3.1 順向運動學 11 3.2 逆向運動學 14 3.3 機械臂工作區限制 20 3.4 運動軌跡矩陣 22 第 4 章 強化學習軌跡預測 25 4.1 TD3模型介紹 26 4.1.1 Actor-Critic算法 26 4.1.2 Q-learning算法 27 4.1.3 TD3 (Twin Delayed DDPG)算法 29 4.2 神經網路架構 34 4.3 Gym搭建手臂環境 36 4.3.1 獎勵函數設計 36 4.3.2 搭建機械臂環境 37 4.4 回放緩衝區建立 39 第 5 章 實驗結果與分析 40 5.1 實驗場景 40 5.2 實驗流程 41 5.3 單一目標強化學習實驗 43 5.3.1 獎勵設計 44 5.3.2 實驗數據收集 44 5.3.3 訓練實驗數據 45 5.3.4 實驗成果 47 5.4 多物件強化學習實驗 49 5.4.1 獎勵設計 49 5.4.2 實驗數據收集 50 5.4.3 實驗數據 50 5.4.4 實驗成果 51 第 6 章 結論與建議 54 6.1 結論 54 6.2 建議 55 參考文獻 56

    [1] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz, “Trust region policy optimization.” in ICML, 2015, pp. 1889–1897..
    [2] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
    [3] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” arXiv preprint arXiv:2004.14990, 2020.
    [4] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” in International Conference on Learning Representations, 2020..
    [5] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” CoRR, vol. abs/1504.00702, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
    [6] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Qtopt: Scalable deep reinforcement learning for vision-based robotic manipulation,” arXiv preprint arXiv:1806.10293, 2018.
    [7] M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” arXiv preprint arXiv:1707.06887, 2017.
    [8] Breyer, M., Furrer, F., Novkovic, T., Siegwart, R., & Nieto, J.,“Comparing task simplifications to learn closed-Loop object picking using deep reinforcement learning,”IEEE Robotics and Automation Letters, arXiv:1803.04996,2019
    [9] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al., “Learning dexterous in-hand manipulation,” arXiv preprint arXiv:1808.00177, 2018.
    [10] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” in IROS, 2017.
    [11] L. Pinto, M. Andrychowicz, P. Welinder, W. Zaremba, and P. Abbeel, “Asymmetric actor critic for image-based robot learning,” arXiv preprint arXiv:1710.06542, 2017.
    [12] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Simto-real transfer of robotic control with dynamics randomization,” in International Conference on Robotics and Automation, 2018.
    [13] X. Peng, M. Andrychowicz, W. Zaremba and P. Abbeel, “Sim-to-Real transfer of robotic control with dynamics randomization”, 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. Available: 10.1109/icra .2018.8460528.
    [14] Quillen, Deirdre, et al. "Deep reinforcement learning for vision-based robotic grasping: A simulated comparative evaluation of off-Policy methods."arXiv preprint arXiv:1802.10264(2018).
    [15] Yarats, D., Zhang, A., Kostrikov, I., Amos, B., Pineau, J., and Fergus, R. Improving sample efficiency in modelfree reinforcement learning from images. arXiv preprint arXiv:1910.01741, 2019.
    [16] Albert Zhan, Philip Zhao, Lerrel Pinto, Pieter Abbeel, and Michael Laskin. A framework for efficient robotic manipulation. arXiv:2012.07975 [cs], December 2020. arXiv: 2012.07975.
    [17] 施慶隆、李文猶,機電整合與控制—多軸運動設計與應用,第三版,全華書局股份有限公司,2015。

    無法下載圖示 全文公開日期 2023/08/05 (校內網路)
    全文公開日期 2025/08/05 (校外網路)
    全文公開日期 2025/08/05 (國家圖書館:臺灣博碩士論文系統)
    QR CODE