基於RGB圖像與點雲圖之深度學習整合應用於六維姿態估測

簡易檢索 / 詳目顯示

回結果列表

研究生：	賴仁斌 Ren-Bin Lai
論文名稱：	基於RGB圖像與點雲圖之深度學習整合應用於六維姿態估測 Application of deep learning based on RGB image and point cloud to 6D pose estimation
指導教授：	蔡明忠 Ming-Jong Tsai
口試委員:	蔡明忠 Ming-Jong Tsai 郭永麟 Yong-Lin Kuo 詹朝基 Chao-Chi Chan 楊棧雲 Chan-Yun Yang
學位類別：	碩士 Master
系所名稱：	工程學院 - 自動化及控制研究所 Graduate Institute of Automation and Control
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	80
中文關鍵詞：	六維姿態估測、點雲圖、深度學習、YOLO 、水五金零件
外文關鍵詞：	6D pose estimation, point cloud, deep learning, YOLO, water hardware parts
相關次數：	點閱：175 下載：1
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

物件六維姿態估測在機器人抓取的應用中已成為一項重要的任務，現行的方法主要為透過視覺獲取物件圖像，並根據物件的特徵進行影像處理與數學計算來找出其位置與方向性。隨著人工智慧的發展，已有許多研究著手進行以深度學習的方式來進行物件的六維姿態估測，為了提高模型效率或考慮視覺成本等，許多方法選擇以2D相機的RGB圖像進行預測，雖然RGB圖像已經可以做到相當良好的估測，但仍缺少幾何資訊中部份的資訊，尤其在某些情況如多物件堆疊、遮斷或紋理相近時仍可能造成誤判。本研究以YOLO、PointNet與Gen6D等深度學習方法，整合應用於水五金的六維姿態估測，並將整個流程分成兩個階段來處理。第一階段透過YOLO來辨識物件取得邊界框，以取代Gen6D的偵測器所預測的物件，接著根據此邊界框分割出目標物件，轉換為局部點雲資訊輸入至PointNet進行雜訊過濾與點雲特徵提取。第二階段則將PointNet提取出來的點雲特徵與Gen6D所估測出來的結果進行尺度校正，並透過姿態回歸方法得出歐拉角，最終得出物件的六維姿態。本研究以YOLO來調整Gen6D的偵測器，克服Gen6D在多物件、混料及無物件時偵測效果不佳的問題。另外本研究使用深度相機取得真實的物件深度，取代Gen6D以數學計算所得到的深度值，能提供更準確的預測位置，並且透過局部轉換的方法能有效減少PointNet在訓練時造成的負荷及提高訓練效率。本研究以三種不同種類的水五金零件證明本研究可用於混料物件堆疊的場景。另透過自製治具驗證卡氏座標X、Y、Z的檢測平均誤差分別為 ±1.07mm、±1.56mm、±2.75mm。對標的物以不同姿態擺放，其估測角度的平均誤差Rx為 ±4.91°、Ry為 ±10.56°，Rz為 ±2.37°。

Object's 6D pose estimation has become a crucial task in robot grasping applications. Many researchers have adopted deep learning approaches for object's 6D pose estimation. Although RGB images yield considerable accuracy in estimation, they lack certain geometric information, particularly in scenarios involving multiple stacked objects, occlusions, or similar textures, which can still lead to misjudgments. This study integrates deep learning methods like YOLO, PointNet, and Gen6D for the 6D pose estimation of hardware products. The process is divided into two stages: the first stage employs YOLO to recognize objects and obtain bounding boxes, replacing the objects predicted by Gen6D's detector. The bounding boxes are then used to segment the target object, transforming it into local point cloud information input into PointNet for noise filtering and feature extraction. In the second stage, the point cloud features extracted by PointNet are combined with the results estimated by Gen6D for scale calibration. Euler angles are derived through pose regression, ultimately determining the object's 6D pose. This study uses YOLO to adjust Gen6D's detector, addressing its subpar performance in scenarios with multiple objects, mixed objects, and no objects. Additionally, real object depths from a depth camera are used instead of mathematically derived depths, enhancing predictive accuracy. The study employs a local transformation approach to mitigate PointNet's training load and improve efficiency. The research demonstrates its applicability in scenarios with mixed and stacked objects in the hardware domain using three different types of water hardware parts. The average detection errors for Cartesian coordinates X, Y, and Z were found to be within ±1.07mm, ±1.56mm, and ±2.75mm respectively. For the target object placed in various orientations, the average estimation errors for Rx, Ry and Rz were ±4.91°, ±10.56°, and ±2.37° respectively.

致謝    I
摘要    II
ABSTRACT    III
目錄    IV
圖目錄    VII
表目錄    X
第一章 緒論    1
1 研究背景與動機    1
2 研究目的    2
3 研究貢獻    2
4 研究架構    3
第二章 文獻回顧與技術探討    4
1 六維姿態估測文獻回顧    4
1.1 傳統特徵提取相關研究    5
1.2 基於RGB之深度學習相關研究    5
1.3 基於RGB-D之深度學習相關研究    7
2 圖像格式    8
2.1 點陣圖    8
2.2 點雲圖    9
3 影像處理    12
3.1 座標系轉換    12
4 深度學習應用    18
4.1 物件偵測    18
4.2 影像分割    20
5 六維姿態估測    21
5.1 Gen6D模型    23
第三章 系統架構與研究方法    26
1 系統架構    26
1.1 工作流程    27
1.2 影像擷取環境    29
2 六維姿態估測    32
2.1 模型使用    32
2.2 模型調整    34
2.3 堆疊夾取在Gen6D的限制    36
3 物件偵測調整    37
3.1 模型使用    37
3.2 自製資料集    39
3.3 過濾器    41
4 點雲特徵提取    43
4.1 模型使用    43
4.2 局部點雲圖    46
5 視覺座標校正    49
第四章 實驗結果與分析    52
1 物件偵測預測結果    52
2 點雲分割預測結果    57
3 六維姿態估測結果    59
3.1 完整流程    59
3.2 多物件偵測驗證    62
3.3 精度驗證    62
3.4 結果分析    71
第五章 結論與未來研究方向    74
1 結論    74
2 未來研究方向    75
參考文獻    76

                                

[1] 呂尚杰、江博通、張俊隆，「視覺導引機器人在金屬製品取放應用簡介」，機械工業雜誌，362期，pp.26-33，2013。
[2] 林宇宸，「基於3D視覺與姿態估測演算法之機器人隨機抓取系統」，碩士論文，自動化及控制研究所，國立臺灣科技大學，台北市，2021。 [Online]. Available: https://hdl.handle.net/11296/8gmmns
[3] 孫永富，「使用低成本立體相機建立工業機器人立體視覺與應用於散裝零件夾取之研究」，碩士論文，自動化及控制研究所，國立臺灣科技大學，台北市，2019。 [Online]. Available: https://hdl.handle.net/11296/8d6ybt
[4] 邱威堯、謝伯璜、張津魁、呂尚杰、張俊隆，「堆疊物件取放技術簡介」，機械工業雜誌，362期，pp.20-25，2013。
[5] 呂尚杰、邱威堯、江博通、張俊隆，「水五金堆疊物件取放技術簡介」，機械工業雜誌，374期，pp.57-64，2014。
[6] B. Drost, M. Ulrich, N. Navab, and S. Ilic, “Model globally, match locally: Efficient and robust 3D object recognition,” 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Jun. 2010, doi: 10.1109/cvpr.2010.5540108.
[7] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes,” Robotics: Science and Systems XIV, Jun. 2018, doi: 10.15607/rss.2018.xiv.019.
[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998, doi: 10.1109/5.726791.
[9] M. A. Dede and Y. Genc, “Object aspect classification and 6DoF pose estimation,” Image and Vision Computing, vol. 124, p. 104495, Aug. 2022, doi: 10.1016/j.imavis.2022.104495.
[10] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield, "Deep object pose estimation for semantic robotic grasping of household objects," arXiv preprint 2018, https://doi.org/10.48550/arXiv.1809.10790.
[11] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2019, doi: 10.1109/cvpr.2019.00469.
[12] J. Cheng, P. Liu, Q. Zhang, H. Ma, F. Wang, and J. Zhang, “Real-Time and Efficient 6-D Pose Estimation From a Single RGB Image,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–14, 2021, doi: 10.1109/tim.2021.3115564.
[13] M. A. Yilmaz and A. M. Tekalp, “DFPN: Deformable Frame Prediction Network,” 2021 IEEE International Conference on Image Processing (ICIP), Sep. 2021, doi: 10.1109/icip42928.2021.9506210.
[14] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and I.-H. Yeh, “CSPNet: A New Backbone that can Enhance Learning Capability of CNN,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jun. 2020, doi: 10.1109/cvprw50498.2020.00203.
[15] S. Hinterstoisser et al., “Model Based Training, Detection and Pose Estimation of Texture-Less 3D Objects in Heavily Cluttered Scenes,” Lecture Notes in Computer Science, pp. 548–562, 2013, doi: 10.1007/978-3-642-37331-2_42.
[16] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother, “Learning 6D Object Pose Estimation Using 3D Object Coordinates,” Lecture Notes in Computer Science, pp. 536–551, 2014, doi: 10.1007/978-3-319-10605-2_35.
[17] H. Zhang et al., “A Practical Robotic Grasping Method by Using 6-D Pose Estimation With Protective Correction,” IEEE Transactions on Industrial Electronics, vol. 69, no. 4, pp. 3876–3886, Apr. 2022, doi: 10.1109/tie.2021.3075836.
[18] 王仁蔚、李佳蓮、陳健倫、王俞程、鄭詠珈、李志中，「應用視圖之經驗轉移於機器手臂夾取隨機堆疊物件」，機械工業雜誌，461期，pp.39-51，2021。
[19] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” 2017 IEEE International Conference on Computer Vision (ICCV), Oct. 2017, doi: 10.1109/iccv.2017.322.
[20] S.-K. Huang, C.-C. Hsu, W.-Y. Wang, and C.-H. Lin, “Iterative Pose Refinement for Object Pose Estimation Based on RGBD Data,” Sensors, vol. 20, no. 15, p. 4114, Jul. 2020, doi: 10.3390/s20154114.
[21] J. Wang, L. Qiu, G. Yi, S. Zhang, and Y. Wang, “Multiple geometry representations for 6D object pose estimation in occluded or truncated scenes,” Pattern Recognition, vol. 132, p. 108903, Dec. 2022, doi: 10.1016/j.patcog.2022.108903.
[22] D.-C. Hoang, J. A. Stork, and T. Stoyanov, “Voting and Attention-Based Pose Relation Learning for Object Pose Estimation From 3D Point Clouds,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 8980–8987, Oct. 2022, doi: 10.1109/lra.2022.3189158.
[23] Z. Ding, X. Han, and M. Niethammer, “VoteNet: A Deep Learning Label Fusion Method for Multi-atlas Segmentation,” Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, pp. 202–210, 2019, doi: 10.1007/978-3-030-32248-9_23.
[24] T. Hodaň et al., “BOP: Benchmark for 6D Object Pose Estimation,” Lecture Notes in Computer Science, pp. 19–35, 2018, doi: 10.1007/978-3-030-01249-6_2.
[25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, Jun. 2017, doi: 10.1109/tpami.2016.2577031.
[26] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, doi: 10.1109/cvpr.2017.106.
[27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, doi: 10.1109/cvpr.2016.91.
[28] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint https://doi.org/10.48550/arXiv.1409.1556.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, doi: 10.1109/cvpr.2016.90.
[30] C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, doi: 10.1109/cvpr.2015.7298594.
[31] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, doi: 10.1109/cvpr.2017.16.
[32] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, "Pointnet++: Deep hierarchical feature learning on point sets in a metric space," Advances in neural information processing systems, vol. 30, 2017.
[33] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic Graph CNN for Learning on Point Clouds,” ACM Transactions on Graphics, vol. 38, no. 5, pp. 1–12, Oct. 2019, doi: 10.1145/3326362.
[34] A. Kendall, M. Grimes, and R. Cipolla, “PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization,” 2015 IEEE International Conference on Computer Vision (ICCV), Dec. 2015, doi: 10.1109/iccv.2015.336.
[35] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “DeepIM: Deep Iterative Matching for 6D Pose Estimation,” International Journal of Computer Vision, vol. 128, no. 3, pp. 657–678, Nov. 2019, doi: 10.1007/s11263-019-01250-9.
[36] J. L. Schonberger and J.-M. Frahm, “Structure-from-Motion Revisited,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, doi: 10.1109/cvpr.2016.445.
[37] Y. Liu et al., “Gen6D: Generalizable Model-Free 6-DoF Object Pose Estimation from RGB Images,” Computer Vision – ECCV 2022, pp. 298–315, 2022, doi: 10.1007/978-3-031-19824-3_18.
[38] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors," arXiv preprint arXiv:2207.02696, 2022, https://doi.org/10.48550/arXiv.2207.02696
[39] T. Hodan et al., “Photorealistic Image Synthesis for Object Instance Detection,” 2019 IEEE International Conference on Image Processing (ICIP), Sep. 2019, doi: 10.1109/icip.2019.8803821.
[40] CCE Tech Pubs - Intel Corp., “Intel® RealSenseTM Camera 400 Series (DS5) Product Family,” in Datasheet, 5ed. 2019, ch. 3, sec. 8, pp. 50–51. [Online]. Available: https://www.intel.com/content/dam/support/us/en/documents/emerging-technologies/intel-realsense-technology/Intel-RealSense-D400-Series--Datasheet.pdf

全文公開日期 2026/08/23 (校外網路)
全文公開日期 2026/08/23 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文