簡易檢索 / 詳目顯示

研究生: 吳基鴻
Chi-Hung Wu
論文名稱: 基於卷積神經網路以單張RGB影像即時二維手部關節點估測系統
A Real-time CNN-based 2D Hand Joint Estimation from Monocular RGB Frame
指導教授: 王乃堅
Nai-Jian Wang
口試委員: 蘇順豐
Shun-Feng Su
鍾順平
Shun-Ping Chung
呂學坤
Shyue-Kung Lu
郭景明
Jing-Ming Guo
王乃堅
Nai-Jian Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 66
中文關鍵詞: 卷積神經網路二維手部關節點位置soft-argmax深度可分離卷積模組即時
外文關鍵詞: CNNs, 2D hand joint localization, soft-argmax, depth-wise separable convolution blocks, real-time
相關次數: 點閱:240下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 手因為其靈活的特性,是人類生活中主要操作大小事務的肢體部位之一,也因此,人們開始對於提取串流影像中手部的位置及姿態的議題感到興趣,相信這些手部資訊可以增進當今人機互動的體驗。而隨著卷積神經網路的成功發展,人們開始將此應用在幾乎所有的機器視覺領域上並且都獲得了突破性的成果。
    在本論文中,我們提出了一個可於單張RGB影像上提取二維手部關節點位置並且基於兩階段辨識的輕量化卷積神經網路,其流程是先判斷出影像中的各個手部區域範圍,再從各個手部區域中估測其21個手指關節點的位置。第一階段的擷取網路我們採用了非對稱性的架構來減少參數及運算量;而在第二階段的估測網路中,我們添加soft-argmax的運算在卷積神經網路的最後一層,藉此來使神經網路直接估測出指節的二維座標。在我們輕量化神經網路的過程中,最重要的一步是將標準的卷積核以深度可分離卷積模組替代,這使得我們的神經網路即使在計算能力較弱的設備上也能即時運行。
    在Rendered Handpose Dataset這樣的公開資料集上進行驗證,我們提出的神經網路架構取得了17.4的mean EPE,比起其他研究提出的方法在估測二維手部關節點位置的問題上有差不多甚至更好的準確度表現。


    The hands could be regarded as a primary operating tool for human due to their flexibility. Therefore, people are interested in hand’s location and posture in streaming image and believe that they could make human-computer interaction (HCI) more advanced. Since the success of Convolutional Neural Networks (CNNs), people started to apply it in almost all the Computer Vision fields and reached a milestone for breakthrough results.
    In this thesis, we present an approach that estimates 2D hand joint locations from monocular RGB frame, based on a two-stage lightweight CNNs. The two-stage approach first predicts the accurate hand regions and then localizes 21 hand joints per hand region. In the first stage, we use an asymmetric architecture to reduce some parameters. Then, we add a soft-argmax operation after the last layer of the convolutional neural network in the second stage and it makes the neural network predict the coordinates directly. More importantly, we use depth-wise separable convolution blocks instead of standard convolution filters so that the proposed approach could achieve real-time performance even on devices with less computational power.
    Testing on the public dataset such as Rendered Handpose Dataset, our approach gets 17.4 in mean EPE, and it matches or outperforms the accuracy of the state-of-the-art approaches on the problem of 2D hand joint localization.

    摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 VII 表目錄 IX 第一章 緒論 1 1.1 研究背景與動機 1 1.2 文獻回顧 2 1.3 論文目標 3 1.4 論文組織 3 第二章 類神經網路探討 4 2.1 類神經網路 4 2.1.1 前饋神經網路(Feed-forward Neural Network) 5 2.1.2 訓練資料影響神經網路效能 5 2.2 卷積神經網路 6 2.2.1 標準卷積核(Standard Convolution Filter) 6 2.2.2 深度可分離卷積模組(Depth-wise Separable Convolutional Block) 7 2.2.3 激勵函數(Activation Function) 9 2.2.4 最大池化(Max-pooling) 10 2.2.5 U-Net 11 2.3 Dropout 12 2.4 Adam(Adaptive Moment Estimation)演算法 13 第三章 手部關節位置估測系統 15 3.1 系統架構 15 3.2 手部區域擷取網路 17 3.2.1 非對稱性架構(Asymmetric architecture) 18 3.2.2 Instance Normalization 19 3.2.3 雙線性插值(Bilinear Interpolation) 20 3.2.4 手部區域擷取網路之損失函數 21 3.2.5 Intersection over Union 22 3.3 指節位置估測網路 23 3.3.1 獲取手部區域方形框 24 3.3.2 Soft-argmax 28 3.3.3 指節位置估測網路之損失函數 29 3.3.4 Mean End-Point-Error 30 第四章 實驗結果 31 4.1 Rendered Handpose Dataset 31 4.2 系統模型訓練 33 4.2.1資料增強(Data Augmentation) 33 4.2.2手部區域擷取網路訓練過程 35 4.2.3指節位置估測網路訓練過程 38 4.3 實驗測試結果 40 4.3.1手部區域擷取網路驗證 40 4.3.2指節位置估測網路驗證 42 4.3.3系統驗證 44 4.3.4比較其他研究 47 第五章 結論與未來研究方向 50 5.1結論 50 5.2未來研究方向 51 參考文獻 52

    [1] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.
    [2] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
    [3] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge, et al, “Depth-based 3d hand pose estimation: From current achievements to future goals,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2636–2645, 2018.
    [4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in Proceedings of the European Conference on Computer Vision(ECCV), 2018, pp. 118–134.
    [5] C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4903–4911.
    [6] P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single rgb frame for real time 3d hand pose estimation in the wild,” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE, 2018.
    [7] Y. Cai, L. Ge, J. Cai, and J. Yuan, “Weakly-supervised 3d hand pose estimation from monocular rgb images,” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682, 2018.
    [8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
    [9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
    [10] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3,” arXiv preprint arXiv:1905.02244, 2019.
    [11] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
    [12] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
    [13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
    [14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” JMLR, vol.15, pp. 1929–1958, 2014.
    [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
    [16] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
    [17] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
    [18] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” arXiv preprint arXiv:1512.09300, 2015.
    [19] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical coordinate regression with convolutional neural networks,” arXiv preprint arXiv:1801.07372, 2018.
    [20] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 3D pose estimation using Part Affinity Fields,” arXiv preprint arXiv:1812.08008, 2018.
    [21] F. Gouidis, P. Panteleris, I. Oikonomidis, and A. Argyros, “Accurate hand keypoint localization on mobile devices,” in 2019 16th International Conference on Machine Vision Applications (MVA). IEEE, 2019, pp. 1–6.
    [22] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang, “3d hand pose tracking and estimation using stereo matching,” arXiv:1610.07214, 2016.

    無法下載圖示 全文公開日期 2025/07/24 (校內網路)
    全文公開日期 2025/07/24 (校外網路)
    全文公開日期 2025/07/24 (國家圖書館:臺灣博碩士論文系統)
    QR CODE