研究生: |
吳基鴻 Chi-Hung Wu |
---|---|
論文名稱: |
基於卷積神經網路以單張RGB影像即時二維手部關節點估測系統 A Real-time CNN-based 2D Hand Joint Estimation from Monocular RGB Frame |
指導教授: |
王乃堅
Nai-Jian Wang |
口試委員: |
蘇順豐
Shun-Feng Su 鍾順平 Shun-Ping Chung 呂學坤 Shyue-Kung Lu 郭景明 Jing-Ming Guo 王乃堅 Nai-Jian Wang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2020 |
畢業學年度: | 108 |
語文別: | 中文 |
論文頁數: | 66 |
中文關鍵詞: | 卷積神經網路 、二維手部關節點位置 、soft-argmax 、深度可分離卷積模組 、即時 |
外文關鍵詞: | CNNs, 2D hand joint localization, soft-argmax, depth-wise separable convolution blocks, real-time |
相關次數: | 點閱:243 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
手因為其靈活的特性,是人類生活中主要操作大小事務的肢體部位之一,也因此,人們開始對於提取串流影像中手部的位置及姿態的議題感到興趣,相信這些手部資訊可以增進當今人機互動的體驗。而隨著卷積神經網路的成功發展,人們開始將此應用在幾乎所有的機器視覺領域上並且都獲得了突破性的成果。
在本論文中,我們提出了一個可於單張RGB影像上提取二維手部關節點位置並且基於兩階段辨識的輕量化卷積神經網路,其流程是先判斷出影像中的各個手部區域範圍,再從各個手部區域中估測其21個手指關節點的位置。第一階段的擷取網路我們採用了非對稱性的架構來減少參數及運算量;而在第二階段的估測網路中,我們添加soft-argmax的運算在卷積神經網路的最後一層,藉此來使神經網路直接估測出指節的二維座標。在我們輕量化神經網路的過程中,最重要的一步是將標準的卷積核以深度可分離卷積模組替代,這使得我們的神經網路即使在計算能力較弱的設備上也能即時運行。
在Rendered Handpose Dataset這樣的公開資料集上進行驗證,我們提出的神經網路架構取得了17.4的mean EPE,比起其他研究提出的方法在估測二維手部關節點位置的問題上有差不多甚至更好的準確度表現。
The hands could be regarded as a primary operating tool for human due to their flexibility. Therefore, people are interested in hand’s location and posture in streaming image and believe that they could make human-computer interaction (HCI) more advanced. Since the success of Convolutional Neural Networks (CNNs), people started to apply it in almost all the Computer Vision fields and reached a milestone for breakthrough results.
In this thesis, we present an approach that estimates 2D hand joint locations from monocular RGB frame, based on a two-stage lightweight CNNs. The two-stage approach first predicts the accurate hand regions and then localizes 21 hand joints per hand region. In the first stage, we use an asymmetric architecture to reduce some parameters. Then, we add a soft-argmax operation after the last layer of the convolutional neural network in the second stage and it makes the neural network predict the coordinates directly. More importantly, we use depth-wise separable convolution blocks instead of standard convolution filters so that the proposed approach could achieve real-time performance even on devices with less computational power.
Testing on the public dataset such as Rendered Handpose Dataset, our approach gets 17.4 in mean EPE, and it matches or outperforms the accuracy of the state-of-the-art approaches on the problem of 2D hand joint localization.
[1] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.
[2] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
[3] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge, et al, “Depth-based 3d hand pose estimation: From current achievements to future goals,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2636–2645, 2018.
[4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in Proceedings of the European Conference on Computer Vision(ECCV), 2018, pp. 118–134.
[5] C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4903–4911.
[6] P. Panteleris, I. Oikonomidis, and A. Argyros, “Using a single rgb frame for real time 3d hand pose estimation in the wild,” In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE, 2018.
[7] Y. Cai, L. Ge, J. Cai, and J. Yuan, “Weakly-supervised 3d hand pose estimation from monocular rgb images,” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 666–682, 2018.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
[10] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3,” arXiv preprint arXiv:1905.02244, 2019.
[11] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
[12] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
[13] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018.
[14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” JMLR, vol.15, pp. 1929–1958, 2014.
[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[16] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” CoRR, vol. abs/1607.08022, 2016.
[17] Y. Wu and K. He, “Group normalization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
[18] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” arXiv preprint arXiv:1512.09300, 2015.
[19] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical coordinate regression with convolutional neural networks,” arXiv preprint arXiv:1801.07372, 2018.
[20] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 3D pose estimation using Part Affinity Fields,” arXiv preprint arXiv:1812.08008, 2018.
[21] F. Gouidis, P. Panteleris, I. Oikonomidis, and A. Argyros, “Accurate hand keypoint localization on mobile devices,” in 2019 16th International Conference on Machine Vision Applications (MVA). IEEE, 2019, pp. 1–6.
[22] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang, “3d hand pose tracking and estimation using stereo matching,” arXiv:1610.07214, 2016.