簡易檢索 / 詳目顯示

研究生: 張閔翔
Min-Hsiang Chang
論文名稱: 基於體素之多視角三維人體重建自編碼網路
Voxel-based Multi-View VAE for 3D Human Shape Reconstruction
指導教授: 徐繼聖
Gee-Sern Hsu
口試委員: 林嘉文
Chia-Wen Lin
林彥宇
Yen-Yu Lin
鍾聖倫
Sheng-Luen Chung
郭景明
Jing-Ming Guo
徐繼聖
Gee-Sern Hsu
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 60
中文關鍵詞: 三維人體重建體素生成多視角整合
外文關鍵詞: 3D Human Shape Reconstruction, Voxel generation, Multi-View
相關次數: 點閱:132下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本文提出了一種以體素(Voxel)為基本元件之多視角三維人體重建的自編碼網路。透過大量三維人體掃描資料之端到端學習(End-to-End Learning),本自編碼網路可運用其端到端架構(End-to-End Structure)重建二維人體圖像之三維人體模型。有別於其他方法須先精確定位二維與三維關節點位置,再透過參數化之線性三維人體模型(Linear 3D Human Model)進行較為耗時之擬合處理,本文提出之自編碼網路不需要二維與三維的關節點位置,也不需要擬合處理,而是透過二維編碼器、三維解碼器、多視角整合架構與優化器以生成細緻的三維人體模型。因為目前尚無可同時提供三維人體模型與不同視角之二維人體影像的資料庫,故本研究將CAESAR[3]三維人體掃描資料庫的人體模型進行二維投影,並將Colorful Human資料庫與Kinect Head Pose[4]資料庫分別進行訓練以及驗證,實驗結果顯示所提出的方法具有下列優勢: 1.較快的三維模型重建速度 2.訓練資料並不需要繁雜之標註 3.較易訓練與運用的端對端三維重建架構。


    We propose the Voxel-based Multi-View VAE (VM-VAE) for 3D human body reconstruction from a 2D image. The proposed VM-VAE addresses the following issues: 1) The insufficiency of 2D-to-3D human body databases, 2) The verification of voxel-based 3D human body reconstruction model and 3) The time-consuming human body reconstruction of existing approaches. The proposed VM-VAE is composed of a 2D encoder, a 3D decoder, a multi-view fusion module and a feature refiner. Given a set of multi-view 2D images, the encoder is trained to transform each 2D image into a deep feature representation. The 3D decoder is trained to transform the deep feature representation into a coarse 3D model. The whole set of the coarse 3D models are integrated by the multi-view fusion module for generating a fused 3D model, which is then refined with more 3D shape details by the feature refiner. To meet the requirements of our training needs, we convert the CAESAR[3] human body scanning database into the 2D-to-3D CAESAR database. Experiments on the 2D-to-3D CAESAR database, the Colorful Human dataset and the Kinect Head Pose[4] dataset show that the proposed approach can satisfactorily reconstruct the 3D human body models regardless of the poses, body shapes and backgrounds in the given 2D images.

    摘要 4 Abstract 5 誌謝 6 表目錄 9 圖目錄 10 第一章 介紹 12 1.1 研究背景和動機 12 1.2 方法概述 13 1.3 論文貢獻 14 1.4 論文架構 16 第二章 文獻回顧 17 2.1 基於參數化線性人體模型之三維人體重建相關文獻 17 2.1.1 SMPL線性人體模型 17 2.1.2 Keep it SMPL 19 2.1.3 End-to-end Recovery of Human Shape and Pose (HMR) 22 2.2 基於體素之三維重建相關文獻 24 2.2.1 ShapeNet三維物體資料庫 25 2.2.2 3D-R2N2 27 第三章 主要方法 32 3.1 二維編碼器 33 3.2 三維解碼器 34 3.3 多視角整合器 35 3.4 特徵優化器 36 3.5 損失函數 38 第四章 實驗設置與分析 39 4.1 標準資料庫介紹 39 4.1.1 CAESAR資料庫 39 4.1.2 Colorful Human 資料庫 41 4.1.3 BIWI Kinect Head Pose資料庫 41 4.2 效能評估指標 42 4.2.1 基於體素之三維人體重建方法 42 4.2.2 基於線性人體模型之三維人體重建方法 43 4.2.3 如何相互比較 43 4.3 實驗設置以及結果分析 43 4.3.1 CAESAR資料庫 44 4.3.2 Colorful Human資料庫 50 4.3.3 BIWI Kinect Head Pose資料庫 54 第五章 結論與未來研究方向 56 第六章 參考文獻 57

    [1] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Pe-ter V. Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: automatic estimation of 3d human pose and shape from a single image. CoRR, abs/1607.08128, 2016
    [2] Christoph Lassner, Javier Romero, Martin Kiefel, FedericaBogo, Michael J. Black, and Peter V. Gehler. Unite the people: Closing the loop between 3d and 2d human representations. CoRR, abs/1701.02468, 2017.
    [3] Kathleen Robinette, Sherri Blackwell, Hein Daanen, MarkBoehmer, and Scott Fleming. Civilian american and european surface anthropometry resource (caesar), final report.volume 1. summary. page 74, 06 2002.
    [4] Gabriele Fanelli, Matthias Dantone, Juergen Gall, Andrea Fossati, and Luc Van Gool. Random forests for real time 3d face analysis. Int. J. Comput. Vision, 101(3):437–458,February 2013.
    [5] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4732, 2016
    [6] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V. Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4929– 4937, 2016
    [7] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1302–1310, 2017.
    [8] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pages 483–499, 2016
    [9] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3d human pose estimation. In IEEE International Conference on Computer Vision (ICCV), pages 2659–2668, 2017.
    [10] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1263–1272, 2017.
    [11] Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. ´ Lcr-net: Localization-classification-regression for human pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1216–1224, 2017.
    [12] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3d human pose estimation in the wild: A weakly-supervised approach. In IEEE International Conference on Computer Vision (ICCV), pages 398–407, 2017
    [13] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. In IEEE International Conference on Computer Vision (ICCV), pages 2621– 2630, 2017
    [14] Riza Alp Guler, George Trigeorgis, Epameinondas Anton- ¨ akos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, page 5, 2017.
    [15] Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos. ¨ Densepose: Dense human pose estimation in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    [16] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. SCAPE: shape completion and animation of people. ACM Trans. Graph., 24(3):408–416, 2005.
    [17] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
    [18] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
    [19] Jain, A., Tompson, J., LeCun, Y., Bregler, C.: MoDeep: A deep learning framework using motion features for human pose estimation. In: Asian Conference on Computer Vision, ACCV. vol. 9004, pp. 302–315 (2015)
    [20] Pfister, T., Charles, J., Zisserman, A.: Flowing convnets for human pose estimation in videos. In: IEEE International Conference on Computer Vision, ICCV. pp. 1913– 1921 (2015)
    [21] Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Asian Conference on Computer Vision, ACCV. pp. 538–552 (2014)
    [22] Toshev, A., Szegedy, C.: DeepPose: Human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1653–1660 (2014)
    [23] Geman, S., McClure, D.: Statistical methods for tomographic image reconstruction. Bulletin of the International Statistical Institute 52(4), 5–21 (1987)
    [24] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR 2015.
    [25] Antonio Torralba, Bryan C Russell, and Jenny Yuen. LabelMe: Online image annotation and applications. Proceedings of the IEEE, 98(8):1467–1484, 2010.
    [26] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV, 2014.
    [27] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8) (November 1997) 1735–1780
    [28] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems 27. (2014)
    [29] Liu, F., Shen, C., Lin, G.: Deep convolutional neural fields for depth estimation from a single image. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition. (2015)
    [30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
    [31] Z. C. Marton, R. B. Rusu and M. Beetz. On fast surface reconstruction methods for large and noisy point clouds. In 2009 IEEE International Conference on Robotics and Automation, pages 3218–3223, 2009.
    [32] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019
    [33] Shunsuke Saito, , Zeng Huang, Ryota Natsume, Shigeo Mor-ishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitiza-tion.arXiv preprint arXiv:1905.05172, 2019.
    [34] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv1409.1556, 09 2014.
    [35] Jia Deng, R. Socher, Li Fei-Fei, Wei Dong, Kai Li, and Li-Jia Li. Imagenet: A large-scale hierarchical image database.In2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), volume 00, pages 248–255, 06 2009.
    [36] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015).
    [37] Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.
    [38] Wu, Yuxin, and Kaiming He. "Group normalization." Proceedings of the European Conference on Computer Vision (ECCV). 2018.
    [39] G ̈ul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, ErsinYumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volu-metric inference of 3D human body shapes. In ECCV, 2018.
    [40] http://mocap.cs.cmu.edu.
    [41] Loper, M., Mahmood, N., Black, M.J.: MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia 33(6), 220:1–220:13 (2014)
    [42] Olson, E., Agarwal, P.: Inference on networks of mixtures for robust robot mapping. Int. J. Robotics Research 32(7), 826–840 (2013)
    [43] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1510–1519, 2015.

    QR CODE