研究生: |
張浩偉 Hao-Wei Zhang |
---|---|
論文名稱: |
以切換式像素對齊隱函數進行人體重建 PIS-FU: Pixel-Aligned Implicit Switching Function for Human Reconstruction |
指導教授: |
徐繼聖
Gee-Sern Hsu |
口試委員: |
賴尚宏
Shang-Hong Lai 王鈺強 Yu-Chiang Wang 邱維辰 Wei-Chen Chiu 郭景明 Jing-Ming Guo |
學位類別: |
碩士 Master |
系所名稱: |
工程學院 - 機械工程系 Department of Mechanical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 65 |
中文關鍵詞: | 切換式學習 、像素對齊隱函數 、三維人體重建 |
外文關鍵詞: | Switching Learning, Pixel-Aligned Implicit Function, 3D Human Reconstruction |
相關次數: | 點閱:198 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
我們提出了一個以切換式像素對齊隱函數進行人體重建 (PIS-Fu)為核心的方法,藉由二維人體圖像進行三維人體重建。我們的網路由一個雙級編碼器、一個切換學習表面預估器、一個查詢網路和一個模型轉換器所組成。給予一個二維人體圖像做為輸入,雙級編碼器會將輸入圖像轉換為整體幾何與局部表面細節的特徵表示形式。其特徵將嵌入切換學習表面預估器所建立的一個外部形狀預估器和一個內外表面漸進器中的空間稀疏點。而外部形狀預估器主要功能為消除三維人體重建的外部雜訊點,而內外表面漸進器則會藉由檢視三維人體,針對人體表面的內外側空間點進行區分。而被嵌入的特徵將會經由查詢網路預測並生成初步的剛體模型,再通過模型轉換器將其轉換為參數化模型,完成最終的三維重建。我們在小資料上驗證我們的網路設置並且在RenderPeople資料庫將其性能與其他方法做比較。實驗證明我們的方法在重建準確度和計算資源效能上,皆比目前最新的方法更具競爭性。
We propose a network with the Pixel-Aligned Implicit Switching Function (PIS-Fu) as the core part for the 3D reconstruction of a 2D human image. Our network consists of a dual-level encoder, a switching-learning surface estimator, a query network and a model converter. Given a 2D human image as input, the dual-level encoder transforms the input image to a feature representation that combines the holistic geometry and local surface details. The feature representation is embedded with the spatial sparse feature extracted from the switching-learning surface estimator, which is composed of an outer shape estimator and an inner-outer surface approximator. The outer shape estimator aims to remove the cloud points out of an estimated 3D body shape, and the inner-outer surface approximator aims to search for a 3D surface that separates the inner cloud points from the outer ones. The embedded feature is then processed by the query network to generate a rigid body model, which is transformed by the model convertor into a parametric model for completing the 3D reconstruction. We verify our network settings on a small dataset, and compare the performance with other approaches on the RenderPeople dataset. Experiments show that our approach is highly competitive to other methods in both reconstruction accuracy and memory efficiency.
[1] Langner, Oliver, et al. "Presentation and validation of the Radboud Faces Database." Cognition and emotion 24.8 (2010): 1377-1388. Nagrani, Arsha, Joon
Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017).
[2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human
avatars from monocular video. In2018 International Conference on 3D Vision (3DV), pages98–109. IEEE, 2018.
[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based
reconstruction of 3d people models. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8387–8397, 2018.
[4] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3d humans. In
European Conference on Computer Vision, pages 344–359. Springer, 2020.
[5] B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. PonsMoll. Combining
implicit function learning and parametric models for 3d human reconstruction. In
European Conference on Computer Vision (ECCV). Springer, August 2020.
[6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep
it smpl: Automatic estimation of 3d human pose and shape from a single image.
In European conference on computer vision, pages 561–578. Springer, 2016.
[7] Y.-W. Cha, T. Price, Z. Wei, X. Lu, N. Rewkowski, R. Chabra, Z. Qin, H. Kim,
Z. Su, Y. Liu, et al. Towards fully mobile 3d face, body, and environment capture
using only head-worn cameras. IEEE transactions on visualization and computer
graphics, 24(11):2993–3004, 2018.
[8] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019
[9] J. Chibane and G. Pons-Moll. Implicit feature networks for texture completion
from partial 3d data. In European Conference on Computer Vision, pages 717–
725. Springer, 2020.
[10] O. C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and ¨ O. Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In
International conference on medical image computing and computer-assisted
intervention, pages 424–432. Springer, 2016.
[11] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H.
Hoppe, A. Kirk, and S. Sullivan. Highquality streamable free-viewpoint video.
ACM Transactions on Graphics (ToG), 34(4):1–13, 2015.
[12] T. He, J. Collomosse, H. Jin, and S. Soatto. Geo-pifu: Geometry and pixel
aligned implicit functions for single-view human reconstruction. arXiv preprint
arXiv:2006.08072, 2020.
[13] Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li. Deep volumetric video from very sparse multi-view performance capture. In
Proceedings of the European Conference on Computer Vision (ECCV), pages 336–354, 2018.
[14] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d deformation model for
tracking faces, hands, and bodies. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 8320–8329, 2018.
[15] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endto-end recovery of
human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
[16] S. Liu, Y. Zhang, S. Peng, B. Shi, M. Pollefeys, and Z. Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 2019–2028, 2020.
[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[18] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A
skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
[19] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface
construction algorithm. ACM siggraph computer graphics, 21(4):163–169, 1987.
[20] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy
networks: Learning 3d reconstruction in function space. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
4460–4470, 2019.
[21] Renderpeople, 2018. https://renderpeople.com/ 3d-people.
[22] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose
estimation. In European conference on computer vision, pages 483–499.Springer, 2016.
[23] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf:
Learning continuous signed distance functions for shape representation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 165–174, 2019.
[24] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single
image. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 10975–10985, 2019.
[25] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, and F. Moreno-Noguer.
3dpeople: Modeling the geometry of dressed humans. In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 2242–2251, 2019.
[26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for
biomedical image segmentation. In International Conference on Medical image
computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[27] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu:
Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
pages 2304–2314, 2019.
[28] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multilevel pixel-aligned
implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 84–
93, 2020.
[29] G. Tiwari, B. L. Bhatnagar, T. Tung, and G. Pons-Moll. Sizer: A dataset and
model for parsing 3d clothing and learning size sensitive 3d clothing. In
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23– 28, 2020, Proceedings, Part III 16, pages 1–18. Springer, 2020.
[30] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a
convolutional network and a graphical model for human pose estimation.
Advances in neural information processing systems, 27:1799–1807, 2014.
[31] M. Vakalopoulou, G. Chassagnon, N. Bus, R. Marini, E. I. Zacharaki, M.-P.
Revel, and N. Paragios. Atlasnet: multiatlas non-linear deep networks for
medical image segmentation. In International Conference on Medical Image
Computing and Computer-Assisted Intervention, pages 658–666. Springer, 2018.
[32] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
[33] I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst. Embree: a kernel
framework for efficient cpu ray tracing. ACM Transactions on Graphics (TOG),
33(4):1–8, 2014.
[34] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 8798–8807, 2018.
[35] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu. Deephuman: 3d human
reconstruction from a single image. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages 7739–7749, 2019.
[36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.