簡易檢索 / 詳目顯示

研究生: 張浩偉
Hao-Wei Zhang
論文名稱: 以切換式像素對齊隱函數進行人體重建
PIS-FU: Pixel-Aligned Implicit Switching Function for Human Reconstruction
指導教授: 徐繼聖
Gee-Sern Hsu
口試委員: 賴尚宏
Shang-Hong Lai
王鈺強
Yu-Chiang Wang
邱維辰
Wei-Chen Chiu
郭景明
Jing-Ming Guo
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 65
中文關鍵詞: 切換式學習像素對齊隱函數三維人體重建
外文關鍵詞: Switching Learning, Pixel-Aligned Implicit Function, 3D Human Reconstruction
相關次數: 點閱:198下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出了一個以切換式像素對齊隱函數進行人體重建 (PIS-Fu)為核心的方法,藉由二維人體圖像進行三維人體重建。我們的網路由一個雙級編碼器、一個切換學習表面預估器、一個查詢網路和一個模型轉換器所組成。給予一個二維人體圖像做為輸入,雙級編碼器會將輸入圖像轉換為整體幾何與局部表面細節的特徵表示形式。其特徵將嵌入切換學習表面預估器所建立的一個外部形狀預估器和一個內外表面漸進器中的空間稀疏點。而外部形狀預估器主要功能為消除三維人體重建的外部雜訊點,而內外表面漸進器則會藉由檢視三維人體,針對人體表面的內外側空間點進行區分。而被嵌入的特徵將會經由查詢網路預測並生成初步的剛體模型,再通過模型轉換器將其轉換為參數化模型,完成最終的三維重建。我們在小資料上驗證我們的網路設置並且在RenderPeople資料庫將其性能與其他方法做比較。實驗證明我們的方法在重建準確度和計算資源效能上,皆比目前最新的方法更具競爭性。


    We propose a network with the Pixel-Aligned Implicit Switching Function (PIS-Fu) as the core part for the 3D reconstruction of a 2D human image. Our network consists of a dual-level encoder, a switching-learning surface estimator, a query network and a model converter. Given a 2D human image as input, the dual-level encoder transforms the input image to a feature representation that combines the holistic geometry and local surface details. The feature representation is embedded with the spatial sparse feature extracted from the switching-learning surface estimator, which is composed of an outer shape estimator and an inner-outer surface approximator. The outer shape estimator aims to remove the cloud points out of an estimated 3D body shape, and the inner-outer surface approximator aims to search for a 3D surface that separates the inner cloud points from the outer ones. The embedded feature is then processed by the query network to generate a rigid body model, which is transformed by the model convertor into a parametric model for completing the 3D reconstruction. We verify our network settings on a small dataset, and compare the performance with other approaches on the RenderPeople dataset. Experiments show that our approach is highly competitive to other methods in both reconstruction accuracy and memory efficiency.

    摘要 Abstract 誌謝 目錄 圖目錄 表目錄 第1章 介紹 1.1 研究背景和動機 1.2 方法概述 1.3 論文貢獻 1.4 論文架構 第2章 文獻回顧 2.1 DeepHuman 2.2 PIFu 2.3 PIFuHD 2.4 Geo-PIFu 2.5 SMPL 第3章 主要方法 3.1 整體網路架構 3.2 像素對齊隱函數 (PIFu)之回顧 3.3 切換式像素對齊隱函數 (PIS-Fu)之設計 3.4 模型轉換器 3.5 損失與局部採樣函數 第4章 實驗設置與分析 4.1 資料庫介紹 4.1.1 RenderPeople Dataset 4.1.2 SIZER Dataset 4.2 實驗設置 4.2.1 資料劃分、設置 4.2.2 效能評估指標 4.2.3 實驗設計 4.3 實驗結果與分析 4.3.1 RenderPeople Dataset 單視角人體重建比較 4.3.2 RenderPeople Dataset 多視角人體重建比較 4.3.3 SIZER Dataset 消融實驗 4.3.4 參數化模型比較 第5章 結論與未來研究方向 第6章 參考文獻

    [1] Langner, Oliver, et al. "Presentation and validation of the Radboud Faces Database." Cognition and emotion 24.8 (2010): 1377-1388. Nagrani, Arsha, Joon
    Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017).
    [2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human
    avatars from monocular video. In2018 International Conference on 3D Vision (3DV), pages98–109. IEEE, 2018.
    [3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based
    reconstruction of 3d people models. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8387–8397, 2018.
    [4] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3d humans. In
    European Conference on Computer Vision, pages 344–359. Springer, 2020.
    [5] B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. PonsMoll. Combining
    implicit function learning and parametric models for 3d human reconstruction. In
    European Conference on Computer Vision (ECCV). Springer, August 2020.
    [6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep
    it smpl: Automatic estimation of 3d human pose and shape from a single image.
    In European conference on computer vision, pages 561–578. Springer, 2016.
    [7] Y.-W. Cha, T. Price, Z. Wei, X. Lu, N. Rewkowski, R. Chabra, Z. Qin, H. Kim,
    Z. Su, Y. Liu, et al. Towards fully mobile 3d face, body, and environment capture
    using only head-worn cameras. IEEE transactions on visualization and computer
    graphics, 24(11):2993–3004, 2018.
    [8] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019
    [9] J. Chibane and G. Pons-Moll. Implicit feature networks for texture completion
    from partial 3d data. In European Conference on Computer Vision, pages 717–
    725. Springer, 2020.
    [10] O. C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and ¨ O. Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In
    International conference on medical image computing and computer-assisted
    intervention, pages 424–432. Springer, 2016.
    [11] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H.
    Hoppe, A. Kirk, and S. Sullivan. Highquality streamable free-viewpoint video.
    ACM Transactions on Graphics (ToG), 34(4):1–13, 2015.
    [12] T. He, J. Collomosse, H. Jin, and S. Soatto. Geo-pifu: Geometry and pixel
    aligned implicit functions for single-view human reconstruction. arXiv preprint
    arXiv:2006.08072, 2020.
    [13] Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li. Deep volumetric video from very sparse multi-view performance capture. In
    Proceedings of the European Conference on Computer Vision (ECCV), pages 336–354, 2018.
    [14] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d deformation model for
    tracking faces, hands, and bodies. In Proceedings of the IEEE conference on
    computer vision and pattern recognition, pages 8320–8329, 2018.
    [15] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endto-end recovery of
    human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
    [16] S. Liu, Y. Zhang, S. Peng, B. Shi, M. Pollefeys, and Z. Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition, pages 2019–2028, 2020.
    [17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
    [18] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A
    skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
    [19] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface
    construction algorithm. ACM siggraph computer graphics, 21(4):163–169, 1987.
    [20] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy
    networks: Learning 3d reconstruction in function space. In Proceedings of the
    IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages
    4460–4470, 2019.
    [21] Renderpeople, 2018. https://renderpeople.com/ 3d-people.
    [22] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose
    estimation. In European conference on computer vision, pages 483–499.Springer, 2016.
    [23] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf:
    Learning continuous signed distance functions for shape representation. In
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
    Recognition, pages 165–174, 2019.
    [24] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single
    image. In Proceedings of the IEEE/CVF Conference on Computer Vision and
    Pattern Recognition, pages 10975–10985, 2019.
    [25] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, and F. Moreno-Noguer.
    3dpeople: Modeling the geometry of dressed humans. In Proceedings of the
    IEEE/CVF International Conference on Computer Vision, pages 2242–2251, 2019.
    [26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for
    biomedical image segmentation. In International Conference on Medical image
    computing and computer-assisted intervention, pages 234–241. Springer, 2015.
    [27] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu:
    Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
    pages 2304–2314, 2019.
    [28] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multilevel pixel-aligned
    implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 84–
    93, 2020.
    [29] G. Tiwari, B. L. Bhatnagar, T. Tung, and G. Pons-Moll. Sizer: A dataset and
    model for parsing 3d clothing and learning size sensitive 3d clothing. In
    Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
    August 23– 28, 2020, Proceedings, Part III 16, pages 1–18. Springer, 2020.
    [30] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a
    convolutional network and a graphical model for human pose estimation.
    Advances in neural information processing systems, 27:1799–1807, 2014.
    [31] M. Vakalopoulou, G. Chassagnon, N. Bus, R. Marini, E. I. Zacharaki, M.-P.
    Revel, and N. Paragios. Atlasnet: multiatlas non-linear deep networks for
    medical image segmentation. In International Conference on Medical Image
    Computing and Computer-Assisted Intervention, pages 658–666. Springer, 2018.
    [32] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
    [33] I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst. Embree: a kernel
    framework for efficient cpu ray tracing. ACM Transactions on Graphics (TOG),
    33(4):1–8, 2014.
    [34] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. Highresolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition,
    pages 8798–8807, 2018.
    [35] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu. Deephuman: 3d human
    reconstruction from a single image. In Proceedings of the IEEE/CVF
    International Conference on Computer Vision, pages 7739–7749, 2019.
    [36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.

    QR CODE