簡易檢索 / 詳目顯示

研究生: 林鈺紘
Yu-Hong Lin
論文名稱: 基於預訓練像素對齊模型做為參考進行三維人體重建
Pretrained Pixel-Aligned Reference Network for 3D Human Reconstruction
指導教授: 徐繼聖
Gee-Sern Hsu
口試委員: 林嘉文
莊永裕
陳祝嵩
郭景明
徐繼聖
學位類別: 碩士
Master
系所名稱: 工程學院 - 機械工程系
Department of Mechanical Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 51
中文關鍵詞: 深度學習三維重建
外文關鍵詞: Deep learning, 3D reconstruction
相關次數: 點閱:261下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 我們提出基於預訓練像素對齊模型做為參考進行三維人體重建的方法。本方法透過預訓練模型生成的 預訓練像素對齊參考表面(PPR)提供的參考表面以及側視角法向圖,用以更好的約束空間佔有查詢處理,從而生成更好的3D人體模型。我們的網路由一個雙通道編碼器以及查詢網路所組成,雙通道編碼器使用一條路徑從輸入影像中提取正視角特徵影像,另一條路徑從PPR的法向量側視圖中提取側視角特徵,前視角特徵和側視角特徵會與額外的空間特徵相連接,並由查詢網路處理特徵以估計重建的形狀。在訓練PPR網路時,我們同時考慮 PPR 表面周圍和目標表面周圍的點,使表面隱含數能更好的捕捉表面輪廓及人體姿態,嵌入PPR表面的訓練也相較沒有顯得更有效率。我們在THuman 2.0、自行收集的 Renderpeple 上以及我們自行根據 SIZER資料庫製作的 SIZER-PA上驗證我們的方法內容。實驗結果表明PPR network 的效能優於其他最先進的方法


    We propose the Pretrained Pixel-aligned Reference (PPR) network for 3D human reconstruction. The PPR, obtained via a pretrained model, offers a reference surface and side-view normal to better constrain the spatial query processing, leading to better 3D reconstruction. Our network consists of a dual-path encoder and a query network. The dual-path encoder uses one path to extract the front-view features from the input image, and the other path to extract the side-view features from the PPR normals. The front-view and side-view features are concatenated with additional spatial features, and processed by the query network for estimating the reconstructed shape. When training the PPR network, we consider both the points around the PPR surface and around the target surface, making the implicit surface function better capture the surface contour and the human pose. The embedding of the PPR surface also makes the training more efficient than the that without. We verify the performance of the PPR network through experiments on the RenderPeople dataset, the THuman 2.0 dataset, and our 3D human dataset SIZER-PA. It shows that the PPR network outperforms state-of-the-art approaches.

    摘要 2 Abstract 3 誌謝 4 目錄 5 圖目錄 6 表目錄 7 第1章 介紹 8 1.1 研究背景和動機 8 1.2 方法概述 10 1.3 論文貢獻 12 1.4 論文架構 13 第2章 文獻回顧 14 2.1 SMPL 14 2.2 DeepHuman 16 2.3 PIFu 19 2.4 PIFuHD 20 2.5 Geo-PIFu 22 2.6 PaMIR 24 2.7 ICON 25 第3章 主要方法 26 3.1 像素對齊隱函數 (PIFu) 之回顧 26 3.2 整體網路架構 28 3.3 切換表面訓練法 29 3.4 損失函數設定 30 第4章 實驗設置與分析 32 4.1 資料庫介紹 32 4.1.1 RenderPeople Database 32 4.1.2 THuman 2.0 dataset 34 4.1.3 SIZER Dataset 34 4.2 實驗設置 36 4.2.1 資料劃分、設置 36 4.2.2 效能評估指標 36 4.2.3 實驗設計 37 4.3 實驗結果與分析 40 4.3.1 單視角人體重建與其他方法的比較 40 4.3.2 消融實驗 42 第5章 結論與未來研究方向 47 第6章 參考文獻 48

    [1] Langner, Oliver, et al. "Presentation and validation of the Radboud Faces Database." Cognition and emotion 24.8 (2010): 1377-1388. Nagrani, Arsha, Joon Son Chung, and Andrew Zisserman. "Voxceleb: a large-scale speaker identification dataset." arXiv preprint arXiv:1706.08612 (2017).
    [2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human avatars from monocular video. In2018 International Conference on 3D Vision (3DV), pages98–109. IEEE, 2018.
    [3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3d people models. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 8387–8397, 2018.
    [4] H. Bertiche, M. Madadi, and S. Escalera. Cloth3d: Clothed 3d humans. In European Conference on Computer Vision, pages 344–359. Springer, 2020.
    [5] B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. PonsMoll. Combining implicit function learning and parametric models for 3d human reconstruction. In European Conference on Computer Vision (ECCV). Springer, August 2020.
    [6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European conference on computer vision, pages 561–578. Springer, 2016
    [7] Y.-W. Cha, T. Price, Z. Wei, X. Lu, N. Rewkowski, R. Chabra, Z. Qin, H. Kim, Z. Su, Y. Liu, et al. Towards fully mobile 3d face, body, and environment capture using only head-worn cameras. IEEE transactions on visualization and computer graphics, 24(11):2993–3004, 2018.
    [8] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939–5948, 2019.
    [9] J. Chibane and G. Pons-Moll. Implicit feature networks for texture completion from partial 3d data. In European Conference on Computer Vision, pages 717–725. Springer, 2020
    [10] O. C¸ ic¸ek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and ¨ O. Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer, 2016.
    [11] A. Collet, M. Chuang, P. Sweeney, D. Gillett, D. Evseev, D. Calabrese, H. Hoppe, A. Kirk, and S. Sullivan. Highquality streamable free-viewpoint video. ACM Transactions on Graphics (ToG), 34(4):1–13, 2015.
    [12] T. He, J. Collomosse, H. Jin, and S. Soatto. Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. arXiv preprint arXiv:2006.08072, 2020.
    [13] Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li. Deep volumetric video from very sparse multi-view performance capture. In Proceedings of the European Conference on Computer Vision (ECCV), pages 336–354, 2018.
    [14] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8320–8329, 2018.
    [15] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endto-end recovery of human shape and pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7122–7131, 2018.
    [16] S. Liu, Y. Zhang, S. Peng, B. Shi, M. Pollefeys, and Z. Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2019–2028, 2020.
    [17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
    [18] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):1–16, 2015.
    [19] W. E. Lorensen and H. E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM siggraph computer graphics, 21(4):163–169, 1987.
    [20] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460–4470, 2019.
    [21] Renderpeople, 2018. https://renderpeople.com/ 3d-people.
    [22] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European conference on computer vision, pages 483–499. Springer, 2016.
    [23] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165–174, 2019.
    [24] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019.
    [25] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, and F. Moreno-Noguer. 3dpeople: Modeling the geometry of dressed humans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2242–2251, 2019
    [26] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
    [27] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2304–2314, 2019.
    [28] S. Saito, T. Simon, J. Saragih, and H. Joo. Pifuhd: Multilevel pixel-aligned implicit function for high-resolution 3d human digitization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 84–93, 2020.
    [29] G. Tiwari, B. L. Bhatnagar, T. Tung, and G. Pons-Moll. Sizer: A dataset and model for parsing 3d clothing and learning size sensitive 3d clothing. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23– 28, 2020, Proceedings, Part III 16, pages 1–18. Springer, 2020.
    [30] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. Advances in neural information processing systems, 27:1799–1807, 2014.
    [31] M. Vakalopoulou, G. Chassagnon, N. Bus, R. Marini, E. I. Zacharaki, M.-P. Revel, and N. Paragios. Atlasnet: multiatlas non-linear deep networks for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 658–666. Springer, 2018.
    [32] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
    [33] I. Wald, S. Woop, C. Benthin, G. S. Johnson, and M. Ernst. Embree: a kernel framework for efficient cpu ray tracing. ACM Transactions on Graphics (TOG), 33(4):1–8, 2014.
    [34] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018.
    [35] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu. Deephuman: 3d human reconstruction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7739–7749, 2019.
    [36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imageto-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
    [37] Tao Yu, Zerong Zheng, kaiwen Guo, Pengpeng Liu, Qionghai Dai, Yebin Liu. "Function4D: Real-time Human Volumetric Capture from Very Sparse Consumer RGBD Sensors". CVPR 2021
    [38] Zheng, Zerong, et al. "Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction." IEEE transactions on pattern analysis and machine intelligence 44.6 (2021): 3170-3184.
    [39] Xiu, Yuliang, et al. "Icon: Implicit clothed humans obtained from normals." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

    無法下載圖示 全文公開日期 2024/09/27 (校內網路)
    全文公開日期 2024/09/27 (校外網路)
    全文公開日期 2024/09/27 (國家圖書館:臺灣博碩士論文系統)
    QR CODE