簡易檢索 / 詳目顯示

研究生: 林浚鉅
Jyun-Jyu Lin
論文名稱: 基於單目深度估測之新視圖合成應用
Monocular Depth Estimation for Novel View Synthesis
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
鍾聖倫
Sheng-Luen Chung
丘建青
Chien-Ching Chiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 68
中文關鍵詞: 新視圖合成圖像修復深度估測
外文關鍵詞: Novel View Synthesis, Image Inpainting, Depth Estimation
相關次數: 點閱:221下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本文研究一種基於單目相機的連續新穎視圖合成方法。對於新穎視圖合成的常規方法是使用源圖像的深度資訊及相機參數將RGB像素投影至三維空間中,透過相對姿態渲染至目標視角之影像平面。但對於連續新穎視圖合成,我們需要確保源圖像間的深度資訊是一致的。因此我們使用結合基於捲積神經網路 (CNN)的深度估測模型和Structure from Motion (SfM)的深度估測方法取得一致的深度資訊。接著使用真實世界存在的標的物,將深度資訊縮放到真實世界尺度,例如:車道線。我們將源圖像透過一致的深度資訊提升到分層深度圖像 (LDI)後,利用一種基於深度學習的影像修復模型,修復被遮蔽的區域並渲染到目標視角。由於ITRI數據集中並未提供相對姿態中的旋轉矩陣,我們基於已知的位移向量提出一個三維搜尋方法,並結合前述流程以估測ITRI數據集的旋轉矩陣。本文在DESC和ITRI數據集上的模擬結果驗證該方法的有效性。


This thesis investigates a continuous novel view synthesis method based on a monocular camera. The conventional method for novel view synthesis is to use the depth information of the source image and the camera parameters to project the RGB pixels into the three-dimensional space, and render to the image plane of the target view through the relative pose. However, for continuous novel view synthesis, we need to ensure that the depth information between the source images is consistent. Therefore, we use a combination of depth estimation model based on convolutional neural network (CNN) and Structure from Motion (SfM) method to get consistent depth information. Then we use real-world objects to scale the depth information to meters, such as lane lines. Next, we lift the source image onto the Layered Depth Image (LDI) through the consistent depth information, and use a learning-based image inpainting model to inpaint the occluded area and render it to the target view. Because the ITRI dataset doesn't provide the rotation matrix in the relative pose, we propose a three-dimensional search method based on the known translation vector, and combine the aforementioned method to estimate the rotation matrix of the ITRI dataset. The experiments in this thesis on DESC and ITRI dataset reveal the effectiveness of this proposed method.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Summary of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Novel View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Image Inpainting . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overall Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv 73.2 . . . . . . . . . . . . . . . . . . . . 8 3.2.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 Test-Time Fine-tuning . . . . . . . . . . . . . . . . . . . . 13 Depth Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.1 Dotted Line Heuristic . . . . . . . . . . . . . . . . . . . . . 15 3.3.2 Scale the Depth to the Closest one in Meters . . . . . . . . 16 3.4 Novel View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.5 Synthetic Image Border Inpainting . . . . . . . . . . . . . . . . . 17 3.6 Relative Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . 18 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 4.1 Consistent Depth Estimation Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 ITRI Synthesis Dataset . . . . . . . . . . . . . . . . . . . . 21 4.1.2 DSEC Dataset [1] . . . . . . . . . . . . . . . . . . . . . . . 24 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3.1 Structural Similarity [2] . . . . . . . . . . . . . . . . . . . 26 4.3.2 Learned Perceptual Image Patch Similarity [3] . . . . . . . 26 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.1 27 4.4 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . v4.4.2 Performance on the DSEC Dataset . . . . . . . . . . . . . 28 4.4.3 Performance on the ITRI Dataset . . . . . . . . . . . . . . 30 4.4.4 Continuous Synthesis Results . . . . . . . . . . . . . . . . 35 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.5.1 Impact of image inpainting . . . . . . . . . . . . . . . . . . 38 4.6 Limitation and Error Analysis . . . . . . . . . . . . . . . . . . . . 41 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Appendix A : Fine-Tune Faster R-CNN [4] by Synthetic Images . . . . . . 46 Appendix B : Custom Pose Panel . . . . . . . . . . . . . . . . . . . . . . . 48 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

[1] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo
event camera dataset for driving scenarios,” IEEE Robotics and Automation
Letters, 2021.
[2] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
assessment: from error visibility to structural similarity,” IEEE Transactions
on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[3] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The un-
reasonable effectiveness of deep features as a perceptual metric,” in CVPR,
2018.
[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2017.
[5] X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video
depth estimation,” ACM Transactions on Graphics (TOG), vol. 39, no. 4,
pp. 71–1, 2020.
[6] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
robust monocular depth estimation: Mixing datasets for zero-shot cross-
dataset transfer,” IEEE Transactions on Pattern Analysis and Machine In-
telligence (TPAMI), 2020.
[7] M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using
context-aware layered depth inpainting,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 8028–8038,
2020.
49[8] S. W. Oh, S. Lee, J.-Y. Lee, and S. J. Kim, “Onion-peel networks for deep
video completion,” in Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision, pp. 4403–4412, 2019.
[9] M. Levoy and P. Hanrahan, “Light field rendering,” in Proceedings of the
23rd Annual Conference on Computer Graphics and Interactive Techniques,
pp. 31–42, 1996.
[10] S. J. Gortler, R. Grzeszczuk, R. Szeliski, and M. F. Cohen, “The lumigraph,”
in Proceedings of the 23rd Annual Conference on Computer Graphics and
Interactive Techniques, pp. 43–54, 1996.
[11] C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen, “Unstruc-
tured lumigraph rendering,” in Proceedings of the 28th Annual Conference
on Computer Graphics and Interactive Techniques, pp. 425–432, 2001.
[12] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo magnifi-
cation: Learning view synthesis using multiplane images,” in SIGGRAPH,
2018.
[13] J. Flynn, M. Broxton, P. Debevec, M. DuVall, G. Fyffe, R. Overbeck,
N. Snavely, and R. Tucker, “Deepview: View synthesis with learned gra-
dient descent,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 2367–2376, 2019.
[14] J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Trans-
actions on Medical Imaging, vol. 37, no. 6, pp. 1322–1332, 2018.
[15] J. Adler and O. Öktem, “Solving ill-posed inverse problems using iterative
deep neural networks,” Inverse Problems, vol. 33, no. 12, p. 124007, 2017.
[16] P. P. Srinivasan, R. Tucker, J. T. Barron, R. Ramamoorthi, R. Ng, and
N. Snavely, “Pushing the boundaries of view extrapolation with multiplane
images,” in Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 175–184, 2019.
50[17] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ra-
mamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view syn-
thesis with prescriptive sampling guidelines,” ACM Transactions on Graph-
ics (TOG), 2019.
[18] R. Tucker and N. Snavely, “Single-view view synthesis with multiplane im-
ages,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 551–560, 2020.
[19] J. Shade, S. Gortler, L.-w. He, and R. Szeliski, “Layered depth images,”
in Proceedings of the 25th Annual Conference on Computer Graphics and
Interactive Techniques, pp. 231–242, 1998.
[20] S. Tulsiani, R. Tucker, and N. Snavely, “Layer-structured 3d scene inference
via view synthesis,” in Proceedings of the European Conference on Computer
Vision (ECCV), pp. 302–317, 2018.
[21] H. Dhamo, K. Tateno, I. Laina, N. Navab, and F. Tombari, “Peeking behind
objects: Layered depth prediction from a single image,” Pattern Recognition
Letters, vol. 125, pp. 333–340, 2019.
[22] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Ad-
vances in Neural Information Processing Systems, vol. 27, 2014.
[23] H. Dhamo, N. Navab, and F. Tombari, “Object-driven multi-layer scene
decomposition from a single image,” in Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision, pp. 5369–5378, 2019.
[24] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in
Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[25] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular
depth estimation with left-right consistency,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 270–279, 2017.
51[26] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-
supervised monocular depth estimation,” in Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 3828–3838, 2019.
[27] Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. Free-
man, “Learning the depths of moving people by watching frozen people,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4521–4530, 2019.
[28] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera, “Filling-
in by joint interpolation of vector fields and gray levels,” IEEE Transactions
on Image Processing, vol. 10, no. 8, pp. 1200–1211, 2001.
[29] A. Levin, A. Zomet, and Y. Weiss, “Learning how to inpaint from global
image statistics.,” in ICCV, vol. 1, pp. 305–312, 2003.
[30] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,”
in Proceedings of the 27th Annual Conference on Computer Graphics and
Interactive Techniques, pp. 417–424, 2000.
[31] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, “Simultaneous structure and
texture image inpainting,” IEEE Transactions on Image Processing, vol. 12,
no. 8, pp. 882–889, 2003.
[32] S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen, “Image
melding: Combining inconsistent images using patch-based synthesis,” ACM
Transactions on Graphics (TOG), vol. 31, no. 4, pp. 1–10, 2012.
[33] J.-B. Huang, S. B. Kang, N. Ahuja, and J. Kopf, “Temporally coherent com-
pletion of dynamic video,” ACM Transactions on Graphics (TOG), vol. 35,
no. 6, pp. 1–11, 2016.
[34] A. Newson, A. Almansa, M. Fradet, Y. Gousseau, and P. Pérez, “Video
inpainting of complex scenes,” Siam Journal on Imaging Sciences, vol. 7,
no. 4, pp. 1993–2019, 2014.
52[35] Y. Wexler, E. Shechtman, and M. Irani, “Space-time video completion,” in
Proceedings of the 2004 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2004. CVPR 2004., vol. 1, pp. I–I, IEEE,
2004.
[36] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent
image completion,” ACM Transactions on Graphics (ToG), vol. 36, no. 4,
pp. 1–14, 2017.
[37] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Con-
text encoders: Feature learning by inpainting,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2536–2544,
2016.
[38] D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Deep video inpainting,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 5792–5801, 2019.
[39] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end memory
networks,” arXiv preprint arXiv:1503.08895, 2015.
[40] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7794–7803, 2018.
[41] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
“Flownet 2.0: Evolution of optical flow estimation with deep networks,”
in IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Jul 2017.
[42] A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: Image segmentation
as rendering,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 9799–9808, 2020.
53[43] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European
Conference on Computer Vision, pp. 740–755, Springer, 2014.
[44] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, “Lookahead optimizer: k
steps forward, 1 step back,” in Advances in Neural Information Processing
Systems, pp. 9597–9608, 2019.
[45] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, “Im-
age inpainting for irregular holes using partial convolutions,” in Proceedings
of the European Conference on Computer Vision (ECCV), pp. 85–100, 2018.

無法下載圖示 全文公開日期 2024/09/24 (校內網路)
全文公開日期 2031/09/24 (校外網路)
全文公開日期 2031/09/24 (國家圖書館:臺灣博碩士論文系統)
QR CODE