基於深度學習及光流法之影像重建系統｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	陳韋廷 WEI-TING CHEN
論文名稱：	基於深度學習及光流法之影像重建系統 View Synthesis with Optical Flow and Deep Neural Networks
指導教授：	陳郁堂 Yie-Tarng Chen
口試委員:	林銘波 Ming-Bo Lin 陳省隆 Hsing-Lung Chen 呂政修 Jenq-Shiou Leu 方文賢 Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	57
中文關鍵詞：	視圖合成、圖像修補
外文關鍵詞：	Flow-Guided Video Inpainting, Depth-Guided Warping
相關次數：	點閱：144 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本文研究一種利用單相機的新穎視圖合成架構，將源圖像透過相對姿態和深度得到目標圖像。在新穎視圖合成的研究中，重建3D場景是件重要的步驟。但是，在常規方法中難以從單個圖像獲取完整的3D信息。因此本論文提出利用神經網絡的結構，通過使用自我監督的學習方式估計源視圖的深度信息。接下來，利用深度將源圖像映射到目標圖像。但是，在經過深度的投影後，目標圖像中的某些像素會丟失。我們利用光流引導的方式修復生成後圖像缺失的像素。在光流引導的影片修復中，我們將影片修復視為像素傳播問題，而不是直接填充RGB像素至每個幀的缺失區域。我們首先計算丟失部分的光流，而光流遵循從粗到精的過程。然後，利用光流引導像素，用以填充相鄰幀中的缺失區域。本論文在KITTI和ITRI數據集上的實驗表明了該方法的有效性。

This thesis investigates an architecture for novel view synthesis from a monocular camera and a relative pose between the source view and the target view. Reconstructing 3D scenes is an important step toward novel view synthesis. However, It is difficult to acquire complete 3D information from a single image in conventional methods. Taking advantage of deep neural networks, the proposed architecture first precisely estimates the depth information of a source-view image by using a self-supervised learning scheme. Next, the depth-guided warping is used to map a source-view image to a target-view image. However, some pixels in the target-view image become missing after the depth-guided warping. We utilize flow-guided video inpainting and generative image inpainting to fill missing pixels in the target-view image. In the flow-guided video inpainting, instead of filling in the RGB pixels of each frame directly, video inpainting is treated as a pixel propagation problem. We complete the missing optical flow first. Specifically, the optical flow follows a coarse-to-fine for the flow fields. Then, the synthesized flow field is used to guide the propagation of pixels to fill up the missing regions from adjacent frames. Finally, to inpaint remaining missing pixels at each frame, we consider a generative Image Inpainting with contextual attention, which consists of a coarse and a refinement network, and employs context attention to guide pixel filling in the missing pixel region. Experiments on Kitti and ITRI datasets reveal the effectiveness of the proposed approach.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgment . .. . . . . . . . . . . . . . . . . . . . . iii
Table of contents . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . ix
Introduction . . . . . . . . . . . . . . . . . . . . . . 1
1 Motivations . . . . . . . . . . . . . . . . . . . . . 1
2 Summary of Thesis . . . . . . . . . . . . . . . . . . 1
3 Contributions . . . . . . . . . . . . . . . . . . 2
4 Thesis Outline . . . . . . . . . . . . . . . . . . . 2
Related Work . . . . . . . . . . . . . . . . . . . . . 3
1 View Synthesis . . . . . . . . . . . . . . . . . . . 3
2 Depth Estimation . . . . . . . . . . . . . . . . . . 4
3 Learning-Based Motion Estimation . . . . . . . . . . 4
4 Depth-Guided View Synthesis . . . . . . . . . . . . 5
5 Video Inpainting . . . . . . . . . . . . . . . . . 6
Proposed Method . . . . . . . . . . . . . . . . . . 7
1 Overall Methodology . . . . . . . . . . . . . . . 7
2 Unsupervised Depth and Ego-motion Learning from Monocular Video  . . . 8
2.1 Eulers rotation theorem [1] . . . . . . . . . . . . . . . . 9
2.2 Image-based Rendering of Dierentiable Depth Estimation . . . .11
2.3 Photometric Loss . . . . . . . . . . . . .  . . 12
2.4 Geometry consistency loss . . . . . . . . . . . 12
2.5 Depth Smoothness . . . . . . . . . . . . .  . . 13
2.6 Depth-Guided Warping . . . . . . . . . . .  . . 15
3 Flow Guided Video Inpainting . . . . . . . . . . . 16
3.1 Subnetwork of Deep Flow Completion . . .  . . . . 17
3.2 Optimize Optical Flow by Stacking . . . . . . . . 18
3.3 Loss Function . . . . . . . . . . . . . . . . . . 19
3.4 Optical Flow Guided Image Inpainting . .  . . . . 20
3.5 Image Inpainting by GAN . . . . . . . . . . . . . 21
Experimental and Results . . . . . . . . . . . . . . 25
1 Dataset . . . . . . . . . . . . . . . . . . . . . . 25
1.1 KITTI Visual Odometry Dataset [2] . . . . . . . 25
1.2 CARLA Dataset [3] . . . . . . . . . . .  . . . 27
2 Evaluation Protocol . . . . . . . . . . . . . . . 28
2.1 Structural Similarity [4] . . . . . . . . . . 29
2.2 Peak Signal-to-Noise Ratio . . . . . . . . . 29
2.3 L1 pixel error . . . . . . . . . . . . . . . . . . 30
3 Experimental Results on Stereo Video . . . . . . . . 30
4 Failure Cases and Dicult Cases Analysis . . . . . . . 35
Conclusion . . . .  . . . . . . . . . . .. . . . . . 38
References . . . . . . . . . . . . . . . . . . . . . . . 39
                                

[1] J. Diebel, “Representing attitude: Euler angles, unit quaternions, and rotation
vectors,” Matrix, vol. 58, no. 15-16, pp. 1-35, 2006.
[2] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, pp. 1231-1237, Aug. 2013.
[3] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, pp. 1-16, 2017.
[4] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600-612, 2004.
[5] J. Bai, A. Agarwala, M. Agrawala, and R. Ramamoorthi, “Automatic cinemagraph portraits,” in Computer Graphics Forum, vol. 32, pp. 17-25, Wiley Online Library, 2013.
[6] S. Liu, L. Yuan, P. Tan, and J. Sun, “Bundled camera paths for video stabilization,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1-10, 2013.
[7] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, Highquality video view interpolation using a layered representation,” ACM transactions on graphics (TOG), vol. 23, no. 3, pp. 600-608, 2004.
[8] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4040-4048, 2016.
[9] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851-1858, 2017.
[10] S. E. Chen and L. Williams, “View interpolation for image synthesis,” in Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, pp. 279-288, ACM, 1993.
[11] S. M. Seitz and C. R. Dyer, “View morphing,” in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 21-30, ACM, 1996.
[12] P. E. Debevec, C. J. Taylor, and J. Malik, Modeling and rendering architecture from photographs. University of California, Berkeley, 1996.
[13] A. Fitzgibbon, Y. Wexler, and A. Zisserman, “Image-based rendering using image-based priors, 9th ieee int'l conf. on comp,” Proceedings of the IEEE International Conference on Computer Vision, pp. 14-17, Oct. 2003.
[14] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, p. 541-551, Dec 1989.
[15] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, p. 193-202, Apr 1980.
[16] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox, “Learning to generate chairs with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1538-1546, 2015.
[17] K. Rematas, C. H. Nguyen, T. Ritschel, M. Fritz, and T. Tuytelaars, “Novel views of objects from a single image,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 8, pp. 1576-1590, 2016.
[18] J. Xie, R. Girshick, and A. Farhadi, “Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks,” Lecture Notes in Computer Science, p. 842-857, 2016.
[19] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270-279, 2017.
[20] Z. Yang, P. Wang, W. Xu, L. Zhao, and R. Nevatia, “Unsupervised learning of geometry with edge-aware depth-normal consistency,” 2017.
[21] Z. Yang, P. Wang, Y. Wang, W. Xu, and R. Nevatia, “Lego: Learning edge with geometry all at once by watching videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 225-234, 2018.
[22] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758-2766, 2015.
[23] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2938-2946, 2015.
[24] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, “Geometry-aware learning of maps for camera localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616-2625, 2018.
[25] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision, pp. 740-756, Springer, 2016.
[26] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv preprint arXiv:1704.07804, 2017.
[27] R. Li, S.Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7286-7291, IEEE, 2018.
[28] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille, “Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding,” arXiv preprint arXiv:1810.06125, July 2018.
[29] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
[30] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, “Deepstereo: Learning to predict new views from the world's imagery,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515-5524, 2016.
[31] P. Hedman, J. Philip, T. Price, J.-M. Frahm, G. Drettakis, and G. Brostow, “Deep blending for free-viewpoint image-based rendering,” ACM Transactions on Graphics (TOG), vol. 37, no. 6, pp. 1-15, 2018.
[32] A. Knapitsch, J. Park, Q.-Y. Zhou, and V. Koltun, “Tanks and temples: Benchmarking large-scale scene reconstruction,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1-13, 2017.
[33] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in European conference on computer vision, pp. 746-760, Springer, 2012.
[34] Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2041-2050, 2018.
[35] Z. Li, T. Dekel, F. Cole, R. Tucker, N. Snavely, C. Liu, and W. T. Freeman, “Learning the depths of moving people by watching frozen people,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4521-4530, 2019.
[36] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high delity natural image synthesis,” arXiv preprint arXiv:1809.11096, 2018.
[37] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401-4410, 2019.
[38] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” in Advances in neural information processing systems, pp. 2172-2180, 2016.
[39] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y.-L. Yang, “Hologan: Unsupervised learning of 3d representations from natural images,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 7588-7597, 2019.
[40] C. Ballester, M. Bertalmio, V. Caselles, G. Sapiro, and J. Verdera, “Fillingin by joint interpolation of vector fields and gray levels,” IEEE Transactions on Image Processing, vol. 10, no. 8, p. 1200-1211, 2001.
[41] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pp. 417-424, 2000.
[42] A. Levin, A. Zomet, and Y. Weiss, “Learning how to inpaint from global image statistics,” in null, p. 305, IEEE, 2003.
[43] M. Bertalmio, L. Vese, G. Sapiro, and S. Osher, “Simultaneous structure and texture image inpainting,” IEEE Transactions on Image Processing, vol. 12, p. 882-889, Aug 2003.
[44] S. Darabi, E. Shechtman, C. Barnes, D. B. Goldman, and P. Sen, “Image melding,” ACM Transactions on Graphics, vol. 31, p. 1-10, Aug 2012.
[45] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Transactions on Graphics, vol. 36, p. 1-14, Jul
2017.
[46] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2536-2544, 2016.
[47] J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” in Thirty-third Conference on Neural Information Processing Systems (NeurIPS), 2019.
[48] X. Chen, J. Song, and O. Hilliges, “Monocular neural image based rendering with continuous view control,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4090-4100, 2019.
[49] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, pp. 2017-2025, 2015.
[50] R. Xu, X. Li, B. Zhou, and C. C. Loy, “Deep flow-guided video inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3723-3732, 2019.
[51] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu, “Occlusion aware unsupervised learning of optical flow,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4884-4893, 2018.
[52] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image inpainting with contextual attention,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5505-5514, 2018.
[53] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462-2470, 2017.
[54] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent image completion,” ACM Transactions on Graphics (ToG), vol. 36, no. 4, pp. 1-14, 2017.
[55] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in neural information processing systems, pp. 5767-5777, 2017.
[56] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, “View synthesis by appearance flow,” in European Conference on Computer Vision, pp. 286-301, Springer, 2016.

全文公開日期 2025/08/24 (校內網路)
全文公開日期 2025/08/24 (校外網路)
全文公開日期 2025/08/24 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文