簡易檢索 / 詳目顯示

研究生: 許桂豪
Kuei-Hao Hsu
論文名稱: 基於循環生成對抗網路與運動回復結構之深度重建
Structure From Motion and Cycle Generator Adversarial Networks for Depth Reconstruction
指導教授: 林敬舜
Ching-Shun Lin
口試委員: 林昌鴻
Chang-Hong Lin
陳維美
Wei-Mei Chen
王煥宗
Huan-Chun Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 51
中文關鍵詞: 深度重建生成對抗網路運動恢復結構
外文關鍵詞: Structure from motion, Generative adversarial networks, Depth reconstruction
相關次數: 點閱:200下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 深度重建是理解影像中物體與場景之間幾何關係的重要關鍵,其能夠提供除了平面 資訊 外更為立體的環境要素。而近年來已經提出許多深度學習在深度預測 的應用,然而在室內場景的單眼影像深度預測中,模型對於 物體形狀、複雜紋路、 光線、遮掩物等等非常敏感,而單眼影像有限空間資訊也會降低模型對深度的預 測精準度,此外 在室內場景由於充斥許多高階語意的物件,因此往往需要語意分 割來協助模型對於物體間的界線以及單一物體的深度分佈,而這些將會花費許多 人力成本來對場景中的物件進行語意標籤。
    在本論文中 提出結合 CycleGAN在不同域的影像間轉換能力以及生成影像 較精細的優點 與較為直覺的 Structure From Motion的學習方法來進行單眼影像深 度 的預測。首先藉由 CycleGAN的 Generator來進行場景影像與深度圖的生成, 並藉由 Discriminator來強化 Generator的生成能力,並且以 Cycle-Consist Loss來 進行兩個不同域的影像的轉換的連結,此外針對 GAN訓練上的容易崩潰的缺陷 而加入 Spectral Normalization來 穩定訓練以及加入 Self-Attention機制來精細化 生成的深度圖。 其次 加入 PoseNet對視差圖中相機的位置及角度進行 6-DOF 的 相機運動參數預測,並與深度圖配合計算 場景中的物體旋轉與平移來學習每一張影像的Pixel變化藉此輔助影像深度的學習,最終達到單眼影像深度重建的能力。


    Depth reconstruction is an important key to understand the geometric relationship between objects and scenes in an image. Depth of image can provide 3 dimensional environmental information. Recently, many applications of deep learning for the depth prediction have been proposed. However, in the monocular image depth prediction of indoor scenes, the model is very sensitive to object shape, complex texture, and environmental light. The intrinsically limited spatial information provided by the monocular image will also reduce the prediction accuracy of the model. In addition, indoor scenes are filled with many high-order semantic objects, so semantic segmentation is often needed to assist the model to learn the boundary between objects and the depth distribution of a single object, and results in the expensive computation in the semantic object labeling. In this thesis, we use CycleGAN to translate the unpaired images in different domains and SFM (Structure From Motion) to predict the depth of monocular image.
    We first use CycleGAN to generate scene images and depth maps, the discriminator to enhance the image, and then the cycle-consist loss to link the images of these domains. Owing to the mode collapse resulting from the underlying problem of GAN, the spectral normalization is used to stabilize the training and the self-attention is used to refine the generated depth map. We also added PoseNet to predict the 6-DOF camera motion parameters such as the position and angle in the disparity. To achieve the ability of monocular image prediction, the following step is to use the depth map for object rotation and translation calculation for a clearer pixel variation of each image.

    致謝 .............................................................. I 摘要 ............................................................. II 中文摘要 ....................................................... II Abstract ....................................................... III 目錄 ............................................................. IV 圖片索引 ......................................................... VI 表索引 ........................................................... VII 專有名詞縮寫對照表 .............................................. VIII 第一章 導論 ........................................................ 2 1.1前言 .......................................................... 2 1.2 Image Generator Model ......................................... 2 1.3 Monocular Depth Estimation .................................... 3 1.4本文架構 ...................................................... 4 第二章 文獻回顧 .................................................... 6 2.1 影像生成與轉換 ................................................ 6 2.2 Depth Map Estimation .......................................... 9 第三章 實驗架構設計 ............................................... 13 3.1 Scene Image to Depth Map ..................................... 14 3.2 Structure from Motion ........................................ 16 3.2.1 Camera Pose............................................... 16 3.2.2 Rigid Flow................................................ 17 3.2.3 Warp Image................................................ 18 3.2.4 Rigid Structure Reconstructor............................. 22 3.3網絡架構與目標函數 ............................................ 23 V 3.3.1 Generator.................................................. 23 3.3.2 Discriminator.............................................. 24 3.3.3 PoseNet.................................................... 24 3.3.4目標函數 ................................................... 24 3.4 Tricks for DNN Learning ....................................... 25 3.4.1 Spectral Normalization..................................... 25 3.4.2 Self-Attention............................................. 27 3.4.3 Nearest Neighbor Interpolation............................. 28 第四章 實驗結果 .................................................. 29 4.1 Encoder-Deocder結構實驗 ...................................... 30 4.2 L1 Loss與L2 Loss深度預測效果比較 ............................ 33 4.3 不同Lp norm深度預測結果比較 .................................. 35 4.4 模型生成結果比較.............................................. 38 第五章 結論與未來展望 ............................................ 48 5.1 結論.......................................................... 48 5.2 未來展望...................................................... 49 參考文獻 ......................................................... 50

    [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” CVPR, pp. 770-778, Jun. 2016.
    [2] Karen Simonyan, and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” ACPR, pp. 730-734, Nov. 2015.
    [3] François Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” CVPR, pp. 1800-1807, Jul. 2017. [4] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial networks,” arXiv:1406.2661, Jun. 2014. [5] Mehdi Mirza, and Simon Osindero, “Conditional generative adversarial nets,” arXiv:1411.1784, Nov. 2014. [6] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida “Spectral normalization for generative adversarial networks,” ICLR, Feb. 2018. [7] Martin Arjovsky, Soumith Chintala, and L´eon Bottou “Wasserstein generative adversarial networks,” ICML, vol. 70, pp. 214–223, Aug. 2017.
    [8] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “A neural algorithm of artistic style,” CVPR, pp. 2414-2423, Sep. 2015.
    [9] Richard Zhang, Phillip Isola, and Alexei A. Efros, “Colorful image colorization,” arXiv:1603.08511, Oct. 2016.
    [10] Casper Kaae Sønderby, Jose Caballero, Lucas Theis, Wenzhe Shi, and Ferenc Huszár, “Amortised map inference for image super-resolution,” arXiv:1610.04490, Feb. 2017.
    [11] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, “Image-to-image translation with conditional adversarial networks,” CVPR, pp. 5967-5976, Jul. 2017.
    [12] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros, ”Unpaired image-to-image translation using cycle-consistent adversarial networks,” ICCV, pp. 2242-2251, Oct. 2017.
    [13] Hyowon Ha, Sunghoon Im, Jaesik Park, Hae-Gon Jeon, and In So Kweon, “High-quality depth from uncalibrated small motion clip,” CVPR, pp. 5413-5421, Jun. 2016.
    [14] Naejin Kong, and Michael J. Black, “Intrinsic depth: Improving depth transfer with intrinsic images,” ICCV, pp. 3514-3522, Dec. 2015.
    [15] David Eigen, and Rob Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” ICCV, pp. 2650-2658, Dec. 2015.
    [16] Fayao Liu, Chunhua Shen, and Guosheng Lin, “Deep convolutional neural fields for depth estimation from a single image,” CVPR, pp. 5162-5170, Jun. 2015.
    [17] David Eigen, Christian Puhrsch, and Rob Fergus, “Depth map prediction from a single image using a multi-scale deep network”, NIPS, pp. 2366-2374, Jun. 2014.
    [18] Dan Xu, Wei Wang, Hao Tang, Hong Liu, Nicu Sebe, and Elisa Ricci, “Structured attention guided convolutional neural fields for monocular depth estimation,” CVPR , pp. 3917-3925, Jun. 2018.
    [19] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille, “DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.40, pp. 834-848, May 2017.
    [20] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, and Mingyi He, “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs,” CVPR, pp. 1119-1127, Jun. 2015.
    [21] Yasutaka Furukawa and Carlos Hernández, “Multi-view stereo: A tutorial,” Foundations and Trends in Computer Graphics and Vision, pp. 16-36 2015.
    [22] Richard Hartley and Andrew Zisserman, “Multiple view geometry in computer vision,” Cambridge University Press, Second edition, pp. 262-276, 2004.
    [23] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki, “Sfm-net: Learning of structure and motion from video,” arXiv:1704.07804, Apr. 2017. [24] Diederik P. Kingma, and Max Welling, “Auto-Encoding variational Bayes,” arXiv:1312.6114, May 2014.
    [25] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros, “Context encoders: Feature learning by inpainting,” CVPR, pp. 2536-2544, Jun. 2016.
    [26] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv:1511.06434, Jun. 2016.
    [27] Augustus Odena, Christopher Olah, and Jonathon Shlens, “Conditional image synthesis with auxiliary classifier,” ICML, vol.70, pp. 2642-2651, Aug. 2017.
    [28] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” arXiv:1704.02510, Oct. 2018.
    [29] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim, “Learning to discover cross-domain relations with generative adversarial networks,” arXiv:1703.05192, May 2017.
    [30] Jure Žbontar, Yann LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” arXiv:1510.05970, May 2016.
    [31] Nikolaus Mayer, Eddy Ilg, Philip Häusser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” CVPR, pp. 4040-4048, Dec. 2015.
    [32] Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox, “FlowNet: Learning optical flow with convolutional networks,” arXiv:1504.06852, Apr. 2015.
    [33] Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen, “Estimating depth from monocular images as classification using deep fully convolutional residual networks,” arXiv:1605.02305, Aug. 2017.
    [34] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab, “Deeper depth prediction with fully convolutional residual networks,” 3DV, pp. 239-248, Sep. 2016.
    [35] Xiaolong Wang, David F. Fouhey, and Abhinav Gupta, “Designing deep networks for surface normal estimation,” CVPR, pp. 539-547, Jun. 2015.
    [36] Johannes L. Schönberger, and Jan-Michael Frahm, “Structure-from-motion revisited,” CVPR, pp. 4104-4113, Jun. 2016.
    [37] Bill Triggs, Philip McLauchlan, Richard Hartley, and Andrew Fitzgibbon, “Bundle adjustmenta modern synthesis,” International Workshop on Vision Algorithms, pp. 1-10, 1999.
    [38] Alex Kendall, Matthew Grimes, and Roberto Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” ICCV, pp. 2938-2946, Feb. 2016.
    [39] Alex Kendall, and Roberto Cipolla, “Geometric loss functions for camera pose regression with deep learning,” CVPR, pp. 6555-6564, May 2017.
    [40] Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe, “Unsupervised learning of depth and ego-motion from video,” CVPR, pp. 6612-6619, Jul. 2017.
    [41] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki, “SfM-Net: Learning of structure and motion from video,” arXiv:1704.07804, Apr. 2017.
    [42] Zhichao Yin, and Jianping Shi, “GeoNet: Unsupervised learning of dense depth, optical flow and camera pose,” CVPR, pp. 1983-1992, Jun. 2018.
    [43] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena, “Self-Attention Generative Adversarial Networks,” ICML, pp. 7354-7363, Aug. 2018.
    [44] John McCormac, Ankur Handa, Stefan Leutenegger and Andrew J.Davison, “SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?,” ICCV, pp. 2697-2706, May 2017.
    [45] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Medical Image Computing and Computer-Assisted Intervention, vol. 9351, pp. 234-241, 2015

    無法下載圖示 全文公開日期 2024/08/26 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE