簡易檢索 / 詳目顯示

研究生: Bui Anh Vu
Bui Anh Vu
論文名稱: 基於多尺度變壓器網路之單眼影像深度估測
Monocular Depth Estimation Based on Multiscale Transformer
指導教授: 郭景明
Jing-Ming Guo
口試委員: 郭景明
Jing-Ming Guo
莊明霖
Ming-Lin Chuang
沈中安
Chung-An Shen
郭天穎
Tien-Ying Kuo
Huei-Yung Lin
林惠勇
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 82
中文關鍵詞: monocular-depth-estimationtransformermultiscaledeep-learningencoder-decodermdedepth-map
外文關鍵詞: monocular-depth-estimation, transformer, multiscale, deep-learning, encoder-decoder, mde, depth-map
相關次數: 點閱:205下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Transformers have shown great achievement on dense prediction tasks thanks to the ability to extract long-range dependency on 2D images. Recently, researchers have applied transformers in depth estimation and gained better performance compared to traditional works. However, these works mainly exploit transformers at a single scale of the image. Thus, this thesis proposes a novel hierarchical transformers-based encoder-decoder model for monocular depth estimation. The proposed architecture relies on stacked transformers encoder to capture depth features at multiple scales which are combined later in the decoding phase. Furthermore, the model discretizes the image depth into bins whose centers are adaptively learned. In addition, the value for each pixel in the depth map is the linear combination of the bin centers and the bins probability distribution. Finally, the proposed architecture is experimented on the NYU-Depth V2 indoor dataset, and the KITTI outdoor dataset with different metrics and both achieve promising results compared to most recent works.


    Transformers have shown great achievement on dense prediction tasks thanks to the ability to extract long-range dependency on 2D images. Recently, researchers have applied transformers in depth estimation and gained better performance compared to traditional works. However, these works mainly exploit transformers at a single scale of the image. Thus, this thesis proposes a novel hierarchical transformers-based encoder-decoder model for monocular depth estimation. The proposed architecture relies on stacked transformers encoder to capture depth features at multiple scales which are combined later in the decoding phase. Furthermore, the model discretizes the image depth into bins whose centers are adaptively learned. In addition, the value for each pixel in the depth map is the linear combination of the bin centers and the bins probability distribution. Finally, the proposed architecture is experimented on the NYU-Depth V2 indoor dataset, and the KITTI outdoor dataset with different metrics and both achieve promising results compared to most recent works.

    ABSTRACT i ACKNOWLEDGEMENT ii TABLES OF CONTENTS iv ABBREVIATIONS AND SYMBOLS vi LIST OF FIGURES ix LIST OF TABLES xi CHAPTER 1 INTRODUCTION 1 1.1. Background 1 1.2. Research Objective 4 1.3. Research Scope and Assumptions 4 1.4. Research Methodology 5 1.5. Research Outline 7 CHAPTER 2 LITERATURE REVIEW 9 2.1. Traditional Approach for Monocular Depth Estimation 9 2.2. Deep Learning for Monocular Depth Estimation 13 2.2.1. Convolutional Neural Network 13 2.2.2. Transformer 21 2.2.3. Deep Learning methods for MDE 30 CHAPTER 3 METHODOLOGY 40 3.1. Problem definition 40 3.2. Architecture 40 3.3. Encoder 42 3.4. Decoder 43 3.5. Features Aggregating Module (FAM) 45 3.6. Adaptively discretizing bin widths 46 3.7. Depth Map Prediction 48 CHAPTER 4 EXPERIMENTS 50 4.1. Dataset 50 4.1.1. NYU-Depth v2 50 4.1.2. KITTI 50 4.2. Metrics 51 4.3. Training loss 52 4.4. Implementation 53 4.5. Evaluation 53 CHAPTER 5 ABLATION STUDY 61 5.1. FAM affection on training 61 5.2. Model size 62 CHAPTER 6 CONCLUSION 63 REFERENCES 64

    [1] M. Alam, M. D. Samad, L. Vidyaratne, A. Glandon, and K. M. Iftekharuddin, "Survey on deep neural networks in speech and vision systems," Neurocomputing, vol. 417, pp. 302-321, 2020.
    [2] W. Huang, J. Cheng, Y. Yang, and G. Guo, "An improved deep convolutional neural network with multi-scale information for bearing fault diagnosis," Neurocomputing, vol. 359, pp. 77-92, 2019.
    [3] G. Tian, L. Liu, J. Ri, Y. Liu, and Y. Sun, "ObjectFusion: An object detection and segmentation framework with RGB-D SLAM and convolutional neural networks," Neurocomputing, vol. 345, pp. 3-14, 2019.
    [4] J. Valentin et al., "Depth from motion for smartphone AR," ACM Transactions on Graphics (ToG), vol. 37, no. 6, pp. 1-19, 2018.
    [5] X. Yang, H. Luo, Y. Wu, Y. Gao, C. Liao, and K.-T. Cheng, "Reactive obstacle avoidance of monocular quadrotors with online adapted depth prediction network," Neurocomputing, vol. 325, pp. 142-158, 2019.
    [6] A. Mertan, D. J. Duff, and G. Unal, "Single image depth estimation: An overview," Digital Signal Processing, p. 103441, 2022.
    [7] T. Lindeberg, "Scale invariant feature transform," 2012.
    [8] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, "Speeded-up robust features (SURF)," Computer vision and image understanding, vol. 110, no. 3, pp. 346-359, 2008.
    [9] J. Lafferty, A. McCallum, and F. C. Pereira, "Conditional random fields: Probabilistic models for segmenting and labeling sequence data," 2001.
    [10] A. Blake, P. Kohli, and C. Rother, Markov random fields for vision and image processing. MIT press, 2011.
    [11] A. N. Gorban, E. M. Mirkes, and I. Y. Tyukin, "How deep should be the depth of convolutional neural networks: a backyard dog case study," Cognitive Computation, vol. 12, no. 2, pp. 388-397, 2020.
    [12] J. Ren, A. Hussain, J. Han, and X. Jia, "Cognitive modelling and learning for multimedia mining and understanding," Cognitive Computation, vol. 11, no. 6, pp. 761-762, 2019.
    [13] J. Zbontar and Y. LeCun, "Stereo matching by training a convolutional neural network to compare image patches," J. Mach. Learn. Res., vol. 17, no. 1, pp. 2287-2318, 2016.
    [14] G. Calin and V. Roda, "Real-time disparity map extraction in a dual head stereo vision system," Latin American applied research, vol. 37, no. 1, pp. 21-24, 2007.
    [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Communications of the ACM, vol. 60, no. 6, pp. 84-90, 2017.
    [16] K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.
    [17] C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1-9.
    [18] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778.
    [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
    [20] B. Graham, "Fractional max-pooling," arXiv preprint arXiv:1412.6071, 2014.
    [21] M. Tan and Q. Le, "Efficientnet: Rethinking model scaling for convolutional neural networks," in International conference on machine learning, 2019: PMLR, pp. 6105-6114.
    [22] M. Tan and Q. Le, "Efficientnetv2: Smaller models and faster training," in International Conference on Machine Learning, 2021: PMLR, pp. 10096-10106.
    [23] A. Vaswani et al., "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
    [24] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, and N. Shazeer, "Alexander Ku and Dustin Tran. Image transformer," arXiv preprint arXiv: 1802.05751, 2018.
    [25] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens, "Stand-alone self-attention in vision models," Advances in Neural Information Processing Systems, vol. 32, 2019.
    [26] H. Zhao, J. Jia, and V. Koltun, "Exploring self-attention for image recognition," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10076-10085.
    [27] R. Child, S. Gray, A. Radford, and I. Sutskever, "Generating long sequences with sparse transformers," arXiv preprint arXiv:1904.10509, 2019.
    [28] A. Dosovitskiy et al., "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [29] J.-B. Cordonnier, A. Loukas, and M. Jaggi, "On the relationship between self-attention and convolutional layers," arXiv preprint arXiv:1911.03584, 2019.
    [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018.
    [31] D. Eigen, C. Puhrsch, and R. Fergus, "Depth map prediction from a single image using a multi-scale deep network," Advances in neural information processing systems, vol. 27, 2014.
    [32] J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, "CAM-Convs: Camera-aware multi-scale convolutions for single-view depth," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11826-11835.
    [33] F. Liu, C. Shen, and G. Lin, "Deep convolutional neural fields for depth estimation from a single image," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5162-5170.
    [34] S. Zhao, H. Fu, M. Gong, and D. Tao, "Geometry-aware symmetric domain adaptation for monocular depth estimation," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9788-9798.
    [35] D. Eigen and R. Fergus, "Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2650-2658.
    [36] J. Li, R. Klein, and A. Yao, "A two-streamed network for estimating fine-scaled depth maps from single rgb images," in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3372-3380.
    [37] J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, "From big to small: Multi-scale local planar guidance for monocular depth estimation," arXiv preprint arXiv:1907.10326, 2019.
    [38] Z. Zhang, C. Xu, J. Yang, J. Gao, and Z. Cui, "Progressive hard-mining network for monocular depth estimation," IEEE Transactions on Image Processing, vol. 27, no. 8, pp. 3691-3702, 2018.
    [39] G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, "Transformer-based attention networks for continuous pixel-wise prediction," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16269-16279.
    [40] S. F. Bhat, I. Alhashim, and P. Wonka, "Adabins: Depth estimation using adaptive bins," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009-4018.
    [41] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, "SegFormer: Simple and efficient design for semantic segmentation with transformers," Advances in Neural Information Processing Systems, vol. 34, pp. 12077-12090, 2021.
    [42] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, "Indoor segmentation and support inference from rgbd images," in European conference on computer vision, 2012: Springer, pp. 746-760.
    [43] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, "Vision meets robotics: The kitti dataset," The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231-1237, 2013.
    [44] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, "Unsupervised cnn for single view depth estimation: Geometry to the rescue," in European conference on computer vision, 2016: Springer, pp. 740-756.
    [45] A. Saxena, S. Chung, and A. Ng, "Learning depth from single monocular images," Advances in neural information processing systems, vol. 18, 2005.
    [46] F. Liu, C. Shen, G. Lin, and I. Reid, "Learning depth from single monocular images using deep convolutional neural fields," IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2024-2039, 2015.
    [47] C. Godard, O. Mac Aodha, and G. J. Brostow, "Unsupervised monocular depth estimation with left-right consistency," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 270-279.
    [48] Y. Kuznietsov, J. Stuckler, and B. Leibe, "Semi-supervised deep learning for monocular depth map prediction," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6647-6655.
    [49] Y. Gan, X. Xu, W. Sun, and L. Lin, "Monocular depth estimation with affinity, vertical pooling, and label enhancement," in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 224-239.
    [50] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, "Deep ordinal regression network for monocular depth estimation," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002-2011.
    [51] W. Yin, Y. Liu, C. Shen, and Y. Yan, "Enforcing geometric constraints of virtual normal for depth prediction," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5684-5693.
    [52] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, "Deeper depth prediction with fully convolutional residual networks," in 2016 Fourth international conference on 3D vision (3DV), 2016: IEEE, pp. 239-248.
    [53] Z. Hao, Y. Li, S. You, and F. Lu, "Detail preserving depth estimation from a single image using attention guided networks," in 2018 International Conference on 3D Vision (3DV), 2018: IEEE, pp. 304-313.
    [54] W. Lee, N. Park, and W. Woo, "Depth-assisted real-time 3D object detection for augmented reality," in ICAT, 2011, vol. 11, no. 2, pp. 126-132.
    [55] M. Ramamonjisoa and V. Lepetit, "Sharpnet: Fast and accurate recovery of occluding contours in monocular depth estimation," in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0-0.
    [56] J. Hu, M. Ozay, Y. Zhang, and T. Okatani, "Revisiting single image depth estimation: Toward higher resolution maps with accurate object boundaries," in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019: IEEE, pp. 1043-1051.
    [57] L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, and J. Heikkilä, "Guiding monocular depth estimation using depth-attention volume," in European Conference on Computer Vision, 2020: Springer, pp. 581-597.

    QR CODE