Basic Search / Detailed Display

Author: 黃浩禎
Hao-Chen Huang
Thesis Title: 具有語義訊息和多重約束的單目深度預測
Monocular Depth Estimation with Semantic Information and Multiple Constraints
Advisor: 花凱龍
Kai-Lung Hua
Committee: 陳永耀
Yung-Yao Chen
鍾國亮
Kuo-Liang Chung
陳怡伶
Yi-Ling Chen
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2022
Graduation Academic Year: 110
Language: 英文
Pages: 40
Keywords (in Chinese): 單目深度估計變形卷積多重損失函數語意分割
Keywords (in other languages): Monocular Depth Estimation, Deformable Convolution, Multi-Loss, Semantic Segmentation
Reference times: Clicks: 178Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

單目深度估計是對單個 2D 圖像的深度訊息和場景幾何形狀進行推斷 的一項任務。該任務通常用於輔助其他任務,如自動駕駛汽車和同時定 位與地圖構建 (SLAM)。從單個圖像準確預測深度是一個挑戰,因為單個 2D 場景可能輸出多個深度順序。在本文中,我們提出了我們的單目深度 估計模型,該模型利用高維和多尺度信息,並動態調整視野域,以達到 state-of-the-art 的表現。最後,我們採用多重損失來限制特徵的發展,並確保融合後的準確性。


Monocular depth estimation is a task where a single 2D image’s depth infor- mation and scene geometry is inferred. This task is often used to assist other tasks such as self-driving cars and simultaneous localization and mapping (SLAM) for building scenes. Accurately estimating depth from a single image is challenging since a single 2D scene may output multiple depth or- ders. In this paper, we propose our monocular depth estimation model that leverages high-level and multi-scale information and dynamically adjusts the field of view to achieve state-of-the-art performance. Finally, we apply multi-loss to limit the development of features and ensure accuracy after fusion.

Contents Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Mococular Depth Estimation Models . . . . . . . 4 2.2 Feature Fusion . . . . . . . . . . . . . . . . . . . 5 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Res2Net Fusion Block . . . . . . . . . . . . . . . 7 3.2 Deformable Convolution . . . . . . . . . . . . . . 9 3.3 Semantic Segmentation Head . . . . . . . . . . . 11 3.4 Multi-loss Constrained Models . . . . . . . . . . . 13 iv4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Implementation Details . . . . . . . . . . . . . . . 15 4.2 KITTI Dataset . . . . . . . . . . . . . . . . . . . 17 4.3 NYUv2 Dataset . . . . . . . . . . . . . . . . . . . 18 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . 18 4.5 Comparison to the state-of-the-art . . . . . . . . . 20 4.6 Ablation Study . . . . . . . . . . . . . . . . . . . 21 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Letter of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . 28

References
[1] J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance
for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
[2] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale
deep network,” Advances in neural information processing systems, vol. 27, 2014.
[3] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine-scaled depth maps from
single rgb images,” arXiv preprint arXiv:1607.00730, 2016.
[4] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep con-
volutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38,
no. 10, pp. 2024–2039, 2015.
[5] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic
prediction from a single image,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2800–2809, 2015.
[6] A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 5506–5514, 2016.
[7] S. Kim, K. Park, K. Sohn, and S. Lin, “Unified depth prediction and intrinsic image decomposition
from a single image via joint convolutional neural fields,” in European conference on computer vision,
pp. 143–159, Springer, 2016.
[8] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with
fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV),
pp. 239–248, IEEE, 2016.
[9] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-
scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43,
no. 2, pp. 652–662, 2019.
[10] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,”
in Proceedings of the IEEE international conference on computer vision, pp. 764–773, 2017.
[11] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep &
cross network and practical lessons for web-scale learning to rank systems,” in Proceedings of the
Web Conference 2021, pp. 1785–1797, 2021.
[12] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-
scale convolutional architecture,” in Proceedings of the IEEE international conference on computer
vision, pp. 2650–2658, 2015.
25[13] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in
International Conference on 3D Vision (3DV), 2017.
[14] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188, 2021.
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[16] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018,
2021.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[18] S. M. H. Miangoleh, S. Dille, L. Mai, S. Paris, and Y. Aksoy, “Boosting monocular depth estimation
models to high-resolution via content-adaptive multi-resolution merging,” in Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, pp. 9685–9694, 2021.
[19] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth es-
timation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis
and machine intelligence, 2020.
[20] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular depth map
prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 6647–6655, 2017.
[21] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation:
Geometry to the rescue,” in European conference on computer vision, pp. 740–756, Springer, 2016.
[22] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for
object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 2117–2125, 2017.
[23] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790, 2020.
[24] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint
arXiv:1511.07122, 2015.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[26] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and
re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129,
pp. 3069–3087, 2021.
26[27] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab:
A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in CVPR, 2020.
[28] J. Cao, H. Leng, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li, “Shapeconv: Shape-aware convolu-
tional layer for indoor rgb-d semantic segmentation,” arXiv preprint arXiv:2108.10528, 2021.
[29] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference
from rgbd images,” in ECCV, 2012.
[30] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth
prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–
5693, 2019.
[31] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2002–2011, 2018.
[32] Y. Gan, X. Xu, W. Sun, and L. Lin, “Monocular depth estimation with affinity, vertical pooling,
and label enhancement,” in Proceedings of the European Conference on Computer Vision (ECCV),
pp. 224–239, 2018.
[33] S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, “Patch-wise attention network for monocular depth esti-
mation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1873–1881,
2021.
[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances
in neural information processing systems, vol. 32, 2019.
[35] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
arXiv:1711.05101, 2017.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” Advances in neural information processing systems, vol. 25, 2012.
[37] L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, and J. Heikkilä, “Guiding monocular depth estimation
using dept

無法下載圖示 Full text public date 2024/08/18 (Intranet public)
Full text public date 2024/08/18 (Internet public)
Full text public date 2024/08/18 (National library)
QR CODE