具有語義訊息和多重約束的單目深度預測｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	黃浩禎 Hao-Chen Huang
論文名稱：	具有語義訊息和多重約束的單目深度預測 Monocular Depth Estimation with Semantic Information and Multiple Constraints
指導教授：	花凱龍 Kai-Lung Hua
口試委員:	陳永耀 Yung-Yao Chen 鍾國亮 Kuo-Liang Chung 陳怡伶 Yi-Ling Chen
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	英文
論文頁數：	40
中文關鍵詞：	單目深度估計、變形卷積、多重損失函數、語意分割
外文關鍵詞：	Monocular Depth Estimation, Deformable Convolution, Multi-Loss, Semantic Segmentation
相關次數：	點閱：184 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

單目深度估計是對單個 2D 圖像的深度訊息和場景幾何形狀進行推斷的一項任務。該任務通常用於輔助其他任務，如自動駕駛汽車和同時定位與地圖構建 (SLAM)。從單個圖像準確預測深度是一個挑戰，因為單個 2D 場景可能輸出多個深度順序。在本文中，我們提出了我們的單目深度估計模型，該模型利用高維和多尺度信息，並動態調整視野域，以達到 state-of-the-art 的表現。最後，我們採用多重損失來限制特徵的發展，並確保融合後的準確性。

Monocular depth estimation is a task where a single 2D image’s depth infor- mation and scene geometry is inferred. This task is often used to assist other tasks such as self-driving cars and simultaneous localization and mapping (SLAM) for building scenes. Accurately estimating depth from a single image is challenging since a single 2D scene may output multiple depth or- ders. In this paper, we propose our monocular depth estimation model that leverages high-level and multi-scale information and dynamically adjusts the field of view to achieve state-of-the-art performance. Finally, we apply multi-loss to limit the development of features and ensure accuracy after fusion.

Contents
Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Mococular Depth Estimation Models . . . . . . . 4
2.2 Feature Fusion . . . . . . . . . . . . . . . . . . . 5
3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Res2Net Fusion Block . . . . . . . . . . . . . . . 7
3.2 Deformable Convolution . . . . . . . . . . . . . . 9
3.3 Semantic Segmentation Head . . . . . . . . . . . 11
3.4 Multi-loss Constrained Models . . . . . . . . . . . 13
iv4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4.1 Implementation Details . . . . . . . . . . . . . . . 15
4.2 KITTI Dataset . . . . . . . . . . . . . . . . . . . 17
4.3 NYUv2 Dataset . . . . . . . . . . . . . . . . . . . 18
4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . 18
4.5 Comparison to the state-of-the-art . . . . . . . . . 20
4.6 Ablation Study . . . . . . . . . . . . . . . . . . . 21
5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Letter of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . 28
                                

References
[1] J. H. Lee, M.-K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance
for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
[2] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale
deep network,” Advances in neural information processing systems, vol. 27, 2014.
[3] J. Li, R. Klein, and A. Yao, “A two-streamed network for estimating fine-scaled depth maps from
single rgb images,” arXiv preprint arXiv:1607.00730, 2016.
[4] F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep con-
volutional neural fields,” IEEE transactions on pattern analysis and machine intelligence, vol. 38,
no. 10, pp. 2024–2039, 2015.
[5] P. Wang, X. Shen, Z. Lin, S. Cohen, B. Price, and A. L. Yuille, “Towards unified depth and semantic
prediction from a single image,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2800–2809, 2015.
[6] A. Roy and S. Todorovic, “Monocular depth estimation using neural regression forest,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 5506–5514, 2016.
[7] S. Kim, K. Park, K. Sohn, and S. Lin, “Unified depth prediction and intrinsic image decomposition
from a single image via joint convolutional neural fields,” in European conference on computer vision,
pp. 143–159, Springer, 2016.
[8] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with
fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV),
pp. 239–248, IEEE, 2016.
[9] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-
scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43,
no. 2, pp. 652–662, 2019.
[10] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,”
in Proceedings of the IEEE international conference on computer vision, pp. 764–773, 2017.
[11] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, L. Hong, and E. Chi, “Dcn v2: Improved deep &
cross network and practical lessons for web-scale learning to rank systems,” in Proceedings of the
Web Conference 2021, pp. 1785–1797, 2021.
[12] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-
scale convolutional architecture,” in Proceedings of the IEEE international conference on computer
vision, pp. 2650–2658, 2015.
25[13] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in
International Conference on 3D Vision (3DV), 2017.
[14] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision, pp. 12179–12188, 2021.
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[16] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018,
2021.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[18] S. M. H. Miangoleh, S. Dille, L. Mai, S. Paris, and Y. Aksoy, “Boosting monocular depth estimation
models to high-resolution via content-adaptive multi-resolution merging,” in Proceedings of the IEEE/
CVF Conference on Computer Vision and Pattern Recognition, pp. 9685–9694, 2021.
[19] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth es-
timation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE transactions on pattern analysis
and machine intelligence, 2020.
[20] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semi-supervised deep learning for monocular depth map
prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 6647–6655, 2017.
[21] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation:
Geometry to the rescue,” in European conference on computer vision, pp. 740–756, Springer, 2016.
[22] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for
object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 2117–2125, 2017.
[23] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790, 2020.
[24] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint
arXiv:1511.07122, 2015.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[26] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and
re-identification in multiple object tracking,” International Journal of Computer Vision, vol. 129,
pp. 3069–3087, 2021.
26[27] B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab:
A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in CVPR, 2020.
[28] J. Cao, H. Leng, D. Lischinski, D. Cohen-Or, C. Tu, and Y. Li, “Shapeconv: Shape-aware convolu-
tional layer for indoor rgb-d semantic segmentation,” arXiv preprint arXiv:2108.10528, 2021.
[29] P. K. Nathan Silberman, Derek Hoiem and R. Fergus, “Indoor segmentation and support inference
from rgbd images,” in ECCV, 2012.
[30] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth
prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–
5693, 2019.
[31] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 2002–2011, 2018.
[32] Y. Gan, X. Xu, W. Sun, and L. Lin, “Monocular depth estimation with affinity, vertical pooling,
and label enhancement,” in Proceedings of the European Conference on Computer Vision (ECCV),
pp. 224–239, 2018.
[33] S. Lee, J. Lee, B. Kim, E. Yi, and J. Kim, “Patch-wise attention network for monocular depth esti-
mation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1873–1881,
2021.
[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances
in neural information processing systems, vol. 32, 2019.
[35] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint
arXiv:1711.05101, 2017.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” Advances in neural information processing systems, vol. 25, 2012.
[37] L. Huynh, P. Nguyen-Ha, J. Matas, E. Rahtu, and J. Heikkilä, “Guiding monocular depth estimation
using dept

全文公開日期 2024/08/18 (校內網路)
全文公開日期 2024/08/18 (校外網路)
全文公開日期 2024/08/18 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文