簡易檢索 / 詳目顯示

研究生: 陳哲威
Che-Wei Chen
論文名稱: 以適應式區段分割預測深度
Depth Estimation by Adaptive Segmentation Bins
指導教授: 鍾聖倫
Sheng-Luen Chung
徐繼聖
Gee-Sern Hsu
口試委員: 鍾聖倫
Sheng-Luen Chung
徐繼聖
Gee-Sern Hsu
蘇順豐
Shun-Feng Su
郭重顯
Chung-Hsien Kuo
Kosin Chamnongthai
Kosin Chamnongthai
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 60
中文關鍵詞: 深度預測適應式區段分割
外文關鍵詞: Depth estimation, adaptive bins, Segmentation
相關次數: 點閱:165下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

場景圖像準確預測深度具有挑戰性,因為2D圖像本身提供有限的RGB色彩資訊。我們針對單一影像的深度預測任務提出了一種融合適應式區段深度預測網路與物件分割網路的方法。其中適應式區段深度預測網路是由RGB-Seg編碼器解碼器和Vision Transformer組成。而分割網路則是由RGB編碼器解碼器所構成。我們會先透過分割網路得到RGB圖像的分割圖,接著再將RGB圖像與其分割圖作為預測網路的輸入完成深度預測任務。相比以往的方法只利用圖片上的色彩資訊,我們嘗試新增來自分割的物件資訊來豐富2D視覺資訊,這會讓我們的方法在機器視覺上更有增益性。本研究的創新之處包括:1)分割網路辨識物體進行場景分析來豐富RGB圖像的資訊量。2)在深度預測網路上,能夠整合了更多輸入的信息,使其能有好的學習效果,並讓深度區段能更趨近景象分布。3)與最先進的作品相比,我們的方法在基準資料庫上的評測是具有競爭性的。


Accurate depth estimation from a single RGB scene image is challenging because of the limited information of the given RGB image. We propose a method that fuses an adaptive-bin depth estimation network and a segmentation network for the single-image depth estimation. The adaptive-bin depth estimation network is composed of a RGB-Seg encoder-decoder and a vision transformer. The segmentation network is composed of a RGB encoder-decoder. We first obtain the segmentation map of the RGB image through the segmentation network, and then use the RGB image and its segmentation map as the input of the depth estimation network to complete the depth estimation task. Compared with the previous method that only uses the color information on the image, we try to add the object information from the segmentation to enrich the 2D visual information that make our method more profitable in machine vision. The novelties of this study include the following: 1) The segmentation network identify objects and perform scene analysis to increases the amount of information in the RGB image. 2) In the depth estimation network, more input information can be integrated, so that it can have a good learning effect, and the depth bins can be closer to the scene depth distribution. 3) The method with a competitive performance on the benchmark dataset compared to state-of-the-art works.

摘要 Abstract Acknowledgements Contents List of Figures List of Tables Chapter 1: Introduction 1.1 Motivation 1.2 Method Overview 1.3 Contribution 1.4 Thesis Organization Chapter 2: Related Work 2.1 Regression Based Method 2.2 Hybrid Regression Based Method 2.3 AdaBins 2.4 Transformer Chapter 3: Main Method 3.1 Scene Segmentation Module 3.2 RGB-Seg Encoder-Decoder 3.3 Vision-Transformer 3.4 Loss Functions Chapter 4: Experiments and Results 4.1 Database and Evaluation Metric 4.2 Implementation Details 4.3 Comparison with State-of-the-art 4.4 Ablation Study 4.5 Qualitative Result 4.6 Competition with AdaBins Chapter 5: Conclusion References Appendix A: Categories of Segmentation Tasks Appendix B: Glossary

[1] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” arXiv preprint
arXiv:2011.14141, 2020.
[2] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from
rgbd images,” in European conference on computer vision, pp. 746–760, Springer, 2012.
[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International
Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
[4] W. Lee, N. Park, and W. Woo, “Depthassisted
realtime
3d object detection for augmented reality,”
in ICAT, vol. 11, pp. 126–132, 2011.
[5] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation
via fusionbased
cnn architecture,” in Asian conference on computer vision, pp. 213–228,
Springer, 2016.
[6] S. M. A. Tousi, J. Khorramdel, F. Lotfi, A. H. Nikoofard, A. N. Ardekani, and H. D. Taghirad, “A new
approach to estimate depth of cars using a monocular image,” in 2020 8th Iranian Joint Congress on
Fuzzy and intelligent Systems (CFIS), pp. 045–050, IEEE, 2020.
[7] A. Astudillo, A. AlKaff,
Á. Madridano, F. García, D. Martín, and A. de la Escalera, “Monolsde:
Lightweight semanticcnn
for depth estimation from monocular aerial images,” in 2020 International
Conference on Unmanned Aircraft Systems (ICUAS), pp. 807–814, IEEE, 2020.
[8] H. Pan, T. Guan, Y. Luo, L. Duan, Y. Tian, L. Yi, Y. Zhao, and J. Yu, “Dense 3d reconstruction
combining depth and rgb information,” Neurocomputing, vol. 175, pp. 644–651, 2016.
[9] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense twoframe
stereo correspondence
algorithms,” International journal of computer vision, vol. 47, no. 1, pp. 7–42, 2002.
[10] Y. Liu, X. Cao, Q. Dai, and W. Xu, “Continuous depth estimation for multiview
stereo,” in 2009 IEEE
Conference on Computer Vision and Pattern Recognition, pp. 2121–2128, IEEE, 2009.
[11] J. H. Lee, M.K.
Han, D. W. Ko, and I. H. Suh, “From big to small: Multiscale
local planar guidance
for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–
4708, 2017.
41
[14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam,
“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[15] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in
International Conference on Machine Learning, pp. 6105–6114, PMLR, 2019.
[16] L.C.
Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions
on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[17] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth
prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5684–
5693, 2019.
[18] M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramidbased
depth
residuals,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[19] D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.M.
Gross, “Efficient rgbd
semantic
segmentation for indoor scene analysis,” arXiv preprint arXiv:2011.06961, 2020.
[20] H. Laga, “A survey on deep learning architectures for imagebased
depth reconstruction,” arXiv
preprint arXiv:1906.06113, 2019.
[21] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multiscale
deep network,” Advances in Neural Information Processing Systems, vol. 27, pp. 2366–2374, 2014.
[22] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with
fully convolutional residual networks,” in 2016 Fourth international conference on 3D vision (3DV),
pp. 239–248, IEEE, 2016.
[23] I. Alhashim and P. Wonka, “High quality monocular depth estimation via transfer learning,” arXiv
preprint arXiv:1812.11941, 2018.
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural
networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
[25] O. Ronneberger, P. Fischer, and T. Brox, “Unet:
Convolutional networks for biomedical image segmentation,”
in International Conference on Medical image computing and computerassisted
intervention,
pp. 234–241, Springer, 2015.
[26] Y. Cao, Z. Wu, and C. Shen, “Estimating depth from monocular images as classification using deep
fully convolutional residual networks,” IEEE Transactions on Circuits and Systems for Video Technology,
vol. 28, no. 11, pp. 3174–3182, 2017.
[27] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for
monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2002–2011, 2018.
42
[28] L. R. Medsker and L. Jain, “Recurrent neural networks,” Design and Applications, vol. 5, 2001.
[29] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech
separation,” in ICASSP 20212021
IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 21–25, IEEE, 2021.
[30] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “Endtoend
object
detection with transformers,” in European Conference on Computer Vision, pp. 213–229, Springer,
2020.
[31] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image
recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[32] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran, “Image transformer,”
in International Conference on Machine Learning, pp. 4055–4064, PMLR, 2018.
[33] J. Deng, W. Dong, R. Socher, L.J.
Li, K. Li, and L. FeiFei,
“Imagenet: A largescale
hierarchical
image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255,
Ieee, 2009.
[34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
[35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Mobilenetv2: The next generation of
ondevice
computer vision networks,” in CVPR, 2018.
[36] J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation
networks,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 7132–7141, 2018.
[37] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from
a single image,” in Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 605–613, 2017.
[38] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, pp. 3213–3223, 2016.
[39] R. Garg, V. K. Bg, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation:
Geometry to the rescue,” in European conference on computer vision, pp. 740–756, Springer, 2016.
[40] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, et al., “Pytorch: An imperative style, highperformance
deep learning library,” Advances
in Neural Information Processing Systems, vol. 32, pp. 8026–8037, 2019.
[41] L. Bottou, “Stochastic gradient descent tricks,” in Neural networks: Tricks of the trade, pp. 421–436,
Springer, 2012.
43
[42] L. N. Smith and N. Topin, “Superconvergence:
Very fast training of neural networks using large learning
rates,” in Artificial Intelligence and Machine Learning for MultiDomain
Operations Applications,
vol. 11006, p. 1100612, International Society for Optics and Photonics, 2019.
[43] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on
Learning Representations, 2018.
[44] T.Y.
Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft
coco: Common objects in context,” in European conference on computer vision, pp. 740–755,
Springer, 2014.
[45] A. Kirillov, R. Girshick, K. He, and P. Dollár, “Panoptic feature pyramid networks,” in Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408, 2019.
[46] I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi,
S. Basart, M. R. Walter, et al., “Diode: A dense indoor and outdoor depth dataset,” arXiv preprint
arXiv:1908.00463, 2019.

無法下載圖示 全文公開日期 2023/09/22 (校內網路)
全文公開日期 2026/09/22 (校外網路)
全文公開日期 2026/09/22 (國家圖書館:臺灣博碩士論文系統)
QR CODE