簡易檢索 / 詳目顯示

研究生: 李明勳
Ming-Xun Li
論文名稱: 基於語義實例輔助實現高分辨率深度估計
Semantic Instance Aided High Resolution Depth Estimation
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
陳省隆
Hsing-Lung Chen
林銘波
Ming-Bo Lin
方文賢
Wen-Hsien Fang
黃琴雅
Chin-Ya Huang
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 110
語文別: 英文
論文頁數: 26
中文關鍵詞: 立體攝影機深度估計語義分割實例分割
外文關鍵詞: Stereo Camera, Depth Estimation, Semantic Segmentation, Instance Segmentation
相關次數: 點閱:230下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本文研究利用雙相機鏡頭的深度預測,並加入語義及實例分割進行輔助。在單、雙鏡頭的深度預測研究中,利用捲積神經網路的深度學習預測深度已有不錯的表現。但是,卻無法在物件的邊緣及細節上有更佳的表現。因此本文提出利用神經網絡架構,通過使用雙相機鏡頭圖像,並且加入語義及實例分割的輔助,在良好的深度預測上,對物件的邊緣及細節上有更佳的表現。在神經網路架構上,除了利用雙相機鏡頭圖像、語義及實例分割進行深度預測外,還對深度進行了語義及實例分割的預測。加入語義分割及實例分割的輔助後,雖會使物件的邊緣及細節的預測上有良好的表現,但在訓練的過程中,語義及實例分割的信息可能會減少,所以另外加入語義及實例分割的預測以進行約束,使深度的預測更加良好。本論文在KITTI 2015和Carla數據集上的實驗表明了該方法的有效性。


We present a framework for high resolution stereo depth estimation. In recent years, deep neural networks showed fast progress in depth estimation for both monocular and stereo cameras. For new visual applications such as novel view synthesis, it is desired to acquire a high-resolution depth map to render a high-quality novel view. However, the outputs of existing high-resolution depth estimation approaches still lack fine depth estimation and sharp boundaries between objects. To address this issue, we propose a hierarchical stereo matching architecture integrating semantic and instance segmentation. Since the information of semantic and instant segmentation can impose spatial constraint on pixels such as learning the shape boundary of objects. We empirically evaluate our framework on KITTI 2015 and Carla datasets and show the effectiveness of the proposed approach.

Abstract I Acknowledgment III List of Figures VI List of Tables VII 1 Introduction 1 1.1 Motivations 1 1.2 Summary of Thesis 2 1.3 Contributions 2 1.4 Thesis Outline 3 2 Related Work 4 2.1 Stereo Depth Estimation 4 2.2 Coarse-to-fine CNNs 4 2.3 Deep Learning for segmentation Prediction 5 2.4 Multi-Task for Segmentation and Depth 5 3 Proposed Method 6 3.1 Overall Methodology 6 3.2 Semantic and Instance Input Augmentation 6 3.3 Hierarchical Depth Stereo Matching 8 3.4 Transfer Semantic neural network 9 3.5 Loss Functions 10 3.6 Novel View Synthesis 11 4 Experimental Results 13 4.1 Dataset 13 4.1.1 KITTI Dataset 13 4.1.2 CARLA Dataset 15 4.2 Evaluation Protocol 17 4.2.1 Root-Mean-Square Error 17 4.2.2 Absolute Relative Error 17 4.2.3 Three-Pixel Error 18 4.2.4 Structural Similarity 18 4.3 Experimental Results 19 4.3.1 Results of KITTI-2015 19 4.3.2 Results of CARLA 22 5 Conclusion and Future Works 24 5.1 Conclusion 24 5.2 Future Works 24 References 25

[1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, pp. 1231–1237, Aug. 2013.
[2] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Proceedings of the 1st Annual Conference on Robot Learning, pp. 1–16, 2017.
[3] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[4] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3061–3070, 2015.
[5] G. Yang, J. Manela, M. Happold, and D. Ramanan, “Hierarchical deep stereo matching on high-resolution images,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. abs/1912.06704, 2019.
[6] Y. Meng, Y. Lu, A. Raj, S. Sunarjo, R. Guo, T. Javidi, G. Bansal, and
D. Bharadia, “Signet: Semantic instance aided unsupervised 3d geometry perception,” CoRR, vol. abs/1812.05642, 2018.
[7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” CoRR, vol. abs/1411.4038, 2014.
[8] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
[9] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” CoRR, vol. abs/1709.02371, 2017.
[10] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention for semantic segmentation,” CoRR, vol. abs/2005.10821, 2020.
[11] A. Kirillov, Y. Wu, K. He, and R. B. Girshick, “Pointrend: Image segmentation as rendering,” CoRR, vol. abs/1912.08193, 2019.
[12] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. 41– 75, 1997.
[13] B. Liu, S. Gould, and D. Koller, “Single image depth estimation from predicted semantic labels,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1253–1260, 2010.
[14] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.”
https://github.com/facebookresearch/detectron2, 2019.
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large- scale image recognition,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (Y. Bengio and Y. LeCun, eds.), 2015.
[16] C. Godard, O. M. Aodha, M. Firman, and G. Brostow, “Digging into self-supervised monocular depth estimation,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3827–3837, 2019.
[17] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
[18] M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using context-aware layered depth inpainting,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

QR CODE