簡易檢索 / 詳目顯示

研究生: 劉宏毅
Hung-Yi Liu
論文名稱: 具尺度一致性的動態物件距離估測之研究
The Study of Scale-Consistent Video Depth Estimation forDynamic Objects
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
方文賢
Wen-Hsien Fang
林銘波
Ming-Bo Lin
阮聖彰
Shanq-Jang Ruan
呂永和
Yung-Ho Leu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 35
中文關鍵詞: 一致性深度估測物件追蹤及分割物件距離估測相機位姿估測
外文關鍵詞: consistent video depth, Multi-Object Tracking and Segmentation, objects distance estimation, Structure from Motion
相關次數: 點閱:148下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 一致性影片深度被視為基於圖像的應用中的一種關鍵材料,為了生成幾何一致
    的深度結果,一種名為 Structure from Motion(SfM) 的方法可以對深度提供很
    好的幾何參考。
    然而,動態物體不能被SfM處理,它們通常會導致比正常更糟糕的結果,
    運動物體不可避免地出現在駕駛場景中。為了解決上述問題,我們提出了
    BDox-CDft,使用基於深度學習的相機位姿姿估計來代替 SfM 任務,並且我們
    訓練了一個RNN網絡 – DBox,通過對邊界框和相機的時序變化做出響應的光
    學擴展來預測和修復動態物體的距離變化。最後,通過光流指示的參考點可以
    幫助深度估測模型學習一致的深度,結合先前的相機位姿和動態物體距離,我
    們的方法可以在駕駛序列上生成穩健一致的深度。
    從實驗證明,我們的方法在 KITTI 中贏過了最先進的方法,並在 DSEC
    中獲得了相似的分數。儘管在所有的測試序列上執行模型時,物體和地面都沒
    有明顯的深度破損,但其性能受到實例分割所決定的區域限制。


    Dense depth estimation is a critical issue for novel view synthesis and virtual
    reality (VR). Conventional Structure from Motion (SfM) and multi-view geometry are popular approaches for dense depth estimation, which can provide great
    geometrical constraint to depth, but suffers from poor depth estimation for dynamic objects and poor texture regions. With fast progress of learning-based
    monocular depth estimation, the new approach can solve the feature matching
    problems at textureless areas but it still cannot support a precise depth estimation for dynamic objects. To fill this gap, in this work, we investigate an architecture to refine depth estimation for dynamic objects from a monocular depth
    estimation. Specifically, we integrate new developed technologies in deep learning
    for camera motion estimation, multiple object tracking and visual adometry to
    address this problem. First, we use a learned-based pose estimation scheme to replace conventional Structure from Motion (SfM), which can provide camera pose
    even at poor texture areas. Then, we train a recurrent neural network to predict
    the distance of dynamic objects given bounding boxes from the object detector
    and camera pose. Finally, we use optical flow with forward-and-backward-check
    to establish geometrically consistent estimation on pixels across the video. We
    verify the performance of our proposed system on KITTI and DSEC datasets.

    Abstract . . . . i Acknowledgment . . . . iii Table of contents . . . . iv 1 Introduction . . . . 1 2 Related Work . . . . 4 3 The Proposed Method . . . . 6 3.1 Data Pre-processing . . . . 7 3.2 Depth Estimation for Dynamic Objects . . . . 9 3.2.1 Network inputs . . . . 9 3.2.2 Network architecture . . . .11 3.2.3 Loss function . . . . 12 3.3 Depth Inpainting for Dynamic Objects . . . . 13 3.3.1 Depth Re-scaling for Dynamic Objects . . . .14 3.3.2 Fine-tune Networks for Consistent Depth Estimation . . . . 15 4 Experiment . . . . 17 4.1 Experimental Setup . . . . 17 4.1.1 DBox Training . . . . 17 4.1.2 Consistent Depth Testing . . . . 19 4.2 Experimental Results . . . . 20 4.3 Limitation . . . . 23 5 Conclusion . . . . 25 References . . . . 26

    [1] X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video
    depth estimation,” ACM Transactions on Graphics (Proceedings of ACM
    SIGGRAPH), vol. 39, no. 4, 2020.
    [2] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into
    self-supervised monocular depth prediction,” The International Conference
    on Computer Vision (ICCV), 2019.
    [3] D. G. Lowe, “Object recognition from local scale-invariant features,” Proceedings of the International Conference on Computer Vision: 1150–1157,
    1999.
    [4] J. L. Sch¨onberger and J.-M. Frahm, “Structure-from-motion revisited,” Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
    [5] Mapillary, “OpenSfM.” https://github.com/mapillary/OpenSfM.
    [6] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine:
    Robust hierarchical localization at large scale,” Conference on Computer
    Vision and Pattern Recognition (CVPR), 2019.
    [7] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue:
    Learning feature matching with graph neural networks,” Conference on
    Computer Vision and Pattern Recognition (CVPR), 2020.
    [8] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan,
    “HR-Depth: High resolution self-supervised monocular depth estimation,”
    AAAI, 2021.
    [9] M. Klingner, J.-A. Term¨ohlen, J. Mikolajczyk, and T. Fingscheidt, “SelfSupervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance,” European Conference on Computer Vision
    (ECCV), 2020.
    [10] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised Learning of
    Depth and Ego-Motion from Video,” Conference on Computer Vision and
    Pattern Recognition (CVPR), 2017.
    [11] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, “D3VO: Deep depth,
    deep pose and deep uncertainty for monocular visual odometry,” Conference
    on Computer Vision and Pattern Recognition (CVPR), 2020.
    [12] H. Zhan, C. S. Weerasekera, J. W. Bian, and I. Reid, “Visual odometry revisited: What should be learnt?,” IEEE International Conference on Robotics
    and Automation (ICRA), pp. 4203–4210, 2020.
    [13] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth estimation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition,
    2021.
    [14] Z. Teed and J. Deng, “DeepV2D: Video to depth with differentiable structure from motion,” International Conference on Learning Representations
    (ICLR), 2020.
    [15] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger,
    and B. Leibe, “MOTS: Multi-object tracking and segmentation,” Conference
    on Computer Vision and Pattern Recognition (CVPR), 2019.
    [16] K. He, G. Gkioxari, P. Dollr, and R. Girshick, “Mask R-CNN,” 2017 IEEE
    International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
    [17] Z. Xu, W. Zhang, X. Tan, W. Yang, H. Huang, S. Wen, E. Ding, and
    L. Huang, “Segment as points for efficient online multi-object tracking and
    segmentation,” European Conference on Computer Vision (ECCV), 2020.
    [18] Y. min Song, Y. chul Yoon, K. Yoon, M. Jeon, S.-W. Lee, and W. Pedrycz,
    “Online Multi-Object Tracking and Segmentation with GMPHD Filter and
    Mask-based Affinity Fusion,” 5th BMTT MOTChallenge Workshop, 2020.
    [19] M. A. Haseeb, J. Guan, D. Risti´c-Durrant, and A. Gr¨aser, “DisNet: a novel
    method for distance estimation from monocular camera,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.
    [20] J. Zhu, Y. Fang, H. Abu-Haimed, K.-C. Lien, D. Fu, and J. Gu, “Learning
    object-specific distance from a monocular image,” The International Conference on Computer Vision (ICCV), 2019.
    [21] B. A. Griffin and J. J. Corso, “Depth from camera motion and object detection,” The IEEE Conference on Computer Vision and Pattern Recognition
    (CVPR), 2021.
    [22] A. Kirillov, Y. Wu, K. He, and R. Girshick, “PointRend: Image segmentation
    as rendering,” ArXiv:1912.08193, 2019.
    [23] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock,
    M. Hasegawa-Johnson, and T. Huang, “Dilated recurrent neural networks,”
    NIPS, 2017.
    [24] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical
    flow,” European Conference on Computer Vision (ECCV), 2020.
    [25] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA:
    An open urban driving simulator,” Proceedings of the 1st Annual Conference
    on Robot Learning, pp. 1–16, 2017.
    [26] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
    The KITTI dataset,” International Journal of Robotics Research, vol. 32,
    pp. 1231 – 1237, Sept. 2013.
    [27] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo
    event camera dataset for driving scenarios,” IEEE Robotics and Automation
    Letters, 2021.
    [28] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
    robust monocular depth estimation: Mixing datasets for zero-shot crossdataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
    [29] M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using
    context-aware layered depth inpainting,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

    無法下載圖示 全文公開日期 2024/09/27 (校內網路)
    全文公開日期 2026/09/27 (校外網路)
    全文公開日期 2026/09/27 (國家圖書館:臺灣博碩士論文系統)
    QR CODE