研究生: |
劉宏毅 Hung-Yi Liu |
---|---|
論文名稱: |
具尺度一致性的動態物件距離估測之研究 The Study of Scale-Consistent Video Depth Estimation forDynamic Objects |
指導教授: |
陳郁堂
Yie-Tarng Chen |
口試委員: |
陳郁堂
Yie-Tarng Chen 方文賢 Wen-Hsien Fang 林銘波 Ming-Bo Lin 阮聖彰 Shanq-Jang Ruan 呂永和 Yung-Ho Leu |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 35 |
中文關鍵詞: | 一致性深度估測 、物件追蹤及分割 、物件距離估測 、相機位姿估測 |
外文關鍵詞: | consistent video depth, Multi-Object Tracking and Segmentation, objects distance estimation, Structure from Motion |
相關次數: | 點閱:148 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
一致性影片深度被視為基於圖像的應用中的一種關鍵材料,為了生成幾何一致
的深度結果,一種名為 Structure from Motion(SfM) 的方法可以對深度提供很
好的幾何參考。
然而,動態物體不能被SfM處理,它們通常會導致比正常更糟糕的結果,
運動物體不可避免地出現在駕駛場景中。為了解決上述問題,我們提出了
BDox-CDft,使用基於深度學習的相機位姿姿估計來代替 SfM 任務,並且我們
訓練了一個RNN網絡 – DBox,通過對邊界框和相機的時序變化做出響應的光
學擴展來預測和修復動態物體的距離變化。最後,通過光流指示的參考點可以
幫助深度估測模型學習一致的深度,結合先前的相機位姿和動態物體距離,我
們的方法可以在駕駛序列上生成穩健一致的深度。
從實驗證明,我們的方法在 KITTI 中贏過了最先進的方法,並在 DSEC
中獲得了相似的分數。儘管在所有的測試序列上執行模型時,物體和地面都沒
有明顯的深度破損,但其性能受到實例分割所決定的區域限制。
Dense depth estimation is a critical issue for novel view synthesis and virtual
reality (VR). Conventional Structure from Motion (SfM) and multi-view geometry are popular approaches for dense depth estimation, which can provide great
geometrical constraint to depth, but suffers from poor depth estimation for dynamic objects and poor texture regions. With fast progress of learning-based
monocular depth estimation, the new approach can solve the feature matching
problems at textureless areas but it still cannot support a precise depth estimation for dynamic objects. To fill this gap, in this work, we investigate an architecture to refine depth estimation for dynamic objects from a monocular depth
estimation. Specifically, we integrate new developed technologies in deep learning
for camera motion estimation, multiple object tracking and visual adometry to
address this problem. First, we use a learned-based pose estimation scheme to replace conventional Structure from Motion (SfM), which can provide camera pose
even at poor texture areas. Then, we train a recurrent neural network to predict
the distance of dynamic objects given bounding boxes from the object detector
and camera pose. Finally, we use optical flow with forward-and-backward-check
to establish geometrically consistent estimation on pixels across the video. We
verify the performance of our proposed system on KITTI and DSEC datasets.
[1] X. Luo, J. Huang, R. Szeliski, K. Matzen, and J. Kopf, “Consistent video
depth estimation,” ACM Transactions on Graphics (Proceedings of ACM
SIGGRAPH), vol. 39, no. 4, 2020.
[2] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into
self-supervised monocular depth prediction,” The International Conference
on Computer Vision (ICCV), 2019.
[3] D. G. Lowe, “Object recognition from local scale-invariant features,” Proceedings of the International Conference on Computer Vision: 1150–1157,
1999.
[4] J. L. Sch¨onberger and J.-M. Frahm, “Structure-from-motion revisited,” Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[5] Mapillary, “OpenSfM.” https://github.com/mapillary/OpenSfM.
[6] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine:
Robust hierarchical localization at large scale,” Conference on Computer
Vision and Pattern Recognition (CVPR), 2019.
[7] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperGlue:
Learning feature matching with graph neural networks,” Conference on
Computer Vision and Pattern Recognition (CVPR), 2020.
[8] X. Lyu, L. Liu, M. Wang, X. Kong, L. Liu, Y. Liu, X. Chen, and Y. Yuan,
“HR-Depth: High resolution self-supervised monocular depth estimation,”
AAAI, 2021.
[9] M. Klingner, J.-A. Term¨ohlen, J. Mikolajczyk, and T. Fingscheidt, “SelfSupervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance,” European Conference on Computer Vision
(ECCV), 2020.
[10] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised Learning of
Depth and Ego-Motion from Video,” Conference on Computer Vision and
Pattern Recognition (CVPR), 2017.
[11] N. Yang, L. von Stumberg, R. Wang, and D. Cremers, “D3VO: Deep depth,
deep pose and deep uncertainty for monocular visual odometry,” Conference
on Computer Vision and Pattern Recognition (CVPR), 2020.
[12] H. Zhan, C. S. Weerasekera, J. W. Bian, and I. Reid, “Visual odometry revisited: What should be learnt?,” IEEE International Conference on Robotics
and Automation (ICRA), pp. 4203–4210, 2020.
[13] J. Kopf, X. Rong, and J.-B. Huang, “Robust consistent video depth estimation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2021.
[14] Z. Teed and J. Deng, “DeepV2D: Video to depth with differentiable structure from motion,” International Conference on Learning Representations
(ICLR), 2020.
[15] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger,
and B. Leibe, “MOTS: Multi-object tracking and segmentation,” Conference
on Computer Vision and Pattern Recognition (CVPR), 2019.
[16] K. He, G. Gkioxari, P. Dollr, and R. Girshick, “Mask R-CNN,” 2017 IEEE
International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
[17] Z. Xu, W. Zhang, X. Tan, W. Yang, H. Huang, S. Wen, E. Ding, and
L. Huang, “Segment as points for efficient online multi-object tracking and
segmentation,” European Conference on Computer Vision (ECCV), 2020.
[18] Y. min Song, Y. chul Yoon, K. Yoon, M. Jeon, S.-W. Lee, and W. Pedrycz,
“Online Multi-Object Tracking and Segmentation with GMPHD Filter and
Mask-based Affinity Fusion,” 5th BMTT MOTChallenge Workshop, 2020.
[19] M. A. Haseeb, J. Guan, D. Risti´c-Durrant, and A. Gr¨aser, “DisNet: a novel
method for distance estimation from monocular camera,” IEEE/RSJ International Conference on Intelligent Robots and Systems, 2018.
[20] J. Zhu, Y. Fang, H. Abu-Haimed, K.-C. Lien, D. Fu, and J. Gu, “Learning
object-specific distance from a monocular image,” The International Conference on Computer Vision (ICCV), 2019.
[21] B. A. Griffin and J. J. Corso, “Depth from camera motion and object detection,” The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2021.
[22] A. Kirillov, Y. Wu, K. He, and R. Girshick, “PointRend: Image segmentation
as rendering,” ArXiv:1912.08193, 2019.
[23] S. Chang, Y. Zhang, W. Han, M. Yu, X. Guo, W. Tan, X. Cui, M. Witbrock,
M. Hasegawa-Johnson, and T. Huang, “Dilated recurrent neural networks,”
NIPS, 2017.
[24] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical
flow,” European Conference on Computer Vision (ECCV), 2020.
[25] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA:
An open urban driving simulator,” Proceedings of the 1st Annual Conference
on Robot Learning, pp. 1–16, 2017.
[26] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
The KITTI dataset,” International Journal of Robotics Research, vol. 32,
pp. 1231 – 1237, Sept. 2013.
[27] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza, “Dsec: A stereo
event camera dataset for driving scenarios,” IEEE Robotics and Automation
Letters, 2021.
[28] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards
robust monocular depth estimation: Mixing datasets for zero-shot crossdataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
[29] M.-L. Shih, S.-Y. Su, J. Kopf, and J.-B. Huang, “3d photography using
context-aware layered depth inpainting,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.