簡易檢索 / 詳目顯示

研究生: 蔡孟軒
Meng-Hsuan Tsai
論文名稱: 利用深度學習估測 YouTube 車禍影片中三維車輛軌跡之研究
3D Vehicle Trajectory Estimationfrom YouTube Car Accident Videos Using Deep Neural Networks
指導教授: 陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Yie-Tarng Chen
林銘波
Ming-Bo Lin
方文賢
Wen-Hsien Fang
呂政修
Jenq-Shiou Leu
陳省隆
Hsing-Lung Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 51
中文關鍵詞: 視覺測距即時定位與地圖構建物件偵測透視變換前後向錯誤校正
外文關鍵詞: Visual Odometry, Simultaneous Localization and Mapping, Object Detection, Perspective Transformation, Forward and backward Error
相關次數: 點閱:305下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於近年來自駕車等引人注目的應用,從行車紀錄器影片生成3D車輛軌跡成為被關注的議題。為了提前預防車禍,我們必須首先了解車禍的主要原因以及基於這些三維車輛軌跡容易發生車禍的駕駛模式。在本文中,我們提出了一種三維車輛軌跡估計方案:通過YouTube網站上的車禍行車紀錄器影片生成車輛軌跡。首先,我們使用單目視覺測距方法估計自我運動並得到兩個相鄰幀的特徵點對,以及最近開發的SfMLearner神經網路。由於來自圖像對的自我運動估計會有尺度問題,只能獲得相對速度。為了解決這個問題,首先我們將圖片利用透視變換(Perspective Transformation)投影到鳥瞰圖上面,並且偵測道路線,之後對於道路線進行比例縮放,找出特徵點位移與真實距離間的比例關係。最後,我們可以計算車輛與自我運動相機的距離並加入相機自我運動來獲得其他車輛在世界坐標中的3D車輛軌跡。與單眼相機的其他自我運動估計方法相比,我們的方法有兩個優點。首先,我們的方法可以獲得絕對尺度的車輛軌跡。其次,所提出的方法不僅可以在KITTI數據集上運行,還可以在YouTube網站上質量較差的車禍行車紀錄器影片運行。


    Generating 3D vehicle trajectories from dash-cam videos has garnered recent attention due to compelling applications in self-driving cars. In order to prevent car accidents in advance, we must first understand the main causes of car accidents and the driving modes that are prone to car accidents based on these 3D vehicle trajectories. In this thesis, we explore an intriguing scenario for 3D vehicle trajectory estimation: generating trajectories of vehicles involved accidents from online videos on YouTube captured by dash-cameras. First, we estimate camera ego motion using both the structure from motion, where we use the feature point pairs of two adjacent frames, and recently developed SfMLearner neural networks. Since ego motion estimation from image pairs suffers from the scale ambiguity problem, where only a relative speed can be obtained. To address this problem, first we use a novel vehicle depth estimation based on a combination of inverse perspective transform a dash line heuristic and sequentially calculate a moving vector from adjacent frames at the bird view by using matching feature pairs. Finally, we can obtain 3D vehicle trajectories in the world coordinate by adding the estimated camera ego motion and vehicle motion in the car coordinate. Compared with other ego motion estimation approaches from monocular camera, our approach has two advantages. First,our method can get vehicle trajectories in the absolute scale. Second, the proposed method can operate not only on KITTI datasets but also accident videos on YouTube from a dash cam.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Vehicle Trajectories Estimation . . . . . . . . . . . . . . . . . . . 1 1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Summary of the Proposal Approach . . . . . . . . . . . . . . . . . 3 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Traditional Scene Geometry . . . . . . . . . . . . . . . . . . . . . 5 2.2 View Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Learning-Based Motion Estimation . . . . . . . . . . . . . . . . . 6 2.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4.1 One-Stage Detector: . . . . . . . . . . . . . . . . . . . . . 7 2.4.2 Two-Stage Detector: . . . . . . . . . . . . . . . . . . . . . 8 2.5 Visual Odometry Benchmark [1] . . . . . . . . . . . . . . . . . . . 8 3 The Proposed Approach for 3D Ego-Motion Estimation . . . . . . . . . 10 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Region-of-Interest Selection . . . . . . . . . . . . . . . . . . . . . 12 3.4 Structure from Motion . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4.1 Monocular Visual Odometry . . . . . . . . . . . . . . . . . 13 3.4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 14 3.4.3 Feature Matching . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.4 Essential Matrix Estimation . . . . . . . . . . . . . . . . . 16 3.4.5 Computing R, t from the Essential Matrix . . . . . . . . . 18 3.4.6 Ego-Motion Estimation in World Coordinate . . . . . . . . 18 3.5 Ego-Motion Estimation Based on Deep Learning . . . . . . . . . . 19 3.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.5.2 Supervision by View Synthesis . . . . . . . . . . . . . . . . 20 3.5.3 Dierentiable Depth Image-Based Rendering Module . . . 20 3.5.4 Mask for the Model Limitation . . . . . . . . . . . . . . . 21 3.5.5 Depth Smoothness . . . . . . . . . . . . . . . . . . . . . . 22 3.5.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.7 Network Architecture . . . . . . . . . . . . . . . . . . . . . 22 4 The Proposed 3D Trajectory Generation . . . . . . . . . . . . . . . . . 24 4.1 Absolute Scale Estimation for Monocular Visual Odometry . . . . 26 4.1.1 Feature Point Pairs Selection . . . . . . . . . . . . . . . . 27 4.1.2 Inverse Perspective Transformation . . . . . . . . . . . . . 28 4.1.3 Distance Estimation by Dotted Line Heuristic . . . . . . . 29 4.1.4 Scale Ratio Calculation . . . . . . . . . . . . . . . . . . . . 30 4.2 Coordinate Transformation . . . . . . . . . . . . . . . . . . . . . . 31 4.2.1 Coordinate Rotation . . . . . . . . . . . . . . . . . . . . . 32 4.2.2 Car Coordinate to World Coordinate Transformation . . . 33 5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.1.1 KITTI Visual Odometry Dataset [1] . . . . . . . . . . . . 34 5.1.2 Youtube Accident Dataset . . . . . . . . . . . . . . . . . . 35 5.2 Experimental Results on Monocular Visual Odometry . . . . . . . 36 5.3 Experimental Results on Absolute Scale Estimation . . . . . . . . 40 5.4 Experimental Results on Car Accident Trajectory Generation . . 42 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    [1] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, "Vision meets robotics: The kitti dataset," The International Journal of Robotics Research, vol. 32, pp. 1231-1237, Aug. 2013.
    [2] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, "Orb-slam: a versatile and accurate monocular slam system," IEEE Transactions on Robotics, vol. 31, pp. 1147-1163, Aug. 2015.
    [3] R. Mur-Artal and J. D. Tard_os, "Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras," IEEE Transactions on Robotics, vol. 33, pp. 1255-1262, June 2017.
    [4] J. Engel, T. Schops, and D. Cremers, "Lsd-slam: Large-scale direct monocular slam," in European Conference on Computer Vision, pp. 834-849, Springer, 2014.
    [5] Y. Furukawa, C. Hern_andez, et al., "Multi-view stereo: A tutorial," Foundations and Trends in Computer Graphics and Vision, vol. 9, pp. 1-148, June 2015.
    [6] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
    [7] J. L. Schonberger and J.-M. Frahm, "Structure-from-motion revisited," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4104-4113, 2016.
    [8] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, "Bundle adjustment_a modern synthesis," in International Workshop on Vision Algorithms, pp. 298-372, Springer, 1999.
    [9] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, "Monoslam: Realtime
    single camera slam," IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1052-1067, June 2007.
    [10] J. Civera, A. J. Davison, and J. M. Montiel, "Inverse depth parametrization
    for monocular slam," IEEE Transactions on Robotics, vol. 24, pp. 932-945, Oct. 2008.
    [11] A. Chiuso, P. Favaro, H. Jin, and S. Soatto, "Structure from motion causally
    integrated over time," IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 523-535, AUG 2002.
    [12] E. Eade and T. Drummond, "Scalable monocular slam," in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Volume 1, pp. 469-476, IEEE Computer Society, 2006.
    [13] H. Strasdat, J. M. Montiel, and A. J. Davison, "Visual slam: why filter ?," Image and Vision Computing, vol. 30, pp. 65-77, Feb. 2012.
    [14] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon, "Kinectfusion: Realtime dense surface mapping and tracking.," in ISMAR, vol. 11, pp. 127-136, 2011.
    [15] F. Endres, J. Hess, J. Sturm, D. Cremers, and W. Burgard, "3-d mappingwith an rgb-d camera," IEEE Transactions on Robotics, vol. 30, pp. 177-187, Sep. 2013.
    [16] S. E. Chen and L. Williams, "View interpolation for image synthesis," in Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, pp. 279-288, ACM, 1993.
    [17] C. L. Zitnick, S. B. Kang, M. Uyttendaele, S. Winder, and R. Szeliski, "Highquality video view interpolation using a layered representation," in ACM Transactions on Graphics (TOG), vol. 23, pp. 600-608, ACM, 2004.
    [18] S. M. Seitz and C. R. Dyer, "View morphing," in Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, pp. 21-30, ACM, 1996.
    [19] P. E. Debevec, C. J. Taylor, and J. Malik, Modeling and rendering architecture from photographs. University of California, Berkeley, 1996.
    [20] A. Fitzgibbon, Y. Wexler, and A. Zisserman, "Image-based rendering using image-based priors, 9th ieee int'l conf. on comp," Proceedings of the IEEE International Conference on Computer Vision, pp. 14-17, Oct. 2003.
    [21] J. Flynn, I. Neulander, J. Philbin, and N. Snavely, "Deepstereo: Learning to predict new views from the world's imagery," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515-5524, 2016.
    [22] J. Xie, R. Girshick, and A. Farhadi, "Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks," in European Conference on Computer Vision, pp. 842-857, Springer, 2016.
    [23] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros, "View synthesis by appearance flow," in European Conference on Computer Vision, pp. 286-301, Springer, 2016.
    [24] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, "Flownet: Learning optical ow with convolutional networks," in Proceedings of the IEEE International Conference on Computer Vision, pp. 2758-2766, 2015.
    [25] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, "A large dataset to train convolutional networks for disparity, optical flow, and scene ow estimation," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 4040{4048, 2016.
    [26] J. Thewlis, S. Zheng, P. Torr, and A. Vedaldi, "Fully-trainable deep matching," Sep. 2016.
    [27] A. Kendall, M. Grimes, and R. Cipolla, "Posenet: A convolutional network for real-time 6-dof camera relocalization," in Proceedings of the IEEE International Conference on Computer Vision, pp. 2938-2946, 2015.
    [28] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz, "Geometry-aware learning of maps for camera localization," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616-2625, 2018.
    [29] R. Garg, V. K. BG, G. Carneiro, and I. Reid, "Unsupervised cnn for single view depth estimation: Geometry to the rescue," in European Conference on Computer Vision, pp. 740-756, Springer, 2016.
    [30] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, "Sfm-net: Learning of structure and motion from video," arXiv preprint arXiv:1704.07804, Apr. 2017.
    [31] A. Z. Zhu, W. Liu, Z.Wang, V. Kumar, and K. Daniilidis, "Robustness meets deep learning: An end-to-end hybrid pipeline for unsupervised learning of egomotion," arXiv preprint arXiv:1812.08351, Feb. 2018.
    [32] R. Li, S.Wang, Z. Long, and D. Gu, "Undeepvo: Monocular visual odometry through unsupervised deep learning," in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 7286-7291, IEEE, 2018.
    [33] C. Luo, Z. Yang, P. Wang, Y. Wang, W. Xu, R. Nevatia, and A. Yuille, "Every pixel counts++: Joint learning of geometry and motion with 3d holistic understanding," arXiv preprint arXiv:1810.06125, July 2018.
    [34] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, "SSD: Single Shot MultiBox Detector," in Proceedings of the European Conference on Computer Vision, pp. 21-37, 2016.
    [35] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll_ar, "Focal loss for dense object detection," in Proceedings of IEEE International Conference on Computer Vision, pp. 2999-3007, 2017.
    [36] J. Redmon and A. Farhadi, "Yolov3: An incremental improvement," arXiv preprint arXiv:1804.02767, Apr. 2018.
    [37] S. Ren, K. He, R. Girshick, and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," in Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 91-99, 2015.
    [38] T.-Y. Lin, P. Doll_ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 936-944, 2017.
    [39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll_ar, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Proceedings of the European Conference on Computer Vision, pp. 740-755, 2014.
    [40] R. Girshick, "Fast R-CNN," in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
    [41] J. Dai, Y. Li, K. He, and J. Sun, "R-fcn: Object detection via regionbased fully convolutional networks," in Proceedings of the Neural Information Processing System, pp. 379-387, 2016.
    [42] M. Trajkovi_c and M. Hedley, "Fast corner detection," Image and Vision Computing, vol. 16, pp. 75-87, Feb. 1998.
    [43] C. Tomasi and T. K. Detection, "Tracking of point features," tech. rep., Tech. Rep. CMU-CS-91-132, Carnegie Mellon University, 1991.
    [44] Z. Kalal, K. Mikolajczyk, and J. Matas, "Forward-backward error: Automatic detection of tracking failures," in 2010 20th International Conference on Pattern Recognition, pp. 2756-2759, IEEE, 2010.
    [45] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, "Unsupervised learning of depth and ego-motion from video," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851-1858, 2017.
    [46] C. Fehn, "Depth-image-based rendering (dibr), compression, and transmission
    for a new approach on 3d-tv," in Stereoscopic Displays and Virtual Reality Systems XI, vol. 5291, pp. 93-105, International Society for Optics and Photonics, 2004.
    [47] M. Jaderberg, K. Simonyan, A. Zisserman, et al., "Spatial transformer networks," in Advances in Neural Information Processing Systems, pp. 2017-2025, 2015.

    無法下載圖示 全文公開日期 2021/08/16 (校內網路)
    全文公開日期 2024/08/16 (校外網路)
    全文公開日期 2024/08/16 (國家圖書館:臺灣博碩士論文系統)
    QR CODE