簡易檢索 / 詳目顯示

研究生: 吳侒融
An-Rong Wu
論文名稱: 卷積神經網路與自我關注機制於車禍偵測之應用
Traffic Accident Detection Using Convolutional Neural Network and Self-attention Mechanism
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 鍾聖倫
Sheng-Luen Chung
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 69
中文關鍵詞: 車禍偵測行車記錄器之車禍數據集卷積神經網路自我關注機制自動駕駛
外文關鍵詞: Accident detection, Dashcam accident dataset, CNN, Self-attention, Autonomous vehicles
相關次數: 點閱:201下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 擁有辨識道路上異常情況能力的交通事故偵測,由於先進駕駛輔助系統(ADAS)、視頻監控、交通分析等廣泛的應用而越來越受到重視。本論文提出了一種新型的交通事故偵測架構用於行車記錄器視頻,該方法首先利用時間關係網路來處理一系列視頻幀以生成時空特徵,之後,採用雙向自我關注機制來有效地學習跨幀間的長期時間依賴性。此外,為了便於架構的培訓,我們也從YouTube、行車記錄器互助網(VEDR.tw)和Facebook中蒐集以臺灣地區為主的車禍事故影片,並提供時間標記以提升交通事故檢測的性能。我們將新數據集稱為ITRI數據集,其中包含了如雨天,明亮度,晝夜變化,隧道場景等更具有挑戰性的問題。最後,我們在常用的DAD和ITRI數據集上實現我們的方法,透過雙向自我關注機制的調用,該網路架構提供了比先前的方法更高的性能。


    Traffic accident detection, which needs to recognize the abnormal movements on the road, is getting more attention due to a wide range of applications, such as Advanced Driver Assistance Systems (ADAS), video surveillance, and traffic analysis. This thesis proposes a novel architecture for traffic accident detection in dashcam videos. The new method first utilizes a temporal relation network to process a sequence of frames to generate spatio-temporal features. Afterward, a bidirectional self-attention mechanism is employed to effectively learn the long-term temporal dependency across the frames. Furthermore, to facilitate the training of the method, we also collect a large number of car accident videos appearing in Taiwan from YouTube, VEDR.tw, and Facebook, and provide the temporal-annotations to boost the performance of traffic accident detection. This dataset, referred to as the ITRI dataset, has a variety of challenging issues, such as raining, illuminations, day-night situation, tunnels, etc. Finally, we implement our method on the commonly used DAD and ITRI datasets. By invoking the bidirectional self-attention mechanism, the network provided superior performance compare with previous works.

    摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Accident Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Spatial Feature Generation . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . 14 3.3.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 15 3.4 Bidirectional Self-Attention . . . . . . . . . . . . . . . . . . . . . 16 3.5 Multi-Scale Temporal Relations . . . . . . . . . . . . . . . . . . . 18 3.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . 20 4.1 Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Video Collection . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.2 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.3 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . 27 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.1 Impact of Different Window Length . . . . . . . . . . . . . 30 4.4.2 Impact of the Different Segment . . . . . . . . . . . . . . . 32 4.4.3 Impact of Bidirectional Self-Attention . . . . . . . . . . . . 33 4.5 Comparison with Previous Works . . . . . . . . . . . . . . . . . . 34 4.6 Successful Cases and Error Analysis . . . . . . . . . . . . . . . . . 36 4.6.1 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.2 DAD dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Appendix A : Example images from the DAD dataset . . . . . . . . . . . . 49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    [1] S. Ioffe and C. Szegedy, “Batch normalization:
    Accelerating deep
    network training by reducing internal covariate shift,” arXiv preprint
    arXiv:1502.03167, 2015.
    [2] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
    with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
    2018.
    [3] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
    in dashcam videos,” in Proceedings of the Asian Conference on Computer
    Vision, pp. 136–153, 2016.
    [4] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational
    reasoning in videos,” in Proceedings of the European Conference on Computer
    Vision, pp. 803–818, 2018.
    [5] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
    IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
    [6] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep
    recurrent neural networks,” in Proceedings of the International Conference
    on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
    [7] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regu-
    larization,” arXiv preprint arXiv:1409.2329, 2014.
    [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
    of gated recurrent neural networks on sequence modeling,” arXiv preprint
    arXiv:1412.3555, 2014.
    [9] A. P. Shah, J.-B. Lamare, T. Nguyen-Anh, and A. Hauptmann, “Cadp: A
    novel dataset for cctv traffic camera based accident analysis,” in Proceedings
    of the IEEE International Conference on Advanced Video and Signal Based
    Surveillance, pp. 1–9, 2018.
    [10] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
    centric risk assessment: Accident anticipation and risky region localization,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 2222–2230, 2017.
    [11] Y. Takimoto, Y. Tanaka, T. Kurashima, S. Yamamoto, M. Okawa, and
    H. Toda, “Predicting traffic accidents with event recorder data,” in Pro-
    ceedings of the International Workshop on Prediction of Human Mobility,
    pp. 11–14, 2019.
    [12] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsu-
    pervised traffic accident detection in first-person videos,” arXiv preprint
    arXiv:1903.00618, 2019.
    [13] K. C. NG, Y. MURATA, and M. ATSUMI, “Traffic risk estimation from on-
    vehicle video by region-based spatio-temporal dnn trained using comparative
    loss,” in 人工知能全大論文集 一般社法人 人工知能, pp. 3Rin201–3Rin201,
    2019.
    [14] J.-C. Chen, Z.-Y. Lian, C.-L. Huang, and C.-H. Chuang, “Automatic recog-
    nition of driving events based on dashcam videos,” in Proceedings of the In-
    ternational Conference on Image and Graphics Processing, pp. 22–25, 2020.
    [15] L. Taccari, F. Sambo, L. Bravi, S. Salti, L. Sarti, M. Simoncini, and A. Lori,
    “Classification of crash and near-crash events from dashcam videos and
    telematics,” in Proceedings of the International Conference on Intelligent
    Transportation Systems, pp. 2460–2465, 2018.
    [16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
    Berg, “Ssd: Single shot multibox detector,” in Proceedings of the European
    Conference on Computer Vision, pp. 21–37, 2016.
    [17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense
    object detection,” in Proceedings of the IEEE International Conference on
    Computer Vision, pp. 2980–2988, 2017.
    [18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
    Unified, real-time object detection,” in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 779–788, 2016.
    [19] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceed-
    ings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 7263–7271, 2017.
    [20] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
    preprint arXiv:1804.02767, 2018.
    [21] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage
    object detection,” in Proceedings of the IEEE International Conference on
    Computer Vision, pp. 9627–9636, 2019.
    [22] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Confer-
    ence on Computer Vision, pp. 1440–1448, 2015.
    [23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob-
    ject detection with region proposal networks,” in Proceedings of the Advances
    in Neural Information Processing Systems, pp. 91–99, 2015.
    [24] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
    based fully convolutional networks,” in Proceedings of the Advances in Neural
    Information Processing Systems, pp. 379–387, 2016.
    [25] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn:
    In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264,
    2017.
    [26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings
    of the IEEE International Conference on Computer Vision, pp. 2961–2969,
    2017.
    [27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep represen-
    tation learning for human motion prediction and classification,” in Proceed-
    ings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 6158–6166, 2017.
    [28] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning
    for detecting multiple space-time action tubes in videos,” arXiv preprint
    arXiv:1608.01529, 2016.
    [29] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
    “Temporal segment networks: Towards good practices for deep action recog-
    nition,” in Proceedings of the European Conference on Computer Vision,
    pp. 20–36, 2016.
    [30] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
    and the kinetics dataset,” in Proceedings of the IEEE Conference on Com-
    puter Vision and Pattern Recognition, pp. 6299–6308, 2017.
    [31] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolu-
    tional encoder-decoder architecture for image segmentation,” IEEE transac-
    tions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–
    2495, 2017.
    [32] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs,
    T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in Proceedings of the In-
    ternational Conference on Medical Image Computing and Computer-Assisted
    Intervention, pp. 478–486, 2016.
    [33] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and
    J. Garcia-Rodriguez, “A review on deep learning techniques applied to se-
    mantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
    [34] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The
    importance of skip connections in biomedical image segmentation,” in Pro-
    ceedings of the Deep Learning and Data Labeling for Medical Applications,
    pp. 179–187, 2016.
    [35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
    action recognition in videos,” in Proceedings of the Advances in Neural In-
    formation Processing Systems, pp. 568–576, 2014.
    [36] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
    network fusion for video action recognition,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 1933–1941,
    2016.
    [37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa-
    tiotemporal features with 3d convolutional networks,” in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
    [38] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with
    pseudo-3d residual networks,” in Proceedings of the IEEE International Con-
    ference on Computer Vision, pp. 5533–5541, 2017.
    [39] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
    stream network with bidirectional self-attention for action recognition in ex-
    treme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8,
    pp. 1187–1191, 2019.
    [40] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-
    augmented rgb stream for action recognition,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 7882–7891,
    2019.
    [41] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
    image caption generator,” in Proceedings of the IEEE Conference on Com-
    puter Vision and Pattern Recognition, pp. 3156–3164, 2015.
    [42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
    X. Tang, “Residual attention network for image classification,” in Proceed-
    ings of the IEEE Conference on Computer Vision and Pattern Recognition,
    pp. 3156–3164, 2017.
    [43] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
    scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, 2016.
    [44] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention
    network for scene segmentation,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, 2019.
    [45] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image
    inpainting with contextual attention,” in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, 2018.
    [46] W. Du, Y. Wang, and Y. Qiao, “Rpan: An end-to-end recurrent pose-
    attention network for action recognition in videos,” in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 3725–3734, 2017.
    [47] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
    Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
    [48] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
    work for action recognition in videos,” IEEE Transactions on Image Pro-
    cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
    [49] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierar-
    chical attention network for action recognition in videos,” arXiv preprint
    arXiv:1607.06416, 2016.
    [50] Y. Rao, J. Lu, and J. Zhou, “Attention-aware deep reinforcement learning for
    video face recognition,” in Proceedings of the IEEE International Conference
    on Computer Vision, 2017.
    [51] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual
    attention,” arXiv preprint arXiv:1511.04119, 2015.
    [52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
    large-scale hierarchical image database,” in Proceedings of the IEEE Confer-
    ence on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
    [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
    L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
    of the Advances in Neural Information Processing Systems, pp. 5998–6008,
    2017.
    [54] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
    deep bidirectional transformers for language understanding,” arXiv preprint
    arXiv:1810.04805, 2018.
    [55] S. Gao, A. Ramanathan, and G. Tourassi, “Hierarchical convolutional at-
    tention networks for text classification,” in Proceedings of the Workshop on
    Representation Learning for Natural Language Processing, pp. 11–23, 2018.
    [56] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

    無法下載圖示 全文公開日期 2025/08/24 (校內網路)
    全文公開日期 2025/08/24 (校外網路)
    全文公開日期 2025/08/24 (國家圖書館:臺灣博碩士論文系統)
    QR CODE