簡易檢索 / 詳目顯示

研究生: 王剡家
Yan-Jia Wang
論文名稱: 應用具有空間和時間自注意力的自適應器於車禍預測
Adapting Spatial and Temporal Modeling for Traffic Accident Anticipation
指導教授: 方文賢
Wen-Hsien Fang
口試委員: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
賴坤財
Kuen-Tsair Lay
呂政修
Jenq-Shiou Leu
丘建青
Chien-Ching Chiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 40
中文關鍵詞: 異常偵測車禍預測有效微調
外文關鍵詞: Anomaly detection, Accident anticipation, Efficient tuning
相關次數: 點閱:208下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 交通事故預測是一個重要的研究領域,目的在提前預測可能發生的交通事故,以避免嚴重災害並減少交通事故的發生。在本論文中,我們採用了Adapting Image Models(AIM)的框架,改善了內部的架構並結合一些方法以提高預測性來應用在交通事故預測中。首先,我們加深了自適應器的結構,並利用全連結層和一維卷積層提取全局和局部特徵,從而提高模型對空間信息的理解能力。接著,我們在空間和時間維度上分別引入了注意機制。在空間維度上,我們採用交叉注意力來學習大物體和小物體之間的位置關係,準確定位可能發生事故的區域。在時間維度上,我們透過加權時間注意力來學習相鄰幀之間的相關性,提前預測可能發生事故的時間點。透過將這些改進方法融入Vision Transformer(ViT)中,我們對兩個數據集進行實驗。我們的結果也證明說我們的架構在交通事故預測中取得了顯著的性能提升,並且在準確性和預測能力方面都取得了卓越的成果,在處理不同規模物體和場景時也表現出了優越性能。這些貢獻有望為交通安全領域的研究和實踐提供有價值的參考,並促進更廣泛的應用。


    Traffic accident anticipation is a crucial field of research that aims to predict potential accidents beforehand, thereby preventing severe disasters and reducing traffic incidents. In this study, we employ the Adapting Image Models(AIM) framework and introduce various methods to enhance the model's predictive performance for traffic accident anticipation. To begin with, we enhance the adapter's structure, extracting local and global features using FC and Conv1D layers to improve spatial understanding. Additionally, we incorporate attention mechanisms within adapter in both spatial and temporal dimensions. In the spatial dimension, cross attention is used to learn positional relationships between large and small objects, accurately identifying accident-prone areas. In the temporal dimension, weighted temporal attention is employed to learn correlations between adjacent frames, enabling advance anticipation of accident timings. By integrating these enhancements into the Vision Transformer(ViT), extensive experiments are conducted on two datasets, showcasing a significant performance improvement in traffic accident anticipation using our proposed approach. Our model achieves impressive accuracy and prediction capabilities, excelling in handling objects and scenarios of diverse scales. These contributions are expected to serve as valuable references for traffic safety research and practices, fostering broader applications in the field.

    摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Traffic Accident Anticipation . . . . . . . . . . . . . . . . . . . . . 4 2.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Efficient Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.1 Essence of ViT . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.2 Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Enhanced Spatial Adaptation . . . . . . . . . . . . . . . . . . . . 11 3.3 Enhanced Temporal Adaptation . . . . . . . . . . . . . . . . . . . 14 3.4 Enhanced Joint Adaptation . . . . . . . . . . . . . . . . . . . . . 15 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . 18 4.1 Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1.1 Driver Anomaly Detection Dataset (DAD) . . . . . . . . . 18 4.1.2 Car Crash Dataset (CCD) . . . . . . . . . . . . . . . . . . 20 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.1 Implement details . . . . . . . . . . . . . . . . . . . . . . . 21 4.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 22 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3.1 Analysis of Conv1d Up & Down Module . . . . . . . . . . 24 4.3.2 Analysis of Cross Attention . . . . . . . . . . . . . . . . . 25 4.3.3 Analysis of Weighted Temporal Attention . . . . . . . . . 26 4.4 Visualization Results . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.4.1 Successful Cases . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.2 Failure Cases and Error Analysis . . . . . . . . . . . . . . 30 4.5 Comparison with the State-of-the-Art Works . . . . . . . . . . . . 32 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 35 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    [1] A. K. Alexey Dosovitskiy, Lucas Beyer and S. Gelly., “An image is worth
    16x16 words: Transformers for image recognition at scale,” in Proceedings
    of International Conference on Learning Representations, 2020.
    [2] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “AIM: Adapting
    image models for efficient video action recognition,” in The Eleventh Inter-
    national Conference on Learning Representations, 2023.
    [3] C.-F. R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-Attention Multi-
    Scale Vision Transformer for Image Classification,” in International Confer-
    ence on Computer Vision (ICCV), 2021.
    [4] Y. X. Fu-Hsiang Chan, Yu-Ting Chen and M. Sun., “Anticipating accidents
    in dashcam videos,” in Proceedings of Asian Conference on Computer Vi-
    sion, 2016.
    [5] A. Suzuki, Kataoka and Satoh., “Anticipating traffic accidents with adaptive
    loss and large-scale incident db,” in Proceedings of Conference on Computer
    Vision and Pattern, 2018.
    [6] C. Zeng, Chou and C. Niebles., “Agent-centric risk assessment: Accident
    anticipation and risky region localization,” in Proceedings of Conference on
    Computer Vision and Pattern, 2017.
    [7] W. C. Yao, Xu and Atkins., “Unsupervised traffic accident detection in first-
    person videos.,” in Proceedings of 2019 IEEE/RSJ International Conference
    on Intelligent Robots and Systems (IROS), p. 273–280, 2019.
    [8] T. Takimoto and Toda., “Unsupervised traffic accident detection in first-
    person videos.,” in Proceedings of the 3rd ACM SIGSPATIAL International
    Workshop on Prediction of Human Mobility, pp. 11–14, 2019.
    [9] W. Bao, Q. Yu, and Y. Kong, “Uncertainty-based traffic accident antici-
    pation with spatio-temporal relational learning,” in Proceedings of the 28th
    ACM International Conference on Multimedia (MM ’20), October 2020.
    [10] R. Karim and Z. Yin, “A dynamic spatial-temporal attention network for
    early anticipation of traffic accidents,” in Proceedings of IEEE Transactions
    on Intelligent Transportation Systems, vol. 23, no. 7, pp. 9590–9600, 2022.
    [11] W. Bao, Q. Yu, Y. Kong, and W. Chen., “Deep reinforced accident antici-
    pation with visual explanation,” in Proceedings of International Conference
    on Computer Vision, 2021.
    [12] B. Vinyals, Toshev and Erhan, “Show and tell: A neural image caption
    generator,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, p. 3156–3164, 2015.
    [13] J. Wang and Tang, “Residual attention network for image classification,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, p. 3156–3164, 2017.
    [14] W. X. Chen, Yang and Yuille, “Attention to scale: Scale-aware semantic
    image segmentation,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, 2016.
    [15] F. Fu, Bao and Lu, “Dual attention network for scene segmentation,” in
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
    nition, 2019.
    [16] L. Yu, Lin and Huang, “Generative imageinpainting with contextual at-
    tention,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, 2018.
    [17] W. W Du and Qiao, “Rpan: An end-to-end recurrent pose-attention network
    for action recognition in videos,” in Proceedings of the IEEE International
    Conference on Computer Vision, p. 3725–3734, 2017.
    [18] Girdhar and Ramanan, “Attentional pooling for action recognition,” in Pro-
    ceedings of the Neural Information Processing Systems, p. 34–45, 2017.
    [19] W. Du and Qiao, “Recurrent spatial-temporal attention network for action
    recognition in videos,” in Proceedings of IEEE Transactions on Image Pro-
    cessing, p. 1347–1360, 2017.
    [20] H. Wang, Tang and Chang, “Hierarchical attention network for action recog-
    nition in videos,” arXiv preprint arXiv:1607.06416, 2016.
    [21] L. Rao and Zhou, “Attention-aware deep reinforcement learning for video
    face recognition,” in Proceedings of the IEEE International Conference on
    Computer Vision, 2017.
    [22] K. Sharma and Salakhutdinov, “Action recognition using visual attention,”
    arXiv preprint arXiv:1511.04119, 2015.
    [23] S. J. Neil Houlsby, Andrei Giurgiu and S. Gelly., “Parameter-efficient transfer
    learning for nlp.,” in Proceedings of International Conference on Machine
    Learning, p. 2790–2799, 2019.
    [24] P. W. Edward Hu, yelong shen and W. Chen., “Lora: Low-rank adaptation
    of large language models.,” in Proceedings of International Conference on
    Learning Representations, 2022.
    [25] L. Li and P. Liang., “Prefix-tuning: Optimizing continuous prompts for gen-
    eration.,” in Proceedings of the 59th Annual Meeting of the Association for
    Computational Linguistics and the 11th International Joint Conference on
    Natural Language Processing (Volume 1: Long Papers), 2021.
    [26] R. A.-R. Brian Lester and N. Constant., “The power of scale for parameter-
    efficient prompt tuning.,” in Proceedings of International Conference on
    Learning Representations, 2021.
    [27] X. M. Junxian He, Chunting Zhou and G. Neubig., “Towards a unified
    view of parameter-efficient transfer learning.,” in Proceedings of Interna-
    tional Conference on Learning Representations, 2022.
    [28] H. L. Ziyi Lin, Shijie Geng, “Bitfit: Simple parameter-efficient finetuning
    for transformer-based masked language-models.,” in Proceedings of the 60th
    Annual Meeting of the Association for Computational Linguistics (Volume
    2: Short Papers), p. 1–9, 2022.
    [29] C. G. Zhiwu Qing, Shiwei Zhang and N. Sang., “Mar: Masked autoencoders
    for efficient action recognition.,” arXiv preprint arXiv:2207.11660, 2022.
    [30] V. N. Yi-Lin Sung and C. A. Raffel., “Training neural networks with fixed
    sparse masks.,” in Proceedings of Advances in Neural Information Processing
    Systems, 2021.
    [31] A. J. Hyojin Bahng and P. Isola., “Visual prompting: Modifying pixel space
    to adapt pre-trained models.,” arXiv preprint arXiv:2203.17274,, 2022.
    [32] C. G. Shoufa Chen and P. Luo., “Adaptformer: Adapting vision transformers
    for scalable visual recognition.,” arXiv preprint arXiv:2205.13535, 2022.
    [33] S. Jie and Z.-H. Deng., “Convolutional bypasses are better vision transformer
    adapters.,” arXiv preprint arXiv:2207.11660, 2022.
    [34] X. S. Yunhe Gao and D. N. Metaxas., “Visual prompt tuning for test-time
    domain adaptation.,” arXiv preprint arXiv:2207.11660, 2022.
    [35] S. Vaswani and Polosukhin, “Attention is all you need,” in Proceedings of the
    Advances in Neural Information Processing Systems, p. 5998–6008, 2017.
    [36] K. S. Will Kay and B. Zhang., “The kinetics human action video dataset,”
    arXiv preprint arXiv:1705.06950, 2017.
    [37] Y. B. I. Goodfellow and A. Courville, “Deep learning.,” MIT press, 2016

    QR CODE