應用具有空間和時間自注意力的自適應器於車禍預測｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	王剡家 Yan-Jia Wang
論文名稱：	應用具有空間和時間自注意力的自適應器於車禍預測 Adapting Spatial and Temporal Modeling for Traffic Accident Anticipation
指導教授：	方文賢 Wen-Hsien Fang
口試委員:	方文賢 Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen 賴坤財 Kuen-Tsair Lay 呂政修 Jenq-Shiou Leu 丘建青 Chien-Ching Chiu
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	40
中文關鍵詞：	異常偵測、車禍預測、有效微調
外文關鍵詞：	Anomaly detection, Accident anticipation, Efficient tuning
相關次數：	點閱：208 下載：3
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

交通事故預測是一個重要的研究領域，目的在提前預測可能發生的交通事故，以避免嚴重災害並減少交通事故的發生。在本論文中，我們採用了Adapting Image Models(AIM)的框架，改善了內部的架構並結合一些方法以提高預測性來應用在交通事故預測中。首先，我們加深了自適應器的結構，並利用全連結層和一維卷積層提取全局和局部特徵，從而提高模型對空間信息的理解能力。接著，我們在空間和時間維度上分別引入了注意機制。在空間維度上，我們採用交叉注意力來學習大物體和小物體之間的位置關係，準確定位可能發生事故的區域。在時間維度上，我們透過加權時間注意力來學習相鄰幀之間的相關性，提前預測可能發生事故的時間點。透過將這些改進方法融入Vision Transformer(ViT)中，我們對兩個數據集進行實驗。我們的結果也證明說我們的架構在交通事故預測中取得了顯著的性能提升，並且在準確性和預測能力方面都取得了卓越的成果，在處理不同規模物體和場景時也表現出了優越性能。這些貢獻有望為交通安全領域的研究和實踐提供有價值的參考，並促進更廣泛的應用。

Traffic accident anticipation is a crucial field of research that aims to predict potential accidents beforehand, thereby preventing severe disasters and reducing traffic incidents. In this study, we employ the Adapting Image Models(AIM) framework and introduce various methods to enhance the model's predictive performance for traffic accident anticipation. To begin with, we enhance the adapter's structure, extracting local and global features using FC and Conv1D layers to improve spatial understanding. Additionally, we incorporate attention mechanisms within adapter in both spatial and temporal dimensions. In the spatial dimension, cross attention is used to learn positional relationships between large and small objects, accurately identifying accident-prone areas. In the temporal dimension, weighted temporal attention is employed to learn correlations between adjacent frames, enabling advance anticipation of accident timings. By integrating these enhancements into the Vision Transformer(ViT), extensive experiments are conducted on two datasets, showcasing a significant performance improvement in traffic accident anticipation using our proposed approach. Our model achieves impressive accuracy and prediction capabilities, excelling in handling objects and scenarios of diverse scales. These contributions are expected to serve as valuable references for traffic safety research and practices, fostering broader applications in the field.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 Traffic Accident Anticipation . . . . . . . . . . . . . . . . . . . . . 4
2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Efficient Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1 Essence of ViT . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Adapters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Enhanced Spatial Adaptation . . . . . . . . . . . . . . . . . . . . 11
3 Enhanced Temporal Adaptation . . . . . . . . . . . . . . . . . . . 14
4 Enhanced Joint Adaptation . . . . . . . . . . . . . . . . . . . . . 15
5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Experimental Results and Discussions . . . . . . . . . . . . . . . . . . . 18
1 Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.1 Driver Anomaly Detection Dataset (DAD) . . . . . . . . . 18
1.2 Car Crash Dataset (CCD) . . . . . . . . . . . . . . . . . . 20
2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1 Implement details . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 22
3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Analysis of Conv1d Up & Down Module . . . . . . . . . . 24
3.2 Analysis of Cross Attention . . . . . . . . . . . . . . . . . 25
3.3 Analysis of Weighted Temporal Attention . . . . . . . . . 26
4 Visualization Results . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Successful Cases . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Failure Cases and Error Analysis . . . . . . . . . . . . . . 30
5 Comparison with the State-of-the-Art Works . . . . . . . . . . . . 32
6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 35
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
                                

[1] A. K. Alexey Dosovitskiy, Lucas Beyer and S. Gelly., “An image is worth
16x16 words: Transformers for image recognition at scale,” in Proceedings
of International Conference on Learning Representations, 2020.
[2] T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “AIM: Adapting
image models for efficient video action recognition,” in The Eleventh Inter-
national Conference on Learning Representations, 2023.
[3] C.-F. R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-Attention Multi-
Scale Vision Transformer for Image Classification,” in International Confer-
ence on Computer Vision (ICCV), 2021.
[4] Y. X. Fu-Hsiang Chan, Yu-Ting Chen and M. Sun., “Anticipating accidents
in dashcam videos,” in Proceedings of Asian Conference on Computer Vi-
sion, 2016.
[5] A. Suzuki, Kataoka and Satoh., “Anticipating traffic accidents with adaptive
loss and large-scale incident db,” in Proceedings of Conference on Computer
Vision and Pattern, 2018.
[6] C. Zeng, Chou and C. Niebles., “Agent-centric risk assessment: Accident
anticipation and risky region localization,” in Proceedings of Conference on
Computer Vision and Pattern, 2017.
[7] W. C. Yao, Xu and Atkins., “Unsupervised traffic accident detection in first-
person videos.,” in Proceedings of 2019 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), p. 273–280, 2019.
[8] T. Takimoto and Toda., “Unsupervised traffic accident detection in first-
person videos.,” in Proceedings of the 3rd ACM SIGSPATIAL International
Workshop on Prediction of Human Mobility, pp. 11–14, 2019.
[9] W. Bao, Q. Yu, and Y. Kong, “Uncertainty-based traffic accident antici-
pation with spatio-temporal relational learning,” in Proceedings of the 28th
ACM International Conference on Multimedia (MM ’20), October 2020.
[10] R. Karim and Z. Yin, “A dynamic spatial-temporal attention network for
early anticipation of traffic accidents,” in Proceedings of IEEE Transactions
on Intelligent Transportation Systems, vol. 23, no. 7, pp. 9590–9600, 2022.
[11] W. Bao, Q. Yu, Y. Kong, and W. Chen., “Deep reinforced accident antici-
pation with visual explanation,” in Proceedings of International Conference
on Computer Vision, 2021.
[12] B. Vinyals, Toshev and Erhan, “Show and tell: A neural image caption
generator,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, p. 3156–3164, 2015.
[13] J. Wang and Tang, “Residual attention network for image classification,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, p. 3156–3164, 2017.
[14] W. X. Chen, Yang and Yuille, “Attention to scale: Scale-aware semantic
image segmentation,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016.
[15] F. Fu, Bao and Lu, “Dual attention network for scene segmentation,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, 2019.
[16] L. Yu, Lin and Huang, “Generative imageinpainting with contextual at-
tention,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018.
[17] W. W Du and Qiao, “Rpan: An end-to-end recurrent pose-attention network
for action recognition in videos,” in Proceedings of the IEEE International
Conference on Computer Vision, p. 3725–3734, 2017.
[18] Girdhar and Ramanan, “Attentional pooling for action recognition,” in Pro-
ceedings of the Neural Information Processing Systems, p. 34–45, 2017.
[19] W. Du and Qiao, “Recurrent spatial-temporal attention network for action
recognition in videos,” in Proceedings of IEEE Transactions on Image Pro-
cessing, p. 1347–1360, 2017.
[20] H. Wang, Tang and Chang, “Hierarchical attention network for action recog-
nition in videos,” arXiv preprint arXiv:1607.06416, 2016.
[21] L. Rao and Zhou, “Attention-aware deep reinforcement learning for video
face recognition,” in Proceedings of the IEEE International Conference on
Computer Vision, 2017.
[22] K. Sharma and Salakhutdinov, “Action recognition using visual attention,”
arXiv preprint arXiv:1511.04119, 2015.
[23] S. J. Neil Houlsby, Andrei Giurgiu and S. Gelly., “Parameter-efficient transfer
learning for nlp.,” in Proceedings of International Conference on Machine
Learning, p. 2790–2799, 2019.
[24] P. W. Edward Hu, yelong shen and W. Chen., “Lora: Low-rank adaptation
of large language models.,” in Proceedings of International Conference on
Learning Representations, 2022.
[25] L. Li and P. Liang., “Prefix-tuning: Optimizing continuous prompts for gen-
eration.,” in Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), 2021.
[26] R. A.-R. Brian Lester and N. Constant., “The power of scale for parameter-
efficient prompt tuning.,” in Proceedings of International Conference on
Learning Representations, 2021.
[27] X. M. Junxian He, Chunting Zhou and G. Neubig., “Towards a unified
view of parameter-efficient transfer learning.,” in Proceedings of Interna-
tional Conference on Learning Representations, 2022.
[28] H. L. Ziyi Lin, Shijie Geng, “Bitfit: Simple parameter-efficient finetuning
for transformer-based masked language-models.,” in Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers), p. 1–9, 2022.
[29] C. G. Zhiwu Qing, Shiwei Zhang and N. Sang., “Mar: Masked autoencoders
for efficient action recognition.,” arXiv preprint arXiv:2207.11660, 2022.
[30] V. N. Yi-Lin Sung and C. A. Raffel., “Training neural networks with fixed
sparse masks.,” in Proceedings of Advances in Neural Information Processing
Systems, 2021.
[31] A. J. Hyojin Bahng and P. Isola., “Visual prompting: Modifying pixel space
to adapt pre-trained models.,” arXiv preprint arXiv:2203.17274,, 2022.
[32] C. G. Shoufa Chen and P. Luo., “Adaptformer: Adapting vision transformers
for scalable visual recognition.,” arXiv preprint arXiv:2205.13535, 2022.
[33] S. Jie and Z.-H. Deng., “Convolutional bypasses are better vision transformer
adapters.,” arXiv preprint arXiv:2207.11660, 2022.
[34] X. S. Yunhe Gao and D. N. Metaxas., “Visual prompt tuning for test-time
domain adaptation.,” arXiv preprint arXiv:2207.11660, 2022.
[35] S. Vaswani and Polosukhin, “Attention is all you need,” in Proceedings of the
Advances in Neural Information Processing Systems, p. 5998–6008, 2017.
[36] K. S. Will Kay and B. Zhang., “The kinetics human action video dataset,”
arXiv preprint arXiv:1705.06950, 2017.
[37] Y. B. I. Goodfellow and A. Courville, “Deep learning.,” MIT press, 2016

簡易檢索 / 詳目顯示

相關論文