簡易檢索 / 詳目顯示

研究生: 吳侒融
An-Rong Wu
論文名稱: 卷積神經網路與自我關注機制於車禍偵測之應用
Traffic Accident Detection Using Convolutional Neural Network and Self-attention Mechanism
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 鍾聖倫
Sheng-Luen Chung
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 69
中文關鍵詞: 車禍偵測行車記錄器之車禍數據集卷積神經網路自我關注機制自動駕駛
外文關鍵詞: Accident detection, Dashcam accident dataset, CNN, Self-attention, Autonomous vehicles
相關次數: 點閱:221下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

擁有辨識道路上異常情況能力的交通事故偵測,由於先進駕駛輔助系統(ADAS)、視頻監控、交通分析等廣泛的應用而越來越受到重視。本論文提出了一種新型的交通事故偵測架構用於行車記錄器視頻,該方法首先利用時間關係網路來處理一系列視頻幀以生成時空特徵,之後,採用雙向自我關注機制來有效地學習跨幀間的長期時間依賴性。此外,為了便於架構的培訓,我們也從YouTube、行車記錄器互助網(VEDR.tw)和Facebook中蒐集以臺灣地區為主的車禍事故影片,並提供時間標記以提升交通事故檢測的性能。我們將新數據集稱為ITRI數據集,其中包含了如雨天,明亮度,晝夜變化,隧道場景等更具有挑戰性的問題。最後,我們在常用的DAD和ITRI數據集上實現我們的方法,透過雙向自我關注機制的調用,該網路架構提供了比先前的方法更高的性能。


Traffic accident detection, which needs to recognize the abnormal movements on the road, is getting more attention due to a wide range of applications, such as Advanced Driver Assistance Systems (ADAS), video surveillance, and traffic analysis. This thesis proposes a novel architecture for traffic accident detection in dashcam videos. The new method first utilizes a temporal relation network to process a sequence of frames to generate spatio-temporal features. Afterward, a bidirectional self-attention mechanism is employed to effectively learn the long-term temporal dependency across the frames. Furthermore, to facilitate the training of the method, we also collect a large number of car accident videos appearing in Taiwan from YouTube, VEDR.tw, and Facebook, and provide the temporal-annotations to boost the performance of traffic accident detection. This dataset, referred to as the ITRI dataset, has a variety of challenging issues, such as raining, illuminations, day-night situation, tunnels, etc. Finally, we implement our method on the commonly used DAD and ITRI datasets. By invoking the bidirectional self-attention mechanism, the network provided superior performance compare with previous works.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Accident Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Spatial Feature Generation . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . 14 3.3.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 15 3.4 Bidirectional Self-Attention . . . . . . . . . . . . . . . . . . . . . 16 3.5 Multi-Scale Temporal Relations . . . . . . . . . . . . . . . . . . . 18 3.6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . 20 4.1 Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.1 Video Collection . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.2 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.3 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . 27 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.1 Impact of Different Window Length . . . . . . . . . . . . . 30 4.4.2 Impact of the Different Segment . . . . . . . . . . . . . . . 32 4.4.3 Impact of Bidirectional Self-Attention . . . . . . . . . . . . 33 4.5 Comparison with Previous Works . . . . . . . . . . . . . . . . . . 34 4.6 Successful Cases and Error Analysis . . . . . . . . . . . . . . . . . 36 4.6.1 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.6.2 DAD dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Appendix A : Example images from the DAD dataset . . . . . . . . . . . . 49 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

[1] S. Ioffe and C. Szegedy, “Batch normalization:
Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[2] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
2018.
[3] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
in dashcam videos,” in Proceedings of the Asian Conference on Computer
Vision, pp. 136–153, 2016.
[4] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational
reasoning in videos,” in Proceedings of the European Conference on Computer
Vision, pp. 803–818, 2018.
[5] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[6] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep
recurrent neural networks,” in Proceedings of the International Conference
on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
[7] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regu-
larization,” arXiv preprint arXiv:1409.2329, 2014.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[9] A. P. Shah, J.-B. Lamare, T. Nguyen-Anh, and A. Hauptmann, “Cadp: A
novel dataset for cctv traffic camera based accident analysis,” in Proceedings
of the IEEE International Conference on Advanced Video and Signal Based
Surveillance, pp. 1–9, 2018.
[10] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
centric risk assessment: Accident anticipation and risky region localization,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2222–2230, 2017.
[11] Y. Takimoto, Y. Tanaka, T. Kurashima, S. Yamamoto, M. Okawa, and
H. Toda, “Predicting traffic accidents with event recorder data,” in Pro-
ceedings of the International Workshop on Prediction of Human Mobility,
pp. 11–14, 2019.
[12] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsu-
pervised traffic accident detection in first-person videos,” arXiv preprint
arXiv:1903.00618, 2019.
[13] K. C. NG, Y. MURATA, and M. ATSUMI, “Traffic risk estimation from on-
vehicle video by region-based spatio-temporal dnn trained using comparative
loss,” in 人工知能全大論文集 一般社法人 人工知能, pp. 3Rin201–3Rin201,
2019.
[14] J.-C. Chen, Z.-Y. Lian, C.-L. Huang, and C.-H. Chuang, “Automatic recog-
nition of driving events based on dashcam videos,” in Proceedings of the In-
ternational Conference on Image and Graphics Processing, pp. 22–25, 2020.
[15] L. Taccari, F. Sambo, L. Bravi, S. Salti, L. Sarti, M. Simoncini, and A. Lori,
“Classification of crash and near-crash events from dashcam videos and
telematics,” in Proceedings of the International Conference on Intelligent
Transportation Systems, pp. 2460–2465, 2018.
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in Proceedings of the European
Conference on Computer Vision, pp. 21–37, 2016.
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense
object detection,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 2980–2988, 2017.
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 779–788, 2016.
[19] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 7263–7271, 2017.
[20] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
preprint arXiv:1804.02767, 2018.
[21] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage
object detection,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 9627–9636, 2019.
[22] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Confer-
ence on Computer Vision, pp. 1440–1448, 2015.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks,” in Proceedings of the Advances
in Neural Information Processing Systems, pp. 91–99, 2015.
[24] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” in Proceedings of the Advances in Neural
Information Processing Systems, pp. 379–387, 2016.
[25] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn:
In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264,
2017.
[26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 2961–2969,
2017.
[27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep represen-
tation learning for human motion prediction and classification,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6158–6166, 2017.
[28] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning
for detecting multiple space-time action tubes in videos,” arXiv preprint
arXiv:1608.01529, 2016.
[29] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
“Temporal segment networks: Towards good practices for deep action recog-
nition,” in Proceedings of the European Conference on Computer Vision,
pp. 20–36, 2016.
[30] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 6299–6308, 2017.
[31] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation,” IEEE transac-
tions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–
2495, 2017.
[32] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs,
T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in Proceedings of the In-
ternational Conference on Medical Image Computing and Computer-Assisted
Intervention, pp. 478–486, 2016.
[33] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and
J. Garcia-Rodriguez, “A review on deep learning techniques applied to se-
mantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
[34] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The
importance of skip connections in biomedical image segmentation,” in Pro-
ceedings of the Deep Learning and Data Labeling for Medical Applications,
pp. 179–187, 2016.
[35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
action recognition in videos,” in Proceedings of the Advances in Neural In-
formation Processing Systems, pp. 568–576, 2014.
[36] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1933–1941,
2016.
[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa-
tiotemporal features with 3d convolutional networks,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
[38] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with
pseudo-3d residual networks,” in Proceedings of the IEEE International Con-
ference on Computer Vision, pp. 5533–5541, 2017.
[39] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
stream network with bidirectional self-attention for action recognition in ex-
treme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8,
pp. 1187–1191, 2019.
[40] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-
augmented rgb stream for action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 7882–7891,
2019.
[41] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
image caption generator,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 3156–3164, 2015.
[42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
X. Tang, “Residual attention network for image classification,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3156–3164, 2017.
[43] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016.
[44] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention
network for scene segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019.
[45] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image
inpainting with contextual attention,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018.
[46] W. Du, Y. Wang, and Y. Qiao, “Rpan: An end-to-end recurrent pose-
attention network for action recognition in videos,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 3725–3734, 2017.
[47] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
[48] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
work for action recognition in videos,” IEEE Transactions on Image Pro-
cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
[49] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierar-
chical attention network for action recognition in videos,” arXiv preprint
arXiv:1607.06416, 2016.
[50] Y. Rao, J. Lu, and J. Zhou, “Attention-aware deep reinforcement learning for
video face recognition,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017.
[51] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual
attention,” arXiv preprint arXiv:1511.04119, 2015.
[52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
of the Advances in Neural Information Processing Systems, pp. 5998–6008,
2017.
[54] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[55] S. Gao, A. Ramanathan, and G. Tourassi, “Hierarchical convolutional at-
tention networks for text classification,” in Proceedings of the Workshop on
Representation Learning for Natural Language Processing, pp. 11–23, 2018.
[56] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

無法下載圖示 全文公開日期 2025/08/24 (校內網路)
全文公開日期 2025/08/24 (校外網路)
全文公開日期 2025/08/24 (國家圖書館:臺灣博碩士論文系統)
QR CODE