卷積神經網路與自我關注機制於車禍偵測之應用｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳侒融 An-Rong Wu
論文名稱：	卷積神經網路與自我關注機制於車禍偵測之應用 Traffic Accident Detection Using Convolutional Neural Network and Self-attention Mechanism
指導教授：	方文賢 Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen
口試委員:	鍾聖倫 Sheng-Luen Chung 賴坤財 Kuen-Tsair Lay 丘建青 Chien-Ching Chiu 方文賢 Wen-Hsien Fang 陳郁堂 Yie-Tarng Chen
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	中文
論文頁數：	69
中文關鍵詞：	車禍偵測、行車記錄器之車禍數據集、卷積神經網路、自我關注機制、自動駕駛
外文關鍵詞：	Accident detection, Dashcam accident dataset, CNN, Self-attention, Autonomous vehicles
相關次數：	點閱：221 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

擁有辨識道路上異常情況能力的交通事故偵測，由於先進駕駛輔助系統(ADAS)、視頻監控、交通分析等廣泛的應用而越來越受到重視。本論文提出了一種新型的交通事故偵測架構用於行車記錄器視頻，該方法首先利用時間關係網路來處理一系列視頻幀以生成時空特徵，之後，採用雙向自我關注機制來有效地學習跨幀間的長期時間依賴性。此外，為了便於架構的培訓，我們也從YouTube、行車記錄器互助網(VEDR.tw)和Facebook中蒐集以臺灣地區為主的車禍事故影片，並提供時間標記以提升交通事故檢測的性能。我們將新數據集稱為ITRI數據集，其中包含了如雨天，明亮度，晝夜變化，隧道場景等更具有挑戰性的問題。最後，我們在常用的DAD和ITRI數據集上實現我們的方法，透過雙向自我關注機制的調用，該網路架構提供了比先前的方法更高的性能。

Traffic accident detection, which needs to recognize the abnormal movements on the road, is getting more attention due to a wide range of applications, such as Advanced Driver Assistance Systems (ADAS), video surveillance, and traffic analysis. This thesis proposes a novel architecture for traffic accident detection in dashcam videos. The new method first utilizes a temporal relation network to process a sequence of frames to generate spatio-temporal features. Afterward, a bidirectional self-attention mechanism is employed to effectively learn the long-term temporal dependency across the frames. Furthermore, to facilitate the training of the method, we also collect a large number of car accident videos appearing in Taiwan from YouTube, VEDR.tw, and Facebook, and provide the temporal-annotations to boost the performance of traffic accident detection. This dataset, referred to as the ITRI dataset, has a variety of challenging issues, such as raining, illuminations, day-night situation, tunnels, etc. Finally, we implement our method on the commonly used DAD and ITRI datasets. By invoking the bidirectional self-attention mechanism, the network provided superior performance compare with previous works.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1 Accident Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Spatial Feature Generation . . . . . . . . . . . . . . . . . . . . . . 8
3 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1
Scaled Dot-Product Attention . . . . . . . . . . . . . . . . 14
3.2
Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 15
4 Bidirectional Self-Attention . . . . . . . . . . . . . . . . . . . . . 16
5 Multi-Scale Temporal Relations . . . . . . . . . . . . . . . . . . . 18
6 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Experimental Results and Discussion . . . . . . . . . . . . . . . . . . . 20
1
Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.1 Video Collection . . . . . . . . . . . . . . . . . . . . . . . 24
1.2 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . 25
1.3 Dataset Distribution . . . . . . . . . . . . . . . . . . . . . 27
2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Impact of Different Window Length . . . . . . . . . . . . . 30
4.2 Impact of the Different Segment . . . . . . . . . . . . . . . 32
4.3 Impact of Bidirectional Self-Attention . . . . . . . . . . . . 33
5 Comparison with Previous Works . . . . . . . . . . . . . . . . . . 34
6 Successful Cases and Error Analysis . . . . . . . . . . . . . . . . . 36
6.1 ITRI dataset . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 DAD dataset . . . . . . . . . . . . . . . . . . . . . . . . . 37
7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 48
1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Appendix A : Example images from the DAD dataset . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
                                

[1] S. Ioffe and C. Szegedy, “Batch normalization:
Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[2] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
2018.
[3] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
in dashcam videos,” in Proceedings of the Asian Conference on Computer
Vision, pp. 136–153, 2016.
[4] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational
reasoning in videos,” in Proceedings of the European Conference on Computer
Vision, pp. 803–818, 2018.
[5] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”
IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[6] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep
recurrent neural networks,” in Proceedings of the International Conference
on Acoustics, Speech and Signal Processing, pp. 6645–6649, 2013.
[7] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural network regu-
larization,” arXiv preprint arXiv:1409.2329, 2014.
[8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation
of gated recurrent neural networks on sequence modeling,” arXiv preprint
arXiv:1412.3555, 2014.
[9] A. P. Shah, J.-B. Lamare, T. Nguyen-Anh, and A. Hauptmann, “Cadp: A
novel dataset for cctv traffic camera based accident analysis,” in Proceedings
of the IEEE International Conference on Advanced Video and Signal Based
Surveillance, pp. 1–9, 2018.
[10] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
centric risk assessment: Accident anticipation and risky region localization,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2222–2230, 2017.
[11] Y. Takimoto, Y. Tanaka, T. Kurashima, S. Yamamoto, M. Okawa, and
H. Toda, “Predicting traffic accidents with event recorder data,” in Pro-
ceedings of the International Workshop on Prediction of Human Mobility,
pp. 11–14, 2019.
[12] Y. Yao, M. Xu, Y. Wang, D. J. Crandall, and E. M. Atkins, “Unsu-
pervised traffic accident detection in first-person videos,” arXiv preprint
arXiv:1903.00618, 2019.
[13] K. C. NG, Y. MURATA, and M. ATSUMI, “Traffic risk estimation from on-
vehicle video by region-based spatio-temporal dnn trained using comparative
loss,” in 人工知能全大論文集一般社法人人工知能, pp. 3Rin201–3Rin201,
2019.
[14] J.-C. Chen, Z.-Y. Lian, C.-L. Huang, and C.-H. Chuang, “Automatic recog-
nition of driving events based on dashcam videos,” in Proceedings of the In-
ternational Conference on Image and Graphics Processing, pp. 22–25, 2020.
[15] L. Taccari, F. Sambo, L. Bravi, S. Salti, L. Sarti, M. Simoncini, and A. Lori,
“Classification of crash and near-crash events from dashcam videos and
telematics,” in Proceedings of the International Conference on Intelligent
Transportation Systems, pp. 2460–2465, 2018.
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C.
Berg, “Ssd: Single shot multibox detector,” in Proceedings of the European
Conference on Computer Vision, pp. 21–37, 2016.
[17] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense
object detection,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 2980–2988, 2017.
[18] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 779–788, 2016.
[19] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 7263–7271, 2017.
[20] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv
preprint arXiv:1804.02767, 2018.
[21] Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage
object detection,” in Proceedings of the IEEE International Conference on
Computer Vision, pp. 9627–9636, 2019.
[22] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International Confer-
ence on Computer Vision, pp. 1440–1448, 2015.
[23] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time ob-
ject detection with region proposal networks,” in Proceedings of the Advances
in Neural Information Processing Systems, pp. 91–99, 2015.
[24] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via region-
based fully convolutional networks,” in Proceedings of the Advances in Neural
Information Processing Systems, pp. 379–387, 2016.
[25] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, “Light-head r-cnn:
In defense of two-stage object detector,” arXiv preprint arXiv:1711.07264,
2017.
[26] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings
of the IEEE International Conference on Computer Vision, pp. 2961–2969,
2017.
[27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep represen-
tation learning for human motion prediction and classification,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6158–6166, 2017.
[28] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning
for detecting multiple space-time action tubes in videos,” arXiv preprint
arXiv:1608.01529, 2016.
[29] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool,
“Temporal segment networks: Towards good practices for deep action recog-
nition,” in Proceedings of the European Conference on Computer Vision,
pp. 20–36, 2016.
[30] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 6299–6308, 2017.
[31] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolu-
tional encoder-decoder architecture for image segmentation,” IEEE transac-
tions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–
2495, 2017.
[32] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs,
T. Leiner, M. A. Viergever, and I. Išgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in Proceedings of the In-
ternational Conference on Medical Image Computing and Computer-Assisted
Intervention, pp. 478–486, 2016.
[33] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and
J. Garcia-Rodriguez, “A review on deep learning techniques applied to se-
mantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
[34] M. Drozdzal, E. Vorontsov, G. Chartrand, S. Kadoury, and C. Pal, “The
importance of skip connections in biomedical image segmentation,” in Pro-
ceedings of the Deep Learning and Data Labeling for Medical Applications,
pp. 179–187, 2016.
[35] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
action recognition in videos,” in Proceedings of the Advances in Neural In-
formation Processing Systems, pp. 568–576, 2014.
[36] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1933–1941,
2016.
[37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa-
tiotemporal features with 3d convolutional networks,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
[38] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with
pseudo-3d residual networks,” in Proceedings of the IEEE International Con-
ference on Computer Vision, pp. 5533–5541, 2017.
[39] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
stream network with bidirectional self-attention for action recognition in ex-
treme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8,
pp. 1187–1191, 2019.
[40] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-
augmented rgb stream for action recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 7882–7891,
2019.
[41] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
image caption generator,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 3156–3164, 2015.
[42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
X. Tang, “Residual attention network for image classification,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 3156–3164, 2017.
[43] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
scale: Scale-aware semantic image segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016.
[44] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention
network for scene segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019.
[45] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image
inpainting with contextual attention,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018.
[46] W. Du, Y. Wang, and Y. Qiao, “Rpan: An end-to-end recurrent pose-
attention network for action recognition in videos,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 3725–3734, 2017.
[47] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
[48] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
work for action recognition in videos,” IEEE Transactions on Image Pro-
cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
[49] Y. Wang, S. Wang, J. Tang, N. O’Hare, Y. Chang, and B. Li, “Hierar-
chical attention network for action recognition in videos,” arXiv preprint
arXiv:1607.06416, 2016.
[50] Y. Rao, J. Lu, and J. Zhou, “Attention-aware deep reinforcement learning for
video face recognition,” in Proceedings of the IEEE International Conference
on Computer Vision, 2017.
[51] S. Sharma, R. Kiros, and R. Salakhutdinov, “Action recognition using visual
attention,” arXiv preprint arXiv:1511.04119, 2015.
[52] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
[53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings
of the Advances in Neural Information Processing Systems, pp. 5998–6008,
2017.
[54] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[55] S. Gao, A. Ramanathan, and G. Tourassi, “Hierarchical convolutional at-
tention networks for text classification,” in Proceedings of the Workshop on
Representation Learning for Natural Language Processing, pp. 11–23, 2018.
[56] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

全文公開日期 2025/08/24 (校內網路)
全文公開日期 2025/08/24 (校外網路)
全文公開日期 2025/08/24 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文