研究生: |
宋宣佑 Shiuan-You Sung |
---|---|
論文名稱: |
應用Transformer Encoder於車禍偵測之研究 Anticipating Traffic Accidents Using Transformer Encoder Representations |
指導教授: |
方文賢
Wen-Hsien Fang |
口試委員: |
丘建青
Chien-ching Chiu 賴坤財 Kuen-Tsair Lay 陳郁堂 Yie-Tarng Chen 鍾聖倫 Sheng-Luen Chung |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電子工程系 Department of Electronic and Computer Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 63 |
中文關鍵詞: | accident 、dashcam accident dataset 、dynamic spatial attention 、temporal dependency 、transformer encoder |
外文關鍵詞: | accident, dashcam accident dataset, dynamic spatial attention, temporal dependency, transformer encoder |
相關次數: | 點閱:171 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
This thesis presents an effective attention-based framework to detect accident in dashcam videos. In contrast to the existing work, our proposed method employs multi-head self-attention mechanism, which is capable to capture various temporal dependencies among the frames, instead of Long Short-Term Memory (LSTM) to enhance the learning capability. First, a dynamic spatial attention (DSA), which can dynamically provide soft-attention for every object, is invoked to aggregate the information from full-frame and object features generated by faster R-CNN. Next, a transformer encoder is employed to effectively learn various temporal dependencies of specific objects. Thereafter, we combine full-frame features with the aggregated object features to obtain the final feature representation. Finally, the final features are passed on to a fully-connected layer to perform accident anticipation. Moreover, a new training strategy is devised as well to improve the learning capability of the attention-based network. Simulations show that the proposed method outperforms the main state-of-the-art methods on the publicly available dashcam accident dataset (DAD dataset).
This thesis presents an effective attention-based framework to detect accident in dashcam videos. In contrast to the existing work, our proposed method employs multi-head self-attention mechanism, which is capable to capture various temporal dependencies among the frames, instead of Long Short-Term Memory (LSTM) to enhance the learning capability. First, a dynamic spatial attention (DSA), which can dynamically provide soft-attention for every object, is invoked to aggregate the information from full-frame and object features generated by faster R-CNN. Next, a transformer encoder is employed to effectively learn various temporal dependencies of specific objects. Thereafter, we combine full-frame features with the aggregated object features to obtain the final feature representation. Finally, the final features are passed on to a fully-connected layer to perform accident anticipation. Moreover, a new training strategy is devised as well to improve the learning capability of the attention-based network. Simulations show that the proposed method outperforms the main state-of-the-art methods on the publicly available dashcam accident dataset (DAD dataset).
References
[1] S. Brutzer, B. Höferlin, and G. Heidemann, “Evaluation of background sub-
traction techniques for video surveillance,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 1937–1944, IEEE,
2011.
[2] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern
Recognition Letters, vol. 34, no. 1, pp. 3–19, 2013.
[3] Z. Wang, M. P. Deisenroth, H. B. Amor, D. Vogt, B. Schölkopf, and J. Pe-
ters, “Probabilistic modeling of human movements for intention inference,”
Proceedings of Robotics: Science and Systems, VIII, 2012.
[4] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation
planning using early prediction of human motion,” in Proceedings of the
International Conference on Intelligent Robots and Systems, pp. 299–306,
IEEE, 2013.
[5] H. S. Koppula and A. Saxena, “Anticipating human activities using object
affordances for reactive robotic response,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 38, no. 1, pp. 14–29, 2015.
[6] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-
robot teams,” in Experimental Robotics, pp. 453–470, Springer, 2016.
[7] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity fore-
casting,” in Proceedings of the European Conference on Computer Vision,
pp. 201–214, Springer, 2012.
[8] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in
Proceedings of the European Conference on Computer Vision, pp. 707–720,
Springer, 2010.
[9] K. A. Brookhuis, D. De Waard, and W. H. Janssen, “Behavioural impacts
of advanced driver assistance systems–an overview,” European Journal of
Transport and Infrastructure Research, vol. 1, no. 3, 2019.
[10] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
in dashcam videos,” in Proceedings of the Asian Conference on Computer
Vision, pp. 136–153, Springer, 2016.
[11] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
centric risk assessment: Accident anticipation and risky region localization,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2222–2230, 2017.
[12] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
2018.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceeding of
the Neural Information Processing Systems, pp. 5998–6008, 2017.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
[15] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for
Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
[16] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activ-
ities from streaming videos,” in Proceedings of the International Conference
on Computer Vision, pp. 1036–1043, IEEE, 2011.
[17] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual represen-
tations from unlabeled video,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 98–106, 2016.
[18] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Pe-
tersson, and L. Andersson, “Encouraging lstms to anticipate actions very
early,” in Proceedings of the IEEE International Conference on Computer
Vision, pp. 280–289, 2017.
[19] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
[20] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu, “Pairwise body-part attention
for recognizing human-object interactions,” in Proceedings of the European
Conference on Computer Vision, pp. 51–67, 2018.
[21] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
work for action recognition in videos,” IEEE Transactions on Image Pro-
cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
[22] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotempo-
ral attention for video-based person re-identification,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–
378, 2018.
[23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm
convolves, attends and flows for action recognition,” Computer Vision and
Image Understanding, vol. 166, pp. 41–50, 2018.
[24] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with rbf kernel-
ized feature mapping rnn,” in Proceedings of the European Conference on
Computer Vision (ECCV), pp. 301–317, 2018.
[25] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7794–7803, 2018.
[26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio, “Show, attend and tell: Neural image caption generation
with visual attention,” in Proceeedings of the International Conference on
Machine Learning, pp. 2048–2057, 2015.
[27] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville,
“Describing videos by exploiting temporal structure,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4507–4515, 2015.
[28] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
visual attention,” arXiv preprint arXiv:1412.7755, 2014.
[29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[30] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action trans-
former network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 244–253, 2019.
[31] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
stream network with bidirectional self-attention for action recognition in
extreme low resolution videos,” IEEE Signal Processing Letters, 2019.
[32] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving?
The KITTI Vision Benchmark Suite,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012.
[33] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in
48Proceedings of the Computer Vision and Pattern Recognition Workshop on
the Future of Datasets in Vision, vol. 2, 2015.
[34] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,
“Bdd100k: A diverse driving video database with scalable annotation tool-
ing,” arXiv preprint arXiv:1805.04687, 2018.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 770–778, 2016.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural
networks for driver activity anticipation via sensory-fusion architecture,” in
Proceeding of the IEEE International Conference on Robotics and Automa-
tion, pp. 3118–3125, IEEE, 2016.
[38] G. E. Nasr, E. Badr, and C. Joun, “Cross entropy error function in neu-
ral networks: Forecasting gasoline demand.,” in Proceedings of the Florida
Artificial Intelligence Research Society Conference, pp. 381–384, 2002.
[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceed-
ings of the European Conference on Computer Vision, pp. 740–755, Springer,
2014.
[40] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
arXiv preprint arXiv:1412.6980, 2014.