簡易檢索 / 詳目顯示

研究生: 宋宣佑
Shiuan-You Sung
論文名稱: 應用Transformer Encoder於車禍偵測之研究
Anticipating Traffic Accidents Using Transformer Encoder Representations
指導教授: 方文賢
Wen-Hsien Fang
口試委員: 丘建青
Chien-ching Chiu
賴坤財
Kuen-Tsair Lay
陳郁堂
Yie-Tarng Chen
鍾聖倫
Sheng-Luen Chung
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 63
中文關鍵詞: accidentdashcam accident datasetdynamic spatial attentiontemporal dependencytransformer encoder
外文關鍵詞: accident, dashcam accident dataset, dynamic spatial attention, temporal dependency, transformer encoder
相關次數: 點閱:171下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • This thesis presents an effective attention-based framework to detect accident in dashcam videos. In contrast to the existing work, our proposed method employs multi-head self-attention mechanism, which is capable to capture various temporal dependencies among the frames, instead of Long Short-Term Memory (LSTM) to enhance the learning capability. First, a dynamic spatial attention (DSA), which can dynamically provide soft-attention for every object, is invoked to aggregate the information from full-frame and object features generated by faster R-CNN. Next, a transformer encoder is employed to effectively learn various temporal dependencies of specific objects. Thereafter, we combine full-frame features with the aggregated object features to obtain the final feature representation. Finally, the final features are passed on to a fully-connected layer to perform accident anticipation. Moreover, a new training strategy is devised as well to improve the learning capability of the attention-based network. Simulations show that the proposed method outperforms the main state-of-the-art methods on the publicly available dashcam accident dataset (DAD dataset).


    This thesis presents an effective attention-based framework to detect accident in dashcam videos. In contrast to the existing work, our proposed method employs multi-head self-attention mechanism, which is capable to capture various temporal dependencies among the frames, instead of Long Short-Term Memory (LSTM) to enhance the learning capability. First, a dynamic spatial attention (DSA), which can dynamically provide soft-attention for every object, is invoked to aggregate the information from full-frame and object features generated by faster R-CNN. Next, a transformer encoder is employed to effectively learn various temporal dependencies of specific objects. Thereafter, we combine full-frame features with the aggregated object features to obtain the final feature representation. Finally, the final features are passed on to a fully-connected layer to perform accident anticipation. Moreover, a new training strategy is devised as well to improve the learning capability of the attention-based network. Simulations show that the proposed method outperforms the main state-of-the-art methods on the publicly available dashcam accident dataset (DAD dataset).

    Table of contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1 Anticipating Accident in Video . . . . . . . . . . . . . . . . . . . 3 2.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Learning Long-Term Temporal Dependency . . . . . . . . . . . . 4 2.4 Dashcam Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Dynamic Spatial Attention . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.4.1 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . 12 3.4.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 13 3.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 DAD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . 18 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3.1 Impact of Accident Length in Training . . . . . . . . . . . 19 4.3.2 Impact of Dynamic Spatial Attention . . . . . . . . . . . . 21 4.3.3 Impact of Positional Encoding . . . . . . . . . . . . . . . . 21 4.3.4 Impact of Transformer Encoder . . . . . . . . . . . . . . . 24 4.4 Comparison with the State-of-the-art Method . . . . . . . . . . . 29 4.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 43 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Appendix A : Example images from the dataset . . . . . . . . . . . . . . . 44 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 List of Figures 3.1 Overview of the proposed method. . . . . . . . . . . . . . . . . . . 7 3.2 The dynamic spatial attention architecture. . . . . . . . . . . . . 9 3.3 The transformer encoder architecture. . . . . . . . . . . . . . . . . 11 3.4 The multi-head attention architecture. . . . . . . . . . . . . . . . 13 3.5 The loss function strategy. . . . . . . . . . . . . . . . . . . . . . . 15 4.1 A few examples of the effect of adding the dynamic spatial atten- tion into the framework: (a) the results of the attention weight without the addition of DSA, blue bounding boxes are the can- didate objects. (b) the results of the attention weight with the addition of DSA, the bounding boxes, and the corresponding at- tention weights are in blue and red, respectively. If the attention weight is higher than 0.4, the corresponding bounding box will be in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Illustration of the failure case by the addition of DSA. (a) the attention weight corresponded to the accident. (b) the attention weight did not correspond to the accident, which is happened in far away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Our proposed module. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . . . . . . . . . .25 4.4 With the scheme of DSA and LSTM. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . . . . . . 26 4.5 Only with the scheme of LSTM. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . . . . . . . . . 27 4.6 Without the scheme of positional encoding in transformer encoder. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident prob- ability higher than 0.5 indicates that an accident has happened. . 28 4.7 This is a third-person accident video of two motorbikes. The can- didate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respec- tively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . 31 4.8 This is a third-person accident video of a motorbike and a truck. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident prob- ability higher than 0.5 indicates that an accident has happened. . 32 4.9 This is a first-person accident video of a motorbike and a car. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respec- tively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . 33 4.10 The movement of accident object seems like just turned right or left. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low atten- tion, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. 36 4.11 The accident object is slightly rubbing with other object and pass. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident prob- ability higher than 0.5 indicates that an accident has happened. . 37 4.12 The accident object is occlusion by the front object. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . 38 4.13 The accident object is too small to correctly detect the accident. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident prob- ability higher than 0.5 indicates that an accident has happened. . 39 4.14 The object which is losing control is hard to detect as an acci- dent. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low atten- tion, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. 40 4.15 The model would be misclassified as an accident if the object closed to the camera. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.16 The objects on the crowded street will be easily misclassified as an accident. The candidate objects are in blue and the attention weight is in yellow, red, and dark indicate high, medium, and low attention, respectively. Also, the bounding box of the object will be in green while the attention weight is higher than 0.4. The accident probability higher than 0.5 indicates that an accident has happened. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.1 Snapshots of DAD dataset. . . . . . . . . . . . . . . . . . . . . . 44 List of Tables 4.1 Performance comparison of detecting accident results on DAD dataset with the different keyframe. . . . . . . . . . . . . . . . . . . . . . 20 4.2 Performance comparison of detecting accident results on DAD dataset with various mechanisms. . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Performance comparison of the detecting accident results on DAD dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    References
    [1] S. Brutzer, B. Höferlin, and G. Heidemann, “Evaluation of background sub-
    traction techniques for video surveillance,” in Proceedings of the IEEE Con-
    ference on Computer Vision and Pattern Recognition, pp. 1937–1944, IEEE,
    2011.
    [2] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern
    Recognition Letters, vol. 34, no. 1, pp. 3–19, 2013.
    [3] Z. Wang, M. P. Deisenroth, H. B. Amor, D. Vogt, B. Schölkopf, and J. Pe-
    ters, “Probabilistic modeling of human movements for intention inference,”
    Proceedings of Robotics: Science and Systems, VIII, 2012.
    [4] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation
    planning using early prediction of human motion,” in Proceedings of the
    International Conference on Intelligent Robots and Systems, pp. 299–306,
    IEEE, 2013.
    [5] H. S. Koppula and A. Saxena, “Anticipating human activities using object
    affordances for reactive robotic response,” IEEE Transactions on Pattern
    Analysis and Machine Intelligence, vol. 38, no. 1, pp. 14–29, 2015.
    [6] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-
    robot teams,” in Experimental Robotics, pp. 453–470, Springer, 2016.
    [7] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity fore-
    casting,” in Proceedings of the European Conference on Computer Vision,
    pp. 201–214, Springer, 2012.
    [8] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in
    Proceedings of the European Conference on Computer Vision, pp. 707–720,
    Springer, 2010.
    [9] K. A. Brookhuis, D. De Waard, and W. H. Janssen, “Behavioural impacts
    of advanced driver assistance systems–an overview,” European Journal of
    Transport and Infrastructure Research, vol. 1, no. 3, 2019.
    [10] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
    in dashcam videos,” in Proceedings of the Asian Conference on Computer
    Vision, pp. 136–153, Springer, 2016.
    [11] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
    centric risk assessment: Accident anticipation and risky region localization,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 2222–2230, 2017.
    [12] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
    with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
    2018.
    [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
    L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceeding of
    the Neural Information Processing Systems, pp. 5998–6008, 2017.
    [14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
    object detection with region proposal networks,” IEEE Transactions on Pat-
    tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
    [15] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for
    Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
    [16] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activ-
    ities from streaming videos,” in Proceedings of the International Conference
    on Computer Vision, pp. 1036–1043, IEEE, 2011.
    [17] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual represen-
    tations from unlabeled video,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 98–106, 2016.
    [18] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Pe-
    tersson, and L. Andersson, “Encouraging lstms to anticipate actions very
    early,” in Proceedings of the IEEE International Conference on Computer
    Vision, pp. 280–289, 2017.
    [19] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
    Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
    [20] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu, “Pairwise body-part attention
    for recognizing human-object interactions,” in Proceedings of the European
    Conference on Computer Vision, pp. 51–67, 2018.
    [21] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
    work for action recognition in videos,” IEEE Transactions on Image Pro-
    cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
    [22] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotempo-
    ral attention for video-based person re-identification,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–
    378, 2018.
    [23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm
    convolves, attends and flows for action recognition,” Computer Vision and
    Image Understanding, vol. 166, pp. 41–50, 2018.
    [24] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with rbf kernel-
    ized feature mapping rnn,” in Proceedings of the European Conference on
    Computer Vision (ECCV), pp. 301–317, 2018.
    [25] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, pp. 7794–7803, 2018.
    [26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
    and Y. Bengio, “Show, attend and tell: Neural image caption generation
    with visual attention,” in Proceeedings of the International Conference on
    Machine Learning, pp. 2048–2057, 2015.
    [27] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville,
    “Describing videos by exploiting temporal structure,” in Proceedings of the
    IEEE International Conference on Computer Vision, pp. 4507–4515, 2015.
    [28] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
    visual attention,” arXiv preprint arXiv:1412.7755, 2014.
    [29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
    deep bidirectional transformers for language understanding,” arXiv preprint
    arXiv:1810.04805, 2018.
    [30] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action trans-
    former network,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, pp. 244–253, 2019.
    [31] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
    stream network with bidirectional self-attention for action recognition in
    extreme low resolution videos,” IEEE Signal Processing Letters, 2019.
    [32] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving?
    The KITTI Vision Benchmark Suite,” in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012.
    [33] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Be-
    nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in
    48Proceedings of the Computer Vision and Pattern Recognition Workshop on
    the Future of Datasets in Vision, vol. 2, 2015.
    [34] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,
    “Bdd100k: A diverse driving video database with scalable annotation tool-
    ing,” arXiv preprint arXiv:1805.04687, 2018.
    [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
    recognition,” in Proceedings of the IEEE conference on computer vision and
    pattern recognition, pp. 770–778, 2016.
    [36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
    preprint arXiv:1607.06450, 2016.
    [37] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural
    networks for driver activity anticipation via sensory-fusion architecture,” in
    Proceeding of the IEEE International Conference on Robotics and Automa-
    tion, pp. 3118–3125, IEEE, 2016.
    [38] G. E. Nasr, E. Badr, and C. Joun, “Cross entropy error function in neu-
    ral networks: Forecasting gasoline demand.,” in Proceedings of the Florida
    Artificial Intelligence Research Society Conference, pp. 381–384, 2002.
    [39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
    and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceed-
    ings of the European Conference on Computer Vision, pp. 740–755, Springer,
    2014.
    [40] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
    arXiv preprint arXiv:1412.6980, 2014.

    無法下載圖示 全文公開日期 2024/08/22 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE