應用Transformer Encoder於車禍偵測之研究｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	宋宣佑 Shiuan-You Sung
論文名稱：	應用Transformer Encoder於車禍偵測之研究 Anticipating Traffic Accidents Using Transformer Encoder Representations
指導教授：	方文賢 Wen-Hsien Fang
口試委員:	丘建青 Chien-ching Chiu 賴坤財 Kuen-Tsair Lay 陳郁堂 Yie-Tarng Chen 鍾聖倫 Sheng-Luen Chung
學位類別：	碩士 Master
系所名稱：	電資學院 - 電子工程系 Department of Electronic and Computer Engineering
論文出版年：	2019
畢業學年度：	107
語文別：	英文
論文頁數：	63
中文關鍵詞：	accident 、dashcam accident dataset 、dynamic spatial attention 、temporal dependency 、transformer encoder
外文關鍵詞：	accident, dashcam accident dataset, dynamic spatial attention, temporal dependency, transformer encoder
相關次數：	點閱：171 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

This thesis presents an effective attention-based framework to detect accident in dashcam videos. In contrast to the existing work, our proposed method employs multi-head self-attention mechanism, which is capable to capture various temporal dependencies among the frames, instead of Long Short-Term Memory (LSTM) to enhance the learning capability. First, a dynamic spatial attention (DSA), which can dynamically provide soft-attention for every object, is invoked to aggregate the information from full-frame and object features generated by faster R-CNN. Next, a transformer encoder is employed to effectively learn various temporal dependencies of specific objects. Thereafter, we combine full-frame features with the aggregated object features to obtain the final feature representation. Finally, the final features are passed on to a fully-connected layer to perform accident anticipation. Moreover, a new training strategy is devised as well to improve the learning capability of the attention-based network. Simulations show that the proposed method outperforms the main state-of-the-art methods on the publicly available dashcam accident dataset (DAD dataset).

Table of contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Anticipating Accident in Video . . . . . . . . . . . . . . . . . . . 3
2.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Learning Long-Term Temporal Dependency . . . . . . . . . . . . 4
2.4 Dashcam Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Feature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Dynamic Spatial Attention . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Transformer Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.1 Positional Encoding . . . . . . . . . . . . . . . . . . . . . . 12
3.4.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 13
3.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1 DAD Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . 18
4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Impact of Accident Length in Training . . . . . . . . . . . 19
4.3.2 Impact of Dynamic Spatial Attention . . . . . . . . . . . . 21
4.3.3 Impact of Positional Encoding . . . . . . . . . . . . . . . . 21
4.3.4 Impact of Transformer Encoder . . . . . . . . . . . . . . . 24
4.4 Comparison with the State-of-the-art Method . . . . . . . . . . . 29
4.5 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Appendix A : Example images from the dataset . . . . . . . . . . . . . . . 44
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
List of Figures
3.1 Overview of the proposed method. . . . . . . . . . . . . . . . . . . 7
3.2 The dynamic spatial attention architecture. . . . . . . . . . . . . 9
3.3 The transformer encoder architecture. . . . . . . . . . . . . . . . . 11
3.4 The multi-head attention architecture. . . . . . . . . . . . . . . . 13
3.5 The loss function strategy. . . . . . . . . . . . . . . . . . . . . . . 15
4.1 A few examples of the effect of adding the dynamic spatial atten-
tion into the framework: (a) the results of the attention weight
without the addition of DSA, blue bounding boxes are the can-
didate objects. (b) the results of the attention weight with the
addition of DSA, the bounding boxes, and the corresponding at-
tention weights are in blue and red, respectively. If the attention
weight is higher than 0.4, the corresponding bounding box will be
in green. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Illustration of the failure case by the addition of DSA. (a) the
attention weight corresponded to the accident. (b) the attention
weight did not correspond to the accident, which is happened in
far away. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Our proposed module. The candidate objects are in blue and the
attention weight is in yellow, red, and dark indicate high, medium,
and low attention, respectively. Also, the bounding box of the
object will be in green while the attention weight is higher than
0.4. The accident probability higher than 0.5 indicates that an
accident has happened. . . . . . . . . . . . . . . . . . . . . . . .25
4.4 With the scheme of DSA and LSTM. The candidate objects are in
blue and the attention weight is in yellow, red, and dark indicate
high, medium, and low attention, respectively. Also, the bounding
box of the object will be in green while the attention weight is
higher than 0.4. The accident probability higher than 0.5 indicates
that an accident has happened. . . . . . . . . . . . . . . . . . . . 26
4.5 Only with the scheme of LSTM. The candidate objects are in blue
and the attention weight is in yellow, red, and dark indicate high,
medium, and low attention, respectively. Also, the bounding box
of the object will be in green while the attention weight is higher
than 0.4. The accident probability higher than 0.5 indicates that
an accident has happened. . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Without the scheme of positional encoding in transformer encoder.
The candidate objects are in blue and the attention weight is in
yellow, red, and dark indicate high, medium, and low attention,
respectively. Also, the bounding box of the object will be in green
while the attention weight is higher than 0.4. The accident prob-
ability higher than 0.5 indicates that an accident has happened. . 28
4.7 This is a third-person accident video of two motorbikes. The can-
didate objects are in blue and the attention weight is in yellow,
red, and dark indicate high, medium, and low attention, respec-
tively. Also, the bounding box of the object will be in green while
the attention weight is higher than 0.4. The accident probability
higher than 0.5 indicates that an accident has happened. . . . . . 31
4.8 This is a third-person accident video of a motorbike and a truck.
The candidate objects are in blue and the attention weight is in
yellow, red, and dark indicate high, medium, and low attention,
respectively. Also, the bounding box of the object will be in green
while the attention weight is higher than 0.4. The accident prob-
ability higher than 0.5 indicates that an accident has happened. . 32
4.9 This is a first-person accident video of a motorbike and a car. The
candidate objects are in blue and the attention weight is in yellow,
red, and dark indicate high, medium, and low attention, respec-
tively. Also, the bounding box of the object will be in green while
the attention weight is higher than 0.4. The accident probability
higher than 0.5 indicates that an accident has happened. . . . . . 33
4.10 The movement of accident object seems like just turned right or
left. The candidate objects are in blue and the attention weight
is in yellow, red, and dark indicate high, medium, and low atten-
tion, respectively. Also, the bounding box of the object will be in
green while the attention weight is higher than 0.4. The accident
probability higher than 0.5 indicates that an accident has happened. 36
4.11 The accident object is slightly rubbing with other object and pass.
The candidate objects are in blue and the attention weight is in
yellow, red, and dark indicate high, medium, and low attention,
respectively. Also, the bounding box of the object will be in green
while the attention weight is higher than 0.4. The accident prob-
ability higher than 0.5 indicates that an accident has happened. . 37
4.12 The accident object is occlusion by the front object. The candidate
objects are in blue and the attention weight is in yellow, red, and
dark indicate high, medium, and low attention, respectively. Also,
the bounding box of the object will be in green while the attention
weight is higher than 0.4. The accident probability higher than 0.5
indicates that an accident has happened. . . . . . . . . . . . . . . 38
4.13 The accident object is too small to correctly detect the accident.
The candidate objects are in blue and the attention weight is in
yellow, red, and dark indicate high, medium, and low attention,
respectively. Also, the bounding box of the object will be in green
while the attention weight is higher than 0.4. The accident prob-
ability higher than 0.5 indicates that an accident has happened. . 39
4.14 The object which is losing control is hard to detect as an acci-
dent. The candidate objects are in blue and the attention weight
is in yellow, red, and dark indicate high, medium, and low atten-
tion, respectively. Also, the bounding box of the object will be in
green while the attention weight is higher than 0.4. The accident
probability higher than 0.5 indicates that an accident has happened. 40
4.15 The model would be misclassified as an accident if the object closed
to the camera. The candidate objects are in blue and the attention
weight is in yellow, red, and dark indicate high, medium, and low
attention, respectively. Also, the bounding box of the object will
be in green while the attention weight is higher than 0.4. The
accident probability higher than 0.5 indicates that an accident has
happened. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.16 The objects on the crowded street will be easily misclassified as
an accident. The candidate objects are in blue and the attention
weight is in yellow, red, and dark indicate high, medium, and low
attention, respectively. Also, the bounding box of the object will
be in green while the attention weight is higher than 0.4. The
accident probability higher than 0.5 indicates that an accident has
happened. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1 Snapshots of DAD dataset. . . . . . . . . . . . . . . . . . . . . . 44
List of Tables
4.1
Performance comparison of detecting accident results on DAD dataset
with the different keyframe. . . . . . . . . . . . . . . . . . . . . . 20
4.2
Performance comparison of detecting accident results on DAD dataset
with various mechanisms. . . . . . . . . . . . . . . . . . . . . . . 29
4.3
Performance comparison of the detecting accident results on DAD
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

                                

References
[1] S. Brutzer, B. Höferlin, and G. Heidemann, “Evaluation of background sub-
traction techniques for video surveillance,” in Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 1937–1944, IEEE,
2011.
[2] X. Wang, “Intelligent multi-camera video surveillance: A review,” Pattern
Recognition Letters, vol. 34, no. 1, pp. 3–19, 2013.
[3] Z. Wang, M. P. Deisenroth, H. B. Amor, D. Vogt, B. Schölkopf, and J. Pe-
ters, “Probabilistic modeling of human movements for intention inference,”
Proceedings of Robotics: Science and Systems, VIII, 2012.
[4] J. Mainprice and D. Berenson, “Human-robot collaborative manipulation
planning using early prediction of human motion,” in Proceedings of the
International Conference on Intelligent Robots and Systems, pp. 299–306,
IEEE, 2013.
[5] H. S. Koppula and A. Saxena, “Anticipating human activities using object
affordances for reactive robotic response,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 38, no. 1, pp. 14–29, 2015.
[6] H. S. Koppula, A. Jain, and A. Saxena, “Anticipatory planning for human-
robot teams,” in Experimental Robotics, pp. 453–470, Springer, 2016.
[7] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity fore-
casting,” in Proceedings of the European Conference on Computer Vision,
pp. 201–214, Springer, 2012.
[8] J. Yuen and A. Torralba, “A data-driven approach for event prediction,” in
Proceedings of the European Conference on Computer Vision, pp. 707–720,
Springer, 2010.
[9] K. A. Brookhuis, D. De Waard, and W. H. Janssen, “Behavioural impacts
of advanced driver assistance systems–an overview,” European Journal of
Transport and Infrastructure Research, vol. 1, no. 3, 2019.
[10] F.-H. Chan, Y.-T. Chen, Y. Xiang, and M. Sun, “Anticipating accidents
in dashcam videos,” in Proceedings of the Asian Conference on Computer
Vision, pp. 136–153, Springer, 2016.
[11] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. Carlos Niebles, and M. Sun, “Agent-
centric risk assessment: Accident anticipation and risky region localization,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2222–2230, 2017.
[12] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents
with adaptive loss and large-scale incident db,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3521–3529,
2018.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceeding of
the Neural Information Processing Systems, pp. 5998–6008, 2017.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.
[15] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for
Large-Scale Image Recognition,” arXiv preprint arXiv:1409.1556, 2014.
[16] M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activ-
ities from streaming videos,” in Proceedings of the International Conference
on Computer Vision, pp. 1036–1043, IEEE, 2011.
[17] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual represen-
tations from unlabeled video,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 98–106, 2016.
[18] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Pe-
tersson, and L. Andersson, “Encouraging lstms to anticipate actions very
early,” in Proceedings of the IEEE International Conference on Computer
Vision, pp. 280–289, 2017.
[19] R. Girdhar and D. Ramanan, “Attentional pooling for action recognition,” in
Proceedings of the Neural Information Processing Systems, pp. 34–45, 2017.
[20] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu, “Pairwise body-part attention
for recognizing human-object interactions,” in Proceedings of the European
Conference on Computer Vision, pp. 51–67, 2018.
[21] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention net-
work for action recognition in videos,” IEEE Transactions on Image Pro-
cessing, vol. 27, no. 3, pp. 1347–1360, 2017.
[22] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotempo-
ral attention for video-based person re-identification,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 369–
378, 2018.
[23] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “Videolstm
convolves, attends and flows for action recognition,” Computer Vision and
Image Understanding, vol. 166, pp. 41–50, 2018.
[24] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with rbf kernel-
ized feature mapping rnn,” in Proceedings of the European Conference on
Computer Vision (ECCV), pp. 301–317, 2018.
[25] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7794–7803, 2018.
[26] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
and Y. Bengio, “Show, attend and tell: Neural image caption generation
with visual attention,” in Proceeedings of the International Conference on
Machine Learning, pp. 2048–2057, 2015.
[27] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville,
“Describing videos by exploiting temporal structure,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4507–4515, 2015.
[28] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition with
visual attention,” arXiv preprint arXiv:1412.7755, 2014.
[29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of
deep bidirectional transformers for language understanding,” arXiv preprint
arXiv:1810.04805, 2018.
[30] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action trans-
former network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 244–253, 2019.
[31] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
stream network with bidirectional self-attention for action recognition in
extreme low resolution videos,” IEEE Signal Processing Letters, 2019.
[32] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving?
The KITTI Vision Benchmark Suite,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3354–3361, 2012.
[33] M. Cordts, M. Omran, S. Ramos, T. Scharwächter, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset,” in
48Proceedings of the Computer Vision and Pattern Recognition Workshop on
the Future of Datasets in Vision, vol. 2, 2015.
[34] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell,
“Bdd100k: A diverse driving video database with scalable annotation tool-
ing,” arXiv preprint arXiv:1805.04687, 2018.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 770–778, 2016.
[36] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv
preprint arXiv:1607.06450, 2016.
[37] A. Jain, A. Singh, H. S. Koppula, S. Soh, and A. Saxena, “Recurrent neural
networks for driver activity anticipation via sensory-fusion architecture,” in
Proceeding of the IEEE International Conference on Robotics and Automa-
tion, pp. 3118–3125, IEEE, 2016.
[38] G. E. Nasr, E. Badr, and C. Joun, “Cross entropy error function in neu-
ral networks: Forecasting gasoline demand.,” in Proceedings of the Florida
Artificial Intelligence Research Society Conference, pp. 381–384, 2002.
[39] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceed-
ings of the European Conference on Computer Vision, pp. 740–755, Springer,
2014.
[40] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
arXiv preprint arXiv:1412.6980, 2014.

全文公開日期 2024/08/22 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文