簡易檢索 / 詳目顯示

研究生: 周渥餘
WoYu Chou
論文名稱: 基於時軸特徵位移與注意力機制強化之異常偵測模型
Self-Attention Augmented Temporal Shift Module for Anomaly Detection
指導教授: 方文賢
Wen-Hsien Fang
口試委員: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
邱建青
Chien-Ching Chiu
賴坤財
Kuen-Tsair Lay
阮聖彰
Shanq-Jang Ruan
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 52
中文關鍵詞: 異常偵測弱監督式學習空間資訊強化時間關係理解模型
外文關鍵詞: anomaly detection, weakly supervised learning, spatial augmentation, temporal modeling
相關次數: 點閱:195下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

利用人工智能(AI)於監視器系統中執行異常事件偵測已經成為一種熱門並且廣泛研究的課題。但是,時至今日,基於弱監督式學習的異常偵測模型相關研究仍然在尋求更多的方式去強化對於空間與時間關係的理解,並且存在著一定的挑戰性。由於弱監督式學習的特性,多種不同形式的異常情況會被一律分為異常,要求模型除了具有理解影片中動作行為的能力外,同時必須理解影片中的環境前後變化以偵測不同種類的異常行為。本篇論文提出了一個基於時軸特徵位移(TSM)的改進模型,利用不同的注意力機制強化空間與時間的特徵與關聯性,在UCF-crime中有著目前最優秀的表現並且在ShanghaiTech和XD-Violence達到接近最優秀的表現。我們的以一個新的角度來強化空間與時間的關係理解,在基於2D捲積與時間軸關係演算法的偽3D網路開闢了新的可能性,對現實中的異常事件偵測有了更全面的理解與貢獻。


In the field of surveillance systems, detecting abnormal events using artifi-
cial intelligence (AI) has been a subject of extensive research. However, achiev-
ing a robust understanding of spatial-temporal relations in weakly supervised
anomaly detection presents unique challenges. The model must comprehend not
only short-term movements but also long-term correlations in videos, as different
actions with distinct characteristics may be labeled as anomalies while sharing
the same label. In this paper, we propose a model that enhances the temporal
shift module (TSM) with multiple attention-based augmentations. Our model
achieves state-of-the-art performance in the UCF-Crime dataset and near state-
of-the-art performance in the ShanghaiTech and XD-Violence datasets. By im-
proving the TSM and incorporating attention mechanisms, our model effectively
captures complex temporal relationships, enhancing its ability to detect anoma-
lies in surveillance videos.This novel approach provides a new perspective on
pseudo-3D CNN networks for spatial-temporal relation understanding, paving
the way for significant advancements in weakly supervised anomaly detection.
The proposed model opens new possibilities for future research and applications
in surveillance systems, contributing to a more comprehensive understanding of
abnormal events in real-world scenarios.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Temporal Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Self-Attention Augmented Multi-resolution Images . . . . . . . . . 9 3.2 Channel Attention TSM . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.1 Elements of TSM . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Channel Attention TSM . . . . . . . . . . . . . . . . . . . 12 3.3 Consensus Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.1.1 Anomaly Datasets . . . . . . . . . . . . . . . . . . . . . . 16 4.1.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 17 4.2 Ablation studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.2.1 Evaluating the modules through ablation studies . . . . . . 19 4.2.2 Analysis of Self-Attention Augmented Multi-resolution Im- ages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2.3 Analysis of TSM with Channel-Attention . . . . . . . . . . 20 4.2.4 Analysis of Consensus Fusion . . . . . . . . . . . . . . . . 20 4.3 Visualization results . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . 21 4.3.2 Visualization of failure cases . . . . . . . . . . . . . . . . 22 4.3.3 Visualization of success cases . . . . . . . . . . . . . . . . 27 4.4 Comepared to the State of the Art . . . . . . . . . . . . . . . . . 29 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 32 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

[1] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video
understanding,” in Proceedings of the IEEE/CVF International Conference
on Computer Vision, pp. 7083–7093, 2019.
[2] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-
Fei, “Large-scale video classification with convolutional neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1725–1732, 2014.
[3] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for
action recognition in videos,” Advances in Neural Information Processing
Systems, vol. 27, 2014.
[4] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool,
“Temporal segment networks: Towards good practices for deep action recog-
nition,” in European Conference on Computer Vision, pp. 20–36, 2016.
[5] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spa-
tiotemporal features with 3d convolutional networks,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 4489–4497, 2015.
[6] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model
and the kinetics dataset,” in Computer Vision and Pattern Recognition
(CVPR), 2017 IEEE Conference on, pp. 4724–4733, IEEE, 2017.
[7] Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and G. Carneiro,
“Weakly-supervised video anomaly detection with robust temporal feature
magnitude learning,” arXiv preprint arXiv:2101.10030, 2021.
[8] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “Mgfn: Magnitude-
contrastive glance-and-focus network for weakly-supervised video anomaly
detection,” in Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 37, pp. 387–395, 2023.
[9] J.-C. Wu, H.-Y. Hsieh, D.-J. Chen, C.-S. Fuh, and T.-L. Liu, “Self-supervised
sparse representation for video anomaly detection,” in European Conference
on Computer Vision, pp. 729–745, Springer, 2022.
[10] M. Hasan, J. Choi, J. Neumann, A. K. RoyChowdhury, and L. S. Davis,
“Learning temporal regularity in video sequences,” in CVPR, pp. 733–742,
2016.
[11] Y. Zhao, B. Deng, C. Shen, Y. Liu, H. Lu, and X.-S. Hua, “Spatio-temporal
autoencoder for video anomaly detection,” in Proceedings of the 25th ACM
International Conference on Multimedia, pp. 1933–1941, ACM, 2017.
[12] W. L. W. Luo and S. Gao, “Remembering history with convolutional lstm for
anomaly detection,” in 2017 IEEE International Conference on Multimedia
and Expo (ICME), 2017.
[13] J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, “Abnormal event detec-
tion and localization via adversarial event prediction,” IEEE Transactions
on Neural Networks and Learning Systems, 2021.
[14] K. Doshi and Y. Yilmaz, “Continual learning for anomaly detection in
surveillance videos,” in Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition workshops, pp. 254–255, 2020.
[15] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveil-
lance videos,” in CVPR, pp. 6479–6488, 2018.
[16] H. K. Joo, K. Vo, K. Yamazaki, and N. Le, “Clip-tsa: Clip-assisted tem-
poral self-attention for weakly-supervised video anomaly detection,” arXiv
preprint arXiv:2212.05136, 2022.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural
information processing systems, vol. 30, 2017.
[18] G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative adversarial network
with spatial attention for face attribute editing,” in Proceedings of the Eu-
ropean conference on computer vision (ECCV), pp. 417–432, 2018.
[19] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-
stream network with bidirectional self-attention for action recognition in ex-
treme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8,
pp. 1187–1191, 2019.
[20] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and
X. Tang, “Residual attention network for image classification,” in Proceed-
ings of the IEEE conference on computer vision and pattern recognition,
pp. 3156–3164, 2017.
[21] Y. Li, “Video forecasting with forward-backward-net: Delving deeper into
spatiotemporal consistency,” in Proceedings of the 26th ACM international
conference on Multimedia, pp. 211–219, 2018.
[22] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 7794–7803, 2018.
[23] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Spatial-temporal action
localization with hierarchical self-attention,” IEEE Transactions on Multi-
media, vol. 24, pp. 625–639, 2021.
[24] D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, and S. Wen, “Stnet:
Local and global spatial-temporal modeling for action recognition,” in Pro-
ceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 8401–
8408, 2019.
[25] A. A. Baffour, Z. Qin, Y. Wang, Z. Qin, and K.-K. R. Choo, “Spatial
self-attention network with self-attention distillation for fine-grained image
recognition,” Journal of Visual Communication and Image Representation,
vol. 81, p. 103368, 2021.
[26] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block
attention module,” in Proceedings of the European conference on computer
vision (ECCV), pp. 3–19, 2018.
[27] Z. Zheng, G. An, D. Wu, and Q. Ruan, “Global and local knowledge-aware
attention network for action recognition,” IEEE transactions on neural net-
works and learning systems, vol. 32, no. 1, pp. 334–347, 2020.
[28] J. Lei, Y. Jia, B. Peng, and Q. Huang, “Channel-wise temporal attention net-
work for video action recognition,” in 2019 IEEE International Conference
on Multimedia and Expo (ICME), pp. 562–567, IEEE, 2019.
[29] L. Chen, Y. Liu, and Y. Man, “Spatial-temporal channel-wise attention net-
work for action recognition,” Multimedia Tools and Applications, vol. 80,
pp. 21789–21808, 2021.
[30] X. Dong, R. Zhao, H. Sun, D. Wu, J. Wang, X. Zhou, J. Liu, S. Cui, and
Z. He, “Multi-attention transformer for naturalistic driving action recogni-
tion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pp. 5434–5440, 2023.
[31] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational
reasoning in videos,” in Proceedings of the European Conference on Computer
Vision (ECCV), pp. 803–818, 2018.
[32] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer
look at spatio temporal convolutions for action recognition,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6450–6459, 2018.
[33] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime
tv-l 1 optical flow,” in Joint Pattern Recognition Symposium, pp. 214–223,
Springer, 2007.
[34] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d.
Hengel, “Memorizing normality to detect anomaly: Memory-augmented
deep autoencoder for unsupervised anomaly detection,” in IEEE Interna-
tional Conference on Computer Vision (ICCV), 2019.
[35] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd count-
ing via multi-column convolutional neural network,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2016.
[36] P. Wu, j. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, and Z. Yang, “Not only
look, but also listen: Learning multimodal violence detection under weak
supervision,” in European Conference on Computer Vision (ECCV), 2020.
[37] J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolu-
tional label noise cleaner: Train a plug-and-play action classifier for anomaly
detection,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1237–1246, 2019.
[38] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,”
in ICCV, pp. 2720–2727, 2013.
[39] F. Sohrab, J. Raitoharju, M. Gabbouj, and A. Iosifidis, “Subspace support
vector data description,” in ICPR, 2018.
[40] J. Wang and A. Cherian, “GODS: Generalized one-class discriminative sub-
spaces for anomaly detection,” in ICCV, pp. 8201–8211, 2019.
[41] G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, “Cloze
test helps: Effective video anomaly detection via learning to complete video
events,” in Proceedings of the ACM International Conference on Multimedia,
pp. 583–591, 2020.
[42] M. Z. Zaheer, A. Mahmood, M. H. Khan, M. Segu, F. Yu, and S.-I. Lee,
“Generative cooperative learning for unsupervised video anomaly detection,”
in CVPR, pp. 14744–14754, 2022.
[43] K. Liu and H. Ma, “Exploring background-bias for anomaly detection in
surveillance videos,” in Proceedings of the 27th ACM International Confer-
ence on Multimedia, pp. 1490–1499, 2019.
[44] J. Zhang, L. Qing, and J. Miao, “Temporal convolutional network with
complementary inner bag loss for weakly supervised anomaly detection,”
in Proceedings of the IEEE International Conference on Image Processing,
pp. 4030–4034, 2019.
[45] M. Z. Zaheer, A. Mahmood, M. Astrid, and S.-I. Lee, “Claws: Clustering
assisted weakly supervised learning with normalcy suppression for anomalous
event detection,” in Proceedings of the European Conference on Computer
Vision, pp. 1–8, 2020.
[46] S. Li, F. Liu, and L. Jiao, “Self-training multi-sequence learning with trans-
former for weakly supervised video anomaly detection,” in Proceedings of the
AAAI, (Virtual), February 2022.
[47] H. Sapkota and Q. Yu, “Bayesian nonparametric submodular video partition
for robust anomaly detection,” in CVPR, pp. 3212–3221, June 2022.
[48] H. Lv, Z. Yue, Q. Sun, B. Luo, Z. Cui, and H. Zhang, “Unbiased multiple
instance learning for weakly supervised video anomaly detection,” in CVPR,
pp. 8022–8031, June 2023.
[49] Y. Chen, Z. Liu, B. Zhang, W. Fok, X. Qi, and Y.-C. Wu, “MGFN:
Magnitude-contrastive glance-and-focus network for weakly-supervised video
anomaly detection,” 2022.
[50] C. Zhang, G. Li, Y. Qi, S. Wang, L. Qing, Q. Huang, and M.-H. Yang, “Ex-
ploiting completeness and uncertainty of pseudo labels for weakly supervised
video anomaly detection,” arXiv preprint arXiv:2212.04090, 2022.
[51] X. Peng, H. Wen, Y. Luo, X. Zhou, K. Yu, Y. Wang, and Z. Wu, “Learning
weakly supervised audio-visual violence detection in hyperbolic space,” 2023

QR CODE