簡易檢索 / 詳目顯示

研究生: Rizard Renanda Adhi Pramono
Rizard Renanda Adhi Pramono
論文名稱: 具有自註意力的視頻中視覺上下文的關係推理
Relational Reasoning of Visual Contexts in Videos with Self-Attention
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
陳郁堂
Yie-Tarng Chen
呂政修
Jenq-Shiou Leu
廖弘源
Mark Liao
傅楸善
Chiou-Shann Fuh
丘建青
Chien-Ching Chiu
楊健生
Jason Young
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 142
中文關鍵詞: action localizationconditional random fieldgroup activity recognitionperson re-identificationself-attention
外文關鍵詞: action localization, conditional random field, group activity recognition, person re-identification, self-attention
相關次數: 點閱:211下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

In recent years, understanding the visual contexts of videos has become one of the major issues in computer vision society by reasons of its numerous practical applications and the availability of huge amount of computational resources. Different from still images, video data have richer contexts including the spatial and temporal dependencies and pose additional challenges such as camera movement, illumination changes, viewpoint variations, poor video resolution, dynamic object interactions, and etc. Self-attention has been proven to be effective in modelling the structure of sequences of data in miscellaneous natural language processing tasks. This dissertation thus focuses on the use of self-attention to reason the relational contexts within video data on three popular tasks in visual context understanding: action localization, group activity recognition, and video-based person re-identification (re-ID).
First, a novel architecture is developed for spatial-temporal action localization in videos. The new architecture first utilizes a two-stream 3D convolutional neural network (3D-CNN) to provide initial action detection. Next, a new hierarchical self-attention network (HiSAN), the core of this architecture, is developed to learn the spatial-temporal relationships of key actors. Such a combination of 3D-CNN and HiSAN allows us to effectively extract both of the spatial context information and the long-term temporal dependency to improve action localization accuracy. Afterwards, a new fusion strategy is employed, which first re-scores the bounding boxes to settle the inconsistent detection scores caused by background clutter or occlusion, and then aggregates the motion and appearance information from the two-stream network with the motion saliency to alleviate the impact of camera movement. Finally, a tube association network based on the self-similarity of the actors’ appearance and spatial information across frames is addressed to efficaciously construct the action tubes.
Next, an effective relational network for group activity recognition is introduced. The crux of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition.
Lastly, a novel attentive graph network is established to tackle difficult scenarios in video re-ID such as occlusion, view misalignment, and pose variations. The network begins with a temporal-aware feature extractor, which makes use of the short-term temporal correlation of the fine-grained feature maps to generate the multi-scale part-based CNN features. Subsequently, central to the new network, a combination of self-attention and CRFs that stems from our relational network for group activity recognition is employed to enjoy the strength of CRFs in learning the inter-dependency of the multi-scale part features and the strength of self-attention in directly attending the response at various distant positions. To deal with both local and non-local relationships of body parts in a sequence of frames, a new type of self-attention being applied to a set of cliques with different extents of temporal locality is considered. Also, the CRF inference is cast as a node clustering problem based on the graph representation to aggregate the multi-scale part features related to the person of interest and de-emphasize the influence of the unrelated background information for more accurate image sequence representation.


In recent years, understanding the visual contexts of videos has become one of the major issues in computer vision society by reasons of its numerous practical applications and the availability of huge amount of computational resources. Different from still images, video data have richer contexts including the spatial and temporal dependencies and pose additional challenges such as camera movement, illumination changes, viewpoint variations, poor video resolution, dynamic object interactions, and etc. Self-attention has been proven to be effective in modelling the structure of sequences of data in miscellaneous natural language processing tasks. This dissertation thus focuses on the use of self-attention to reason the relational contexts within video data on three popular tasks in visual context understanding: action localization, group activity recognition, and video-based person re-identification (re-ID).
First, a novel architecture is developed for spatial-temporal action localization in videos. The new architecture first utilizes a two-stream 3D convolutional neural network (3D-CNN) to provide initial action detection. Next, a new hierarchical self-attention network (HiSAN), the core of this architecture, is developed to learn the spatial-temporal relationships of key actors. Such a combination of 3D-CNN and HiSAN allows us to effectively extract both of the spatial context information and the long-term temporal dependency to improve action localization accuracy. Afterwards, a new fusion strategy is employed, which first re-scores the bounding boxes to settle the inconsistent detection scores caused by background clutter or occlusion, and then aggregates the motion and appearance information from the two-stream network with the motion saliency to alleviate the impact of camera movement. Finally, a tube association network based on the self-similarity of the actors’ appearance and spatial information across frames is addressed to efficaciously construct the action tubes.
Next, an effective relational network for group activity recognition is introduced. The crux of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition.
Lastly, a novel attentive graph network is established to tackle difficult scenarios in video re-ID such as occlusion, view misalignment, and pose variations. The network begins with a temporal-aware feature extractor, which makes use of the short-term temporal correlation of the fine-grained feature maps to generate the multi-scale part-based CNN features. Subsequently, central to the new network, a combination of self-attention and CRFs that stems from our relational network for group activity recognition is employed to enjoy the strength of CRFs in learning the inter-dependency of the multi-scale part features and the strength of self-attention in directly attending the response at various distant positions. To deal with both local and non-local relationships of body parts in a sequence of frames, a new type of self-attention being applied to a set of cliques with different extents of temporal locality is considered. Also, the CRF inference is cast as a node clustering problem based on the graph representation to aggregate the multi-scale part features related to the person of interest and de-emphasize the influence of the unrelated background information for more accurate image sequence representation.

Table of contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Action Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Group Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Video-based Person Re-identification . . . . . . . . . . . . . . . . . . . . 1.5 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CNN for Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Learning Temporal Dependency . . . . . . . . . . . . . . . . . . . . . . 2.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 CNN for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Multi-object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Image-based Person Re-identification . . . . . . . . . . . . . . . . . . . 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Spatial-Temporal Action Localization with Hierarchical Self-Attention . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Action Detection Network . . . . . . . . . . . . . . . . . . . . . 3.2.2 Hierarchical Self-Attention Network . . . . . . . . . . . . . . . . 3.2.3 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Action Tube Generation . . . . . . . . . . . . . . . . . . . . . . 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Comparison with the State-of-the-Art Works . . . . . . . . . . . 3.3.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Relational Reasoning with Self-Attention Augmented Conditional Random Fields for Group Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Proposed Relational Network . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Self-Attention Augmented Conditional Random Field . . . . . . 4.3.2 Bidirectional UTE for Group Activity Recognition . . . . . . . . 4.3.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Comparison with the State-of-the-Art Works . . . . . . . . . . . 4.4.4 Individual Action Classification Accuracy . . . . . . . . . . . . . 4.4.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Learning Relational Graphs with Self-Attention Augmented Conditional Random Fields for Video-Based Person Re-identification . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Temporal-Aware Feature Extractor . . . . . . . . . . . . . . . . . 5.2.2 Spatial-Temporal Graph Model . . . . . . . . . . . . . . . . . . 5.2.3 Self-Attention Conditional Random Fields . . . . . . . . . . . . 5.2.4 New Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Comparison with the State-of-the-Art Works . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
“Attention is all you need,” in Proceedings of the Advances in Neural Information Processing
Systems, Long Beach, California, 2017, pp. 5998–6008.
[2] V. Jain, M. S. Pillai, L. Chandra, R. Kumar, M. Khari, and A. Jain, “CamAspect: An intelligent
automated real-time surveillance system with smartphone indexing,” IEEE Sensors Letters, vol. 4,
no. 10, pp. 1–4, 2020.
[3] M. Qi, J. Qin, A. Li, Y. Wang, J. Luo, and L. Van Gool, “StagNet: An attentive semantic RNN
for group activity recognition,” in Proceedings of the European Conference on Computer Vision,
Munich, Germany, 2018, pp. 101–117.
[4] M. S. Ibrahim and G. Mori, “Hierarchical relational networks for group activity recognition and
retrieval,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018,
pp. 721–736.
[5] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in
videos,” in Proceedings of the Neural Information Processing Systems, Montr´eal, Canada, 2014,
pp. 568–576.
[6] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream networks exploiting
pose, motion, and appearance for action classification and detection,” in Proceedings of the IEEE
International Conference on Computer Vision, Venice, Italy, 2017, pp. 2904–2913.
[7] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu,
Hawaii, 2017, pp. 6299–6308.
[8] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “PoTion: Pose motion representation for
action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, Utah, 2018, pp. 7024–7033.
[9] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotemporal attention for video-based
person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, Utah, 2018, pp. 369–378.
[10] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video person re-identification with competitive
snippet-similarity aggregation and co-attentive snippet embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 1169–1178.
[11] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “VRSTC: Occlusion-free video person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, California, 2019, pp. 7183–7192. [12] Y. Fu, X. Wang, Y. Wei, and T. Huang, “STA: Spatial-temporal attention for large-scale video-based person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, 2019, pp. 8287–8294. [13] Y. Zhao, X. Shen, Z. Jin, H. Lu, and X.-s. Hua, “Attribute-driven feature disentangling and temporal
aggregation for video person re-identification,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Long Beach, California, 2019, pp. 4913–4922.
[14] A. Subramaniam, A. Nambiar, and A. Mittal, “Co-segmentation inspired attention networks for
video-based person re-identification,” in Proceedings of the IEEE International Conference on Computer
Vision, Seoul, South Korea, 2019, pp. 562–572.
[15] G. Wu, X. Zhu, and S. Gong, “Spatio-temporal associative representation for video person reidentification.” in Proceedings of the British Machine Vision Conference, Cardiff, United Kingdom,
2019, p. 278.
[16] J. Li, J. Wang, Q. Tian, W. Gao, and S. Zhang, “Global-local temporal representations for video
person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision,
Seoul, South Korea, 2019, pp. 3958–3967.
[17] J. Yang, W.-S. Zheng, Q. Yang, Y.-C. Chen, and Q. Tian, “Spatial-temporal graph convolutional network for video-based person re-identification,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 3289–3299.
[18] M. Qi, Y. Wang, J. Qin, A. Li, J. Luo, and L. Van Gool, “StagNet: An attentive semantic RNN for
group activity and individual action recognition,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 30, no. 2, pp. 549–565, 2020.
[19] Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks
for analyzing relations in group activity recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 4772–4781.
[20] T. Shu, S. Todorovic, and S.-C. Zhu, “CERN: confidence-energy recurrent network for group activity
recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, Hawaii, 2017, pp. 5523–5531.
[21] S. Biswas and J. Gall, “Structural recurrent neural network (SRNN) for group activity analysis,”
in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe,
Nevada, 2018, pp. 1625–1632.
[22] R. Yan, J. Tang, X. Shu, Z. Li, and Q. Tian, “Participation-contributed temporal dynamic model for
group activity recognition,” in Proceedings of the ACM International Conference on Multimedia,
Seoul, South Korea, 2018, pp. 1292–1300.
[23] G. Hu, B. Cui, Y. He, and S. Yu, “Progressive relation learning for group activity recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington,
2020, pp. 980–989. [24] L. Lu, Y. Lu, R. Yu, H. Di, L. Zhang, and S. Wang, “GAIM: Graph attention interaction model for
collective activity recognition,” IEEE Transactions on Multimedia, vol. 22, no. 2, pp. 524–539, 2019.
[25] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision
making,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
2015, pp. 4705–4713.
[26] Y. Zha, T. Ku, Y. Li, and P. Zhang, “Deep position-sensitive tracking,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 96–107, 2019.
[27] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah, “Deep affinity network for multiple object
tracking,” To be published in IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 43, no. 1, pp. 104–119, 2019.
[28] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal transformers,” in Proceedings of the International Conference on Learning Representations, New Orleans, Louisiana,
2019.
[29] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video
and language representation learning,” in Proceedings of the IEEE International Conference on Computer
Vision, Seoul, South Korea, 2019, pp. 7464–7473.
[30] K. Gavrilyuk, R. Sanford, M. Javan, and C. G. Snoek, “Actor-transformers for group activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 839–848.
[31] D. Purwanto, R. Renanda Adhi Pramono, Y.-T. Chen, and W.-H. Fang, “Extreme low resolution
action recognition with spatial-temporal multi-head self-attention and knowledge distillation,” in
Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, South
Korea, 2019, pp. 961–969.
[32] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Hierarchical self-attention network for action
localization in videos,” in Proceedings of the IEEE International Conference on Computer Vision,
Seoul, South Korea, 2019, pp. 61–70.
[33] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention
network for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Seattle, Washington, 2020, pp. 10 327–10 336.
[34] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image
captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Seattle, Washington, 2020, pp. 10 578–10 587.
[35] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8, pp. 1187–1191, 2019.
[36] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 244–253.
[37] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr,
“Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International
Conference on Computer Vision, Boston, Massachusetts, 2015, pp. 1529–1537.
[38] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for
human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, Hawaii, 2017, pp. 1831–1840.
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[40] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies,” 2001.
[41] I. S. Kim, H. S. Choi, K. M. Yi, J. Y. Choi, and S. G. Kong, “Intelligent visual surveillance—a
survey,” International Journal of Control, Automation and Systems, vol. 8, no. 5, pp. 926–939, 2010.
[42] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and
T. Darrell, “Sequence to sequence-video to text,” in Proceedings of the IEEE International Conference
on Computer Vision, Santiago, Chile, 2015, pp. 4534–4542.
[43] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4315–4324.
[44] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, “Learning actor relation graphs for group activity
recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, California, 2019, pp. 9964–9974.
[45] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal
model for group activity recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1971–1980.
[46] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for
large-scale person re-identification,” in European Conference on Computer Vision, Amsterdam, the
Netherlands, 2016, pp. 868–884.
[47] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically meaningful key frames from
personal video clips: from humans to computers,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 19, no. 2, pp. 289–301, 2008.
[48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with
3D convolutional networks,” in Proceedings of the IEEE International Conference on Computer
Vision, Santiago, Chile, 2015, pp. 4489–4497.
[49] S. Ji,W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
[50] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “Mict: Mixed 3D/2D convolutional tube for human action
recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, Utah, 2018, pp. 449–458.
[51] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatio-temporal attention networks for
action recognition and detection,” IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990–3001,
2020.
[52] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-augmented RGB stream
for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Long Beach, California, 2019, pp. 7882–7891.
[53] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speedaccuracy
trade-offs in video classification,” in Proceedings of the European Conference on Computer
Vision, Munich, Germany, 2018, pp. 305–321.
[54] C. C. Loy, T. Xiang, and S. Gong, “Modelling activity global temporal dependencies using time
delayed probabilistic graphical model,” in Proceedings of the IEEE International Conference on
Computer Vision, Kyoto, Japan, 2009, pp. 120–127.
[55] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves, attends and flows
for action recognition,” Computer Vision and Image Understanding, vol. 166, pp. 41 – 50, 2018.
[56] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with RBF kernelized feature mapping
RNN,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018,
pp. 305–322.
[57] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal and recognition networks
for action detection,” in Proceedings of the European Conference on Computer Vision, Munich,
Germany, 2018, pp. 303–318.
[58] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp.
7794–7803.
[59] Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-aware cascade networks for temporal
action segmentation,” in Proceedings of the European Conference on Computer Vision, Glasgow,
United Kingdom, 2020, pp. 34–51.
[60] V. I. Morariu and L. S. Davis, “Multi-agent event recognition in structured scenarios,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Colorado,
2011, pp. 3289–3296.
[61] S. S. Intille and A. F. Bobick, “Recognizing planned, multiperson action,” Computer Vision and
Image Understanding, vol. 81, no. 3, pp. 414–445, 2001.
[62] Y. Xu, L. Qin, X. Liu, J. Xie, and S.-C. Zhu, “A causal and-or graph model for visibility fluent
reasoning in tracking interacting objects,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 2178–2187.
[63] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 399–417.
[64] W.-H. Li, F.-T. Hong, and W.-S. Zheng, “Learning to learn relation for important people detection
in still images,” in The IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
California, 2019, pp. 5003–5011.
[65] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-TAD: Sub-graph localization for
temporal action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Seattle, Washington, 2020, pp. 10 156–10 165.
[66] M. Feng, S. Z. Gilani, Y. Wang, L. Zhang, and A. Mian, “Relation graph network for 3D object
detection in point clouds,” IEEE Transactions on Image Processing, vol. 30, pp. 92–107, 2021.
[67] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in
Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 803–
818.
[68] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention
network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, Hawaii, 2017, pp. 3156–3164.
[69] X. Li and C. Change Loy, “Video object segmentation with joint re-identification and attention-aware
mask propagation,” in Proceedings of the European Conference on Computer Vision, Munich,
Germany, 2018, pp. 90–105.
[70] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu, “Pairwise body-part attention for recognizing human-object
interactions,” in Proceedings of the European Conference on Computer Vision, Munich, Germany,
2018, pp. 51–67.
[71] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention network for action recognition
in videos,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1347–1360, 2018.
[72] R. R. A. Pramono, Y. T. Chen, and W. H. Fang, “Empowering relational network by self-attention
augmented conditional random fields for group activity recognition,” in Proceedings of the European
Conference on Computer Vision, Glasgow, United Kingdom, 2020, pp. 71–90.
[73] H.Wu, X. Ma, and Y. Li, “Convolutional networks with channel and STIPs attention model for action
recognition in videos,” IEEE Transactions on Multimedia, vol. 22, no. 9, pp. 2293–2306, 2020.
[74] C. Chen, D. Gong, H. Wang, Z. Li, and K. Y. K. Wong, “Learning spatial attention for face superresolution,” IEEE Transactions on Image Processing, vol. 30, pp. 1219–1231, 2021.
[75] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and
category-specific object detection: a survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp.
84–100, 2018.
[76] D. Sidib´e, M. Rastgoo, and F. M´eriaudeau, “On spatio-temporal saliency detection in videos using
multilinear PCA,” in Proceedings of the International Conference on Pattern Recognition, Las Vegas,
Nevada, 2016, pp. 1876–1880.
[77] D. Zhang, J. Han, Y. Zhang, and D. Xu, “Synthesizing supervision for learning deep saliency network
without human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 42, no. 7, pp. 1755–1769, 2019.
[78] X. Peng and C. Schmid, “Multi-region two-stream R-CNN for action detection,” in Proceedings of
the European Conference on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 744–759.
[79] E. H. P. Alwando, Y.-T. Chen, and W.-H. Fang, “CNN-based multiple path search for action tube
detection in videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30,
no. 1, pp. 104–116, 2020.
[80] G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Online real-time multiple spatiotemporal action localisation and prediction,” in Proceedings of the IEEE International Conference on
Computer Vision, Venice, Italy, 2017, pp. 3637–3646.
[81] Z. Yang, J. Gao, and R. Nevatia, “Spatio-temporal action detection with cascade proposal and location anticipation,” in Proceedings of the British Machine Vision Conference, London, United Kingdom,
2017, pp. 95.1–95.12.
[82] V. Kalogeiton, P.Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal
action localization,” in Proceedings of the IEEE International Conference on Computer Vision,
Venice, Italy, 2017, pp. 4405–4413.
[83] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with
region proposal networks,” in Proceedings of the Neural Information Processing Systems, Montr´eal,
Canada, 2015, pp. 91–99.
[84] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional
networks,” in Proceedings of the Advances in Neural Information Processing Systems, Barcelona,
Spain, 2016, pp. 379–387.
[85] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks
for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii.
[86] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot
multibox detector,” in Proceedings of the European Conference on Computer Vision, Amsterdam,
the Netherlands, 2016, pp. 21–37.
[87] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 6517–6525.
[88] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” in
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp.
2980–2988.
[89] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module for single-shot object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long
Beach, California, 2019, pp. 840–849.
[90] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
arXiv:1804.02767, 2018.
[91] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region
proposal network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, Utah, 2018, pp. 8971–8980.
[92] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings
of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 941–951.
[93] T. Xiao, H. Li,W. Ouyang, and X.Wang, “Learning deep feature representations with domain guided
dropout for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1249–1258.
[94] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, Utah, 2018, pp. 5157–5166.
[95] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person reidentification,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul,
South Korea, 2019, pp. 3702–3712.
[96] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-aware global attention for person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Seattle, Washington, 2020, pp. 3186–3195.
[97] Z. Zhu, X. Jiang, F. Zheng, X. Guo, F. Huang, X. Sun, and W. Zheng, “Aware loss with angular
regularization for person re-identification,” in Proceedings of the AAAI Conference on Artificial
Intelligence, New York City, New York, 2020, pp. 13 114–13 121.
[98] F. Vannucci, G. Di Cesare, F. Rea, G. Sandini, and A. Sciutti, “A robot with style: can robotic
attitudes influence human actions?” in Proceedings of the International Conference on Humanoid
Robots, Beijing, China, 2018, pp. 1–6.
[99] G. Gkioxari and J. Malik, “Finding action tubes,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 759–768.
[100] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
2015, pp. 3164–3172.
[101] L.Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using hybrid fully convolutional
networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las
Vegas, Nevada, 2016, pp. 2708–2717.
[102] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning for detecting multiple
space-time action tubes in videos,” in Proceedings of the British Machine Vision Conference, York,
United Kingdom, 2016, pp. 58.1–58.13.
[103] J. Zhao and C. G. Snoek, “Dance with flow: Two-in-one stream action detection,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019,
pp. 9935–9944.
[104] X. Yang, X. Yang, M.-Y. Liu, F. Xiao, L. S. Davis, and J. Kautz, “STEP: Spatio-temporal progressive
learning for video action detection,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Long Beach, California, 2019, pp. 264–272.
[105] L. Song, S. Zhang, G. Yu, and H. Sun, “TACNet: Transition-aware context network for spatio-temporal action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Long Beach, California, 2019, pp. 11 987–11 995.
[106] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection
in videos,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy,
2017, pp. 5822–5831.
[107] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici,
S. Ricco, R. Sukthankar et al., “AVA: A video dataset of spatio-temporally localized atomic visual
actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt
Lake City, Utah, 2018, pp. 6047–6056.
[108] K. Duarte, Y. Rawat, and M. Shah, “VideoCapsuleNet: A simplified network for action detection,”
in Proceedings of the Advances in Neural Information Processing Systems, Montr´eal, Canada, 2018,
pp. 7610–7619.
[109] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision, Munich, Germany,
2018, pp. 318–334.
[110] Y. Ye, X. Yang, and Y. Tian, “Discovering spatio-temporal action tubes,” Journal of Visual Communication and Image Representation, vol. 58, pp. 515–524, 2019.
[111] Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with
local and global diffusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Long Beach, California, 2019, pp. 12 056–12 065.
[112] Y. Li, W. Lin, J. See, N. Xu, S. Xu, K. Yan, and C. Yang, “CFAD: Coarse-to-fine action detector for
spatiotemporal action localization,” in Proceedings of the European Conference on Computer Vision,
Glasgow, United Kingdom, 2020, pp. 510–527.
[113] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp.
6202–6211.
[114] J. Wu, Z. Kuang, L. Wang, W. Zhang, and G. Wu, “Context-aware RCNN: A baseline for action
detection in videos,” in Proceedings of the European Conference on Computer Vision, Glasgow,
United Kingdom, 2020, pp. 440–456.
[115] Y. Li, Z. Wang, L. Wang, and G. Wu, “Actions as moving points,” in Proceedings of the European
Conference on Computer Vision, Glasgow, United Kingdom, 2020, pp. 68–84.
[116] P. Mettes and C. G. Snoek, “Spatial-aware object embeddings for zero-shot localization and classification of actions,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 4443–4452.
[117] Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid, “A structured model for action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 9975–9984.
[118] D. Li, T. Yao, Z. Qiu, H. Li, and T. Mei, “Long short-term relation networks for video action detection,” in Proceedings of the ACM International Conference on Multimedia, Nice, France, 2019, pp.
629–637.
[119] M. Guo, Y. Zhang, and T. Liu, “Gaussian transformer: a lightweight approach for natural language
inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii,
2019, pp. 6489–6496.
[120] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs
and imagenet?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, Utah, 2018, pp. 6546–6555.
[121] T. Brox, A. Bruhn, N. Papenberg, and J.Weickert, “High accuracy optical flow estimation based on a
theory for warping,” in Proceedings of the European Conference on Computer Vision, Prague, Czech
Republic, 2004, pp. 25–36.
[122] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream bi-directional recurrent
neural network for fine-grained action detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1961–1970.
[123] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and
T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 2625–2634.
[124] W. McGuire, R. H. Gallagher, and H. Saunders, Matrix Structural Analysis, 2000.
[125] B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, and T. Zhang, “Modeling localness for self-attention networks,” in Proceedings of the Conference on Empirical Methods in Natural Language
Processing, Brussels, Belgium, 2018, pp. 6489–6496.
[126] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Columbus, Ohio, 2014, pp. 580–587.
[127] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues
with long-term dependencies,” in Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 2017, pp. 300–311.
[128] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical
image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Miami, Florida, 2009, pp. 248–255.
[129] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from
videos in the wild,” Center for Research in Computer Vision, 2012.
[130] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,”
in Proceedings of the IEEE Conference on Computer Vision, Portland, Oregon, 2013, pp. 3192–3199.
[131] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, Ohio, 2014.
[132] J. He, Z. Deng, M. S. Ibrahim, and G. Mori, “Generic tubelet proposals for action localization,”
in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe,
Nevada, 2018, pp. 343–351.
[133] Y. Li, W. Lin, T. Wang, J. See, R. Qian, N. Xu, L. Wang, and S. Xu, “Finding action tubes with a
sparse-to-dense framework.” in Proceedings of the AAAI Conference on Artificial Intelligence, New
York City, New York, 2020, pp. 11,466–11.473.
[134] Z. Fan, T. Lin, X. Zhao, W. Jiang, T. Xu, and M. Yang, “An online approach for gesture recognition
toward real-world applications,” in Proceeding of the International Conference on Image and
Graphics, Shanghai, China, 2017, pp. 262–272.
[135] W. Choi and S. Savarese, “A unified framework for multi-target tracking and collective activity
recognition,” in Proceedings of the European Conference on Computer Vision, Florence, Italy, 2012,
pp. 215–230.
[136] M. R. Amer, P. Lei, and S. Todorovic, “HiRF: Hierarchical random field for collective activity recognition in videos,” in Proceedings of the European Conference on Computer Vision, Z¨urich, Switzerland, 2014, pp. 572–585.
[137] H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori, “Visual recognition by counting instances: A
multi-instance cardinality potential kernel,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 2596–2605.
[138] P. Zhang, Y. Tang, J.-F. Hu, and W.-S. Zheng, “Fast collective activity recognition under weak supervision, ”IEEE Transactions on Image Processing, vol. 29, pp. 29–43, 2019.
[139] X. Li and M. Choo Chuah, “SBGAR: Semantics based group activity recognition,” in Proceedings
of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2876–2885.
[140] J. Tang, X. Shu, R. Yan, and L. Zhang, “Coherence constrained graph LSTM for group activity
recognition,” To be published in IEEE Transactions on Pattern Analysis and Machine Intelligence,
2020.
[141] Y. Tang, J. Lu, Z. Wang, M. Yang, and J. Zhou, “Learning semantics-preserving attention and contextual interaction for group activity recognition,” IEEE Transactions on Image Processing, vol. 28, no. 10, pp. 4997–5012, 2019.
[142] X. Shu, L. Zhang, Y. Sun, and J. Tang, “Host–parasite: Graph LSTM-in-LSTM for group activity
recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 663–
674, 2020.
[143] X. Shu, J. Tang, G.-J. Qi, W. Liu, and J. Yang, “Hierarchical long short-term concurrent memory for
human interaction recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 43, no. 3, pp. 1110–1118, 2021.
[144] M. Wang, B. Ni, and X. Yang, “Recurrent modeling of interaction context for collective activity
recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, Hawaii, 2017, pp. 3048–3056.
[145] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep representation learning for human
motion prediction and classification,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 6158–6166.
[146] S. M. Azar, M. G. Atigh, A. Nickabadi, and A. Alahi, “Convolutional relational machine for group
activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 7892–7901.
[147] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “HiGCIN: Hierarchical graph-based cross inference
network for group activity recognition,” To be published in IEEE Transactions on Pattern Analysis
and Machine Intelligence, 2020.
[148] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-GCN: Geometric graph convolutional
networks,” in Proceedings of the International Conference on Learning Representations, Vienna,
Austria, 2020.
[149] D. Z¨ugner, A. Akbarnejad, and S. G¨unnemann, “Adversarial attacks on neural networks for graph
data,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, London, United Kingdom, 2018, pp. 2847–2856.
[150] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “Social adaptive module for weakly-supervised group
activity recognition,” in Proceedings of the European Conference on Computer Vision, Glasgow,
United Kingdom, 2020, pp. 208–224.
[151] J. Chen, W. Bao, and Y. Kong, “Group activity prediction with sequential relational anticipation
model,” in European Conference on Computer Vision, Glasgow, United Kingdom, 2020 , pp. 581–
597.
[152] D. M. Nguyen, R. Calderbank, and N. Deligiannis, “Geometric matrix completion with deep conditional random fields,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9,
pp. 3579–3593, 2020.
[153] Q. Li, Y. Shi, X. Huang, and X. X. Zhu, “Building footprint generation by integrating convolution
neural network with feature pairwise conditional random field (fpcrf),” IEEE Transactions on
Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7502–7519, 2020.
[154] K. Sun, B. Xiao, D. Liu, and J.Wang, “Deep high-resolution representation learning for human pose
estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, California, 2019, pp. 5693–5703.
[155] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
[156] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for
segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine
Learning, Williamstown, Massachusetts, 2001, pp. 282–289.
[157] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain,
2011, pp. 109–117.
[158] L. Ladick`y, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical CRFs for object class
image segmentation,” in Proceedings of the International Conference on Computer Vision, Kyoto,
Japan, 2009, pp. 739–746.
[159] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: Deep learning on spatiotemporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Las Vegas, Nevada, 2016, pp. 5308–5317.
[160] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Bengio, “Graph attention networks,” in Proceedings of the International Conference on Learning Representations, Vancouver,
Canada, 2018.
[161] H. Gao, J. Pei, and H. Huang, “Conditional random field enhanced graph convolutional neural networks,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &
Data Mining, Anchorage, Alaska, 2019, pp. 276–284.
[162] H. Yuan and S. Ji, “StructPool: Structured graph pooling via conditional random fields,” in Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
[163] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr, “Higher order conditional random fields in deep
neural networks,” in Proceedings of the European Conference on Computer Vision, Amsterdam, the
Netherlands, 2016, pp. 524–540.
[164] Y. Liu, C. Sun, L. Lin, and X. Wang, “Learning natural language inference using bidirectional lstm
model and inner-attention,” arXiv preprint arXiv:1605.09090, 2016.
[165] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “UPSNet: A unified panoptic
segmentation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Long Beach, California, 2019, pp. 8818–8826.
[166] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
[167] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan,
“Supervised contrastive learning,” in Proceedings of the Advances in Neural Information Processing
Systems, New York City, New York, 2020.
[168] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning
of visual representations,” in Proceedings of the International Conference on Machine Learning,
Vienna, Austria, 2020, pp. 1597–1607.
[169] W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using
spatio-temporal relationship among people,” in Proceedings of the International Conference on
Computer Vision Workshops, Kyoto, Japan, 2009, pp. 1282–1289.
[170] ——, “Learning context for collective activity recognition,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Colorado Springs, Colorado, 2011, pp. 3273–3280.
[171] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime TV-L 1 optical flow,” in
Proceedings of the Joint Pattern Recognition Symposium, Heidelberg, Germany, 2007, pp. 214–223.
[172] T. Lan, Y.Wang,W. Yang, S. N. Robinovitch, and G. Mori, “Discriminative latent models for recognizing
contextual group activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 34, no. 8, pp. 1549–1562, 2011.
[173] S. Asghari-Esfeden, M. Sznaier, and O. Camps, “Dynamic motion representation for human action
recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision,
Snowmass, Colorado, 2020, pp. 557–566.
[174] Z. Pan, S. Liu, A. K. Sangaiah, and K. Muhammad, “Visual attention feature (VAF): a novel strategy
for visual tracking based on cloud platform in intelligent surveillance systems,” Journal of Parallel
and Distributed Computing, vol. 120, pp. 182–194, 2018.
[175] X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance-preserving 3D convolution for video-based person re-identification,” in Proceedings of the European Conference on Computer Vision,
Glasgow, United Kingdom, 2020, pp. 228–243.
[176] D. Chung, K. Tahboub, and E. J. Delp, “A two stream siamese convolutional neural network for
person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision,
Venice, Italy, 2017, pp. 1983–1991.
[177] W. Zhang, S. Hu, K. Liu, and Z. Zha, “Learning compact appearance representation for video-based
person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29,
no. 8, pp. 2442–2452, 2019.
[178] N. McLaughlin, J. M. Del Rincon, and P. Miller, “Recurrent convolutional network for video-based
person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Las Vegas, Nevada, 2016, pp. 1325–1334.
[179] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly attentive spatial-temporal pooling
networks for video-based person re-identification,” in Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 2017, pp. 4733–4742.
[180] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and
temporal recurrent neural networks for video-based person re-identification,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4747–
4756.
[181] L. Wu, Y. Wang, J. Gao, and X. Li, “Where-and-when to look: Deep siamese attention networks for
video-based person re-identification,” IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1412–
1424, 2019.
[182] R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen, “Temporal complementary learning for video
person re-identification,” in Proceedings of the European Conference on Computer Vision, Glasgow,
United Kingdom, 2020, pp. 388–405.
[183] Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Multi-granularity reference-aided attentive feature aggregation
for video-based person re-identification,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 10 407–10 416.
[184] P. Li, P. Pan, P. Liu, M. Xu, and Y. Yang, “Hierarchical temporal modeling with mutual distance
matching for video based person re-identification,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 31, no. 2, pp. 503–511, 2021.
[185] W. Zhang, X. He, W. Lu, H. Qiao, and Y. Li, “Feature aggregation with reinforcement learning for
video-based person re-identification,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 30, no. 12, pp. 3847–3852, 2019.
[186] G. Chen, Y. Rao, J. Lu, and J. Zhou, “Temporal coherence or temporal motion: Which is more
critical for video-based person re-identification?” in Proceedings of the European Conference on
Computer Vision, Glasgow, United Kingdom, 2020, pp. 660–676.
[187] R. Hou, H. Chang, B. Ma, R. Huang, and S. Shan, “Bicnet-tks: Learning efficient spatial-temporal
representation for video person re-identification,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, Nashville, Tennessee, 2021, pp. 2014–2023.
[188] X. Liu, P. Zhang, C. Yu, H. Lu, and X. Yang, “Watching you: Global-guided reciprocal learning for
video-based person re-identification,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Nashville, Tennessee, 2021, pp. 13 334–13 343.
[189] Y. Yan, J. Qin, J. Chen, L. Liu, F. Zhu, Y. Tai, and L. Shao, “Learning multi-granular hypergraphs for
video-based person re-identification,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Seattle, Washington, 2020, pp. 2899–2908.
[190] Y. Wu, O. E. F. Bourahla, X. Li, F. Wu, Q. Tian, and X. Zhou, “Adaptive graph representation
learning for video person re-identification,” IEEE Transactions on Image Processing, vol. 29, pp.
8821–8830, 2020.
[191] J. Liu, Z.-J. Zha, W.Wu, K. Zheng, and Q. Sun, “Spatial-temporal correlation and topology learning
for person re-identification in videos,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Nashville, Tennessee, 2021, pp. 4370–4379.
[192] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
Transformers for image recognition at scale,” in Proceedings of the International Conference on
Learning Representations, Vienna, Austria, 2021.
[193] Q.-H. Pham, T. Nguyen, B.-S. Hua, G. Roig, and S.-K. Yeung, “JSIS3D: joint semantic-instance
segmentation of 3D point clouds with multi-task pointwise networks and multi-value conditional
random fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Long Beach, California, 2019, pp. 8827–8836.
[194] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple
granularities for person re-identification,” in Proceedings of the ACM International Conference on
Multimedia, Seoul, South Korea, 2018, pp. 274–282.
[195] R. D. Alba, “A graph-theoretic definition of a sociometric clique,” Journal of Mathematical Sociology,
vol. 3, no. 1, pp. 113–126, 1973.
[196] H. Luo, Y. Gu, X. Liao, S. Lai, andW. Jiang, “Bag of tricks and a strong baseline for deep person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, Long Beach, California, 2019, pp. 1487–1495.
[197] T.Wang, S. Gong, X. Zhu, and S.Wang, “Person re-identification by video ranking,” in Proceedings
of the European Conference on Computer Vision, Z¨urich, Switzerland, 2014, pp. 688–703.
[198] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person Re-Identification by Descriptive and
Discriminative Classification,” in Proceedings of the Scandinavian Conference on Image Analysis,
Ystad, Sweden, 2011, pp. 91–102.
[199] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by GAN improve the person reidentification
baseline in vitro,” in Proceedings of the IEEE International Conference on Computer
Vision, Venice, Italy, 2017, pp. 3754–3762.
[200] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph CNN
for learning on point clouds,” ACM Transactions On Graphics, vol. 38, no. 5, pp. 1–12, 2019.
[201] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation
strategies from data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Long Beach, California, 2019, pp. 113–123.
[202] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in
Proceedings of the International Conference on Machine Learning, Los Angeles City, Los Angeles,
2019, pp. 6105–6114.
[203] X. Li,W.Wang, L.Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: learning
qualified and distributed bounding boxes for dense object detection,” in Proceedings of the Advances
in Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 21 002–21 012.

QR CODE