簡易檢索 / 詳目顯示

研究生: Rizard Renanda Adhi Pramono
Rizard Renanda Adhi Pramono
論文名稱: 具有自註意力的視頻中視覺上下文的關係推理
Relational Reasoning of Visual Contexts in Videos with Self-Attention
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 賴坤財
Kuen-Tsair Lay
陳郁堂
Yie-Tarng Chen
呂政修
Jenq-Shiou Leu
廖弘源
Mark Liao
傅楸善
Chiou-Shann Fuh
丘建青
Chien-Ching Chiu
楊健生
Jason Young
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 142
中文關鍵詞: action localizationconditional random fieldgroup activity recognitionperson re-identificationself-attention
外文關鍵詞: action localization, conditional random field, group activity recognition, person re-identification, self-attention
相關次數: 點閱:190下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • In recent years, understanding the visual contexts of videos has become one of the major issues in computer vision society by reasons of its numerous practical applications and the availability of huge amount of computational resources. Different from still images, video data have richer contexts including the spatial and temporal dependencies and pose additional challenges such as camera movement, illumination changes, viewpoint variations, poor video resolution, dynamic object interactions, and etc. Self-attention has been proven to be effective in modelling the structure of sequences of data in miscellaneous natural language processing tasks. This dissertation thus focuses on the use of self-attention to reason the relational contexts within video data on three popular tasks in visual context understanding: action localization, group activity recognition, and video-based person re-identification (re-ID).
    First, a novel architecture is developed for spatial-temporal action localization in videos. The new architecture first utilizes a two-stream 3D convolutional neural network (3D-CNN) to provide initial action detection. Next, a new hierarchical self-attention network (HiSAN), the core of this architecture, is developed to learn the spatial-temporal relationships of key actors. Such a combination of 3D-CNN and HiSAN allows us to effectively extract both of the spatial context information and the long-term temporal dependency to improve action localization accuracy. Afterwards, a new fusion strategy is employed, which first re-scores the bounding boxes to settle the inconsistent detection scores caused by background clutter or occlusion, and then aggregates the motion and appearance information from the two-stream network with the motion saliency to alleviate the impact of camera movement. Finally, a tube association network based on the self-similarity of the actors’ appearance and spatial information across frames is addressed to efficaciously construct the action tubes.
    Next, an effective relational network for group activity recognition is introduced. The crux of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition.
    Lastly, a novel attentive graph network is established to tackle difficult scenarios in video re-ID such as occlusion, view misalignment, and pose variations. The network begins with a temporal-aware feature extractor, which makes use of the short-term temporal correlation of the fine-grained feature maps to generate the multi-scale part-based CNN features. Subsequently, central to the new network, a combination of self-attention and CRFs that stems from our relational network for group activity recognition is employed to enjoy the strength of CRFs in learning the inter-dependency of the multi-scale part features and the strength of self-attention in directly attending the response at various distant positions. To deal with both local and non-local relationships of body parts in a sequence of frames, a new type of self-attention being applied to a set of cliques with different extents of temporal locality is considered. Also, the CRF inference is cast as a node clustering problem based on the graph representation to aggregate the multi-scale part features related to the person of interest and de-emphasize the influence of the unrelated background information for more accurate image sequence representation.


    In recent years, understanding the visual contexts of videos has become one of the major issues in computer vision society by reasons of its numerous practical applications and the availability of huge amount of computational resources. Different from still images, video data have richer contexts including the spatial and temporal dependencies and pose additional challenges such as camera movement, illumination changes, viewpoint variations, poor video resolution, dynamic object interactions, and etc. Self-attention has been proven to be effective in modelling the structure of sequences of data in miscellaneous natural language processing tasks. This dissertation thus focuses on the use of self-attention to reason the relational contexts within video data on three popular tasks in visual context understanding: action localization, group activity recognition, and video-based person re-identification (re-ID).
    First, a novel architecture is developed for spatial-temporal action localization in videos. The new architecture first utilizes a two-stream 3D convolutional neural network (3D-CNN) to provide initial action detection. Next, a new hierarchical self-attention network (HiSAN), the core of this architecture, is developed to learn the spatial-temporal relationships of key actors. Such a combination of 3D-CNN and HiSAN allows us to effectively extract both of the spatial context information and the long-term temporal dependency to improve action localization accuracy. Afterwards, a new fusion strategy is employed, which first re-scores the bounding boxes to settle the inconsistent detection scores caused by background clutter or occlusion, and then aggregates the motion and appearance information from the two-stream network with the motion saliency to alleviate the impact of camera movement. Finally, a tube association network based on the self-similarity of the actors’ appearance and spatial information across frames is addressed to efficaciously construct the action tubes.
    Next, an effective relational network for group activity recognition is introduced. The crux of the network is to integrate conditional random fields (CRFs) with self-attention to infer the temporal dependencies and spatial relationships of the actors. This combination can take advantage of the capability of CRFs in modelling the actors' features that depend on each other and the capability of self-attention in learning the temporal evolution and spatial relational contexts of every actor in videos. Additionally, there are two distinct facets of our CRF and self-attention. First, the pairwise energy of the new CRF relies on both of the temporal self-attention and spatial self-attention, which apply the self-attention mechanism to the features in time and space, respectively. Second, to address both local and non-local relationships in group activities, the spatial self-attention takes account of a collection of cliques with different scales of spatial locality. The associated mean-field inference thereafter can thus be reformulated as a self-attention network to generate the relational contexts of the actors and their individual action labels. Lastly, a bidirectional universal transformer encoder (UTE) is utilized to aggregate the forward and backward temporal context information, scene information and relational contexts for group activity recognition.
    Lastly, a novel attentive graph network is established to tackle difficult scenarios in video re-ID such as occlusion, view misalignment, and pose variations. The network begins with a temporal-aware feature extractor, which makes use of the short-term temporal correlation of the fine-grained feature maps to generate the multi-scale part-based CNN features. Subsequently, central to the new network, a combination of self-attention and CRFs that stems from our relational network for group activity recognition is employed to enjoy the strength of CRFs in learning the inter-dependency of the multi-scale part features and the strength of self-attention in directly attending the response at various distant positions. To deal with both local and non-local relationships of body parts in a sequence of frames, a new type of self-attention being applied to a set of cliques with different extents of temporal locality is considered. Also, the CRF inference is cast as a node clustering problem based on the graph representation to aggregate the multi-scale part features related to the person of interest and de-emphasize the influence of the unrelated background information for more accurate image sequence representation.

    Table of contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .v Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Action Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Group Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Video-based Person Re-identification . . . . . . . . . . . . . . . . . . . . 1.5 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . 1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 CNN for Action Recognition . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Learning Temporal Dependency . . . . . . . . . . . . . . . . . . . . . . 2.3 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 CNN for Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Multi-object Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Image-based Person Re-identification . . . . . . . . . . . . . . . . . . . 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Spatial-Temporal Action Localization with Hierarchical Self-Attention . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Action Detection Network . . . . . . . . . . . . . . . . . . . . . 3.2.2 Hierarchical Self-Attention Network . . . . . . . . . . . . . . . . 3.2.3 Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Action Tube Generation . . . . . . . . . . . . . . . . . . . . . . 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Comparison with the State-of-the-Art Works . . . . . . . . . . . 3.3.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Relational Reasoning with Self-Attention Augmented Conditional Random Fields for Group Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Proposed Relational Network . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Self-Attention Augmented Conditional Random Field . . . . . . 4.3.2 Bidirectional UTE for Group Activity Recognition . . . . . . . . 4.3.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Comparison with the State-of-the-Art Works . . . . . . . . . . . 4.4.4 Individual Action Classification Accuracy . . . . . . . . . . . . . 4.4.5 Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Learning Relational Graphs with Self-Attention Augmented Conditional Random Fields for Video-Based Person Re-identification . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Temporal-Aware Feature Extractor . . . . . . . . . . . . . . . . . 5.2.2 Spatial-Temporal Graph Model . . . . . . . . . . . . . . . . . . 5.2.3 Self-Attention Conditional Random Fields . . . . . . . . . . . . 5.2.4 New Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Comparison with the State-of-the-Art Works . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,
    “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing
    Systems, Long Beach, California, 2017, pp. 5998–6008.
    [2] V. Jain, M. S. Pillai, L. Chandra, R. Kumar, M. Khari, and A. Jain, “CamAspect: An intelligent
    automated real-time surveillance system with smartphone indexing,” IEEE Sensors Letters, vol. 4,
    no. 10, pp. 1–4, 2020.
    [3] M. Qi, J. Qin, A. Li, Y. Wang, J. Luo, and L. Van Gool, “StagNet: An attentive semantic RNN
    for group activity recognition,” in Proceedings of the European Conference on Computer Vision,
    Munich, Germany, 2018, pp. 101–117.
    [4] M. S. Ibrahim and G. Mori, “Hierarchical relational networks for group activity recognition and
    retrieval,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018,
    pp. 721–736.
    [5] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in
    videos,” in Proceedings of the Neural Information Processing Systems, Montr´eal, Canada, 2014,
    pp. 568–576.
    [6] M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream networks exploiting
    pose, motion, and appearance for action classification and detection,” in Proceedings of the IEEE
    International Conference on Computer Vision, Venice, Italy, 2017, pp. 2904–2913.
    [7] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,”
    in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu,
    Hawaii, 2017, pp. 6299–6308.
    [8] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “PoTion: Pose motion representation for
    action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Salt Lake City, Utah, 2018, pp. 7024–7033.
    [9] S. Li, S. Bak, P. Carr, and X. Wang, “Diversity regularized spatiotemporal attention for video-based
    person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Salt Lake City, Utah, 2018, pp. 369–378.
    [10] D. Chen, H. Li, T. Xiao, S. Yi, and X. Wang, “Video person re-identification with competitive
    snippet-similarity aggregation and co-attentive snippet embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 1169–1178.
    [11] R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “VRSTC: Occlusion-free video person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Long Beach, California, 2019, pp. 7183–7192. [12] Y. Fu, X. Wang, Y. Wei, and T. Huang, “STA: Spatial-temporal attention for large-scale video-based person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, 2019, pp. 8287–8294. [13] Y. Zhao, X. Shen, Z. Jin, H. Lu, and X.-s. Hua, “Attribute-driven feature disentangling and temporal
    aggregation for video person re-identification,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Long Beach, California, 2019, pp. 4913–4922.
    [14] A. Subramaniam, A. Nambiar, and A. Mittal, “Co-segmentation inspired attention networks for
    video-based person re-identification,” in Proceedings of the IEEE International Conference on Computer
    Vision, Seoul, South Korea, 2019, pp. 562–572.
    [15] G. Wu, X. Zhu, and S. Gong, “Spatio-temporal associative representation for video person reidentification.” in Proceedings of the British Machine Vision Conference, Cardiff, United Kingdom,
    2019, p. 278.
    [16] J. Li, J. Wang, Q. Tian, W. Gao, and S. Zhang, “Global-local temporal representations for video
    person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision,
    Seoul, South Korea, 2019, pp. 3958–3967.
    [17] J. Yang, W.-S. Zheng, Q. Yang, Y.-C. Chen, and Q. Tian, “Spatial-temporal graph convolutional network for video-based person re-identification,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 3289–3299.
    [18] M. Qi, Y. Wang, J. Qin, A. Li, J. Luo, and L. Van Gool, “StagNet: An attentive semantic RNN for
    group activity and individual action recognition,” IEEE Transactions on Circuits and Systems for
    Video Technology, vol. 30, no. 2, pp. 549–565, 2020.
    [19] Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks
    for analyzing relations in group activity recognition,” in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 4772–4781.
    [20] T. Shu, S. Todorovic, and S.-C. Zhu, “CERN: confidence-energy recurrent network for group activity
    recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Honolulu, Hawaii, 2017, pp. 5523–5531.
    [21] S. Biswas and J. Gall, “Structural recurrent neural network (SRNN) for group activity analysis,”
    in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe,
    Nevada, 2018, pp. 1625–1632.
    [22] R. Yan, J. Tang, X. Shu, Z. Li, and Q. Tian, “Participation-contributed temporal dynamic model for
    group activity recognition,” in Proceedings of the ACM International Conference on Multimedia,
    Seoul, South Korea, 2018, pp. 1292–1300.
    [23] G. Hu, B. Cui, Y. He, and S. Yu, “Progressive relation learning for group activity recognition,” in
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington,
    2020, pp. 980–989. [24] L. Lu, Y. Lu, R. Yu, H. Di, L. Zhang, and S. Wang, “GAIM: Graph attention interaction model for
    collective activity recognition,” IEEE Transactions on Multimedia, vol. 22, no. 2, pp. 524–539, 2019.
    [25] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision
    making,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
    2015, pp. 4705–4713.
    [26] Y. Zha, T. Ku, Y. Li, and P. Zhang, “Deep position-sensitive tracking,” IEEE Transactions on Multimedia, vol. 22, no. 1, pp. 96–107, 2019.
    [27] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah, “Deep affinity network for multiple object
    tracking,” To be published in IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 43, no. 1, pp. 104–119, 2019.
    [28] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, “Universal transformers,” in Proceedings of the International Conference on Learning Representations, New Orleans, Louisiana,
    2019.
    [29] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video
    and language representation learning,” in Proceedings of the IEEE International Conference on Computer
    Vision, Seoul, South Korea, 2019, pp. 7464–7473.
    [30] K. Gavrilyuk, R. Sanford, M. Javan, and C. G. Snoek, “Actor-transformers for group activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 839–848.
    [31] D. Purwanto, R. Renanda Adhi Pramono, Y.-T. Chen, and W.-H. Fang, “Extreme low resolution
    action recognition with spatial-temporal multi-head self-attention and knowledge distillation,” in
    Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, South
    Korea, 2019, pp. 961–969.
    [32] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Hierarchical self-attention network for action
    localization in videos,” in Proceedings of the IEEE International Conference on Computer Vision,
    Seoul, South Korea, 2019, pp. 61–70.
    [33] L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention
    network for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, Seattle, Washington, 2020, pp. 10 327–10 336.
    [34] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image
    captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Seattle, Washington, 2020, pp. 10 578–10 587.
    [35] D. Purwanto, R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos,” IEEE Signal Processing Letters, vol. 26, no. 8, pp. 1187–1191, 2019.
    [36] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman, “Video action transformer network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 244–253.
    [37] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr,
    “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International
    Conference on Computer Vision, Boston, Massachusetts, 2015, pp. 1529–1537.
    [38] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for
    human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Honolulu, Hawaii, 2017, pp. 1831–1840.
    [39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 2019, pp. 4171–4186.
    [40] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient flow in recurrent nets: the
    difficulty of learning long-term dependencies,” 2001.
    [41] I. S. Kim, H. S. Choi, K. M. Yi, J. Y. Choi, and S. G. Kong, “Intelligent visual surveillance—a
    survey,” International Journal of Control, Automation and Systems, vol. 8, no. 5, pp. 926–939, 2010.
    [42] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and
    T. Darrell, “Sequence to sequence-video to text,” in Proceedings of the IEEE International Conference
    on Computer Vision, Santiago, Chile, 2015, pp. 4534–4542.
    [43] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese, “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4315–4324.
    [44] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu, “Learning actor relation graphs for group activity
    recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Long Beach, California, 2019, pp. 9964–9974.
    [45] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal
    model for group activity recognition,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1971–1980.
    [46] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for
    large-scale person re-identification,” in European Conference on Computer Vision, Amsterdam, the
    Netherlands, 2016, pp. 868–884.
    [47] J. Luo, C. Papin, and K. Costello, “Towards extracting semantically meaningful key frames from
    personal video clips: from humans to computers,” IEEE Transactions on Circuits and Systems for
    Video Technology, vol. 19, no. 2, pp. 289–301, 2008.
    [48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with
    3D convolutional networks,” in Proceedings of the IEEE International Conference on Computer
    Vision, Santiago, Chile, 2015, pp. 4489–4497.
    [49] S. Ji,W. Xu, M. Yang, and K. Yu, “3D convolutional neural networks for human action recognition,”
    IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
    [50] Y. Zhou, X. Sun, Z.-J. Zha, and W. Zeng, “Mict: Mixed 3D/2D convolutional tube for human action
    recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Salt Lake City, Utah, 2018, pp. 449–458.
    [51] J. Li, X. Liu, W. Zhang, M. Zhang, J. Song, and N. Sebe, “Spatio-temporal attention networks for
    action recognition and detection,” IEEE Transactions on Multimedia, vol. 22, no. 11, pp. 2990–3001,
    2020.
    [52] N. Crasto, P. Weinzaepfel, K. Alahari, and C. Schmid, “Mars: Motion-augmented RGB stream
    for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Long Beach, California, 2019, pp. 7882–7891.
    [53] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning: Speedaccuracy
    trade-offs in video classification,” in Proceedings of the European Conference on Computer
    Vision, Munich, Germany, 2018, pp. 305–321.
    [54] C. C. Loy, T. Xiang, and S. Gong, “Modelling activity global temporal dependencies using time
    delayed probabilistic graphical model,” in Proceedings of the IEEE International Conference on
    Computer Vision, Kyoto, Japan, 2009, pp. 120–127.
    [55] Z. Li, K. Gavrilyuk, E. Gavves, M. Jain, and C. G. Snoek, “VideoLSTM convolves, attends and flows
    for action recognition,” Computer Vision and Image Understanding, vol. 166, pp. 41 – 50, 2018.
    [56] Y. Shi, B. Fernando, and R. Hartley, “Action anticipation with RBF kernelized feature mapping
    RNN,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018,
    pp. 305–322.
    [57] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal and recognition networks
    for action detection,” in Proceedings of the European Conference on Computer Vision, Munich,
    Germany, 2018, pp. 303–318.
    [58] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp.
    7794–7803.
    [59] Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-aware cascade networks for temporal
    action segmentation,” in Proceedings of the European Conference on Computer Vision, Glasgow,
    United Kingdom, 2020, pp. 34–51.
    [60] V. I. Morariu and L. S. Davis, “Multi-agent event recognition in structured scenarios,” in Proceedings
    of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Colorado,
    2011, pp. 3289–3296.
    [61] S. S. Intille and A. F. Bobick, “Recognizing planned, multiperson action,” Computer Vision and
    Image Understanding, vol. 81, no. 3, pp. 414–445, 2001.
    [62] Y. Xu, L. Qin, X. Liu, J. Xie, and S.-C. Zhu, “A causal and-or graph model for visibility fluent
    reasoning in tracking interacting objects,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 2178–2187.
    [63] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 399–417.
    [64] W.-H. Li, F.-T. Hong, and W.-S. Zheng, “Learning to learn relation for important people detection
    in still images,” in The IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
    California, 2019, pp. 5003–5011.
    [65] M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-TAD: Sub-graph localization for
    temporal action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Seattle, Washington, 2020, pp. 10 156–10 165.
    [66] M. Feng, S. Z. Gilani, Y. Wang, L. Zhang, and A. Mian, “Relation graph network for 3D object
    detection in point clouds,” IEEE Transactions on Image Processing, vol. 30, pp. 92–107, 2021.
    [67] B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in
    Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 803–
    818.
    [68] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang, “Residual attention
    network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, Honolulu, Hawaii, 2017, pp. 3156–3164.
    [69] X. Li and C. Change Loy, “Video object segmentation with joint re-identification and attention-aware
    mask propagation,” in Proceedings of the European Conference on Computer Vision, Munich,
    Germany, 2018, pp. 90–105.
    [70] H.-S. Fang, J. Cao, Y.-W. Tai, and C. Lu, “Pairwise body-part attention for recognizing human-object
    interactions,” in Proceedings of the European Conference on Computer Vision, Munich, Germany,
    2018, pp. 51–67.
    [71] W. Du, Y. Wang, and Y. Qiao, “Recurrent spatial-temporal attention network for action recognition
    in videos,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1347–1360, 2018.
    [72] R. R. A. Pramono, Y. T. Chen, and W. H. Fang, “Empowering relational network by self-attention
    augmented conditional random fields for group activity recognition,” in Proceedings of the European
    Conference on Computer Vision, Glasgow, United Kingdom, 2020, pp. 71–90.
    [73] H.Wu, X. Ma, and Y. Li, “Convolutional networks with channel and STIPs attention model for action
    recognition in videos,” IEEE Transactions on Multimedia, vol. 22, no. 9, pp. 2293–2306, 2020.
    [74] C. Chen, D. Gong, H. Wang, Z. Li, and K. Y. K. Wong, “Learning spatial attention for face superresolution,” IEEE Transactions on Image Processing, vol. 30, pp. 1219–1231, 2021.
    [75] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and
    category-specific object detection: a survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp.
    84–100, 2018.
    [76] D. Sidib´e, M. Rastgoo, and F. M´eriaudeau, “On spatio-temporal saliency detection in videos using
    multilinear PCA,” in Proceedings of the International Conference on Pattern Recognition, Las Vegas,
    Nevada, 2016, pp. 1876–1880.
    [77] D. Zhang, J. Han, Y. Zhang, and D. Xu, “Synthesizing supervision for learning deep saliency network
    without human annotation,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 42, no. 7, pp. 1755–1769, 2019.
    [78] X. Peng and C. Schmid, “Multi-region two-stream R-CNN for action detection,” in Proceedings of
    the European Conference on Computer Vision, Amsterdam, the Netherlands, 2016, pp. 744–759.
    [79] E. H. P. Alwando, Y.-T. Chen, and W.-H. Fang, “CNN-based multiple path search for action tube
    detection in videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30,
    no. 1, pp. 104–116, 2020.
    [80] G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Online real-time multiple spatiotemporal action localisation and prediction,” in Proceedings of the IEEE International Conference on
    Computer Vision, Venice, Italy, 2017, pp. 3637–3646.
    [81] Z. Yang, J. Gao, and R. Nevatia, “Spatio-temporal action detection with cascade proposal and location anticipation,” in Proceedings of the British Machine Vision Conference, London, United Kingdom,
    2017, pp. 95.1–95.12.
    [82] V. Kalogeiton, P.Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal
    action localization,” in Proceedings of the IEEE International Conference on Computer Vision,
    Venice, Italy, 2017, pp. 4405–4413.
    [83] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with
    region proposal networks,” in Proceedings of the Neural Information Processing Systems, Montr´eal,
    Canada, 2015, pp. 91–99.
    [84] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region-based fully convolutional
    networks,” in Proceedings of the Advances in Neural Information Processing Systems, Barcelona,
    Spain, 2016, pp. 379–387.
    [85] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks
    for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii.
    [86] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single shot
    multibox detector,” in Proceedings of the European Conference on Computer Vision, Amsterdam,
    the Netherlands, 2016, pp. 21–37.
    [87] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 6517–6525.
    [88] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ar, “Focal loss for dense object detection,” in
    Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp.
    2980–2988.
    [89] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module for single-shot object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long
    Beach, California, 2019, pp. 840–849.
    [90] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint
    arXiv:1804.02767, 2018.
    [91] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region
    proposal network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Salt Lake City, Utah, 2018, pp. 8971–8980.
    [92] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings
    of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 941–951.
    [93] T. Xiao, H. Li,W. Ouyang, and X.Wang, “Learning deep feature representations with domain guided
    dropout for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1249–1258.
    [94] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Salt Lake City, Utah, 2018, pp. 5157–5166.
    [95] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature learning for person reidentification,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul,
    South Korea, 2019, pp. 3702–3712.
    [96] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen, “Relation-aware global attention for person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Seattle, Washington, 2020, pp. 3186–3195.
    [97] Z. Zhu, X. Jiang, F. Zheng, X. Guo, F. Huang, X. Sun, and W. Zheng, “Aware loss with angular
    regularization for person re-identification,” in Proceedings of the AAAI Conference on Artificial
    Intelligence, New York City, New York, 2020, pp. 13 114–13 121.
    [98] F. Vannucci, G. Di Cesare, F. Rea, G. Sandini, and A. Sciutti, “A robot with style: can robotic
    attitudes influence human actions?” in Proceedings of the International Conference on Humanoid
    Robots, Beijing, China, 2018, pp. 1–6.
    [99] G. Gkioxari and J. Malik, “Finding action tubes,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 759–768.
    [100] P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile,
    2015, pp. 3164–3172.
    [101] L.Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimation using hybrid fully convolutional
    networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las
    Vegas, Nevada, 2016, pp. 2708–2717.
    [102] S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning for detecting multiple
    space-time action tubes in videos,” in Proceedings of the British Machine Vision Conference, York,
    United Kingdom, 2016, pp. 58.1–58.13.
    [103] J. Zhao and C. G. Snoek, “Dance with flow: Two-in-one stream action detection,” in Proceedings of
    the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019,
    pp. 9935–9944.
    [104] X. Yang, X. Yang, M.-Y. Liu, F. Xiao, L. S. Davis, and J. Kautz, “STEP: Spatio-temporal progressive
    learning for video action detection,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, Long Beach, California, 2019, pp. 264–272.
    [105] L. Song, S. Zhang, G. Yu, and H. Sun, “TACNet: Transition-aware context network for spatio-temporal action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Long Beach, California, 2019, pp. 11 987–11 995.
    [106] R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection
    in videos,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy,
    2017, pp. 5822–5831.
    [107] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici,
    S. Ricco, R. Sukthankar et al., “AVA: A video dataset of spatio-temporally localized atomic visual
    actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt
    Lake City, Utah, 2018, pp. 6047–6056.
    [108] K. Duarte, Y. Rawat, and M. Shah, “VideoCapsuleNet: A simplified network for action detection,”
    in Proceedings of the Advances in Neural Information Processing Systems, Montr´eal, Canada, 2018,
    pp. 7610–7619.
    [109] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid, “Actor-centric relation network,” in Proceedings of the European Conference on Computer Vision, Munich, Germany,
    2018, pp. 318–334.
    [110] Y. Ye, X. Yang, and Y. Tian, “Discovering spatio-temporal action tubes,” Journal of Visual Communication and Image Representation, vol. 58, pp. 515–524, 2019.
    [111] Z. Qiu, T. Yao, C.-W. Ngo, X. Tian, and T. Mei, “Learning spatio-temporal representation with
    local and global diffusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Long Beach, California, 2019, pp. 12 056–12 065.
    [112] Y. Li, W. Lin, J. See, N. Xu, S. Xu, K. Yan, and C. Yang, “CFAD: Coarse-to-fine action detector for
    spatiotemporal action localization,” in Proceedings of the European Conference on Computer Vision,
    Glasgow, United Kingdom, 2020, pp. 510–527.
    [113] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp.
    6202–6211.
    [114] J. Wu, Z. Kuang, L. Wang, W. Zhang, and G. Wu, “Context-aware RCNN: A baseline for action
    detection in videos,” in Proceedings of the European Conference on Computer Vision, Glasgow,
    United Kingdom, 2020, pp. 440–456.
    [115] Y. Li, Z. Wang, L. Wang, and G. Wu, “Actions as moving points,” in Proceedings of the European
    Conference on Computer Vision, Glasgow, United Kingdom, 2020, pp. 68–84.
    [116] P. Mettes and C. G. Snoek, “Spatial-aware object embeddings for zero-shot localization and classification of actions,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 4443–4452.
    [117] Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid, “A structured model for action detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 9975–9984.
    [118] D. Li, T. Yao, Z. Qiu, H. Li, and T. Mei, “Long short-term relation networks for video action detection,” in Proceedings of the ACM International Conference on Multimedia, Nice, France, 2019, pp.
    629–637.
    [119] M. Guo, Y. Zhang, and T. Liu, “Gaussian transformer: a lightweight approach for natural language
    inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, Hawaii,
    2019, pp. 6489–6496.
    [120] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs
    and imagenet?” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Salt Lake City, Utah, 2018, pp. 6546–6555.
    [121] T. Brox, A. Bruhn, N. Papenberg, and J.Weickert, “High accuracy optical flow estimation based on a
    theory for warping,” in Proceedings of the European Conference on Computer Vision, Prague, Czech
    Republic, 2004, pp. 25–36.
    [122] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao, “A multi-stream bi-directional recurrent
    neural network for fine-grained action detection,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 1961–1970.
    [123] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and
    T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in
    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 2625–2634.
    [124] W. McGuire, R. H. Gallagher, and H. Saunders, Matrix Structural Analysis, 2000.
    [125] B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, and T. Zhang, “Modeling localness for self-attention networks,” in Proceedings of the Conference on Empirical Methods in Natural Language
    Processing, Brussels, Belgium, 2018, pp. 6489–6496.
    [126] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, Columbus, Ohio, 2014, pp. 580–587.
    [127] A. Sadeghian, A. Alahi, and S. Savarese, “Tracking the untrackable: Learning to track multiple cues
    with long-term dependencies,” in Proceedings of the IEEE International Conference on Computer
    Vision, Venice, Italy, 2017, pp. 300–311.
    [128] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical
    image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Miami, Florida, 2009, pp. 248–255.
    [129] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from
    videos in the wild,” Center for Research in Computer Vision, 2012.
    [130] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,”
    in Proceedings of the IEEE Conference on Computer Vision, Portland, Oregon, 2013, pp. 3192–3199.
    [131] K. Soomro and A. R. Zamir, “Action recognition in realistic sports videos,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, Ohio, 2014.
    [132] J. He, Z. Deng, M. S. Ibrahim, and G. Mori, “Generic tubelet proposals for action localization,”
    in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe,
    Nevada, 2018, pp. 343–351.
    [133] Y. Li, W. Lin, T. Wang, J. See, R. Qian, N. Xu, L. Wang, and S. Xu, “Finding action tubes with a
    sparse-to-dense framework.” in Proceedings of the AAAI Conference on Artificial Intelligence, New
    York City, New York, 2020, pp. 11,466–11.473.
    [134] Z. Fan, T. Lin, X. Zhao, W. Jiang, T. Xu, and M. Yang, “An online approach for gesture recognition
    toward real-world applications,” in Proceeding of the International Conference on Image and
    Graphics, Shanghai, China, 2017, pp. 262–272.
    [135] W. Choi and S. Savarese, “A unified framework for multi-target tracking and collective activity
    recognition,” in Proceedings of the European Conference on Computer Vision, Florence, Italy, 2012,
    pp. 215–230.
    [136] M. R. Amer, P. Lei, and S. Todorovic, “HiRF: Hierarchical random field for collective activity recognition in videos,” in Proceedings of the European Conference on Computer Vision, Z¨urich, Switzerland, 2014, pp. 572–585.
    [137] H. Hajimirsadeghi, W. Yan, A. Vahdat, and G. Mori, “Visual recognition by counting instances: A
    multi-instance cardinality potential kernel,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Boston, Massachusetts, 2015, pp. 2596–2605.
    [138] P. Zhang, Y. Tang, J.-F. Hu, and W.-S. Zheng, “Fast collective activity recognition under weak supervision, ”IEEE Transactions on Image Processing, vol. 29, pp. 29–43, 2019.
    [139] X. Li and M. Choo Chuah, “SBGAR: Semantics based group activity recognition,” in Proceedings
    of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2876–2885.
    [140] J. Tang, X. Shu, R. Yan, and L. Zhang, “Coherence constrained graph LSTM for group activity
    recognition,” To be published in IEEE Transactions on Pattern Analysis and Machine Intelligence,
    2020.
    [141] Y. Tang, J. Lu, Z. Wang, M. Yang, and J. Zhou, “Learning semantics-preserving attention and contextual interaction for group activity recognition,” IEEE Transactions on Image Processing, vol. 28, no. 10, pp. 4997–5012, 2019.
    [142] X. Shu, L. Zhang, Y. Sun, and J. Tang, “Host–parasite: Graph LSTM-in-LSTM for group activity
    recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 663–
    674, 2020.
    [143] X. Shu, J. Tang, G.-J. Qi, W. Liu, and J. Yang, “Hierarchical long short-term concurrent memory for
    human interaction recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 43, no. 3, pp. 1110–1118, 2021.
    [144] M. Wang, B. Ni, and X. Yang, “Recurrent modeling of interaction context for collective activity
    recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Honolulu, Hawaii, 2017, pp. 3048–3056.
    [145] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom, “Deep representation learning for human
    motion prediction and classification,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 6158–6166.
    [146] S. M. Azar, M. G. Atigh, A. Nickabadi, and A. Alahi, “Convolutional relational machine for group
    activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 7892–7901.
    [147] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “HiGCIN: Hierarchical graph-based cross inference
    network for group activity recognition,” To be published in IEEE Transactions on Pattern Analysis
    and Machine Intelligence, 2020.
    [148] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, and B. Yang, “Geom-GCN: Geometric graph convolutional
    networks,” in Proceedings of the International Conference on Learning Representations, Vienna,
    Austria, 2020.
    [149] D. Z¨ugner, A. Akbarnejad, and S. G¨unnemann, “Adversarial attacks on neural networks for graph
    data,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &
    Data Mining, London, United Kingdom, 2018, pp. 2847–2856.
    [150] R. Yan, L. Xie, J. Tang, X. Shu, and Q. Tian, “Social adaptive module for weakly-supervised group
    activity recognition,” in Proceedings of the European Conference on Computer Vision, Glasgow,
    United Kingdom, 2020, pp. 208–224.
    [151] J. Chen, W. Bao, and Y. Kong, “Group activity prediction with sequential relational anticipation
    model,” in European Conference on Computer Vision, Glasgow, United Kingdom, 2020 , pp. 581–
    597.
    [152] D. M. Nguyen, R. Calderbank, and N. Deligiannis, “Geometric matrix completion with deep conditional random fields,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9,
    pp. 3579–3593, 2020.
    [153] Q. Li, Y. Shi, X. Huang, and X. X. Zhu, “Building footprint generation by integrating convolution
    neural network with feature pairwise conditional random field (fpcrf),” IEEE Transactions on
    Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7502–7519, 2020.
    [154] K. Sun, B. Xiao, D. Liu, and J.Wang, “Deep high-resolution representation learning for human pose
    estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Long Beach, California, 2019, pp. 5693–5703.
    [155] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask R-CNN,” in Proceedings of the IEEE International
    Conference on Computer Vision, Venice, Italy, 2017, pp. 2961–2969.
    [156] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for
    segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine
    Learning, Williamstown, Massachusetts, 2001, pp. 282–289.
    [157] P. Kr¨ahenb¨uhl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Proceedings of the Advances in Neural Information Processing Systems, Granada, Spain,
    2011, pp. 109–117.
    [158] L. Ladick`y, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical CRFs for object class
    image segmentation,” in Proceedings of the International Conference on Computer Vision, Kyoto,
    Japan, 2009, pp. 739–746.
    [159] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-RNN: Deep learning on spatiotemporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Las Vegas, Nevada, 2016, pp. 5308–5317.
    [160] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Li`o, and Y. Bengio, “Graph attention networks,” in Proceedings of the International Conference on Learning Representations, Vancouver,
    Canada, 2018.
    [161] H. Gao, J. Pei, and H. Huang, “Conditional random field enhanced graph convolutional neural networks,” in Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery &
    Data Mining, Anchorage, Alaska, 2019, pp. 276–284.
    [162] H. Yuan and S. Ji, “StructPool: Structured graph pooling via conditional random fields,” in Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 2020.
    [163] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr, “Higher order conditional random fields in deep
    neural networks,” in Proceedings of the European Conference on Computer Vision, Amsterdam, the
    Netherlands, 2016, pp. 524–540.
    [164] Y. Liu, C. Sun, L. Lin, and X. Wang, “Learning natural language inference using bidirectional lstm
    model and inner-attention,” arXiv preprint arXiv:1605.09090, 2016.
    [165] Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun, “UPSNet: A unified panoptic
    segmentation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Long Beach, California, 2019, pp. 8818–8826.
    [166] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. The MIT Press, 2016.
    [167] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan,
    “Supervised contrastive learning,” in Proceedings of the Advances in Neural Information Processing
    Systems, New York City, New York, 2020.
    [168] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning
    of visual representations,” in Proceedings of the International Conference on Machine Learning,
    Vienna, Austria, 2020, pp. 1597–1607.
    [169] W. Choi, K. Shahid, and S. Savarese, “What are they doing?: Collective activity classification using
    spatio-temporal relationship among people,” in Proceedings of the International Conference on
    Computer Vision Workshops, Kyoto, Japan, 2009, pp. 1282–1289.
    [170] ——, “Learning context for collective activity recognition,” in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, Colorado Springs, Colorado, 2011, pp. 3273–3280.
    [171] C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime TV-L 1 optical flow,” in
    Proceedings of the Joint Pattern Recognition Symposium, Heidelberg, Germany, 2007, pp. 214–223.
    [172] T. Lan, Y.Wang,W. Yang, S. N. Robinovitch, and G. Mori, “Discriminative latent models for recognizing
    contextual group activities,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
    vol. 34, no. 8, pp. 1549–1562, 2011.
    [173] S. Asghari-Esfeden, M. Sznaier, and O. Camps, “Dynamic motion representation for human action
    recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision,
    Snowmass, Colorado, 2020, pp. 557–566.
    [174] Z. Pan, S. Liu, A. K. Sangaiah, and K. Muhammad, “Visual attention feature (VAF): a novel strategy
    for visual tracking based on cloud platform in intelligent surveillance systems,” Journal of Parallel
    and Distributed Computing, vol. 120, pp. 182–194, 2018.
    [175] X. Gu, H. Chang, B. Ma, H. Zhang, and X. Chen, “Appearance-preserving 3D convolution for video-based person re-identification,” in Proceedings of the European Conference on Computer Vision,
    Glasgow, United Kingdom, 2020, pp. 228–243.
    [176] D. Chung, K. Tahboub, and E. J. Delp, “A two stream siamese convolutional neural network for
    person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision,
    Venice, Italy, 2017, pp. 1983–1991.
    [177] W. Zhang, S. Hu, K. Liu, and Z. Zha, “Learning compact appearance representation for video-based
    person re-identification,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29,
    no. 8, pp. 2442–2452, 2019.
    [178] N. McLaughlin, J. M. Del Rincon, and P. Miller, “Recurrent convolutional network for video-based
    person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Las Vegas, Nevada, 2016, pp. 1325–1334.
    [179] S. Xu, Y. Cheng, K. Gu, Y. Yang, S. Chang, and P. Zhou, “Jointly attentive spatial-temporal pooling
    networks for video-based person re-identification,” in Proceedings of the IEEE International
    Conference on Computer Vision, Venice, Italy, 2017, pp. 4733–4742.
    [180] Z. Zhou, Y. Huang, W. Wang, L. Wang, and T. Tan, “See the forest for the trees: Joint spatial and
    temporal recurrent neural networks for video-based person re-identification,” in Proceedings of the
    IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4747–
    4756.
    [181] L. Wu, Y. Wang, J. Gao, and X. Li, “Where-and-when to look: Deep siamese attention networks for
    video-based person re-identification,” IEEE Transactions on Multimedia, vol. 21, no. 6, pp. 1412–
    1424, 2019.
    [182] R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen, “Temporal complementary learning for video
    person re-identification,” in Proceedings of the European Conference on Computer Vision, Glasgow,
    United Kingdom, 2020, pp. 388–405.
    [183] Z. Zhang, C. Lan, W. Zeng, and Z. Chen, “Multi-granularity reference-aided attentive feature aggregation
    for video-based person re-identification,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Seattle, Washington, 2020, pp. 10 407–10 416.
    [184] P. Li, P. Pan, P. Liu, M. Xu, and Y. Yang, “Hierarchical temporal modeling with mutual distance
    matching for video based person re-identification,” IEEE Transactions on Circuits and Systems for
    Video Technology, vol. 31, no. 2, pp. 503–511, 2021.
    [185] W. Zhang, X. He, W. Lu, H. Qiao, and Y. Li, “Feature aggregation with reinforcement learning for
    video-based person re-identification,” IEEE Transactions on Neural Networks and Learning Systems,
    vol. 30, no. 12, pp. 3847–3852, 2019.
    [186] G. Chen, Y. Rao, J. Lu, and J. Zhou, “Temporal coherence or temporal motion: Which is more
    critical for video-based person re-identification?” in Proceedings of the European Conference on
    Computer Vision, Glasgow, United Kingdom, 2020, pp. 660–676.
    [187] R. Hou, H. Chang, B. Ma, R. Huang, and S. Shan, “Bicnet-tks: Learning efficient spatial-temporal
    representation for video person re-identification,” in Proceedings of the IEEE Conference on Computer
    Vision and Pattern Recognition, Nashville, Tennessee, 2021, pp. 2014–2023.
    [188] X. Liu, P. Zhang, C. Yu, H. Lu, and X. Yang, “Watching you: Global-guided reciprocal learning for
    video-based person re-identification,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, Nashville, Tennessee, 2021, pp. 13 334–13 343.
    [189] Y. Yan, J. Qin, J. Chen, L. Liu, F. Zhu, Y. Tai, and L. Shao, “Learning multi-granular hypergraphs for
    video-based person re-identification,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, Seattle, Washington, 2020, pp. 2899–2908.
    [190] Y. Wu, O. E. F. Bourahla, X. Li, F. Wu, Q. Tian, and X. Zhou, “Adaptive graph representation
    learning for video person re-identification,” IEEE Transactions on Image Processing, vol. 29, pp.
    8821–8830, 2020.
    [191] J. Liu, Z.-J. Zha, W.Wu, K. Zheng, and Q. Sun, “Spatial-temporal correlation and topology learning
    for person re-identification in videos,” in Proceedings of the IEEE Conference on Computer Vision
    and Pattern Recognition, Nashville, Tennessee, 2021, pp. 4370–4379.
    [192] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
    M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
    Transformers for image recognition at scale,” in Proceedings of the International Conference on
    Learning Representations, Vienna, Austria, 2021.
    [193] Q.-H. Pham, T. Nguyen, B.-S. Hua, G. Roig, and S.-K. Yeung, “JSIS3D: joint semantic-instance
    segmentation of 3D point clouds with multi-task pointwise networks and multi-value conditional
    random fields,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
    Long Beach, California, 2019, pp. 8827–8836.
    [194] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou, “Learning discriminative features with multiple
    granularities for person re-identification,” in Proceedings of the ACM International Conference on
    Multimedia, Seoul, South Korea, 2018, pp. 274–282.
    [195] R. D. Alba, “A graph-theoretic definition of a sociometric clique,” Journal of Mathematical Sociology,
    vol. 3, no. 1, pp. 113–126, 1973.
    [196] H. Luo, Y. Gu, X. Liao, S. Lai, andW. Jiang, “Bag of tricks and a strong baseline for deep person reidentification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
    Workshops, Long Beach, California, 2019, pp. 1487–1495.
    [197] T.Wang, S. Gong, X. Zhu, and S.Wang, “Person re-identification by video ranking,” in Proceedings
    of the European Conference on Computer Vision, Z¨urich, Switzerland, 2014, pp. 688–703.
    [198] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person Re-Identification by Descriptive and
    Discriminative Classification,” in Proceedings of the Scandinavian Conference on Image Analysis,
    Ystad, Sweden, 2011, pp. 91–102.
    [199] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated by GAN improve the person reidentification
    baseline in vitro,” in Proceedings of the IEEE International Conference on Computer
    Vision, Venice, Italy, 2017, pp. 3754–3762.
    [200] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph CNN
    for learning on point clouds,” ACM Transactions On Graphics, vol. 38, no. 5, pp. 1–12, 2019.
    [201] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le, “Autoaugment: Learning augmentation
    strategies from data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern
    Recognition, Long Beach, California, 2019, pp. 113–123.
    [202] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in
    Proceedings of the International Conference on Machine Learning, Los Angeles City, Los Angeles,
    2019, pp. 6105–6114.
    [203] X. Li,W.Wang, L.Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: learning
    qualified and distributed bounding boxes for dense object detection,” in Proceedings of the Advances
    in Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 21 002–21 012.

    QR CODE