簡易檢索 / 詳目顯示

研究生: 狄騠克
Didik Purwanto
論文名稱: 通過時間依賴性建模了解視頻上下文
Video Context Understanding by Temporal Dependency Modelling
指導教授: 方文賢
Fang, Wen-Hsien
陳郁堂
Yie-Tarng Chen
口試委員: 陳郁堂
Chen, Yie-Tarng
賴坤財
Lay, Kuen-Tsair
呂政修
Leu, Jenq-Shiou
丘建青
Chiu, Chien-Ching
廖弘源
Mark Liao
傅楸善
Fuh, Chiou-Shan
楊健生
Jason Young
方文賢
Fang, Wen-Hsien
學位類別: 博士
Doctor
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 96
中文關鍵詞: first-person action recognitionHilbert-Huang transformanomaly detectionlow-resolution videosconditional-random fieldsmulti-instance learning
外文關鍵詞: first-person action recognition, Hilbert-Huang transform, anomaly detection, low-resolution videos, conditional-random fields, multi-instance learning
相關次數: 點閱:187下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Video context understanding has attracted increasing interests due to its potential applications in a wide range of areas. However, analyzing context within videos is not a straightforward task due to a number of factors like camera movement, multiple view- points, low-resolution quality, illumination, occlusion, and inter-class variation. Mean- while, learning temporal dependency has been demonstrated to be beneficial for video understanding as videos contain not only spatial but also temporal information. Thus, this dissertation aims to develop a number of algorithms to effectively recognize human behaviors in a variety of different scenarios by leveraging temporal dependency across the video frames. We focus on three difficult, yet important tasks: first-person action recognition, extreme low-resolution action recognition, and anomaly detection.
    First, we present a framework for first-person action recognition with a combination of temporal pooling and the Hilbert–Huang transform (HHT). It first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled convolutional neural network (CNN) features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to de- compose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy.
    Second, we present a novel three-stream network for action recognition in extreme low resolution (LR) videos. In contrast to the existing networks, the new network uses the trajectory-spatial network, which is robust against visual distortion, instead of the pose information to complement the two-stream network. Also, the new three-stream network is combined with the inflated 3D ConvNet (I3D) model pre-trained on kinetics to produce more discriminative spatio-temporal features in blurred LR videos. Moreover, a bidirectional self-attention network is aggregated with the three-stream network to further manifest various temporal dependence among the spatio-temporal features. A new fusion strategy is devised as well to integrate the information from the three different modalities.
    Third, we present a novel weakly supervised approach for anomaly detection, which begins with a relation-aware feature extractor to capture the multi-scale CNN features from a video. Afterwards, self-attention is integrated with conditional random fields (CRFs), the core of the network, to make use of the ability of self-attention in capturing the short-range correlations of the features and the ability of CRFs in learning the inter-dependencies of these features. Such a framework can learn not only the dynamic interactions among the actors which are important for detecting complex movements, but also their short- and long-term dependencies across frames. Also, to deal with both local and non-local relationships of the features, a new variant of self-attention is developed by taking into consideration of a set of cliques with different temporal localities. More- over, a new loss function which take the advantage of contrastive loss with multi-instance learning is considered to broaden the gap between the normal and abnormal samples, resulting in more accurate abnormal discrimination. Finally, the framework also extended into an online setting, which enables real-time low-latency anomaly detection and ported on limited resources devices such as Jetson Nano.


    Video context understanding has attracted increasing interests due to its potential applications in a wide range of areas. However, analyzing context within videos is not a straightforward task due to a number of factors like camera movement, multiple view- points, low-resolution quality, illumination, occlusion, and inter-class variation. Mean- while, learning temporal dependency has been demonstrated to be beneficial for video understanding as videos contain not only spatial but also temporal information. Thus, this dissertation aims to develop a number of algorithms to effectively recognize human behaviors in a variety of different scenarios by leveraging temporal dependency across the video frames. We focus on three difficult, yet important tasks: first-person action recognition, extreme low-resolution action recognition, and anomaly detection.
    First, we present a framework for first-person action recognition with a combination of temporal pooling and the Hilbert–Huang transform (HHT). It first adaptively performs temporal sub-action localization, treats each channel of the extracted trajectory pooled convolutional neural network (CNN) features as a time series, and summarizes the temporal dynamic information in each sub-action by temporal pooling. The temporal evolution across sub-actions is then modeled by rank pooling. Thereafter, to account for the highly dynamic scene changes in first-person videos, the HHT is employed to de- compose the ranked pooling features into finite and often few data-dependent functions, called intrinsic mode functions (IMFs), through empirical mode decomposition. Hilbert spectral analysis is then applied to each IMF component, and four salient descriptors are scrutinized and aggregated into the final video descriptor. Such a framework cannot only precisely acquire both long- and short-term tendencies, but also address the cumbersome significant camera motion in first-person videos to render better accuracy.
    Second, we present a novel three-stream network for action recognition in extreme low resolution (LR) videos. In contrast to the existing networks, the new network uses the trajectory-spatial network, which is robust against visual distortion, instead of the pose information to complement the two-stream network. Also, the new three-stream network is combined with the inflated 3D ConvNet (I3D) model pre-trained on kinetics to produce more discriminative spatio-temporal features in blurred LR videos. Moreover, a bidirectional self-attention network is aggregated with the three-stream network to further manifest various temporal dependence among the spatio-temporal features. A new fusion strategy is devised as well to integrate the information from the three different modalities.
    Third, we present a novel weakly supervised approach for anomaly detection, which begins with a relation-aware feature extractor to capture the multi-scale CNN features from a video. Afterwards, self-attention is integrated with conditional random fields (CRFs), the core of the network, to make use of the ability of self-attention in capturing the short-range correlations of the features and the ability of CRFs in learning the inter-dependencies of these features. Such a framework can learn not only the dynamic interactions among the actors which are important for detecting complex movements, but also their short- and long-term dependencies across frames. Also, to deal with both local and non-local relationships of the features, a new variant of self-attention is developed by taking into consideration of a set of cliques with different temporal localities. More- over, a new loss function which take the advantage of contrastive loss with multi-instance learning is considered to broaden the gap between the normal and abnormal samples, resulting in more accurate abnormal discrimination. Finally, the framework also extended into an online setting, which enables real-time low-latency anomaly detection and ported on limited resources devices such as Jetson Nano.

    Abstract........................................ i RelatedPublications.................................. v Acknowledgment ................................... vii Tableofcontents ................................... viii ListofFigures..................................... xii ListofTables ..................................... xv Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction.................................... 1 1.1 ProblemsandChallenges.......................... 2 1.2 Motivation.................................. 3 1.3 ContributionsofthisDissertation...................... 6 1.4 Organization................................. 7 2 LiteratureReview ................................. 9 2.1 HumanActivityRecognition ........................ 9 2.2 UnsupervisedAnomalyDetection ..................... 11 2.3 AttentionMechanism............................ 12 2.4 GraphicalModel .............................. 13 2.5 EdgeComputing .............................. 13 3 First-Person Action Recognition With Temporal Pooling and Hilbert-Huang Transform..................................... 15 3.1 Introduction................................. 15 3.2 TheProposedMethod............................ 18 3.2.1 OverallMethodology........................ 18 3.2.2 Trajectory-PooledFeatureExtraction . . . . . . . . . . . . . . . 19 3.2.3 AdaptiveSub-ActionIntervalDivision. . . . . . . . . . . . . . . 20 3.2.4 TemporalPoolingStrategy..................... 20 3.2.5 RankPooling............................ 22 3.2.6 HHT-BasedVideoDescriptor ................... 23 3.3 ExperimentalResults ............................ 26 3.3.1 3.3.2 3.3.3 3.3.4 3.3.5 4 Three-Stream Network With Bidirectional Self-Attention for Action Recogni- tioninExtremeLowResolutionVideos ..................... 41 4.1 Introduction................................. 41 4.2 ProposedMethod .............................. 43 4.2.1 Three-StreamNetwork....................... 43 4.2.2 Spatio-TemporalFeatureExtraction . . . . . . . . . . . . . . . . 44 4.2.3 BidirectionalSelf-AttentionNetwork . . . . . . . . . . . . . . . 45 Datasets .............................. 27 Evaluation Protocol and Experimental Setup . . . . . . . . . . . 28 ParameterSetting.......................... 29 AblationStudies .......................... 29 Comparison with the State-of-the-Art Methods . . . . . . . . . . 38 3.4 Summary .................................. 40 4.2.4 NewFusionStrategy........................ 46 4.3 ExperimentalResultsandDiscussions ................... 47 4.3.1 Low Resolution Datasets and Trajectory-Spatial Images . . . . . 48 4.3.2 Experimental Setup and Evaluation Protocol . . . . . . . . . . . 48 4.3.3 AblationStudies .......................... 49 4.3.4 Comparison with the State-of-the-art Methods . . . . . . . . . . . 52 4.4 Summary .................................. 54 5 Dance with Self-Attention: A New Look of Conditional Random Fields on Anomaly Detection in Videos and its Online Implementation . . . . . . . . . . 56 5.1 Introduction................................. 56 5.2 ProposedMethod .............................. 59 5.2.1 FeatureExtraction ......................... 59 5.2.2 Self-Attention Conditional Random Fields . . . . . . . . . . . . 61 5.2.3 ContrastiveMulti-InstanceLearning . . . . . . . . . . . . . . . . 66 5.3 OnlineSetting................................ 67 5.4 ExperimentalResults ............................ 68 5.4.1 DatasetsandEvaluationMetric .................. 68 5.4.2 ImplementationDetails....................... 70 5.4.3 AblationStudies .......................... 71 5.4.4 PerformanceAnalysis ....................... 73 5.4.5 Comparison with the State-of-the-Art Works . . . . . . . . . . . 74 5.5 Summary .................................. 76 6 ConclusionandFutureWorks........................... 78 6.1 Conclusion ................................. 78 6.2 FutureWorks ................................ 79 References....................................... 80 Biography....................................... 96

    [1] G.AbebeandA.Cavallaro,“Alongshort-termmemoryconvolutionalneuralnetworkforfirst-person vision activity recognition,” in Proceedings of the IEEE International Conference on Computer Vi- sion, Venice, Italy, 2017, pp. 1339–1346.
    [2] G. Abebe, A. Cavallaro, and X. Parra, “Robust multi-dimensional motion features for first-person vision activity recognition,” Computer Vision and Image Understanding, vol. 149, pp. 229–248, 2016.
    [3] F. Alam, R. Mehmood, I. Katib, and A. Albeshri, “Analysis of eight data mining algorithms for smarter internet of things,” Procedia Computer Science, vol. 98, pp. 437–442, 2016.
    [4] R. D. Alba, “A graph-theoretic definition of a sociometric clique,” Journal of Mathematical Sociol- ogy, vol. 3, no. 1, pp. 113–126, 1973.
    [5] H. Alwassel, F. C. Heilbron, and B. Ghanem, “Action search: Spotting actions in videos and its ap- plication to temporal action localization,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 251–266.
    [6] B. Antic ́ and B. Ommer, “Video parsing for abnormality detection,” in Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 2415–2422.
    [7] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr, “Higher order conditional random fields in deep neural networks,” in Proceeding of the European Conference on Computer Vision, Amsterdam, Netherland, 2016, pp. 524–540.
    [8] M.Bregonzio,S.Gong,andT.Xiang,“Recognisingactionascloudsofspace-timeinterestpoints,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, 2009, pp. 1948–1955.
    [9] J.CarreiraandA.Zisserman,“Quovadis,actionrecognition?anewmodelandthekineticsdataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 2017, pp. 4724–4733.
    [10] H.Chen,X.Zhao,S.Sun,andM.Tan,“Pls-ccaheterogeneousfeaturesfusion-basedlow-resolution human detection method for outdoor video surveillance,” International Journal of Automation and Computing, vol. 14, no. 2, pp. 136–146, 2017.
    [11] J. Chen, J. Wu, J. Konrad, and P. Ishwar, “Semi-coupled two-stream fusion ConvNets for action recognition at extremely low resolutions,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Santa Rosa, California, USA, 2017, pp. 139–147.
    [12] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the International Conference on Machine Learning, Virtual, 2020, pp. 1597–1607.
    [13] K.-W.Cheng,Y.-T.Chen,andW.-H.Fang,“Videoanomalydetectionandlocalizationusinghierar- chical feature representation and gaussian process regression,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 2909–2917.
    [14] J. Choe and H. Shim, “Attention-based dropout layer for weakly supervised object localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, USA, 2019, pp. 2219–2228.
    [15] V. Choutas, P. Weinzaepfel, J. Revaud, and C. Schmid, “Potion: Pose motion representation for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, Utah, United States, 2018, pp. 7024–7033.
    [16] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, United States, 2017, pp. 1831–1840.
    [17] R. Cilla, M. A. Patricio, A. Berlanga, and J. M. Molina, “Creating human activity recognition sys- tems using pareto-based multiobjective optimization,” in Proceedings of the IEEE International Con- ference on Advanced Video and Signal Based Surveillance, Genova, Italy, 2009, pp. 37–42.
    [18] Y.Cong,J.Yuan,andJ.Liu,“Sparsereconstructioncostforabnormaleventdetection,”inProceed- ings of the IEEE Conference of Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 3449–3456.
    [19] J. Dai, B. Saghafi, J. Wu, J. Konrad, and P. Ishwar, “Towards privacy-preserving recognition of hu- man activities,” in Proceedings of the IEEE International Conference on Image Processing, Quebec City, Canada, 2015, pp. 4238–4242.
    [20] F.Desraches,O.Kamara-Esteban,D.Casado-Mansilla,andC.E.Borges,“Forecastingtheusageof appliances of shared use: an analysis of simplicity over complexity,” in Proceedings of the Energy and Sustainability for Small Developing Economies, Madeira Island, Portugal, 2018, pp. 1–6.
    [21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional trans- formers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
    [22] J. Donahue, H. Anne, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusetts, USA, 2015, pp. 2625–2634.
    [23] A.Elsts,R.McConville,X.Fafoutis,N.Twomey,R.J.Piechocki,R.Santos-Rodriguez,andI.Crad- dock, “On-board feature extraction from acceleration data for activity recognition.” in Proceedings of the European Conference on Wireless Sensor Networks, Getafe, Spain, 2018, pp. 163–168.
    [24] L.Fa,Y.Song,andX.Shu,“GlobalandlocalC3Densemblesystemforfirstpersoninteractiveaction recognition,” in Proceedings of the International Conference on Multimedia Modeling, Bangkok, Thailand, 2018, pp. 153–164.
    [25] J.-C.Feng,F.-T.Hong,andW.-S.Zheng,“Mist:Multipleinstanceself-trainingframeworkforvideo anomaly detection,” in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Nashville, USA, 2021, pp. 14 009–14 018.
    [26] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolu- tion for action recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 5378–5387.
    [27] B.Fernando,E.Gavves,J.Oramas,A.Ghodrati,andT.Tuytelaars,“Rankpoolingforactionrecogni- tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 773–787, 2017.
    [28] G. T. Flitton, T. P. Breckon, and N. M. Bouallagu, “Object recognition using 3d sift in complex ct volumes.” in Proceedings of the British Machine Vision Conference, vol. 1, Wales, UK, 2010, pp. 1–12.
    [29] H. Gao, J. Pei, and H. Huang, “Conditional random field enhanced graph convolutional neural net- works,” in Proceedings of the ACM International Conference on Knowledge Discovery & Data Min- ing, Anchorage, Alaska, 2019, pp. 276–284.
    [30] Z. Gao, G. Lu, and P. Yan, “Enhancing action recognition in low-resolution videos using dempster- shafer’s model,” in Proceedings of the IEEE International Conference on Digital Signal Processing, London, UK, 2016, pp. 676–680.
    [31] D. Gong, L. Liu, V. Le, B. Saha, M. R. Mansour, S. Venkatesh, and A. v. d. Hengel, “Memoriz- ing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” in Proceedings of the IEEE International Conference on Computer Vision, Long Beach, California, USA, 2019, pp. 1705–1714.
    [32] P. Goyal, P. Dolla ́r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
    [33] M. Haghighat and M. Abdel-Mottaleb, “Low resolution face recognition in surveillance systems using discriminant correlation analysis,” in Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition, Washington, DC, USA, 2017, pp. 912–917.
    [34] Y. Han, P. Zhang, T. Zhuo, W. Huang, and Y. Zhang, “Video action recognition based on deeper convolution networks with pair-wise frame motion concatenation,” in Proceedings of IEEE Confer- ence on Computer Vision and Pattern Recognition Workshops, Honolulu, Hawaii, USA, 2017, pp. 1226–1235.
    [35] W. Hao, R. Zhang, S. Li, J. Li, F. Li, S. Zhao, and W. Zhang, “Anomaly event detection in secu- rity surveillance using two-stream based model,” Security and Communication Networks, vol. 2020, 2020.
    [36] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regu- larity in video sequences,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 2016, pp. 733–742.
    [37] K.He,X.Zhang,S.Ren,andJ.Sun,“Deepresiduallearningforimagerecognition,”inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 2016, pp. 770–778.
    [38] P. Hennings-Yeomans, S. Baker, and B. Kumar, “Simultaneous super-resolution and feature extrac- tion for recognition of low-resolution faces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Alaska, USA, 2008, pp. 1–8.
    [39] C. Herrmann, D. Willersinn, and J. Beyerer, “Low-resolution convolutional neural networks for video face recognition,” in Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, Colorado Springs, USA, 2016, pp. 221–227.
    [40] N.E.Huang,Z.Shen,S.R.Long,M.C.Wu,H.H.Shih,Q.Zheng,N.-C.Yen,C.C.Tung,andH.H. Liu, “The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis,” in Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 454, no. 1971. The Royal Society, 1998, pp. 903–995.
    [41] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 603–612.
    [42] R. K. Ibrahim, E. Ambikairajah, B. G. Celler, and N. H. Lovell, “Gait pattern classification using compact features extracted from intrinsic mode functions,” in Proceeding of IEEE International Conference of the Engineering in Medicine and Biology Society, British Columbia, Canada, 2008, pp. 3852–3855.
    [43] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on International Conference on Machine Learning, Lille, France, 2015 , pp. 448–456.
    [44] Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, “First-person animal activity recognition from egocentric videos,” in Proceedings of IEEE International Conference on Pattern Recognition, Stockholm, Sweden, 2014, pp. 4310–4315.
    [45] A.Jain,A.R.Zamir,S.Savarese,andA.Saxena,“Structural-rnn:Deeplearningonspatio-temporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, USA, 2016, pp. 5308–5317.
    [46] A. Javidani and A. Mahmoudi-Aznaveh, “A unified method for first and third person action recog- nition,” in Proceedings of the IEEE Iranian Conference on Electrical Engineering, Mashhad, Iran, 2018, pp. 1629–1633.
    [47] S.Ji,W.Xu,M.Yang,andK.Yu,“3dconvolutionalneuralnetworksforhumanactionrecognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 221–231, 2013.
    [48] R. Kahani, A. Talebpour, and A. Mahmoudi-Aznaveh, “Time series correlation for first-person videos,” in Proceedings of Iranian Conference Electrical Engineering, Shiraz, Iran, 2016, pp. 805– 809.
    [49] S.Kang,J.Lee,H.Jang,Y.Lee,S.Park,andJ.Song,“Ascalableandenergy-efficientcontextmon- itoring framework for mobile personal sensor networks,” IEEE Transactions on Mobile Computing, vol. 9, no. 5, pp. 686–702, 2009.
    [50] A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,andL.Fei-Fei,“Large-scalevideoclas- sification with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014, pp. 1725–1732.
    [51] M. Khodayar and J. Wang, “Spatio-temporal graph deep neural network for short-term wind speed forecasting,” IEEE Transactions on Sustainable Energy, vol. 10, no. 2, pp. 670–681, 2018.
    [52] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first- person sports videos,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recog- nition, Colorado Springs, Colorado, 2011, pp. 3241–3248.
    [53] P. Kra ̈henbu ̈hl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge poten- tials,” in Proceeding of the Advances in Neural Information Processing Systems, Granada, Spain, 2011, pp. 109–117.
    [54] L. Kratz and K. Nishino, “Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, Florida, USA, 2009, pp. 1446–1453.
    [55] S. Krinidis and M. Krinidis, “Empirical mode decomposition on skeletonization pruning,” Image and Vision Computing, vol. 31, no. 8, pp. 533–541, 2013.
    [56] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “HMDB: a large video database for human motion recognition,” in Proceedings of the IEEE International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 2556–2563.
    [57] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the International Conference on Machine Learning, Massachusetts, USA, 2001, pp. 1–8.
    [58] N.D.Lane,S.Bhattacharya,P.Georgiev,C.Forlivesi,andF.Kawsar,“Anearlyresourcecharacteri- zation of deep learning on wearables, smartphones and internet-of-things devices,” in Proceedings of the International Workshop on Internet of Things Towards Applications, Seoul, South Korea, 2015, pp. 7–12.
    [59] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proceedings of IEEE the Conference on Computer Vision and Pattern Recognition, An- chorage, Alaska, USA, 2008, pp. 1–8.
    [60] I.Lee,D.Kim,S.Kang,andS.Lee,“Ensembledeeplearningforskeleton-basedactionrecognition using temporal sliding LSTM networks,” in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 1012–1020.
    [61] J.Lee,I.Lee,andJ.Kang,“Self-attentiongraphpooling,”inProceedingsoftheInternationalCon- ference on Machine Learning, Long Beach, USA, 2019, pp. 3734–3743.
    [62] B.Li,X.Li,Z.Zhang,andF.Wu,“Spatio-temporalgraphroutingforskeleton-basedactionrecogni- tion,” in Proceeding of the Association for the Advancement of Artificial Intelligence, vol. 33, no. 01, Honolulu, Hawaii, USA, 2019, pp. 8561–8568.
    [63] D.Li,T.Yao,L.-Y.Duan,T.Mei,andY.Rui,“Unifiedspatio-temporalattentionnetworksforaction recognition in videos,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 416–428, 2018.
    [64] W.-H.Li,F.-T.Hong,andW.-S.Zheng,“Learningtolearnrelationforimportantpeopledetectionin still images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, USA, 2019, pp. 5003–5011.
    [65] X. Li and C. Change Loy, “Video object segmentation with joint re-identification and attention- aware mask propagation,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 90–105.
    [66] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 7083–7093.
    [67] S. Lin, H. Yang, X. Tang, T. Shi, and L. Chen, “Social mil: Interaction-aware for crowd anomaly detection,” in Proceedings of the IEEE International Conference on Advanced Video and Signal- based Surveillance, Taipei, Taiwan, 2019, pp. 1–8.
    [68] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 2017, pp. 1647–1656.
    [69] K. Liu and H. Ma, “Exploring background-bias for anomaly detection in surveillance videos,” in Proceedings of the ACM Multimedia, Nice, France, 2019, pp. 1490–1499.
    [70] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new base- line,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 2018, pp. 6536–6545.
    [71] C. Lu, J. Shi, and J. Jia, “Abnormal event detection at 150 fps in matlab,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, USA, 2013, pp. 2720–2727.
    [72] W.Luo,W.Liu,andS.Gao,“Rememberinghistorywithconvolutionallstmforanomalydetection,” in Proceedings of the IEEE International Conference on Multimedia and Expo, Hong Kong, 2017, pp. 439–444.
    [73] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Re- search, vol. 9, no. Nov, pp. 2579–2605, 2008.
    [74] V.Mahadevan,W.Li,V.Bhalodia,andN.Vasconcelos,“Anomalydetectionincrowdedscenes,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, California, USA, 2010, pp. 1975–1981.
    [75] J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural net- works,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hon- olulu, Hawaii, USA, 2017, pp. 2891–2900.
    [76] W.McGuire,R.Gallagher,andR.Ziemian,MatrixStructuralAnalysis,2000.
    [77] J. Monteiro, J. P. Aires, R. Granada, R. C. Barros, and F. Meneguzzi, “Virtual guide dog: an ap- plication to support visually-impaired people through deep convolutional neural networks,” in Pro- ceedings of International Joint Conference on Neural Networks, Anchorage, Alaska, USA, 2017, pp. 2267–2274.
    [78] J. Monteiro, R. Granada, R. C. Barros et al., “Evaluating the feasibility of deep learning for action recognition in small datasets,” in Proceedings of the IEEE International Joint Conference on Neural Networks, Quebec, Canada, 2018, pp. 1–8.
    [79] V.I.MorariuandL.S.Davis,“Multi-agenteventrecognitioninstructuredscenarios,”inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, Colorado, USA, 2011, pp. 3289–3296.
    [80] T.P.Moreira,D.Menotti,andH.Pedrini,“First-personactionrecognitionthroughvisualrhythmtex- ture description,” in Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017, pp. 2627–2631.
    [81] S.Narayan,M.S.Kankanhalli,andK.R.Ramakrishnan,“Actionandinteractionrecognitioninfirst- person videos,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, Ohio, USA, 2014, pp. 512–518.
    [82] K. Nguyen, C. Fookes, A. Ross, and S. Sridharan, “Iris recognition with off-the-shelf cnn features: A deep learning perspective,” IEEE Access, vol. 6, pp. 18 848–18 855, 2018.
    [83] B. Ni, P. Moulin, and S. Yan, “Pose adaptive motion feature pooling for human action analysis,” International Journal of Computer Vision, vol. 111, no. 2, pp. 229–248, 2015.
    [84] B.Ni,P.Moulin,X.Yang,andS.Yan,“Motionpartregularization:improvingactionrecognitionvia trajectory selection,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recogni- tion, Boston, USA, 2015, pp. 3698–3706.
    [85] B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrusˇaitis, and L. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proceedings of the ACM International Conference on Mul- timodal Interaction, Beijing, China, 2016, pp. 284–288.
    [86] R.J.OweisandE.W.Abdulhay,“Seizureclassificationineegsignalsutilizinghilbert-huangtrans- form,” Biomedical engineering online, vol. 10, no. 1, p. 38, 2011.
    [87] F.Ozkan,M.A.Arabaci,E.Surer,andA.Temizel,“Boostedmultiplekernellearningforfirst-person activity recognition,” in Proceedings of the European Signal Processing Conference, Greek Island of Kos, UK, 2017, pp. 1085–1089.
    [88] A.Paithane,D.Bormane,andU.Patil,“Novelalgorithmforfeatureextractionandfeatureselection from electrocardiogram signal,” International Journal of Computer Applications, vol. 134, no. 9, pp. 6–9, 2016.
    [89] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora, “Compact cnn for indexing egocentric videos,” in Proceedings of IEEE Winter Conference on Applications of Computer Vision, New York, USA, 2016, pp. 1–9.
    [90] R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Hierarchical self-attention network for action local- ization in videos,” in Proceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 61–70.
    [91] R. R. A. Pramono, Y.-T. Chen, and W.-H. Fang, “Empowering relational network by self-attention augmented conditional random fields for group activity recognition,” in Proceeding of the European Conference on Computer Vision, Virtual, 2020, pp. 71–90.
    [92] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “First-person action recognition with temporal pooling and hilbert–huang transform,” IEEE Transactions on Multimedia, vol. 21, no. 12, pp. 3122–3135, 2019.
    [93] D. Purwanto, R. R. A. Pramono, Y. Chen, and W. Fang, “Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos,” IEEE Signal Processing Let- ters, vol. 26, no. 8, pp. 1187–1191, Aug 2019.
    [94] D.Purwanto,Y.-T.Chen,andW.-H.Fang,“Temporalaggregationforfirst-personactionrecognition using hilbert-huang transform,” in Proceedings of IEEE International Conference on Multimedia and Expo, Hong Kong, 2017, pp. 895–900.
    [95] D. Purwanto, R. Renanda Adhi Pramono, Y.-T. Chen, and W.-H. Fang, “Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, South Korea, 2019, pp. 1–8.
    [96] D. Purwanto, Y.-T. Chen, and W.-H. Fang, “Dance with self-attention: A new look of conditional random fields on anomaly detection in videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, p. to be appeared.
    [97] X. Qi, M. Keally, G. Zhou, Y. Li, and Z. Ren, “Adasense: Adapting sampling rates for activity recognition in body sensor networks,” in Proceedings of the Real-Time and Embedded Technology and Applications Symposium, Philadelphia, Pennsylvania, USA, 2013, pp. 163–172.
    [98] S. Rahman, J. See, and C. Ho, “Deep CNN object features for improved action recognition in low quality videos,” Advanced Science Letters, vol. 23, no. 11, pp. 11 360–11 364, 2017.
    [99] S. Rahman, J. See, and C. C. Ho, “Exploiting textures for better action recognition in low-quality videos,” EURASIP Journal on Image and Video Processing, vol. 2017, no. 1, p. page 74, 2017.
    [100] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, and N. Sebe, “Abnormal event detection in videos using generative adversarial nets,” in Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 2017, pp. 1577–1581.
    [101] F. Riaz, A. Hassan, S. Rehman, I. K. Niazi, and K. Dremstrup, “Emd-based temporal and spectral features for the classification of eeg signals using supervised learning,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 24, no. 1, pp. 28–35, 2016.
    [102] G. Rogez, J. S. Supancic, III, and D. Ramanan, “First-person pose recognition using egocentric workspaces,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4325–4333.
    [103] M. S. Ryoo and L. Matthies, “First-person activity recognition: what are they doing to me?” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Portland, Oregon, USA, 2013, pp. 2730–2737.
    [104] M. S. Ryoo, B. Rothrock, and L. Matthies, “Pooled motion features for first-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 896–904.
    [105] M. S. Ryoo, B. Rothrock, C. Fleming, and H. J. Yang, “Privacy-preserving human activity recogni- tion from extreme low resolution.” in Proceeding of the Association for the Advancement of Artificial Intelligence, San Francisco, California, USA, 2017, pp. 4255–4262.
    [106] M.Ryoo,K.Kim,andH.Yang,“Extremelowresolutionactivityrecognitionwithmulti-siameseem- bedding learning,” in Proceedings of the Association for the Advancement of Artificial Intelligence, New Orleans, Louisiana, USA, 2018, pp. 1–8.
    [107] M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli, “Adversarially learned one-class classifier for novelty detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition, Salt Lake City, Utah, USA, 2018, pp. 3379–3388.
    [108] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 2018, pp. 4510–4520.
    [109] B. Scho ̈lkopf, R. C. Williamson, A. J. Smola, J. Shawe-Taylor, J. C. Platt et al., “Support vector method for novelty detection.” in Proceedings of the Conference on Neural Information Processing Systems, vol. 12, Denver, USA, 1999, pp. 582–588.
    [110] S.Shekhar,V.M.Patel,andR.Chellappa,“Synthesis-basedrobustlowresolutionfacerecognition,” arXiv preprint arXiv:1707.02733, 2017.
    [111] T.Shen,T.Zhou,G.Long,J.Jiang,S.Pan,andC.Zhang,“Disan:Directionalself-attentionnetwork for rnn/cnn-free language understanding,” in Proceeding of the Association for the Advancement of Artificial Intelligence, vol. 32, no. 1, New Orleans, Louisiana, USA, 2018.
    [112] C.Si,Y.Jing,W.Wang,L.Wang,andT.Tan,“Skeleton-basedactionrecognitionwithspatialreason- ing and temporal stack learning,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 103–118.
    [113]
    [114]
    [115]
    [116]
    [117]
    [118]
    [119]
    [120]
    [121]
    [122]
    [123]
    C. Si, W. Chen, W. Wang, L. Wang, and T. Tan, “An attention enhanced graph convolutional lstm network for skeleton-based action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, USA, 2019, pp. 1227–1236.
    K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Proceedigns of the Advances in Neural Information Processing Systems, Quebec, Canada, 2014, pp. 568–576.
    S. Song, V. Chandrasekhar, N.-M. Cheung, S. Narayan, L. Li, and J.-H. Lim, “Activity recogni- tion in egocentric life-logging videos,” in Proceeding of the Asian Conference on Computer Vision, Singapore, Singapore, 2014, pp. 445–458.
    S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “An end-to-end spatio-temporal attention model for human action recognition from skeleton data,” in Proceeding of the Association for the Advancement of Artificial Intelligence, vol. 31, no. 1, San Francisco, California, USA, 2017.
    S. Sudhakaran and O. Lanz, “Convolutional long short-term memory networks for recognizing first person interactions,” in Proceedings of the IEEE International Conference on Computer Vision Work- shop, Venice, Italy, 2017, pp. 2339–2346.
    M. Suhail and L. Sigal, “Mixture-kernel graph attention network for situation recognition,” in Pro- ceedings of the IEEE International Conference on Computer Vision, Seoul, South Korea, 2019, pp. 10 363–10 372.
    W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” in Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 2018, pp. 6479–6488.
    C. Sun, Y. Jia, Y. Hu, and Y. Wu, “Scene-aware context reasoning for unsupervised abnormal event detection in videos,” in Proceedings of the ACM International Conference on Multimedia, Virtual, 2020, pp. 184–192.
    L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human action recognition using factorized spatio- temporal convolutional networks,” in Proceedings of the IEEE International Conference on Com- puter Vision, Santiago, Chile, 2015, pp. 4597–4605.
    A. B. Tanfous, H. Drira, and B. B. Amor, “Coding kendall’s shape trajectories for 3D action recog- nition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 2018, pp. 2840–2849.
    G.Taylor,R.Fergus,Y.LeCun,andC.Bregler,“Convolutionallearningofspatio-temporalfeatures,” in Proceedings of the European Conference on Computer Vision, Crete, Greece, 2010, pp. 140–153.
    [124]
    [125]
    [126]
    [127]
    [128]
    [129]
    [130]
    [131]
    [132]
    [133]
    [134]
    [135]
    [136]
    K. Thakkar and P. Narayanan, “Part-based graph convolutional network for action recognition,” in Proceedings of British Machine Vision Conference, Newcastle, UK, 2018, pp. 100–108.
    H. Tjandrasa, S. Djanali, and F. Arunanto, “Feature extraction using combination of intrinsic mode functions and power spectrum for eeg signal classification,” in Proceeding of the IEEE International Congress on Image and Signal Processing, BioMedical Engineering and Informatics, Datong, China, 2016, pp. 1498–1502.
    S. Tolwinski, “The hilbert transform and empirical mode decomposition as tools for data analysis,” Tucson: University of Arizona, 2007.
    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 4489–4497.
    D.Tran,J.Ray,Z.Shou,S.-F.Chang,andM.Paluri,“Convnetarchitecturesearchforspatiotemporal feature learning,” CoRR, pp. 1–8, 2017.
    H.T.TranandD.Hogg,“Anomalydetectionusingaconvolutionalwinner-take-allautoencoder,”in Proceedings of the British Machine Vision Conference, London, UK, 2017, pp. 1–8.
    A.Vaswani,N.Shazeer,N.Parmar,J.Uszkoreit,L.Jones,A.Gomez,L.Kaiser,andI.Polosukhin, “Attention is all you need,” in Proceeding of the Advances in Neural Information Processing Systems, Long Beach, California, USA, 2017, pp. 5998–6008.
    P. Velicˇkovic ́, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention net- works,” arXiv preprint arXiv:1710.10903, 2017.
    B.Wan,Y.Fang,X.Xia,andJ.Mei,“Weaklysupervisedvideoanomalydetectionviacenter-guided discriminative learning,” in Proceedings of the IEEE International Conference on Multimedia Expo, Virtual, 2020, pp. 1–6.
    H.Wang,A.Kla ̈ser,C.Schmid,andC.Liu,“Densetrajectoriesandmotionboundarydescriptorsfor action recognition,” International Journal of Computer Vision, vol. 103, no. 1, pp. 60–79, 2013.
    H. Wang and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 3551–3558.
    L. Wang, Y. Qiao, and X. Tang, “Action recognition with trajectory-pooled deep-convolutional de- scriptors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 4305–4314.
    P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1051–1061, 2018.
    [137]
    [138]
    [139]
    [140]
    [141]
    [142]
    [143]
    [144]
    [145]
    [146]
    [147]
    [148]
    X.Wang,M.Magno,L.Cavigelli,andL.Benini,“Fann-on-mcu:Anopen-sourcetoolkitforenergy- efficient neural network inference at the edge of the internet of things,” IEEE Internet of Things Journal, vol. 7, no. 5, pp. 4403–4417, 2020.
    D.Weinland,M.O ̈zuysal,andP.Fua,“Makingactionrecognitionrobusttoocclusionsandviewpoint changes,” in Proceeding of the European Conference on Computer Vision, Crete, Greece, 2010, pp. 635–648.
    G.Willems,T.Tuytelaars,andL.VanGool,“Anefficientdenseandscale-invariantspatio-temporal interest point detector,” in Proceedings of the European Conference on Computer Vision, Marseille, France, 2008, pp. 650–663.
    P. Wu and J. Liu, “Learning causal temporal relation and feature discrimination for anomaly detec- tion,” IEEE Transactions on Image Processing, vol. 30, pp. 3513–3527, 2021.
    P.Wu,J.Liu,Y.Shi,Y.Sun,F.Shao,Z.Wu,andZ.Yang,“Notonlylook,butalsolisten:Learning multimodal violence detection under weak supervision,” in Proceedings of the European Conference on Computer Vision, Virtual, 2020, pp. 322–339.
    D. Xie, C. Deng, H. Wang, C. Li, and D. Tao, “Semantic adversarial network with multi-scale pyramid attention for video classification,” in Proceedings of the Association for the Advancement of Artificial Intelligence, Hawaii, USA, 2019.
    D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe, “Learning deep representations of appearance and motion for anomalous event detection,” arXiv preprint arXiv:1510.01553, 2015.
    D. Xu, T. Li, Y. Li, X. Su, S. Tarkoma, and P. Hui, “A survey on edge intelligence,” arXiv preprint arXiv:2003.12172, 2020.
    M. Xu, A. Sharghi, X. Chen, and D. Crandall, “Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition,” in Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, 2018, pp. 1607–1615.
    S.Yan,J.S.Smith,W.Lu,andB.Zhang,“Abnormaleventdetectionfromvideosusingatwo-stream recurrent variational autoencoder,” IEEE Transactions on Cognitive and Developmental Systems, vol. 12, no. 1, pp. 30–42, 2018.
    Y.Yang,C.Deng,D.Tao,S.Zhang,W.Liu,andX.Gao,“Latentmax-marginmultitasklearningwith skelets for 3-D action recognition,” IEEE Transactions on Cybernetics, vol. 47, no. 2, pp. 439–448, 2017.
    Z. Yang, Y. Li, J. Yang, and J. Luo, “Action recognition with spatio–temporal visual attention on skeleton image sequences,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2405–2415, 2018.
    [149]
    [150]
    [151]
    [152]
    [153]
    [154]
    [155]
    [156]
    [157]
    [158]
    [159]
    [160]
    J.Yin,Q.Yang,andJ.J.Pan,“Sensor-basedabnormalhuman-activitydetection,”IEEETransactions on Knowledge and Data Engineering, vol. 20, no. 8, pp. 1082–1090, 2008.
    T. Yoshida, T. Takahashi, D. Deguchi, I. Ide, and H. Murase, “Robust face super-resolution using free-form deformations for low-quality surveillance video,” in Proceedings of the IEEE International Conference on Multimedia and Expo, Nara, Japan, 2012, pp. 368–373.
    G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft, “Cloze test helps: Effective video anomaly detection via learning to complete video events,” in Proceedings of the ACM International Conference on Multimedia, Virtual, 2020, pp. 583–591.
    J. Yu, Y. Lee, K. C. Yow, M. Jeon, and W. Pedrycz, “Abnormal event detection and localization via adversarial event prediction.” IEEE Transactions on Neural Networks and Learning Systems, 2021.
    T. Yu, L. Wang, C. Guo, H. Gu, S. Xiang, and C. Pan, “Pseudo low rank video representation,” Pattern Recognition, vol. 85, pp. 50–59, 2019.
    T. Yu, C. Guo, L. Wang, H. Gu, S. Xiang, and C. Pan, “Joint spatial-temporal attention for action recognition,” Pattern Recognition Letters, vol. 112, pp. 226–233, 2018.
    M.Z.Zaheer,A.Mahmood,M.Astrid,andS.-I.Lee,“Claws:Clusteringassistedweaklysupervised learning with normalcy suppression for anomalous event detection,” in Proceedings of the European Conference on Computer Vision, Virtual, 2020, pp. 1–8.
    M.Z.Zaheer,A.Mahmood,H.Shin,andS.-I.Lee,“Aself-reasoningframeworkforanomalydetec- tion using video-level labels,” IEEE Signal Processing Letters, vol. 27, pp. 1705–1709, 2020.
    H. F. Zaki, F. Shafait, and A. Mian, “Modeling sub-event dynamics in first-person action recogni- tion,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017, pp. 7253–7262.
    G. Zhang, M. Kan, S. Shan, and X. Chen, “Generative adversarial network with spatial attention for face attribute editing,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 417–432.
    J.Zhang,L.Qing,andJ.Miao,“Temporalconvolutionalnetworkwithcomplementaryinnerbagloss for weakly supervised anomaly detection,” in Proceedings of the IEEE International Conference on Image Processing, Taipei, Taiwan, 2019, pp. 4030–4034.
    L. Zhang, G. Zhu, L. Mei, P. Shen, S. A. Shah, and M. Bennamoun, “Attention in convolutional LSTM for gesture recognition,” in Proceedings of the Neural Information Processing Systems, Mon- treal, Canada, 2018, pp. 1953–1962.

    [161]
    [162]
    [163]
    [164]
    [165]
    [166]
    [167]
    [168]
    [169]
    [170]
    [171]
    [172]
    Y. Zhang, C. Cao, J. Cheng, and H. Lu, “Egogesture: A new dataset and benchmark for egocentric hand gesture recognition,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1038–1050, 2018.
    B. Zhao, L. Fei-Fei, and E. P. Xing, “Online detection of unusual events in videos via dynamic sparse coding,” in Proceedings of the IEEE Conference of Computer Vision and Pattern Recognition, Colorado Springs, USA, 2011, pp. 3313–3320.
    T. Zhao and X. Wu, “Pyramid feature attention network for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 3085–3094.
    Y. Zhao, H. Di, J. Zhang, Y. Lu, F. Lv, and Y. Li, “Region-based mixture models for human action recognition in low-resolution videos,” Neurocomputing, vol. 247, pp. 1–15, 2017.
    S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr, “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, Amsterdam, Netherlands, 2015, pp. 1529–1537.
    J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, Long Beach, California, USA, 2019, pp. 1237– 1246.
    B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018, pp. 803– 818.
    H.Zhou,Y.Yuan,andC.Shi,“Objecttrackingusingsiftfeaturesandmeanshift,”ComputerVision and Image Understanding, vol. 113, no. 3, pp. 345–352, 2009.
    T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017, pp. 1851–1858.
    J.Zhu,B.Wang,X.Yang,W.Zhang,andZ.Tu,“Actionrecognitionwithactons,”inProceedingsof the IEEE International Conference on Computer Vision, Sydney, Australia, 2013, pp. 3559–3566.
    Y.ZhuandS.Newsam,“Motion-awarefeatureforimprovedvideoanomalydetection,”inProceed- ings of the British Machine Vision Conference, Cardiff, UK, 2019, pp. 1–8.
    M.Zolfaghari,G.L.Oliveira,N.Sedaghat,andT.Brox,“Chainedmulti-streamnetworksexploiting pose, motion, and appearance for action classification and detection,” in Proceeding of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2923–2932.
    [173] W. W. Zou and P. C. Yuen, “Very low resolution face recognition problem,” IEEE Transactions on Image Processing, vol. 21, no. 1, pp. 327–340, 2012.

    無法下載圖示 全文公開日期 2024/09/23 (校內網路)
    全文公開日期 2027/09/23 (校外網路)
    全文公開日期 2027/09/23 (國家圖書館:臺灣博碩士論文系統)
    QR CODE