簡易檢索 / 詳目顯示

研究生: 狄騠克
DIDIK PURWANTO
論文名稱: 第一人稱視頻中的行動識別使用Hilbert-Huang變換進行時間聚合
Action Recognition in First-Person Videos Using Hilbert-Huang Transform for Temporal Aggregation
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 46
中文關鍵詞: temporal pyramid poolingtemporal aggregationfirst-person videoaction recognitionHilbert-Huang transform
外文關鍵詞: temporal pyramid pooling, temporal aggregation, first-person video, action recognition, Hilbert-Huang transform
相關次數: 點閱:276下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

This thesis presents a new convolutional neural network (CNN)-based approach for first-person video action recognition.
The new approach aggregates both of the short- and long-term trends
of the videos based on the CNN features derived by the Hilbert-Huang transform (HHT), a renowned time-frequency analysis tool. With HHT, the extracted CNN feature vectors in each channel are decomposes via the empirical mode decomposition (EMD) through a set of intrinsic mode functions (IMFs). Afterward, the Hilbert transform is conducted and scrutinized to obtain a more precise feature representation. Consequently, the new scheme can facilitate the extraction of the salient features of the activities in the first-person videos. Moreover, to boost the detection performance, a new key frame selection scheme is also addressed to truncate the redundant frames and simultaneously prune out the noisy features. Simulations show that the proposed method can outperform the main state-of-the-art works on some widespread datasets.


This thesis presents a new convolutional neural network (CNN)-based approach for first-person video action recognition.
The new approach aggregates both of the short- and long-term trends
of the videos based on the CNN features derived by the Hilbert-Huang transform (HHT), a renowned time-frequency analysis tool. With HHT, the extracted CNN feature vectors in each channel are decomposes via the empirical mode decomposition (EMD) through a set of intrinsic mode functions (IMFs). Afterward, the Hilbert transform is conducted and scrutinized to obtain a more precise feature representation. Consequently, the new scheme can facilitate the extraction of the salient features of the activities in the first-person videos. Moreover, to boost the detection performance, a new key frame selection scheme is also addressed to truncate the redundant frames and simultaneously prune out the noisy features. Simulations show that the proposed method can outperform the main state-of-the-art works on some widespread datasets.

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Related Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Action Recognition on First-Person Video . . . . . . . . . . . . . . . 1 1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Key Frame Selection Scheme . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Trajectory Aligned CNN Features . . . . . . . . . . . . . . . . . . . 11 3.4 Temporal Aggregation using HHT . . . . . . . . . . . . . . . . . . . 12 3.4.1 Empirical Mode Decomposition . . . . . . . . . . . . . . . . . 12 3.4.2 Hilbert Transform Analysis . . . . . . . . . . . . . . . . . . . 13 3.4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 13 4 Experimental and Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . . . 16 4.3 Assessment of Key Frame Selection Scheme . . . . . . . . . . . . . . . 16 4.4 Performance Evaluation of Temporal Aggregation via HHT . . . . . . 19 4.4.1 IMF Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.4.2 Feature Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4.3 Performance Assessment . . . . . . . . . . . . . . . . . . . . . 20 4.5 Comparison With State-of-the-Art Methods . . . . . . . . . . . . . . 23 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Appendix A : Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Biography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

," in
Proceedings of IEEE International Conference on Computer Vision, pp. 3551{
3558, 2013.
[2] L. Wang, Y. Qiao, and X. Tang, \Action recognition with trajectory-pooled
deep-convolutional descriptors," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305{4314, 2015.
[3] K. Simonyan and A. Zisserman, \Two-stream convolutional networks for action recognition in videos," in Proceedings of Neural Information Processing
Systems, pp. 568{576, 2014.
[4] H. Wang, A. Kl¨aser, C. Schmid, and C. L. Liu, \Dense trajectories and motion
boundary descriptors for action recognition," International Journal of Computer Vision, vol. 103, no. 1, pp. 60{79, 2013.
[5] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, \Evaluation of
local spatio-temporal features for action recognition," in Proceedings of British
Machine Vision Conference, pp. 124{1, 2009.
[6] S. Singh, C. Arora, and C. Jawahar, \Trajectory aligned features for first person
action recognition," Pattern Recognition, vol. 62, pp. 45{55, 2017.
[7] A. Fathi, X. Ren, and J. M. Rehg, \Learning to recognize objects in egocentric
activities," in Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 3281{3288, 2011.
[8] X. Ren and C. Gu, \Figure-ground segmentation improves handled object
recognition in egocentric video," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3137{3144, 2010.
[9] A. Fathi, A. Farhadi, and J. M. Rehg, \Understanding egocentric activities," in
Proceedings of IEEE International Conference on Computer Vision, pp. 407{
414, IEEE, 2011.
[10] M. S. Ryoo and L. Matthies, \First-person activity recognition: what are they
doing to me?," in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2730{2737, 2013.
[11] K. Ogaki, K. M. Kitani, Y. Sugano, and Y. Sato, \Coupling eye-motion and egomotion features for first-person activity recognition," in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition Workshops, pp. 1{7,
IEEE, 2012.
[12] A. Fathi, Y. Li, and J. Rehg, \Learning to recognize daily actions using gaze,"
in European Conference on Computer Vision, 2012.
[13] Y. J. Lee, J. Ghosh, and K. Grauman, \Discovering important people and
objects for egocentric video summarization," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1346{1353, 2012.
[14] O. Aghazadeh, J. Sullivan, and S. Carlsson, \Novelty detection from an egocentric perspective," in Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 3297{3304, 2011.
[15] B. Xiong and K. Grauman, \Detecting snap points in egocentric video with a
web photo prior," in European Conference on Computer Vision, pp. 282{298,
2014.
[16] Y. Poleg, T. Halperin, C. Arora, and S. Peleg, \Egosampling: Fast-forward and
stereo for egocentric videos," in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 4768{4776, 2015.
[17] P. Wang, Y. Cao, C. Shen, L. Liu, and H. T. Shen, \Temporal pyramid pooling
based convolutional neural networks for action recognition," IEEE Transactions
on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1{1, 2015.
[18] X. Shu, J. Tang, G.-J. Qi, Y. Song, Z. Li, and L. Zhang, \Concurrence-aware
long short-term sub-memories for person-person action recognition," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
2017.
[19] Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo, \First-person animal
activity recognition from egocentric videos," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, pp. 4310{4315, 2014.
[20] M. Ma, H. Fan, and K. M. Kitani, \Going deeper into first-person activity
recognition," Computing Research Repository, vol. abs/1605.03688, 2016.
[21] M. S. Ryoo, B. Rothrock, and L. Matthies, \Pooled motion features for firstperson videos," in Proceedings of IEEE Conference on Computer Vision and
Pattern Recognition, pp. 896{904, 2015.
[22] R. Kahani, A. Talebpour, and A. Mahmoudi-Aznaveh, \Time series correlation for first-person videos," in Proceedings of Iranian Conference Electrical
Engineering, pp. 805{809, 2016.
[23] J. Monteiro, J. P. Aires, R. Granada, R. C. Barros, and F. Meneguzzi, \Virtual
guide dog: an application to support visually-impaired people through deep
convolutional neural networks," in Proceedings of International Joint Conference on Neural Networks, 2017.
[24] N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. C.
Yen, C. C. Tung, and H. H. Liu, \The empirical mode decomposition and
the hilbert spectrum for nonlinear and non-stationary time series analysis,"
Royal Society of London A: Mathematical, Physical and Engineering Sciences,
vol. 454, no. 1971, pp. 903{995, 1998.
[25] H. Xie and Z. Wang, \Mean frequency derived via hilbert-huang transform with
application to fatigue emg signal analysis," Computer Methods and Programs
in Biomedicine, vol. 82, no. 2, pp. 114{120, 2006.
[26] J. C. Echeverria, J. A. Crowe, M. S. Woolfson, and B. R. Hayes-Gill, \Application of empirical mode decomposition to heart rate variability analysis,"
Medical and Biological Engineering and Computing, vol. 39, no. 4, pp. 471{479,
2001.
[27] G. Gkioxari and J. Malik, \Finding action tubes," in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, pp. 759{768, 2015.
[28] D. Purwanto, Y.-T. Chen, and W.-H. Fang, \Temporal aggregation for firstperson action recognition using hilbert-huang transform," in Proceedings of
IEEE International Conference on Multimedia and Expo, 2017.
[29] Z. Jiang, V. Rozgic, and S. Adali, \Learning spatiotemporal features for infrared
action recognition with 3d convolutional neural networks," in Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[30] Y. Shi, W. Zeng, T. Huang, and Y. Wang, \Learning deep trajectory descriptor
for action recognition in videos using deep neural networks," in Proceedings of
IEEE International Conference on Multimedia and Expo, pp. 1{6, 2015.
[31] B. Zhang, L. Wang, Z. Wang, Y. Qiao, and H. Wang, \Real-time action recognition with enhanced motion vector cnns," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, June 2016.
[32] C. Feichtenhofer, A. Pinz, and A. Zisserman, \Convolutional two-stream network fusion for video action recognition," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, June 2016.
[33] I. Laptev and T. Lindeberg, \Space-time interest points," in Proceedings of
International Conference on Computer Vision, pp. 432{439, 2003.
[34] P. Scovanner, S. Ali, and M. Shah, \A 3-dimensional sift descriptor and its
application to action recognition," in Proceedings of ACM International Conference on Multimedia, pp. 357{360, 2007.
[35] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, \Learning realistic
human actions from movies," in Proceedings of IEEE Conference on Computer
Vision and Pattern Recognition, pp. 1{8, 2008.
[36] A. Klaser, M. Marsza lek, and C. Schmid, \A spatio-temporal descriptor
based on 3d-gradients," in Proceedings of British Machine Vision Conference,
pp. 275{1, 2008.
[37] G. Willems, T. Tuytelaars, and L. Van Gool, \An efficient dense and scaleinvariant spatio-temporal interest point detector," in European Conference on
Computer Vision, pp. 650{663, 2008.
[38] L. Yeffet and L. Wolf, \Local trinary patterns for human action recognition," in
Proceedings of IEEE International Conference on Computer Vision, pp. 492{
497, IEEE, 2009.
[39] B. Ni, P. Moulin, X. Yang, and S. Yan, \Motion part regularization: improving
action recognition via trajectory selection," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, June 2015.
[40] B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, \Modeling video evolution for action recognition," in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, June 2015.
[41] B. Ni, P. Moulin, and S. Yan, \Pose adaptive motion feature pooling for human
action analysis," International Journal of Computer Vision, vol. 111, no. 2,
pp. 229{248, 2015.
[42] Q. Shi, L. Wang, L. Cheng, and A. Smola, \Discriminative human action segmentation and recognition using semi-markov model," in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1{8, IEEE, 2008.
[43] G. Abebe, A. Cavallaro, and X. Parra, \Robust multi-dimensional motion features for first-person vision activity recognition," Computer Vision and Image
Understanding, vol. 149, pp. 229{248, 2016.
[44] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, \Large-scale video classification with convolutional neural networks," in
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1725{1732, 2014.
[45] G. Taylor, R. Fergus, Y. LeCun, and C. Bregler, \Convolutional learning
of spatio-temporal features," in European Conference on Computer Vision,
pp. 140{153, 2010.
[46] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, \Learning spatiotemporal features with 3d convolutional networks," in Proceedings of IEEE
International Conference on Computer Vision, pp. 4489{4497, 2015.
[47] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga,
and G. Toderici, \Beyond short snippets: Deep networks for video classification," in Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, pp. 4694{4702, 2015.
[48] S. Ji, W. Xu, M. Yang, and K. Yu, \3d convolutional neural networks for human action recognition," IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 35, no. 1, pp. 221{231, 2013.
[49] S. Singh, C. Arora, and C. Jawahar, \First person action recognition using deep
learned descriptors," in Proceedings of IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2620{2628, 2016.
[50] S. Narayan, M. S. Kankanhalli, and K. R. Ramakrishnan, \Action and interaction recognition in first-person videos," in Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, pp. 512{518, 2014.
[51] F. Ozkan, M. A. Arabaci, E. Surer, and A. Temizel, \Boosted multiple kernel
learning for first-person activity recognition," in Proceedings of European Signal
Processing Conference, 2017.
[52] T. P. Moreira, D. Menotti, and H. Pedrini, \First-person action recognition
through visual rhythm texture description," in Proceedings of IEEE Conference
on Acoustics, Speech and Signal Processing, pp. 2627{2631, 2017.
[53] S. E. F. De Avila, A. P. B. Lopes, A. da Luz, and A. de Albuquerque Ara´ujo,
\Vsumm: A mechanism designed to produce static video summaries and a novel
evaluation method," Pattern Recognition Letters, vol. 32, no. 1, pp. 56{68, 2011.
[54] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, \Diverse sequential subset
selection for supervised video summarization," in Proceedings of Neural Information Processing Systems, pp. 2069{2077, 2014.
[55] D. Liu, G. Hua, and T. Chen, \A hierarchical visual model for video object
summarization," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 12, pp. 2178{2190, 2010.
[56] M. Gygli, H. Grabner, and L. Van Gool, \Video summarization by learning
submodular mixtures of objectives," in Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3090{3098, 2015.
[57] F. Riaz, A. Hassan, S. Rehman, I. K. Niazi, and K. Dremstrup, \Emd-based
temporal and spectral features for the classification of eeg signals using supervised learning," IEEE Transactions on Neural Systems and Rehabilitation
Engineering, vol. 24, no. 1, pp. 28{35, 2016.
[58] H. R. Mohseni, A. Maghsoudi, and M. B. Shamsollahi, \Seizure detection in
eeg signals: A comparison of different approaches," in Proceedings of IEEE
Engineering in Medicine and Biology Society, pp. 6724{6727, IEEE, 2006.
[59] R. B. Pachori, \Discrimination between ictal and seizure-free eeg signals using empirical mode decomposition," Research Letters in Signal Processing,
vol. 2008, p. 14, 2008.
[60] V. Bajaj and R. B. Pachori, \Classification of seizure and nonseizure eeg signals using empirical mode decomposition," IEEE Transactions on Information
Technology in Biomedicine, vol. 16, no. 6, pp. 1135{1142, 2012.
[61] F. Duman, N. Ozdemir, and E. Yildirim, \Patient specific seizure prediction ¨
algorithm using hilbert-huang transform," in Proceedings of IEEE Biomedical
and Health Informatics, pp. 705{708, 2012.
[62] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, \Caffe: Convolutional architecture for fast feature embedding," in Proceedings of ACM International Conference on Multimedia, pp. 675{
678, ACM, 2014.
[63] T.-Y. Liu et al., \Learning to rank for information retrieval," Foundations and
Trends R in Information Retrieval, vol. 3, no. 3, pp. 225{331, 2009.
[64] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, \High accuracy optical
flow estimation based on a theory for warping," in European Conference on
Computer Vision, pp. 25{36, 2004.
[65] J. Choi, W. J. Jeon, and S.-C. Lee, \Spatio-temporal pyramid matching for
sports videos," in Proceedings of ACM International Conference on Multimedia
Information Retrieval, pp. 291{297, 2008.
[66] E. H. Spriggs, F. De La Torre, and M. Hebert, \Temporal segmentation and
activity classification from first-person sensing," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 17{24, 2009.
[67] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, \Summary transfer:
Exemplar-based subset selection for video summarization," in Proceedings of
IEEE Conference on Computer Vision and Pattern Recognition, pp. 1059{1067,
2016.
[68] F. Perronnin, J. S´anchez, and T. Mensink, \Improving the fisher kernel for
large-scale image classification," in European Conference on Computer Vision,
pp. 143{156, 2010

QR CODE