研究生: |
胡允誠 YUN-CHENG HU |
---|---|
論文名稱: |
運用混合式深度學習方法於即時人臉表情辨識 Real-Time Facial Expression Recognition Using Hybrid Deep Learning |
指導教授: |
陳建中
Jiann-Jone Chen |
口試委員: |
杭學鳴
Hsueh-Ming Hang 鍾國亮 Kuo-Liang Chung 吳怡樂 Yi-Leh Wu 花凱龍 Kai-Lung Hua 陳建中 Jiann-Jone Chen |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2018 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 82 |
中文關鍵詞: | 深度學習 、神經網路 、卷積 、表情辨識 、即時辨識 |
外文關鍵詞: | Deep Learning, CNN, Convolution, Neural Network, Real-Time, Facial Expression Recognition |
相關次數: | 點閱:234 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
臉部表情識別在電腦視覺領域中為相當熱門的研究之一,目前仍是一充滿挑戰的主題。隨著機器學習演算法的效能提升,深度學習應用也日漸普及。應用深度學習方法於辨識臉部表情可以顯著提升準確度。在一般機器學習方法中,會針對某一個資料庫做訓練參數微調,使得訓練結果處理當前資料庫有中最佳的效能。而深度學習不僅在訓練過程中能獲致更好的結果,且類神經網路也能發揮自主學習的能力,自動尋找出訓練資料的特徵,因而不須再針對當前資料集,自行設計一個擷取特徵的模型。運用深度學習方法所需的訓練過程,運算量大所需時間長,例如使用早期的卷積神經網路(Convolutional Neural Network, CNN)架構訓練FER2013人臉表情資料庫時,就需要耗費大約2千萬個參數運算量。若再經過一些參數微調設定,如此一往復,所需要花費的訓練時間就相當可觀。本論文參考谷歌(Google)的 Inception與Xception架構來設計深度學習網路方法,並提出串聯級方法。在第一個網路架構中,由兩層的卷積層開始,每層卷積後都有最大池化層(Pooling)緊,接著由三個Inception層所組成。另一個網路架構也是由兩層卷積層開始,接續由三個Xception層所組成。使用這兩種深度學習網路架構,除了能提升辨識準確率,也可降低運算量,最後使用一融合層(Merge Layer)將兩個架構串聯(Concatenate)。實驗結果顯示本論文所提之方法,針對人臉表情辨識準確度可達到70.1%。為驗證實際應用的功效,我們開發整合出一個臉部表情即時識別系統,實驗表現出良好的即時辨識效果,辨識速度每秒可達幀率為6至7張,意即辨識一張影像只需0.143秒。另外,本論文中我們亦提出一個更新資料庫的方法,讓使用者可以依照當前情境之需求來建置新的樣本與調整資料庫。
Facial expression recognition becomes important and popular in computer vision applications. With the advance of machine learning technology, one can develop deep learning methods to improve the facial expression recognition performance. In general, it impose one machine-learning module on one specific database to fine tune system parameters to yield the best classification performance. The deep learning algorithm can further boost the machine learning ability in dealing with new databases. By utilizing the neural network model and machine-learning framework, the deep learning possesses self-learning capability. When changing a dataset, it would learn and classify the features by its own. However, utilizing deep learning requires performing time-consuming training process. For example, utilizing a traditional convolution neural network (CNN) on training FER2013 dataset would produce 20 million parameters requiring adjustment in the training process. In this research, we utilized two deep learning neural network frameworks, Inception and Xception by Google, and we proposed a method to concatenate the two frameworks in a merge layer. Experiments showed that the proposed method achieves 70.1% recognition accuracy. Based on the training results, we integrate face detection and facial expression recognition modules to develop a real-time facial expression recognition system. It needs only 0.143 second to recognize one frame. In addition, we also proposed a database updating method for users to build new training data when facing new testing data, so that the system can still yield good real-time recognition performances under different circumstances.
[1] Mehrabian A. Communication without words[J]. Communication Theory, pp. 193-200, 2008.
[2] Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, “Imagenet classification with deep convolutional neural networks,” In NIPS, pp. 1106-1114, 2012.
[3] C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 22, no. 3, pp. 273-297, 1995.
[4] Matthew D. Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,” In ECCV, 2014.
[5] D.-J. Yu et al, “Mixed pooling for convolutional neural networks,” 364-375. 10.1007/978-3-319-11740-9_34, Springer, Cham, 2014.
[6] N. Srivastava et al, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, 15.1:1929-1958, 2014.
[7] S. Ioffe et al, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” Int. Conf. Machine Learning, 2015.
[8] K. Simonyan et al, “Very deep convolutional networks for large-scale image recognition,” arXiv: 1409.1556, 2014
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” arXiv: 1409-4842, 2014.
[10] Kaiming. He, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
[11] François Chollet, “Xception: deep learning with depthwise separable convolutions,” arXiv: 1610-02357, 2017.
[12] Keras: the python deep learning library. https://keras.io/
[13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna. “Rethinking the inception architecture for computer vision,” arXiv: 1512-00567, 2015.
[14] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto and H. Adam. “Mobilenets: efficient convolutional neural networks for mobile vision applications,” arXiv: 1704.04861, 2017.
[15] D. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv: 1412.6980, 2014.
[16] Challenges in representation learning: Facial expression recognition challenge: http://www.kaggle.com/c/challengesin-representation-learning-facial-expression-recognitionchallenge
[17] OpenCV: open source computer vision library. https://opencv.org/
[18] Ranjan, Rajeev, Vishal M. Patel, and Rama Chellappa. "Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[19] Nvidia CUDA: compute unified device architecture.
https://developer.nvidia.com/computeworks
[20] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features,” IEEE Computer Society Conference on CVPR, vol.1, pp.511-518, 2001.
[21] Y. Freund and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Journal of Computer and System Sciences, pp.119-139, Aug. 1997.
[22] Pratap Dangeti. Extending machine learning algorithms [Video], 2017. https://www.packtpub.com/big-data-and-business-intelligence/extending-machine-learning-algorithms-video
[23] A. Mollahosseini, D. Chan, and M. H. Mahoor. “Going deeper in facial expression recognition using deep neural networks,” IEEE Winter Conference on Applications of Computer Vision (WACV), 2016.
[24] Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400, 2013.
[25] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. "The elements of statistical learning: data mining, inference, and prediction." , 2009.
[26] James, Gareth, et al. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
[27] S. Fortmann-Roe, “Understanding the Bias-Variance Tradeoff.”
http://scott.fortmann-roe.com/docs/BiasVariance.html
[28] P. Domingos, “A few useful things to know about machine learning,” Commun. ACM, 55(10):78–87, 2012.
[29] Tang, Yichuan. "Deep learning using linear support vector machines." arXiv preprint arXiv:1306.0239, 2013.
[30] Kankanamge, Sarasi, Clinton Fookes, and Sridha Sridharan. "Facial analysis in the wild with LSTM networks." Image Processing (ICIP), 2017 IEEE International Conference on. IEEE, 2017.
[31] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
[32] M. Quinn, G. Sivesind and G. Reis, "Real-time Emotion Recognition From Facial Expressions.", 2017.
[33] Goodfellow, Ian J., et al. "Challenges in representation learning: A report on three machine learning contests." International Conference on Neural Information Processing. Springer, Berlin, Heidelberg, 2013.