一個基於自動編碼器和生成對抗網路的語音情感手勢生成方法

簡易檢索 / 詳目顯示

回結果列表

研究生：	李承紘 Cheng-Hung Li
論文名稱：	一個基於自動編碼器和生成對抗網路的語音情感手勢生成方法 A Co-speech Emotive Gesture Generation Method Based on Autoencoders and Generative Adversarial Networks
指導教授：	范欽雄 Chin-Shyurng Fahn
口試委員:	繆紹綱 Shaou-Gang Miaou 王榮華 Jung-Hua Wang 馮輝文 Huei-Wen Ferng 范欽雄 Chin-Shyurng Fahn
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	57
中文關鍵詞：	深度學習、生成式AI 、語音情感手勢、生成對抗網路、多重特徵融合
外文關鍵詞：	Deep learning, Generative AI, Co-speech emotive gestures, Generative adversarial networks, Multiple feature fusion
相關次數：	點閱：201 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，深度學習和生成式AI領域出現了各種不同面向的討論議題，其中之一是關於手勢生成的研究。手勢生成的應用領域包括機器人領域和虛擬人物領域，透過生成真實而自然的手勢，可以增強人與機器人之間的互動方式，而在虛擬人物中使用逼真的手勢也能提升用戶互動的沉浸體驗和真實感。因此，本論文的目的和動機在於利用深度學習方法，結合了音訊、文本內容和手勢的多重特徵，用以訓練一個手勢生成模型。該模型的目標是生成一系列連貫且合理的語音情感手勢動作，並將其應用於虛擬人物上。透過這樣的研究，虛擬人物將能夠依照語音和情緒展示出相應的手勢動作，這將使得虛擬人物更加生動活潑，並更好地與使用者進行互動。
過去的方法大多專注於處理單一特徵的生成，例如僅考慮文字或語音等，這在特徵獲取方面有一些不足之處，因此，我們提出了一種融合多重特徵的方法來生成手勢，以更有效地獲取特徵並生成接近真實手勢的結果。在實驗結果方面，我們使用TED手勢資料集並運用三個評估指標進行評估和分析，這些指標分別是：關節座標的平均絕對誤差（MAJE）、平均加速度差（MAD）以及Fréchet手勢距離（FGD）。我們的模型在這些指標上表現出色，具體數值MAJE獲得22.95、MAD獲得2.57，以及FGD獲得3.85，此外，與最近最佳的手勢生成模型Sp2AG進行比較，我們的模型在評估指標上取得了更好的結果。具體而言，相較於Sp2AG模型，我們的模型在MAJE上下降了10.63%，在MAD上下降了13.18%，而在FGD上下降了12.30%。

In recent years, the fields of deep learning and generative AI have witnessed a wide range of discussion on various aspects, one of which is the research on gesture generation. Gesture generation finds its applications in fields such as robotics and virtual character domains. By generating authentic and natural gestures, it enhances the way humans interact with robots, and the use of realistic gestures in virtual characters also enhances the sense of realism and immersion in user interactions. Therefore, the purpose and motivation of this thesis are to leverage deep learning methods and combine features from audio, textual content, and gestures to train a gesture generation model. The objective of this model is to generate a series of coherent and suitable co-speech emotive gesture movements and apply them to virtual characters. Through such research, virtual characters will be capable of displaying corresponding gesture movements based on speech and emotions. This will make virtual characters livelier and enable better interaction with users.
In the past, most methods have focused on generating gestures based on a single feature, such as considering only text or speech, which has some limitations in feature acquisition. Therefore, we propose a method that integrates multiple features to generate gestures, aiming to capture features and produce results that closely resemble real gestures more effectively. In terms of experimental results, we utilize the TED gesture dataset and evaluate and analyze our model using three evaluation metrics: mean absolute joint error (MAJE), mean acceleration difference (MAD), and Fréchet gesture distance (FGD). Our model performs remarkably well on these metrics, achieving values of 22.95 for MAJE, 2.57 for MAD, and 3.85 for FGD. Furthermore, when compared to the state-of-the-art gesture generation model Sp2AG, our model achieves better results in the evaluation metrics. Specifically, our model exhibits a 10.63% decrease in MAJE, a 13.18% decrease in MAD, and a 12.30% decrease in FGD compared to the Sp2AG model.

中文摘要 i
Abstract ii
致謝 iv
List of Figures vii
List of Tables viii
Chapter 1 Introduction 1
1 Overview 1
2 Motivation 2
3 System Description 5
4 Thesis Organization 6
Chapter 2 Related Work 8
1 Literature Review 8
1.1 Gesture generation for artificial agents 8
1.2 Data-driven gesture generation methods 9
1.3 Gesture generation methods with multimodal data 10
1.4 Evaluating generative models 11
2 Review of Neural Networks 12
2.1 Artificial neural networks 12
2.2 Long short-term memory 14
2.3 Autoencoder 15
2.4 Generative adversarial networks 16
Chapter 3 Our Co-speech Emotive Gesture Generation Method 18
1 Feature Extraction 18
1.1 Audio feature processing 18
1.2 Text feature processing 20
1.3 Gesture feature processing 22
2 Network Architecture 23
2.1 Gesture generator 24
2.2 Gesture discriminator 26
3 Loss Functions and Optimizer 27
Chapter 4 Experimental Results and Discussion 30
1 Experimental Setup 30
1.1 Developing environment setup 31
1.2 Training dataset 32
1.3 Evaluation metrics 33
2 Analysis of Training Results 35
3 Analysis of Testing Results 37
3.1 The prediction results 37
3.2 Comparison with other models 45
4 Ablation Study 48
5 Applications 49
Chapter 5 Conclusions and Future Work 53
1 Conclusions 53
2 Future Work 54
References 55


                                

[1] A. Kendon, Gesture: Visible Action as Utterance, Cambridge, UK: Cambridge University Press, 2004.
[2] S. Beilock, How the Body Knows Its Mind: The Surprising Power of the Physical Environment to Influence How You Think and Feel, New York: Simon and Schuster, 2015.
[3] C. Ott, D. Lee, and Y. Nakamura, “Motion capture based human motion recognition and imitation by direct marker control,” in Proceedings of the Humanoids 8th IEEE-RAS International Conference on Humanoid Robots, Daejeon, South Korea, 2008, pp. 399-405.
[4] J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone, “Animated conversation: Rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents,” in Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, Orlando, Florida, 1994, pp. 413-420.
[5] M. Thiebaux, S. Marsella, A. N. Marshall, and M. Kallmann, “Smartbody: Behavior realization for embodied conversational agents,” in Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, Estoril, Portugal, 2008, pp. 151-158.
[6] S. Shamsuddin, L. I. Ismail, H. Yussof, N. I. Zahari, S. Bahari, H. Hashim, and A. Jaffar, “Humanoid robot NAO: Review of control and motion exploration,” in Proceedings of the IEEE International Conference on Control System, Computing and Engineering, Penang, Malaysia, 2011, pp. 511-516.
[7] F. Tanaka, K. Isshiki, F. Takahashi, M. Uekusa, R. Sei, and K. Hayashi, “Pepper learns together with children: Development of an educational application,” in Proceedings of the 15th IEEE-RAS International Conference on Humanoid Robots, Seoul, South Korea, 2015, pp. 270-275.
[8] G. Castillo and M. Neff, “What do we express without knowing? Emotion in gesture,” in Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, Montreal, Canada, 2019, pp. 702-710.
[9] Y. Yoon, W. R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee, “Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots,” in Proceedings of the International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 4303-4309.
[10] C. Ahuja and L. P. Morency, “Language2pose: Natural language grounded pose forecasting,” in Proceedings of the International Conference on 3D Vision, Quebec City, Canada, 2019, pp. 719-728.
[11] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik, “Learning individual styles of conversational gesture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 3497-3506.
[12] D. McNeill, Language and Gesture, Cambridge, UK: Cambridge University Press, 2000.
[13] D. McNeill, Gesture and Thought, Chicago: University of Chicago press, 2008.
[14] T. Kucherenko, P. Jonell, S. Van-Waveren, G. E. Henter, S. Alexandersson, I. Leite, and H. Kjellström, “Gesticulator: A framework for semantically-aware speech-driven gesture generation,” in Proceedings of the 2020 International Conference on Multimodal Interaction, Utrecht, Netherlands, 2020, pp. 242-250.
[15] Y. Yoon, B. Cha, J. H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee, “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” ACM Transactions on Graphics, vol. 39, no. 6, pp. 1-16, 2020.
[16] U. Bhattacharya, E. Childs, N. Rewkowski, and D. Manocha, “Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning,” in Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 2021, pp. 2027-2036.
[17] S. Zhou, M. L. Gordon, R. Krishna, A. Narcomey, L. Fei-Fei, and M. S. Bernstein, “HYPE: A benchmark for human eye perceptual evaluation of generative models,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 3449-3461.
[18] H. Joo, T. Simon, M. Cikara, and Y. Sheikh, “Towards social artificial intelligence: Nonverbal social signal prediction in a triadic interaction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, California, 2019, pp. 10873-10883.
[19] T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjellström, “Analyzing input and output representations for speech-driven gesture generation,” in Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, Paris, France, 2019, pp. 97-104.
[20] A. Aristidou, E. Stavrakis, P. Charalambous, Y. Chrysanthou, and S. L. Himona, “Folk dance evaluation using laban movement analysis,” Journal on Computing and Cultural Heritage, vol. 8, no. 4, pp. 1-19, 2015.
[21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234-2242.
[22] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 6629-6640.
[23] E. I. Alves, “Earthquake forecasting using neural networks: Results and future work,” Nonlinear Dynamics, vol. 1, no. 44, pp. 341-349, 2006.
[24] M. Aslam, J. M. Lee, and S. Hong, “A multi-layer perceptron based deep learning model to quantify the energy potentials of a thin film A-Si PV system,” Energy Reports, vol. 6, no. 9, pp. 1331-1336, 2020.
[25] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
[26] R. N. Toma, F. Piltan, and J. M. Kim, “A deep autoencoder-based convolution neural network framework for bearing fault classification in induction motors,” Sensors, vol. 21, no. 24, pp. 8453, 2021.
[27] H. Shao, S. Yao, D. Sun, A. Zhang, S. Liu, D. Liu, J. Wang, and T. Abdelzaher, “Controlvae: Controllable variational autoencoder,” in Proceedings of the International Conference on Machine Learning, Vienna, Austria, 2020, pp. 8655-8664.
[28] J. Feng, X. Feng, J. Chen, X. Cao, X. Zhang, L. Jiao, and T. Yu, “Generative adversarial networks based on collaborative learning and attention mechanism for hyperspectral image classification,” Remote Sensing, vol. 12, no.7, pp. 1149, 2020.

全文公開日期 2033/08/02 (校內網路)
全文公開日期 2033/08/02 (校外網路)
全文公開日期 2033/08/02 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文