透過文本內文、語音和說話者身分之三模態線索生成用於健康照護機器人的對話手勢

簡易檢索 / 詳目顯示

回結果列表

研究生：	吳博元 Po-Yuan Wu
論文名稱：	透過文本內文、語音和說話者身分之三模態線索生成用於健康照護機器人的對話手勢 Generation of Co-Speech Gestures of a Health Care Robot from Trimodal Cues: Contents of Text, Speech, and Speaker Identity
指導教授：	范欽雄 Chin-Shyurng Fahn
口試委員:	馮輝文王榮華鄭為民
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	47
中文關鍵詞：	深度學習、三模態線索生成手勢、TED手勢資料集、生成對抗網路、照護機器人
外文關鍵詞：	deep learning, trimodal cues generating gestures, TED gesture dataset, generative adversarial networks, care robots
相關次數：	點閱：163 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，人口高齡化已經變成全球各國都面臨到的議題，老年人照護政策逐漸地受到重視，尤其我國的國民人均壽命高於全球人均壽命的情況，更顯得老年人照護的重要性，而在照護人口無法負荷需長期照護的老年人比率下，推動了照護機器人的發展。本論文的目的及動機是希望透過深度學習方法，訓練模型學習國際演講者的演講音訊、內容和動作，用以生成照護機器人說話時的肢體手勢，訓練完的模型可藉由輸入音訊和文本內文，生成對應機器人關節點的位置，使機器人在與老年人互動的過程中，能更生動的表達說話內容。
現有透過深度學習方法生成手勢的研究並不多，一部分是處理音訊，藉由傳統的方法或卷積神經網路抽取特徵，另一部分是透過演講內容做語意分析抽取特徵，再利用長短期記憶模型架構生成手勢，鮮少有人使用生成對抗網路生成手勢。我們則是透過音訊、演講內容、演講者身分三種模態線索生成手勢，藉由訓練三種不同的神經網路用以抽取各個模態的特徵，接著建立一個生成對抗網路，其中的生成器會根據抽取的特徵生成手勢，而判別器負責辨識生成手勢和實際手勢的真偽，經過模型的交互訓練後，使生成器生成近乎真實的手勢。
實驗結果方面，我們會將生成出來的手勢和實際的手勢，利用三個評估指標—關節位置Mean absolute error (MAE)、Mean acceleration distance (MAD)，以及加速Fréchet 手勢距離(FGD)進行觀察及分析，並與現有的數個出色的手勢生成模型做比較。針對TED手勢資料集，由實驗數據上看出，相較於當代最佳手勢生成模型Sp2AG，我們的生成模型的位置誤差率，在關節點的Mean absolute error (MAE)下降7.88%，以及距離偏移率，在手勢的Mean acceleration distance (MAD)下降10.23%，以及加速Fréchet 手勢距離 (FGD)下降8.75%。

In recent years, aging population has become an issue in many countries all over the world, and elderly care’s policies are gradually concerned. In particular, the life expectancy in Taiwan is more than the average life expectancy in most countries. The development of care robots has been driven by the proportion of the elderly who need long-term care. The purpose and motivation of this thesis is to build a deep learning model and train by using trimodal cues: the contents of text, speech, and speaker identity. The trained model can generate corresponding gestures from the inputs, so that the care robot can express gestures more vividly during human-robot interaction.
So far, there are not many studies on generation of robot gesture through deep learning methods. In the data preprocessing, one group is only to extract audio features and another is only to analyze semantics of text content. The architecture of model usually uses long short-term memory to build. In this thesis, we propose a generative model based on generative adversarial networks to generate gestures by using the trimodal cues. We train three different neural networks for extracting the features of each modality separately. And then we build a generative adversarial network, where the generator of network generates gestures based on the extracted features, and the discriminator identifies the authenticity of the generative gestures and the actual gestures. After training, the generator will generate near-realistic gestures.
From experimental results, we observe and analyze the gestures differences between the generative gestures and the actual gestures by use of three evaluation metrics: Mean absolute error (MAE) of joint coordinates, Mean acceleration distance (MAD) of gestures, and accelerated Fréchet gesture distance (FGD), and then compare the performance with several existing outstanding generative models. In comparison with Sp2AG, the state-of-the-art gesture generation model, for the position error rate of our generated model, the Mean absolute error (MAE) at the joint point is decreased by 7.88%, while for the distance offset rate, the Mean acceleration distance (MAD) of gestures is decreased by 10.23%, and accelerated Fréchet gesture distance (FGD) is decreased by 8.75%.

Contents
摘要    i
Abstract    ii
中文致謝    iv
Contents    v
List of Figures    vii
List of Tables    ix
Chapter 1    Introduction    1
1.1    Background    1
1.2    Motivation    2
1.3    System Description    3
1.4       Thesis Organization    5
Chapter 2    Related Work    6
2.1    Literature Review    6
2.2    Deep Neural Networks    7
2.2.1    Multi-Layer Perceptron    8
2.2.2    Recurrent Neural Networks    9
2.2.3    Self-Attention Mechanism    10
2.2.4    Transformer    11
2.2.5    Squeeze and Excitation Networks    13
Chapter 3    Deep-Learning-Based Gesture Generation Method    15
3.1    Data Preprocessing    15
3.1.1    Text Processing    15
3.1.2    Audio Processing    17
3.1.3    Speaker Identity Style Sampling    17
3.2    Generative Adversarial Networks Model    19
3.2.1    Network Architecture    19
3.2.2    Bi-GRU block architecture    20
3.2.3    Searching for Weighted Coefficients in Losses    21
Chapter 4    Experimental Results and Discussion    23
4.1    Datasets    23
4.2    Experimental Environment and Training Detail    24
4.3    Data Visualization    26
4.4    Training and Validating Result and Analysis    27
4.5    Testing Result and Analysis    35
4.6    Comparison with Baseline Methods    41
Chapter 5    Conclusions and Future Work    43
5.1    Conclusions and Contributions    43
5.2    Future Work    44
References    45


                                

References
[1] World Health Organization. World Health Statistics Overview 2019: Monitoring Health for the Sustainable Development Goals (SDGs). 2019. Available online: https://apps.who.int/iris/bitstream/handle/10665/311696/WHO-DAD-2019.1-eng.pdf
(accessed on 8 August 2022).
[2] Ministry of Health and Welfare. 2018 Taiwan Health and Welfare Report. 2018. Available online: https://www.mohw.gov.tw/cp-137-47558-2.html (accessed on 8 August 2022).
[3] N. Sadoughi and C. Busso. “Speech-driven animation with meaningful behaviors,” Speech Communication, vol. 110, pp. 90-100, 2019.
[4] C.-M. Huang and B. Mutlu. “Learning-based modeling of multimodal behaviors for humanlike robots,” in Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 2014, pp. 57-64.
[5] M. Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation, Universal-Publishers: Irvine, California, 2005.
[6] S. Levine, P. Krahenbuhl, S. Thrun, and V. Koltun. “Gesture controllers,” Transactions on Graphics, vol. 29, no. 4, pp. 1-11, 2010.
[7] Y. Ferstl, M. Neff, and R. McDonnell. “Multi-Objective adversarial gesture generation,” in Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games, Newcastle Upon Tyne, United Kingdom, 2019, pp. 1-10.
[8] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. “Learning individual styles of conversational gesture,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 3497-3506.
[9] S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow. “Style-controllable speech-driven gesture synthesis using normalizing flows,” Computer Graphics Forum, vol. 39, no. 2, pp. 487-496, 2020.
[10] Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee. “Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots,” in Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 4303-4309.
[11] T. Kucherenko, P. Jonell, S. vanWaveren, G. E. Henter, S. Alexanderson, I. Leite, and H. Kjellstrom. “Gesticulator: a framework for semantically-aware speech-driven gesture generation,” in Proceedings of the ACM International Conference on Multimodal Interaction, Utrecht, Netherlands, 2020, pp. 242-250.
[12] A. B Hostetter and A. L Potthoff. “Effects of personality and social situation on representational gesture production,” Gesture, vol. 12, no. 1, pp. 62-83, 2012.
[13] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. “Multimodal machine learning: a survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423-443, 2018.
[14] C. Ahuja and L.-P. Morency. “Language2Pose: natural language grounded pose forecasting,” in Proceedings of the IEEE International Conference on 3D Vision, Quebec City, Canada, 2019, pp. 719-728.
[15] M. Roddy, G. Skantze, and N. Harte. “Multimodal continuous turn-taking prediction using multiscale RNNs,” in Proceedings of the ACM International Conference on Multimodal Interaction, Boulder, Colorado, 2018, pp. 186-190.
[16] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473, 2015.
[17] A. Aristidou, E. Stavrakis, P. Charalambous, Y. Chrysanthou, and S. Loizidou Himona. “Folk dance evaluation using laban movement analysis,” Computing and Cultural Heritage, vol. 8, no. 4, pp. 1-19, 2015.
[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. “Improved techniques for training GANs,” in Proceedings of the Conference on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234-2242.
[19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 6626-6637.
[20] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. “Frechet audio distance: a metric for evaluating music enhancement algorithms,” arXiv:1812.08466, 2018.
[21] L. Medsker and C. J. Jain. Recurrent neural networks: Design, and Applications, CRC Press: Boca Raton, California, 1999.
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 30.
[23] J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks,” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UTAH, 2018, pp. 7132-7141.
[24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
[25] G. Ian, P.-A. Jean, M. Mehdi, B. Xu, D.-F. SherjilOzair, A. Courville, and Y. Bengio. “Generative adversarial nets,” in Proceedings of the Conference on Neural Information Processing Systems, Montreal, Canada, 2014, 2672-2680.
[26] Y. Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee. “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” Transactions on Graphics, vol. 39, no. 6, pp. 1-16, 2020.
[27] U. Bhattacharya, E. Childs, N. Rewkowski, and D. Manocha. “Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning,” in Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 2021, pp. 2027-2036.

全文公開日期 2033/02/06 (校內網路)
全文公開日期 2033/02/06 (校外網路)
全文公開日期 2033/02/06 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文