簡易檢索 / 詳目顯示

研究生: 吳博元
Po-Yuan Wu
論文名稱: 透過文本內文、語音和說話者身分之三模態線索生成用於健康照護機器人的對話手勢
Generation of Co-Speech Gestures of a Health Care Robot from Trimodal Cues: Contents of Text, Speech, and Speaker Identity
指導教授: 范欽雄
Chin-Shyurng Fahn
口試委員: 馮輝文
王榮華
鄭為民
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 47
中文關鍵詞: 深度學習三模態線索生成手勢TED手勢資料集生成對抗網路照護機器人
外文關鍵詞: deep learning, trimodal cues generating gestures, TED gesture dataset, generative adversarial networks, care robots
相關次數: 點閱:163下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,人口高齡化已經變成全球各國都面臨到的議題,老年人照護政策逐漸地受到重視,尤其我國的國民人均壽命高於全球人均壽命的情況,更顯得老年人照護的重要性,而在照護人口無法負荷需長期照護的老年人比率下,推動了照護機器人的發展。本論文的目的及動機是希望透過深度學習方法,訓練模型學習國際演講者的演講音訊、內容和動作,用以生成照護機器人說話時的肢體手勢,訓練完的模型可藉由輸入音訊和文本內文,生成對應機器人關節點的位置,使機器人在與老年人互動的過程中,能更生動的表達說話內容。
    現有透過深度學習方法生成手勢的研究並不多,一部分是處理音訊,藉由傳統的方法或卷積神經網路抽取特徵,另一部分是透過演講內容做語意分析抽取特徵,再利用長短期記憶模型架構生成手勢,鮮少有人使用生成對抗網路生成手勢。我們則是透過音訊、演講內容、演講者身分三種模態線索生成手勢,藉由訓練三種不同的神經網路用以抽取各個模態的特徵,接著建立一個生成對抗網路,其中的生成器會根據抽取的特徵生成手勢,而判別器負責辨識生成手勢和實際手勢的真偽,經過模型的交互訓練後,使生成器生成近乎真實的手勢。
    實驗結果方面,我們會將生成出來的手勢和實際的手勢,利用三個評估指標—關節位置Mean absolute error (MAE)、Mean acceleration distance (MAD),以及加速Fréchet 手勢距離(FGD)進行觀察及分析,並與現有的數個出色的手勢生成模型做比較。針對TED手勢資料集,由實驗數據上看出,相較於當代最佳手勢生成模型Sp2AG,我們的生成模型的位置誤差率,在關節點的Mean absolute error (MAE)下降7.88%,以及距離偏移率,在手勢的Mean acceleration distance (MAD)下降10.23%,以及加速Fréchet 手勢距離 (FGD)下降8.75%。


    In recent years, aging population has become an issue in many countries all over the world, and elderly care’s policies are gradually concerned. In particular, the life expectancy in Taiwan is more than the average life expectancy in most countries. The development of care robots has been driven by the proportion of the elderly who need long-term care. The purpose and motivation of this thesis is to build a deep learning model and train by using trimodal cues: the contents of text, speech, and speaker identity. The trained model can generate corresponding gestures from the inputs, so that the care robot can express gestures more vividly during human-robot interaction.
    So far, there are not many studies on generation of robot gesture through deep learning methods. In the data preprocessing, one group is only to extract audio features and another is only to analyze semantics of text content. The architecture of model usually uses long short-term memory to build. In this thesis, we propose a generative model based on generative adversarial networks to generate gestures by using the trimodal cues. We train three different neural networks for extracting the features of each modality separately. And then we build a generative adversarial network, where the generator of network generates gestures based on the extracted features, and the discriminator identifies the authenticity of the generative gestures and the actual gestures. After training, the generator will generate near-realistic gestures.
    From experimental results, we observe and analyze the gestures differences between the generative gestures and the actual gestures by use of three evaluation metrics: Mean absolute error (MAE) of joint coordinates, Mean acceleration distance (MAD) of gestures, and accelerated Fréchet gesture distance (FGD), and then compare the performance with several existing outstanding generative models. In comparison with Sp2AG, the state-of-the-art gesture generation model, for the position error rate of our generated model, the Mean absolute error (MAE) at the joint point is decreased by 7.88%, while for the distance offset rate, the Mean acceleration distance (MAD) of gestures is decreased by 10.23%, and accelerated Fréchet gesture distance (FGD) is decreased by 8.75%.

    Contents 摘要 i Abstract ii 中文致謝 iv Contents v List of Figures vii List of Tables ix Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivation 2 1.3 System Description 3 1.4 Thesis Organization 5 Chapter 2 Related Work 6 2.1 Literature Review 6 2.2 Deep Neural Networks 7 2.2.1 Multi-Layer Perceptron 8 2.2.2 Recurrent Neural Networks 9 2.2.3 Self-Attention Mechanism 10 2.2.4 Transformer 11 2.2.5 Squeeze and Excitation Networks 13 Chapter 3 Deep-Learning-Based Gesture Generation Method 15 3.1 Data Preprocessing 15 3.1.1 Text Processing 15 3.1.2 Audio Processing 17 3.1.3 Speaker Identity Style Sampling 17 3.2 Generative Adversarial Networks Model 19 3.2.1 Network Architecture 19 3.2.2 Bi-GRU block architecture 20 3.2.3 Searching for Weighted Coefficients in Losses 21 Chapter 4 Experimental Results and Discussion 23 4.1 Datasets 23 4.2 Experimental Environment and Training Detail 24 4.3 Data Visualization 26 4.4 Training and Validating Result and Analysis 27 4.5 Testing Result and Analysis 35 4.6 Comparison with Baseline Methods 41 Chapter 5 Conclusions and Future Work 43 5.1 Conclusions and Contributions 43 5.2 Future Work 44 References 45

    References
    [1] World Health Organization. World Health Statistics Overview 2019: Monitoring Health for the Sustainable Development Goals (SDGs). 2019. Available online: https://apps.who.int/iris/bitstream/handle/10665/311696/WHO-DAD-2019.1-eng.pdf
    (accessed on 8 August 2022).
    [2] Ministry of Health and Welfare. 2018 Taiwan Health and Welfare Report. 2018. Available online: https://www.mohw.gov.tw/cp-137-47558-2.html (accessed on 8 August 2022).
    [3] N. Sadoughi and C. Busso. “Speech-driven animation with meaningful behaviors,” Speech Communication, vol. 110, pp. 90-100, 2019.
    [4] C.-M. Huang and B. Mutlu. “Learning-based modeling of multimodal behaviors for humanlike robots,” in Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 2014, pp. 57-64.
    [5] M. Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation, Universal-Publishers: Irvine, California, 2005.
    [6] S. Levine, P. Krahenbuhl, S. Thrun, and V. Koltun. “Gesture controllers,” Transactions on Graphics, vol. 29, no. 4, pp. 1-11, 2010.
    [7] Y. Ferstl, M. Neff, and R. McDonnell. “Multi-Objective adversarial gesture generation,” in Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games, Newcastle Upon Tyne, United Kingdom, 2019, pp. 1-10.
    [8] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik. “Learning individual styles of conversational gesture,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, California, 2019, pp. 3497-3506.
    [9] S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow. “Style-controllable speech-driven gesture synthesis using normalizing flows,” Computer Graphics Forum, vol. 39, no. 2, pp. 487-496, 2020.
    [10] Y. Yoon, W.-R. Ko, M. Jang, J. Lee, J. Kim, and G. Lee. “Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots,” in Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, Canada, 2019, pp. 4303-4309.
    [11] T. Kucherenko, P. Jonell, S. vanWaveren, G. E. Henter, S. Alexanderson, I. Leite, and H. Kjellstrom. “Gesticulator: a framework for semantically-aware speech-driven gesture generation,” in Proceedings of the ACM International Conference on Multimodal Interaction, Utrecht, Netherlands, 2020, pp. 242-250.
    [12] A. B Hostetter and A. L Potthoff. “Effects of personality and social situation on representational gesture production,” Gesture, vol. 12, no. 1, pp. 62-83, 2012.
    [13] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. “Multimodal machine learning: a survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, pp. 423-443, 2018.
    [14] C. Ahuja and L.-P. Morency. “Language2Pose: natural language grounded pose forecasting,” in Proceedings of the IEEE International Conference on 3D Vision, Quebec City, Canada, 2019, pp. 719-728.
    [15] M. Roddy, G. Skantze, and N. Harte. “Multimodal continuous turn-taking prediction using multiscale RNNs,” in Proceedings of the ACM International Conference on Multimodal Interaction, Boulder, Colorado, 2018, pp. 186-190.
    [16] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate,” arXiv:1409.0473, 2015.
    [17] A. Aristidou, E. Stavrakis, P. Charalambous, Y. Chrysanthou, and S. Loizidou Himona. “Folk dance evaluation using laban movement analysis,” Computing and Cultural Heritage, vol. 8, no. 4, pp. 1-19, 2015.
    [18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. “Improved techniques for training GANs,” in Proceedings of the Conference on Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234-2242.
    [19] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 6626-6637.
    [20] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi. “Frechet audio distance: a metric for evaluating music enhancement algorithms,” arXiv:1812.08466, 2018.
    [21] L. Medsker and C. J. Jain. Recurrent neural networks: Design, and Applications, CRC Press: Boca Raton, California, 1999.
    [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. “Attention is all you need,” in Proceedings of the Conference on Neural Information Processing Systems, Long Beach, California, 2017, pp. 30.
    [23] J. Hu, L. Shen, and G. Sun. “Squeeze-and-excitation networks,” In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UTAH, 2018, pp. 7132-7141.
    [24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
    [25] G. Ian, P.-A. Jean, M. Mehdi, B. Xu, D.-F. SherjilOzair, A. Courville, and Y. Bengio. “Generative adversarial nets,” in Proceedings of the Conference on Neural Information Processing Systems, Montreal, Canada, 2014, 2672-2680.
    [26] Y. Yoon, B. Cha, J.-H. Lee, M. Jang, J. Lee, J. Kim, and G. Lee. “Speech gesture generation from the trimodal context of text, audio, and speaker identity,” Transactions on Graphics, vol. 39, no. 6, pp. 1-16, 2020.
    [27] U. Bhattacharya, E. Childs, N. Rewkowski, and D. Manocha. “Speech2affectivegestures: synthesizing co-speech gestures with generative adversarial affective expression learning,” in Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 2021, pp. 2027-2036.

    無法下載圖示 全文公開日期 2033/02/06 (校內網路)
    全文公開日期 2033/02/06 (校外網路)
    全文公開日期 2033/02/06 (國家圖書館:臺灣博碩士論文系統)
    QR CODE