簡易檢索 / 詳目顯示

研究生: 蔡文傑
Wen-Jie Cai
論文名稱: 一個基於ResNet-GRU的中文日常生活對話之唇語辨識模型
A ResNet-GRU-Based Lip Reading Model for Daily Mandarin Conversations
指導教授: 阮聖彰
Shanq-Jang Ruan
口試委員: 力博宏
Po-Hung Li
呂政修
Jenq-Shiou Leu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 72
中文關鍵詞: 視覺語音辨識中文唇語辨識中文唇語辨識資料集深度學習
外文關鍵詞: Visual Speech Recognition, Mandarin Lip Reading, Mandarin Lip Reading Dataset, Deep Learning
相關次數: 點閱:162下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 唇語辨識又稱為視覺語音辨識,目的是要從無聲的影片中僅僅透過觀察人類嘴唇的運動來預測出正在說的句子或單字。近幾年裡,深度學習的快速發展以及大規模唇語辨識資料集的出現,已經讓唇語辨識有了一些令人矚目的進展。但是目前大多數的研究還是為了在解決英語唇語辨識的問題,所以中文的唇語辨識研究與資料集到目前為止依然很少。因此,我們提出了一個唇語辨識模型來預測中文的日常對話,並且我們也收集了一個中文日常對話的唇語辨識資料集(DMCLR)。這個資料集包含1,000個影片,分別由10位受試者各自錄製100句常用的日常生活對話所組成。這個模型是由一個時空卷積層、18層殘差網路、3層的雙向門控循環網路所組成。我們將我們的方法在DMCLR資料集上實現了高達94%的準確度。此外,透過我們的實驗展示了目前最先進的模型在中文裡已經可以透過限制目標句子數量來取得很好的辨識效果。如此一來就能在更多限制的場景下真正讓唇語辨識在生活上實現。最後,我們也在兩個最大的公開唇語辨識資料集LRW和LRW-1000上獲得了明顯的效能改善,分別從85.3%提高到88.6%,以及從41.4%提升到56.1%。這個結果不但超越了過去所有研究的表現,並且建立了一個最先進的效能指標。


    Lip reading, also called visual speech recognition is a technology that predicts sentences being spoken in a video by reading human lip movements. Lip reading has achieved tremendous advancement due to the rapid development of deep learning and the establishment of a considerable amount of lip reading datasets. However, most of the datasets are compiled in English; the datasets collected in Mandarin are inadequate. Therefore, we propose a lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by 10 speakers. Our model consists of a spatiotemporal convolution layer, a ResNet-18 network, and a 3-layer Bi-GRU network. This model is able to reach 94% of accuracy in the DMCLR dataset. Our experiment has demonstrated that the most advanced model in mandarin to be effective under the restriction of target sentences. Such advancement makes it possi-ble for lip reading applications to be practical in real life. Additionally, we are able to improve accuracy from 85.3% to 88.6% and 41.4% to 56.1% respectively on the two most enormous public available lip reading datasets, the LRW dataset (English) and the LRW-1000 dataset (Mandarin), respectively. The results show that our method achieves state-of-the-art performance on these two challenging datasets.

    Recommendation Form...I Committee Form...II Chinese Abstract...III English Abstract...IV Acknowledgements...V Table of Contents...VII List of Tables...X List of Figures...XI Chapter 1 Introduction...1 1.1. Background of the Lip Reading...1 1.2. Previous Works...2 1.3. Feature of Mandarin...3 1.4. Overview of This Thesis...4 1.5. Dataset of This Thesis...5 1.6. Contributions of This Thesis...6 1.7. Organization of This Thesis...7 Chapter 2 Related Works...8 2.1. Audio-Visual Speech Recognition (AVSR)...8 2.2. English Lip Reading Methods...9 2.3. Mandarin Lip Reading Methods...11 Chapter 3 Characteristics of Mandarin...12 3.1. Syllable...13 3.2. Pinyin and Tone...14 Chapter 4 The Proposed Methods...16 4.1. Front-end Module: ResNet with Spatiotemporal Layers...17 4.2. Back-end Module: Bidirectional GRUs...19 Chapter 5 Dataset...20 5.1. Building Dataset...22 5.2. Data Processing...23 5.2.1. Text Processing and Alignment...23 5.2.2. Face Alignment and Data Augmentation...23 Chapter 6 Experimental Results...25 6.1. Implementation Details...25 6.2. Results and Comparison...27 6.2.1. Comparison with Other Methods on LRW Dataset...29 6.2.2. Comparison with Other Methods on LRW-1000 Dataset...30 Chapter 7 Conclusion...32 Reference...34 Appendix...39

    [1]Wand, M.; Koutnik, J.; and Schmidhuber, J. Lipreading with Long Short-Term Memory. In IEEE International Conference on Acoustics, Speech and Signal Proc-essing, 6115–6119, 2016.
    [2]Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599, 2016.
    [3]Stavros Petridis and Maja Pantic. Deep Complementary Bottleneck Features for Visual Speech Recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2304–2308, 2016.
    [4]Joon Son Chung and Andrew Zisserman. Lip Reading in the wild. In Asian Conference on Computer Vision. Springer, 87–103, 2016.
    [5]Joon Son Chung, Andrew W Senior, Oriol Vinyals, and Andrew Zisserman. Lip Reading Sentences in the Wild. In CVPR. 3444–3453, 2017.
    [6]Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese. Proc. Interspeech 2018, 791–795, 2018.
    [7]https://githu-b.com/WenJie-Cai/Daily-Mandarin-Conversation-Lip-Reading-DMCLR-dataset, 2021.
    [8]Mroueh, Y.; Marcheret, E.; and Goel, V. Deep multimodal learning for audio-visual speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2130–2134, 2015.
    [9]Noda, K.; Yamaguchi, Y.; Nakadai, K.; Okuno, H. G.; and Ogata, T. Audio-visual speech recognition using deep learning. Applied Intelligence 42(4):722–737, 2015.
    [10]Cootes, T.F., Taylor, C.J., Cooper, D.H., and Graham, J. Active shape models – their training and application. Computer Vision and Image Understanding, 61(1):38–59, 1995.
    [11]Kass, M., Witkin, A., and Terzopoulos, D. Snakes: Active contour models. International Journal of Computer Vision, 1(4):321–331, 1988.
    [12]Cootes, T.F., Edwards, G.J., and Taylor, C.J. Active appearance models. Proc. Eroean Conference on Computer Vision, Freiburg, Germany, pp. 484–498, 1998.
    [13]Zhou, Z., Zhao, G., Hong, X., Pietikainen, M. A review of recent advances in visual speech decoding. Image Vis. Comput. 32(9), 590–605, 2014.
    [14]Potamianos, G., Neti, C., Luettin, J., Matthews, I. Audio-visual automatic speech recognition: an overview. Issues Vis. Audio-Vis. Speech Process. 22, 23, 2004.
    [15]Koller, O., Ney, H., Bowden, R. Deep learning of mouth shapes for sign language. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 85–91, 2015.
    [16]Tamura, S., Ninomiya, H., Kitaoka, N., Osuga, S., Iribe, Y., Takeda, K., Hayamizu, S. Audio-visual speech recognition using deep bottleneck features and high performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Assoc-iation Annual Summit and Conference (APSIPA), pp. 575–582, 2015.
    [17]Cooke, M., Barker, J., Cunningham, S., Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120(5), 2421–2424, 2006.
    [18]G. Zhao, M. Barnard and M. Pietikainen. "Lipreading with local spatiotemporal descriptors", IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254-1265, 2009.
    [19]Matthews, T.Cootes, J. Bangham, S. Cox, and R. Harvey. Extraction of visual features for lipreading. IEEE Trans. on Pattern Analysis and Machine Vision, vol. 24, no. 2, pp. 198-213, 2002.
    [20]Chung, J.S., Zisserman, A. Lip Reading in Profile. In british machine vision conference, 2017.
    [21]Stafylakis, T., Tzimiropoulos, G. Combining residual networks with LSTMs for lipreading. In conference of the international speech communication association, pp. 3652–3656, 2017.
    [22]X. Zhang, H. Gong, X. Dai, F. Yang, N. Liu and M Liu. "Understanding Pictograph with Facial Features End-to-End Sentence-Level Lip Reading of Chinese", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9211-9218, 2019.
    [23]K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in ECCV. Springer, pp. 630–645, 2016.
    [24]Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang, Chenhao Wang, Jing-yun Xiao, Keyu Long, Shiguang Shan, and Xilin Chen. LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild. arXiv p-reprint arXiv:1810.06990, 2019.
    [25]Ya Zhao, Rui Xu, and Mingli Song. A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading. arXiv preprint arXiv:1908.04917, 2019.
    [26]Cox, S.; Harvey, R.; Lan, Y.; Newman, J.; and john Theobald, B. The challenge of multispeaker lip-reading. In International Conference on AuditoryVisual Speech Processing, 2008.
    [27]Anina, I.; Zhou, Z.; Zhao, G.; and Pietikainen, M. Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, 1–5, 2015.
    [28]Baek, Eugene G. The most frequently used English daily conversations. GOTOP Information Inc, 2019.
    [29]Oksrtclient, http://www.zimudashi.cn/, 2016.
    [30]Yuanhang Zhang, Shuang Yang, Jingyun Xiao, Shiguang Shan, and Xilin Chen, “Can we read speech beyond the lips? Rethinking roi selection for deep visual speech recognition,” The 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.
    [31]Davis E King, “Dlib-ml: A machine learning toolkit,” The Journal of Machine Learning Research, vol. 10, pp. 1755– 1758, 2009.
    [32]Opencv api reference. http://docs.opencv.org/2.4.13/ modules/refman.html, 2016.
    [33]Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz, “mixup: Beyond empirical risk minimization,” The 18th International Conference on Learning Representations (ICLR), 2018.
    [34]Chenhao Wang, “Multi-grained spatio-temporal modeling for lip-reading,” The 30th British Machine Vision Conference, 2019.
    [35]Brais Martinez, Pingchuan Ma, Stavros Petridis, and Maja Pantic, “Lipreading using temporal convolutional networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6319–6323, 2020.

    無法下載圖示 全文公開日期 2026/08/27 (校內網路)
    全文公開日期 2026/08/27 (校外網路)
    全文公開日期 2026/08/27 (國家圖書館:臺灣博碩士論文系統)
    QR CODE