簡易檢索 / 詳目顯示

研究生: 郭曜銘
Yao-Ming Kuo
論文名稱: 基於深度學習的中文語音異常類別分類
Deep Learning-based automated classification of Chinese Speech Sound Disorders
指導教授: 阮聖彰
Shanq-Jang Ruan
口試委員: 阮聖彰
Shanq-Jang Ruan
林淵翔
Yuan-Hsiang Lin
塗雅雯
Ya-Wen Tu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 68
中文關鍵詞: 語音障礙構音障礙分類中文構音障礙語料庫機器學習人工智能
外文關鍵詞: speech sound disorder, speech disfluency classification, Chinese speech sound disorder dataset, machine learning, artificial intelligence
相關次數: 點閱:159下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本文介紹了一個分析聲學資料的系統,以便用於電腦輔助診斷和分類兒童的構音障礙。該分析專注於識別和分類四種不同類型的中文構音錯誤。該研究收集並組建了一個包含2540個塞音化、舌根音化、聲隨韻母和塞擦音化的語料庫,這些樣本來自90個3至6歲的兒童,具有正常或病態的發音特徵。每份錄音都有兩位語言治療師的詳細診斷性標記。語音樣本的分類是使用三種成熟的類神經網絡模型進行分類的。特徵圖是使用從語音中提取的三組MFCC參數,並轉換成一個三維數據結構作為模型輸入。我們採用了六種數據增強技術,以增強可用的數據集,同時避免過度擬合。實驗探討了四種構音錯誤類別分類在中文詞與字的可行性。不同數據子集的實驗表明,該系統有能力準確地檢測出所需分析的四種發音障礙,使用單一中文詞的最佳結果達到了74.4%的準確率。


This article describes a system for analyzing acoustic data in order to assist in the diagnosis and classification of children's speech disorders using a computer. The analysis concentrated on identifying and categorizing four distinct types of Chinese misconstructions. The study collected and generated a speech corpus containing 2540 stopping, velar, consonant-vowel, and affricate samples from 90 children aged 3-6 years with normal or pathological articulatory features. Each recording was accompanied by a detailed diagnostic annotation by two speech language pathologists. Classification of the speech samples was accomplished using three well-established neural network models for image classification. The feature maps are created using three sets of MFCC parameters extracted from speech sounds and aggregated into a three-dimensional data structure as model input. We employ six techniques for data augmentation in order to augment the available dataset while avoiding over-simulation. The experiments examine the usability of four different categories of Chinese phrases and characters. Experiments with different data subsets demonstrate the system's ability to accurately detect the analyzed pronunciation disorders, the best Multi-Class classification using a single Chinese phrase achieves an accuracy of 74.4 percent.

RECOMMENDATION FORM I COMMITTEE FORM II CHINESE ABSTRACT III ABSTRACT IV ACKNOWLEDGEMENTS V TABLE OF CONTENTS VI LIST OF TABLES X 1. INTRODUCTION 1 1.1. DISORDERS CHARACTERIZATIONS 4 1.1.1. Stopping 5 1.1.2. Backing 7 1.1.3. FCPD 9 1.1.4. Affrication 11 1.2. STATE OF THE ART 13 1.3. AIMS AND SCOPE 14 1.4. PAPER STRUCTURE 15 2. MATERIALS AND METHODS 16 2.1. COLLECTING AND LABELING AUDIO SAMPLES 17 2.2. DATA PRE-PROCESSING 22 2.3. MODELS 28 2.4. TRAINING ENVIRONMENTS 33 2.5. EXPERIMENT METHODS 34 2.5.1. Experiment 1 – Multi-Class classification using a single Chinese phrase 34 2.5.2. Experiment 2 – Binary classification using a single Chinese character 35 2.5.3. Experiment 3 – Multi-Class classification using a single Chinese character 37 2.5.4. Runtime of the Developed Application 38 3. RESULTS 39 3.1. EXPERIMENT 1 – MULTI-CLASS CLASSIFICATION USING A SINGLE CHINESE PHRASE 40 3.2. EXPERIMENT 2 – BINARY CLASSIFICATION USING A SINGLE CHINESE CHARACTER 42 3.3. EXPERIMENT 3 – MULTI-CLASS CLASSIFICATION USING A SINGLE CHINESE CHARACTER 43 3.4. RUNTIME OF THE DEVELOPED APPLICATION 45 4. DISCUSSION 46 5. CONCLUSIONS 48 A. APPENDIX 50 B. APPENDIX 58 REFERENCES 65

[1] L. I. Black, A. Vahratian, and H. J. Hoffman, “Communication disorders and use of intervention services among children aged 3-17 years: United states, 2012. NCHS data brief. Number 205.” Centers for Disease Control and Prevention, 2015.
[2] Y. Wren, L. L. Miller, T. J. Peters, A. Emond, and S. Roulstone, “Prevalence and predictors of persistent speech sound disorder at eight years old: Findings from a population cohort study,” Journal of Speech, Language, and Hearing Research, vol. 59, no. 4, pp. 647–673, 2016.
[3] Y.-N. Chang and L.-L. Yeh, “Assessment practices followed by speech-language pathologists for clients with suspected speech sound disorder in taiwan: A survey study,” Taiwan Journal of Physical Medicine and Rehabilitation, vol. 47, no. 1, pp. 31–47, 2019.
[4] P.-H. Sen and C.-L. Wang, “A study of the supply and demand of speech-language pathologist manpower in taiwan.” University of Taipei Department of Special Education, 2017.
[5] S. Rvachew, “Phonological processing and reading in children with speech sound disorders.” ASHA, 2007.
[6] P. Eadie, A. Morgan, O. C. Ukoumunne, K. Ttofari Eecen, M. Wake, and S. Reilly, “Speech sound disorder at 4 years: Prevalence, comorbidities, and predictors in a community cohort of children,” Developmental Medicine & Child Neurology, vol. 57, no. 6, pp. 578–584, 2015.
[7] J. Jeng, “The phonological processes of syllable-initial consonants spoken by the preschool children of mandarin chinese,” Journal of Special Education, vol. 34, pp. 135–169, 2011.
[8] I. Anjos, N. Marques, M. Grilo, I. Guimarães, J. Magalhães, and S. Cavaco, “Sibilant consonants classification with deep neural networks,” in EPIA conference on artificial intelligence, 2019, pp. 435–447.
[9] M. Krecichwost, N. Mocko, and P. Badura, “Automated detection of sigmatism using deep learning applied to multichannel speech signal,” Biomedical Signal Processing and Control, vol. 68, p. 102612, 2021.
[10] N. Hammami, I. A. Lawal, M. Bedda, and N. Farah, “Recognition of arabic speech sound error in children,” International Journal of Speech Technology, vol. 23, no. 3, pp. 705–711, 2020.
[11] F. Wang, W. Chen, Z. Yang, Q. Dong, S. Xu, and B. Xu, “Semi-supervised disfluency detection,” in Proceedings of the 27th international conference on computational linguistics, 2018, pp. 3529–3538.
[12] P. J. Lou, P. Anderson, and M. Johnson, “Disfluency detection using auto-correlational neural networks,” arXiv preprint arXiv:1808.09092, 2018.
[13] S. Wang, W. Che, Y. Zhang, M. Zhang, and T. Liu, “Transition-based disfluency detection using lstms,” in Proceedings of the 2017 conference on empirical methods in natural language processing, 2017, pp. 2785–2794.
[14] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning, 2019, pp. 6105–6114.
[15] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, and K. Keutzer, “Densenet: Implementing efficient convnet descriptor pyramids,” arXiv preprint arXiv:1404.1869, 2014.
[16] X. Xia, C. Xu, and B. Nan, “Inception-v3 for flower classification,” in 2017 2nd international conference on image, vision and computing (ICIVC), 2017, pp. 783–787.
[17] K. Palanisamy, D. Singhania, and A. Yao, “Rethinking cnn models for audio classification,” arXiv preprint arXiv:2007.11154, 2020.
[18] L. Nanni, G. Maguolo, and M. Paci, “Data augmentation approaches for improving animal audio classification,” Ecological Informatics, vol. 57, p. 101084, 2020.
[19] S. Tomar, “Converting video formats with FFmpeg,” Linux Journal, vol. 2006, no. 146, p. 10, 2006.
[20] B. McFee et al., “Librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, 2015, vol. 8, pp. 18–25.
[21] M. Abadi et al., “TensorFlow, Large-scale machine learning on heterogeneous systems.” Nov. 2015. doi: 10.5281/zenodo.4724125.
[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
[23] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In ECCV, 2016.
[24] Q. Liao and T. Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
[25] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015.

QR CODE