簡易檢索 / 詳目顯示

研究生: 魏俊全
Chun-Chuan Wei
論文名稱: 視覺語音特徵於國語語音辨識之鑑別分析研究
Discriminative Analysis on Visual Features for Mandarin Speech Recognition
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 古鴻炎
Hung-Yan Gu
王新民
Hsin-Min Wang
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2009
畢業學年度: 97
語文別: 中文
論文頁數: 71
中文關鍵詞: 視覺語音國語語音辨識隱藏式馬可夫模型距離鑑別式分析影音整合語音辨識
外文關鍵詞: Visual speech, Mandarin Speech Recognition, HMM Distance, Discriminative Analysis, Audio-Visual Recognition
相關次數: 點閱:365下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

使用嘴形影像特徵輔助語音辨識系統,在噪音干擾的環境中,提供抵抗噪音的能力,具有顯著的效果。但是,在一般多詞彙的辨識裡,單純使用嘴形影像特徵,並無法達到應用上可以接受的效能。嘴形影像特徵雖然帶有和語音內容相關的資訊,但相對於聲學特徵的鑑別力是比較低的。
本論文將藉由對視覺特徵模型距離的分析,來了解模型的鑑別能力,在不同層級的辨識問題上,達到可以使用視覺特徵進行辨識應用的效能。研究中我們將辨識問題設定在中文成對音節的兩類辨識上。首先分析不同視覺特徵的辨識能力。接著透過模型距離的計算與實際辨識結果的分析,了解中文音節對中具有鑑別力的成對音節。
根據實驗的結果,中文成對音節的辨識,視覺特徵可以達到10.47%的平均錯誤率,並且有18.17%的音節對,錯誤率是低於2.5%的。我們採用的模型距離計算,和辨識錯誤率呈現緊密的關聯,因此可以透過模型距離的資訊,來篩選有鑑別力的音節對。此外,我們比較聲學特徵模型與視覺特徵模型的模型距離,找出兩種不同特徵在成對音節辨識上,互補性鑑別力的資訊。


The visual features can improve the performance of the speech recognition system under noisy environment. However, it is hard to achieve acceptable performance in a multi-words recognition task by using visual features alone. The speech information delivered by visual features is less than acoustic features.
In this paper, we apply the measurement of model distance on visual models to understand the discriminability of visual features. Then, we select the pair-wised recognition task of Chinese syllable pairs to put in use. According to the analysis of model distance and recognition error, we find the discriminative pairs of Chinese syllables.
The experimental result show that the average error rate of this pair-wised task is 10.47%, and there are 18.17% model pairs its error rate lower than 2.5%. The model distance is highly correlative to the recognition error. Comparing with the analysis of audio features, we find the model pairs that are more discriminative in visual features than in audio features.

第一章 導論 1 1.1 背景簡介 1 1.2 研究動機與目的 3 1.3 論文架構 5 第二章 語料蒐集 6 2.1 語料介紹 6 2.1.1、中文連續數字單人語料 7 2.1.2、中文斷開單音節(isolated syllable)單人語料 7 2.1.3、中文連續語音多人語料 8 2.2 語料處理與工具介紹 8 2.3 本章摘要 11 第三章 視覺語音特徵 12 3.1 系統架構 12 3.2 嘴形特徵抽取與模型建立 14 3.2.1 以離散餘弦轉換為基礎的頻域特徵 14 3.2.2 以嘴形特徵點為基礎的空間特徵 16 3.2.3 以向量量化為基礎的編碼特徵 19 3.3 視覺特徵實驗結果與討論 25 3.3.1 動態特徵實驗 25 3.3.2 視覺特徵比較實驗 26 3.4 本章摘要 28 第四章 HMM模型距離分析 29 4.1 HMM 距離計算 29 4.1.1 高斯分佈距離 31 4.1.2 基於pseudo-divergence的高斯混合模型距離 33 4.1.3 HMM距離 35 4.2 本章摘要 38 第五章 中文成對音節辨識與模型距離分析 39 5.1 實驗架構 39 5.2 實驗結果與討論 41 5.2.1 中文成對音節辨識 41 5.2.2 模型距離與辨識錯誤率關聯之分析 43 5.2.3 以模型距離篩選音節對 48 5.2.4 聲學聲韻母模型的比較 50 5.2.5 命令對挑選 52 5.3 本章摘要 54 第六章 結論與未來研究方向 55 6.1 結論 55 6.2 未來研究與改進 56 附錄(一) 語料錄製:中文音節與發音字對應 57 附錄(二) 中文聲韻母模型 61 參考文獻 69

[1] J. H. Tao and P. R. Yin “Realistic Visual Speech Synthesis Based on Hybrid Concatenation Method,” IEEE Trans. Audio, Speech, and Language Processing, vol. 17, pp. 469, Mar. 2009.
[2] S.A. King and R.E. Parent, “Creating Speech-Synchronized Animation,” IEEE Trans. Visualization and Computer Graphics, vol. 11, no. 3, pp. 341-352, 2005.
[3] Z. Deng, U. Neumann, J. Lewis, T. Kim, M. Bulut and S. Narayanan “Expressive facial animation synthesis by learning speech coarticultion and expression spaces,” IEEE Trans. Visualization and Computer Graphics, vol. 12, pp. 1523, Nov./Dec. 2006.
[4] D. Sodoyer, B. Rivet, L. Girin, J.-L. Schwartz, and C. Jutten, “An analysis of visual speech information applied to voice activity detection,” In Proc. Int. Conf. Acoustic, Speech, and Signal Processing ( ICASSP), Toulouse, France, 2006, pp. 601–604.
[5] B. Rivet, C. Servière, L. Girin, D-T Pham, and C. Jutten, "Audiovisual speech source separation : a regularization method based on visual voice activity detection," In Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), pp. 223-227, Hilvarenbeek, The Netherlands, Sept. 2007.
[6] B. Rivet, A. Aubrey, L. Girin, Y. Hicks, C. Jutten, and J. Chambers, "Development and comparison of two approaches for visual speech analysis with application to voice activity detection," In Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), pp. 228-232, Hilvarenbeek, The Netherlands, Sept. 2007.
[7] Siatras, S.; Nikolaidis, N.; Krinidis, M.; Pitas, I., "Visual Lip Activity Detection and Speaker Detection Using Mouth Region Intensities,” IEEE Trans. Circuits and Systems for Video Technology, vol. 19, pp. 133-137, Jan. 2009
[8] Ming Liu, Ziyou Xiong, Chu, S.M., Zhenqiu Zhang, Huang, T.S., “Audio visual word spotting,” In Proc. Int. Conf. Acoustic, Speech, and Signal Processing ( ICASSP), vol. 3, pp. 785, May 2004.
[9] Tieyan Fu, Xiao Xing Liu, Lu Hong Liang, Xiaobo Pi, Nefian, A.V., “Audio-visual speaker identification using coupled hidden Markov models,” Int. Conf. on Image Processing(ICIP), vol. 3, Sept. 2003.
[10] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. Senior, “Recent advances in the automatic recognition of audio-visual speech,” Proceedings of the IEEE, vol. 91, pp. 1306-1326, Sept. 2003.
[11] Peng Liu, Zuoying Wang, “Visual information assisted Mandarin large vocabulary continuous speech recognition,” Int. Conf. on Natural Language Processing and Knowledge Engineering, Oct. 2003.
[12] Potamianos, G. and Neti, C.,” Automatic speechreading of impaired speech,” In Proc. Int. Conf. Auditory-Visual Speech Processing (AVSP), Aalborg, Denmark, pp. 177–182, 2001.
[13] 麥克風語料庫 TCC-300Edu,中華民國計算語言學學會發行。
[14] M. Vihola, M. Harju, P. Salmela, J. Suontausta, and J. Savela, “Two dissimilarity measures for hmms and their application in phoneme model clustering,” In Proc. Int. Conf. Acoustic, Speech, and Signal Processing ( ICASSP), vol. 1, 2002.
[15] Xaun Peng, Wang Xu and Bingxi Wang, “Speaker clustering via novel pseudo-divergence of Gaussian mixture models,” International Conference on Natural Language Processing and Knowledge Engineering , NLP-KE’05, pp.111-114, 2005.
[16] C. Bahlmann and H. Burkhardt, “Measuring HMM Similarity with the Bayes Probability of Error and Its Application to Online Handwriting Recognition,” Proc. Sixth Int. Conf. Document Analysis and Recognition, pp. 406-411, 2001.
[17] Rafael C. Gonzalez, Richard E. Woods, “Digital Image Processing, 2nd,”2001
[18] Melanie Mitchell , “An introduction to genetic algorithms”,1998

QR CODE