Basic Search / Detailed Display

Author: 許瓊之
Qiong-zhi Hsu
Thesis Title: 整合聲學指引規則至HMM最佳路徑搜尋之歌聲分段方法
Singing Voice Signal Segmentation Methods Integrating Acoustic-guiding Rules into HMM Based Best-Path Searching
Advisor: 古鴻炎
Hung-Yan Gu
Committee: 王新民
Hsin-Min Wang
余明興
Ming-Shing Yu
林伯慎
Bor-Shen Lin
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2015
Graduation Academic Year: 103
Language: 中文
Pages: 95
Keywords (in Chinese): 歌聲信號隱藏式馬可夫模型聲學指引規則維特比解碼外顯式狀態時長
Keywords (in other languages): singing voice signal, HMM, acoustic-guiding rules, Viterbi decoding algorithm, explicit state-duration
Reference times: Clicks: 210Downloads: 1
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

本論文對於歌聲信號裡聲、韻母時間位置之自動分段的問題,提出了一種整合聲學指引規則至HMM維特比解碼中作最佳路徑搜尋的方法,可用以大幅提升基於HMM之基本分段方法的準確率。我們製作的聲、韻母自動分段程式,分成三種版本,分別使用不同的維特比解碼演算法,來作相互的效能比較。在HMM訓練階段,我們使用HTK軟體對TCC-300語料庫中選出的語句,去訓練出聲、韻母HMM模型;然後透過強制對齊,對自備的歌聲語料,分析各聲、韻母HMM之各狀態上的駐留時長參數,如此就可帶入伽瑪(gamma)機率分佈,去計算外顯式狀態時長機率。在測試階段,實驗的結果顯示,使用外顯式狀態時長機率之修正的維特比解碼可以比基本維特比解碼在10 ms之容忍度內提升7.55% 的準確率;進一步依各音框偵測出的基頻值與能量值,並依聲學知識去設計聲、韻母相關的限制規則,再把規則整合至維特比解碼的步驟中,如此比起基本的維特比解碼,可讓準確率從31.73% 提升到61.33%;接著再把聲、韻母HMM駐留時長的限制規則整合進去,則可讓準確率再提升至66.86%;最後當再加入一種靜音相關的後處理步驟來更正聲、韻母邊界,則準確率更可提升到68.45%。


Is this thesis, we propose singing voice signal segmentation methods that integrate acoustic-guiding rules into HMM (hidden Markov model) based best-path searching to gratly improve the segmentation accuracies for HMM based segmentation of syllable initials and finals. In practice, we have programmed three versions of Viterbi decoding algorithms for automatical segmentation of initials and finals, and then compare their performances. In the training stage, the software package, HTK, is used to train syllable initial and finial HMM models with some selected setences from TCC-300 corpus. Next, we estimate the state-duration parameters of each HMM state by means of forced aligning our recorded singing voice signals. Then, the parameters of state durations can be used to calculate gamma distribution based explicit state-duration probability. In the testing stage, the results of the experiments show that the Viterbi decoding algorithm using explicit state duration probability can obtain the segmentation accuracy rate which is 7.55% higher than the Viterbi decoding algorithm using implicit state transition probability under the tolerance range of 10 ms. Furthermore, we base on the detected fundamental frequency and energy from each frame to design acoustic knowledge related constraint rules, and integrate these acoustic-guiding rules into the improved Viterbi decoding algorithm. By using this Viterbi decoding algorithm, the segmentation accuracy rate can be promoted from 31.73% to 61.33% as compared with the basic Viterbi decoding algorithm. In addition, if we integrate more rules to constrain the duration of initials and finals, the segmentation accurate rate will be raised to 66.86%. Finally, we add a post-processing step that adjusts the boundaries of initials and finals according to the detected silence frames. As a result, the segmentation accuracy rate is further raised to 68.45%.

摘要 I ABSTRACT II 誌謝 III 目錄 IV 圖表索引 VI 第1章 緒論 1 1.1 研究動機 1 1.2 文獻回顧 2 1.2.1 語音自動分段 2 1.3 研究方法 4 1.4 論文架構 9 第2章 語料準備與特徵萃取 10 2.1 HMM訓練語料 10 2.2 歌聲語料 12 2.3 特徵參數抽取 12 第3章 HMM模型訓練 18 3.1 HTK工具軟體 18 3.2 HMM模型訓練流程 19 3.2.1 語料預處理 19 3.2.2 設定HMM模型名稱及轉換語音單位 22 3.2.3 計算MFCC係數 23 3.2.4 建立原始HMM模型 25 3.2.5 取代參數 26 3.2.6 訓練HMM聲、韻母模型 29 3.3 計算外顯式狀態時長機率 30 3.3.1 狀態駐留時長參數 31 3.3.2 外顯式狀態時長機率之伽瑪分佈 32 第4章 聲、韻母自動分段 34 4.1 隱藏式馬可夫模型 34 4.2 維特比解碼 35 4.3 外顯式時長機率之維特比解碼 41 4.4 整合聲學規則之擴充式維特比解碼 44 4.4.1 靜音偵測 45 4.4.2 擴充式的維特比解碼演算法 46 第5章 實驗結果與討論 52 5.1 實驗語料介紹 52 5.2 效能評量方式 54 5.3 基本維特比解碼之實驗 55 5.4 外顯式狀態時長機率之維特比解碼實驗 58 5.5 整合聲學規則之擴充式維特比解碼實驗 60 5.6 分段錯誤分析及後處理步驟、HMM訓練迭代次數 67 5.7 效能比較 70 第6章 結論 73 參考文獻 76 附錄(一) : 40首歌聲語料之歌名與歌詞 79

[1] S. Young, “The HTK Hidden Markov Model Toolkit : Design and Philosophy”, Tech Report TR.153, Department of Engineering, Cambridge University (UK), 1993.
[2] S. Dusan and L.-R. Rabiner, “On the Relation between Maximum Spectral Transition Positions and Phone Boundaries”, in Proc. Interspeech 2006, pp. 17–21, 2006.
[3] G. Almpanidis, M. Kotti and C. Kotropoulos, “Robust Detection of Phone Boundaries Using Model Selection Criteria with Few Observations”, IEEE Transactions on Audio, Speech and Language Processing, vol.17, no. 2, pp. 287-298, 2009.
[4] 林宥余,使用取樣點式聲學參數之音素分段,國立交通大學電信工程研究所碩士論文,2010。
[5] B. Pellom and J. Hansen, “Automatic segmentation of speech recorded in unknown noisy channel characteristics”, Speech Commun., vol. 25, no. 1–3, pp. 97–116, 1998.
[6] G. David Forney, The Viterbi Algorithm : A Personal History, http://arxiv.org/abs/cs/0504020v2
[7] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valchev, P. Woodland, The HTK Book( for HTK version 3.2.1), Cambridge University Engineering Department, 2002.
[8] J.-W. Kuo, H.-Y. Lo and H.-M. Wang, “Improved HMM/SVM methods for automatic phoneme segmentation”, in Proc. Interspeech, Antwerp, Belgium, pp. 2057-2060, 2007.
[9] 吳昌益,使用頻譜演進模型之國語語音合成研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[10] 吳俊欣,MFCC特徵空間座標系統對映之語者調適方法,國立台灣科技大學資訊工程研究所碩士論文,2003。
[11] 林祐靖,結合HMM頻譜模型與ANN抖音模型之國語歌聲合成,國立台灣科技大學資訊工程研究所碩士論文,2014。
[12] ACLCLP, Mandarin microphone speech corpus – TCC300, http://www.aclclp.org.tw/use_mat.php#tcc300edu.
[13] 校園民歌回顧,一品文化出版,台北,1985。
[14] K. Sjolander and J. Beskow, Centre of Speech Technolodgy at KTH, http://www.speech.kth.se/wavesurfer/.
[15] 王小川,語音訊號處理(修訂二版),全華圖書公司,2009。
[16] L.-R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[17] D. O’Shaufhnessy, Speech Communication Human and Machine, Addison-Wesley Publishing Company, 1987.
[18] J.-S. Jang, HTK Example : Digit Recognition,
http://mirlab.org/jang/books/audiosignalprocessing/htkBasicExample.asp?title=17-2%20HTK%20Example:%20Digit%20Recognition%20(HTK%20%B0%F2%A5%BB%BDd%A8%D2%A4@%A1G%BC%C6%A6r%BF%EB%C3%D1)

[19] Geek Garden, CCITT 16 Bits CRC, http://geek-garden.blogspot.tw/2012/07/ccitt-16-bit-crc.html
[20] L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.
[21] R.L. Scheaffer, Introduction to Probability and Its Applications, PWS Publishing, 1995.
[22] 林秉正,使用適應性區間模型於語者說話速度之調整,國立成功大學資訊工程研究所碩士論文,2002。
[23] 賴名彥,結合HMM頻譜模型與ANN韻律模型之國語語音合成系統,國立台灣科技大學資訊工程研究所碩士論文,2009。
[24] 黃國勛,行動裝置上語音命令辨識之研究,國立台灣科技大學資訊工程研究所碩士論文,2007。
[25] GNU Operating System, The GSL-GNU Scientific Library, http://www.gnu.org/software/gsl/
[26] J. Yuan, N. Ryant and M. Liberman, “Automatic Phonetic Segmentation in Mandarin Chinese : Boundary Models, Glottal Features And Tone”, ICASSP 2014, 2014.
[27] 林政源,應用於中文語音與歌聲合成之自動切音研究,國立清華大學資訊工程研究所博士論文,2007。

QR CODE