Basic Search / Detailed Display

Author: 旭法
Sivabalan - Adinarayanan
Thesis Title: 文句不相關語者驗證使用支援向量機
Text-Independent Speaker Verification Using Support Vector Machine
Advisor: 洪西進
Shi-Jinn Horng
Committee: 王有禮
Yue-Li Wang
Hsing Mei
Jeen-Shing Wang
Chang-Biau Yang
Degree: 碩士
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2005
Graduation Academic Year: 93
Language: 英文
Pages: 83
Keywords (in Chinese): 梅爾倒頻譜參數支援向量機語者驗證
Keywords (in other languages): Support Vector Machine, Speaker Verification, MFCC
Reference times: Clicks: 212Downloads: 2
School Collection Retrieve National Library Collection Retrieve Error Report

系統實作中,以梅爾倒頻譜參數(Mel-Frequency Cepstral Coefficients, MFCCs)做為語者特徵,結合支援向量機(Support Vector Machine)建立語者相依模型。

This dissertation aims to explore the technology of speaker recognition,
specifically by researching the technique in current state-of-the-art systems. Current
state-of-the-art speaker verification systems are based on discriminatively trained
generative models. In these systems, discrimination is achieved with the linear
function. We studied the use of support vector machines (SVMs) for text
independent speaker verification. Two main approaches were considered. The first is
approach using linear SVMs. The second approach is an utterance based approach
using kernels SVMs. State-of-the-art speaker verification systems rely on generative
models to recognize speakers. It is a curious result since discriminative approaches
for classification should in theory be better than generative ones since the former are
optimized to minimize the classification error rate explicitly compared to the latter.
The polynomial kernel and radial basis function kernel are widely used for
speaker verification task. We examine the properties of the linear SSVMs in
comparison. By doing so, we will be able to study or adopt a simpler system with
faster execution time which would yield to high or close performance in term of
accuracy with the current kernel methods. The approach using linear SVMs is to
study the method efficiency in simplicity and time consumption in reducing the error
rate. This is in order to overcome the difficulties arising from an application of
complex kernel SVMs to speaker verification. We begin with an investigation into
the similar kernel functions like polynomial and RBF kernels. This technique were
tested on one of the top ten database named YOHO database and then evaluated on
the more difficult custom-build text-independent database. This separation of the development from the evaluation is important to ensure that the methods are general
and that the classifiers have not been tuned to one particular database.
Experimentally the linear SVMs benefits, by not only out perform current
state-of-the-art classifiers on the YOHO text-independent speaker verification
database but even with the kernel functions yielding to a close result and faster
execution time. This thesis reports equal error rates on the YOHO database that are
1.81% of equal error rate and 0.65% of equal error rate with our ownbuild textindependent

Abstract i Acknowledgements iii List of Figures vi List of Tables vi 1 INTRODUCTION 1.1 Introduction 2 1.2 Goals and Motivation 5 1.3 Overview 6 2 BACKGROUND 2.1 Acoustic Models 9 2.2 Speech Production 11 2.3 Previous Work 14 2.4 Applications 15 2.5 Pros and Cons of Speaker Recognition 18 2.6 Elementary Concepts and Terminology 19 2.6.1 Speaker Identification 20 2.6.2 Speaker Verification 21 2.7 Text-Dependent 22 2.8 Text-Independent 23 3 Feature Extraction (MFCC) 3.1 Introduction 25 3.2 Mel-Frequency Ceptrum Coefficients Processor (MFCC) 28 3.2.1 Frame Blocking 29 3.2.2 Windowing 30 3.2.3 Fast Fourier Transform (FFT) 31 3.2.4 Mel-Frequency Wrapping 32 3.2.5 Ceptrum 34 4 Vector Quantization (VQ) 4.1 Introduction 36 4.2 Vector Quantization 37 5 Support Vector Machine (SVM) 5.1 Speaker Modeling 44 5.2 Conventional Support Vector Machine 47 5.3 Variational Support Vector Machine 54 6 Experiments & Results 6.1 Error Reporting 58 6.2 Corpora 63 6.2.1 Custom-Build Database 64 Custom-Build Text-Dependent Database 64 Experimental Procedure 64 Custom-Build Text-Independent Database 67 Experimental Procedure 67 6.2.2 The YOHO Voice Verification Corpus 70 Experimental Procedure 71 7 Conclusion 7.1 Conclusion 76 REFERENCES List of Tables Page Table 3.1: A summary of some common windowing functions 31 Table 6.1: Speaker Verification Error Rate 74 List of Figures Page Figure 1.1: Speaker Verification Flow 5 Figure 1.2: Training and Testing Flow 5 Figure 2.1: Schematic and circuit model of the vocal tract [4] 10 Figure 2.2: Acoustic tube model of speech production 12 Figure 2.3: Speech production mechanism [8] 13 Figure 2.4: Applying speaker recognition in speech recognition 17 Figure 2.5: Areas of voice (speaker) recognition 19 Figure 2.6: Basic structure of speaker identification system 20 Figure 2.7: Basic structure of speaker verification system 21 Figure 3.1: An Example of speech signal 25 Figure 3.2: Frame-based analysis 27 Figure 3.3: Mel scale filterbank 28 Figure 3.4: Block diagram of the MFCC processor 29 Figure 3.5: How the parameter N and M are utilized in the frame blocker 30 Figure 3.6: Common time windows, with durations normalized to unity 30 Figure 3.7: An example of Mel-spaced filterbank 33 Figure 4.1: Conceptual diagram illustrating vector quantization codebook formation. One speaker can be discriminated from another based of the location of centroids [34] 37 Figure 4.2: The process of VQ codebook generation; the features are shown by blue dots, the group boundary in green and the centroids are in red 41 Figure 4.3: Flow of the binary split codebook generation algorithm. [5] 42 Figure 5.1: Training data is perfectly linearly separated by multiple hyperplanes in R2 49 Figure 5.2: A hyperplane is found by the SVM in R2. The support vectors are circled 49 Figure 5.3: The support vector (circles points) of a soft margin SVM in R2 52 Figure 5.4: An example of kernel mapping 53 Figure 6.1: Imposter Scores Distribution and FAR 59 Figure 6.2: Client Scores Distribution and FRR 60 Figure 6.3: Overlapping of distribution of the client and the imposter, FAR and FRR 60 Figure 6.4: Reporting classifier performance on ROC and DET curves (a): An ROC curve illustrates the trade-off between probability of false acceptances (horizontal axis) against the true acceptance probability or one minus the false rejection probability (vertical axis) 62 (b): The DET curves corresponding to the ROCs shown in (a). The false acceptance probability (horizontal axis) is plotted against the false rejection probability (vertical axis). The conversion from probabilities to the normal deviate scale is shown in (c) 63 (c): The normal deviate is found by computing the percentage area under the normal distribution 63 Figure 6.5: Text-Dependent DET Plot (100 Speakers) 66 Figure 6.6: FAR & FRR Versus Threshold (100 Speakers) 66 Figure 6.7: Text-Independent DET Plot (30 Speakers) 69 Figure 6.8: FAR & FRR Versus Threshold (30 Speakers) 69 Figure 6.9: Text-Independent DET Plot (138 Speakers) 73 Figure 6.10: FAR & FRR Versus Threshold (138 Speakers) 73

[1] Prabhakar, S., Pankanti, S., and Jain, A. Biometric recognition: security and
privacy concerns. IEEE Security & Privacy Magazine 1 (2003), 33–42.
[2] The Biometric Consortium. Webpage, December 2003.
[3] Kittler, J., and Nixon, M., Eds. 4th International Conference on Audio- and
Video-Based Biometric Person Authentication (AVBPA 2003). Lecture Notes
in Computer Science. Springer-Verlag, Berlin, 2003.
[4] J.R. Flanagan. Speech Analysis, Synthesis and Perception, chapter 3. Springer-
Verlag, 1972.
[5] L. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Englewood
Cliffs, New Jersey: Prentice Hall, pp. 14-17, pp. 52-65, pp. 112-117, pp. 183-
191, 1993.
[6] B. S. Atal, "Automatic recognition of speakers form there voices," Proc. IEEE,
vol.64, pp. 460-475, 1976.
[7] A. E. Rosenberg, and F. K. Soong, "Recent research in automatic speaker
recognition," in Advances in Speech Signal Processing, S. Furui, M. Sondhi,
Eds. New York: Marcel Dekker Inc., pp. 701-737, 1992.
[8] G. J. Tortora and S. R. Grabowski, Principles of Anatomy and Physiology, (8th
Ed.) New York: Harper Collins, p. 709, 1996.
[9] D. A. Reynolds, "Automatic speaker recognition using gaussian mixture
speaker models," Lincoln Laboratory Journal, vol. 8, no. 2, pp. 173-192, 1995.
[10] S. Furui, "An overview of speaker recognition technology," in Automatic
Speech and Speaker Recognition, C. H. Lee, F. K. Soong, and K. K. Paliwal,
Eds. Boston: Kluwer Academic, pp. 31-56 ,1996.
[11] G.R. Doddington. Speaker recognition - Identifying people by their voices.
Proceedings of the IEEE, 73(11):1651-1663, November 1985.
[12] G.R. Doddington. Speaker recognition based on idolectal differences between
speakers. In Proc. Eurospeech, pages 2521-2524, Aalborg, September 2001.
[13] W.D. Andrews, M.A. Kohler, and J.P. Campbell. Phonetic speaker recognition.
In Proc. Eurospeech, pages 2517-2520, Aalborg, September 2001.
[14] C.R. Janowski Jr., T.F. Quatieri, and D.A. Reynolds. Measuring fine structure
in speech: Application to speaker identification. In Proc. ICASSP, pages 325-
328, Detroit, May 1995.
[15] H.A. Murthy, F. Beaufays, L.P. Heck, and M. Weintraub. Robust textindependent
speaker identification over telephone channels. IEEE Trans. On
Speech and Audio Processing, 7(5):554-568, September 1999.
[16] Rose, P. Forensic Speaker Identification. Taylor & Francis, London, 2002.
[17] Niemi-Laitinen, T. Thesis, University of Helsinki, Department of Phonetics,
Helsinki, Finland, 1999.
[18] Kuhn, R., Junqua, J.-C., Nguyen, P., and Niedzielski, N. Rapid speaker
adaptation in eigenvoice space. IEEE Trans. on Speech and Audio Processing 8
(2000), 695–707.
[19] Martin, A., and Przybocki, M. Speaker recognition in a multi-speaker
environment. In Proc. 7th European Conference on Speech Communication and
Technology (Eurospeech 2001) (Aalborg, Denmark, 2001), pp. 787–790.
[20] Lapidot, I., Guterman, H., and Cohen, A. Unsupervised speaker recognition
based on competition between self-organizing maps. IEEE Transactions on
Neural Networks 13 (2002), 877–887.
[21] Liu, D., and Kubala, F. Fast speaker change detection for broadcast news
transcription and indexing. In Proc. 6th European Conference on Speech
Communication and Technology (Eurospeech 1999) (Budapest, Hungary,
1999), pp. 1031–1034.
[22] Kwon, S., and Narayanan, S. Speaker change detection using a new weighted
distance measure. In Proc. Int. Conf. on Spoken Language Processing (ICSLP
2002) (Denver, Colorado, USA, 2002), pp. 2537–2540.
[23] Brunelli, R., and Falavigna, D. Person identification using multiple cues. IEEE
Trans. on Pattern Analysis and Machine Intelligence 17, 10 (1995), 955–966.
[24] Toh, K.-A. Fingerprint and speaker verification decisions fusion. In Proc. 12th
Int. Conf. on Image Analysis and Processing (ICIAP’03) (2003), pp. 626–631.
[25] Kittler, J., and Nixon, M., Eds. 4th International Conference on Audio- and
Video-Based Biometric Person Authentication (AVBPA 2003). Lecture Notes
in Computer Science. Springer-Verlag, Berlin, 2003.
[26] Zetterholm, E. The significance of phonetics in voice imitation. In Proc. 8th
Australian Int. Conf. on Speech Science and Technology (2000), pp. 342–347.
[27] J.P. Campbell, J. (1997), “Speaker Recognition: A Tutorial”, in ‘Proceedings of
the IEEE’, Vol. 85, pp. 1437–1462
[28] Furui, S. Recent advances in speaker recognition. Pattern Recognition Letters
18, 9 (1997), 859–872.
[29] Reynolds, D. (2002), “An Overview of Automatic Speaker Recognition
Technology”, in ‘Proceedings of the International Conference on Acoustics,
Speech and Signal Processing. ICASSP 2002’, Vol. 4, pp. 4072–4075.
[30] Naik, J. M. (1990), ‘Speaker Verification: A Tutorial’, IEEE Communications
Magazine 28, 42–48.
[31] Che, C., Lin, Q. & Yuk, D.-S. (1996), “An HMM Approach to Text-Prompted
Speaker Verification”, in ‘Proceedings of the International Conference on
Acoustics, Speech and Signal Processing. ICASSP ’96’, pp. CD–ROM.
[32] Matsui, T. & Furui, S. (1993), “Concatenated Phoneme Models for Text-
Variable Speaker Recognition”, in ‘Proceedings of the International
Conference on Acoustics, Speech and Signal Processing. ICASSP ’93’, Vol. 2,
pp. 391–394.
[33] Gish, H. & Schmidt, M. (1994), ‘Text-Independent Speaker Identification’,
IEEE Signal Processing Magazine 11(4), 18–32.
[34] F.K. Song, A.E. Rosenberg and B.H. Juang, “A vector quantisation approach to
speaker recognition”, AT&T Technical Journal, Vol. 66-2, pp. 14-26, March
[35] A. E. Rosenberg, and F. K. Soong, "Recent research in automatic speaker
recognition," in Advances in Speech Signal Processing, S. Furui, M. Sondhi,
Eds. New York: Marcel Dekker Inc., pp. 701-737, 1992.
[36] F. K. Soong, A. E. Rosenberg, and B. H. Juang, "A vector quantization
approach to speaker recognition," AT & T Journal, vol. 66, no. 2, pp. 14-26,
[37] F. K. Soong, A. E. Rosenberg, and B. H. Juang, "A vector quantization
approach to speaker recognition," Proc. ICASSP'85, (Tampa, Florida), March
1985, pp. 387-390.
[38] Y. Linde, A. Buzo, and R. M. Gray, "An algorithm for vector quantisation,"
IEEE Trans. Communications, vol. COM-28, no. 1, pp 84-95, January 1980.
[39] J. Fritsch, Hierarchical Connectionist Acoustic Modeling for Domain-Adaptive
Large Vocabulary Speech Recognition, Ph. D. dissertation, University of
Karlsruhe, Germany, 2000.
[40] V.N. Vapnik, Statistical Learning Theory, John Wiley, New York, NY, USA,
[41] V.N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York,
[42] C. J. C. Burges. A tutorial on support vector machines for pattern recognition.
Data Mining and Knowledge Discovery, 2(2):121-167, 1998.
[43] V. N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer,
New York, 1982.
[44] T. Joachims. Learning to Classify Text Using Support Vector Machines.
Kluwer Academic Publishers, Norwell, Massachusetts, 2002.
[45] A. J. Robinson, Dynamic Error Propagation Networks, Ph.D. dissertation,
Cambridge University, UK, February 1989.
[46] T. Joachims, SVMLight: Support Vector Machine http://ai.informatik.
ml, University of Dortmund, November 1999.
[47] Y. LeCun, et. al., “Handwritten Digit Recognition with Backpropagation
Network,”Advances in Neural Information Processing Systems-2, Morgan
Kaufman,pp. 396-404, 1990.
[48] T. Joachims, “Text Categorization with Support Vector Machines: Learning
with Many Relevant Features,” Technical Report 23, LS VIII, University of
Dortmund, Germany, 1997.
[49] M. Schmidt, H. Gish, “Speaker Identification Via Support Vector
Classifiers,”Proceedings of the International Conference on Acoustics, Speech
and Signal Processing, pp. 105-108, Atlanta, GA, USA, May 1996
[50] S. Fine, J. Navratil and R. A. Gopinath. Hybrid GMM/SVM Approach to
Speaker Identification, Proceedings of the International Conference on
Acoustics, Speech and Signal Processing, Salt Lake City, Utah, USA, 2001.
[51] A. Ganapathiraju, J. Hamaker and J. Picone, “Support Vector Machines for
Speech Recognition,” Proceedings of the International Conference on Spoken
Language Processing, pp. 2923-2926, Sydney, Australia, November 1998.
[52] C. Philip and P. Moreno, “On the Use of Support Vector Machines for Phonetic
Classification,” Proceedings of the International Conference on Acoustics,
Speech and Signal Processing, Phoenix, Arizona, USA, 1999.
[53] R. Fletcher. Practical Methods of Optimization. John Wiley & Sons, Chichester,
second edition, 1987.
[54] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector
Machines. Cambridge University Press, Cambridge, 2000.
[55] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal
margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM
Workshop on Computational Learning Theory, pages 144{152, Pittsburgh, PA,
July 1992. ACM Press.
[56] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273-
279, 1995
[57] M. Brown, W. Grundy, D. Lin, N. Cristianini, C. Sugnet, M. Ares, and D.
Haussler. Support vector machine classification of microarray gene expression
data. Technical Report UCSC-CRL-99-09, Department of Computer Science,
University of California, Santa Cruz, 1999.
[58] C. Chen and O. L. Mangasarian. Smoothing methods for convex inequalities
and linear complementarity problems. Mathematical Programming, 71(1):51-69,
[59] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear
and mixed complementarity problems. Computational Optimization and
Applications, 5(2):97-138, 1996.
[60] Y.-J. Lee and O. L. Mangasarian. SSVM: A smooth support vector machine.
Computational Optimization and Applications, 20:5-22, 2001. Data Mining
Institute, University of Wisconsin, Technical Report 99-03.
[61] G. Fung and O. L. Mangasarian. A feature selection Newton method for
support vector machine classification. Computational optimization and
applications, pages 1-18, 2003.
[62] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines.
In Advances in Neural Information Processing Systems 07, 2003.
[63] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines.
Technical Report 00-07, Data Mining Institute, Computer Sciences Department,
University of Wisconsin, Madison, Wisconsin, July 2000. Proceedings of the
First SIAM International Conference on Data Mining, Chicago, April 5-7, 2001,
CD-ROM Proceedings.
[64] O. L. Mangasarian. Generalized support vector machines. In A. Smola, P.
Bartlett, B. SchÄolkopf, and D. Schuurmans, editors, Advances in Large
Margin Classifiers, pages 135-146, Cambridge, MA, 2000. MIT Press.
[65] R. Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: a decision- tree
hybrid. In Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data
Mining, 1996, Cambridge, MA 02142, 1996. The AAAI Press/The MIT Press.
[66] J. Platt. Sequential minimal optimization: A fast algorithm for training support
vector machines. In B. SchÄolkopf, C. J. C. Burges, and A. J. Smola, editors,
Advances in Kernel Methods - Support Vector Learning, pages 185{208. MIT
Press, 1999.
[67] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Academic
Press, London, 1981.
[68] Y.-J. Lee, H.-Y. Lo, and S.-Y. Huang. Incremental reduced support vector
machine. In Proceedings of the 2003 International Conference on Informatics,
Cybernetics, and Systems (ICICS 2003), Kaohsiung, Taiwan, 2003.
[69] J. A. Swets, editor. Signal Detection and Recognition by Human Observers.
John Wiley & Sons, Inc., 1964.
[70] J. P. Egan. Signal Detection Theory and ROC. Academic Press, 1975.
[71] A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The
DET curve in assessment of detection task performance. In Proc. EuroSpeech,
pages 1895.1898, September 1997.
[72] J. P. Campbell Jr. Testing with the YOHO CD-ROM voice verification corpus.
In Proc. ICASSP, volume 1, pages 341.344, 1995.
[73] Hou Fenglei, Wang Bingxi, “Text-independent speaker recognition using
support vector machine”, Info-tech and Info-net, 2001. Proc. ICII 2001-Beijing.
2001 International Conferences on Vol. 3, 29 Oct.-1 Nov.2001, pp 402-407
[74] V.Wan and W.M.Campbell, “Support Vector Machines for speaker verification
and identification”, in Proc. Neural Networks for Signal Processing X,2000, pp.
[75] V.Wan and S.Renals, “Evaluation of kernel methods for speaker verification
and identification”, in Proc. ICASSP, vol. 1, 2002, pp. 669-672
[76] Lifeng Sang; Zhaohui Wu; Yingchun Yang; Wanfeng Zhang; Multimedia and
Expo, 2003. ICME '03. Proceedings. 2003 International Conference on Volume
3, 6-9 July 2003 Page(s):III - 613-16 vol.3
[77] Zhiyou Ma; Yingchun Yang; Zhaohui Wu; Systems, Man and Cybernetics,
2003. IEEE International Conference on Volume 5, 5-8 Oct. 2003 Page(s):
4153 - 4158 vol.5
[78] D. O’Shaughnessy, Speech Communication: Human and Machine,
AddisonWesley, New York, New York, USA, 1987.
[79] L.R Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Prentice-
Hall, Englewood Cliffs, N.J., 1978.
[81] “MatlabVOICEBOX”
[82] comp.speech Frequently Asked Questions WWW site,