簡易檢索 / 詳目顯示

研究生: 黃偉愷
Wei-Kai Huang
論文名稱: 一個輕量化關鍵詞檢測模型使用新穎的資料擴增框架
A Small-Footprint Keyword Spotting Model using Novel Data Augmentation Framework
指導教授: 陳冠宇
Kuan-Yu Chen
口試委員: 王新民
Hsin-Min Wang
林伯慎
Bor-Shen Lin
王緒翔
Syu-Siang Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 69
中文關鍵詞: 資料擴增關鍵詞檢測語音辨識對抗式訓練
外文關鍵詞: data augmentation, keyword spotting, speech recognition, adversarial training
相關次數: 點閱:206下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 關鍵詞檢測系統是智慧裝置中重要的人機互動媒介。然而,要求關鍵詞檢測模型在少量參數的情況下,依然擁有著強健的性能表現,是非常具有挑戰性的。因此,在本論文中,我們設計了一個新穎的輕量化關鍵詞檢測模型-ConvKWS,它透過內部的卷積混合模組,將輸入特徵的空間及通道資訊進行混合來增加特徵之間的互動性,讓模型在即使只有少量參數的情況下也可以有優異的效能表現。
    除了模型架構上的研究之外,對於要開發一套優秀的關鍵詞檢測系統,如何擴增訓練資料也是一個重要的研究課題。在語音相關研究中,我們可以將資料擴增的方法分為兩類,一種是對語音訊號進行擴增的方法,代表性的方法有速度擾動 (Speed Perturbation),另一種則是對語音特徵的頻譜圖進行擴增的方法,代表性的方法有SpecAugment。跟隨著這樣的研究分類,我們也設計了WavCutMix 和SpecADV兩種新穎的擴增語音資料的方法。WavCutMix是一種會隨機將輸入的語音訊號部分的片段替換成別的語音片段的方法。SpecADV則是透過對抗訓練 (Adversarial Training) 來擴增語音特徵的頻譜圖。
    為了調查 WavCutMix 和 SpecADV 的通用性,我們除了將它們應用在關鍵詞檢測的任務外,我們也將它們應用在語音辨識的任務上。我們使用Google Speech Commands V2 資料集來進行關鍵詞檢測系列的實驗,而語音辨識系列的實驗則是使用AISHELL-1、LibriSpeech 100h、LibriSpeech 960h,三個資料集來進行實驗。實驗結果中顯示,使用WavCutMix及SpecADV來訓練模型可以有效的提升模型的效能表現。尤其在 Google Speech Commands V2的實驗中, 有了WavCutMix及SpecADV , ConvKWS比其它先進的模型使用更少的參數,卻可以達到比它們還優秀的98.81%正確率。


    Keyword spotting (KWS) system is an important human-computer interaction medium in smart devices. However, requiring the KWS model to maintain robust performance with a small number of parameters is extremely challenging. Therefore, we proposed ConvKWS, a novel, small-footprint KWS model that uses the internal Conv-Mixing module to combine the spatial and channel information of the input features. ConvKWS can increase the interaction between features, allowing the model to perform well even with few parameters.
    In addition to model architecture research, how to augment speech data is an important research topic for developing robust KWS systems. In speech-related research, data augmentation methods can be applied to the speech waveform (e.g., the speed perturbation) or the acoustic features (e.g., the SpecAugment). Following the line of research, we designed two novel data augmentation methods for KWS – WavCutMix and SpecADV. WavCutMix randomly replaces input waveform segments with other speech waveform segments, while SpecADV augments the spectrogram of acoustic features through adversarial training.
    To investigate the generality of WavCutMix and SpecADV, we applied them to the automatic speech recognition (ASR) and KWS tasks. We also conducted a series of experiments on KWS using the Google Speech Commands V2 dataset. Meanwhile, we used AISHELL-1, LibriSpeech 100h, LibriSpeech 960h, three datasets in their series of experiments on ASR. In the Google Speech Commands V2 experiment involving WavCutMix and SpecADV, ConvKWS achieved 98.81% accuracy with fewer parameters than other state-of-the-art models. The experimental results show that the model trained with WavCutMix and SpecADV significantly improved the model’s performance.

    1. Introduction 1 2. Related Work 4 2.1. Keyword Spotting Models 4 2.1.1. MatchboxNet 4 2.1.2. ConvMixer 5 2.1.3. BC-ResNet 7 2.1.4. Keyword Transformer (KWT) 9 2.1.5. Wav2KWS 11 2.2. End-to-End ASR Models 12 2.2.1. Speech-Transformer 12 2.2.2. Hybrid CTC/Attention End-to-End ASR Model 16 2.3. Data Augmentation for Speech Recognition 16 2.3.1. SpecAugment 17 2.3.2. SpecSwap 18 2.3.3. MixSpeech 20 2.3.4. Speed Perturbation 21 2.3.5. Aligned Data Augmentation (ADA) 22 2.4. Adversarial Attack/Training 23 3. Proposed Methods 26 3.1. ConvKWS 27 3.2. WavCutMix 29 3.2.1. Alignment-based replacement 30 3.2.2. Equalization-based replacement 31 3.3. SpecADV 32 3.3.1. Clean stage 33 3.3.2. Noise stage 34 3.4. Combination of WavCutMix and SpecADV 39 4. Experiments 41 4.1. Experiment Datasets 41 4.1.1. Google Speech Commands V2 41 4.1.2. AISHELL-1 41 4.1.3. LibriSpeech 42 4.2. Experimental Setup 42 4.2.1. Experimental setup for KWS 43 4.2.2. Experimental setup for ASR 43 4.3. KWS Experimental Results 45 4.3.1. ConvKWS Experimental Results on Google Speech Commands V2-12 45 4.3.2. Ablation Studies of WavCutMix and SpecADV for ConvKWS 46 4.3.3. Ablation Studies of ConvKWS Model Architecture 48 4.4. ASR Experimental Results 49 4.4.1. SpecADV Experimental Results on AISHELL-1 49 4.4.2. WavCutMix Experimental Results on AISHELL-1 51 4.4.3. Combination Experimental Results on AISHELL-1 53 4.4.4. SpecADV Experimental Results on LibriSpeech 100h 54 4.4.5. WavCutMix Experimental Results on LibriSpeech 100h 56 4.4.6. Combination Experimental Results on LibriSpeech 100h 58 4.4.7. LibriSpeech 960h Experimental Results 59 4.4.8. Experimental results of compared with other methods 61 5. Conclusion 63 6. References 64

    [1] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, and T. N. Sainath, "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal processing magazine, vol. 29, no. 6, pp. 82-97, 2012.
    [2] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, "Listen, attend and spell: A neural network for large vocabulary conversational speech recognition," in 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 4960-4964.
    [3] B. H. Juang and L. R. Rabiner, "Hidden Markov models for speech recognition," Technometrics, vol. 33, no. 3, pp. 251-272, 1991.
    [4] I. López-Espejo, Z.-H. Tan, J. Hansen, and J. Jensen, "Deep spoken keyword spotting: An overview," IEEE Access, 2021.
    [5] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, "Audio augmentation for speech recognition," in Sixteenth annual conference of the international speech communication association, 2015.
    [6] T. K. Lam, M. Ohta, S. Schamoni, and S. Riezler, "On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR," arXiv preprint arXiv:2104.01393, 2021.
    [7] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "Specaugment: A simple data augmentation method for automatic speech recognition," arXiv preprint arXiv:1904.08779, 2019.
    [8] X. Song, Z. Wu, Y. Huang, D. Su, and H. Meng, "SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition," in Interspeech, 2020, pp. 581-585.
    [9] L. Meng, J. Xu, X. Tan, J. Wang, T. Qin, and B. Xu, "MixSpeech: Data augmentation for low-resource automatic speech recognition," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 7008-7012.
    [10] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, "Intriguing properties of neural networks," arXiv preprint arXiv:1312.6199, 2013.
    [11] A. Shafahi, M. Najibi, M. A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein, "Adversarial training for free!," Advances in Neural Information Processing Systems, vol. 32, 2019.
    [12] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, "Cutmix: Regularization strategy to train strong classifiers with localizable features," in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023-6032.
    [13] S. Sun, C.-F. Yeh, M. Ostendorf, M.-Y. Hwang, and L. Xie, "Training augmentation with adversarial examples for robust speech recognition," arXiv preprint arXiv:1806.02782, 2018.
    [14] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, "Large-scale adversarial training for vision-and-language representation learning," Advances in Neural Information Processing Systems, vol. 33, pp. 6616-6628, 2020.
    [15] S. Majumdar and B. Ginsburg, "Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition," arXiv preprint arXiv:2004.08531, 2020.
    [16] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, "Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020: IEEE, pp. 6124-6128.
    [17] D. Ng, Y. Chen, B. Tian, Q. Fu, and E. S. Chng, "ConvMixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022: IEEE, pp. 3603-3607.
    [18] I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, and J. Uszkoreit, "Mlp-mixer: An all-mlp architecture for vision," Advances in Neural Information Processing Systems, vol. 34, pp. 24261-24272, 2021.
    [19] B. Kim, S. Chang, J. Lee, and D. Sung, "Broadcasted residual learning for efficient keyword spotting," arXiv preprint arXiv:2106.04140, 2021.
    [20] A. Berg, M. O'Connor, and M. T. Cruz, "Keyword transformer: A self-attention model for keyword spotting," arXiv preprint arXiv:2104.00769, 2021.
    [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
    [22] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, "An image is worth 16x16 words: Transformers for image recognition at scale," arXiv preprint arXiv:2010.11929, 2020.
    [23] D. Seo, H.-S. Oh, and Y. Jung, "Wav2kws: Transfer learning from speech representations for keyword spotting," IEEE Access, vol. 9, pp. 80682-80691, 2021.
    [24] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A framework for self-supervised learning of speech representations," Advances in Neural Information Processing Systems, vol. 33, pp. 12449-12460, 2020.
    [25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015: IEEE, pp. 5206-5210.
    [26] L. Dong, S. Xu, and B. Xu, "Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018: IEEE, pp. 5884-5888.
    [27] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, "Hybrid CTC/attention architecture for end-to-end speech recognition," IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.
    [28] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369-376.
    [29] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "mixup: Beyond empirical risk minimization," arXiv preprint arXiv:1710.09412, 2017.
    [30] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, "Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi," in Interspeech, 2017, vol. 2017, pp. 498-502.
    [31] Z. Lu, W. Han, Y. Zhang, and L. Cao, "Exploring targeted universal adversarial perturbations to end-to-end asr models," arXiv preprint arXiv:2104.02757, 2021.
    [32] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, "Listen, attend and spell," arXiv preprint arXiv:1508.01211, 2015.
    [33] A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.
    [34] H. Wu, B. Zheng, X. Li, X. Wu, H.-y. Lee, and H. Meng, "Characterizing the adversarial vulnerability of speech self-supervised learning," arXiv preprint arXiv:2111.04330, 2021.
    [35] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "Hubert: Self-supervised speech representation learning by masked prediction of hidden units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451-3460, 2021.
    [36] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, "Vqa: Visual question answering," in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425-2433.
    [37] I. J. Goodfellow, J. Shlens, and C. Szegedy, "Explaining and harnessing adversarial examples," arXiv preprint arXiv:1412.6572, 2014.
    [38] A. Trockman and J. Z. Kolter, "Patches are all you need?," arXiv preprint arXiv:2201.09792, 2022.
    [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, and L. Antiga, "Pytorch: An imperative style, high-performance deep learning library," Advances in neural information processing systems, vol. 32, 2019.
    [40] P. Warden, "Speech commands: A dataset for limited-vocabulary speech recognition," arXiv preprint arXiv:1804.03209, 2018.
    [41] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 2017: IEEE, pp. 1-5.
    [42] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, and J. Zhong, "SpeechBrain: A general-purpose speech toolkit," arXiv preprint arXiv:2106.04624, 2021.
    [43] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
    [44] T. Kudo and J. Richardson, "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," arXiv preprint arXiv:1808.06226, 2018.

    QR CODE