簡易檢索 / 詳目顯示

研究生: 熊蓶蓶
HSIUNG WEI WEI
論文名稱: 深度學習應用於音訊除噪、辨識、分群與降維
Deep Learning for Audio Denoising, Identification, Clustering, and Dimensionality Reduction
指導教授: 林敬舜
ChingShun Lin
口試委員: 陳維美
Wei-Mei Chen
林昌鴻
Chang Hong Lin
王煥宗
Huan-Chun Wang
林敬舜
ChingShun Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 51
中文關鍵詞: 深度學習雜訊消除圖樣辨識殘響分群堆疊自編碼器多點室內脈衝響應
外文關鍵詞: Deep learning, Audio denoising, Pattern identification, Autoencoder clustering, Dimensionality reduction, Multiposition room impluse response
相關次數: 點閱:444下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近幾年來隨著電腦的運算能力大幅增加,有愈來愈多與深度學習相關的學術研究出現。在本研究中,我們基於深度神經網路提出音訊除噪、圖樣辨識、殘響分群與資料降維的應用。首先在音訊除噪中,我們在頻域上將先驗的乾淨語音和雜訊語音分別提取特徵,再送入遞歸神經網絡做雜訊消除。由於遞歸神經網絡是一種時間序列的模型,能夠學習不同時間點狀態上的關聯性,因此適合訓練語音訊號。在基於TIMIT語音資料庫的雜訊消除實驗中,深度長短期記憶模型的遞歸神經網絡與其他深度神經網路相比,能夠提昇較高的訊噪比。另外,在鳥鳴聲辨識中,我們為了突顯聲學資料在時頻圖上的圖形特徵,設計了一種刻度化的短時時頻特徵提供給深度神經網路學習。使得在多達360種的鳥鳴聲分類裡,使用遞歸神經網絡分類能夠得到高達91\%的準確率。另一方面,深度神經網路中的自編碼器能夠以非監督式學習來降低資料的維度。在降維與分群實驗中,多重室內脈衝響應訊號先經過降維再做全向、水平與垂直三種方向的分群能達到較好的效果。對直接音而言,三種方向的分群結果都顯示出與聲源的距離有關,按照聲源的強度被歸納為同一群。另外在反射音與殘響的分群中,自編碼器不僅是非監督式的降維方法,同時也有助於觀察主要和次要的虛擬聲源。


    With massive amounts of computational power, deep learning has been widely studied in recent years. In this study, we have proposed several systems for audio denoising, identification, clustering, and dimensionality reduction based on deep neural networks (DNN). First of all, we use the recurrent neural network (RNN) for audio noise removal according to a priori frequency representations of both clean speech and noisy speech. Owing to the sequential data, the RNN provides a structure of time series with lags which is more suitable for audio modeling than other DNNs. The RNN with long short-term memory (LSTM) module raises the signal-to-noise rate for noisy TIMIT. Moreover, we propose scaled short-time Fourier transform (SSTFT) that provides effective features for avian call identification. The scaled representation can enhance the acoustics characteristic of visual pattern fed into the RNN classifier for identifying avian species, and achieve up to 91\% hitting rate when more than 360 avian species are involved. On other hand, we use the autoencoders for unsupervised dimensionality reduction applied to a multiposition room impulse response database, including horizontal, elevational, and omni-directional subsets, for finding similarities among room responses. For direct sound clustering, the distribution mainly depends on the location of sound source, whereas for reflection and reverberation clustering, the unsupervised stacked autoencoder (SAE) not only provides dimensionality reduction but indicates the major and minor virtual sound sources with its supervised vision.

    1 Introduction 1.1 Deep Learning Backgrounds 1.2 Related Researches 1.3 Chapter Organization 2 Denoising 2.1 Recurrent Neural Network 2.2 Tricks for DNN Learning 3 Identification 3.1 Scaled Time-Frequency Representation 4 Dimensionality Reduction and Clustering 5 Experimental Results 6 Conclusion and Future Research

    [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu-
    tional neural networks,” in Proc. of Int. Conf. on Neural Information Processing Systems, pp.
    1097–1105, Dec. 2012.
    [2] L. R. Medsker and L. C. Jain, Recurrent Neural Networks: Design and Applications 1st. CRC
    Press, 2001.
    [3] H. Bourlard and Y. Kamp, “Auto-association by multilayer perceptrons and singular value
    decomposition,” Biological Cybernetics, vol. 59, no. 4, pp. 291–294, Jan. 1988.
    [4] D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction: Learning
    Algorithms, Architectures and Stability.
    John Wiley, 2001.
    [5] F. Briggs, R. Raich, and X. Z. Fern, “Audio classification of bird species: A statistical manifold
    approach,” in Proc. of IEEE Int. Conf. on Data Mining, pp. 51–60, Dec. 2009.
    [6] P. Vincent, H. Larochelle, Y. Bengio, and P.-A, Manzagol, “Extracting and composing robust
    features with denoising autoencoders,” in Proc. of Int. Conf. on Machine Learning, pp. 1096–
    1103, Jul. 2008.
    [7] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment classifi-
    cation: A deep learning approach,” in Proc. of Int. Conf. on Machine Learning, pp. 97–110,
    Jun. 2011.
    [8] T.-E. Chen, S.-I. Yang, L.-T. Ho, K.-H. Tsai, Y.-H. Chen, Y.-F. Chang, Y.-H. Lai, S.-S. Wang,
    Y. Tsao, and C.-C. Wu, “S1 and s2 heart sound recognition using deep neural networks,” IEEE
    Trans. on Biomedical Engineering, vol. 64, no. 2, pp. 372–380, Apr. 2017.
    [9] M. A. Acevedo, “Automated classification of bird and amphibian calls using machine learning:
    A comparison of methods,” Ecological Informatics, vol. 4, no. 4, pp. 206–214, Sep. 2009.
    [10] L. Cohen, Time-Frequency Analysis.
    Prentice-Hall, 1995.
    [11] M. Marcarini, G. A. Williamson, and L. de Sisternes Garcia, “Comparison of methods for au-
    tomated recognition of avian nocturnal flight calls,” in Proc. of IEEE Int. Conf. on Acoustics,
    Speech and Signal Processing, pp. 2029–2032, Mar. 2008.
    [12] A. L. McIlraith and H. C. Card, “Birdsong recognition using backpropagation and multivariate
    statistics,” IEEE Trans. on Signal Processing, vol. 45, no. 11, pp. 2740–2748, Nov. 1997.
    [13] S. Bharitkar and C. Kyriakakis, “Visualization of multiple listener room acoustic equalization
    with the sammon map,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15,
    no. 2, pp. 542–551, Jan. 2007.
    49[14] F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu, “Learning deep representations for graph
    clustering,” in Proc. of AAAI Conf. on Artificial Intelligence, pp. 1293–1299, Jul. 2014.
    [15] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and
    Intelligent Laboratory Systems, vol. 2, no. 1-3, pp. 37–52, Aug. 1987.
    [16] B. Scholkopf, A. Smola, and K.-R. Muller, “Kernel principal component analysis,” in Proc.
    of Int. Conf. on Artificial Neural Networks, pp. 583–588, Oct. 1997.
    [17] M. Balasubramanian and E. L. Schwartz, “The isomap algorithm and topological stability,”
    Science, vol. 295, no. 5552, p. 7, Jan. 2002.
    [18] J. B. Tenenbaum, V. Silva, and J. C. Langford, “A global geometric framework for nonlinear
    dimensionality reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, Dec. 2000.
    [19] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and
    clustering,” in Proc. of Int. Conf. on Neural Information Processing Systems, pp. 585–591,
    Dec. 2002.
    [20] J. W. Sammon, “A nonlinear mapping for data structure analysis,” IEEE Trans. on Comput-
    ers, vol. 100, no. 5, pp. 401–409, May 1969.
    [21] X. Chen, A. Eversole, G. Li, D. Yu, and F. Seide, “Pipelined back-propagation for context-
    dependent deep neural networks,” in Proc. of the Ann. Conf. of International Speech Com-
    munication Association, pp. 26–29, Sep. 2012.
    [22] T. Merritt, R. A. J. Clark, Z. Wu, J. Yamagishi, and S. King, “Deep neural network-guided
    unit selection synthesis,” in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Pro-
    cessing, pp. 5145–5149, Mar. 2016.
    [23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9,
    no. 8, pp. 1735–1780, Nov. 1997.
    [24] A. Graves, A. R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural
    networks,” in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645–
    6649, May 2013.
    [25] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural net-
    works,” Machine Learning, vol. 3, no. 28, pp. 1310–1318, Jun. 2013.
    [26] G. B. Orr and K. R. Muller, Neural Networks: Tricks of The Trade. Springer, 1998.
    [27] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving
    neural networks by preventing co-daptation of feature detectors,” in Proc. of the Int. Conf.
    on Computing Research Repository, p. arXiv preprint arXiv:1207.0580, Jul. 2012.
    [28] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,”
    in Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 2133–2136, Mar.
    2012.
    50[29] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol, “Stacked denoising
    autoencoders: Learning useful representations in a deep network with a local denoising crite-
    rion,” Journal of Machine Learning Research, vol. 11, no. 11, pp. 3371–3408, Dec. 2010.
    [30] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep
    networks,” in Proc. of Int. Conf. on Neural Information Processing Systems, pp. 153–160,
    Dec. 2006.
    [31] S. Haykin, Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, 1994.
    [32] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and
    V. Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1,” [Online] Available:
    https://catalog.ldc.upenn.edu/ldc93s1.
    [33] Y. Hu and P. Loizou, “Subjective evaluation and comparison of speech enhancement algo-
    rithms,” Speech Communication, vol. 49, no. 7, pp. 588–601, Jul. 2007.
    [34] T. Kabaya and M. Matsuda, The Songs and Calls of 420 Birds in Japan. Shogakkan, 2001.
    [35] A. Benyassine, E. Shlomot, H.-Y. Su, and E. Yuen, “A robust low complexity voice activity
    detection algorithm for speech communication systems,” in Proc. of IEEE Int. Conf. on Speech
    Coding for Telecommunications Proceeding, pp. 97–98, Sep. 1997.
    [36] C. C. Chang and C. J. Lin, “LIBSVM: A Library for Support Vector Machines,” [Online]
    Available: http://www.csie.ntu.edu.tw/cjlin/libsvm

    無法下載圖示 全文公開日期 2022/08/08 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE