簡易檢索 / 詳目顯示

研究生: 易洋
Yang Yi
論文名稱: 使用CNN為基礎之特徵擷取器及複合分類器結構之音樂曲風分類
Music Genre Classification Using CNN-based Feature Extractor and Compound Classifier Structures
指導教授: 古鴻炎
Hung-Yan Gu
陳冠宇
Kuan-Yu Chen
口試委員: 王新民
Hsin-Min Wang
林伯慎
Bor-Shen Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 中文
論文頁數: 基隆路四段43號二宿1303房
中文關鍵詞: 音樂曲風分類多層分類器結構分類器混合結構CNN特徵提取梅爾刻度頻譜梅爾倒頻譜係數
外文關鍵詞: music genre classification, hierarchical classifier structure, classifier-mixing structure, CNN feature extractor, mel-frequency spectrum, MFCC
相關次數: 點閱:326下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文研究了階層式及分類器混合這兩種複合式的分類器結構,用於提升音樂曲風分類之正確率。首先我們對輸入的樂曲分析出四種頻域特徵,分別是梅爾頻譜,梅爾倒頻譜,調變頻譜及打擊音效頻譜,作為基礎聲學特徵;然後使用平均值與標準差、主成份分析以及卷積神經網路來對基礎聲學特徵作降維處理,並提取出進階之聲學特徵向量。接下來,我們會拿數種方式求得的進階聲學特徵向量,去訓練四種基本的分類器,就是支援向量機、K-近鄰算法、高斯混合模型與多層感知器。然後,我們經實驗選用四種基礎聲學特徵所對應的最佳分類效能之降維方法與基本分類器的組合,去構建階層式分類器結構;另外,我們使用四種基礎聲學特徵,分別訓練出一個對應的基於卷積神經網路的專家網路,用來構建分類器混合之結構。曲風分類之實驗結果顯示,兩種複合的分類器結構在不同的資料集上,皆可改進曲風分類的正確率,階層式分類器結構最高可獲得87.1%之正確率,而分類器混合結構則可獲得最高88.0%之正確率。


    In this thesis, we have studied two types of compound classifier structures, hierarchical structure and classifier-mixing structure, in order to improve the accuracy of music genre classification. First, four kinds of spectral features are analyzed from an inputted music signal, including mel-frequency spectrum, mel-frequency cepstrum, modulation spectrum and percussive spectrum. These features are considered as basic acoustic features (BAF). Then, three dimension-reduction methods, mean & standard deviation, principal component analysis (PCA), and convolutional neural network (CNN), are used to extract advanced acoustic features (AAF) from BAF. Next, we use AAF to train four types of basic classifiers, i.e. support vector machine, k-nearest neighbor, Gaussian mixture model and multiple layer perceptron. By combing different AAF and basic classifiers, we pick one best-performance combination of AAF and basic classifier for each kind of BAF. Then, the four best-performance combinations are used to construct a hierarchical classifier structure. On the other hand, each kind of BAF is used to train a corresponding expert network based on CNN. Then, the four expert networks are used to construct a classifier-mixing structure. According to the results of music genre classification experiments with different datasets, both compound classifier structures studied here can obtain considerable improvement in classification rate. The hierarchical classifier structure achieves the classification accuracy, 87.1% whereas the classifier-mixing structure achieves the higher classification accuracy, 88.0%.

    目錄 ABSTRACT IV 摘要 1 第1章 緒論 2 1.1 研究動機 2 1.2 研究方法 3 第2章 文獻回顧 9 2.1 音訊資料之表示 9 2.1.1 樣本點單位之特徵 9 2.1.2 音框單位之特徵 10 2.2 基礎分類方法 13 2.3 分類效能提升之方法 15 2.4 與本論文直接相關之文獻 17 2.4.1 CNN提取進階聲學特徵 17 2.4.2 調變頻譜在音樂曲風分類的應用 21 第3章 基礎聲學特徵與降維方法 24 3.1 資料集準備 24 3.1.1 GTZAN資料集 24 3.1.2 FMA資料集 25 3.2 音訊預處理 27 3.3 基礎聲學特徵 29 3.3.1 梅爾頻譜 29 3.3.2 梅爾倒頻譜 31 3.3.3 調變頻譜 32 3.3.4 打擊音效頻譜 33 3.3.5 程式實作 35 3.4 特徵降維方法 37 3.4.1 平均值與標準差 37 3.4.2 主成分分析 38 3.4.3 卷積神經網路 41 第4章 基本分類器 43 4.1 高斯混合模型 43 4.2 支援向量機 46 4.3 K-近鄰算法 48 4.4 多層感知器 49 第5章 分類器複合之結構 50 5.1 階層式分類器結構 50 5.2 分類器混合結構 53 第6章 實驗及討論 57 6.1 基本分類器之實驗 57 6.1.1 正規化 57 6.1.2 特徵降維 57 6.1.3 曲風分類正確率 58 6.2 階層式結構之實驗 61 6.3 混合結構之實驗 63 6.3.1 專家分類器之實驗 63 6.3.2 專家網路混合之實驗 64 第7章 結論 68 參考文獻 71

    [1] Il-Young Jeong, Kyogu Lee, “Learning temporal features using a deep neural network and its application to music genre classification,” International Society of Music Information Retrieval Conference. ISMIR, 2016.
    [2] A. Mahmoodzadeh, H. R. Abutalebi, H. Soltanian-Zadeh, H. Sheikhzadeh, “Determination of pitch range based on onset and offset analysis in modulation frequency domain,” 5th International Symposium on Telecommunications, 2010.
    [3] Jonathan Driedger, Meinard Müller, Sascha Disch, “Extending harmonic-percussive separation of audio signals,” International Society for Music Information Retrieval Conference, ISMIR, 2014.
    [4] Keunwoo Choi, George Fazekas, Mark Sandler, “Automatic tagging using deep convolutional neural networks,” International Society of Music Information Retrieval Conference, ISMIR, 2016.
    [5] Keunwoo Choi, György Fazekas, Mark Sandler, Kyunghyun Cho, “Transfer learning for music classification and regression tasks,” International Society of Music Information Retrieval (ISMIR) Conference, 2017.
    [6] Jongpil Lee, Juhan Nam, “Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging,” IEEE Signal Processing Letters, pp. 1208-1212, 2017.
    [7] Sander Dieleman, Benjamin Schrauwen, “End-to-end learning for music audio,” Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference, pp. 6964-6968, 2014.
    [8] Rajib Sarkar, Sanjoy Kumar Saha, “Music genre classification using EMD and pitch based feature,” International Conference on Advances in Pattern Recognition (ICAPR) Conference, 2015.
    [9] Norden E. Huang, et al., “The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis,” Proc. R. Soc. Lond. A, pp. 454, 903–995, 1998.
    [10] Keunwoo Choi, George Fazekas, Mark Sandler, “Explaining deep convolutional neural networks on music classification,” arXiv: 1607.02444, 2016.
    [11] Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis, “Singing-voice separation from monaural recordings using deep recurrent neural networks,” International Society for Music Information Retrieval (ISMIR) Conference, 2014.
    [12] Christine Senac, Thomas Pellegrini, Florian Mouret, Julien Pinquier, “Music feature maps with convolutional neural networks for music genre classification,” International Workshop on Content-Based Multimedia Indexing, 2017.
    [13] Corey Kereliuk, Bob L. Sturm, Jan Larsen, “Deep Learning and Music Adversaries,” IEEE Transactions on Multimedia, 2015.
    [14] Jan Wulfing, Martin Riedmiller, “Unsupervised learning of local features for music classification,” International Society for Music Information Retrieval (ISMIR) Conference, 2012.
    [15] Emiru Tsunoo, George Tzanetakis, Nobutaka Ono, Shigeki Sagayama, “Beyond timbral statistics: improving music classification using percussive patterns and bass lines,” IEEE Transactions on Audio, Speech, and Language Processing, pp. 1003-1014, 2011.
    [16] Zhouyu Fu, Guojun Lu, Kai-Ming Ting, Dengsheng Zhang, “On feature combination for music classification,” Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 453-462, 2010.
    [17] Baniya, Babu Kaji, Deepak Ghimire, and Joonwhoan Lee, “Evaluation of different audio features for musical genre classification,” Signal Processing Systems (SiPS), 2013 IEEE Workshop on. IEEE, pp. 260-265, 2013.
    [18] Jialie Shen, John Shepherd, Anne H. H. Ngu, “Towards effective content-based music retrieval with multiple acoustic feature combination,” IEEE Transactions on Multimedia, pp. 8.6: 1179-1189, 2006.
    [19] Yeh, Chin-Chia Michael, Li Su, and Yi-Hsuan Yang, “Dual-layer bag-of-frames model for music genre classification,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, pp. 246-250, 2013.
    [20] Grzegorz Gwardys, Daniel Grzywczak, “Deep image features in music information retrieval,” International Journal of Electronics and Telecommunications, pp. 60.4: 321-326, 2014.
    [21] Lee, Chang-Hsing, et al., “Automatic music genre classification based on modulation spectral analysis of spectral and cepstral features,” IEEE Transactions on Multimedia, pp. 11.4: 670-682, 2009.
    [22] de Leon, Franz A., Kirk Martinez, “Music genre classification using polyphonic timbre models,” Digital Signal Processing (DSP), 2014 19th International Conference on. IEEE, pp. 415-420, 2014.
    [23] Keunwoo Choi, George Fazekas, Mark Sandler, Kyunghyun Cho, “Convolutional recurrent neural networks for music classification,” IEEE international conference on acoustics, speech and signal processing (ICASSP), 2017.
    [24] Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha, Juhan Nam, “Representation Learning of Music Using Artist Labels,” International Society for Music Information Retrieval Conference (ISMIR), 2018.
    [25] Kazuki Irie, Shankar Kumar, Michael Nirschl, Hank Liao, “RADMM: recurrent adaptive mixture model with applications to domain robust language modeling,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [26] Tzanetakis, George, and Perry Cook, “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, pp. 10.5: 293-302, 2002.
    [27] B. L. Sturm, “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use,” rXiv preprint arXiv:1306.1461, 2013.
    [28] Kereliuk, Corey, Bob L. Sturm, and Jan Larsen, “Deep learning and music adversaries,” IEEE Transactions on Multimedia, pp. 17.11: 2059-2071, 2015.
    [29] Defferrard, Michaël, et al., “Fma: a dataset for music analysis,” arXiv preprint, p. arXiv:1612.01840, 2016.
    [30] Harris, Fredric J., “On the use of windows for harmonic analysis with the discrete Fourier transform,” Proceedings of the IEEE, pp. 66.1: 51-83, 1978.
    [31] A. V. Oppenheim, Discrete-time signal processing, Pearson Education India, 1999.
    [32] D. O'shaughnessy, Speech communication: human and machine, Universities press, 1987.
    [33] Davis, Steven, and Paul Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE transactions on acoustics, speech, and signal processing, pp. 28.4: 357-366, 1980.
    [34] Ahmed, Nasir, T. Natarajan, Kamisetty R. Rao, “Discrete cosine transform,” IEEE transactions on Computers, pp. 100.1: 90-93, 1974.
    [35] Ono, Nobutaka, et al, “Separation of a monaural audio signal into harmonic/percussive components by complementary diffusion on spectrogram,” ignal Processing Conference, 2008 16th European. IEEE, pp. 1-4, 2008.
    [36] McFee, Brian, et al, “librosa: Audio and music signal analysis in python,” Proceedings of the 14th python in science conference, 2015.
    [37] Jones, Eric, Travis Oliphant, and Pearu Peterson, “SciPy: open source scientific tools for Python,” 2014.
    [38] Pons, Jordi, and Xavier Serra, “Randomly weighted CNNs for (music) audio classification,” arXiv preprint, p. arXiv:1805.00237, 2018.
    [39] J. Shlens, “A tutorial on principal component analysis,” arXiv preprint, p. arXiv:1404.1100, 2014.
    [40] Pedregosa, Fabian, et al., “Scikit-learn: Machine learning in Python,” Journal of machine learning research, pp. 12.Oct: 2825-2830., 2011.
    [41] Simonyan, Karen, and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint, p. arXiv:1409.1556, 2014.
    [42] F. Chollet, “Keras,” 2015.
    [43] N. M. Nasrabadi, “Pattern recognition and machine learning,” Journal of electronic imaging, p. 16.4: 049901, 2007.
    [44] D. H. Wolpert, “Stacked generalization,” Neural networks , pp. 5.2: 241-259, 1992.
    [45] Cho, Kyunghyun, et al, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint, p. arXiv:1406.1078, 2014.
    [46] Leglaive, Simon, Romain Hennequin, and Roland Badeau, “Singing voice detection with deep recurrent neural networks,” Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 121-125, 2015.
    [47] Schuster, Mike, and Kuldip K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, pp. 45.11: 2673-2681, 1997.
    [48] Van Den Oord, Aäron, Sander Dieleman, Benjamin Schrauwen, “Transfer learning by supervised pre-training for audio-based music classification,” Conference of the International Society for Music Information Retrieval (ISMIR 2014), 2014.
    [49] Harb, Hadi, Liming Chen, J-Y. Auloge, “Mixture of experts for audio classification: an application to male female classification and musical genre recognition,” Multimedia and Expo, 2004. ICME'04. 2004 IEEE International Conference, pp. Vol. 2: 1351-1354, 2004.

    QR CODE