Basic Search / Detailed Display

Author: 黃勤翔
Chin-Hsiang Huang
Thesis Title: 使用深度學習技術取得流行歌樂譜
Using Deep Learning Techniques to Get Pop Sheet Music
Advisor: 洪西進
Shi-Jinn Horng
Committee: 楊竹星
Zhu-Xing Yang
李正吉
Jung-Gil Lee
謝仁偉
Ren-wei Xie
林祝興
Zhu-Xing Lin
Degree: 碩士
Master
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2022
Graduation Academic Year: 110
Language: 中文
Pages: 32
Keywords (in Chinese): 深度學習人工智慧歌唱轉譜
Keywords (in other languages): Efficient Net, Exponential Moving Average
Reference times: Clicks: 228Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

本論文使用深度學習技術進行歌唱轉譜任務,超越了傳統做法中使用隱馬可夫模型來實作歌唱轉譜任務的成果,並提出在歌唱轉譜任務中最重要的指標為何?解決了許多喜愛音樂的人想彈奏喜歡的曲子卻因為沒有譜而無法演奏的問題。只要喜歡的歌曲在Youtube上找得到或擁有喜歡的歌曲音樂檔,任何人都能使用本論文提出的模型對自己喜歡的歌做歌唱轉譜,並演奏出歌手所唱的歌曲主旋律。本論文的模型可以輸出包含音符資訊的json檔以及可以直接聽輸出音符的MIDI檔,使用者可以透過json檔提供的模型預測每個音符的confidence與實際聽MIDI檔對模型預測出來的樂譜做修正,這個方法提高了本論文模型的實用性。
本論文所使用的基礎模型是EfficientNetV2,在EfficientNetV2的基礎上加入Attention機制,使模型更關注現在正在預測的歌曲音符,並透過EMA處理使模型不會接收到過久以前輸入的資訊,以提高模型預測音符的準確率,透過我們提出的全新模型加上最近被提出的更大資料集,本論文的模型取得比以往使用深度學習及使用傳統的隱馬可夫模型實作歌唱轉譜任務更高的準確率。


In this thesis, an automatic singing transcription (AST) is based on deep learning techniques which is better than the one based on hidden Markov model and we also point out the criteria of singing transcription. The method proposed in this thesis can solve human beings which like to play their favorite music but without sheet music. Any song or music got from Youtube or anywhere can use the method proposed in this thesis to translate the favorite song or music to sheet music and then can play the song or music. The model proposed in this thesis not only can output json file including both note data and predict confidence, but can output MIDI file which can be listened immediately while outputting. Users can use these two files to fix some notes, and it help the proposed model to be more practical.
EfficientNetV2 is the backbone of this thesis, and an attention module is added which can let the proposed model focus more on the nearby predicted note data. Using exponential moving average (EMA) measure with the input data, the proposed model will not access data too long before. Both attention module and EMA measure can improve the correctness of the predict notes. Through our proposed model and bigger dataset proposed recently, our model gets higher values in various indicators than those on past deep learning models and traditional hidden Markov model in the research in automatic singing transcription.

目 錄 摘 要 1 ABSTRACT 2 致 謝 3 目 錄 4 圖 目 錄 6 表 目 錄 6 第一章 緒論 7 1.1 研究背景與動機 7 1.2 相關研究 8 第二章 環境配置與硬體設備 9 2.1 環境配置 9 2.2 硬體設備 9 第三章 各領域介紹 10 3.1 深度學習介紹 10 3.1.1 深度神經網路 10 3.1.2 卷積神經網路 12 3.2 Attention介紹 14 3.3 MIDI介紹 14 3.4 聲音分離介紹 15 3.5 Efficient介紹 16 3.6 常數Q轉換介紹 17 第四章 模型架構 19 4.1 卷積神經網路中的翹楚 Efficient Net V2 19 4.2 在 Efficient Net V2中加入 Attention 模塊 21 4.3 在 Efficient Net V2中加入 EMA 模塊 21 第五章 研究設計 23 5.1 資料前處理 23 5.2 模型選擇 24 5.3 實驗細節 25 5.4 成果比較 27 5.5 消融實驗 28 第六章 結論與未來展望 29 參考資料 30 圖 目 錄 圖1 限制波爾茲曼機的前向傳播與重建 11 圖2 由限制波爾茲曼機堆疊成的深度信念網路 11 圖3 卷積概念 13 圖4 最大池化概念 13 圖5 CQT轉換與短時距傅立葉轉換的不同 18 圖6 EfficientV2與其他相似模型比較 20 圖7 MB Convolution與Fused MB Convolution比較 20 圖8 取樣點示意圖 24 圖9 實驗流程圖 26 表 目 錄 表1 軟體環境 9 表2 硬體設備 9 表3 成果比較 27 表4 消融實驗 28

⦁ J.-Y. Wang and J.-S. R. Jang, “On the Preparation and Validation of a Large-Scale Dataset of Singing Transcription,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 276–280. doi: ⦁ 10.1109/ICASSP39728.2021.9414601.
⦁ G. E. Hinton, “Boltzmann machine,” Scholarpedia, vol. 2, no. 5, p. 1668, May 2007, doi: ⦁ 10.4249/scholarpedia.1668.
⦁ G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, May 2009, doi: ⦁ 10.4249/scholarpedia.5947.
⦁ A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, vol. 25. Accessed: Jun. 09, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
⦁ A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
⦁ R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” JOSS, vol. 5, no. 50, p. 2154, Jun. 2020, doi: ⦁ 10.21105/joss.02154.
⦁ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arXiv, Sep. 11, 2020. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1905.11946
⦁ M. Tan and Q. V. Le, “EfficientNetV2: Smaller Models and Faster Training.” arXiv, Jun. 23, 2021. Accessed: May 29, 2022. [Online]. Available: http://arxiv.org/abs/2104.00298
⦁ M. P. Ryynänen and A. P. Klapuri, “Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music,” Computer Music Journal, vol. 32, no. 3, pp. 72–86, Sep. 2008, doi: ⦁ 10.1162/comj.2008.32.3.72.
⦁ E. Gómez and J. Bonada, “Towards Computer-Assisted Flamenco Transcription: An Experimental Comparison of Automatic Transcription Algorithms as Applied to A Cappella Singing,” Computer Music Journal, vol. 37, no. 2, pp. 73–90, Jun. 2013, doi: ⦁ 10.1162/COMJ_a_00180.
⦁ E. Molina, L. J. Tardón, A. M. Barbancho, and I. Barbancho, “SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 252–263, Feb. 2015, doi: ⦁ 10.1109/TASLP.2014.2331102.
⦁ L. Yang, A. Maezawa, J. B. L. Smith, and E. Chew, “Probabilistic transcription of sung melody using a pitch dynamic model,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 301–305. doi: ⦁ 10.1109/ICASSP.2017.7952166.
⦁ Matthias Mauch, Chris Cannam, Rachel Bittner, George Fazekas, Justin Salamon, Jiajie Dai, Juan Bello, Simon Dixon, “Computer-Aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency,”  First International Conference on Technologies for Music Notation and Representation (TENOR), May 2015, pp. 1-8.
⦁ M. Zhou, Y. Bai, W. Zhang, T. Zhao, and T. Mei, “Look-into-Object: Self-supervised Structure Modeling for Object Recognition.” arXiv, Mar. 31, 2020. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/2003.14142
⦁ J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive Captioners are Image-Text Foundation Models.” arXiv, May 04, 2022. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/2205.01917
⦁ G. Chrysos, S. Moschoglou, G. Bouritsas, J. Deng, Y. Panagakis, and S. Zafeiriou, “Deep Polynomial Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2021, doi: ⦁ 10.1109/TPAMI.2021.3058891.
⦁ S. Ghose and J. J. Prevost, “AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1895–1907, 2021, doi: ⦁ 10.1109/TMM.2020.3005033.
⦁ Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 5, pp. 1–28, Sep. 2018, doi: ⦁ 10.1145/3178115.
⦁ M. Popel et al., “Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals,” Nat Commun, vol. 11, no. 1, p. 4381, Dec. 2020, doi: ⦁ 10.1038/s41467-020-18073-9.
⦁ R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating Long Sequences with Sparse Transformers.” arXiv, Apr. 23, 2019. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1904.10509

無法下載圖示 Full text public date 2025/09/08 (Intranet public)
Full text public date 2032/09/08 (Internet public)
Full text public date 2032/09/08 (National library)
QR CODE