Basic Search / Detailed Display

Author: 黃勤翔
Chin-Hsiang Huang
Thesis Title: 使用深度學習技術取得流行歌樂譜
Using Deep Learning Techniques to Get Pop Sheet Music
Advisor: 洪西進
Shi-Jinn Horng
Committee: 楊竹星
Zhu-Xing Yang
Jung-Gil Lee
Ren-wei Xie
Zhu-Xing Lin
Degree: 碩士
Department: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
Thesis Publication Year: 2022
Graduation Academic Year: 110
Language: 中文
Pages: 32
Keywords (in Chinese): 深度學習人工智慧歌唱轉譜
Keywords (in other languages): Efficient Net, Exponential Moving Average
Reference times: Clicks: 228Downloads: 0
School Collection Retrieve National Library Collection Retrieve Error Report


In this thesis, an automatic singing transcription (AST) is based on deep learning techniques which is better than the one based on hidden Markov model and we also point out the criteria of singing transcription. The method proposed in this thesis can solve human beings which like to play their favorite music but without sheet music. Any song or music got from Youtube or anywhere can use the method proposed in this thesis to translate the favorite song or music to sheet music and then can play the song or music. The model proposed in this thesis not only can output json file including both note data and predict confidence, but can output MIDI file which can be listened immediately while outputting. Users can use these two files to fix some notes, and it help the proposed model to be more practical.
EfficientNetV2 is the backbone of this thesis, and an attention module is added which can let the proposed model focus more on the nearby predicted note data. Using exponential moving average (EMA) measure with the input data, the proposed model will not access data too long before. Both attention module and EMA measure can improve the correctness of the predict notes. Through our proposed model and bigger dataset proposed recently, our model gets higher values in various indicators than those on past deep learning models and traditional hidden Markov model in the research in automatic singing transcription.

目 錄 摘 要 1 ABSTRACT 2 致 謝 3 目 錄 4 圖 目 錄 6 表 目 錄 6 第一章 緒論 7 1.1 研究背景與動機 7 1.2 相關研究 8 第二章 環境配置與硬體設備 9 2.1 環境配置 9 2.2 硬體設備 9 第三章 各領域介紹 10 3.1 深度學習介紹 10 3.1.1 深度神經網路 10 3.1.2 卷積神經網路 12 3.2 Attention介紹 14 3.3 MIDI介紹 14 3.4 聲音分離介紹 15 3.5 Efficient介紹 16 3.6 常數Q轉換介紹 17 第四章 模型架構 19 4.1 卷積神經網路中的翹楚 Efficient Net V2 19 4.2 在 Efficient Net V2中加入 Attention 模塊 21 4.3 在 Efficient Net V2中加入 EMA 模塊 21 第五章 研究設計 23 5.1 資料前處理 23 5.2 模型選擇 24 5.3 實驗細節 25 5.4 成果比較 27 5.5 消融實驗 28 第六章 結論與未來展望 29 參考資料 30 圖 目 錄 圖1 限制波爾茲曼機的前向傳播與重建 11 圖2 由限制波爾茲曼機堆疊成的深度信念網路 11 圖3 卷積概念 13 圖4 最大池化概念 13 圖5 CQT轉換與短時距傅立葉轉換的不同 18 圖6 EfficientV2與其他相似模型比較 20 圖7 MB Convolution與Fused MB Convolution比較 20 圖8 取樣點示意圖 24 圖9 實驗流程圖 26 表 目 錄 表1 軟體環境 9 表2 硬體設備 9 表3 成果比較 27 表4 消融實驗 28

⦁ J.-Y. Wang and J.-S. R. Jang, “On the Preparation and Validation of a Large-Scale Dataset of Singing Transcription,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 276–280. doi: ⦁ 10.1109/ICASSP39728.2021.9414601.
⦁ G. E. Hinton, “Boltzmann machine,” Scholarpedia, vol. 2, no. 5, p. 1668, May 2007, doi: ⦁ 10.4249/scholarpedia.1668.
⦁ G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, May 2009, doi: ⦁ 10.4249/scholarpedia.5947.
⦁ A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, vol. 25. Accessed: Jun. 09, 2022. [Online]. Available:
⦁ A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jun. 09, 2022. [Online]. Available:
⦁ R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” JOSS, vol. 5, no. 50, p. 2154, Jun. 2020, doi: ⦁ 10.21105/joss.02154.
⦁ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arXiv, Sep. 11, 2020. Accessed: Jun. 09, 2022. [Online]. Available:
⦁ M. Tan and Q. V. Le, “EfficientNetV2: Smaller Models and Faster Training.” arXiv, Jun. 23, 2021. Accessed: May 29, 2022. [Online]. Available:
⦁ M. P. Ryynänen and A. P. Klapuri, “Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music,” Computer Music Journal, vol. 32, no. 3, pp. 72–86, Sep. 2008, doi: ⦁ 10.1162/comj.2008.32.3.72.
⦁ E. Gómez and J. Bonada, “Towards Computer-Assisted Flamenco Transcription: An Experimental Comparison of Automatic Transcription Algorithms as Applied to A Cappella Singing,” Computer Music Journal, vol. 37, no. 2, pp. 73–90, Jun. 2013, doi: ⦁ 10.1162/COMJ_a_00180.
⦁ E. Molina, L. J. Tardón, A. M. Barbancho, and I. Barbancho, “SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 252–263, Feb. 2015, doi: ⦁ 10.1109/TASLP.2014.2331102.
⦁ L. Yang, A. Maezawa, J. B. L. Smith, and E. Chew, “Probabilistic transcription of sung melody using a pitch dynamic model,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 301–305. doi: ⦁ 10.1109/ICASSP.2017.7952166.
⦁ Matthias Mauch, Chris Cannam, Rachel Bittner, George Fazekas, Justin Salamon, Jiajie Dai, Juan Bello, Simon Dixon, “Computer-Aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency,”  First International Conference on Technologies for Music Notation and Representation (TENOR), May 2015, pp. 1-8.
⦁ M. Zhou, Y. Bai, W. Zhang, T. Zhao, and T. Mei, “Look-into-Object: Self-supervised Structure Modeling for Object Recognition.” arXiv, Mar. 31, 2020. Accessed: Jun. 09, 2022. [Online]. Available:
⦁ J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive Captioners are Image-Text Foundation Models.” arXiv, May 04, 2022. Accessed: Jun. 09, 2022. [Online]. Available:
⦁ G. Chrysos, S. Moschoglou, G. Bouritsas, J. Deng, Y. Panagakis, and S. Zafeiriou, “Deep Polynomial Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2021, doi: ⦁ 10.1109/TPAMI.2021.3058891.
⦁ S. Ghose and J. J. Prevost, “AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1895–1907, 2021, doi: ⦁ 10.1109/TMM.2020.3005033.
⦁ Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 5, pp. 1–28, Sep. 2018, doi: ⦁ 10.1145/3178115.
⦁ M. Popel et al., “Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals,” Nat Commun, vol. 11, no. 1, p. 4381, Dec. 2020, doi: ⦁ 10.1038/s41467-020-18073-9.
⦁ R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating Long Sequences with Sparse Transformers.” arXiv, Apr. 23, 2019. Accessed: Jun. 09, 2022. [Online]. Available:

無法下載圖示 Full text public date 2025/09/08 (Intranet public)
Full text public date 2032/09/08 (Internet public)
Full text public date 2032/09/08 (National library)