使用Transformer實現流行音樂轉換鋼琴音樂｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	彭謙 Chien Peng
論文名稱：	使用Transformer實現流行音樂轉換鋼琴音樂 Implementing Popular Music to Piano Music Transformation Using Transformer
指導教授：	洪西進 Shi-Jinn Horng
口試委員:	林祝興 Chu-Hsing LIN 楊竹興 Chu-Hsing Yang 李正吉 Cheng-Chi Lee 顏成安 Cheng-An Yen
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	29
中文關鍵詞：	Transformer 、深度學習、資料處理、音樂轉換
外文關鍵詞：	Transformer, Deep learning, data processing, musice transformer
相關次數：	點閱：197 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文旨在解決喜愛音樂的人因缺乏樂譜而無法演奏喜歡的歌曲的問題。為此，我們使用深度學習技術，以Transformer模型為基礎進行歌唱轉譜任務的研究。與傳統方法中使用隱馬可夫模型相比，我們的模型取得了更好的成果，並超越了先前EfficientNetV2模型的結果。在研究中，我們以t5-small進行訓練。透過我們的模型，只要目標歌曲在YouTube上有相關資源或音樂檔案，任何使用者皆能將其轉換為可供演奏的樂譜，進而重現歌手的主旋律。此外，本論文的模型可以輸出為可播放的MIDI檔。使用者可以實際聆聽MIDI檔案對模型生成的樂譜進行校正。這樣的方法能夠提高使用者對模型預測結果的信心，並確保所生成的樂譜符合實際演奏的要求。總之，本論文展示了使用Transformer模型在歌唱轉譜任務上取得不錯的結果，在ConPoff標準下做測試達到了57.9的準確率，為喜愛音樂的人提供了一種有效的方法來獲得他們喜愛曲目的樂譜。

This thesis aims to address the issue faced by music enthusiasts who are unable to play their favorite songs due to a lack of sheet music. To tackle this problem, we employed deep learning techniques and conducted research on the task of singing transcription using a Transformer model as our foundation. Our model achieved superior results compared to traditional approaches that rely on Hidden Markov Models, surpassing the performance of the previous EfficientNetV2 model. In our study, we trained the model using the t5-small architecture. With our model, any user can convert their desired song into playable sheet music as long as relevant resources or music files are available on YouTube. This allows them to recreate the main melody sung by the vocalist. Additionally, our model is capable of generating MIDI files that can be played back. Users have the opportunity to listen to the MIDI files and make adjustments to the generated sheet music based on the actual playback. This approach enhances user confidence in the model's predictions and ensures the generated sheet music adheres to the requirements of actual performance. In conclusion, this paper demonstrates the promising results achieved using a Transformer model for the task of singing transcription, providing music lovers with an effective method to obtain sheet music for their favorite tracks.

摘       要    I
ABSTRACT    II
致      謝    III
目      錄    IV
圖  目  錄    VI
表  目  錄    VI
第一章  緒論    1
1.1  研究背景與動機    1
1.2  相關研究    2
第二章  環境配置與硬體設備    3
2.1  環境配置    4
2.2  硬體設備    4
第三章  深度學習模型與前處理    5
3.1  Transformer    5
3.1.1  Attention    5
3.1.2  BERT    7
3.1.3  Transformer t5    9
3.2  前處理    10
3.2.1 MIDI      11
3.2.2  音頻轉換MIDI    11
3.2.3  對齊鋼琴與流行音樂    13
3.2.4  聲音分離介紹    14
3.2.5  調整音符長度    14
3.2.6. 常數Q轉換    15
第四章  模型架構    17
4.1  輸入與輸出     17
4.2  模型架構     19
第五章  實驗    20
5.1  資料集    20
5.2  模型選擇    20
5.3  實驗細節    21
5.4  成果比較    24
5.5  消融實驗    25
第六章  結論與未來展望    26
參考資料    27


                                

[1] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian Böck, Andreas Arzt, Gerhard Widmer. “On the Potential of Simple Framewise Approaches to Piano Transcription”, (2016). [Online]. Doi: arXiv:1612.05153
[2] S. Sigtia, E. Benetos and S. Dixon, "An End-to-End Neural Network for Polyphonic Piano Music Transcription," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927-939, May 2016, [Online]. Doi: 10.1109/TASLP.2016.2533858.
[3] J. Choi and K. Lee, "Pop2Piano : Pop Audio-Based Piano Cover Generation," ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5, [Online]. Doi: 10.1109/ICASSP49357.2023.10095653.
[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin: “Attention Is All You Need”(2017). Doi: arXiv: 1706.03762
[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” (2018). Doi: arXiv: 1810.04805
[6] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li and Peter J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”(2020). Doi: arXiv: 1910.10683
[7] Q. Kong, B. Li, X. Song, Y. Wan and Y. Wang, "High-Resolution Piano Transcription With Pedals by Regressing Onset and Offset Times," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3707-3717, 2021, Doi: 10.1109/TASLP.2021.3121991.
[8] Müller et al., “Sync Toolbox: A Python Package for Efficient, Robust, and Accurate Music Synchronization” Journal of Open Source Software, 6(64), 3434, (2021). [Online]. Doi: https://doi.org/10.21105/joss.03434
[9] Dmitry Bogdanov, N Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, Oscar Mayor,G Roma, Justin Salamon, Jose Ricardo Zapata, Xavier Serra, “ESSENTIA: an Audio Analysis Library for Music Information Retrieval”, Proceedings - 14th International Society for Music Information Retrieval Conference, Nov. 2013. [Online]. Doi: https://www.researchgate.net/publication/256104772_ESSENTIA_an_Audio_Analysis_Library_for_Music_Information_Retrieval
[10] R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” JOSS, vol. 5, no. 50, p. 2154, Jun. 2020, Doi: 10.21105/joss.02154.
[11] Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel, “Mt3: Multi-task multitrack music transcription”, (2021). [Online]. Doi: arXiv:2111.03017
[12] Curtis Hawthorne, Ian Simon, Rigel Swavely, Ethan Manilow, and Jesse Engel, “Sequence-to-sequence piano transcription with transformers,” (2021). [Online]. Doi: arXiv:2107.09142
[13] 黃勤翔（2022）。“使用深度學習技術取得流行歌樂譜“。國立台灣科技大學碩士論文。取自https://ndltd.ncl.edu.tw/cgi-bin/gs32/gsweb.cgi?o=d
[14] Noam Shazeer and Mitchell Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” in International Conference on Machine Learning. PMLR, pp. 4596–4604. (2018). [Online]. Doi: https://www.researchgate.net/publication/324492390_Adafactor_Adaptive_Learning_Rates_with_Sublinear_Memory_Cost
[15] Matti P. Ryynänen, Anssi P. Klapuri; Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music. Computer Music Journal 2008; 32 (3): 72–86. doi: https://doi.org/10.1162/comj.2008.32.3.72
[16] E. Molina, L. J. Tardón, A. M. Barbancho, and I. Barbancho, “SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 252–263, Feb. 2015, Doi: 10.1109/TASLP.2014.2331102.
[17] L. Yang, A. Maezawa, J. B. L. Smith, and E. Chew, “Probabilistic transcription of sung melody using a pitch dynamic model,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 301–305. Doi: 10.1109/ICASSP.2017.7952166.
[18] Mauch, Matthias et al. “Computer-aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency.” (2015). [Online]. Doi: https://www.semanticscholar.org/paper/Computer-aided-Melody-Note-Transcription-Using-the-Mauch-Cannam/38d35a377143b3c67ba32a91711c5483f081ee2b
[19] M. Zhou, Y. Bai, W. Zhang, T. Zhao, and T. Mei, “Look-into-Object: Self-supervised Structure Modeling for Object Recognition.” arXiv, Mar. 31, 2020. Accessed: Jun. 09, (2022). [Online]. Doi: http://arxiv.org/abs/2003.14142
[20] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive Captioners are Image-Text Foundation Models.” arXiv, May 04, 2022. Accessed: Jun. 09, (2022). [Online]. Doi: http://arxiv.org/abs/2205.01917
[21] G. Chrysos, S. Moschoglou, G. Bouritsas, J. Deng, Y. Panagakis, and S. Zafeiriou, “Deep Polynomial Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2021, Doi: 10.1109/TPAMI.2021.3058891.
[22] S. Ghose and J. J. Prevost, “AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1895–1907, 2021, Doi: 10.1109/TMM.2020.3005033.
[23] Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 5, pp. 1–28, Sep. 2018, Doi: 10.1145/3178115.
[24] M. Popel et al., “Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals,” Nat Commun, vol. 11, no. 1, p. 4381, Dec. 2020, doi: 10.1038/s41467-020-18073-9.
[25] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating Long Sequences with Sparse Transformers.” arXiv, Apr. 23, 2019. Accessed: Jun. 09, (2022). [Online]. Available: http://arxiv.org/abs/1904.10509

全文公開日期 2053/08/21 (校外網路)
全文公開日期 2053/08/21 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文