研究生: |
黃勤翔 Chin-Hsiang Huang |
---|---|
論文名稱: |
使用深度學習技術取得流行歌樂譜 Using Deep Learning Techniques to Get Pop Sheet Music |
指導教授: |
洪西進
Shi-Jinn Horng |
口試委員: |
楊竹星
Zhu-Xing Yang 李正吉 Jung-Gil Lee 謝仁偉 Ren-wei Xie 林祝興 Zhu-Xing Lin |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 32 |
中文關鍵詞: | 深度學習 、人工智慧 、歌唱轉譜 |
外文關鍵詞: | Efficient Net, Exponential Moving Average |
相關次數: | 點閱:202 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文使用深度學習技術進行歌唱轉譜任務,超越了傳統做法中使用隱馬可夫模型來實作歌唱轉譜任務的成果,並提出在歌唱轉譜任務中最重要的指標為何?解決了許多喜愛音樂的人想彈奏喜歡的曲子卻因為沒有譜而無法演奏的問題。只要喜歡的歌曲在Youtube上找得到或擁有喜歡的歌曲音樂檔,任何人都能使用本論文提出的模型對自己喜歡的歌做歌唱轉譜,並演奏出歌手所唱的歌曲主旋律。本論文的模型可以輸出包含音符資訊的json檔以及可以直接聽輸出音符的MIDI檔,使用者可以透過json檔提供的模型預測每個音符的confidence與實際聽MIDI檔對模型預測出來的樂譜做修正,這個方法提高了本論文模型的實用性。
本論文所使用的基礎模型是EfficientNetV2,在EfficientNetV2的基礎上加入Attention機制,使模型更關注現在正在預測的歌曲音符,並透過EMA處理使模型不會接收到過久以前輸入的資訊,以提高模型預測音符的準確率,透過我們提出的全新模型加上最近被提出的更大資料集,本論文的模型取得比以往使用深度學習及使用傳統的隱馬可夫模型實作歌唱轉譜任務更高的準確率。
In this thesis, an automatic singing transcription (AST) is based on deep learning techniques which is better than the one based on hidden Markov model and we also point out the criteria of singing transcription. The method proposed in this thesis can solve human beings which like to play their favorite music but without sheet music. Any song or music got from Youtube or anywhere can use the method proposed in this thesis to translate the favorite song or music to sheet music and then can play the song or music. The model proposed in this thesis not only can output json file including both note data and predict confidence, but can output MIDI file which can be listened immediately while outputting. Users can use these two files to fix some notes, and it help the proposed model to be more practical.
EfficientNetV2 is the backbone of this thesis, and an attention module is added which can let the proposed model focus more on the nearby predicted note data. Using exponential moving average (EMA) measure with the input data, the proposed model will not access data too long before. Both attention module and EMA measure can improve the correctness of the predict notes. Through our proposed model and bigger dataset proposed recently, our model gets higher values in various indicators than those on past deep learning models and traditional hidden Markov model in the research in automatic singing transcription.
⦁ J.-Y. Wang and J.-S. R. Jang, “On the Preparation and Validation of a Large-Scale Dataset of Singing Transcription,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 276–280. doi: ⦁ 10.1109/ICASSP39728.2021.9414601.
⦁ G. E. Hinton, “Boltzmann machine,” Scholarpedia, vol. 2, no. 5, p. 1668, May 2007, doi: ⦁ 10.4249/scholarpedia.1668.
⦁ G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, May 2009, doi: ⦁ 10.4249/scholarpedia.5947.
⦁ A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, vol. 25. Accessed: Jun. 09, 2022. [Online]. Available: https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
⦁ A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1706.03762
⦁ R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, “Spleeter: a fast and efficient music source separation tool with pre-trained models,” JOSS, vol. 5, no. 50, p. 2154, Jun. 2020, doi: ⦁ 10.21105/joss.02154.
⦁ M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” arXiv, Sep. 11, 2020. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1905.11946
⦁ M. Tan and Q. V. Le, “EfficientNetV2: Smaller Models and Faster Training.” arXiv, Jun. 23, 2021. Accessed: May 29, 2022. [Online]. Available: http://arxiv.org/abs/2104.00298
⦁ M. P. Ryynänen and A. P. Klapuri, “Automatic Transcription of Melody, Bass Line, and Chords in Polyphonic Music,” Computer Music Journal, vol. 32, no. 3, pp. 72–86, Sep. 2008, doi: ⦁ 10.1162/comj.2008.32.3.72.
⦁ E. Gómez and J. Bonada, “Towards Computer-Assisted Flamenco Transcription: An Experimental Comparison of Automatic Transcription Algorithms as Applied to A Cappella Singing,” Computer Music Journal, vol. 37, no. 2, pp. 73–90, Jun. 2013, doi: ⦁ 10.1162/COMJ_a_00180.
⦁ E. Molina, L. J. Tardón, A. M. Barbancho, and I. Barbancho, “SiPTH: Singing Transcription Based on Hysteresis Defined on the Pitch-Time Curve,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 252–263, Feb. 2015, doi: ⦁ 10.1109/TASLP.2014.2331102.
⦁ L. Yang, A. Maezawa, J. B. L. Smith, and E. Chew, “Probabilistic transcription of sung melody using a pitch dynamic model,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, Mar. 2017, pp. 301–305. doi: ⦁ 10.1109/ICASSP.2017.7952166.
⦁ Matthias Mauch, Chris Cannam, Rachel Bittner, George Fazekas, Justin Salamon, Jiajie Dai, Juan Bello, Simon Dixon, “Computer-Aided Melody Note Transcription Using the Tony Software: Accuracy and Efficiency,” First International Conference on Technologies for Music Notation and Representation (TENOR), May 2015, pp. 1-8.
⦁ M. Zhou, Y. Bai, W. Zhang, T. Zhao, and T. Mei, “Look-into-Object: Self-supervised Structure Modeling for Object Recognition.” arXiv, Mar. 31, 2020. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/2003.14142
⦁ J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive Captioners are Image-Text Foundation Models.” arXiv, May 04, 2022. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/2205.01917
⦁ G. Chrysos, S. Moschoglou, G. Bouritsas, J. Deng, Y. Panagakis, and S. Zafeiriou, “Deep Polynomial Neural Networks,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 1–1, 2021, doi: ⦁ 10.1109/TPAMI.2021.3058891.
⦁ S. Ghose and J. J. Prevost, “AutoFoley: Artificial Synthesis of Synchronized Sound Tracks for Silent Videos With Deep Learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1895–1907, 2021, doi: ⦁ 10.1109/TMM.2020.3005033.
⦁ Z. Zhang, J. Geiger, J. Pohjalainen, A. E.-D. Mousa, W. Jin, and B. Schuller, “Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments,” ACM Trans. Intell. Syst. Technol., vol. 9, no. 5, pp. 1–28, Sep. 2018, doi: ⦁ 10.1145/3178115.
⦁ M. Popel et al., “Transforming machine translation: a deep learning system reaches news translation quality comparable to human professionals,” Nat Commun, vol. 11, no. 1, p. 4381, Dec. 2020, doi: ⦁ 10.1038/s41467-020-18073-9.
⦁ R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating Long Sequences with Sparse Transformers.” arXiv, Apr. 23, 2019. Accessed: Jun. 09, 2022. [Online]. Available: http://arxiv.org/abs/1904.10509