研究生: |
李萱瑜 Hsuan-Yu Li |
---|---|
論文名稱: |
基於卷積生成對抗網絡進行複音音樂生成 Study on Convolutional Generative Adversarial Networks based Polyphonic Music Generation |
指導教授: |
蘇順豐
Shun-Feng Su |
口試委員: |
陳永耀
Yung-Yao Chen 花凱龍 Kai-Lung Hua 陳美勇 Mei-Yung Chen 林顯易 Hsien-I Lin |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 電機工程系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 英文 |
論文頁數: | 78 |
中文關鍵詞: | 對抗生成網路 、音樂生成 、鋼琴卷 、複音音樂 、深度學習 |
外文關鍵詞: | Generative Adversarial Nets, Music Generation, Piano Roll, Polyphonic Music, Deep Learning |
相關次數: | 點閱:235 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
為了提升生成音樂的品質並解決音樂片段化的問題,這項研究除了針對參數部分有所探討之外,主要集中在損失函數的改進上。自動生成音樂是在最少的人工干預下進行音樂創作的過程,由於卷積神經網路適合學習局部特徵的特性使得近年來有許多利用對抗生成網絡 (GAN)來產生符號音樂的模型。本論文就是使用卷積生成對抗網路做為模型架構,其分為三個部分:生成器 (Generator),精簡器 (Refiner)和判別器 (Discriminator)。
在不影響GAN的訓練穩定性的前提下,為了減少零碎音符的比例,將用於計算真實與生成的合格音符 (QN)分數之間距離的公式添加到原始損失函數中。此外,隨著迭代次數的增加,逐漸縮減生成器上附加損失函數的係數,並在評估得分接近訓練數據的分數時停止。如此一來,生成器和判別器之間的損失函數就可以保持在最適合該模型的狀態。最後,為了盡可能地接近訓練數據的評估分數,再根據模型的特性為其選擇最合適的激活函數。本論文以Lakh Piano Data Set (LPD)篩選出的lastfm_alternative_8b_phrase為數據集進行訓練與測試,而生成音樂的合格音符 (QN) 分數可以達到84%,複音 (PP) 和音調距離 (TD) 分數分別為46%和95%。在現有的所有方法中,那些分數是最接近訓練資料的分數。
In order to improve the quality of the generated music and solve the problem of music fragmentation, this research study on the parameters and mainly focuses on modifying the loss function used to suit for the music requirements. Automatic generation of music is a process of music creation with minimal human intervention. The characteristics of CNN suitable for learning local features have led to the use of GAN in many existing models to generate symbolic music in recent years. This study uses the convolutional generative confrontation network as the model architecture, which is divided into three parts: Generator, Refiner and Discriminator. Without affecting the training stability of GAN, while reducing the proportion of fragmented notes, the formula used to calculate the distance between the real and the generated Qualified Note score is added to the original loss function. In addition, as the number of iterations increases, the coefficient of the additional loss function on the generator is gradually reduced, and it stops when the evaluation score approaches the score of the training data. In this way, the loss functions of the generator and the discriminator can be kept in the state most suitable for the model. Finally, in order to be as close as possible to the evaluation score of the training data, more suitable activation functions are selected according to the characteristics of the model. This paper uses lastfm_alternative_8b_phrase filtered by LPD as the dataset for training and testing, and the Qualified Note score of the generated music can reach 84%, and the Polyphony and Tonal Distance scores are respectively 46% and 95%. Those scores are the closest scores to those of the training data among all existing approaches.
[1] G. Papadopoulos and G. Wiggins, "AI Methods for Algorithmic Composition: A Survey, a Critical view and Future Prospects," 1999.
[2] P. Westergaard, L. Hiller, and L. M. Isaacson, "Experimental Music. Composition with an Electronic Computer," Journal of Music Theory, vol. 3, p. 302, 1959.
[3] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, "Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription," Proceedings of the 29th International Conference on Machine Learning, ICML 2012, vol. 2, 06/27 2012.
[4] P. Todd and G. Loy, "Modeling the Perception of Tonal Structure with Neural Nets," Computer Music Journal, vol. 13, pp. 128-137, 1989.
[5] D. Eck and J. Schmidhuber, "Finding temporal structure in music: Blues improvisation with LSTM recurrent networks," pp. 747-756, 2002.
[6] A. Graves, "Generating Sequences With Recurrent Neural Networks," 08/04 2013.
[7] C.-Z. A. Huang, T. Cooijmans, A. Roberts, A. C. Courville, and D. Eck, "Counterpoint by Convolution," ArXiv, vol. abs/1903.07227, 2017.
[8] A. oord et al., "WaveNet: A Generative Model for Raw Audio," 09/12 2016.
[9] A. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, "Conditional Image Generation with PixelCNN Decoders," 06/16 2016.
[10] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, "Deep learning techniques for music generation--a survey," arXiv preprint arXiv:1709.01620, 2017.
[11] E. Waite, D. Eck, A. Roberts, and D. Abolafia, "Project Magenta: Generating longterm structure in songs and stories," 2016. [Online]. Available: https: //magenta.tensorflow.org/blog/2016/ 07/15/lookback-rnn-attention-rnn/.
[12] S. Mehri et al., "SampleRNN: An Unconditional End-to-End Neural Audio Generation Model," 12/22 2016.
[13] L.-C. Yang, S.-Y. Chou, and y.-h. Yang, "MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions," 03/31 2017.
[14] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and y.-h. Yang, "MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment," 2018.
[15] H.-W. Dong and Y. Yang, "Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation," in ISMIR, 2018.
[16] I. Goodfellow et al., "Generative Adversarial Nets," ArXiv, 06/01 2014.
[17] L. Yu, W. Zhang, J. Wang, and Y. Yu, "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient," 09/18 2016.
[18] B. Sturm, J. Santos, O. Ben-Tal, and I. Korshunova, "Music transcription modelling and composition using deep learning," 04/29 2016.
[19] H. Chu, R. Urtasun, and S. Fidler, "Song From PI: A Musically Plausible Network for Pop Music Generation," 11/10 2016.
[20] G. Hadjeres, F. Pachet, and F. Nielsen, "DeepBach: A Steerable Model for Bach Chorales Generation," 2017.
[21] "Binary stochastic neurons in tensorflow," 2016. [Online]. Available: https://r2rt.com/binary-stochastic-neurons-in-tensorflow.html.
[22] Y. Bengio, N. Léonard, and A. Courville, "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation," 08/15 2013.
[23] T. White, "Sampling Generative Networks: Notes on a Few Effective Techniques," 09/14 2016.
[24] M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN," ArXiv, vol. abs/1701.07875, 2017.
[25] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, "Improved Training of Wasserstein GANs," ArXiv, vol. abs/1704.00028, 2017.
[26] A. Radford, L. Metz, and S. Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks," CoRR, vol. abs/1511.06434, 2016.
[27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved Techniques for Training GANs," 06/10 2016.
[28] K. He, X. Zhang, S. Ren, and J. Sun, "Identity Mappings in Deep Residual Networks," in European conference on computer vision, 2016: Springer, pp. 630-645.
[29] C. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, vol. 27, pp. 379-423, 01/01 1948.
[30] A. F. Agarap, "Deep Learning using Rectified Linear Units (ReLU)," ArXiv, vol. abs/1803.08375, 2018.
[31] A. L. Maas, "Rectifier Nonlinearities Improve Neural Network Acoustic Models," 2013.
[32] C. Raffel, "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching," 2016.
[33] T. Bertin-Mahieux, D. Ellis, B. Whitman, and P. Lamere, "The Million Song Dataset," pp. 591-596, 2011.
[34] J. Chung, S. Ahn, and Y. Bengio, "Hierarchical Multiscale Recurrent Neural Networks," 09/06 2016.
[35] C. Harte, M. Sandler, and M. Gasser, "Detecting Harmonic Change in Musical Audio," 2006.
[36] W.-Y. Hsiao, Y.-C. Yeh, Y.-S. Huang, J. Liu, T.-K. Hsieh, and H.-T. Hung, "Jamming with Yating: Interactive Demonstration of A Music Composition AI," 2019.
[37] S. Lattner and M. Grachten, "High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction," 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 35-39, 2019.