簡易檢索 / 詳目顯示

研究生: 李萱瑜
Hsuan-Yu Li
論文名稱: 基於卷積生成對抗網絡進行複音音樂生成
Study on Convolutional Generative Adversarial Networks based Polyphonic Music Generation
指導教授: 蘇順豐
Shun-Feng Su
口試委員: 陳永耀
Yung-Yao Chen
花凱龍
Kai-Lung Hua
陳美勇
Mei-Yung Chen
林顯易
Hsien-I Lin
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 78
中文關鍵詞: 對抗生成網路音樂生成鋼琴卷複音音樂深度學習
外文關鍵詞: Generative Adversarial Nets, Music Generation, Piano Roll, Polyphonic Music, Deep Learning
相關次數: 點閱:235下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 為了提升生成音樂的品質並解決音樂片段化的問題,這項研究除了針對參數部分有所探討之外,主要集中在損失函數的改進上。自動生成音樂是在最少的人工干預下進行音樂創作的過程,由於卷積神經網路適合學習局部特徵的特性使得近年來有許多利用對抗生成網絡 (GAN)來產生符號音樂的模型。本論文就是使用卷積生成對抗網路做為模型架構,其分為三個部分:生成器 (Generator),精簡器 (Refiner)和判別器 (Discriminator)。
    在不影響GAN的訓練穩定性的前提下,為了減少零碎音符的比例,將用於計算真實與生成的合格音符 (QN)分數之間距離的公式添加到原始損失函數中。此外,隨著迭代次數的增加,逐漸縮減生成器上附加損失函數的係數,並在評估得分接近訓練數據的分數時停止。如此一來,生成器和判別器之間的損失函數就可以保持在最適合該模型的狀態。最後,為了盡可能地接近訓練數據的評估分數,再根據模型的特性為其選擇最合適的激活函數。本論文以Lakh Piano Data Set (LPD)篩選出的lastfm_alternative_8b_phrase為數據集進行訓練與測試,而生成音樂的合格音符 (QN) 分數可以達到84%,複音 (PP) 和音調距離 (TD) 分數分別為46%和95%。在現有的所有方法中,那些分數是最接近訓練資料的分數。


    In order to improve the quality of the generated music and solve the problem of music fragmentation, this research study on the parameters and mainly focuses on modifying the loss function used to suit for the music requirements. Automatic generation of music is a process of music creation with minimal human intervention. The characteristics of CNN suitable for learning local features have led to the use of GAN in many existing models to generate symbolic music in recent years. This study uses the convolutional generative confrontation network as the model architecture, which is divided into three parts: Generator, Refiner and Discriminator. Without affecting the training stability of GAN, while reducing the proportion of fragmented notes, the formula used to calculate the distance between the real and the generated Qualified Note score is added to the original loss function. In addition, as the number of iterations increases, the coefficient of the additional loss function on the generator is gradually reduced, and it stops when the evaluation score approaches the score of the training data. In this way, the loss functions of the generator and the discriminator can be kept in the state most suitable for the model. Finally, in order to be as close as possible to the evaluation score of the training data, more suitable activation functions are selected according to the characteristics of the model. This paper uses lastfm_alternative_8b_phrase filtered by LPD as the dataset for training and testing, and the Qualified Note score of the generated music can reach 84%, and the Polyphony and Tonal Distance scores are respectively 46% and 95%. Those scores are the closest scores to those of the training data among all existing approaches.

    中文摘要 I Abstract II 致謝 III Table of Contents IV List of Figures VII List of Tables IX Chapter 1 Introduction 1 1.1 Background 1 1.2 Motivations 2 1.3 Baseline Model 4 1.4 Thesis Contributions 6 1.5 Thesis Organization 8 Chapter 2 Related Work 9 2.1 Generative Adversarial Nets 9 2.2 MidiNet 12 2.3 MuseGAN 14 2.4 bMuseGAN 16 Chapter 3 Methodology 18 3.1 Structure 19 3.1.1 Deep Convolution Generative Adversarial Networks 19 3.1.2 Generator 20 3.1.3 Refiner 21 3.1.4 Discriminator 22 3.2 Loss Function 24 3.2.1 GAN 24 3.2.2 WGAN 28 3.2.3 WGAN – GP 32 3.2.4 Qualified Note Loss 35 3.3 Activation Functions 38 3.4 Parameters Discussion 40 Chapter 4 Experiments 42 4.1 Dataset 42 4.2 Implementation Environment 44 4.2.1 Hardware 44 4.2.2 Software 45 4.3 Training and Testing Process 46 4.3.1 Training Process 46 4.3.2 Testing Process 47 4.4 Hyper-parameters 48 4.4.1 Batch Size 48 4.4.2 Epoch 48 4.4.3 Other Parameters Setting 49 4.5 Evaluation Function 50 4.6 Experiment Results 51 4.6.1 Original Loss Function 51 4.6.2 Modified Loss Function 53 4.6.3 Modified Loss Function with Activation Functions 56 4.6.4 Comparison 60 Chapter 5 Conclusions and Future Work 61 5.1 Conclusions 61 5.2 Future Work 62 References 63

    [1] G. Papadopoulos and G. Wiggins, "AI Methods for Algorithmic Composition: A Survey, a Critical view and Future Prospects," 1999.
    [2] P. Westergaard, L. Hiller, and L. M. Isaacson, "Experimental Music. Composition with an Electronic Computer," Journal of Music Theory, vol. 3, p. 302, 1959.
    [3] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, "Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription," Proceedings of the 29th International Conference on Machine Learning, ICML 2012, vol. 2, 06/27 2012.
    [4] P. Todd and G. Loy, "Modeling the Perception of Tonal Structure with Neural Nets," Computer Music Journal, vol. 13, pp. 128-137, 1989.
    [5] D. Eck and J. Schmidhuber, "Finding temporal structure in music: Blues improvisation with LSTM recurrent networks," pp. 747-756, 2002.
    [6] A. Graves, "Generating Sequences With Recurrent Neural Networks," 08/04 2013.
    [7] C.-Z. A. Huang, T. Cooijmans, A. Roberts, A. C. Courville, and D. Eck, "Counterpoint by Convolution," ArXiv, vol. abs/1903.07227, 2017.
    [8] A. oord et al., "WaveNet: A Generative Model for Raw Audio," 09/12 2016.
    [9] A. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, "Conditional Image Generation with PixelCNN Decoders," 06/16 2016.
    [10] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, "Deep learning techniques for music generation--a survey," arXiv preprint arXiv:1709.01620, 2017.
    [11] E. Waite, D. Eck, A. Roberts, and D. Abolafia, "Project Magenta: Generating longterm structure in songs and stories," 2016. [Online]. Available: https: //magenta.tensorflow.org/blog/2016/ 07/15/lookback-rnn-attention-rnn/.
    [12] S. Mehri et al., "SampleRNN: An Unconditional End-to-End Neural Audio Generation Model," 12/22 2016.
    [13] L.-C. Yang, S.-Y. Chou, and y.-h. Yang, "MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions," 03/31 2017.
    [14] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and y.-h. Yang, "MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment," 2018.
    [15] H.-W. Dong and Y. Yang, "Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation," in ISMIR, 2018.
    [16] I. Goodfellow et al., "Generative Adversarial Nets," ArXiv, 06/01 2014.
    [17] L. Yu, W. Zhang, J. Wang, and Y. Yu, "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient," 09/18 2016.
    [18] B. Sturm, J. Santos, O. Ben-Tal, and I. Korshunova, "Music transcription modelling and composition using deep learning," 04/29 2016.
    [19] H. Chu, R. Urtasun, and S. Fidler, "Song From PI: A Musically Plausible Network for Pop Music Generation," 11/10 2016.
    [20] G. Hadjeres, F. Pachet, and F. Nielsen, "DeepBach: A Steerable Model for Bach Chorales Generation," 2017.
    [21] "Binary stochastic neurons in tensorflow," 2016. [Online]. Available: https://r2rt.com/binary-stochastic-neurons-in-tensorflow.html.
    [22] Y. Bengio, N. Léonard, and A. Courville, "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation," 08/15 2013.
    [23] T. White, "Sampling Generative Networks: Notes on a Few Effective Techniques," 09/14 2016.
    [24] M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN," ArXiv, vol. abs/1701.07875, 2017.
    [25] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, "Improved Training of Wasserstein GANs," ArXiv, vol. abs/1704.00028, 2017.
    [26] A. Radford, L. Metz, and S. Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks," CoRR, vol. abs/1511.06434, 2016.
    [27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved Techniques for Training GANs," 06/10 2016.
    [28] K. He, X. Zhang, S. Ren, and J. Sun, "Identity Mappings in Deep Residual Networks," in European conference on computer vision, 2016: Springer, pp. 630-645.
    [29] C. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, vol. 27, pp. 379-423, 01/01 1948.
    [30] A. F. Agarap, "Deep Learning using Rectified Linear Units (ReLU)," ArXiv, vol. abs/1803.08375, 2018.
    [31] A. L. Maas, "Rectifier Nonlinearities Improve Neural Network Acoustic Models," 2013.
    [32] C. Raffel, "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching," 2016.
    [33] T. Bertin-Mahieux, D. Ellis, B. Whitman, and P. Lamere, "The Million Song Dataset," pp. 591-596, 2011.
    [34] J. Chung, S. Ahn, and Y. Bengio, "Hierarchical Multiscale Recurrent Neural Networks," 09/06 2016.
    [35] C. Harte, M. Sandler, and M. Gasser, "Detecting Harmonic Change in Musical Audio," 2006.
    [36] W.-Y. Hsiao, Y.-C. Yeh, Y.-S. Huang, J. Liu, T.-K. Hsieh, and H.-T. Hung, "Jamming with Yating: Interactive Demonstration of A Music Composition AI," 2019.
    [37] S. Lattner and M. Grachten, "High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction," 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 35-39, 2019.

    無法下載圖示 全文公開日期 2026/02/04 (校內網路)
    全文公開日期 2026/02/04 (校外網路)
    全文公開日期 2026/02/04 (國家圖書館:臺灣博碩士論文系統)
    QR CODE