基於卷積生成對抗網絡進行複音音樂生成｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	李萱瑜 Hsuan-Yu Li
論文名稱：	基於卷積生成對抗網絡進行複音音樂生成 Study on Convolutional Generative Adversarial Networks based Polyphonic Music Generation
指導教授：	蘇順豐 Shun-Feng Su
口試委員:	陳永耀 Yung-Yao Chen 花凱龍 Kai-Lung Hua 陳美勇 Mei-Yung Chen 林顯易 Hsien-I Lin
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	78
中文關鍵詞：	對抗生成網路、音樂生成、鋼琴卷、複音音樂、深度學習
外文關鍵詞：	Generative Adversarial Nets, Music Generation, Piano Roll, Polyphonic Music, Deep Learning
相關次數：	點閱：235 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

為了提升生成音樂的品質並解決音樂片段化的問題，這項研究除了針對參數部分有所探討之外，主要集中在損失函數的改進上。自動生成音樂是在最少的人工干預下進行音樂創作的過程，由於卷積神經網路適合學習局部特徵的特性使得近年來有許多利用對抗生成網絡 (GAN)來產生符號音樂的模型。本論文就是使用卷積生成對抗網路做為模型架構，其分為三個部分：生成器 (Generator)，精簡器 (Refiner)和判別器 (Discriminator)。
在不影響GAN的訓練穩定性的前提下，為了減少零碎音符的比例，將用於計算真實與生成的合格音符 (QN)分數之間距離的公式添加到原始損失函數中。此外，隨著迭代次數的增加，逐漸縮減生成器上附加損失函數的係數，並在評估得分接近訓練數據的分數時停止。如此一來，生成器和判別器之間的損失函數就可以保持在最適合該模型的狀態。最後，為了盡可能地接近訓練數據的評估分數，再根據模型的特性為其選擇最合適的激活函數。本論文以Lakh Piano Data Set (LPD)篩選出的lastfm_alternative_8b_phrase為數據集進行訓練與測試，而生成音樂的合格音符 (QN) 分數可以達到84%，複音 (PP) 和音調距離 (TD) 分數分別為46%和95%。在現有的所有方法中，那些分數是最接近訓練資料的分數。

In order to improve the quality of the generated music and solve the problem of music fragmentation, this research study on the parameters and mainly focuses on modifying the loss function used to suit for the music requirements. Automatic generation of music is a process of music creation with minimal human intervention. The characteristics of CNN suitable for learning local features have led to the use of GAN in many existing models to generate symbolic music in recent years. This study uses the convolutional generative confrontation network as the model architecture, which is divided into three parts: Generator, Refiner and Discriminator. Without affecting the training stability of GAN, while reducing the proportion of fragmented notes, the formula used to calculate the distance between the real and the generated Qualified Note score is added to the original loss function. In addition, as the number of iterations increases, the coefficient of the additional loss function on the generator is gradually reduced, and it stops when the evaluation score approaches the score of the training data. In this way, the loss functions of the generator and the discriminator can be kept in the state most suitable for the model. Finally, in order to be as close as possible to the evaluation score of the training data, more suitable activation functions are selected according to the characteristics of the model. This paper uses lastfm_alternative_8b_phrase filtered by LPD as the dataset for training and testing, and the Qualified Note score of the generated music can reach 84%, and the Polyphony and Tonal Distance scores are respectively 46% and 95%. Those scores are the closest scores to those of the training data among all existing approaches.

中文摘要    I
Abstract    II
致謝    III
Table of Contents    IV
List of Figures    VII
List of Tables    IX
Chapter 1    Introduction    1
1    Background    1
2    Motivations    2
3    Baseline Model    4
4    Thesis Contributions    6
5    Thesis Organization    8
Chapter 2    Related Work    9
1    Generative Adversarial Nets    9
2    MidiNet    12
3    MuseGAN    14
4    bMuseGAN    16
Chapter 3    Methodology    18
1    Structure    19
1.1    Deep Convolution Generative Adversarial Networks    19
1.2    Generator    20
1.3    Refiner    21
1.4    Discriminator    22
2    Loss Function    24
2.1    GAN    24
2.2    WGAN    28
2.3    WGAN – GP    32
2.4    Qualified Note Loss    35
3    Activation Functions    38
4    Parameters Discussion    40
Chapter 4    Experiments    42
1    Dataset    42
2    Implementation Environment    44
2.1    Hardware    44
2.2    Software    45
3    Training and Testing Process    46
3.1    Training Process    46
3.2    Testing Process    47
4    Hyper-parameters    48
4.1    Batch Size    48
4.2    Epoch    48
4.3    Other Parameters Setting    49
5    Evaluation Function    50
6    Experiment Results    51
6.1    Original Loss Function    51
6.2    Modified Loss Function    53
6.3    Modified Loss Function with Activation Functions    56
6.4    Comparison    60
Chapter 5    Conclusions and Future Work    61
1    Conclusions    61
2    Future Work    62
References    63
                                

[1] G. Papadopoulos and G. Wiggins, "AI Methods for Algorithmic Composition: A Survey, a Critical view and Future Prospects," 1999.
[2] P. Westergaard, L. Hiller, and L. M. Isaacson, "Experimental Music. Composition with an Electronic Computer," Journal of Music Theory, vol. 3, p. 302, 1959.
[3] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, "Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription," Proceedings of the 29th International Conference on Machine Learning, ICML 2012, vol. 2, 06/27 2012.
[4] P. Todd and G. Loy, "Modeling the Perception of Tonal Structure with Neural Nets," Computer Music Journal, vol. 13, pp. 128-137, 1989.
[5] D. Eck and J. Schmidhuber, "Finding temporal structure in music: Blues improvisation with LSTM recurrent networks," pp. 747-756, 2002.
[6] A. Graves, "Generating Sequences With Recurrent Neural Networks," 08/04 2013.
[7] C.-Z. A. Huang, T. Cooijmans, A. Roberts, A. C. Courville, and D. Eck, "Counterpoint by Convolution," ArXiv, vol. abs/1903.07227, 2017.
[8] A. oord et al., "WaveNet: A Generative Model for Raw Audio," 09/12 2016.
[9] A. Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu, "Conditional Image Generation with PixelCNN Decoders," 06/16 2016.
[10] J.-P. Briot, G. Hadjeres, and F.-D. Pachet, "Deep learning techniques for music generation--a survey," arXiv preprint arXiv:1709.01620, 2017.
[11] E. Waite, D. Eck, A. Roberts, and D. Abolafia, "Project Magenta: Generating longterm structure in songs and stories," 2016. [Online]. Available: https: //magenta.tensorflow.org/blog/2016/ 07/15/lookback-rnn-attention-rnn/.
[12] S. Mehri et al., "SampleRNN: An Unconditional End-to-End Neural Audio Generation Model," 12/22 2016.
[13] L.-C. Yang, S.-Y. Chou, and y.-h. Yang, "MidiNet: A Convolutional Generative Adversarial Network for Symbolic-domain Music Generation using 1D and 2D Conditions," 03/31 2017.
[14] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and y.-h. Yang, "MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment," 2018.
[15] H.-W. Dong and Y. Yang, "Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation," in ISMIR, 2018.
[16] I. Goodfellow et al., "Generative Adversarial Nets," ArXiv, 06/01 2014.
[17] L. Yu, W. Zhang, J. Wang, and Y. Yu, "SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient," 09/18 2016.
[18] B. Sturm, J. Santos, O. Ben-Tal, and I. Korshunova, "Music transcription modelling and composition using deep learning," 04/29 2016.
[19] H. Chu, R. Urtasun, and S. Fidler, "Song From PI: A Musically Plausible Network for Pop Music Generation," 11/10 2016.
[20] G. Hadjeres, F. Pachet, and F. Nielsen, "DeepBach: A Steerable Model for Bach Chorales Generation," 2017.
[21] "Binary stochastic neurons in tensorflow," 2016. [Online]. Available: https://r2rt.com/binary-stochastic-neurons-in-tensorflow.html.
[22] Y. Bengio, N. Léonard, and A. Courville, "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation," 08/15 2013.
[23] T. White, "Sampling Generative Networks: Notes on a Few Effective Techniques," 09/14 2016.
[24] M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN," ArXiv, vol. abs/1701.07875, 2017.
[25] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, "Improved Training of Wasserstein GANs," ArXiv, vol. abs/1704.00028, 2017.
[26] A. Radford, L. Metz, and S. Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks," CoRR, vol. abs/1511.06434, 2016.
[27] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, "Improved Techniques for Training GANs," 06/10 2016.
[28] K. He, X. Zhang, S. Ren, and J. Sun, "Identity Mappings in Deep Residual Networks," in European conference on computer vision, 2016: Springer, pp. 630-645.
[29] C. Shannon, "A Mathematical Theory of Communication," Bell System Technical Journal, vol. 27, pp. 379-423, 01/01 1948.
[30] A. F. Agarap, "Deep Learning using Rectified Linear Units (ReLU)," ArXiv, vol. abs/1803.08375, 2018.
[31] A. L. Maas, "Rectifier Nonlinearities Improve Neural Network Acoustic Models," 2013.
[32] C. Raffel, "Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching," 2016.
[33] T. Bertin-Mahieux, D. Ellis, B. Whitman, and P. Lamere, "The Million Song Dataset," pp. 591-596, 2011.
[34] J. Chung, S. Ahn, and Y. Bengio, "Hierarchical Multiscale Recurrent Neural Networks," 09/06 2016.
[35] C. Harte, M. Sandler, and M. Gasser, "Detecting Harmonic Change in Musical Audio," 2006.
[36] W.-Y. Hsiao, Y.-C. Yeh, Y.-S. Huang, J. Liu, T.-K. Hsieh, and H.-T. Hung, "Jamming with Yating: Interactive Demonstration of A Music Composition AI," 2019.
[37] S. Lattner and M. Grachten, "High-Level Control of Drum Track Generation Using Learned Patterns of Rhythmic Interaction," 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 35-39, 2019.

全文公開日期 2026/02/04 (校內網路)
全文公開日期 2026/02/04 (校外網路)
全文公開日期 2026/02/04 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文