研究生: |
游瀚鈞 Han-Chun Youh |
---|---|
論文名稱: |
基於深度神經網路之樂器聲音波形生成 Waveform Synthesis of Instrument Music via Deep Neural Network |
指導教授: |
林伯慎
Bor-Shen Lin |
口試委員: |
羅乃維
Nai-Wei Lo 陳柏琳 Ber-Lin Chen |
學位類別: |
碩士 Master |
系所名稱: |
管理學院 - 資訊管理系 Department of Information Management |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 73 |
中文關鍵詞: | 波形生成 、音色誤差計算 、音樂生成 、音頻生成器 |
外文關鍵詞: | waveform synthesis, timbre loss, music synthesis, audio generator |
相關次數: | 點閱:177 下載:2 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著類神經網路的快速發展,越來越多音樂處理技術也廣泛地使用類神經網路,像是音樂標記、和弦識別以及音樂生成等。現今使用類神經網路於聲音訊號生成主要多為語音合成,而其網路架構多是從圖像合成架構修改而來;例如WaveNet雖可以修改做樂音的生成,然其網路架構比較複雜,也缺少彈性與創作便利性。另也有使用生成對抗網路合成音樂的方法,但需要較大量的資料進行訓練,在產生樂音時也很難控制其生成效果,不容易做進一步改良或應用在音樂的創作上。其實,聲音的訊息表示和產生過程與圖像相當不同,使用相似的生成架構有不合理之處;樂器發聲的過程也比人類說話單純,應可使用較簡單的網路架構合成。因此,本論文研究以多層神經網路生成樂器聲音,探討使用更簡單、易於控制的生成架構以生成好品質的樂音。我們提出一種基於週期脈衝序列和多層網路的樂器聲音合成方法,此方法使用頻帶能量比例的音色誤差、和週期能量的音量誤差來進行學習;僅需單一聲音檔案就可以訓練生成模型,其儲存空間極小且訓練過程快速。我們也驗證了,透過控制脈衝輸入之週期與振幅,此網路可以產生出不同的音高、音量變化之自然樂音,以及和弦、滑音的輸出效果。
With the rapid development of neural networks, more and more music processing technologies make wide use of neural networks, such as music tagging, chord recognition, and music synthesis. Nowadays, audio synthesis networks are often modified from network architecture for speech or image synthesis. WaveNet, for example, may be used to generate music as well as speech, but its sophisticated architecture is not ease of music creation. GAN architecture may be used for audio generation as well as image generation, but it requires a large amount of data for training, and it is difficult to control the effect of generated music or to apply it to music creation flexibly. In fact, the generative process of instrumental sound is simpler than that of speech and quite different from that of image, so it is unreasonable to use similar generative networks. Therefore, this thesis explores the use of simpler neural networks to generate natural instrumental sounds such that they could be handy for music creation. We proposed a synthesis architecture that takes periodic impulse sequence as input followed by multilayer convolutional network, which is trained with the timbre loss of energy ratio in frequency domain and the amplitude loss in the time domain. Only a single sound is required to train the synthesis model, so the storage space is small and the training is fast. The network has been shown able to synthesize natural instrumental sounds of different pitches or volumes by simply controlling the period or amplitude of the impulse sequence, and may produce the sounds with multiple or dynamic pitches such as melodies, chords and portamento. Therefore, the proposed approach is handy, flexible and controllable for music creation.
[1] Sander Dieleman, Jordi Pons, and Jongpil Lee. “Tutorial: waveform-based music processing with deep learning”, ISMIR Tutorial, 2019.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE ( Volume: 86 , Issue: 11), pages 2278-2324, 1998.
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”, In Advances in Neural Information Processing Systems, pages 1097-1105, 2012.
[4] S. Lawrence, C. L. Giles, A. C. Tsoi. and A. D. Back. “Face Recognition: A Convolutional Neural Network Approach”, IEEE Transactions on Neural Networks. Pages, 98-113, 1997.
[5] D. P. Kingma, and Max Welling. “Auto-Encoding Variational Bayes”, International Conference on Learning Representations (ICLR), 2014.
[6] I. Goodfellow et al. "Generative adversarial nets", in Advances in neural information processing systems. Pages, 2672-2680, 2014.
[7] A. Oord et al. “WAVENET: A GENERATIVE MODEL FOR RAW AUDIO”, arXiv preprint arXiv:1609.03499, 2016.
[8] A. Oord et al. “Conditional Image Generation with PixelCNN Decoders”, CoRR,
abs/1606.05328, 2016b.
[9] A. Oord et al. “Neural discrete representation learning”, NIPS 2017.
[10] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. “Efficient neural audio synthesis”, CoRR, abs/1802.08435, 2018.
[11] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. “Samplernn: An unconditional end-to-end neural audio generation model”, In ICLR, 2017.
[12] N. AHMED, T. NATARAJAN, AND K. RAO. “Discrete Cosine Transform”, IEEE TRANSACTIONS ON COMPUTERS, pages 90-93, 1974.
[13] C. ROADS. “THE COMPUTER MUSIC TUTORIAL”, pages 117-133, 1995.
[14] K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis”, arXiv:1910.06711, 2019.
[15] C. Donahue, J. J. Mcauley, M. S. Puckette, “Adversarial Audio Synthesis”, 7th International Conference on Learning Representations, ICLR 2019.
[16] A. Radford, L. Metz, S. Chintala, “Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks”, 4th International Conference on Learning Representations, ICLR 2016.