簡易檢索 / 詳目顯示

研究生: 游瀚鈞
Han-Chun Youh
論文名稱: 基於深度神經網路之樂器聲音波形生成
Waveform Synthesis of Instrument Music via Deep Neural Network
指導教授: 林伯慎
Bor-Shen Lin
口試委員: 羅乃維
Nai-Wei Lo
陳柏琳
Ber-Lin Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 73
中文關鍵詞: 波形生成音色誤差計算音樂生成音頻生成器
外文關鍵詞: waveform synthesis, timbre loss, music synthesis, audio generator
相關次數: 點閱:177下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 隨著類神經網路的快速發展,越來越多音樂處理技術也廣泛地使用類神經網路,像是音樂標記、和弦識別以及音樂生成等。現今使用類神經網路於聲音訊號生成主要多為語音合成,而其網路架構多是從圖像合成架構修改而來;例如WaveNet雖可以修改做樂音的生成,然其網路架構比較複雜,也缺少彈性與創作便利性。另也有使用生成對抗網路合成音樂的方法,但需要較大量的資料進行訓練,在產生樂音時也很難控制其生成效果,不容易做進一步改良或應用在音樂的創作上。其實,聲音的訊息表示和產生過程與圖像相當不同,使用相似的生成架構有不合理之處;樂器發聲的過程也比人類說話單純,應可使用較簡單的網路架構合成。因此,本論文研究以多層神經網路生成樂器聲音,探討使用更簡單、易於控制的生成架構以生成好品質的樂音。我們提出一種基於週期脈衝序列和多層網路的樂器聲音合成方法,此方法使用頻帶能量比例的音色誤差、和週期能量的音量誤差來進行學習;僅需單一聲音檔案就可以訓練生成模型,其儲存空間極小且訓練過程快速。我們也驗證了,透過控制脈衝輸入之週期與振幅,此網路可以產生出不同的音高、音量變化之自然樂音,以及和弦、滑音的輸出效果。


    With the rapid development of neural networks, more and more music processing technologies make wide use of neural networks, such as music tagging, chord recognition, and music synthesis. Nowadays, audio synthesis networks are often modified from network architecture for speech or image synthesis. WaveNet, for example, may be used to generate music as well as speech, but its sophisticated architecture is not ease of music creation. GAN architecture may be used for audio generation as well as image generation, but it requires a large amount of data for training, and it is difficult to control the effect of generated music or to apply it to music creation flexibly. In fact, the generative process of instrumental sound is simpler than that of speech and quite different from that of image, so it is unreasonable to use similar generative networks. Therefore, this thesis explores the use of simpler neural networks to generate natural instrumental sounds such that they could be handy for music creation. We proposed a synthesis architecture that takes periodic impulse sequence as input followed by multilayer convolutional network, which is trained with the timbre loss of energy ratio in frequency domain and the amplitude loss in the time domain. Only a single sound is required to train the synthesis model, so the storage space is small and the training is fast. The network has been shown able to synthesize natural instrumental sounds of different pitches or volumes by simply controlling the period or amplitude of the impulse sequence, and may produce the sounds with multiple or dynamic pitches such as melodies, chords and portamento. Therefore, the proposed approach is handy, flexible and controllable for music creation.

    第1章 序論 1 1.1 研究背景與動機 1 1.2 研究主要成果 4 1.3 論文組織與架構 4 第2章 音樂屬性解析與相關研究 5 2.1 音樂基本特性 5 2.2 樂器發聲原理 7 2.3 卷積神經網路 8 2.3.1 一維卷積神經網路 10 2.4 傳統電子樂器聲音產生方法 11 2.4.1 取樣器 (Sampler) 11 2.4.2 電子合成器 (Synthesizer) 13 2.5 生成模型文獻探討 15 2.5.1 基於相似性的生成模型 (Likelihood-based models) 15 2.5.2 變異自動編碼模型 (Variational Auto-Encoder, VAE) 18 2.5.3 生成對抗網路 (Generative Adversarial Network, GAN) 19 2.6 本章摘要 22 第3章 樂器聲音波形的生成方法 23 3.1 引言 23 3.2 內嵌式生成器 (Embedding Generator) 24 3.3 處理流程與架構 26 3.4 資料集 29 3.5 脈衝序列與生成器 30 3.5.1 音高變化 31 3.5.2 單音旋律線與滑音 33 3.5.3 和弦 35 3.5.4 音量變化 35 3.6 離散餘弦轉換 (Discrete Cosine Transform, DCT) 38 3.7 能量比例的正規化 39 3.8 音色誤差計算 40 3.9 本章摘要 42 第4章 生成模型探討與改進 43 4.1 生成模型改進 43 4.1.1 可學習的脈衝序列 43 4.1.2 音量誤差 (Amplitude Loss) 45 4.2 卷積神經網路分析 52 4.2.1 層數與卷積核長度探討 52 4.2.2 偏差值 (Bias)的影響 56 4.3 生成模型的優勢與應用 56 4.3.1 與傳統取樣器比較 56 4.3.2 與類神經網路模型比較 57 4.3.3 實際應用層面 58 4.4 本章摘要 59 第5章 結論 60 參考文獻 61

    [1] Sander Dieleman, Jordi Pons, and Jongpil Lee. “Tutorial: waveform-based music processing with deep learning”, ISMIR Tutorial, 2019.
    [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-Based Learning Applied to Document Recognition”, Proceedings of the IEEE ( Volume: 86 , Issue: 11), pages 2278-2324, 1998.
    [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks”, In Advances in Neural Information Processing Systems, pages 1097-1105, 2012.
    [4] S. Lawrence, C. L. Giles, A. C. Tsoi. and A. D. Back. “Face Recognition: A Convolutional Neural Network Approach”, IEEE Transactions on Neural Networks. Pages, 98-113, 1997.
    [5] D. P. Kingma, and Max Welling. “Auto-Encoding Variational Bayes”, International Conference on Learning Representations (ICLR), 2014.
    [6] I. Goodfellow et al. "Generative adversarial nets", in Advances in neural information processing systems. Pages, 2672-2680, 2014.
    [7] A. Oord et al. “WAVENET: A GENERATIVE MODEL FOR RAW AUDIO”, arXiv preprint arXiv:1609.03499, 2016.
    [8] A. Oord et al. “Conditional Image Generation with PixelCNN Decoders”, CoRR,
    abs/1606.05328, 2016b.
    [9] A. Oord et al. “Neural discrete representation learning”, NIPS 2017.
    [10] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. “Efficient neural audio synthesis”, CoRR, abs/1802.08435, 2018.
    [11] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. “Samplernn: An unconditional end-to-end neural audio generation model”, In ICLR, 2017.
    [12] N. AHMED, T. NATARAJAN, AND K. RAO. “Discrete Cosine Transform”, IEEE TRANSACTIONS ON COMPUTERS, pages 90-93, 1974.
    [13] C. ROADS. “THE COMPUTER MUSIC TUTORIAL”, pages 117-133, 1995.
    [14] K. Kumar et al., “MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis”, arXiv:1910.06711, 2019.
    [15] C. Donahue, J. J. Mcauley, M. S. Puckette, “Adversarial Audio Synthesis”, 7th International Conference on Learning Representations, ICLR 2019.
    [16] A. Radford, L. Metz, S. Chintala, “Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks”, 4th International Conference on Learning Representations, ICLR 2016.

    QR CODE