Continuous Music Generation Using Video Based on Circumplex Model of Affect

簡易檢索 / 詳目顯示

回結果列表

研究生：	MOLION SURYA PRADANA MOLION SURYA PRADANA
論文名稱：	Continuous Music Generation Using Video Based on Circumplex Model of Affect Continuous Music Generation Using Video Based on Circumplex Model of Affect
指導教授：	楊傳凱 Chuan-Kai Yang
口試委員:	賴源正 Yuan-Cheng Lai 林伯慎 Bor-Shen Lin
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	59
外文關鍵詞：	Continuous music generation, Russell Circumplex model, Valence-arousal, Multimodal approach
相關次數：	點閱：354 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

This study presents a novel approach to continuous music generation utilizing the
Russell Circumplex model's valence-arousal dimensions as a basis for emotional representation. The proposed system combines facial emotion recognition, transfer learning, and symbolic music generation to dynamically generate music based on continuously evolving emotional states. The process begins with face detection, followed by facial landmark modeling to predict the emotional state of the detected face. To account for situations where faces are not present, a transfer learning model is trained to classify images that lack facial features. The most effective transfer learning model identified in this study is EfficientNetB2_0.001_32, which
achieves promising results with root mean square error (RMSE) values of 0.327 for valence and 0.223 for arousal prediction. Next, the predicted valence-arousal values are used to condition the symbolic music generation system. This multimodal approach allows for the translation of continuous-valued emotions into corresponding musical characteristics, such as tempo, melody, and harmony. The music generation process occurs at regular intervals, with new musical output being generated every 60 seconds to adapt to the evolving emotional states. The proposed system demonstrates smooth operation and robustness in predicting valence and arousal values, as well as generating coherent music reflective of the detected emotions. The integration of facial emotion recognition and symbolic music generation provides a comprehensive framework for continuous music generation in real-time, with potential applications in various domains such as interactive media, entertainment, and therapy.

ABSTRACT    iv
ACKNOWLEDGEMENT    v
TABLES OF CONTENT    vi
LIST OF FIGURES    viii
LIST OF TABLES    ix
CHAPTER I INTRODUCTION    1
I.1 BACKGROUND    1
I.2 OBJECTIVE    3
I.3 CONTRIBUTIONS    3
I.4 RESEARCH OUTLINE    3
CHAPTER II LITERATURE REVIEW    5
II.1 FACE LANDMARKS DETECTION, FEATURE EXTRACTIONS, AND EXPRESSION RECOGNITION.    5
II.2 RUSSELL'S CIRCUMPLEX MODEL OF AFFECT    9
II.3 TRANSFER LEARNING IMAGENET MODEL    11
II.4 FLUIDSYNTH    15
II.5 PYDUB    16
II.6 PYGAME    17
II.7 SYMBOLIC MUSIC GENERATION    17
CHAPTER III PROPOSED METHOD    20
III.1 MODEL 1    20
III.1.1 OPEN AFFECTIVE STANDARDIZED IMAGE SET (OASIS) DATASET    20
III.1.2 DATA PRE-PROCESSING    21
III.1.3 DATA AUGMENTATION    21
III.1.4 ARCHITECTURE    22
III.1.5 TRAINING    24
III.1.6 PERFORMANCE EVALUATIONS    26
III.2 ARCHITECTURE OF CONTINUOUS MUSIC GENERATION    28
III.2.1 MODEL 2    29
III.2.2 MODEL 3    30
III.2.3 OPTIMATION FOR CONTINUOUS MUSIC GENERATION    34
CHAPTER IV EXPERIMENTS & RESULTS    36
ABSTRACT    iv
ACKNOWLEDGEMENT    v
TABLES OF CONTENT    vi
LIST OF FIGURES    viii
LIST OF TABLES    ix
CHAPTER I INTRODUCTION    1
I.1 BACKGROUND    1
I.2 OBJECTIVE    3
I.3 CONTRIBUTIONS    3
I.4 RESEARCH OUTLINE    3
CHAPTER II LITERATURE REVIEW    5
II.1 FACE LANDMARKS DETECTION, FEATURE EXTRACTIONS, AND EXPRESSION RECOGNITION.    5
II.2 RUSSELL'S CIRCUMPLEX MODEL OF AFFECT    9
II.3 TRANSFER LEARNING IMAGENET MODEL    11
II.4 FLUIDSYNTH    15
II.5 PYDUB    16
II.6 PYGAME    17
II.7 SYMBOLIC MUSIC GENERATION    17
CHAPTER III PROPOSED METHOD    20
III.1 MODEL 1    20
III.1.1 OPEN AFFECTIVE STANDARDIZED IMAGE SET (OASIS) DATASET    20
III.1.2 DATA PRE-PROCESSING    21
III.1.3 DATA AUGMENTATION    21
III.1.4 ARCHITECTURE    22
III.1.5 TRAINING    24
III.1.6 PERFORMANCE EVALUATIONS    26
III.2 ARCHITECTURE OF CONTINUOUS MUSIC GENERATION    28
III.2.1 MODEL 2    29
III.2.2 MODEL 3    30
III.2.3 OPTIMATION FOR CONTINUOUS MUSIC GENERATION    34
CHAPTER IV EXPERIMENTS & RESULTS    36
IV.1 TRAINING    36
IV.1.1 WITHOUT DATA AUGMENTATION    36
IV.1.2 WITH DATA AUGMENTATION    37
IV.1.3 TRANSFER LEARNING WITH DATA AUGMENTATION    38
IV.1.4 TRANSFER LEARNING WITHOUT DATA AUGMENTATION    39
IV.2 RESULT OF CONTINUOUS MUSIC GENERATION    52
CHAPTER V CONCLUSION & FUTURE WORKS    55
V.1 CONCLUSIONS    55
V.2 FUTURE WORK    55
REFERENCES    56


                                

[1] D. Han, Y. Kong, J. Han, G. Wang, A survey of music emotion recognition, Front. Comput. Sci. 16 (2022) 1–11. https://doi.org/10.1007/s11704-021-0569-4.
[2] J. Madake, S. Bhatlawande, S. Purandare, S. Shilaskar, Y. Nikhare, Dense video captioning using BiLSTM encoder, 2022 3rd Int. Conf. Emerg. Technol. INCET 2022. (2022) 1–6. https://doi.org/10.1109/incet54531.2022.9824569.
[3] C. Fan, Z. Zhang, D.J. Crandall, Deepdiary: Lifelogging image captioning and summarization, J. Vis. Commun. Image Represent. 55 (2018) 40–55. https://doi.org/10.1016/j.jvcir.2018.05.008.
[4] J. Lee, J. Lee, T.W. Kim, C. Koo, EEG-Based Circumplex model of affect for identifying interindividual differences in thermal comfort, J. Manag. Eng. 38 (2022). https://doi.org/10.1061/(asce)me.1943-5479.0001061.
[5] A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, M. Pantic, Estimation of continuous valence and arousal levels from faces in naturalistic conditions, Nat. Mach. Intell. 3 (2021) 42–50. https://doi.org/10.1038/s42256-020-00280-0.
[6] B. Kurdi, S. Lozano, M.R. Banaji, Introducing the Open Affective Standardized Image Set (OASIS), Behav. Res. Methods. 49 (2017) 457–470. https://doi.org/10.3758/s13428-016-0715-3.
[7] R. Chheda, D. Bohara, R. Shetty, S. Trivedi, R. Karani, Music recommendation based on affective image content analysis, Procedia Comput. Sci. 218 (2023) 383–392. https://doi.org/10.1016/j.procs.2023.01.021.
[8] N. Kotecha, Bach2Bach: Generating music using a deep reinforcement learning approach, (2018). http://arxiv.org/abs/1812.01060.
[9] C.-Z.A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A.M. Dai, M.D. Hoffman, M. Dinculescu, D. Eck, Music Transformer, (2018) 1–14. http://arxiv.org/abs/1809.04281.
[10] A. Veltman, D.W.J. Pulle, R.W. De Doncker, The Transformer, Power Syst. (2016) 47–82. https://doi.org/10.1007/978-3-319-29409-4_3.
[11] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, NAACL HLT 2018 - 2018 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf. 2 (2018) 464–468. https://doi.org/10.18653/v1/n18-2074.
[12] A. Wiafe, P. Fränti, Affective algorithmic composition of music: A systematic review, Appl. Comput. Intell. 3 (2023) 27–43. https://doi.org/10.3934/aci.2023003.
[13] D. Williams, A. Kirke, E.R. Miranda, E. Roesch, I. Daly, S. Nasuto, Investigating affect in algorithmic composition systems, Psychol. Music. 43 (2015) 831–854. https://doi.org/10.1177/0305735614543282.
[14] P.K.C.P.A.K.P.V.E.B. Prof. Pardeep Kumar Prof. Ahmed Jabbar Obaid, A Fusion of Artificial intelligence and internet of things for emerging cyber systems, 2022. http://link.springer.com/book/10.1007/978-3-030-76653-5.
[15] G. Zoss, D. Bradley, Continuous landmark detection with 3D queries Disney research | Studios, 16858–16867.
[16] A. Farkhod, A.B. Abdusalomov, M. Mukhiddinov, Y.I. Cho, Development of real-time landmark-based emotion recognition CNN for masked faces, Sensors. 22 (2022). https://doi.org/10.3390/s22228704.
[17] J. Fagertun, S. Harder, A. Rosengren, C. Moeller, T. Werge, R.R. Paulsen, T.F. Hansen, 3D facial landmarks: Inter-operator variability of manual annotation, BMC Med. Imaging. 14 (2014) 1–9. https://doi.org/10.1186/1471-2342-14-35.
[18] P. Jaiswal, S. Heliwal, Competitive analysis of web development frameworks, 2022. https://doi.org/10.1007/978-981-16-6605-6_53.
[19] H.J. Vidyarani, S. Math, Face and facial expression recognition using local directional feature structure, 13 (2022) 1067–1079. https://doi.org/10.22075/ijnaa.2022.5648.
[20] X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (2018) 379–388. https://doi.org/10.1109/cvpr.2018.00047.
[21] S. Sachdeva, H. Ruan, G. Hamarneh, D.M. Behne, A. Jongman, J.A. Sereno, Y. Wang, Plain-to-clear speech video conversion for enhanced intelligibility, Int. J. Speech Technol. 26 (2023) 163–184. https://doi.org/10.1007/s10772-023-10018-z.
[22] A. Bulat, G. Tzimiropoulos, How far are we from solving the 2D & 3D face alignment problem? (and a Dataset of 230,000 3D Facial Landmarks), Proc. IEEE Int. Conf. Comput. Vis. 2017-Octob (2017) 1021–1030. https://doi.org/10.1109/iccv.2017.116.
[23] J. Yang, A. Bulat, G. Tzimiropoulos, FAN-Face: A simple orthogonal improvement to deep face recognition, AAAI 2020 - 34th AAAI Conf. Artif. Intell. (2020) 12621–12628. https://doi.org/10.1609/aaai.v34i07.6953.
[24] C.L. Lin, Y.H. Huang, The application of adaptive tolerance and serialized facial feature extraction to automatic attendance systems, Electron. 11 (2022). https://doi.org/10.3390/electronics11142278.
[25] S. Gautam, Facial expression recognition and analysis techniques, SSRN Electron. J. (2019) 0–3. https://doi.org/10.2139/ssrn.3370149.
[26] A. Lavric, V. Popa, C. David, C.C. Paval, Keratoconus detection algorithm using convolutional neural networks: Challenges, Proc. 11th Int. Conf. Electron. Comput. Artif. Intell. ECAI 2019. 2019 (2019). https://doi.org/10.1109/ecai46879.2019.9042100.
[27] M.L. Febryan, V. Suryani, F.A. Yulianto, Implementation of sleep detector using histogram of oriented gradients and support vector machine for saving electricity in household electronic equipment, 2022 Int. Conf. Inf. Technol. Syst. Innov. ICITSI 2022 - Proc. (2022) 128–137. https://doi.org/10.1109/icitsi56531.2022.9971014.
[28] V. Cantoni, D. Dimov, M. Tistarelli, Biometric authentication: First international workshop, BIOMET 2014 Sofia, Bulgaria, June 23-24, 2014 revised selected papers, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 8897 (2014). https://doi.org/10.1007/978-3-319-13386-7.
[29] S.H. Lee, D.H. Kim, B.C. Song, Self-supervised knowledge distillation using singular value decomposition, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 11210 LNCS (2018) 339–354. https://doi.org/10.1007/978-3-030-01231-1_21.
[30] S. Mann, A.K. Bindal, A. Balyan, V. Shukla, Z. Gupta, V. Tomar, S. Miah, Multiresolution-based singular value decomposition approach for breast cancer image classification, Biomed Res. Int. 2022 (2022). https://doi.org/10.1155/2022/6392206.
[31] C. Cobos, O. Rodriguez, J. Rivera, J. Betancourt, M. Mendoza, E. León, E. Herrera-Viedma, A hybrid system of pedagogical pattern recommendations based on singular value decomposition and variable data attributes, Inf. Process. Manag. 49 (2013) 607–625. https://doi.org/10.1016/j.ipm.2012.12.002.
[32] Kate Hevner, Experimental studies of the elements of expression in music, Am. J. Psychol. 48 (1936) 246–268. http://www.jstor.org/stable/1415746 .
[33] L.C. Yu, L.H. Lee, S. Hao, J. Wang, Y. He, J. Hu, K.R. Lai, X. Zhang, Building Chinese affective resources in valence-arousal dimensions, 2016 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT 2016 - Proc. Conf. (2016) 540–545. https://doi.org/10.18653/v1/n16-1066.
[34] G. Paltoglou, M. Thelwall, Seeing stars of valence and arousal in blog posts, IEEE Trans. Affect. Comput. 4 (2013) 116–123. https://doi.org/10.1109/t-affc.2012.36.
[35] S. Liu, J. McGree, Z. Ge, Y. Xie, Computer vision in big data applications, Comput. Stat. Methods Anal. Big Data with Appl. (2016) 57–85. https://doi.org/10.1016/B978-0-12-803732-4.00004-0.
[36] M. Gao, P. Song, F. Wang, J. Liu, A. Mandelis, D. Qi, A Novel deep convolutional neural network based on ResNet-18 and transfer learning for detection of wood knot defects, J. Sensors. 2021 (2021). https://doi.org/10.1155/2021/4428964.
[37] J. Newmarch, Linux sound programming, 2017. https://doi.org/10.1007/978-1-4842-2496-0.
[38] S. Redekar, A. Sawant, R. Kolanji, N. Sawant, Heart rate prediction from human speech using regression models, Proc. - 2022 IEEE World Conf. Appl. Intell. Comput. AIC 2022. (2022) 702–707. https://doi.org/10.1109/aic55036.2022.9848913.
[39] S. Kelly, Python, PyGame, and Raspberry Pi Game Development, 2019. https://doi.org/10.1007/978-1-4842-4533-0.
[40] E. Sheng, K.W. Chang, P. Natarajan, N. Peng, Towards controllable biases in language generation, Find. Assoc. Comput. Linguist. Find. ACL EMNLP 2020. (2020) 3239–3254. https://doi.org/10.18653/v1/2020.findings-emnlp.291.
[41] L. N. Ferreira, L. Mou, J. Whitehead, L.H.S. Lelis, Controlling perceived emotion in symbolic music generation with monte carlo tree search, Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 18 (2022) 163–170. https://doi.org/10.1609/aiide.v18i1.21960.
[42] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, Y.-H. Yang, EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation, (2021). http://arxiv.org/abs/2108.01374.
[43] X. Li, Y. Niu, Based on improved GAN networks, 2022 (2022).
[44] M. Huzaifah, L. Wyse, Deep generative models for musical audio synthesis, Handb. Artif. Intell. Music. (2021) 639–678. https://doi.org/10.1007/978-3-030-72116-9_22.
[45] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.A. Huang, S. Dieleman, E. Elsen, J. Engel, D. Eck, G. Brain, Enabling factorized piano music modeling and generation with the maestro dataset, (2019) 1–12.
[46] K. Zhao, S. Li, J. Cai, H. Wang, J. Wang, An emotional symbolic music generation system based on LSTM networks, Proc. 2019 IEEE 3rd Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2019. (2019) 2039–2043. https://doi.org/10.1109/itnec.2019.8729266.
[47] B. Li, D. Lima, Facial expression recognition via ResNet-50, Int. J. Cogn. Comput. Eng. 2 (2021) 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002.
[48] L. Yang, Z. Wang, L. Ma, W. Dai, Transfer learning-based vehicle collision prediction, Wirel. Commun. Mob. Comput. 2022 (2022). https://doi.org/10.1155/2022/2545958.
[49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem (2016) 2818–2826. https://doi.org/10.1109/cvpr.2016.308.
[50] F. Chollet, Xception: Deep learning with depthwise separable convolutions, Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017. 2017-Janua (2017) 1800–1807. https://doi.org/10.1109/cvpr.2017.195.
[51] S. Mascarenhas, M. Agarwal, A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification, Proc. IEEE Int. Conf. Disruptive Technol. Multi-Disciplinary Res. Appl. CENTCON 2021. 1 (2021) 96–99. https://doi.org/10.1109/centcon52345.2021.9687944.
[52] D. Chicco, M.J. Warrens, G. Jurman, The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation, PeerJ Comput. Sci. 7 (2021) 1–24. https://doi.org/10.7717/peerj-cs.623.
[53] S. Sulun, M.E.P. Davies, P. Viana, Symbolic music generation conditioned on continuous-valued emotions, IEEE Access. 10 (2022) 44617–44626. https://doi.org/10.1109/access.2022.3169744.

全文公開日期 2025/08/20 (校內網路)
全文公開日期 2025/08/20 (校外網路)
全文公開日期 2025/08/20 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文