Basic Search / Detailed Display

Author: MOLION SURYA PRADANA
MOLION SURYA PRADANA
Thesis Title: Continuous Music Generation Using Video Based on Circumplex Model of Affect
Continuous Music Generation Using Video Based on Circumplex Model of Affect
Advisor: 楊傳凱
Chuan-Kai Yang
Committee: 賴源正
Yuan-Cheng Lai
林伯慎
Bor-Shen Lin
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2023
Graduation Academic Year: 111
Language: 英文
Pages: 59
Keywords (in other languages): Continuous music generation, Russell Circumplex model, Valence-arousal, Multimodal approach
Reference times: Clicks: 264Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report

  • This study presents a novel approach to continuous music generation utilizing the
    Russell Circumplex model's valence-arousal dimensions as a basis for emotional representation. The proposed system combines facial emotion recognition, transfer learning, and symbolic music generation to dynamically generate music based on continuously evolving emotional states. The process begins with face detection, followed by facial landmark modeling to predict the emotional state of the detected face. To account for situations where faces are not present, a transfer learning model is trained to classify images that lack facial features. The most effective transfer learning model identified in this study is EfficientNetB2_0.001_32, which
    achieves promising results with root mean square error (RMSE) values of 0.327 for valence and 0.223 for arousal prediction. Next, the predicted valence-arousal values are used to condition the symbolic music generation system. This multimodal approach allows for the translation of continuous-valued emotions into corresponding musical characteristics, such as tempo, melody, and harmony. The music generation process occurs at regular intervals, with new musical output being generated every 60 seconds to adapt to the evolving emotional states. The proposed system demonstrates smooth operation and robustness in predicting valence and arousal values, as well as generating coherent music reflective of the detected emotions. The integration of facial emotion recognition and symbolic music generation provides a comprehensive framework for continuous music generation in real-time, with potential applications in various domains such as interactive media, entertainment, and therapy.

    ABSTRACT iv ACKNOWLEDGEMENT v TABLES OF CONTENT vi LIST OF FIGURES viii LIST OF TABLES ix CHAPTER I INTRODUCTION 1 I.1 BACKGROUND 1 I.2 OBJECTIVE 3 I.3 CONTRIBUTIONS 3 I.4 RESEARCH OUTLINE 3 CHAPTER II LITERATURE REVIEW 5 II.1 FACE LANDMARKS DETECTION, FEATURE EXTRACTIONS, AND EXPRESSION RECOGNITION. 5 II.2 RUSSELL'S CIRCUMPLEX MODEL OF AFFECT 9 II.3 TRANSFER LEARNING IMAGENET MODEL 11 II.4 FLUIDSYNTH 15 II.5 PYDUB 16 II.6 PYGAME 17 II.7 SYMBOLIC MUSIC GENERATION 17 CHAPTER III PROPOSED METHOD 20 III.1 MODEL 1 20 III.1.1 OPEN AFFECTIVE STANDARDIZED IMAGE SET (OASIS) DATASET 20 III.1.2 DATA PRE-PROCESSING 21 III.1.3 DATA AUGMENTATION 21 III.1.4 ARCHITECTURE 22 III.1.5 TRAINING 24 III.1.6 PERFORMANCE EVALUATIONS 26 III.2 ARCHITECTURE OF CONTINUOUS MUSIC GENERATION 28 III.2.1 MODEL 2 29 III.2.2 MODEL 3 30 III.2.3 OPTIMATION FOR CONTINUOUS MUSIC GENERATION 34 CHAPTER IV EXPERIMENTS & RESULTS 36 ABSTRACT iv ACKNOWLEDGEMENT v TABLES OF CONTENT vi LIST OF FIGURES viii LIST OF TABLES ix CHAPTER I INTRODUCTION 1 I.1 BACKGROUND 1 I.2 OBJECTIVE 3 I.3 CONTRIBUTIONS 3 I.4 RESEARCH OUTLINE 3 CHAPTER II LITERATURE REVIEW 5 II.1 FACE LANDMARKS DETECTION, FEATURE EXTRACTIONS, AND EXPRESSION RECOGNITION. 5 II.2 RUSSELL'S CIRCUMPLEX MODEL OF AFFECT 9 II.3 TRANSFER LEARNING IMAGENET MODEL 11 II.4 FLUIDSYNTH 15 II.5 PYDUB 16 II.6 PYGAME 17 II.7 SYMBOLIC MUSIC GENERATION 17 CHAPTER III PROPOSED METHOD 20 III.1 MODEL 1 20 III.1.1 OPEN AFFECTIVE STANDARDIZED IMAGE SET (OASIS) DATASET 20 III.1.2 DATA PRE-PROCESSING 21 III.1.3 DATA AUGMENTATION 21 III.1.4 ARCHITECTURE 22 III.1.5 TRAINING 24 III.1.6 PERFORMANCE EVALUATIONS 26 III.2 ARCHITECTURE OF CONTINUOUS MUSIC GENERATION 28 III.2.1 MODEL 2 29 III.2.2 MODEL 3 30 III.2.3 OPTIMATION FOR CONTINUOUS MUSIC GENERATION 34 CHAPTER IV EXPERIMENTS & RESULTS 36 IV.1 TRAINING 36 IV.1.1 WITHOUT DATA AUGMENTATION 36 IV.1.2 WITH DATA AUGMENTATION 37 IV.1.3 TRANSFER LEARNING WITH DATA AUGMENTATION 38 IV.1.4 TRANSFER LEARNING WITHOUT DATA AUGMENTATION 39 IV.2 RESULT OF CONTINUOUS MUSIC GENERATION 52 CHAPTER V CONCLUSION & FUTURE WORKS 55 V.1 CONCLUSIONS 55 V.2 FUTURE WORK 55 REFERENCES 56

    [1] D. Han, Y. Kong, J. Han, G. Wang, A survey of music emotion recognition, Front. Comput. Sci. 16 (2022) 1–11. https://doi.org/10.1007/s11704-021-0569-4.
    [2] J. Madake, S. Bhatlawande, S. Purandare, S. Shilaskar, Y. Nikhare, Dense video captioning using BiLSTM encoder, 2022 3rd Int. Conf. Emerg. Technol. INCET 2022. (2022) 1–6. https://doi.org/10.1109/incet54531.2022.9824569.
    [3] C. Fan, Z. Zhang, D.J. Crandall, Deepdiary: Lifelogging image captioning and summarization, J. Vis. Commun. Image Represent. 55 (2018) 40–55. https://doi.org/10.1016/j.jvcir.2018.05.008.
    [4] J. Lee, J. Lee, T.W. Kim, C. Koo, EEG-Based Circumplex model of affect for identifying interindividual differences in thermal comfort, J. Manag. Eng. 38 (2022). https://doi.org/10.1061/(asce)me.1943-5479.0001061.
    [5] A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, M. Pantic, Estimation of continuous valence and arousal levels from faces in naturalistic conditions, Nat. Mach. Intell. 3 (2021) 42–50. https://doi.org/10.1038/s42256-020-00280-0.
    [6] B. Kurdi, S. Lozano, M.R. Banaji, Introducing the Open Affective Standardized Image Set (OASIS), Behav. Res. Methods. 49 (2017) 457–470. https://doi.org/10.3758/s13428-016-0715-3.
    [7] R. Chheda, D. Bohara, R. Shetty, S. Trivedi, R. Karani, Music recommendation based on affective image content analysis, Procedia Comput. Sci. 218 (2023) 383–392. https://doi.org/10.1016/j.procs.2023.01.021.
    [8] N. Kotecha, Bach2Bach: Generating music using a deep reinforcement learning approach, (2018). http://arxiv.org/abs/1812.01060.
    [9] C.-Z.A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A.M. Dai, M.D. Hoffman, M. Dinculescu, D. Eck, Music Transformer, (2018) 1–14. http://arxiv.org/abs/1809.04281.
    [10] A. Veltman, D.W.J. Pulle, R.W. De Doncker, The Transformer, Power Syst. (2016) 47–82. https://doi.org/10.1007/978-3-319-29409-4_3.
    [11] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, NAACL HLT 2018 - 2018 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf. 2 (2018) 464–468. https://doi.org/10.18653/v1/n18-2074.
    [12] A. Wiafe, P. Fränti, Affective algorithmic composition of music: A systematic review, Appl. Comput. Intell. 3 (2023) 27–43. https://doi.org/10.3934/aci.2023003.
    [13] D. Williams, A. Kirke, E.R. Miranda, E. Roesch, I. Daly, S. Nasuto, Investigating affect in algorithmic composition systems, Psychol. Music. 43 (2015) 831–854. https://doi.org/10.1177/0305735614543282.
    [14] P.K.C.P.A.K.P.V.E.B. Prof. Pardeep Kumar Prof. Ahmed Jabbar Obaid, A Fusion of Artificial intelligence and internet of things for emerging cyber systems, 2022. http://link.springer.com/book/10.1007/978-3-030-76653-5.
    [15] G. Zoss, D. Bradley, Continuous landmark detection with 3D queries Disney research | Studios, 16858–16867.
    [16] A. Farkhod, A.B. Abdusalomov, M. Mukhiddinov, Y.I. Cho, Development of real-time landmark-based emotion recognition CNN for masked faces, Sensors. 22 (2022). https://doi.org/10.3390/s22228704.
    [17] J. Fagertun, S. Harder, A. Rosengren, C. Moeller, T. Werge, R.R. Paulsen, T.F. Hansen, 3D facial landmarks: Inter-operator variability of manual annotation, BMC Med. Imaging. 14 (2014) 1–9. https://doi.org/10.1186/1471-2342-14-35.
    [18] P. Jaiswal, S. Heliwal, Competitive analysis of web development frameworks, 2022. https://doi.org/10.1007/978-981-16-6605-6_53.
    [19] H.J. Vidyarani, S. Math, Face and facial expression recognition using local directional feature structure, 13 (2022) 1067–1079. https://doi.org/10.22075/ijnaa.2022.5648.
    [20] X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (2018) 379–388. https://doi.org/10.1109/cvpr.2018.00047.
    [21] S. Sachdeva, H. Ruan, G. Hamarneh, D.M. Behne, A. Jongman, J.A. Sereno, Y. Wang, Plain-to-clear speech video conversion for enhanced intelligibility, Int. J. Speech Technol. 26 (2023) 163–184. https://doi.org/10.1007/s10772-023-10018-z.
    [22] A. Bulat, G. Tzimiropoulos, How far are we from solving the 2D & 3D face alignment problem? (and a Dataset of 230,000 3D Facial Landmarks), Proc. IEEE Int. Conf. Comput. Vis. 2017-Octob (2017) 1021–1030. https://doi.org/10.1109/iccv.2017.116.
    [23] J. Yang, A. Bulat, G. Tzimiropoulos, FAN-Face: A simple orthogonal improvement to deep face recognition, AAAI 2020 - 34th AAAI Conf. Artif. Intell. (2020) 12621–12628. https://doi.org/10.1609/aaai.v34i07.6953.
    [24] C.L. Lin, Y.H. Huang, The application of adaptive tolerance and serialized facial feature extraction to automatic attendance systems, Electron. 11 (2022). https://doi.org/10.3390/electronics11142278.
    [25] S. Gautam, Facial expression recognition and analysis techniques, SSRN Electron. J. (2019) 0–3. https://doi.org/10.2139/ssrn.3370149.
    [26] A. Lavric, V. Popa, C. David, C.C. Paval, Keratoconus detection algorithm using convolutional neural networks: Challenges, Proc. 11th Int. Conf. Electron. Comput. Artif. Intell. ECAI 2019. 2019 (2019). https://doi.org/10.1109/ecai46879.2019.9042100.
    [27] M.L. Febryan, V. Suryani, F.A. Yulianto, Implementation of sleep detector using histogram of oriented gradients and support vector machine for saving electricity in household electronic equipment, 2022 Int. Conf. Inf. Technol. Syst. Innov. ICITSI 2022 - Proc. (2022) 128–137. https://doi.org/10.1109/icitsi56531.2022.9971014.
    [28] V. Cantoni, D. Dimov, M. Tistarelli, Biometric authentication: First international workshop, BIOMET 2014 Sofia, Bulgaria, June 23-24, 2014 revised selected papers, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 8897 (2014). https://doi.org/10.1007/978-3-319-13386-7.
    [29] S.H. Lee, D.H. Kim, B.C. Song, Self-supervised knowledge distillation using singular value decomposition, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 11210 LNCS (2018) 339–354. https://doi.org/10.1007/978-3-030-01231-1_21.
    [30] S. Mann, A.K. Bindal, A. Balyan, V. Shukla, Z. Gupta, V. Tomar, S. Miah, Multiresolution-based singular value decomposition approach for breast cancer image classification, Biomed Res. Int. 2022 (2022). https://doi.org/10.1155/2022/6392206.
    [31] C. Cobos, O. Rodriguez, J. Rivera, J. Betancourt, M. Mendoza, E. León, E. Herrera-Viedma, A hybrid system of pedagogical pattern recommendations based on singular value decomposition and variable data attributes, Inf. Process. Manag. 49 (2013) 607–625. https://doi.org/10.1016/j.ipm.2012.12.002.
    [32] Kate Hevner, Experimental studies of the elements of expression in music, Am. J. Psychol. 48 (1936) 246–268. http://www.jstor.org/stable/1415746 .
    [33] L.C. Yu, L.H. Lee, S. Hao, J. Wang, Y. He, J. Hu, K.R. Lai, X. Zhang, Building Chinese affective resources in valence-arousal dimensions, 2016 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT 2016 - Proc. Conf. (2016) 540–545. https://doi.org/10.18653/v1/n16-1066.
    [34] G. Paltoglou, M. Thelwall, Seeing stars of valence and arousal in blog posts, IEEE Trans. Affect. Comput. 4 (2013) 116–123. https://doi.org/10.1109/t-affc.2012.36.
    [35] S. Liu, J. McGree, Z. Ge, Y. Xie, Computer vision in big data applications, Comput. Stat. Methods Anal. Big Data with Appl. (2016) 57–85. https://doi.org/10.1016/B978-0-12-803732-4.00004-0.
    [36] M. Gao, P. Song, F. Wang, J. Liu, A. Mandelis, D. Qi, A Novel deep convolutional neural network based on ResNet-18 and transfer learning for detection of wood knot defects, J. Sensors. 2021 (2021). https://doi.org/10.1155/2021/4428964.
    [37] J. Newmarch, Linux sound programming, 2017. https://doi.org/10.1007/978-1-4842-2496-0.
    [38] S. Redekar, A. Sawant, R. Kolanji, N. Sawant, Heart rate prediction from human speech using regression models, Proc. - 2022 IEEE World Conf. Appl. Intell. Comput. AIC 2022. (2022) 702–707. https://doi.org/10.1109/aic55036.2022.9848913.
    [39] S. Kelly, Python, PyGame, and Raspberry Pi Game Development, 2019. https://doi.org/10.1007/978-1-4842-4533-0.
    [40] E. Sheng, K.W. Chang, P. Natarajan, N. Peng, Towards controllable biases in language generation, Find. Assoc. Comput. Linguist. Find. ACL EMNLP 2020. (2020) 3239–3254. https://doi.org/10.18653/v1/2020.findings-emnlp.291.
    [41] L. N. Ferreira, L. Mou, J. Whitehead, L.H.S. Lelis, Controlling perceived emotion in symbolic music generation with monte carlo tree search, Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 18 (2022) 163–170. https://doi.org/10.1609/aiide.v18i1.21960.
    [42] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, Y.-H. Yang, EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation, (2021). http://arxiv.org/abs/2108.01374.
    [43] X. Li, Y. Niu, Based on improved GAN networks, 2022 (2022).
    [44] M. Huzaifah, L. Wyse, Deep generative models for musical audio synthesis, Handb. Artif. Intell. Music. (2021) 639–678. https://doi.org/10.1007/978-3-030-72116-9_22.
    [45] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.A. Huang, S. Dieleman, E. Elsen, J. Engel, D. Eck, G. Brain, Enabling factorized piano music modeling and generation with the maestro dataset, (2019) 1–12.
    [46] K. Zhao, S. Li, J. Cai, H. Wang, J. Wang, An emotional symbolic music generation system based on LSTM networks, Proc. 2019 IEEE 3rd Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2019. (2019) 2039–2043. https://doi.org/10.1109/itnec.2019.8729266.
    [47] B. Li, D. Lima, Facial expression recognition via ResNet-50, Int. J. Cogn. Comput. Eng. 2 (2021) 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002.
    [48] L. Yang, Z. Wang, L. Ma, W. Dai, Transfer learning-based vehicle collision prediction, Wirel. Commun. Mob. Comput. 2022 (2022). https://doi.org/10.1155/2022/2545958.
    [49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem (2016) 2818–2826. https://doi.org/10.1109/cvpr.2016.308.
    [50] F. Chollet, Xception: Deep learning with depthwise separable convolutions, Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017. 2017-Janua (2017) 1800–1807. https://doi.org/10.1109/cvpr.2017.195.
    [51] S. Mascarenhas, M. Agarwal, A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification, Proc. IEEE Int. Conf. Disruptive Technol. Multi-Disciplinary Res. Appl. CENTCON 2021. 1 (2021) 96–99. https://doi.org/10.1109/centcon52345.2021.9687944.
    [52] D. Chicco, M.J. Warrens, G. Jurman, The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation, PeerJ Comput. Sci. 7 (2021) 1–24. https://doi.org/10.7717/peerj-cs.623.
    [53] S. Sulun, M.E.P. Davies, P. Viana, Symbolic music generation conditioned on continuous-valued emotions, IEEE Access. 10 (2022) 44617–44626. https://doi.org/10.1109/access.2022.3169744.

    無法下載圖示 Full text public date 2025/08/20 (Intranet public)
    Full text public date 2025/08/20 (Internet public)
    Full text public date 2025/08/20 (National library)
    QR CODE