Author: |
MOLION SURYA PRADANA MOLION SURYA PRADANA |
---|---|
Thesis Title: |
Continuous Music Generation Using Video Based on Circumplex Model of Affect Continuous Music Generation Using Video Based on Circumplex Model of Affect |
Advisor: |
楊傳凱
Chuan-Kai Yang |
Committee: |
賴源正
Yuan-Cheng Lai 林伯慎 Bor-Shen Lin |
Degree: |
碩士 Master |
Department: |
管理學院 - 資訊管理系 Department of Information Management |
Thesis Publication Year: | 2023 |
Graduation Academic Year: | 111 |
Language: | 英文 |
Pages: | 59 |
Keywords (in other languages): | Continuous music generation, Russell Circumplex model, Valence-arousal, Multimodal approach |
Reference times: | Clicks: 264 Downloads: 0 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
This study presents a novel approach to continuous music generation utilizing the
Russell Circumplex model's valence-arousal dimensions as a basis for emotional representation. The proposed system combines facial emotion recognition, transfer learning, and symbolic music generation to dynamically generate music based on continuously evolving emotional states. The process begins with face detection, followed by facial landmark modeling to predict the emotional state of the detected face. To account for situations where faces are not present, a transfer learning model is trained to classify images that lack facial features. The most effective transfer learning model identified in this study is EfficientNetB2_0.001_32, which
achieves promising results with root mean square error (RMSE) values of 0.327 for valence and 0.223 for arousal prediction. Next, the predicted valence-arousal values are used to condition the symbolic music generation system. This multimodal approach allows for the translation of continuous-valued emotions into corresponding musical characteristics, such as tempo, melody, and harmony. The music generation process occurs at regular intervals, with new musical output being generated every 60 seconds to adapt to the evolving emotional states. The proposed system demonstrates smooth operation and robustness in predicting valence and arousal values, as well as generating coherent music reflective of the detected emotions. The integration of facial emotion recognition and symbolic music generation provides a comprehensive framework for continuous music generation in real-time, with potential applications in various domains such as interactive media, entertainment, and therapy.
[1] D. Han, Y. Kong, J. Han, G. Wang, A survey of music emotion recognition, Front. Comput. Sci. 16 (2022) 1–11. https://doi.org/10.1007/s11704-021-0569-4.
[2] J. Madake, S. Bhatlawande, S. Purandare, S. Shilaskar, Y. Nikhare, Dense video captioning using BiLSTM encoder, 2022 3rd Int. Conf. Emerg. Technol. INCET 2022. (2022) 1–6. https://doi.org/10.1109/incet54531.2022.9824569.
[3] C. Fan, Z. Zhang, D.J. Crandall, Deepdiary: Lifelogging image captioning and summarization, J. Vis. Commun. Image Represent. 55 (2018) 40–55. https://doi.org/10.1016/j.jvcir.2018.05.008.
[4] J. Lee, J. Lee, T.W. Kim, C. Koo, EEG-Based Circumplex model of affect for identifying interindividual differences in thermal comfort, J. Manag. Eng. 38 (2022). https://doi.org/10.1061/(asce)me.1943-5479.0001061.
[5] A. Toisoul, J. Kossaifi, A. Bulat, G. Tzimiropoulos, M. Pantic, Estimation of continuous valence and arousal levels from faces in naturalistic conditions, Nat. Mach. Intell. 3 (2021) 42–50. https://doi.org/10.1038/s42256-020-00280-0.
[6] B. Kurdi, S. Lozano, M.R. Banaji, Introducing the Open Affective Standardized Image Set (OASIS), Behav. Res. Methods. 49 (2017) 457–470. https://doi.org/10.3758/s13428-016-0715-3.
[7] R. Chheda, D. Bohara, R. Shetty, S. Trivedi, R. Karani, Music recommendation based on affective image content analysis, Procedia Comput. Sci. 218 (2023) 383–392. https://doi.org/10.1016/j.procs.2023.01.021.
[8] N. Kotecha, Bach2Bach: Generating music using a deep reinforcement learning approach, (2018). http://arxiv.org/abs/1812.01060.
[9] C.-Z.A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A.M. Dai, M.D. Hoffman, M. Dinculescu, D. Eck, Music Transformer, (2018) 1–14. http://arxiv.org/abs/1809.04281.
[10] A. Veltman, D.W.J. Pulle, R.W. De Doncker, The Transformer, Power Syst. (2016) 47–82. https://doi.org/10.1007/978-3-319-29409-4_3.
[11] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, NAACL HLT 2018 - 2018 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf. 2 (2018) 464–468. https://doi.org/10.18653/v1/n18-2074.
[12] A. Wiafe, P. Fränti, Affective algorithmic composition of music: A systematic review, Appl. Comput. Intell. 3 (2023) 27–43. https://doi.org/10.3934/aci.2023003.
[13] D. Williams, A. Kirke, E.R. Miranda, E. Roesch, I. Daly, S. Nasuto, Investigating affect in algorithmic composition systems, Psychol. Music. 43 (2015) 831–854. https://doi.org/10.1177/0305735614543282.
[14] P.K.C.P.A.K.P.V.E.B. Prof. Pardeep Kumar Prof. Ahmed Jabbar Obaid, A Fusion of Artificial intelligence and internet of things for emerging cyber systems, 2022. http://link.springer.com/book/10.1007/978-3-030-76653-5.
[15] G. Zoss, D. Bradley, Continuous landmark detection with 3D queries Disney research | Studios, 16858–16867.
[16] A. Farkhod, A.B. Abdusalomov, M. Mukhiddinov, Y.I. Cho, Development of real-time landmark-based emotion recognition CNN for masked faces, Sensors. 22 (2022). https://doi.org/10.3390/s22228704.
[17] J. Fagertun, S. Harder, A. Rosengren, C. Moeller, T. Werge, R.R. Paulsen, T.F. Hansen, 3D facial landmarks: Inter-operator variability of manual annotation, BMC Med. Imaging. 14 (2014) 1–9. https://doi.org/10.1186/1471-2342-14-35.
[18] P. Jaiswal, S. Heliwal, Competitive analysis of web development frameworks, 2022. https://doi.org/10.1007/978-981-16-6605-6_53.
[19] H.J. Vidyarani, S. Math, Face and facial expression recognition using local directional feature structure, 13 (2022) 1067–1079. https://doi.org/10.22075/ijnaa.2022.5648.
[20] X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (2018) 379–388. https://doi.org/10.1109/cvpr.2018.00047.
[21] S. Sachdeva, H. Ruan, G. Hamarneh, D.M. Behne, A. Jongman, J.A. Sereno, Y. Wang, Plain-to-clear speech video conversion for enhanced intelligibility, Int. J. Speech Technol. 26 (2023) 163–184. https://doi.org/10.1007/s10772-023-10018-z.
[22] A. Bulat, G. Tzimiropoulos, How far are we from solving the 2D & 3D face alignment problem? (and a Dataset of 230,000 3D Facial Landmarks), Proc. IEEE Int. Conf. Comput. Vis. 2017-Octob (2017) 1021–1030. https://doi.org/10.1109/iccv.2017.116.
[23] J. Yang, A. Bulat, G. Tzimiropoulos, FAN-Face: A simple orthogonal improvement to deep face recognition, AAAI 2020 - 34th AAAI Conf. Artif. Intell. (2020) 12621–12628. https://doi.org/10.1609/aaai.v34i07.6953.
[24] C.L. Lin, Y.H. Huang, The application of adaptive tolerance and serialized facial feature extraction to automatic attendance systems, Electron. 11 (2022). https://doi.org/10.3390/electronics11142278.
[25] S. Gautam, Facial expression recognition and analysis techniques, SSRN Electron. J. (2019) 0–3. https://doi.org/10.2139/ssrn.3370149.
[26] A. Lavric, V. Popa, C. David, C.C. Paval, Keratoconus detection algorithm using convolutional neural networks: Challenges, Proc. 11th Int. Conf. Electron. Comput. Artif. Intell. ECAI 2019. 2019 (2019). https://doi.org/10.1109/ecai46879.2019.9042100.
[27] M.L. Febryan, V. Suryani, F.A. Yulianto, Implementation of sleep detector using histogram of oriented gradients and support vector machine for saving electricity in household electronic equipment, 2022 Int. Conf. Inf. Technol. Syst. Innov. ICITSI 2022 - Proc. (2022) 128–137. https://doi.org/10.1109/icitsi56531.2022.9971014.
[28] V. Cantoni, D. Dimov, M. Tistarelli, Biometric authentication: First international workshop, BIOMET 2014 Sofia, Bulgaria, June 23-24, 2014 revised selected papers, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 8897 (2014). https://doi.org/10.1007/978-3-319-13386-7.
[29] S.H. Lee, D.H. Kim, B.C. Song, Self-supervised knowledge distillation using singular value decomposition, Lect. Notes Comput. Sci. (Including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics). 11210 LNCS (2018) 339–354. https://doi.org/10.1007/978-3-030-01231-1_21.
[30] S. Mann, A.K. Bindal, A. Balyan, V. Shukla, Z. Gupta, V. Tomar, S. Miah, Multiresolution-based singular value decomposition approach for breast cancer image classification, Biomed Res. Int. 2022 (2022). https://doi.org/10.1155/2022/6392206.
[31] C. Cobos, O. Rodriguez, J. Rivera, J. Betancourt, M. Mendoza, E. León, E. Herrera-Viedma, A hybrid system of pedagogical pattern recommendations based on singular value decomposition and variable data attributes, Inf. Process. Manag. 49 (2013) 607–625. https://doi.org/10.1016/j.ipm.2012.12.002.
[32] Kate Hevner, Experimental studies of the elements of expression in music, Am. J. Psychol. 48 (1936) 246–268. http://www.jstor.org/stable/1415746 .
[33] L.C. Yu, L.H. Lee, S. Hao, J. Wang, Y. He, J. Hu, K.R. Lai, X. Zhang, Building Chinese affective resources in valence-arousal dimensions, 2016 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. NAACL HLT 2016 - Proc. Conf. (2016) 540–545. https://doi.org/10.18653/v1/n16-1066.
[34] G. Paltoglou, M. Thelwall, Seeing stars of valence and arousal in blog posts, IEEE Trans. Affect. Comput. 4 (2013) 116–123. https://doi.org/10.1109/t-affc.2012.36.
[35] S. Liu, J. McGree, Z. Ge, Y. Xie, Computer vision in big data applications, Comput. Stat. Methods Anal. Big Data with Appl. (2016) 57–85. https://doi.org/10.1016/B978-0-12-803732-4.00004-0.
[36] M. Gao, P. Song, F. Wang, J. Liu, A. Mandelis, D. Qi, A Novel deep convolutional neural network based on ResNet-18 and transfer learning for detection of wood knot defects, J. Sensors. 2021 (2021). https://doi.org/10.1155/2021/4428964.
[37] J. Newmarch, Linux sound programming, 2017. https://doi.org/10.1007/978-1-4842-2496-0.
[38] S. Redekar, A. Sawant, R. Kolanji, N. Sawant, Heart rate prediction from human speech using regression models, Proc. - 2022 IEEE World Conf. Appl. Intell. Comput. AIC 2022. (2022) 702–707. https://doi.org/10.1109/aic55036.2022.9848913.
[39] S. Kelly, Python, PyGame, and Raspberry Pi Game Development, 2019. https://doi.org/10.1007/978-1-4842-4533-0.
[40] E. Sheng, K.W. Chang, P. Natarajan, N. Peng, Towards controllable biases in language generation, Find. Assoc. Comput. Linguist. Find. ACL EMNLP 2020. (2020) 3239–3254. https://doi.org/10.18653/v1/2020.findings-emnlp.291.
[41] L. N. Ferreira, L. Mou, J. Whitehead, L.H.S. Lelis, Controlling perceived emotion in symbolic music generation with monte carlo tree search, Proc. AAAI Conf. Artif. Intell. Interact. Digit. Entertain. 18 (2022) 163–170. https://doi.org/10.1609/aiide.v18i1.21960.
[42] H.-T. Hung, J. Ching, S. Doh, N. Kim, J. Nam, Y.-H. Yang, EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation, (2021). http://arxiv.org/abs/2108.01374.
[43] X. Li, Y. Niu, Based on improved GAN networks, 2022 (2022).
[44] M. Huzaifah, L. Wyse, Deep generative models for musical audio synthesis, Handb. Artif. Intell. Music. (2021) 639–678. https://doi.org/10.1007/978-3-030-72116-9_22.
[45] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.A. Huang, S. Dieleman, E. Elsen, J. Engel, D. Eck, G. Brain, Enabling factorized piano music modeling and generation with the maestro dataset, (2019) 1–12.
[46] K. Zhao, S. Li, J. Cai, H. Wang, J. Wang, An emotional symbolic music generation system based on LSTM networks, Proc. 2019 IEEE 3rd Inf. Technol. Networking, Electron. Autom. Control Conf. ITNEC 2019. (2019) 2039–2043. https://doi.org/10.1109/itnec.2019.8729266.
[47] B. Li, D. Lima, Facial expression recognition via ResNet-50, Int. J. Cogn. Comput. Eng. 2 (2021) 57–64. https://doi.org/10.1016/j.ijcce.2021.02.002.
[48] L. Yang, Z. Wang, L. Ma, W. Dai, Transfer learning-based vehicle collision prediction, Wirel. Commun. Mob. Comput. 2022 (2022). https://doi.org/10.1155/2022/2545958.
[49] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem (2016) 2818–2826. https://doi.org/10.1109/cvpr.2016.308.
[50] F. Chollet, Xception: Deep learning with depthwise separable convolutions, Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017. 2017-Janua (2017) 1800–1807. https://doi.org/10.1109/cvpr.2017.195.
[51] S. Mascarenhas, M. Agarwal, A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification, Proc. IEEE Int. Conf. Disruptive Technol. Multi-Disciplinary Res. Appl. CENTCON 2021. 1 (2021) 96–99. https://doi.org/10.1109/centcon52345.2021.9687944.
[52] D. Chicco, M.J. Warrens, G. Jurman, The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation, PeerJ Comput. Sci. 7 (2021) 1–24. https://doi.org/10.7717/peerj-cs.623.
[53] S. Sulun, M.E.P. Davies, P. Viana, Symbolic music generation conditioned on continuous-valued emotions, IEEE Access. 10 (2022) 44617–44626. https://doi.org/10.1109/access.2022.3169744.