簡易檢索 / 詳目顯示

研究生: 林啟光
Chi-Kuang Lin
論文名稱: Transformer人臉單元情緒辨識結合大型語言模型之機器人控制互動系統設計
Design of a Robot Control Interaction System Combined with Transformer-Based Facial Action Unit Emotion Recognition and Large Language Models
指導教授: 謝易錚
Yi-Zeng Hsieh
口試委員: 彭盛裕
林士勛
謝易錚
學位類別: 碩士
Master
系所名稱: 電資學院 - 電機工程系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 54
中文關鍵詞: 表情辨識臉部動作編碼系統注意力圖邊緣運算裝置特徵提取生成式AI
外文關鍵詞: Facial Expression Recognition, Facial Action Coding System, Attention Map, Edge Devices, Feature Extraction, Generative AI
相關次數: 點閱:95下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

現有的表情辨識技術已發展到能夠準確的辨識出臉部的表情,然而這些技術難以解釋電腦做出特定表情判斷的依據,故本論文提出以臉部動作編碼系統中的動作單元為基礎,增加五官的資訊,提高模型的辨識度。
本論文設計了一套表情辨識結合機器人互動系統,透過表情辨識演算法結合生成式AI,讓機器人在互動上更加的擬真,然而機器人的硬體空間有高侷限性,無法安裝大型運算設備,只能使用小型邊緣運算裝置,因此本論文修改動作單元辨識演算法的特徵提取網路與架構,在不影響模型效能的同時,減少模型的複雜度,以適應邊緣運算裝置的運算能力。表情辨識演算法使用Google提出的EfficientNet,其架構採用深度可分離卷積(Depthwise Separable Convolution)減少了模型的參數量,適合在資源受限的邊緣運算裝置上運行,依特徵圖生成注意力圖並結合原始臉部影像進行表情辨識,再透過表情辨識的結果作為生成式AI的輸入,利用生成式AI根據辨識的表情生成相對應的動作描述與機器人動作指令。
本論文所提出的注意力圖與動作單元辨識演算法在BP4D資料集與DISFA資料集上與其他動作單元辨識演算法的結果比較差異不大,表情辨識演算法在AffectNet資料集上準確率為60.64%與其他方法比較,具有較高的準確度。


Existing facial expression recognition technologies have been developed to accurately identify facial expressions. However, these technologies make it difficult to explain the basis for computers to make specific expression judgments. Therefore, this thesis proposes to use the action units in the facial action coding system as the basis, and to add information about facial features to improve model recognition.
This thesis designs a facial expression recognition combined with a robot interaction system. By integrating the facial expression recognition algorithm with generative AI, the robot interaction becomes more realistic. However, the robot's hardware space is limited and cannot install large computing equipment, and only small edge devices can be used. Therefore, this thesis modifies the feature extraction network and the architecture of the action unit identification algorithm is reduced to the complexity of the model. On the other hand, our model is adopted to the computing capabilities of edge device without affecting the model performance. The facial expression recognition algorithm uses EfficientNet proposed by Google. Its architecture uses depthwise separable convolution to reduce the number of model parameters, making it suitable for running on resource-constrained edge devices. Attention maps are generated based on feature maps and combined with the original facial image for facial expression recognition, and the facial expression recognition results are used as input to the generative AI. The generative AI generates appropriate action descriptions and robot action instructions based on the recognized expressions.
The results of the attention map and action unit recognition algorithm proposed in this thesis are not much different from other action unit recognition algorithms on BP4D and DISFA datasets. The accuracy of the facial expression recognition algorithm on the AffectNet dataset is 60.64%, and it has a higher accuracy compared to the methods.

摘要 I Abstract III 誌謝 V 目錄 VI 圖目錄 VIII 表目錄 IX 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 3 1.3 論文架構 4 第二章 相關研究探討 5 2.1 臉部動作編碼系統 5 2.2 可視化網路 8 2.2.1 Class Activation Mapping 8 2.2.2 Attention Branch Network 9 2.2.3 Facial Action Unit Detection with Transformers 9 第三章 研究方法 10 3.1 演算法流程及架構 10 3.2 注意力圖與動作單元辨識演算法 11 3.2.1 特徵提取 11 3.2.2 ROI Attention Module 13 3.2.3 Per-AU Embeddings Module 14 3.2.4 AU Correlation Module 15 3.2.5 損失函數 16 3.3 表情辨識演算法 19 3.4 機器人動作生成 21 第四章 實驗結果與比較 22 4.1 軟硬體架構 22 4.2 資料集 22 4.2.1 BP4D 22 4.2.2 DISFA 23 4.2.3 AffectNet 24 4.3 實驗評估指標 25 4.3.1 動作單元評估指標 25 4.3.2 表情辨識評估指標 26 4.4 動作單元辨識演算法實驗結果與比較 27 4.5 注意力圖實驗結果 30 4.6 表情辨識演算法實驗結果比較 32 4.7 機器人動作描述生成與應用 35 第五章 結論與未來展望 44 5.1 結論 44 5.2 未來展望 45 參考文獻 46 附錄 49 附錄一 ChatGPT Instructions 49 附錄二 機器人動作圖 52

[1] P. Ekman, and E. Rosenberg, What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System, 2nd ed. Oxford University Press, 2005.
[2] G. Jacob, and B. Stenger, “Facial Action Unit Detection with Transformers,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 7676-7685, Jun. 2021.
[3] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818-2826, Jun. 2016.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, Jun. 2016.
[5] CMU School of Computer Science. FACS - Facial Action Coding System [Online]. Available: https://www.cs.cmu.edu/~face/facs.htm
[6] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921-2929, Jun. 2016.
[7] H. Fukui, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, “Attention Branch Network: Learning of Attention Mechanism for Visual Explanation,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 10697-10706, Jun. 2019.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems, pp. 5998-6008, Dec. 2017.
[9] J. Peng, X. Bu, M. Sun, Z. Zhang, T. Tan, and J. Yan, “Large-Scale Object Detection in the Wild from Imbalanced Multi-Labels,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 9706-9715, Jun. 2020.
[10] F. Milletari, N. Navab, and S. Ahmadi, “V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation,” in Fourth International Conference on 3D Vision, pp. 565-571, Oct. 2016.
[11] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A Discriminative Feature Learning Approach for Deep Face Recognition,” in European Conference on Computer Vision, pp. 499–515, Sep. 2016.
[12] M. Tan, and Q. Le, “Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks,” in International Conference on Machine Learning, pp. 6105–6114, May 2019.
[13] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, pp. 1877-1901, Dec. 2020.
[14] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., “GPT-4 Technical Report,” in arXiv preprint, arXiv:2303.08774, Mar. 2023.
[15] X. Zhang, L. Yin, J. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. Girard, “BP4D Spontaneous: A High-Resolution Spontaneous 3D Dynamic Facial Expression Database,” in Image and Vision Computing, pp. 692-706, Oct. 2014.
[16] S. Mavadati, M. Mahoor, K. Bartlett, P. Trinh, and J. Cohn, “DISFA: A Spontaneous Facial Action Intensity Database,” in IEEE Transactions on Affective Computing, pp. 151-160, Apr. 2013.
[17] A. Mollahosseini, B. Hasani, and M. Mahoor, “AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild,” in IEEE Transactions on Affective Computing, pp. 18-31, Jan. 2019.
[18] C. Corneanu, M. Madadi, and S. Escalera, “Deep Structure Inference Network for Facial Action Unit Recognition,” in European Conference on Computer Vision, pp. 298-313, Sep. 2018.
[19] X. Niu, H. Han, S. Yang, Y. Huang, and S. Shan, “Local Relationship Learning with Person-Specific Shape Regularization for Facial Action Unit Detection,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 11909-11918, Jun. 2019.
[20] G. Li, X. Zhu, Y. Zeng, Q. Wang, and L. Lin, “Semantic Relationships Guided Representation Learning for Facial Action Unit Recognition,” in AAAI Conference on Artificial Intelligence, pp. 8594-8601, Jul. 2019.
[21] W. Li, F. Abtahi, Z. Zhu, and L. Yin, “EAC-Net: Deep Nets with Enhancing and Cropping for Facial Action Unit Detection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2583-2596, Nov. 2018.
[22] Z. Shao, Z. Liu, J. Cai, and L. Ma, “Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment,” in European Conference on Computer Vision, pp. 705-720, Sep. 2018.
[23] Z. Shao, Z. Liu, J. Cai, Y. Wu, and L. Ma, “Facial Action Unit Detection Using Attention and Relation Learning,” in IEEE Transactions on Affective Computing, pp. 1274-1289, Jul. 2022.
[24] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li, “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, Jun. 2009.
[25] D. King, “Dlib-ml: A Machine Learning Toolkit,” in Journal of Machine Learning Research, pp. 1755-1758, Dec. 2009.
[26] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” in arXiv preprint, arXiv:1704.04861, Apr. 2017.
[27] Z. Liu, Y. Wang, S. Vaidya, F. Ruehle, J. Halverson, M. Soljačić, T. Hou, and M. Tegmark, “KAN: Kolmogorov-Arnold Networks,” in arXiv preprint, arXiv:2404.19756, Apr. 2024.
[28] K. Wang, X. Peng, J. Yang, D. Meng, and Y. Qiao, “Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition,” in IEEE Transactions on Image Processing, pp. 4057-4069, Jan. 2020.
[29] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing Uncertainties for Large-Scale Facial Expression Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 6896-6905, Jun. 2020.

無法下載圖示 全文公開日期 2026/08/26 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE