簡易檢索 / 詳目顯示

研究生: 林八林
Ba-Lin Lin
論文名稱: 應用於電腦視覺之影像人物凝視注意力偵測模型
GazeVAE: Gaze Visual Attention Estimator
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 陳永耀
Yung-Yao Chen
陳駿丞
Jun-Cheng Chen
楊傳凱
Chuan-Kai Yang
陸敬互
Ching-Hu Lu
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 43
中文關鍵詞: 凝視跟隨凝視目標偵測凝視預估視線注意前景物
外文關鍵詞: Gaze following, Gaze target detection, Gaze estimation, Visual attention, Saliency
相關次數: 點閱:183下載:5
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

偵測影像中的人所向看的地方能對人類社交或是動作分析等領域提供非常多的資訊,而影像人物凝視注意力偵測模型之目的即為給定一張完整的影像以及目標人物的頭部影像,透過深度學習的方式預測出其在影像中所看向的地方。最近的研究針對此問題已提出並證明提供深度資訊以及產生角度遮罩可以幫助模型進行判斷,但是這些模型都仰賴著許多其他預訓練的模型來達到更好的效能。因此我們提出了一種兩階段的模型架構,在第一階段中我們透過深度資訊的偽標籤在資料集訓練三維的視覺方向,並將其分解成二維影像平面以及一維深度的遮罩。在第二階段中我們透過原始的影像、頭部位置以及第一階段的輸出結果去預測此人看向的目標是在影像中還是影像外,若是其目標在影像中則預測其看向的圖像位置。我們的架構中除了使用現有的深度預測模型以外並沒有使用其他的預訓練模型,並提出了前瞻的角度等量損失函數提高二維角度的準確度。我們在實驗中證明,我們的模型即使不使用預訓練的骨幹模型也能在曲線下方的面積 (AUC) 方面優於幾個最先進的基線,並在距離等其他指標上達到非常接近的成果。


A person's gaze can reveal where their interest or attention lies in a social scenario. Detecting a person's gaze is essential in multiple domains (i.e., security, psychology, or medical diagnosis). Therefore, visual attention models aim to automate this and determine where multiple people's gazes in a scene lie. Most existing works in this field are dependent on multiple pre-trained models. We propose a two-stage framework, Gaze Visual Attention Estimator (GazeVAE). In the first stage, we train the 3D direction on the GazeFollow dataset with a pseudo label to produce the field of view. Afterward, we decompose the 3D direction into a 2D image plane and a depth-channel gaze to obtain the depth mask image. In the second stage, we concatenate the scene image, the output from stage one, and the head position to predict the gaze target's location. We propose a novel equivalent loss to reduce angle error further. We train the model from scratch except for the off-the-shelf depth network. Our model outperforms the baseline model in AUC and achieves competitive results for GazeFollow and VideoAttentionTarget datasets without pretraining.

Contents Abstract in Chinese . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract in English . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iii Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Gaze Target Prediction . . . . . . . . . . . . . . . 4 2.2 Gaze Direction Estimation . . . . . . . . . . . . . 4 2.3 Visual Saliency . . . . . . . . . . . . . . . . . . . 5 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Visual Attention Module . . . . . . . . . . . . . . 7 3.3 Heatmap Regression Module . . . . . . . . . . . . 12 3.4 Objective Function . . . . . . . . . . . . . . . . . 16 4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 GazeFollow . . . . . . . . . . . . . . . . . . . . . 17 4.2 VideoAttentionTarget . . . . . . . . . . . . . . . . 17 4.3 Implementation Details . . . . . . . . . . . . . . . 21 4.4 Experimental Results . . . . . . . . . . . . . . . . 22 4.5 Ablation Study . . . . . . . . . . . . . . . . . . . 26 4.6 Examples of Failure . . . . . . . . . . . . . . . . 28 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1 Future works . . . . . . . . . . . . . . . . . . . . 30 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Letter of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . 35

[1] D. Lian, Z. Yu, and S. Gao, “Believe it or not, we know what you are looking at!,” in ACCV, 2018.
[2] E. Chong, Y. Wang, N. Ruiz, and J. M. Rehg, “Detecting attended visual targets in video,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[3] E. Chong, N. Ruiz, Y. Wang, Y. Zhang, A. Rozga, and J. M. Rehg, “Connecting gaze, scene, and attention: Generalized attention estimation via joint modeling of gaze and scene saliency,” in The European Conference on Computer Vision (ECCV), September 2018.
[4] P. A. Dias, D. Malafronte, H. Medeiros, and F. Odone, “Gaze estimation for assisted living environments,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV), March 2020.
[5] X. Zhong, X. Qu, C. Ding, and D. Tao, “Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13234–13243, June 2021.
[6] H. Tomas, M. Reyes, R. Dionido, M. Ty, J. Casimiro, R. Atienza, and R. Guinto, “Goo: A dataset for gaze object prediction in retail environments,” in CVPR Workshops (CVPRW), 2021.
[7] K. Campbell, K. L. H. Carpenter, J. Hashemi, S. Espinosa, S. Marsan, J. S. Borg, Z. Chang, Q. Qiu, S. Vermeer, E. Adler, M. Tepper, H. L. Egger, J. P. Baker, G. Sapiro, and G. Dawson, “Computer vision analysis captures atypical attention in toddlers with autism.,” Autism, vol. 23, no. 3, pp. 619–628, 2019.
[8] A. Recasens, A. Khosla, C. Vondrick, and A. Torralba, “Where are they looking?,” in Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
[9] Y. Fang, J. Tang, W. Shen, W. Shen, X. Gu, L. Song, and G. Zhai, “Dual attention guided gaze target detection in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11390–11399, June 2021.
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv preprint arXiv:1512.03385, 2015.
[11] P. Kellnhofer, A. Recasens, S. Stent, W. Matusik, and A. Torralba, “Gaze360: Physically unconstrained gaze estimation in the wild,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
[12] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in International Conference on Computer Vision, 2017.
[13] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
[14] Y. Cheng and F. Lu, “Gaze estimation using transformer,” 2021.
[15] R. Siegfried and J.-M. Odobez, “Visual focus of attention estimation in 3d scene with an arbitrary number of targets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3153–3161, June 2021.
[16] S. Ghosh, M. Hayat, A. Dhall, and J. Knibbe, “Mtgls: Multi-task gaze estimation with limited supervision,” 2021.
[17] Y. Cheng, S. Huang, F. Wang, C. Qian, and F. Lu, “A coarse-to-fine adaptive network for appearance-based gaze estimation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10623–10630, 04 2020.
[18] M. L. R. D and P. Biswas, “Appearance-based gaze estimation using attention and difference mechanism,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3143–3152, June 2021.
[19] N. Liu, N. Zhang, K. Wan, L. Shao, and J. Han, “Visual saliency transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4722–4732, October 2021.
[20] P. Sun, W. Zhang, H. Wang, S. Li, and X. Li, “Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1407–1417, June 2021.
[21] S. Gorji and J. J. Clark, “Attentional push: A deep convolutional network for augmenting image salience with shared attention modeling in social scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[22] D. Parks, A. Borji, and L. Itti, “Augmented saliency model using automatic 3d head pose detection and learned gaze following in natural scenes,” Vision Research, vol. 116, pp. 113–126, 2015. Computational Models of Visual Attention.
[23] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand, “Where should saliency models look next?,” in Computer Vision – ECCV 2016 (B. Leibe, J. Matas, N. Sebe, and M. Welling, eds.), (Cham), pp. 809–824, Springer International Publishing, 2016.
[24] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” ArXiv preprint, 2021.
[25] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019.
[26] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010
[27] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
[28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 200

QR CODE