簡易檢索 / 詳目顯示

研究生: 莊家豪
Chia-Hao Chuang
論文名稱: 運用反事實因果關係提升一階段場景圖模型生成的準確度
Improving the Accuracy of One-Stage Scene Graph Generation Using Counterfactual Causality Relationships
指導教授: 呂永和
Yung-Ho Leu
口試委員: 楊維寧
Wei-Ning Yang
陳雲岫
Yun-Shiow Chen
學位類別: 碩士
Master
系所名稱: 管理學院 - 資訊管理系
Department of Information Management
論文出版年: 2023
畢業學年度: 111
語文別: 英文
論文頁數: 42
中文關鍵詞: 場景圖生成場景圖長尾問題反事實因果總直接效果一階段
外文關鍵詞: Scene Graph Generation, Scene Graph, Long-Tail Problem, Counterfactual Causality, Total Direct Effect, One-Stage
相關次數: 點閱:147下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

場景圖生成 (SGG) 是藉由給定的圖像來預測圖像中物件與物件之間的關係。然而,受到場景圖資料集經常存在極度不平衡的關係分佈所影響,預測結果往往會偏向特定的關係類別。為了解決這樣的長尾問題,近年來已經有多位學者提出了許多改進方法。
最近,一種名為關係變換器 (RelTR) 的方法被提出,它利用一階段的方式產生場景圖,相較於其他兩階段場景圖生成的方法,它不僅可以達到更快的推算速度而且只需要更少的參數數量。但相反地,它在關係預測上得到較低的準確度。
在這篇論文中,我們思考著如何保持推算速度的同時,還能提高關係預測的準確度,因此我們在關係變換器的架構之上,應用了總直接效果 (TDE) 的概念,並產生反事實圖像特徵來降低偏移的關係預測結果。
我們進行了多項實驗,並根據 VG150 資料集中關係類別的數目分成頭部、主體和尾部三個群組做各別的評估。結果顯示我們的方法在平均召回率 (mean recall-based) 優於關係變換器,但在尾部組的召回率 (recall-based) 卻低於它。最後,我們將實驗視覺化來證明我們的方法是能夠產生高質量的場景圖。


Scene Graph Generation (SGG) aims to predict the informative relationships by giving images. However, suffering from the benchmark datasets that always have extremely imbalanced relationship class distribution, the prediction results will be influenced to tend to detect the specific relationship classes. To address this long-tail problem, there have tons of refinement approaches been proposed.
Recently, the novel one-stage SGG approach named Relation Transformer (RelTR) has been proposed, which can achieve faster inference speed and just needs fewer parameter numbers, compared with other two-stage SGG approaches. But in contrast, RelTR got the lower accuracy of the relationship prediction.
In this paper, we think about maintaining the inference speed, and meanwhile, improving the prediction performance. Therefore, we apply the Total Direct Effect (TDE) concept based on RelTR and perform the counterfactual image features to alleviate the biased relationship prediction results. Moreover, our approach is only applied in prediction stage without increasing any parameter numbers that need calculating from scratch.
We conduct several experiments and evaluate on Scene Graph Detection (SGDet), which is one of the sub-tasks from Relationship Retrieval (RR). We further separate the relationship classes into head, body and tail groups according to the instance amount that comes from VG150 dataset and then estimate every group performance compared with RelTR. The results show that our approach outperforms RelTR in terms of mean recall-based metric but degrades in tail group with recall-based metric. We also visualize the quantitative results that demonstrate our approach able to produce scene graphs of high quality.

ABSTRACT i ACKNOWLEDGEMENTS ii LIST OF FIGURES iv LIST OF TABLES vi Chapter 1. Introduction 1 1.1 Research Background 1 1.2 Research Purpose 2 1.3 Research Motivation 3 1.4 Research Overview 5 Chapter 2. Related Work 6 2.1 Scene Graph 6 2.2 Scene Graph Generation 6 2.2.1 Two-Stage Approaches 8 2.2.2 One-Stage Approach 11 Chapter 3. Research Method 14 3.1 Research Structure 14 3.2 Dataset Description 16 3.3 Relationship Retrieval 17 Chapter 4. Experiments and Results 19 4.1 Experimental Environment 19 4.2 Implementation Details 19 4.3 Quantitative Results and Evaluations 19 Chapter 5. Conclusions 28 Appendix 29 References 31

[1] A. T. Latipova and P. P. Kumar, "Overview of Scene Graph Generation Approaches in Computer Vision," 2023 International Russian Smart Industry Conference (SmartIndustryCon), Sochi, Russian Federation, 2023, pp. 655-659, doi: 10.1109/SmartIndustryCon57312.2023.10110820.
[2] J. Johnson et al., "Image Retrieval using Scene Graphs," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3668-3678, doi: 10.1109/CVPR.2015.7298990.
[3] B. Schroeder and S. Tripathi, "Structured Query-Based Image Retrieval Using Scene Graphs," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 2020, pp. 680-684, doi: 10.1109/CVPRW50498.2020.00097.
[4] J. Johnson, A. Gupta and L. Fei-Fei, "Image Generation from Scene Graphs," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1219-1228, doi: 10.1109/CVPR.2018.00133.
[5] G. Mittal, S. Agrawal, A. Agarwal, S. Mehta, and T. Marwah, "Interactive Image Generation Using Scene Graphs," arXiv preprint arXiv:1905.03743, 2019.
[6] D. Teney, L. Liu and A. Van Den Hengel, "Graph-Structured Representations for Visual Question Answering," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3233-3241, doi: 10.1109/CVPR.2017.344.
[7] D. A. Hudson and C. D. Manning, "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6693-6702, doi: 10.1109/CVPR.2019.00686.
[8] R. Krishna et al., "Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations," Int. J. Comput. Vis., vol. 123, no. 1, pp. 32-73, 2017.
[9] K. Tang, Y. Niu, J. Huang, J. Shi and H. Zhang, "Unbiased Scene Graph Generation from Biased Training," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 3713-3722, doi: 10.1109/CVPR42600.2020.00377.
[10] S. Yan et al., "PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation," in Proc. 28th ACM Int. Conf. Multimedia, 2020, pp. 265-273.
[11] M. J. Chiou, H. H. Ding, H. S. Yan, C. H. Wang, R. Zimmermann, and J. S. Feng, "Recovering the Unbiased Scene Graphs from the Biased Ones," arXiv preprint arXiv:2107.02112, 2021.
[12] X. Dong, T. Gan, X. Song, J. Wu, Y. Cheng and L. Nie, "Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 19405-19414, doi: 10.1109/CVPR52688.2022.01882.
[13] X. Lyu et al., "Fine-Grained Predicates Learning for Scene Graph Generation," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 19445-19453, doi: 10.1109/CVPR52688.2022.01886.
[14] H. Zhou, J. Zhang, T. Luo, Y. Yang and J. Lei, "Debiased Scene Graph Generation for Dual Imbalance Learning," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4274-4288, 1 April 2023, doi: 10.1109/TPAMI.2022.3198965.
[15] T. He, L. Gao, J. Song and Y. -F. Li, "State-Aware Compositional Learning Toward Unbiased Training for Scene Graph Generation," in IEEE Transactions on Image Processing, vol. 32, pp. 43-56, 2023, doi: 10.1109/TIP.2022.3224872.
[16] Y. Cong, M. Y. Yang and B. Rosenhahn, "RelTR: Relation Transformer for Scene Graph Generation," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2023.3268066.
[17] J. Pearl and D. Mackenzie, THE BOOK OF WHY: THE NEW SCIENCE OF CAUSE AND EFFECT. Basic Books, 2018.
[18] J. Pearl, "Direct and Indirect Effects," in Proc. 17th conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc., 2001.
[19] T. J. VanderWeele, "A Three-way Decomposition of a Total Effect into Direct, Indirect, and Interactive Effects," Epidemiology, 2013, 24(2):224-232, doi:10.1097/EDE.0b013e318281a64e.
[20] K. He, G. Gkioxari, P. Dollár and R. Girshick, "Mask R-CNN," 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980-2988, doi: 10.1109/ICCV.2017.322.
[21] S. Ren, K. He, R. Girshick and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137-1149, 1 June 2017, doi: 10.1109/TPAMI.2016.2577031.
[22] S. Xie, R. Girshick, P. Dollár, Z. Tu and K. He, "Aggregated Residual Transformations for Deep Neural Networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 5987-5995, doi: 10.1109/CVPR.2017.634.
[23] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," 3rd International Conference on Learning Representations (ICLR 2015), Computational and Biological Learning Society, 2015, pp. 1-14.
[24] R. Zellers, M. Yatskar, S. Thomson and Y. Choi, "Neural Motifs: Scene Graph Parsing with Global Context," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5831-5840, doi: 10.1109/CVPR.2018.00611.
[25] K. Tang, H. Zhang, B. Wu, W. Luo and W. Liu, "Learning to Compose Dynamic Tree Structures for Visual Contexts," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6612-6621, doi: 10.1109/CVPR.2019.00678.
[26] K. S. Tai, R. Socher, and C. D. Manning, "Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks," in ACL, 2015.
[27] J. Pennington, R. Socher, and C. Manning, "GloVe: Global Vectors for Word Representation," in Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2014, pp. 1532-1543.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention Is All You Need," in Advances in neural information processing systems, 2017, pp. 5998-6008.
[29] D. Xu, Y. Zhu, C. B. Choy and L. Fei-Fei, "Scene Graph Generation by Iterative Message Passing," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3097-3106, doi: 10.1109/CVPR.2017.330.
[30] T. Chen, W. Yu, R. Chen and L. Lin, "Knowledge-Embedded Routing Network for Scene Graph Generation," 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6156-6164, doi: 10.1109/CVPR.2019.00632.
[31] C. Zheng, L. Gao, X. Lyu, P. Zeng, A. E. Saddik, and H. T. Shen, "Dual-branch Hybrid Learning Network for Unbiased Scene Graph Generation," arXiv preprint arXiv: 2207.07913, 2022.

QR CODE