簡易檢索 / 詳目顯示

研究生: Arces A. Talavera
Arces A. Talavera
論文名稱: Layout and Context Understanding for Image Synthesis with Scene Graphs
Layout and Context Understanding for Image Synthesis with Scene Graphs
指導教授: 花凱龍
Kai-Lung Hua
口試委員: 花凱龍
Kai-Lung Hua
Arnulfo Azcarraga
Arnulfo Azcarraga
鮑興國
Hsing-Kuo Pao
楊傳凱
Chuan-Kai Yang
楊朝龍
Chao-Lung Yang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 43
中文關鍵詞: Generative ModelsImage SynthesisScene Graphs
外文關鍵詞: Generative Models, Image Synthesis, Scene Graphs
相關次數: 點閱:261下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with complex descriptions having multiple arbitrary objects since it would require information on the placement and sizes of each object in the image. Recently, a method that infers object layouts from scene graphs has been proposed as a solution to this problem. However, their method uses only object labels in describing the layout, which fail to capture the appearance of some objects. Moreover, their model is biased towards generating rectangular shaped objects in the absence of ground-truth masks. In this paper, we propose an object encoding module to capture object features and use it as additional information to the image generation network. We also introduce a graph-cuts based segmentation method that can infer the masks of objects from bounding boxes to better model object shapes. Our method produces more discernable images with more realistic shapes as compared to the images generated by the current state-of-the-art method.


    Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with complex descriptions having multiple arbitrary objects since it would require information on the placement and sizes of each object in the image. Recently, a method that infers object layouts from scene graphs has been proposed as a solution to this problem. However, their method uses only object labels in describing the layout, which fail to capture the appearance of some objects. Moreover, their model is biased towards generating rectangular shaped objects in the absence of ground-truth masks. In this paper, we propose an object encoding module to capture object features and use it as additional information to the image generation network. We also introduce a graph-cuts based segmentation method that can infer the masks of objects from bounding boxes to better model object shapes. Our method produces more discernable images with more realistic shapes as compared to the images generated by the current state-of-the-art method.

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iv Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Image Generation from Text . . . . . . . . . . . . 5 2.1.2 Image Generation from Semantic Layouts . . . . . 6 3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Overview of Proposed Framework . . . . . . . . . . . . . 7 3.2 Mask Generation from Bounding Box . . . . . . . . . . . 9 3.3 Graph Convolution Network. . . . . . . . . . . . . . . . . 10 3.4 Object Encoding Module . . . . . . . . . . . . . . . . . . 11 3.5 Layout Prediction . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Generator and Discriminators . . . . . . . . . . . . . . . . 13 v 3.6.1 Generator . . . . . . . . . . . . . . . . . . . . . . 13 3.6.2 Discriminators . . . . . . . . . . . . . . . . . . . 14 3.7 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Implementation Details . . . . . . . . . . . . . . . . . . . 18 4.1.1 Network Architecture . . . . . . . . . . . . . . . . 18 4.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . 18 4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.3 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . 19 4.3.1 Depiction of Object Appearance . . . . . . . . . . 21 4.3.2 Application of Masks to the Layout . . . . . . . . 21 4.3.3 Predicted Layout . . . . . . . . . . . . . . . . . . 23 4.4 User Study . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 31 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    [1] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to
    image synthesis,” in Proceedings of the 33rd International Conference on International Conference
    on Machine Learning - Volume 48, ICML’16, pp. 1060–1069, 2016.
    [2] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photorealistic
    image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE
    International Conference on Computer Vision (ICCV), pp. 5907–5915, 2017.
    [3] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,”
    in Advances in Neural Information Processing Systems (NIPS), pp. 217–225, 2016.
    [4] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011
    dataset,” 2011.
    [5] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,”
    in Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on,
    pp. 722–729, IEEE, 2008.
    [6] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in CVPR, 2018.
    [7] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise
    scene graphs from textual descriptions for improved image retrieval,” in EMNLP workshop on vision
    and language, pp. 70–80, 2015.
    [8] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.
    Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image
    annotations,” International Journal of Computer Vision (IJCV), vol. 123, no. 1, pp. 32–73, 2017.
    [9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
    [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
    Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems
    (NIPS), pp. 2672–2680, 2014.
    [11] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784,
    2014.
    [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft
    coco: Common objects in context,” in European conference on computer vision, pp. 740–755,
    Springer, 2014.
    [13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial
    networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–
    5976, IEEE, 2017.
    [14] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded refinement networks,” in IEEE
    International Conference on Computer Vision (ICCV), vol. 1, p. 3, 2017.
    [15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
    B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the
    IEEE conference on computer vision and pattern recognition (CVPR), pp. 3213–3223, 2016.
    [16] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang, “Deep grabcut for object selection,” in BMVC, 2017.
    [17] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool, “Deep extreme cut: From extreme points
    to object segmentation,” in CVPR, 2018.
    [18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image
    segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions
    on pattern analysis and machine intelligence (TPAMI), vol. 40, no. 4, pp. 834–848, 2018.
    [19] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on
    Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890, 2017.
    [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
    of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778, 2016.
    [21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic
    models,” in ICML, vol. 30, p. 3, 2013.
    [22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal
    covariate shift,” in ICML, 2015.
    [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

    QR CODE