研究生: |
Arces A. Talavera Arces A. Talavera |
---|---|
論文名稱: |
Layout and Context Understanding for Image Synthesis with Scene Graphs Layout and Context Understanding for Image Synthesis with Scene Graphs |
指導教授: |
花凱龍
Kai-Lung Hua |
口試委員: |
花凱龍
Kai-Lung Hua Arnulfo Azcarraga Arnulfo Azcarraga 鮑興國 Hsing-Kuo Pao 楊傳凱 Chuan-Kai Yang 楊朝龍 Chao-Lung Yang |
學位類別: |
碩士 Master |
系所名稱: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 43 |
中文關鍵詞: | Generative Models 、Image Synthesis 、Scene Graphs |
外文關鍵詞: | Generative Models, Image Synthesis, Scene Graphs |
相關次數: | 點閱:261 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with complex descriptions having multiple arbitrary objects since it would require information on the placement and sizes of each object in the image. Recently, a method that infers object layouts from scene graphs has been proposed as a solution to this problem. However, their method uses only object labels in describing the layout, which fail to capture the appearance of some objects. Moreover, their model is biased towards generating rectangular shaped objects in the absence of ground-truth masks. In this paper, we propose an object encoding module to capture object features and use it as additional information to the image generation network. We also introduce a graph-cuts based segmentation method that can infer the masks of objects from bounding boxes to better model object shapes. Our method produces more discernable images with more realistic shapes as compared to the images generated by the current state-of-the-art method.
Advancements on text-to-image synthesis generate remarkable images from textual descriptions. However, these methods are designed to generate only one object with varying attributes. They face difficulties with complex descriptions having multiple arbitrary objects since it would require information on the placement and sizes of each object in the image. Recently, a method that infers object layouts from scene graphs has been proposed as a solution to this problem. However, their method uses only object labels in describing the layout, which fail to capture the appearance of some objects. Moreover, their model is biased towards generating rectangular shaped objects in the absence of ground-truth masks. In this paper, we propose an object encoding module to capture object features and use it as additional information to the image generation network. We also introduce a graph-cuts based segmentation method that can infer the masks of objects from bounding boxes to better model object shapes. Our method produces more discernable images with more realistic shapes as compared to the images generated by the current state-of-the-art method.
[1] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to
image synthesis,” in Proceedings of the 33rd International Conference on International Conference
on Machine Learning - Volume 48, ICML’16, pp. 1060–1069, 2016.
[2] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan: Text to photorealistic
image synthesis with stacked generative adversarial networks,” in Proceedings of the IEEE
International Conference on Computer Vision (ICCV), pp. 5907–5915, 2017.
[3] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,”
in Advances in Neural Information Processing Systems (NIPS), pp. 217–225, 2016.
[4] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011
dataset,” 2011.
[5] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,”
in Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on,
pp. 722–729, IEEE, 2008.
[6] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene graphs,” in CVPR, 2018.
[7] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning, “Generating semantically precise
scene graphs from textual descriptions for improved image retrieval,” in EMNLP workshop on vision
and language, pp. 70–80, 2015.
[8] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.
Shamma, et al., “Visual genome: Connecting language and vision using crowdsourced dense image
annotations,” International Journal of Computer Vision (IJCV), vol. 123, no. 1, pp. 32–73, 2017.
[9] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems
(NIPS), pp. 2672–2680, 2014.
[11] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784,
2014.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft
coco: Common objects in context,” in European conference on computer vision, pp. 740–755,
Springer, 2014.
[13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial
networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967–
5976, IEEE, 2017.
[14] Q. Chen and V. Koltun, “Photographic image synthesis with cascaded refinement networks,” in IEEE
International Conference on Computer Vision (ICCV), vol. 1, p. 3, 2017.
[15] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the
IEEE conference on computer vision and pattern recognition (CVPR), pp. 3213–3223, 2016.
[16] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang, “Deep grabcut for object selection,” in BMVC, 2017.
[17] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool, “Deep extreme cut: From extreme points
to object segmentation,” in CVPR, 2018.
[18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image
segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions
on pattern analysis and machine intelligence (TPAMI), vol. 40, no. 4, pp. 834–848, 2018.
[19] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890, 2017.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings
of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778, 2016.
[21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic
models,” in ICML, vol. 30, p. 3, 2013.
[22] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal
covariate shift,” in ICML, 2015.
[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.