SSGAN: 基於語意相似性的文字生成圖像模型｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	林正曜 Cheng-Yao Lin
論文名稱：	SSGAN: 基於語意相似性的文字生成圖像模型 SSGAN:A Text-to-Image Generation Based on Semantic Similarity
指導教授：	項天瑞 Tien-Ruey Hsiang
口試委員:	陳建中吳怡樂
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2022
畢業學年度：	110
語文別：	中文
論文頁數：	50
中文關鍵詞：	文字生成圖像
外文關鍵詞：	text-to-image
相關次數：	點閱：121 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

從文字生成一張高質量的逼真圖像是一項具有挑戰性的任務，並且具有許多實際的應用。在過往研究中會先生成具有大致輪廓形狀和顏色的64×64低解析圖片，然後再透過各種方法細化到256×256高解析度的圖片。然而，在目前文字生成圖片的方法中主要會有三個問題。首先，這些方法會很大程度的依賴初始Stage的64×64低解析度圖片，若是生成的輪廓不夠好，其後的細化過程將會難以挽回，最終無法生成高質量的圖片。其次，在現有將輸入句子特徵萃取皆採取RNN、LSTM的方法，但其有長依賴等限制。最後，在文字生成圖片的問題中，語意一致性問題依然是每一個研究者所需面臨的難題。在本研究中，我們提出了SSGAN架構來生成圖像，並以相似度模組來解決初始圖像生成不佳的問題。我們以CycleGAN的概念，將64×64的低解析度圖片和原始對應的圖片轉換成句子，再做相似度評估，以促進64×64低解析圖片有更好的生成。同時，在相似度模組中使用語意的相似度來做為損失函數，可以改善語意一致性的問題。最後，由於Bert在NLP領域中的成功，對句子理解和應用都有很好的效果，因此我們採取Bert來作為語意萃取和句子生成的部分。最終為了驗證我們提出SSGAN的效果，我們以Caltech-USCD-Birds 200資料集來進行實驗，並與AttnGAN、MirrorGAN做比較。實驗結果表明，我們的SSGAN模型對比其他模型，圖像質量更好和在各種評分標準上有顯著的改進。

Generating a high-quality photorealistic image from text is a challenging task with many practical applications.
In previous studies, a low-resolution image of 64×64 with rough outline shapes and colors was first generated, and then refined to a high-resolution image of 256×256 through various methods.
However, there are three main problems in the current method of generating pictures from the text. First of all, these methods will largely rely on the low-resolution image of 64×64 of the Initial Stage. If the generated outline is not good enough, the subsequent refinement process will be difficult to recover, and eventually, high-quality images cannot be generated.
Secondly, in the existing method of extracting the features of sentences, RNN and LSTM are adopted, but both of these methods have limitations. When the length of the input sentence is too long, the extracted global features will only have the meaning of the second half, and the meaning of the first half will be ignored.
Finally, in the problem of generating pictures from text, the problem of semantic consistency is still a difficult problem that every researcher needs to face.
In this study, we propose the SSGAN architecture to generate images and use a similarity module to solve the problem of poor initial image generation. We use the concept of CycleGAN to convert the low-resolution image of 64×64 and the original corresponding image into sentences, and then perform similarity evaluation to promote the generation of low-resolution images of 64×64. At the same time, using semantic similarity as the loss function in the similarity module can improve the problem of semantic consistency. Finally, due to Bert's success in the field of NLP, has a good effect on sentence understanding and application, so we take Bert as the part of semantic extraction and sentence generation.
Finally, in order to verify the effect of our proposed SSGAN, we conduct experiments with the Caltech-USCD-Birds 200 dataset and compare it with AttnGAN and MirrorGAN. Experimental results show that our SSGAN model has the better image quality and significant improvement on various scoring criteria compared to other models.

中文摘要
英文摘要
目錄
圖目錄
表目錄
簡介
1 動機與目的
2 論文架構
2相關研究
1 生成式對抗網路
2 文字生成圖像(Text-To-Image)
3 圖像描述(ImageCaption)
3.1 基於encoder-decoder的深度學習方法
3.2 基於Bert的方法
方法與架構
1 總體架構
2 語意抽取模組
3 生成網路模組
4 語意相似度模組
4.1 圖像描述
4.2 句子語意相似度
5 損失函數
實驗模擬與評估
1 資料集
2 評估方法
2.1 Inception Score
2.2 Similarity Score
3 實驗結果及驗證
3.1 定性評估
3.2 定量評估
3.3 消融實驗
3.4 目標函數權重實驗
4 實驗小結
結論
參考文獻
                                

[1] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Gaugan: semantic image synthesis with spatially adaptive normalization,” in ACM SIGGRAPH 2019 Real-Time Live!, pp. 1–1, 2019.
[2] S.Surya,A.Setlur,A.Biswas,andS.Negi,“Restgan:Asteptowardsvisuallyguided shopper experience via text-to-image synthesis,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1200–1208, 2020.
[3] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine- grained text to image generation with attentional generative adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1316–1324, 2018.
[4] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with con- ditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134, 2017.
[5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proceedings of the IEEE interna- tional conference on computer vision, pp. 2223–2232, 2017.
[6] X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, “A text-to-picture synthesis system for augmenting communication,” in AAAI, vol. 7, pp. 1590–1595, 2007.
[7] J. Agnese, J. Herrera, H. Tao, and X. Zhu, “A survey and taxonomy of adversarial neural networks for text-to-image synthesis,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 10, no. 4, p. e1345, 2020.
[8] G.Yin,B.Liu,L.Sheng,N.Yu,X.Wang,andJ.Shao,“Semanticsdisentanglingfor text-to-image generation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2327–2336, 2019.
[9] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.
[10] X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision- language tasks,” ECCV 2020, 2020.
[11] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
[12] T.Qiao,J.Zhang,D.Xu,andD.Tao,“Mirrorgan:Learningtext-to-imagegeneration by redescription,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514, 2019.
[13] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative ad- versarial text to image synthesis,” in International conference on machine learning, pp. 1060–1069, PMLR, 2016.
[14] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stack- gan: Text to photo-realistic image synthesis with stacked generative adversarial net- works,” in Proceedings of the IEEE international conference on computer vision, pp. 5907–5915, 2017.
[15] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis with a hierarchically-nested adversarial network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6199–6208, 2018.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural in- formation processing systems, vol. 27, 2014.
[17] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learn- ing with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[18] M.MirzaandS.Osindero,“Conditionalgenerativeadversarialnets,”arXivpreprint arXiv:1411.1784, 2014.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.
[20] “Google.”https://www.google.com/.
[21] “Flickr.”https://www.flickr.com/.
[22] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, “Learning what and where to draw,” Advances in neural information processing systems, vol. 29, 2016.
[23] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stack- gan++: Realistic image synthesis with stacked generative adversarial networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 8, pp. 1947–1962, 2018.
[24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.
[25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
[26] A.KarpathyandL.Fei-Fei,“Deepvisual-semanticalignmentsforgeneratingimage descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137, 2015.
[27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
[28] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024, 2017.
[29] K.He,X.Zhang,S.Ren,andJ.Sun,“Deepresiduallearningforimagerecognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[30] K.Xu,J.Ba,R.Kiros,K.Cho,A.Courville,R.Salakhudinov,R.Zemel,andY.Ben- gio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, pp. 2048–2057, PMLR, 2015.
[31] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667, 2017.
[32] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolin- guistic representations for vision-and-language tasks,” Advances in neural informa- tion processing systems, vol. 32, 2019.
[33] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
[34] C.Wah,S.Branson,P.Welinder,P.Perona,andS.Belongie,“Thecaltech-ucsdbirds- 200-2011 dataset,” 2011.
[35] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information process- ing systems, vol. 29, 2016.

全文公開日期 2025/09/28 (校內網路)
全文公開日期本全文未授權公開 (校外網路)
全文公開日期本全文未授權公開 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文