以基於Transformer 的生成對抗網路進行動漫風格場景轉換

簡易檢索 / 詳目顯示

回結果列表

研究生：	楊上寬 Shang-Kuan Yang
論文名稱：	以基於Transformer 的生成對抗網路進行動漫風格場景轉換 Anime Scene Style Transfer Using Transformer-based GAN
指導教授：	戴文凱 Wen-Kai Tai
口試委員:	紀明德 Ming-Te Chi 金台齡 Tai-Ling Jin
學位類別：	碩士 Master
系所名稱：	電資學院 - 資訊工程系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	英文
論文頁數：	68
中文關鍵詞：	風格遷移、風格化、生成對抗網路、Transformer 、動漫
外文關鍵詞：	Style Transfer, Stylization, Generative Adversarial Network, Transformer, Anime
相關次數：	點閱：394 下載：20
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

近年來，隨著動漫相關領域蓬勃發展，動漫相關企劃數量與動漫市場價值持續攀升，觀眾對動畫品質的要求也日益提高。然而，動畫從業者的年齡結構逐漸老化，他們的工作負擔也越來越沉重。在動畫製作過程中，常需參考現實場景，繪師須思考如何將真實場景轉換成具備動漫風格的場景，而這過程需耗費大量時間。因此，本論文旨在開發一個工具，能快速生成高品質的動漫場景，以提供繪師參考，加速製作過程。

本論文針對資料集、模型架構與損失函數進行調整，以達到生成動漫畫場景的目的。在資料集方面，本論文使用了vit-gpt2和Stable Diffusion生成高品質的動漫圖片，並使用clip-ViT-B-32模型建立了一個過濾方法，以選擇適合訓練的資料。在模型方面，基於StyTR$^2$的架構進行了調整，使模型更加關注真實場景的結構，並僅根據真實圖片進行轉換。在損失函數方面，參考了AnimeGAN和CartoonGAN的損失函數，並進一步提出了HSV loss，以在考慮真實圖片中物體顏色的同時融合具有動漫風格的色彩。

根據對使用者感知的研究和數值比較，本論文提出的方法在生成效果、模型大小和推論時間等方面與現今的模型相比都具有競爭力。

In recent years, the growing popularity of anime has led to increasing demands for higher animation quality. However, anime creators have gradually aged, and their workload has become increasingly heavy. During the animation production process, creators often refer to real-world scenes and contemplate how to transform them into scenes with an anime style, which consumes a significant amount of time. Therefore, our objective is to develop a tool that can rapidly generates high-quality anime scenes, providing a reference for artists and accelerating the production process.

In this thesis, we adjust the dataset, model architecture, and loss function to achieve the goal of generating anime-style scenes. Regarding the dataset, we utilize vit-gpt2 and Stable Diffusion to generate high-quality anime images. We establish a filtering pipeline using the clip-ViT-B-32 model to eliminate unsuitable training data. And, we modify the StyTR$^2$ architecture to be more attentive to the structure of the content images and perform the transformation based solely on the content images. Regarding the loss function, we reference the loss functions from AnimeGAN and CartoonGAN. Additionally, we propose the HSV loss to incorporate anime-style colors while considering the content's colors in the content images.

Based on user studies and quantitative comparisons, our proposed method demonstrates competitiveness in terms of generation quality, model size, and inference time when compared to existing models.

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . 4
2 Generative adversarial networks (GAN) . . . . . . . . . . 6
3 AnimeGAN . . . . . . . . . . . . . . . . . . . . . . . . . 8
4 Transformer-based models . . . . . . . . . . . . . . . . . 10
4.1 Transformer . . . . . . . . . . . . . . . . . . . . . 10
4.2 Transformer in computer vision . . . . . . . . . . 13
4.3 StyTr2 . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Stable Diffusion . . . . . . . . . . . . . . . . . . . . . . . 18
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1 Style images creation pipeline . . . . . . . . . . . . . . . 21
2 Model Architecture . . . . . . . . . . . . . . . . . . . . . 25
3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . 32
Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 36
1 Effectiveness of filtering phase of style images creation
pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Implementation details . . . . . . . . . . . . . . . . . . . 38
3 Comparision with SOTA methods . . . . . . . . . . . . . 38
3.1 Quatitative evaluation . . . . . . . . . . . . . . . 38
3.2 Qualitative evaluation . . . . . . . . . . . . . . . 41
4 Analysis of HSV loss . . . . . . . . . . . . . . . . . . . . 42
5 Human perceptual study . . . . . . . . . . . . . . . . . . 45
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
                                

[1] B. Li, Y. Zhu, Y. Wang, C.-W. Lin, B. Ghanem, and L. Shen, “Anigan: Style-guided generative adversarial networks for unsupervised anime face generation,” IEEE Transactions on Multimedia, vol. 24, pp. 4077–4091, 2021.

[2] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycleconsistent adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232, 2017.

[3] J. Chen, G. Liu, and X. Chen, “Animegan: A novel lightweight gan for photo animation,” in International Symposium on Intelligence Computation and Applications, pp. 242–256, Springer, 2020.

[4] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin,“Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.

[5] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.

[6] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.

[7] Y. Deng, F. Tang, W. Dong, C. Ma, X. Pan, L. Wang, and C. Xu, “Stytr2: Image style transfer with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11326–11336, 2022.

[8] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,”in Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510, 2017.

[9] L. Sheng, Z. Lin, J. Shao, and X. Wang, “Avatar-net: Multi-scale zero-shot style transfer by feature decoration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8242–8250, 2018.

[10] D. Y. Park and K. H. Lee, “Arbitrary style transfer with style-attentional networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5880–5888, 2019.

[11] Y. Yao, J. Ren, X. Xie, W. Liu, Y.-J. Liu, and J. Wang, “Attention-aware multi-stroke style transfer,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1467–1475, 2019.

[12] Y. Deng, F. Tang, W. Dong, W. Sun, F. Huang, and C. Xu, “Arbitrary style transfer via multi-adaptation network,” in Proceedings of the 28th ACM International Conference on Multimedia, pp. 2719–2727, 2020.52

[13] Y. Deng, F. Tang, W. Dong, H. Huang, C. Ma, and C. Xu, “Arbitrary video style transfer via multi-channel correlation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1210–1217, 2021.

[14] J. An, S. Huang, Y. Song, D. Dou, W. Liu, and J. Luo, “Artflow: Unbiased image style transfer via reversible neural flows,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 862–871, 2021.

[15] S. Liu, T. Lin, D. He, F. Li, M. Wang, X. Li, Z. Sun, Q. Li, and E. Ding, “Adaattn: Revisit attention mechanism in arbitrary neural style transfer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6649–6658, 2021.

[16] H. Chen, Z. Wang, H. Zhang, Z. Zuo, A. Li, W. Xing, D. Lu, et al., “Artistic style transfer with internal-external learning and contrastive learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 26561–26573, 2021.

[17] X. Wu, Z. Hu, L. Sheng, and D. Xu, “Styleformer: Real-time arbitrary style transfer via parametric style composition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14618–14627, 2021.

[18] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
[19] “Ghibli-diffusion.” https://huggingface.co/nitrosocke/Ghibli-Diffusion. Accessed: 2023-06-08.

[20] H. Masuda, T. Sudo, T. Koudate, A. Matsumoto, K. Rikukawa, T. Ishida, Y. Kameyama, Y. Mori, and M. Hasegawa, “Anime industry report 2022.” https://aja.gr.jp/download/2022_anime_ind_rpt_summary_en, 2022.

[21] 一般社団法人日本アニメーター・演出協会（JAniCA), “アニメーション制作者実態調査報
告書 2015.” http://www.janica.jp/survey/survey2015Report.pdf, 2015.

[22] 一般社団法人日本アニメーター・演出協会（JAniCA), “アニメーション制作者実態調査報
告書 2019.” http://www.janica.jp/survey/survey2019Report.pdf, 2019.

[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.

[24] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 4401–4410, 2019.53

[25] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8110–8119, 2020.

[26] T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” in Proc. NeurIPS, 2021.
[27] E. Richardson, Y. Alaluf, O. Patashnik, Y. Nitzan, Y. Azar, S. Shapiro, and D. Cohen-Or, “Encoding in style: A stylegan encoder for image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2287–2296, June 2021.

[28] “clip-vit-b-32.” https://huggingface.co/sentence-transformers/clip-ViT-B-32. Ac-
cessed: 2023-06-08.

[29] Y. Chen, Y.-K. Lai, and Y.-J. Liu, “Cartoongan: Generative adversarial networks for photo cartoonization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9465–9474, 2018.

[30] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Advances in Neural Information Processing Systems, vol. 30, 2017.

[31] Y. Jiang, S. Chang, and Z. Wang, “Transgan: Two pure transformers can make one strong gan, and that can scale up,” in Advances in Neural Information Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, eds.), vol. 34, pp. 14745–14758, Curran Associates, Inc., 2021.

[32] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenettrained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.,” in International Conference on Learning Representations, 2019.

[33] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[34] “vit-gpt2-image-captioning.” https://huggingface.co/nlpconnect/
vit-gpt2-image-captioning. Accessed: 2023-06-09.

[35] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. Paul Smolley, “Least squares generative adversarial networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2794–2802, 2017.

[36] M. Afifi, M. A. Brubaker, and M. S. Brown, “Histogan: Controlling colors of gan-generated and real images via color histograms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.

[37] “Landscape pictures.” https://www.kaggle.com/datasets/arnaud58/
landscape-pictures?resource=download&select=00000008.jpg. Accessed: 2023-07-01. 54

[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[39] L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv preprint arXiv:2302.05543, 2023.

[40] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.55

全文公開日期 2033/08/02 (校外網路)
全文公開日期 2033/08/02 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文