簡易檢索 / 詳目顯示

研究生: 游宜哲
Yi-Che Yu
論文名稱: 語意分割的視頻生成之正規化研究
Study Of Normalization On Video Generation By Semantic Segmentation
指導教授: 方文賢
Wen-Hsien Fang
陳郁堂
Yie-Tarng Chen
口試委員: 鍾聖倫
Sheng-Luen Chung
賴坤財
Kuen-Tsair Lay
丘建青
Chien-Ching Chiu
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 67
中文關鍵詞: 影像生成生成對抗網絡卷積神經網路資料生成語意分割
外文關鍵詞: video synthesis, generative adversarial network, convolutional neural network, data generation, semantic segmentation
相關次數: 點閱:171下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

在深度學習的領域中,收集資料集往往都是最耗費資源的,所以為了增加類似真實世界的視頻,我們使用語意分割來訓練一個模型將輸入的視頻轉換成真實視頻,以產生所需之資料集。在本論文中我們提出了一種提升視頻生成品質的方法。該方法在低解析度的部分修改了Video-to-Video Synthesis架構使前景的生成網路獲得更多的前幀資訊,接著我們採用Spatially-Adaptive Normalization(SPADE)方法使模型不只可以避免語意分割特徵消失,還可以得到更多不同的特徵。在高解析度的部分,我們加入了Class-Adaptive Normalization(CLADE)方法使參數量不會提升太多,並且提升了視頻生成的品質。最後我們使用了可微擴增法來微調模型,使模型在其他資料集也能被使用。在此方法中,我們藉由修改vid2vid模型架構與加入不同的正規化有效的使視頻品質上升,可微擴增也使模型比較不易受到資料集改變而崩潰。本文在Cityscapes、Waymo、Carla所生成的語意分割影片上驗證此方法生成影像的連續性與品質


In the field of deep learning, collecting dataset usually consumes the most resources. In order to increase the amount of videos similar to the real world, we intend to train a model to convert semantic video into real videos.
In this thesis, we propose a method to improve the quality of the generated videos. There are, however, some differences between the low and high resolution part. For example, in the low resolution part of this method, we modify the video-to-video synthesis architecture to enable the generation network of the foreground to obtain more information of the previous frame, and then use the Spatially Adaptive Normalization (SPADE) algorithm to avoid the loss of semantic segmentation features in the model.
Moreover, we add the Class-Adaptive Normalization (CLADE) scheme to refrain the model from getting too large and improve the quality of the generated videos in the hight resolution part. Finally, the Differentiable augmentation (Diffaug) is employed to fine tune the model, making it applicable to other datasets as well.
Based on this approach, we can effectively enhance the video quality by modifying the model architecture with different disciplines, while making the model less vulnerable to the variation of the datasets as well. The thesis conducted simulations verifies the continuity and quality of the generated images using the proposed method on semantic segmentation videos generated by the Cityscapes, Waymo and Carla datasets.

摘要 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ii 致謝. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii 目錄. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .iii 圖示索引. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii 表格索引. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .ix 專有名詞縮寫表. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x 1緒論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.1引言. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.2動機. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 1.3貢獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 1.4內容章節概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 2相關背景回顧. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2.1生成對抗網絡. . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 2.2圖像到圖像的轉換. . . . . . . . . . . . . . . . . . . . . . . . . .4 2.3正規化. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 2.4資料擴增. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.5視頻合成. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 2.6結語. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 3提出之方法. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 3.1方法概述. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 3.2生成器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 3.2.1前景模型. . . . . . . . . . . . . . . . . . . . . . . . . . .9 3.2.2低解析度模型. . . . . . . . . . . . . . . . . . . . . . . . .10 3.2.3Coarse-to-fine生成器. . . . . . . . . . . . . . . . . . . . .12 3.2.4高解析度圖像的生成. . . . . . . . . . . . . . . . . . . . .13 3.3鑑別器. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 3.3.1圖像鑑別器. . . . . . . . . . . . . . . . . . . . . . . . . .15 3.3.2視頻鑑別器. . . . . . . . . . . . . . . . . . . . . . . . . .16 3.4損失函數. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 3.4.1圖像損失函數. . . . . . . . . . . . . . . . . . . . . . . . .16 3.4.2視頻損失函數. . . . . . . . . . . . . . . . . . . . . . . . .17 3.4.3特徵匹配函數. . . . . . . . . . . . . . . . . . . . . . . . .17 3.5可微擴增. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 3.6結語. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 4模擬結果與討論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.1資料集. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 4.1.1資料前處理. . . . . . . . . . . . . . . . . . . . . . . . . .20 4.2實驗設置. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 4.3評估方式. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .22 4.4方法比較. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 4.4.1前景模型. . . . . . . . . . . . . . . . . . . . . . . . . . .24 4.4.2SPADE正規化在256×128解析度. . . . . . . . . . . . . .26 4.4.3CLADE正規化在512×256解析度. . . . . . . . . . . . . .28 4.4.4可微擴增微調. . . . . . . . . . . . . . . . . . . . . . . . .30 4.5成功案例與錯誤分析. . . . . . . . . . . . . . . . . . . . . . . . .32 4.5.1Cityscapes資料集. . . . . . . . . . . . . . . . . . . . . .32 4.5.2Carla資料集. . . . . . . . . . . . . . . . . . . . . . . . .39 5結論與未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 5.1結論. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 5.2未來展望. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44 附錄1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45 參考文獻. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

[1] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catan-zaro, “Video-to-video synthesis,”arXiv preprint arXiv:1808.06601, 2018.
[2] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “Semantic image synthe-sis with spatially-adaptive normalization,” inProceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 2337–2346,2019.
[3] Z. Tan, D. Chen, Q. Chu, M. Chai, J. Liao, M. He, L. Yuan, G. Hua, andN. Yu, “Efficient semantic image synthesis via class-adaptive normalization,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[4] S. Zhao, Z. Liu, J. Lin, J.-Y. Zhu, and S. Han, “Differentiable augmentationfor data-efficient gan training,”arXiv preprint arXiv:2006.10738, 2020.
[5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”Ad-vances in neural information processing systems, vol. 27, 2014.
[6] E. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep generative imagemodels using a laplacian pyramid of adversarial networks,”arXiv preprintarXiv:1506.05751, 2015.
[7] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learn-ing with deep convolutional generative adversarial networks,”arXiv preprintarXiv:1511.06434, 2015.
[8] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,”Ad-vances in neural information processing systems, vol. 29, pp. 469–477, 2016.
[9] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture forgenerative adversarial networks,” inProceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, pp. 4401–4410, 2019.
[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” inProceedings of the IEEE confer-ence on computer vision and pattern recognition, pp. 1125–1134, 2017.
[11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” inProceedings ofthe IEEE international conference on computer vision, pp. 2223–2232, 2017.
[12] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for highfidelity natural image synthesis,”arXiv preprint arXiv:1809.11096, 2018.
[13] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis withauxiliary classifier gans,” inInternational conference on machine learning,pp. 2642–2651, PMLR, 2017.
[14] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He,“Attngan: Fine-grained text to image generation with attentional generativeadversarial networks,” inProceedings of the IEEE conference on computervision and pattern recognition, pp. 1316–1324, 2018.
[15] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Gen-erative adversarial text to image synthesis,” inInternational Conference onMachine Learning, pp. 1060–1069, PMLR, 2016.
[16] S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for hier-archical text-to-image synthesis,” inProceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 7986–7994, 2018.
[17] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine, “Stochasticadversarial video prediction,”arXiv preprint arXiv:1804.01523, 2018.
[18] Y. Zhao, C. Li, P. Yu, J. Gao, and C. Chen, “Feature quantization improvesgan training,”arXiv preprint arXiv:2004.02088, 2020.
[19] J. Kim, M. Kim, H. Kang, and K. Lee, “U-gat-it: Unsupervised generativeattentional networks with adaptive layer-instance normalization for image-to-image translation,”arXiv preprint arXiv:1907.10830, 2019.
[20] Z. Zhao, Y. Guo, H. Shen, and J. Ye, “Adaptive object detection withdual multi-label prediction,” inEuropean Conference on Computer Vision,pp. 54–69, Springer, 2020.
[21] H.-K. Hsu, C.-H. Yao, Y.-H. Tsai, W.-C. Hung, H.-Y. Tseng, M. Singh, andM.-H. Yang, “Progressive domain adaptation for object detection,” inPro-ceedings of the IEEE/CVF Winter Conference on Applications of ComputerVision, pp. 749–757, 2020.
[22] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unifiedgenerative adversarial networks for multi-domain image-to-image transla-tion,” inProceedings of the IEEE conference on computer vision and patternrecognition, pp. 8789–8797, 2018.
[23] X. Liu, G. Yin, J. Shao, X. Wang, and H. Li, “Learning to predict layout-to-image conditional convolutions for semantic image synthesis,”arXiv preprintarXiv:1910.06809, 2019.
[24] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro,“High-resolution image synthesis and semantic manipulation with condi-tional gans,” inProceedings of the IEEE conference on computer vision andpattern recognition, pp. 8798–8807, 2018.
[25] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” inProceedings of the IEEE confer-ence on computer vision and pattern recognition, pp. 1125–1134, 2017.
[26] P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “Sean: Image synthesis withsemantic region-adaptive normalization,” inProceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 5104–5113,2020.
[27] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch normal-ization help optimization?,” inProceedings of the 32nd international confer-ence on neural information processing systems, pp. 2488–2498, 2018.
[28] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,”arXivpreprint arXiv:1607.06450, 2016.
[29] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: Themissing ingredient for fast stylization,”arXiv preprint arXiv:1607.08022,2016.
[30] Y. Wu and K. He, “Group normalization,” inProceedings of the Europeanconference on computer vision (ECCV), pp. 3–19, 2018.
[31] V. Dumoulin, J. Shlens, and M. Kudlur, “A learned representation for artisticstyle,”arXiv preprint arXiv:1610.07629, 2016.
[32] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adap-tive instance normalization,” inProceedings of the IEEE International Con-ference on Computer Vision, pp. 1501–1510, 2017.
[33] X. Gong, W. Chen, T. Chen, and Z. Wang, “Sandwich batch normalization,”arXiv preprint arXiv:2102.11382, 2021.
[34] T. DeVries and G. W. Taylor, “Improved regularization of convolutionalneural networks with cutout,”arXiv preprint arXiv:1708.04552, 2017.
[35] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification withdeep convolutional neural networks,”Advances in neural information pro-cessing systems, vol. 25, pp. 1097–1105, 2012.
[36] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarialnets with singular value clipping,” inProceedings of the IEEE internationalconference on computer vision, pp. 2830–2839, 2017.
[37] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scenedynamics,”Advances in neural information processing systems, vol. 29,pp. 613–621, 2016.
[38] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing mo-tion and content for natural video sequence prediction,”arXiv preprintarXiv:1706.08033, 2017.
[39] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future:Forecasting from static images using variational autoencoders,” inEuropeanConference on Computer Vision, pp. 835–851, Springer, 2016.
[40] Y. Chen, Y. Pan, T. Yao, X. Tian, and T. Mei, “Mocycle-gan: Unpairedvideo-to-video translation,” inProceedings of the 27th ACM InternationalConference on Multimedia, pp. 647–655, 2019.
[41] O. Gafni, L. Wolf, and Y. Taigman, “Vid2game: Controllable charactersextracted from real-world videos,”arXiv preprint arXiv:1904.08379, 2019.
[42] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-shot video-to-video synthesis,”arXiv preprint arXiv:1910.12713, 2019.
[43] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu, “World-consistent video-to-video synthesis,” inComputer Vision–ECCV 2020: 16th European Con-ference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VIII 16,pp. 359–378, Springer, 2020.
[44] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” inProceedings of the IEEE conference on computer vision andpattern recognition, pp. 770–778, 2016.
[45] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semanticurban scene understanding,” inProceedings of the IEEE conference on com-puter vision and pattern recognition, pp. 3213–3223, 2016.
[46] P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui,J. Guo, Y. Zhou, Y. Chai, B. Caine,et al., “Scalability in perception for au-tonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 2446–2454,2020.
[47] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,”The International Journal of Robotics Research, vol. 32,no. 11, pp. 1231–1237, 2013.
[48] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: Anopen urban driving simulator,” inConference on robot learning, pp. 1–16,PMLR, 2017.
[49] A. Tao, K. Sapra, and B. Catanzaro, “Hierarchical multi-scale attention forsemantic segmentation,”arXiv preprint arXiv:2005.10821, 2020.
[50] A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: Image segmenta-tion as rendering,” inProceedings of the IEEE/CVF conference on computervision and pattern recognition, pp. 9799–9808, 2020.
[51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.
[52] M. Seitzer, “pytorch-fid: Fid score for pytorch.”https://github.com/mseitzer/pytorch-fid, August 2020. Version 0.1.1.

無法下載圖示 全文公開日期 2024/09/08 (校內網路)
全文公開日期 本全文未授權公開 (校外網路)
全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
QR CODE