簡易檢索 / 詳目顯示

研究生: 吳祐任
You-Ren Wu
論文名稱: 一個使用半監督深度協同訓練的細粒度語義分割方法:以動漫角色剖析為例
A Semi-Supervised Deep Co-Training Approach to Fine-Grained Semantic Segmentation: Taking Anime Character Parsing as an Example
指導教授: 范欽雄
Cin-Syong Fan
口試委員: 謝仁偉
Ren-Wei Xie
李建德
Jian-De Li
王榮華
Rong-Hua Wang
學位類別: 碩士
Master
系所名稱: 電資學院 - 資訊工程系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 71
中文關鍵詞: 語意分割人體剖析半監督學習動漫插圖深度學習
外文關鍵詞: semantic segmentation, human parsing, semi-supervised learning, anime illustration, deep learning
相關次數: 點閱:199下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語意分割是一種典型像素級的歸類問題,現今應用於醫學圖像分析、機器人感知、擴增實境等領域裡,本篇論文將其應用於動漫插圖領域裡,對於動漫角色的人體和穿著部件進行像素級別歸類。
    在動漫著色(Anime image colorization)研究裡,有多篇論文嘗試將語意作為訓練參考進行設計,例如:Tag2Pix將分類器的特徵加入訓練,以及Illu2Vec將一個插畫分類神經網路的特徵圖加入訓練,並取得良好的著色效果,從這些例子可以看見著色領域對於語意分割有著相當的關連性,我們期望本論文的研究能夠對動漫相關深度學習領域做出貢獻。
    由於動漫領域分割近幾年並未有公開的資料集,因此我們自行標註了16個類別和13個類別的資料集,其中訓練集包含1,017張圖片,驗證集包含131張圖片。在驗證中,我們將圖片的語意分割歸類了三個難度,分別為:簡單、中等、艱難,並各自含有48、45、25張圖片。簡單難度表示與訓練集中的圖片相似的圖片,中等難度則為相對於簡單難度來說,語義分割比較困難的圖片,而艱難則是複雜的背景且具個人風格的插畫,用來探討我們模型的極限性和弱點。
    在本論文中,我們使用ResNet50作為骨幹網路,並提出以CPS (Cross Pseudo Supervision) 和 CE2P (Context Embedding with Edge Perceiving)的結合架構 C2E2P (Cross Pseudo Supervision with Context Embedding with Edge Perceiving ),來進行動漫角色人體剖析。在同樣實驗設置(13類別、批次大小為8)的情況下,我們的模型取得了較於CE2P和CPS更好的語意分割結果;對於CE2P的62.91% MIoU,我們的模型提升了3.91%,而對於CPS的63.40% MIoU,我們模型提升了3.42%,來到66.82%;若我們將批次大小增大到16,則可再提升1.66%的MIoU,來到68.07%。


    Semantic segmentation is a typical pixel-level classification problem, which is currently used in medical image analysis, robot perception, augmented reality, and other fields. In this thesis, semantic segmentation is applied to the field of anime illustration to classify the human body and wearing parts of anime characters at the pixel level.
    In the anime image colorization research, there are many papers trying to design semantics as a training reference. For example: Tag2Pix adds the features of the classifier to training, and Illu2Vec adds the feature map of an illustration classification neural network to training, and good colorization results are obtained. From these examples, it can be seen that the coloring field is quite relevant for semantic segmentation. We hope that the research in this thesis can contribute to the deep learning field related to animation.
    In recent years, since there is no public dataset for segmentation in the animation field, we annotated datasets of 16 categories and 13 categories by ourselves, of which the training set contains 1,017 images and the validation set contains 131 images. In validation, we classify the semantic segmentation of images into three levels of difficulty: Easy, Medium, and Hard, and each contains 48, 45, and 25 images. Easy means the images similar to those in the training set, Medium is the images in a difficult case of semantic segmentation relative to Easy, and Hard is the images in a complex background with personal style illustrations, which can explore the limits and weaknesses of our model.
    In this thesis, we use ResNet50 as the backbone network, and propose an architecture C2E2P (Cross pseudo supervision with Context Embedding with Edge Perceiving) that combines CPS (Cross Pseudo Supervision) and CE2P (Context Embedding with Edge Perceiving) for anime character parsing. Under the same experimental setting (13 categories, batch size 8), our model achieves better semantic segmentation outcomes than CE2P and CPS do. The experimental results reveal that for the 63.40% MIoU of CE2P, our model improves by 3.91%, and for the 62.91% MIoU of CPS, our model improves by 3.42% to 66.82%. If we increase the batch size to 16, we can raise MIoU by another 1.66% to 68.07%.

    中文摘要 i Abstract ii 致謝 iv List of Figures vii List of Tables ix Chapter 1 Introduction 1 1.1 Overview 1 1.2 Motivation 2 1.3 System Description 4 1.4 Thesis Organization 6 Chapter 2 Related Work 7 2.1 Semantic Segmentation 7 2.2 Human Parsing 10 2.3 Weakly and Semi-Supervised Learning 12 2.3.1 Class activation map 12 2.3.2 Consistent regularization and pseudo labeling 13 Chapter 3 Dataset and Class Definition 15 3.1 Class Definition 15 3.2 Dataset Detail 21 Chapter 4 Our Proposed Semantic Segmentation Method 24 4.1 Training Step 24 4.2 Unsupervised Training Structure 27 4.3 Supervised Training Structure 30 4.3.1 Parsing module 32 4.3.2 Edge module 36 4.3.3 Fusion module 39 Chapter 5 Experimental Results and Discussion 41 5.1 Experimental Setting 41 5.2 Experimental Results and Ablation Study 43 5.2.1 Experiment I 44 5.2.2 Experiment II 46 5.3 Semantic Segmentation Results 49 Chapter 6 Conclusions and Future Work 59 6.1 Conclusions 59 6.2 Future Work 60 References 61

    [1] H. Kim, H. Y. Jhoo, E. Park, and S. Yoo, “Tag2pix: Line art colorization using text tag with secat and changing loss,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019, pp. 9056-9065.
    [2] Z. Dou, N. Wang, B. Li, Z. Wang, H. Li, and B. Liu, “Dual color space guided sketch colorization,” IEEE Transactions on Image Processing, vol. 30, no. 8, pp. 7292-7304, 2021.
    [3] C. W. Seo and Y. Seo, “Seg2pix: Few shot training line art colorization with segmented image data,” Applied Sciences, vol. 11, no. 4, pp. 1464-1479, 2021.
    [4] X. Chen, Y. Yuan, G. Zeng and J. Wang, “Semi-supervised semantic segmentation with cross pseudo supervision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Virtual, 2021, pp. 2613-2622.
    [5] T. Liu, T. Ruan, Z. Huang, Y. Wei, S. Wei, Y. Zhao, and T. Huang, “Devil in the details: Towards accurate single and multiple human parsing,” ArXiv, vol. abs/1809.05996, 2019.
    [6] L. Zhang, Y. Ji, X. Lin, and C. Liu, “Style transfer for anime sketches with enhanced residual U-Net and auxiliary classifier gan,” in Proceedings of the 4th IAPR Asian Conference on Pattern Recognition, Nanjing, China, 2017.
    [7] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, Honolulu, Hawaii, 2017, pp. 6230-6239.
    [8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, Massachusett, 2015, pp. 640-651.
    [9] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 2015, pp. 234-241.
    [10] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler, “Gated-SCNN: Gated shape cnns for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019, pp. 5229-5238.
    [11] A. Kirillov, Y. Wu, K. He, and R. Girshick, “Pointrend: Image segmentation as rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 2020, pp. 9799-9808.
    [12] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision, Munich, Germany, 2018.
    [13] L. Zhang, Y. Ji, and C. Liu, “Danbooregion: An illustration region dataset,” in Proceedings of the European Conference on Computer Vision, Virtual, 2020.
    [14] R. Cao, H. Mo, and C. Gao, “Line art colorization based on explicit region segmentation,” Computer Graphics Forum, vol. 40, no. 7, 2021.
    [15] KichangKim, “DeepDanbooru,” 2020. [Online]. Available: https://github.com/
    KichangKim/DeepDanbooru.
    [16] M. Saito and Y. Matsui, “Illustration2vec: A semantic vector representation of illustrations,” in Proceedings of the SIGGRAPH Asia 2015 Technical Briefs, Kobe, Japan, 2015, pp. 1-4.
    [17] H.-S. Fang, G. Lu, X. Fang, J. Xie, Y.-W. Tai, and C. Lu, “Weakly and semi supervised human body part parsing via pose-guided knowledge transfer,” arXiv preprint arXiv:1805.04310, 2018.
    [18] J. Li, J. Zhao, Y. Wei, C. Lang, Y. Li, T. Sim, S. Yan, and J. Feng, “Multiple-human parsing in the wild,” arXiv preprint arXiv:1705.07206, 2017.
    [19] X. Liang, K. Gong, X. Shen, and L. Lin, “Look into person: Joint body parsing & pose estimation network and a new benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 871-885, 2018.
    [20] W. Wang, H. Zhu, J. Dai, Y. Pang, J. Shen, and L. Shao, “Hierarchical human parsing with typed part-relation reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 2020, pp. 8929-8939.
    [21] P. Li, Y. Xu, Y. Wei, and Y. Yang, “Self-correction for human parsing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3260-3271, 2020.
    [22] Z.-H. Zhou, “A brief introduction to weakly supervised learning,” National Science Review, vol. 5, no. 1, pp. 44-53, 2018.
    [23] S. Jo and I.-J. Yu, “Puzzle-CAM: Improved localization via matching partial and full features,” in Proceedings of the 2021 IEEE International Conference on Image Processing, Anchorage, Alaska, 2021, pp. 639-643.
    [24] T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv:1708.04552, 2017.
    [25] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019, pp. 6023-6032.
    [26] Z. Ke, D. Qiu, K. Li, Q. Yan, and R. W. H. Lau, “Guided collaborative training for pixel-wise semi-supervised learning,” in Proceedings of the European Conference on Computer Vision, Virtual, 2020, pp. 429-445.
    [27] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in Neural Information Processing Systems, vol. 33, pp. 596-608, 2020.
    [28] Z. Ke, D. Wang, Q. Yan, J. Ren, and R. W. H. Lau, “Dual student: Breaking the limits of the teacher in semi-supervised learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 2019, pp. 6728-6736.
    [29] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in Neural Information Processing Systems, vol. 30, 2017.
    [30] A. Bréhéret, “Pixel annotation tool,” 2017. [Online]. Available: https://github.com/
    abreheret/PixelAnnotationTool.
    [31] D. Filipiak, P. Tempczyk, and M. Cygan, “n-CPS: Generalising cross pseudo supervision to n networks for semi-supervised semantic segmentation,” arXiv preprint arXiv:2112.07528, 2021.
    [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Nevada, 2016, pp. 770-778.
    [33] M. Berman, A. R. Triki, and M. B. Blaschko, “The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, 2018, pp. 4413-4421.

    無法下載圖示 全文公開日期 2027/07/26 (校內網路)
    全文公開日期 2032/07/26 (校外網路)
    全文公開日期 2032/07/26 (國家圖書館:臺灣博碩士論文系統)
    QR CODE