Author: |
林詠翔 Yong-Xiang Lin |
---|---|
Thesis Title: |
基於遮罩門控鑑別器之自適應城市場景語意分割模型 Adapting Semantic Segmentation of Urban Scenes via Mask-aware Gated Discriminator |
Advisor: |
花凱龍
Kai-Lung Hua |
Committee: |
花凱龍
Kai-Lung Hua 楊傳凱 Chuan-Kai Yang 陳駿丞 Jun-Cheng Chen 鐘國亮 Kuo-Liang Chung 郭景明 Jing-Ming Guo |
Degree: |
碩士 Master |
Department: |
電資學院 - 資訊工程系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2020 |
Graduation Academic Year: | 108 |
Language: | 中文 |
Pages: | 46 |
Keywords (in Chinese): | 電腦視覺 、域適應 、語意分割 、深度學習 |
Keywords (in other languages): | Computer vision, Domain adaptation, Semantic segmentation, Deep learning |
Reference times: | Clicks: 837 Downloads: 0 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
訓練深度神經網絡進行語義分割依賴於像素級標籤進行監督。但是收集大型數據集
的像素集標籤是非常昂貴且耗時。一種解決方法是利用合成數據集,我們可以使用相應
的標籤生成數據。不幸的是在合成數據上訓練的網絡在真實圖像上表現不佳,這被解釋
為域移位問題。針對此問題科學家提出域適應的技術,域適應技術已顯示出將從合成數
據學習的知識轉移到現實世界數據的潛力。之前的工作主要利用對抗性訓練來執行全
局對齊功能。但是,我們觀察到背景對像在不同的域中具有較小的變化,反之前景類別
在不同域中的變化較大。利用上述的觀察,我們提出了一種域自適應方法,可以分別對
前景對象和背景對象進行建模和調整。我們的方法從畫風轉移開始,以緩解域移位的問
題。接下來是前景自適應模塊,它基於預測出的前景遮罩搭配我們提出的門控鑑別器進
行學習,以便分別適應前景和背景類別。我們在實驗中證明,我們的模型在平均交并比
(mIoU)方面優於幾個最先進的基線。
Training a deep neural network for semantic segmentation relies on pixel-level ground
truth labels for supervision. However, collecting large datasets with pixel-level annotations
is very expensive and time consuming. One workaround is to utilize synthetic data
where we can generate potentially unlimited data with their corresponding ground truth
labels. Unfortunately, networks trained on synthetic data perform poorly on real images
due to the domain shift problem. Domain adaptation techniques have shown potential in
transferring the knowledge learned from synthetic data to real world data. Prior works
have mostly leveraged on adversarial training to perform a global aligning of features.
However, we observed that background objects have lesser variations across different domains
as opposed to foreground objects. Using this insight, we propose a method for
domain adaptation that models and adapts foreground objects and background objects
separately. Our approach starts with a fast style transfer to match the appearance of the
inputs. This is followed by a foreground adaptation module that learns a foreground mask
that is used by our gated discriminator in order to adapt the foreground and background
objects separately. We demonstrate in our experiments that our model outperforms several
state-of-the-art baselines in terms of mean intersection over union (mIoU).
[1] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker,
“Learning to adapt structured output space for semantic segmentation,” in CVPR,
2018.
[2] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-attention generative
adversarial networks,” CoRR, vol. abs/1805.08318, 2018.
[3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic
segmentation,” in CVPR, 2015.
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab:
Semantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2018.
[5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in
CVPR, 2017.
[6] T. Yao, Y. Pan, C. Ngo, H. Li, and T. Mei, “Semi-supervised domain adaptation with
subspace learning for visual recognition,” in CVPR, 2015.
[7] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth
from computer games,” in ECCV, 2016.
[8] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez, “The SYNTHIA
Dataset: A large collection of synthetic images for semantic segmentation of urban
scenes,” in CVPR, 2016.
[9] S. R. Richter, Z. Hayder, and V. Koltun, “Playing for benchmarks,” in ICCV, 2017.
[10] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf,
“Covariate shift and local learning by distribution matching,” in Dataset Shift in
Machine Learning, 2009.
[11] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with
deep adaptation networks,” in ICML, 2015.
[12] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain
adaptation,” in CVPR, 2017.
[13] F. S. Saleh, M. S. Aliakbarian, M. Salzmann, L. Petersson, and J. M. Alvarez, “Effective
use of synthetic data for urban scene semantic segmentation,” in ECCV, 2018.
[14] Y. Li, M.-Y. Liu, X. Li, M.-H. Yang, and J. Kautz, “A closed-form solution to photorealistic
image stylization,” in ECCV, 2018.
[15] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,”
arXiv preprint arXiv:1511.07122, 2015.
[16] J. Fu, H. Tian, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,”
CoRR, vol. abs/1809.02983, 2018.
[17] Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,” CoRR,
vol. abs/1809.00916, 2018.
[18] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille, “Weakly-and semisupervised
learning of a deep convolutional network for semantic image segmentation,”
in Proceedings of the IEEE international conference on computer vision,
pp. 1742–1750, 2015.
[19] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adversarial
and constraint-based adaptation,” arXiv preprint arXiv:1612.02649, 2016.
[20] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic segmentation
of urban scenes,” in ICCV, 2017.
[21] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” 2017 IEEE International
Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
[22] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. A. Efros, and T. Darrell,
“Cycada: Cycle-consistent adversarial domain adaptation,” in ICML, 2018.
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[24] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free-form image inpainting
with gated convolution,” arXiv preprint arXiv:1806.03589, 2018.
[25] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for
generative adversarial networks,” in ICLR, 2018.
[26] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-attention generative
adversarial networks,” arXiv preprint arXiv:1805.0831, 2018.
[27] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F. Wang, and M. Sun, “No
more discrimination: Cross city adaptation of road scene segmenters,” 2017 IEEE
International Conference on Computer Vision (ICCV), pp. 2011–2020, 2017.
[28] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke,
S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,”
in CVPR, 2016.
[29] G. Ros, L. Sellart, J. Materzynska, D. Vázquez, and A. M. López, “The synthia
dataset: A large collection of synthetic images for semantic segmentation of urban
scenes,” 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 3234–3243, 2016.