Aligned-VAEGAN: A Cross-Modal Embedding Approach by Utilizing VAEGANs on Generalized Zero-Shot Learning

簡易檢索 / 詳目顯示

回結果列表

研究生：	Andreas Simon Andreas Simon
論文名稱：	Aligned-VAEGAN: A Cross-Modal Embedding Approach by Utilizing VAEGANs on Generalized Zero-Shot Learning Aligned-VAEGAN: A Cross-Modal Embedding Approach by Utilizing VAEGANs on Generalized Zero-Shot Learning
指導教授：	郭景明 Jing-Ming Guo
口試委員:	郭景明 Jing-Ming Guo 周瑞生 Jui-Sheng Chou 丁建均 Jian-Jiun Ding 徐繼聖 Gee-Sern Hsu
學位類別：	碩士 Master
系所名稱：	電資學院 - 電機工程系 Department of Electrical Engineering
論文出版年：	2020
畢業學年度：	108
語文別：	英文
論文頁數：	93
中文關鍵詞：	Generalized Zero-shot Learning 、Conditional GAN 、Variational Autoencoder 、Cross-modal Embedding 、VAEGAN
外文關鍵詞：	Generalized Zero-shot Learning, Conditional GAN, Variational Autoencoder, Cross-modal Embedding, VAEGAN
相關次數：	點閱：206 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

上一筆

Zero-shot learning aims to learn a classifier with the ability to predict the labels of novel class images that are excluded from the training phase by exploiting the class embeddings of the instances. This is challenging in the real world, as new classes always emerge, and insufficient instances of particular classes to cover the training requirement becomes an issue to be addressed. Many studies have shown promising results, among them are the models that relied on feature generation and cross-modal embedding approaches. This research proposes an extended approach from common cross-modal embedding models by combining cross-modal VAEs with the feature generating model GANs. The model learns the shared latent space features by cross-aligning the reconstruction features and distribution-aligning the latent representations from the VAE networks. Moreover, it teaches Conditional Discriminator networks to distinguish between the real and synthetic features among classes. The features in the shared latent space are used to train a SoftMax classifier. The model also employs the Dissimilar Network Update Iteration (DNUI) to update the VAE and Discriminator networks with dissimilar numbers in each iteration. The experimental results show that the performance of the proposed model surpasses state-of-the-art methods on the dataset of AWA2. It suggests that the approach with the proposed VAEGANs design can be adopted to tackle the zero-shot learning problem.

Abstract .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .i
Acknowledgment   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .ii
Table of contents  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .    iii
List of Figures    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .    vi
List of Tables  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  xiii
List of Acronyms  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  xiv
1   Introduction  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .1
2   Background   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .4
2.1    Image Classification .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .4
2.1.1 AlexNet   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .7
2.1.2 VGGNet  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .7
2.1.3 Inception-GoogLeNet  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .8
2.1.4 ResNet .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   10
2.2    Transfer Learning  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   12
2.3    Generalized Zero-Shot Learning  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   13
2.4    Autoencoder .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   16
2.5    Variational Autoencoder (VAE)  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   18
2.6    Generative Adversarial Network (GAN) .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   22
3   Related work   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   27
3.1    Non-Generative Approaches   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   27
3.2    Cross-Modal Latent Distribution-Alignment Approaches   .  .  .  .  .  .  .   28
3.3    Cross-modal Reconstruction-Alignment Approaches  .  .  .  .  .  .  .  .  .  .   30
3.4    VAEGAN in Feature Generation Models   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   33
4   Method    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   41
4.1    Variational Autoencoder (VAE) Component   .  .  .  .  .  .  .  .  .  .  .  .  .  .   41
4.2    Conditional GAN Component  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   43
4.3    Proposed Model .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   43
4.4    Implementation Details .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   48
5   Experimental Results  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   51
5.1    Datasets and Experiment Setting   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   51
5.2    Analysis  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   52
5.2.1 Ablation Studies   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   52
5.2.2 Effect of Dissimilar Network Update Iteration   .  .  .  .  .  .  .  .  .   53
5.2.3 Analysis on Utilizing Latent Distribution as Classifier Input   .   55
5.2.4 Analysis on The Effect of Conditional Discriminator .  .  .  .  .  .   57
5.3    Comparisons with Benchmark Datasets  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   58
6   Conclusions and Future Works .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   61
6.1    Conclusions   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   61
6.2    Future Works   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   62
References .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   63
Appendix:  Example images and object classes from the datasets  .  .  .  .  .  .  .  .   70
A Animal With Attributes 2 (AWA2)   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   70
A.1 Image Samples   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   70
A.2 Object Classes    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   71
B Caltech-UCSD Birds 200 (CUB) .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   73
B.1 Image Samples   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   73
B.2 Object Classes    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   74
C SUN Attributes (SUN)  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   78
C.1 Image Samples   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   78
C.2 Object Classes    .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .   79




                                

[1] D. Wang, Y. Li, Y. Lin, and Y. Zhuang, “Relational Knowledge Transfer for Zero-Shot Learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, p. 2145–2151, 2016.
[2] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” in Proceedings of the Neural Information Processing Systems, 2013.
[3] E. Kodirov, T. Xiang, and S. Gong, “Semantic Autoencoder for Zero-Shot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
[4] L. Zhang, T. Xiang, and S. Gong, “Learning a Deep Embedding Model for ZeroShot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
[5] S. Changpinyo, W.-L. Chao, and F. Sha, “Predicting Visual Exemplars of Unseen Classes for Zero-Shot Learning,” in Proceedings of the IEEE International Conference on Computer Vision, October 2017.
[6] Y.-H. Tsai, L.-K. Huang, and R. Salakhutdinov, “Learning Robust VisualSemantic Embeddings,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3591–3600, 2017.
[7] T. Mukherjee, M. Yamada, and T. M. Hospedales, “Deep matching autoencoders,” arXiv preprint arXiv:1711.06047, 2017.
[8] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata, “Feature Generating Networks for Zero-Shot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5542–5551, 2018.
[9] R. Felix, B. G. V. Kumar, I. D. Reid, and G. Carneiro, “Multi-modal Cycleconsistent Generalized Zero-Shot Learning,” in Proceedings of the European Conference on Computer Vision, 2018.
[10] G. Arora, V. K. Verma, A. Mishra, and P. Rai, “Generalized Zero-Shot Learning via Synthesized Examples,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4281–4289, 2018.
[11] H. Huang, C. Wang, P. S. Yu, and C.-D. Wang, “Generative Dual Adversarial Network for Generalized Zero-Shot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
[12] A. Mishra, M. S. K. Reddy, A. Mittal, and H. A. Murthy, “A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2269–22698, 2018.
[13] Y. Xian, S. Sharma, B. Schiele, and Z. Akata, “F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
[14] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, “Cvae-gan: Fine-grained image generation through asymmetric training,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[15] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 700–708, Curran Associates, Inc., 2017.
[16] V. Ferrari and A. Zisserman, “Learning Visual Attributes,” in Proceedings of the Neural Information Processing Systems, 2007.
[17] M. Rohrbach, S. Ebert, and B. Schiele, “Transfer Learning in a Transductive Setting,” in Proceedings of the Neural Information Processing Systems, 2013.
[18] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid, “Label-Embedding for Image Classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, pp. 1425–1438, 2016.
[19] Z. Akata, S. E. Reed, D. Walter, H. Lee, and B. Schiele, “Evaluation of output embeddings for fine-grained image classification,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2927–2936, 2015.
[20] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. Corrado, and J. Dean, “Zero-Shot Learning by Convex Combination of Semantic Embeddings,” in Proceedings of the International Conference on Learning Representations, 2014.
[21] Y. Xian, Z. Akata, G. Sharma, Q. N. Nguyen, M. Hein, and B. Schiele, “Latent Embeddings for Zero-Shot Classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77, 2016.
[22] M. Bucher, S. Herbin, and F. Jurie, “Generating Visual Representations for Zero-Shot Classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2666–2673, 2017.
[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in Proceedings of the Advances in Neural Information Processing Systems, 2014.
[24] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in Proceedings of the International Conference on Learning Representations, 2014.
[25] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” in Advances in Neural Information Processing Systems 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 3483–3491, Curran Associates, Inc., 2015.
[26] M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” arXiv preprint arXiv:1411.1784, November 2014.
[27] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein Generative Adversarial Networks,” in Proceedings of the International Conference on Machine Learning, vol. 70, pp. 214–223, August 2017.
[28] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, pp. 2251–2265, 2017.
[29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dolla´r, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ComputerVision – ECCV 2014 (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 740–755, Springer International Publishing, 2014.
[30] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-100 (canadian institute for advanced research),”
[31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[32] M. Everingham, L. VanK. I. Williams, J. Winn, and A. Zisserman,Visual Object Classes Challenge 2012 (VOC2012) Results.” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html, 2012.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F. F. Li, “ImageNet: a Large-Scale Hierarchical Image Database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, June 2009.
[34] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
[35] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, 2015.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[38] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958, 2009.
[39] L. P. Jain, W. J. Scheirer, and T. E. Boult, “Multi-class open set recognition using probability of inclusion,” in Computer Vision – ECCV 2014 (D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, eds.), (Cham), pp. 393–409, Springer International Publishing, 2014.
[40] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng, “Zero-Shot Learning Through Cross-Modal Transfer,” in Proceedings of the Advances in Neural Information Processing Systems, pp. 935–943, 2013.
[41] V. K. Verma and P. Rai, “A simple exponential family framework for zero-shot learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 792–808, 2017.
[42] B. Romera-Paredes and P. Torr, “An embarrassingly simple approach to zeroshot learning,” in Proceedings of the International Conference on Machine Learning, pp. 2152–2161, July 2015.
[43] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha, “Synthesized Classifiers for Zero-Shot Learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
[44] M. Elhoseiny, B. Saleh, and A. Elgammal, “Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions,” in Proceedings of the 2013 IEEE International Conference on Computer Vision, p. 2584–2591, 2013.
[45] J. Lei Ba, K. Swersky, S. Fidler, and R. salakhutdinov, “Predicting Deep ZeroShot Convolutional Neural Networks Using Textual Descriptions,” in Proceedings of the IEEE International Conference on Computer Vision, December2015.
[46] X. Wang, Y. Ye, and A. Gupta, “Zero-shot recognition via semantic embeddings and knowledge graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
[47] T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks,” in Proceedings of the International Conference on Learning Representations, 2017.
[48] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019.
[49] T. Suzuki and M. Sugiyama, “Sufficient dimension reduction via squared-loss mutual information estimation,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 804–811, 2010.
[50] M. Bucher, S. Herbin, and F. Jurie, “Generating visual representations for zero-shot classification,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2666–2673, 2017.
[51] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251, 2017.
[52] M. Arjovsky and L. Bottou, “Towards principled methods for training generative adversarial networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, 2017.
[53] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” in Proceedings of the International Conference on Machine Learning, 2015.
[54] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨olkopf, and A. J. Smola, “A kernel method for the two-sample-problem,” in Advances in Neural Information Processing Systems 19 (B. Sch¨olkopf, J. C. Platt, and T. Hoffman, eds.), pp. 513– 520, MIT Press, 2007.
[55] C. R. Givens and R. M. Shortt, “A Class of Wasserstein Metrics for Probability Distributions,” The Michigan Mathematical Journal, vol. 31, no. 2, pp. 231–240, 1984.
[56] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proceedings of the International Conference on Learning Representations, vol. abs/1412.6980, 2014.
[57] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jo´zefowicz, and S. Bengio, “Generating Sentences from a Continuous Space,” in Proceedings of the Conference on Computational Natural Language Learning, 2015.
[58] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.
[59] G. Patterson and J. Hays, “SUN attribute database: Discovering, annotating, and recognizing scene attributes,” pp. 2751–2758, 2012.
[60] L. van der Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

全文公開日期 2025/08/20 (校內網路)
全文公開日期 2025/08/20 (校外網路)
全文公開日期 2025/08/20 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文