Learning Robust Visual-Semantic Embeddings
Yao-Hung Hubert Tsai, Liang-Kang Huang, Ruslan Salakhutdinov

TL;DR
This paper introduces an end-to-end unsupervised learning framework for robust visual-semantic embeddings, leveraging auto-encoders and domain adaptation to improve image and text representation across labeled and unlabeled data.
Contribution
It presents a novel combination of auto-encoders and Maximum Mean Discrepancy loss for joint embedding learning, including an unsupervised-data adaptation inference technique.
Findings
Outperforms state-of-the-art on multiple datasets
Effective in zero and few-shot recognition tasks
Improves robustness of multi-modal representations
Abstract
Many of the existing methods for learning joint embedding of images and text use only supervised information from paired images and its textual attributes. Taking advantage of the recent success of unsupervised learning in deep neural networks, we propose an end-to-end learning framework that is able to extract more robust multi-modal representations across domains. The proposed method combines representation learning models (i.e., auto-encoders) together with cross-domain learning criteria (i.e., Maximum Mean Discrepancy loss) to learn joint embeddings for semantic and visual features. A novel technique of unsupervised-data adaptation inference is introduced to construct more comprehensive embeddings for both labeled and unlabeled data. We evaluate our method on Animals with Attributes and Caltech-UCSD Birds 200-2011 dataset with a wide range of applications, including zero and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
