Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
Tanmay Gupta, Kevin Shih, Saurabh Singh, and Derek Hoiem

TL;DR
This paper demonstrates that aligning image and word representations in vision-language models enhances transfer learning across tasks like visual recognition and VQA, especially benefiting categories with limited labels.
Contribution
It introduces a vision-language embedding aligned across tasks, improving cross-task transfer and recognition performance over standard multi-task learning methods.
Findings
Better transfer from recognition to VQA tasks.
Improved recognition for categories with few labels.
Enhanced generalization in vision systems.
Abstract
An important goal of computer vision is to build systems that learn visual representations over time that can be applied to many tasks. In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multi-task learning. In particular, the task of visual recognition is aligned to the task of visual question answering by forcing each to use the same word-region embeddings. We show this leads to greater inductive transfer from recognition to VQA than standard multitask learning. Visual recognition also improves, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Thus, our paper takes a small step towards creating more general vision systems by showing the benefit of interpretable, flexible, and trainable core representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
