Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

TL;DR
This paper introduces a unified generative framework for various vision-and-language tasks, simplifying architecture design and improving generalization, while achieving competitive results across multiple benchmarks.
Contribution
The authors propose a single architecture using text generation for multiple vision-and-language tasks, replacing task-specific models and enabling effective multi-task learning.
Findings
Achieves comparable performance to state-of-the-art models on 7 benchmarks.
Demonstrates better generalization on rare-answer questions.
Enables multi-task learning with a single model and parameters.
Abstract
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsVL-T5
