12-in-1: Multi-Task Vision and Language Representation Learning
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

TL;DR
This paper introduces a multi-task vision-language model trained on 12 datasets, reducing parameters significantly while improving performance across diverse tasks like VQA, image retrieval, and grounding.
Contribution
Develops a large-scale multi-task training framework that unifies 12 vision-language tasks into a single model, reducing parameters and enhancing performance.
Findings
Parameter reduction from 3 billion to 270 million.
Average performance improvement of 2.05 points across tasks.
Finetuning from the multi-task model achieves or surpasses state-of-the-art.
Abstract
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
12-in-1: Multi-Task Vision and Language Representation Learning· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
