12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu; Vedanuj Goswami; Marcus Rohrbach; Devi Parikh; Stefan Lee

arXiv:1912.02315·cs.CV·April 28, 2020·36 cites

12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

PDF

Open Access 5 Repos 1 Video

TL;DR

This paper introduces a multi-task vision-language model trained on 12 datasets, reducing parameters significantly while improving performance across diverse tasks like VQA, image retrieval, and grounding.

Contribution

Develops a large-scale multi-task training framework that unifies 12 vision-language tasks into a single model, reducing parameters and enhancing performance.

Findings

01

Parameter reduction from 3 billion to 270 million.

02

Average performance improvement of 2.05 points across tasks.

03

Finetuning from the multi-task model achieves or surpasses state-of-the-art.

Abstract

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

12-in-1: Multi-Task Vision and Language Representation Learning· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling