Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho; Jie Lei; Hao Tan; Mohit Bansal

arXiv:2102.02779·cs.CL·May 25, 2021·64 cites

Unifying Vision-and-Language Tasks via Text Generation

Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

PDF

Open Access 2 Repos 1 Models 1 Video

TL;DR

This paper introduces a unified generative framework for various vision-and-language tasks, simplifying architecture design and improving generalization, while achieving competitive results across multiple benchmarks.

Contribution

The authors propose a single architecture using text generation for multiple vision-and-language tasks, replacing task-specific models and enabling effective multi-task learning.

Findings

01

Achieves comparable performance to state-of-the-art models on 7 benchmarks.

02

Demonstrates better generalization on rare-answer questions.

03

Enables multi-task learning with a single model and parameters.

Abstract

Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
sonoisa/vl-t5-base-japanese
model· 3 dl· ♡ 2
3 dl♡ 2

Videos

Unifying Vision-and-Language Tasks via Text Generation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsVL-T5