Large-Scale Adversarial Training for Vision-and-Language Representation   Learning

Zhe Gan; Yen-Chun Chen; Linjie Li; Chen Zhu; Yu Cheng; Jingjing Liu

arXiv:2006.06195·cs.CV·October 26, 2020·287 cites

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, Jingjing Liu

PDF

Open Access 2 Repos 1 Video

TL;DR

VILLA introduces a large-scale adversarial training framework in the embedding space for vision-and-language models, significantly improving performance across multiple tasks by promoting invariance and robustness.

Contribution

It pioneers large-scale adversarial training in the embedding space for V+L models, combining task-agnostic pre-training and task-specific finetuning with a novel regularization approach.

Findings

01

Achieved new state-of-the-art results on multiple V+L benchmarks.

02

Demonstrated the effectiveness of embedding-space adversarial training.

03

Showed improved robustness and invariance in learned representations.

Abstract

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the "free" adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Large-Scale Adversarial Training for Vision-and-Language Representation Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling