Vision Language Transformers: A Survey

Clayton Fields; Casey Kennington

arXiv:2307.03254·cs.CV·July 10, 2023·2 cites

Vision Language Transformers: A Survey

Clayton Fields, Casey Kennington

PDF

Open Access

TL;DR

This survey reviews the development and impact of vision language transformers, highlighting their architecture, transfer learning capabilities, and potential for advancing multimodal AI tasks.

Contribution

It provides a comprehensive synthesis of current research on vision language transformers, analyzing their strengths, limitations, and open questions.

Findings

01

Transformers have significantly improved vision language task performance.

02

Pretraining on large datasets enables effective transfer to various tasks.

03

Current models face limitations in data efficiency and understanding complex contexts.

Abstract

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Dropout