How Vision-Language Tasks Benefit from Large Pre-trained Models: A   Survey

Yayun Qi; Hongxi Li; Yiqi Song; Xinxiao Wu; Jiebo Luo

arXiv:2412.08158·cs.CV·December 12, 2024

How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey

Yayun Qi, Hongxi Li, Yiqi Song, Xinxiao Wu, Jiebo Luo

PDF

Open Access

TL;DR

This survey reviews how large pre-trained models have significantly advanced vision-language tasks like captioning and question answering, highlighting recent progress, challenges, and future research directions.

Contribution

It provides a comprehensive overview of the impact of pre-trained models on vision-language tasks, including challenges, recent advances, and potential risks.

Findings

01

Pre-trained models improve performance in vision-language tasks.

02

Challenges remain despite advancements in pre-trained models.

03

Future research should address inherent limitations of pre-trained models.

Abstract

The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need