How Vision-Language Tasks Benefit from Large Pre-trained Models: A Survey
Yayun Qi, Hongxi Li, Yiqi Song, Xinxiao Wu, Jiebo Luo

TL;DR
This survey reviews how large pre-trained models have significantly advanced vision-language tasks like captioning and question answering, highlighting recent progress, challenges, and future research directions.
Contribution
It provides a comprehensive overview of the impact of pre-trained models on vision-language tasks, including challenges, recent advances, and potential risks.
Findings
Pre-trained models improve performance in vision-language tasks.
Challenges remain despite advancements in pre-trained models.
Future research should address inherent limitations of pre-trained models.
Abstract
The exploration of various vision-language tasks, such as visual captioning, visual question answering, and visual commonsense reasoning, is an important area in artificial intelligence and continuously attracts the research community's attention. Despite the improvements in overall performance, classic challenges still exist in vision-language tasks and hinder the development of this area. In recent years, the rise of pre-trained models is driving the research on vision-language tasks. Thanks to the massive scale of training data and model parameters, pre-trained models have exhibited excellent performance in numerous downstream tasks. Inspired by the powerful capabilities of pre-trained models, new paradigms have emerged to solve the classic challenges. Such methods have become mainstream in current research with increasing attention and rapid advances. In this paper, we present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
