Vision and Language: from Visual Perception to Content Creation
Tao Mei, Wei Zhang, Ting Yao

TL;DR
This paper reviews recent advances in how vision and language interact, covering tasks like captioning, question answering, and visual content creation, highlighting both technological progress and real-world applications.
Contribution
It provides a comprehensive overview of recent developments in vision-language research, emphasizing the bidirectional influence between visual perception and linguistic understanding.
Findings
Significant growth in vision-to-language applications like captioning and VQA.
Emergence of language-driven visual content creation techniques.
Discussion of real-world deployment and services of vision-language systems.
Abstract
Vision and language are two fundamental capabilities of human intelligence. Humans routinely perform tasks through the interactions between vision and language, supporting the uniquely human capacity to talk about what they see or hallucinate a picture on a natural-language description. The valid question of how language interacts with vision motivates us researchers to expand the horizons of computer vision area. In particular, "vision to language" is probably one of the most popular topics in the past five years, with a significant growth in both volume of publications and extensive applications, e.g., captioning, visual question answering, visual dialog, language navigation, etc. Such tasks boost visual perception with more comprehensive understanding and diverse linguistic representations. Going beyond the progresses made in "vision to language," language can also contribute to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
