From Pixels to Prompts: Vision-Language Models
Khang Hoang Nhat Vo

TL;DR
This paper provides a clear, structured overview of Vision-Language Models to help readers understand the field's core concepts, progress, and how to design their own systems amidst rapid developments.
Contribution
It offers a durable, intuitive mental map of Vision-Language Models, moving beyond datasets and benchmarks to foster deeper understanding and system design.
Findings
Provides a structured overview of Vision-Language Models
Helps readers understand core concepts and progress
Aims to enable confident system design
Abstract
When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: it is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between "I know the buzzwords" and "I actually understand how this works" can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
