How Well Can Vision Language Models See Image Details?
Chenhui Gou, Abdulwahab Felemban, Faizan Farooq Khan, Deyao Zhu,, Jianfei Cai, Hamid Rezatofighi, Mohamed Elhoseiny

TL;DR
This paper investigates the ability of vision-language models to perceive image details beyond semantics by introducing a pixel value prediction task, showing that adapting the vision encoder enhances detailed perception and downstream task performance.
Contribution
It introduces a pixel value prediction task to evaluate and improve VLMs' detailed image perception, demonstrating the importance of vision encoder adaptation for better downstream task results.
Findings
Fine-tuning only the connection module and LLM yields limited pixel prediction accuracy.
Adapting the vision encoder significantly improves pixel prediction precision.
Pixel prediction as a pre-training task boosts downstream image understanding and decision-making performance.
Abstract
Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language · Language, Metaphor, and Cognition
MethodsContrastive Language-Image Pre-training
