How Well Can Vision Language Models See Image Details?

Chenhui Gou; Abdulwahab Felemban; Faizan Farooq Khan; Deyao Zhu,; Jianfei Cai; Hamid Rezatofighi; Mohamed Elhoseiny

arXiv:2408.03940·cs.CV·August 8, 2024

How Well Can Vision Language Models See Image Details?

Chenhui Gou, Abdulwahab Felemban, Faizan Farooq Khan, Deyao Zhu,, Jianfei Cai, Hamid Rezatofighi, Mohamed Elhoseiny

PDF

Open Access

TL;DR

This paper investigates the ability of vision-language models to perceive image details beyond semantics by introducing a pixel value prediction task, showing that adapting the vision encoder enhances detailed perception and downstream task performance.

Contribution

It introduces a pixel value prediction task to evaluate and improve VLMs' detailed image perception, demonstrating the importance of vision encoder adaptation for better downstream task results.

Findings

01

Fine-tuning only the connection module and LLM yields limited pixel prediction accuracy.

02

Adapting the vision encoder significantly improves pixel prediction precision.

03

Pixel prediction as a pre-training task boosts downstream image understanding and decision-making performance.

Abstract

Large Language Model-based Vision-Language Models (LLM-based VLMs) have demonstrated impressive results in various vision-language understanding tasks. However, how well these VLMs can see image detail beyond the semantic level remains unclear. In our study, we introduce a pixel value prediction task (PVP) to explore "How Well Can Vision Language Models See Image Details?" and to assist VLMs in perceiving more details. Typically, these models comprise a frozen CLIP visual encoder, a large language model, and a connecting module. After fine-tuning VLMs on the PVP task, we find: 1) existing VLMs struggle to predict precise pixel values by only fine-tuning the connection module and LLM; and 2) prediction precision is significantly improved when the vision encoder is also adapted. Additionally, our research reveals that incorporating pixel value prediction as one of the VLM pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language · Language, Metaphor, and Cognition

MethodsContrastive Language-Image Pre-training