Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Xin Hu; Haomiao Ni; Yunbei Zhang; Jihun Hamm; Zechen Li; Zhengming Ding

arXiv:2602.19615·cs.CV·February 24, 2026

Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness

Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, Zhengming Ding

PDF

Open Access

TL;DR

This paper presents a plug-and-play module that enhances vision language models' reasoning on rare objects by refining visual tokens and enriching prompts, without requiring finetuning, leading to significant performance improvements.

Contribution

It introduces a lightweight, plug-and-play approach that leverages prior knowledge and synonym-augmented descriptions to improve rare object reasoning in VLMs without finetuning.

Findings

01

Significant gains in rare object recognition and reasoning

02

Improved focus on relevant image regions

03

Enhanced fine-grained object details

Abstract

Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Domain Adaptation and Few-Shot Learning