See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang

TL;DR
This paper introduces Bi-directional Perceptual Shaping (BiPS), a training method that enhances multimodal reasoning in vision-language models by improving visual evidence utilization and generalization across domains.
Contribution
BiPS is a novel training approach that uses bidirectional visual cues and consistency constraints to improve visual evidence reliance and domain generalization in VLMs.
Findings
BiPS improves Qwen2.5-VL-7B performance by 8.2% on average across benchmarks.
BiPS enhances out-of-domain generalization to unseen datasets and image types.
BiPS effectively encourages models to rely on fine-grained visual evidence.
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
