See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang; Yizhen Zhang; Jingjing Fu; Lei Song; Jiang Bian; Yujiu Yang; Rui Wang

arXiv:2512.22120·cs.CV·February 6, 2026

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang

PDF

Open Access 3 Models

TL;DR

This paper introduces Bi-directional Perceptual Shaping (BiPS), a training method that enhances multimodal reasoning in vision-language models by improving visual evidence utilization and generalization across domains.

Contribution

BiPS is a novel training approach that uses bidirectional visual cues and consistency constraints to improve visual evidence reliance and domain generalization in VLMs.

Findings

01

BiPS improves Qwen2.5-VL-7B performance by 8.2% on average across benchmarks.

02

BiPS enhances out-of-domain generalization to unseen datasets and image types.

03

BiPS effectively encourages models to rely on fine-grained visual evidence.

Abstract

Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning