TL;DR
This paper introduces Perception Programs (P²), a training-free method that converts raw visual tool outputs into structured, language-native summaries, significantly improving multimodal language models' visual reasoning capabilities.
Contribution
P² is a novel, training-free, model-agnostic approach that enhances visual reasoning in MLLMs by rewriting tool outputs into language-native summaries, outperforming prior methods.
Findings
P² improves accuracy from 41.35% to 86.47% on multi-view reasoning.
P² achieves a 22% average gain across perception-centric tasks.
P² surpasses prior agentic, supervised, and RL-based tool-use methods without training or model modifications.
Abstract
Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
