Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua; Hugo Silva; Di Niu; Bahador Rashidi

arXiv:2604.12896·cs.CV·April 15, 2026

Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

Muhammad Kamran Janjua, Hugo Silva, Di Niu, Bahador Rashidi

PDF

1 Repo

TL;DR

This paper introduces Perception Programs (P²), a training-free method that converts raw visual tool outputs into structured, language-native summaries, significantly improving multimodal language models' visual reasoning capabilities.

Contribution

P² is a novel, training-free, model-agnostic approach that enhances visual reasoning in MLLMs by rewriting tool outputs into language-native summaries, outperforming prior methods.

Findings

01

P² improves accuracy from 41.35% to 86.47% on multi-view reasoning.

02

P² achieves a 22% average gain across perception-centric tasks.

03

P² surpasses prior agentic, supervised, and RL-based tool-use methods without training or model modifications.

Abstract

Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P $^{2}$ ), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aismartperception/perception-programs
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.