SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro; Nayanika Debnath; Li Mi; Thomas Frick; Junling Wang; Zexue He; Hang Hua; Konrad Schindler; Mattia Rigotti

arXiv:2602.06566·cs.CV·February 11, 2026

SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti

PDF

Open Access

TL;DR

SPARC introduces a modular approach that separates perception and reasoning in vision-language models, enabling more flexible, efficient, and accurate test-time scaling and adaptation for visual reasoning tasks.

Contribution

It proposes a novel framework that decouples perception from reasoning, allowing independent scaling, optimization, and efficient processing in vision-language models.

Findings

01

Outperforms monolithic baselines on visual reasoning benchmarks.

02

Improves Qwen3VL-4B accuracy by 6.7 percentage points on V* VQA.

03

Reduces token budget by 200x while maintaining or improving performance.

Abstract

Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Face Recognition and Perception · Domain Adaptation and Few-Shot Learning