ProxyThinker: Test-Time Guidance through Small Visual Reasoners
Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez

TL;DR
ProxyThinker is a novel inference-time method that enhances large vision-language models' reasoning abilities by leveraging small visual reasoners, significantly improving performance and inference speed without additional training.
Contribution
It introduces ProxyThinker, a training-free technique that transfers reasoning capabilities from small models to large models during inference, enabling efficient and improved visual reasoning.
Findings
Boosts performance on visual reasoning benchmarks
Achieves up to 38x faster inference
Enables untuned models to match RFT counterparts
Abstract
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper spot on a valuable problem of training cost of RL. 2. The paper delivers a surprising and useful finding that the reasoning behavior can be transfered from small expert model to large base model, alleviating the burden of training cost. 3. The authors have thoroughly addressed the practical viability of using three models at inference. By leveraging vLLM and optimized tensor parallelism, they demonstrate a ~38x speedup over a naive implementation. Their system adds only a minor
1. A Significant Trade-off in Reasoning Diversity (Pass@k): The paper's claim of "striking a balance" in reasoning exploration (Sec 4.2) is an oversimplification. The data in Figure 5 clearly shows that while Pass@1 performance (greedy decoding) is improved, the Pass@k performance for $k>4$ drops below that of the unguided base model. This suggests the guidance narrows the large model's reasoning diversity, forcing it down the single "slow-thinking" path favored by the small expert. This trade-o
# Strengths 1. Practical inference-time approach: The method requires no additional training of the large model, addressing the high cost of RFT for VLMs. It is simple to implement (just logit arithmetic) and can leverage existing small RFT models. 2. Empirical gains on visual reasoning tasks: ProxyThinker consistently improves accuracy on spatial and math reasoning benchmarks. In many cases the base VLM closes most of the gap to a fully RFT-trained model. For example, applying a small visual
# Weaknesses 1. Overall, the novelty of the proposed method is under par. ProxyThinker closely mirrors existing logit-guidance techniques. The existence of current methods, like DExperts, Proxy-Tuning and DoLa, makes the ProxyThinker not novel enough. 2. Weak justification for VLM-specific focus. The authors motivate the work by the expense of RFT on large VLMs, but they do not identify any modality-specific challenge that makes ProxyThinker inherently necessary for vision. The technique appea
1. The paper is exceptionally well-written, clearly articulated, and easy to follow. The motivation is strong and well-grounded, directly addressing the significant and timely challenge of improving the scalability of reinforcement learning for large-scale VLMs. 2. The core idea is highly intuitive and logically sound. It builds upon the key insight that RFT methods like GRPO often do not introduce new external knowledge but rather reshape the model's output distribution to elicit a step-by-ste
1. Lack of Principled Analysis on Expert Model Selection: The paper's primary contribution relies on guidance from a small "expert" model, yet the criteria for selecting this expert seem somewhat ad-hoc. While the authors experiment with three public models chosen based on "differing training paradigms and data selection strategies" (line 259), the paper falls short of addressing a crucial question: What properties define an optimal expert for the PROXYTHINKER framework? A deeper investigation i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Multimodal Machine Learning Applications · Machine Learning and Data Classification
MethodsBalanced Selection
