No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers
Damiano Marsili, Georgia Gkioxari

TL;DR
This paper introduces an annotation-free training framework for visual reasoning that leverages AI-powered verifiers to improve reasoning and grounding without requiring labeled data, outperforming existing methods.
Contribution
It presents a novel training approach using LLM and VLM verifiers with reinforcement learning and hard-negative mining, eliminating the need for ground truth labels.
Findings
Improves visual reasoning performance across diverse tasks.
Outperforms open-source and proprietary models.
Enhances visual grounding accuracy.
Abstract
Visual reasoning is challenging, requiring both precise object grounding and understanding complex spatial relationships. Existing methods fall into two camps: language-only chain-of-thought approaches, which demand large-scale (image, query, answer) supervision, and program-synthesis approaches which use pre-trained models and avoid training, but suffer from flawed logic and erroneous grounding. We propose an annotation-free training framework that improves both reasoning and grounding. Our framework uses AI-powered verifiers: an LLM verifier refines LLM reasoning via reinforcement learning, while a VLM verifier strengthens visual grounding through automated hard-negative mining, eliminating the need for ground truth labels. This design combines the strengths of modern AI systems: advanced language-only reasoning models for decomposing spatial queries into simpler subtasks, and strong…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is clearly written and well structured. - It introduces an innovative training framework that enables the model to jointly improve both itself and the tools it invokes, entirely without human supervision. - The ablation studies are convincing, showing that verifier-based RL enhances reasoning logic, while verifier-based pseudo-labeling improves visual grounding, with cumulative performance gains when both are combined.
- The visual grounding module in VALOR is further fine-tuned using verifier-generated pseudo-labels, while all baselines still rely on the frozen pre-trained detector (Table 1). This makes the comparison with baselines partially unfair, since the tools they are allowed to invoke are in fact not properly aligned. - The paper lacks a quantitative analysis of verifier errors, which would help assess the reliability of verifier supervision.
The following are the strenghts of the paper: **Originality:** The paper presents an integration of verifier-guided reinforcement learning with explicit tool use for visual reasoning, eliminating the need for labeled supervision. The structured multi-head verifier and verifier-filtered pseudo-label pipeline seem to be novel and effective extensions of prior VLM paradigms. **Technical Quality:** The paper is empirically strong. The experimental design cleanly isolates the effects of reasoning-l
The following are the weaknesses of the paper: - The paper introduces a rich multi-head verifier reward but does not ablate the contribution of each component. Since logic, spatial, attribute, syntax, and adherence rewards are argued to fix distinct reasoning errors, omitting a head-wise ablation leaves uncertainty about which rewards are essential versus redundant. - The three-stage verifier pipeline for generating pseudo-labels in grounding (coarse filter, per-crop verification, deduplicatio
- The paper is well written and clearly structured. It effectively presents both the high-level motivation and the technical components of the method, with helpful examples that aid comprehension. - It tackles an important and timely problem in visual reasoning, particularly the issue of grounding, which remains a key bottleneck for reliable program-based reasoning systems. - The annotation-free design is compelling and addresses an important challenge in scaling visual reasoning systems.
- My only concern lies in the VLM verifier quality. Since the proposed method uses the VLM to generate the training data for grounding, the VLM may itself produce imperfect outputs. Do you think the fine-tuned model can ever outperform the VLM that labeled the data? Did you compare VALOR against GPT-5-mini at some point, as in Table 4? It might be helpful to discuss this topic more explicitly in the paper. For example, should the long-term strategy be to continually use the strongest available V
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Topic Modeling
