Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation
Yuhan Liu, Lianhui Qin, Shengjie Wang

TL;DR
The paper introduces Speculative Verdict, a training-free framework that combines lightweight draft experts and a verdict model to improve reasoning over dense, information-rich images efficiently and accurately.
Contribution
It proposes a novel speculative decoding-inspired method that enhances visual reasoning by integrating multiple lightweight drafts with a strong verdict model without additional training.
Findings
Achieves consistent improvements on multiple visual question answering benchmarks.
Reduces computational cost while maintaining high accuracy.
Effectively synthesizes correct insights from multiple reasoning paths.
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a…
Peer Reviews
Decision·ICLR 2026 Poster
- The investigated problem of solving dense-layout image reasoning tasks using ensemble learning is of great practical value. - The performance is promising, surpassing tool-based methods such as DeepEyes.
- Unclear Connection to Speculative Decoding The paper's framing as "speculative decoding" is confusing. Traditional speculative decoding aims at inference acceleration, whereas this work operates more as an LLM-as-a-Judge paradigm where candidate answers are evaluated by a verdict model. The paper lacks discussion and comparison with existing judging frameworks (e.g., [1, 2]), which weakens its positioning within the literature. - Limited Technical Contribution: Viewing this work through the L
1) This paper accurately pinpoints VLMs’ core flaws in information-intensive images—poor dense cue localization and error-prone multi-hop reasoning—and clarifies limitations of existing solutions, ensuring relevance. 2) The “Draft-Verdict” two-stage structure (lightweight experts for coverage + large VLM for synthesis) and consensus selection balance accuracy and efficiency, with clear alignment to solving target challenges. 3) Experiments on diverse benchmarks (InfographicVQA, HR-Bench 4K) and
1) How about the comparison of the proposed method with specialized models? 2) The inference speed is not presented in the experiments section. Does it add much computation cost to the baseline method thus slow down the inference1 speed, and if yes, could you give the speed? 3) On HR-Bench 4K, SV w/ GPT-4o Verdict performs worse than SV w/ Qwen2.5-VL-72B-Instruct Verdict, and even worse than several Open-source VLMs, please explain why?
- Originality: The paper presents a novel adaptation of speculative decoding for visual reasoning quality improvement rather than its original purpose of inference acceleration. - Quality: 1. The proposed approach is effective. 2. The experimental evaluation is comprehensive, covering multiple benchmarks and comparing against strong baselines. 3. The ablation studies provide insights into the importance of different components. - Clarity: 1. The paper is well-structured and clearl
1. The tables lack clarity: The tables lack direct comparison between baseline and SV, such as directly providing the increment from GPT-4o (line 332) to SV+4o (line 341), and also the increment on Qwen2.5VL-72B, so that readers can easily compare the performance gain regarding different base models, instead of refering to other places in the paper (such as refering to line 377). Althought I like the figures and plots, the tables really need improving. 2. Limited analysis of computational effi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks
