Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Mohammadjavad Ahmadpour; Amirmahdi Meighani; Payam Taebi; Omid Ghahroodi; Amirmohammad Izadi; Mahdieh Soleymani Baghshah

arXiv:2512.11109·cs.LG·December 15, 2025

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Mohammadjavad Ahmadpour, Amirmahdi Meighani, Payam Taebi, Omid Ghahroodi, Amirmohammad Izadi, Mahdieh Soleymani Baghshah

PDF

Open Access

TL;DR

This paper systematically evaluates test-time scaling in vision-language models, revealing its variable effectiveness depending on model type, task, and dataset, and highlighting the need for adaptive strategies.

Contribution

It provides the first comprehensive empirical analysis of test-time scaling in vision-language models across diverse benchmarks and model types.

Findings

01

Closed-source models benefit from structured reasoning and self-refinement.

02

Open-source models show inconsistent gains, with external verification being most reliable.

03

TTS improvements are dataset-dependent, aiding multi-step reasoning but limited in perception tasks.

Abstract

Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)