When to Think and When to Look: Uncertainty-Guided Lookback

Jing Bi; Filippos Bellos; Junjia Guo; Yayuan Li; Chao Huang; Yolo Y. Tang; Luchuan Song; Susan Liang; Zhongfei Mark Zhang; Jason J. Corso; Chenliang Xu

arXiv:2511.15613·cs.CV·March 30, 2026

When to Think and When to Look: Uncertainty-Guided Lookback

Jing Bi, Filippos Bellos, Junjia Guo, Yayuan Li, Chao Huang, Yolo Y. Tang, Luchuan Song, Susan Liang, Zhongfei Mark Zhang, Jason J. Corso, Chenliang Xu

PDF

TL;DR

This paper analyzes how explicit reasoning chains affect visual language model performance, revealing that strategic, image-referential lookback improves reasoning accuracy and proposing an uncertainty-guided decoding method that enhances results across multiple benchmarks.

Contribution

It provides the first systematic analysis of reasoning in LVLMs, identifies effective short lookback strategies, and introduces an uncertainty-guided decoding approach that improves visual reasoning performance.

Findings

01

Long reasoning chains can lead to errors and ignore images.

02

Short, image-referential lookback phrases improve reasoning accuracy.

03

Uncertainty-guided lookback enhances performance across multiple benchmarks.

Abstract

Test-time thinking (that is, generating explicit intermediate reasoning chains) is known to boost performance in large language models and has recently shown strong gains for large vision language models (LVLMs). However, despite these promising results, there is still no systematic analysis of how thinking actually affects visual reasoning. We provide the first such analysis with a large scale, controlled comparison of thinking for LVLMs, evaluating ten variants from the InternVL3.5 and Qwen3-VL families on MMMU-val under generous token budgets and multi pass decoding. We show that more thinking is not always better; long chains often yield long wrong trajectories that ignore the image and underperform the same models run in standard instruct mode. A deeper analysis reveals that certain short lookback phrases, which explicitly refer back to the image, are strongly enriched in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.