TL;DR
This paper evaluates how internal reasoning traces, called thought streams, influence video scene understanding in Gemini vision-language models, revealing that more reasoning yields diminishing returns and highlighting differences between model versions.
Contribution
It introduces new evaluation metrics for reasoning in vision-language models and analyzes how thought streams impact scene understanding and model behavior.
Findings
Quality improvements plateau after a few hundred tokens.
Flash Lite balances quality and token efficiency effectively.
Models sometimes hallucinate content not reasoned about due to reasoning budget constraints.
Abstract
We benchmark how internal reasoning traces, which we call thought streams, affect video scene understanding in vision-language models. Using four configurations of Google's Gemini 2.5 Flash and Flash Lite across scenes extracted from 100 hours of video, we ask three questions: does more thinking lead to better outputs, where do the gains stop, and what do these models actually think about? We introduce three evaluation metrics. Contentfulness measures how much of the thought stream is useful scene content versus meta-commentary. Thought-Final Coverage measures how faithfully the thought stream translates into the final output. Dominant Entity Analysis identifies which subjects, actions, and settings the model focuses on. GPT-5 serves as an independent judge. We find that quality gains from additional thinking plateau quickly, with most improvement occurring in the first few hundred…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
