Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
Tianyi Zhang, Mahtab Bigverdi, Ranjay Krishna

TL;DR
This paper introduces the Ablate-to-Validate principle and Token Replacement Test (TRT) to rigorously assess whether vision-language models genuinely utilize continuous thought tokens for reasoning, revealing that many models do not rely on token content as assumed.
Contribution
It formalizes a diagnostic method (TRT) for testing latent token utilization in vision-language models and demonstrates its effectiveness across multiple models and systems.
Findings
Models often retain performance even when token content is replaced or corrupted.
Gains from continuous tokens may stem from confounds rather than actual reasoning.
TRT reveals a gap between token presence and genuine utilization in models.
Abstract
Vision-language models (VLMs) are increasingly augmented with continuous or latent non-textual tokens intended to support "visual thinking." Despite improved task accuracy, this alone does not show that models actually use these tokens for reasoning -- gains may arise from confounds such as added context length, special-token anchoring, or training-time regularization. We formalize a diagnostic principle, Ablate-to-Validate, for testing whether latent-token content is genuinely utilized, and instantiate it as the Token Replacement Test (TRT), a standardized suite of content-replacement ablations. TRT holds the prompt, image, token budget, and decoding fixed while replacing intermediate tokens with zero, random, first-repeat, or oracle alternatives, isolating whether performance depends on token content or merely on token presence. As a controlled testbed, we study relative depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
