Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin; Bomin Wei; Jiaxin Ge; Konstantinos Kallidromitis; Stephanie Fu; Trevor Darrell; XuDong Wang

arXiv:2511.19418·cs.CV·December 2, 2025

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens

Yiming Qin, Bomin Wei, Jiaxin Ge, Konstantinos Kallidromitis, Stephanie Fu, Trevor Darrell, XuDong Wang

PDF

Open Access 5 Models 1 Datasets

TL;DR

This paper introduces COVT, a framework that enhances vision-language models by enabling reasoning with continuous visual tokens, leading to improved dense perceptual understanding and multimodal performance.

Contribution

COVT is a novel method that distills rich visual perceptual cues into compact tokens, allowing VLMs to reason visually in a continuous space with improved accuracy and interpretability.

Findings

01

COVT improves VLM performance by 3% to 16% across diverse benchmarks.

02

The framework enables dense visual reasoning with high efficiency.

03

Integrating COVT enhances interpretability and grounded multimodal understanding.

Abstract

Vision-Language Models (VLMs) excel at reasoning in linguistic space but struggle with perceptual understanding that requires dense visual perception, e.g., spatial reasoning and geometric awareness. This limitation stems from the fact that current VLMs have limited mechanisms to capture dense visual information across spatial dimensions. We introduce Chain-of-Visual-Thought (COVT), a framework that enables VLMs to reason not only in words but also through continuous visual tokens-compact latent representations that encode rich perceptual cues. Within a small budget of roughly 20 tokens, COVT distills knowledge from lightweight vision experts, capturing complementary properties such as 2D appearance, 3D geometry, spatial layout, and edge structure. During training, the VLM with COVT autoregressively predicts these visual tokens to reconstruct dense supervision signals (e.g., depth,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Wakals/CoVT-Dataset
dataset· 2.2k dl
2.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications