CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding; Linjian Meng; Yiming Wu; Yuhan Li; Yuhao Liu; Zhen Zhao

arXiv:2605.08802·cs.CV·May 13, 2026

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Ziyang Ding, Linjian Meng, Yiming Wu, Yuhan Li, Yuhao Liu, Zhen Zhao

PDF

1 Repo

TL;DR

CoLVR introduces a contrastive training framework to enhance the exploratory visual reasoning of multimodal large language models, leading to improved performance on various benchmarks.

Contribution

It proposes a novel latent contrastive training method and a trajectory contrastive reward to foster diverse reasoning behaviors in latent visual reasoning.

Findings

01

Achieved 5.83% improvement on VSP

02

Achieved 8.00% improvement on Jigsaw

03

Outperformed existing models on out-of-domain benchmarks with 3.40% gain

Abstract

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Oscar-dzy/CoLVR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.