Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang; Yang Shi; Yifei Wang; Yuanxing Zhang; Pengfei Wan; Kun Gai; Xianghua Ying; Yisen Wang

arXiv:2511.21395·cs.CV·December 1, 2025

Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

PDF

Open Access 1 Models 1 Datasets

TL;DR

Monet introduces a novel training framework enabling multimodal large language models to perform reasoning directly within the latent visual space using continuous embeddings, enhancing abstract visual reasoning capabilities.

Contribution

The paper presents Monet, a new training pipeline with a three-stage distillation process and reinforcement learning, to improve latent visual reasoning in multimodal models.

Findings

01

Monet-7B outperforms existing models on perception and reasoning benchmarks.

02

The approach achieves strong out-of-distribution generalization.

03

The dataset Monet-SFT-125K supports effective training for latent visual reasoning.

Abstract

"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
NOVAglow646/Monet-7B
model· 334 dl· ♡ 4
334 dl♡ 4

Datasets

NOVAglow646/Monet-SFT-125K
dataset· 566 dl
566 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection