Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong; Jiaqi Gu; Yujing Lou; Lubin Fan; Yixiong Zou; Yue Wu; Jieping Ye; Ruixuan Li

arXiv:2512.16584·cs.CV·December 19, 2025

Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs

Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

PDF

Open Access

TL;DR

SkiLa introduces a unified reasoning framework for multimodal models that enables them to generate and incorporate continuous visual embeddings during multi-step reasoning, improving performance on vision-centric tasks.

Contribution

It presents SkiLa, a novel paradigm allowing MLLMs to natively generate latent visual tokens, unifying visual and textual reasoning without predefined toolkits.

Findings

01

Outperforms existing models on vision-centric tasks

02

Demonstrates strong generalization across multi-modal benchmarks

03

Enables seamless integration of visual imagination in reasoning processes

Abstract

While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection