Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
Jintao Tong, Jiaqi Gu, Yujing Lou, Lubin Fan, Yixiong Zou, Yue Wu, Jieping Ye, Ruixuan Li

TL;DR
SkiLa introduces a unified reasoning framework for multimodal models that enables them to generate and incorporate continuous visual embeddings during multi-step reasoning, improving performance on vision-centric tasks.
Contribution
It presents SkiLa, a novel paradigm allowing MLLMs to natively generate latent visual tokens, unifying visual and textual reasoning without predefined toolkits.
Findings
Outperforms existing models on vision-centric tasks
Demonstrates strong generalization across multi-modal benchmarks
Enables seamless integration of visual imagination in reasoning processes
Abstract
While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
