Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray; Ahmed Abdelkader; Chengzhi Mao; Bryan A. Plummer; Kate Saenko; Ranjay Krishna; Leonidas Guibas; Wen-Sheng Chu

arXiv:2512.10941·cs.CV·May 1, 2026

Mull-Tokens: Modality-Agnostic Latent Thinking

Arijit Ray, Ahmed Abdelkader, Chengzhi Mao, Bryan A. Plummer, Kate Saenko, Ranjay Krishna, Leonidas Guibas, Wen-Sheng Chu

PDF

1 Repo 2 Models

TL;DR

Mull-Tokens introduces modality-agnostic latent tokens that enable models to reason across different modalities like images and text without relying on specialized tools or costly data, improving spatial reasoning tasks.

Contribution

The paper proposes Mull-Tokens, a novel pre-trained latent token framework that facilitates free-form multimodal reasoning without supervision, outperforming existing methods.

Findings

01

Mull-Tokens improve spatial reasoning benchmarks by +3% on average.

02

Achieve up to +16% improvement on puzzle-solving tasks.

03

Effective in reasoning across multiple modalities without additional supervision.

Abstract

Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arijitray1993/mull
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.