Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer
Yanan Li, Christina Yi Jin, Yuan Jin, Manli Luo, Tie Xu, Shuai Jiao, Wei He, Qing Zhang

TL;DR
This paper introduces One Tokenizer, a unified architecture for multimodal integration in large language models that eliminates the modality gap, leading to improved performance in biological reasoning tasks.
Contribution
The paper provides a theoretical characterization of the modality gap and proposes a native architecture that maps all modalities into a shared token space, achieving zero-gap integration.
Findings
One Tokenizer outperforms encoder-based models on DNA-text tasks.
Unified token space enables deeper cross-modal reasoning.
Theoretical analysis confirms zero-gap state across all layers.
Abstract
A central challenge in developing Multimodal Large Language Models (MLLMs) is effectively integrating heterogeneous inputs into a cohesive reasoning engine. Current paradigms predominantly rely on modular architectures that introduce modality-specific encoders and cross-modal fusion mechanisms. However, these designs are fundamentally bottlenecked by a geometric modality gap, forcing the LLM to expend significant computational capacity on geometric reconciliation rather than deep cross-modal reasoning. In this work, we formally characterize this modality gap and theoretically demonstrate that native architectures, specifically those employing a unified vocabulary, intrinsically maintain a zero-gap state across all hidden layers. Guided by these theoretical findings, we propose \textit{One Tokenizer}, a native architecture that maps all modalities directly into a shared token space. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
