Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Donghwan Chi; Hyomin Kim; Yoonjin Oh; Yongjin Kim; Donghoon Lee; Daejin Jo; Jongmin Kim; Junyeob Baek; Sungjin Ahn; Sungwoong Kim

arXiv:2505.17726·cs.CV·May 20, 2026

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

PDF

TL;DR

This paper introduces Slot-MLLM, an object-centric visual tokenizer that enhances multimodal LLMs by encoding detailed local visual information aligned with semantics, improving performance on vision-language tasks.

Contribution

It presents the first object-centric slot attention-based visual tokenizer integrated with MLLMs, enabling detailed visual understanding and generation at the object level.

Findings

01

Significant performance improvements over previous visual tokenizers.

02

First demonstration of object-centric slot attention with MLLMs on natural images.

03

Effective encoding of local visual details while maintaining high-level semantics.

Abstract

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN