Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM
Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

TL;DR
This paper introduces Slot-MLLM, an object-centric visual tokenizer that enhances multimodal LLMs by encoding detailed local visual information aligned with semantics, improving performance on vision-language tasks.
Contribution
It presents the first object-centric slot attention-based visual tokenizer integrated with MLLMs, enabling detailed visual understanding and generation at the object level.
Findings
Significant performance improvements over previous visual tokenizers.
First demonstration of object-centric slot attention with MLLMs on natural images.
Effective encoding of local visual details while maintaining high-level semantics.
Abstract
Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN
