Auto-Encoding Morph-Tokens for Multimodal LLM
Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow,, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang

TL;DR
This paper introduces morph-tokens, a novel encoding for images in multimodal LLMs, enabling simultaneous state-of-the-art performance in visual comprehension and image generation tasks.
Contribution
It proposes a dual-purpose encoding scheme for visual tokens that resolves conflicting objectives in multimodal LLMs, achieving new state-of-the-art results.
Findings
Morph-tokens enable improved multimodal comprehension.
Morph-tokens facilitate high-quality image reconstruction.
Achieves new SOTA in multimodal tasks.
Abstract
For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at https://github.com/DCDmllm/MorphTokens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
