Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan; Siliang Tang; Juncheng Li; Zhaoyu Fan; Wei Chow,; Shuicheng Yan; Tat-Seng Chua; Yueting Zhuang; Hanwang Zhang

arXiv:2405.01926·cs.CV·May 6, 2024

Auto-Encoding Morph-Tokens for Multimodal LLM

Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow,, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces morph-tokens, a novel encoding for images in multimodal LLMs, enabling simultaneous state-of-the-art performance in visual comprehension and image generation tasks.

Contribution

It proposes a dual-purpose encoding scheme for visual tokens that resolves conflicting objectives in multimodal LLMs, achieving new state-of-the-art results.

Findings

01

Morph-tokens enable improved multimodal comprehension.

02

Morph-tokens facilitate high-quality image reconstruction.

03

Achieves new SOTA in multimodal tasks.

Abstract

For multimodal LLMs, the synergy of visual comprehension (textual output) and generation (visual output) presents an ongoing challenge. This is due to a conflicting objective: for comprehension, an MLLM needs to abstract the visuals; for generation, it needs to preserve the visuals as much as possible. Thus, the objective is a dilemma for visual-tokens. To resolve the conflict, we propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts; for generation, they take on a different, non-conflicting role as complete visual-tokens for image reconstruction, where the missing visual cues are recovered by the MLLM. Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. Our project is available at https://github.com/DCDmllm/MorphTokens.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dcdmllm/morphtokens
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling