EMMA: Efficient Multimodal Understanding, Generation, and Editing with a Unified Architecture
Xin He, Longhui Wei, Jianbo Ouyang, Minghui Liao, Lingxi Xie, Qi Tian

TL;DR
EMMA is a unified multimodal architecture that combines efficient encoding, reduced token usage, and a shared network to improve understanding, generation, and editing across visual and textual modalities.
Contribution
EMMA introduces a novel architecture with a high compression autoencoder, channel-wise token concatenation, and a shared-decoupled network, advancing efficiency and performance in multimodal tasks.
Findings
EMMA-4B outperforms state-of-the-art models in efficiency and accuracy.
The autoencoder achieves 32x compression, reducing token requirements.
EMMA demonstrates competitive results with recent multimodal models.
Abstract
We propose EMMA, an efficient and unified architecture for multimodal understanding, generation and editing. Specifically, EMMA primarily consists of 1) An efficient autoencoder with a 32x compression ratio, which significantly reduces the number of tokens required for generation. This also ensures the training balance between understanding and generation tasks by applying the same compression ratio to images. 2) Channel-wise concatenation instead of token-wise concatenation among visual understanding and generation tokens, which further reduces the visual tokens in unified architectures. 3) A shared-and-decoupled network that enables mutual improvements across tasks while meeting the task-specific modeling requirements. 4) A mixture-of-experts mechanism adopted for visual understanding encoder, which substantially improves perceptual capabilities with a few parameters increase.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
