MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha

TL;DR
MMAR introduces a lossless, continuous-valued multi-modal auto-regressive framework that improves image understanding and generation, surpassing existing models in accuracy and quality while maintaining scalability.
Contribution
The paper proposes a novel MMAR framework that avoids information loss by using continuous image tokens and disentangles diffusion from auto-regressive modeling, with proven training techniques.
Findings
Outperforms existing multi-modal models on 18 benchmarks
Generates high-quality images with improved understanding capabilities
Scalable with larger datasets and model sizes
Abstract
Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Neural Networks and Applications
MethodsDiffusion · Contrastive Language-Image Pre-training
