MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Jian Yang; Dacheng Yin; Yizhou Zhou; Fengyun Rao; Wei Zhai; Yang Cao; Zheng-Jun Zha

arXiv:2410.10798·cs.CV·June 5, 2025

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling

Jian Yang, Dacheng Yin, Yizhou Zhou, Fengyun Rao, Wei Zhai, Yang Cao, Zheng-Jun Zha

PDF

Open Access

TL;DR

MMAR introduces a lossless, continuous-valued multi-modal auto-regressive framework that improves image understanding and generation, surpassing existing models in accuracy and quality while maintaining scalability.

Contribution

The paper proposes a novel MMAR framework that avoids information loss by using continuous image tokens and disentangles diffusion from auto-regressive modeling, with proven training techniques.

Findings

01

Outperforms existing multi-modal models on 18 benchmarks

02

Generates high-quality images with improved understanding capabilities

03

Scalable with larger datasets and model sizes

Abstract

Recent advancements in multi-modal large language models have propelled the development of joint probabilistic models capable of both image understanding and generation. However, we have identified that recent methods suffer from loss of image information during understanding task, due to either image discretization or diffusion denoising steps. To address this issue, we propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way. Differing from diffusion-based approaches, we disentangle the diffusion process from auto-regressive backbone model by employing a light-weight diffusion head on top each auto-regressed image patch embedding. In this way, when the model transits from image generation to understanding through text generation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Neural Networks and Applications

MethodsDiffusion · Contrastive Language-Image Pre-training