MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Haozhe Zhao; Zefan Cai; Shuzheng Si; Liang Chen; Jiuxiang Gu; Wen Xiao; Minjia Zhang; Junjie Hu

arXiv:2507.09574·cs.CV·January 27, 2026

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

PDF

Open Access

TL;DR

MENTOR introduces an efficient autoregressive framework for multimodal image generation that achieves precise control and high-quality results without auxiliary modules, outperforming existing methods on key benchmarks.

Contribution

The paper presents MENTOR, a novel two-stage training paradigm for autoregressive multimodal image generation that improves alignment, control, and efficiency without auxiliary components.

Findings

01

Outperforms baselines on DreamBench++ benchmark.

02

Achieves superior image reconstruction fidelity.

03

Demonstrates broad task adaptability and training efficiency.

Abstract

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques