Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Zhiyang Xu; Jiuhai Chen; Zhaojiang Lin; Xichen Pan; Lifu Huang; Tianyi Zhou; Madian Khabsa; Qifan Wang; Di Jin; Michihiro Yasunaga; Lili Yu; Xi Victoria Lin; Shaoliang Nie

arXiv:2506.10395·cs.CV·July 15, 2025

Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation

Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

PDF

Open Access

TL;DR

Pisces is a novel auto-regressive multimodal foundation model that uses decoupled visual encoding and tailored training to excel in both image understanding and generation tasks, achieving competitive results across multiple benchmarks.

Contribution

Introduces Pisces, a unified multimodal model with separate visual encoders and specialized training, improving performance in both understanding and generation tasks.

Findings

01

Strong performance on 20+ image understanding benchmarks

02

Robust capabilities demonstrated on GenEval for image generation

03

Decoupled visual encoding enhances task-specific performance

Abstract

Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning