Pisces: An Auto-regressive Foundation Model for Image Understanding and Generation
Zhiyang Xu, Jiuhai Chen, Zhaojiang Lin, Xichen Pan, Lifu Huang, Tianyi Zhou, Madian Khabsa, Qifan Wang, Di Jin, Michihiro Yasunaga, Lili Yu, Xi Victoria Lin, Shaoliang Nie

TL;DR
Pisces is a novel auto-regressive multimodal foundation model that uses decoupled visual encoding and tailored training to excel in both image understanding and generation tasks, achieving competitive results across multiple benchmarks.
Contribution
Introduces Pisces, a unified multimodal model with separate visual encoders and specialized training, improving performance in both understanding and generation tasks.
Findings
Strong performance on 20+ image understanding benchmarks
Robust capabilities demonstrated on GenEval for image generation
Decoupled visual encoding enhances task-specific performance
Abstract
Recent advances in large language models (LLMs) have enabled multimodal foundation models to tackle both image understanding and generation within a unified framework. Despite these gains, unified models often underperform compared to specialized models in either task. A key challenge in developing unified models lies in the inherent differences between the visual features needed for image understanding versus generation, as well as the distinct training processes required for each modality. In this work, we introduce Pisces, an auto-regressive multimodal foundation model that addresses this challenge through a novel decoupled visual encoding architecture and tailored training techniques optimized for multimodal generation. Combined with meticulous data curation, pretraining, and finetuning, Pisces achieves competitive performance in both image understanding and image generation. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
