Image Generators are Generalist Vision Learners
Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie

TL;DR
This paper demonstrates that image generation training enables models to learn powerful, general visual representations, achieving state-of-the-art performance across diverse vision tasks through a unified, generative approach.
Contribution
It introduces Vision Banana, a generalist vision model trained via instruction-tuning that reframes perception as image generation, achieving SOTA results on multiple vision tasks.
Findings
Vision Banana outperforms domain-specific models on segmentation and depth estimation.
Lightweight instruction-tuning preserves image generation capabilities.
Image generation pretraining acts as a universal interface for vision tasks.
Abstract
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗phanerozoic/dense-plantainmodel· ♡ 1♡ 1
- 🤗phanerozoic/deep-plantainmodel· 66 dl66 dl
- 🤗phanerozoic/sonic-plantainmodel
- 🤗phanerozoic/echo-plantainmodel
- 🤗phanerozoic/moving-plantainmodel
- 🤗phanerozoic/bumpy-plantainmodel
- 🤗phanerozoic/otherview-plantainmodel
- 🤗phanerozoic/after-plantainmodel
- 🤗phanerozoic/measure-plantainmodel
- 🤗phanerozoic/pulse-plantainmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
