Write and Paint: Generative Vision-Language Models are Unified Modal Learners
Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang

TL;DR
This paper introduces DaVinci, a unified vision-language generative model that learns to write and paint simultaneously, demonstrating strong performance across diverse tasks and establishing new benchmarks for multi-modal pre-training.
Contribution
The paper proposes a simple, scalable, and versatile unified model for vision-language generation, combining prefix language and image modeling for concurrent multi-modal learning.
Findings
DaVinci achieves competitive results on 27 tasks.
Unified pre-training enhances multi-modal capabilities.
Benchmarking reveals the effectiveness of self-supervised objectives.
Abstract
Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
