Write and Paint: Generative Vision-Language Models are Unified Modal   Learners

Shizhe Diao; Wangchunshu Zhou; Xinsong Zhang; Jiawei Wang

arXiv:2206.07699·cs.CV·February 20, 2023·5 cites

Write and Paint: Generative Vision-Language Models are Unified Modal Learners

Shizhe Diao, Wangchunshu Zhou, Xinsong Zhang, Jiawei Wang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces DaVinci, a unified vision-language generative model that learns to write and paint simultaneously, demonstrating strong performance across diverse tasks and establishing new benchmarks for multi-modal pre-training.

Contribution

The paper proposes a simple, scalable, and versatile unified model for vision-language generation, combining prefix language and image modeling for concurrent multi-modal learning.

Findings

01

DaVinci achieves competitive results on 27 tasks.

02

Unified pre-training enhances multi-modal capabilities.

03

Benchmarking reveals the effectiveness of self-supervised objectives.

Abstract

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shizhediao/davinci
pytorchOfficial

Videos

Write and Paint: Generative Vision-Language Models are Unified Modal Learners· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling