ONE-PEACE: Exploring One General Representation Model Toward Unlimited   Modalities

Peng Wang; Shijie Wang; Junyang Lin; Shuai Bai; Xiaohuan Zhou; Jingren; Zhou; Xinggang Wang; Chang Zhou

arXiv:2305.11172·cs.CV·May 19, 2023·42 cites

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren, Zhou, Xinggang Wang, Chang Zhou

PDF

Open Access 2 Repos 1 Models

TL;DR

This paper introduces ONE-PEACE, a scalable, extensible model with 4B parameters that aligns and integrates multiple modalities like vision, audio, and language through a novel architecture and pretraining tasks, enabling broad multi-modal applications.

Contribution

The paper presents ONE-PEACE, a highly extensible general representation model capable of handling unlimited modalities with a flexible architecture and modality-agnostic pretraining tasks, without relying on pretrained models.

Findings

01

Achieves leading results on diverse uni-modal and multi-modal tasks.

02

Supports seamless extension to new modalities by adding adapters and FFNs.

03

Demonstrates effective cross-modal alignment and fine-grained intra-modal understanding.

Abstract

In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis