ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren, Zhou, Xinggang Wang, Chang Zhou

TL;DR
This paper introduces ONE-PEACE, a scalable, extensible model with 4B parameters that aligns and integrates multiple modalities like vision, audio, and language through a novel architecture and pretraining tasks, enabling broad multi-modal applications.
Contribution
The paper presents ONE-PEACE, a highly extensible general representation model capable of handling unlimited modalities with a flexible architecture and modality-agnostic pretraining tasks, without relying on pretrained models.
Findings
Achieves leading results on diverse uni-modal and multi-modal tasks.
Supports seamless extension to new modalities by adding adapters and FFNs.
Demonstrates effective cross-modal alignment and fine-grained intra-modal understanding.
Abstract
In this work, we explore a scalable way for building a general representation model toward unlimited modalities. We release ONE-PEACE, a highly extensible model with 4B parameters that can seamlessly align and integrate representations across vision, audio, and language modalities. The architecture of ONE-PEACE comprises modality adapters, shared self-attention layers, and modality FFNs. This design allows for the easy extension of new modalities by adding adapters and FFNs, while also enabling multi-modal fusion through self-attention layers. To pretrain ONE-PEACE, we develop two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, which align the semantic space of different modalities and capture fine-grained details within modalities concurrently. With the scaling-friendly architecture and pretraining tasks, ONE-PEACE has the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Music and Audio Processing · Speech Recognition and Synthesis
