Align Anything: Training All-Modality Models to Follow Instructions with   Language Feedback

Jiaming Ji; Jiayi Zhou; Hantao Lou; Boyuan Chen; Donghai Hong; Xuyao; Wang; Wenqi Chen; Kaile Wang; Rui Pan; Jiahao Li; Mohan Wang; Josef Dai,; Tianyi Qiu; Hua Xu; Dong Li; Weipeng Chen; Jun Song; Bo Zheng; Yaodong Yang

arXiv:2412.15838·cs.AI·December 31, 2024

Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback

Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao, Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai,, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang

PDF

Open Access 1 Repo 3 Models 5 Datasets

TL;DR

This paper introduces a novel framework for aligning all-modality models with human preferences using language feedback, addressing data scarcity and evaluation challenges in multi-modal instruction following.

Contribution

It presents the align-anything framework with 200k annotated preference data and a new evaluation system for all-modality models, advancing cross-modal alignment techniques.

Findings

01

Enhanced instruction-following in all-modality models.

02

Effective learning from unified language feedback.

03

Open-sourced data, models, and evaluation tools.

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the instruction-following capabilities of large language models; however, it remains underexplored in the cross-modality domain. As the number of modalities increases, aligning all-modality models with human intentions -- such as instruction following -- becomes a pressing challenge. In this work, we make the first attempt to fine-tune all-modality models (i.e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions. This endeavor presents several challenges. First, there is no large-scale all-modality human preference data in existing open-source resources, as most datasets are limited to specific modalities, predominantly text and image. Secondly, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-alignment/align-anything
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Speech and dialogue systems · Natural Language Processing Techniques